Multimodal concept acquisition with NMF22 Oct 2015

Concepts have often been used as atomic units to describe our inner cognitive world and try to understand its structure. However whether such concepts exists, what they are, and whether we can share some with machines are still open questions.

In our new paper MCA-NMF: Multimodal concept acquisition with non-negative matrix factorization, we define more clearly some aspects of this notion by studying two questions: How a machine could acquire concepts? and How can we tell a machine has learnt a concept?

Various communities of researchers are interested in concepts. Philosopher and cognitive scientists want to know more about the structure of the mind. Developmental psychologists want to understand how children learn. Linguists want to know how language and words signify and interact with cognition. And finally artificial intelligence researchers and roboticists want to build machine that think like us or can interact with us.

Actually, it has been explained by researchers from all these communities that concepts do not really belong to our inner world, and that they are not really atomic units.

The first aspect is well illusrated in the symbol grounding problem [Harnad1990]. It points out that learning language is not only about learning the signs of communication such as words, but also requires to relate them to their semantic content. And that this semantic content emerges from and is grounded in the interaction with the world.

However the symbol grounding problem turns out to be ill-posed [Steels2008]. Indeed, words that are stored in a computer are symbols, but these are not the words a child learns from the voice of its mother, among sentences, and among a much broader acoustic landscape. This is actually one of many examples of ambiguity in perception.

An other important feature of perceptual systems is that they often include sensors from several modalities. This property is very well illustrated in the McGurk effect that shows that perception intrinsically involves and mixes several modalities. Finally, growing evidences from psychological studies demonstrate the influence of language in learning concepts and therefore the co-organization of language and meanings [Waxman1995,Lupyan2007].

What did you here? ‘ba’, ‘da’ or ‘ga’?

In the paper, we present a set of experiments that demonstrate the learning of such concepts from non-symbolic data consisting of speech sounds, images, and motions. We explain how words may emerge from the understanding of full sentences instead of being a pre-requisite to understand the sentences. An open-source implementation of the learner as well as scripts and associated experimental data to reproduce the experiments are publicly available.

Why is it interesting? Why is it new?

Although previous work has already studied the questions of multimodal learning and symbol grounding, this work is original in several aspect.

First it demonstrates the acquisition of grounded complex concepts from raw continuous signal only. That is it does not rely on symbols to train the artificial agent and does not use transcriptions to treat speech.

Also, language learning, and in particular the learning of words, is treated as an instance of multimodal learning. This, in contrast with other approaches that treat language in a different way than other modalities, puts an emphasis on the similarities between the process of learning words and learning, for example, visual concepts. Finally it explores the question of the structure and (de)composition of concepts. In particular it shows that it is possible for an artificial system to discover subcomponents of perception, such as words in spoken sentences, although the system is only exposed to a task that requires to associate whole sentences to corresponding scenes. This aspect is an example where understanding the whole acts as a proxy to understanding the parts. It provides an example of teleological learning occurring without requiring compositional understanding (see discussion by Wrede et al. [Wrede2012]).

How do we evaluate the learning of concepts?

The setup that we use to evaluate the learning of concepts is very close to the behavior we would expect from a child.

“This is a house.”
“Do you like ice-cream?”
“Trees are nice!”
“I eat fish.”

“Where is the fish?”

How does the agent learns?

We built a learner based on the nonnegative matrix factorization algorithm that solves the problem explained above. Without entering in the technical details, we can say that the agent learns how to compress multimodal signal. The intuitive idea is that, because the training multimodal signal associates sentences containing the word fish to pictures containing fishes, it is efficient to represent visual features of fishes in the same way than acoustic features of the word fish. In other words, in the system point of view, the word fish and fishes look-sound the same.

Of course this is just the general idea. This page gives some more details. All the technical details are available in the paper and in the code.

Does it really learn concepts?

This actually depends on what we call a concept. What we propose here is that the learning and training setups are a behavioral definition of concepts. We also provide an implementation that demonstrates that a machine is capable of learning such concepts. This is proved by the performance of the algorithm on the formal task.

Of course this is only one limited definition of the general notion of concepts. But providing a definition of what we mean by concept and tools to investigate what fall and do not fall under this definition are ways to better understand what we call concepts, what role they play in language, and how we may share some of them with machines.

In order to go beyond the definition of concept as formalized by the evaluation setup, we investigate the recognition of single words—not full sentences as before—by the agent. To achieve that we give small fragments of sentences to the learner and ask it to associate them with pictures. The following plots present examples of such scores obtained on various parts of sentences. We manually annotated (in gray) the location of the word that is associated with the meaning. They illustrate four typical behaviors of the system (although the last one is the more common).

What the results tell is that sometime the meaning is well localized on the word that contain it. Sometime less. They are many potential explanations to that, knowing that the learner is only exposed to a tiny sub-language and very few examples. However claiming that the agent goes from a blurred understanding of sentences to a refined knowledge of words would require an analysis of the dynamics of the process which is unfortunately out of the scope of this article but an open subject for future work.