In the following, we present the multimodal concept acquisition (MCA-NMF) framework based on nonnegative matrix factorization, that enables the autonomous acquisition of multimodal concept and the testing of this acquisition through a cross-modal classification behavior. More precisely, we explain how to use a nonnegative matrix factorization algorithm to learn a dictionary of components that represent meaningful elements present in the multimodal perception, without providing the system with a symbolic representation of the semantics.
The ideas presented here enable the implementation of the experiments detailed on this page. They are parts of what is presented in the aricles: MCA-NMF: Multimodal concept acquisition with non-negative matrix factorization and Learning semantic components from sub-symbolic multi-modal perception.
Matrix factorization is a class of techniques that can be used in machine learning to solve dictionary learning problems similar to the one we are interested in. We assume that the samples \(v^i\) that are observed can be expressed as the linear combination of atoms \(w^j\) from the dictionary: \[v^i = \sum\limits_{j=1}^{k}h_i^jw^j\] The set of samples are represented by a matrix \(V\), the set of atoms by a matrix \(W\) and the coefficients by a matrix \(H\). The previous equation can thus be re-written as: \[V \simeq W\cdot H\] The equality from the formal model has been replaced by the approximation that is optimized in practice. Finding the matrices \(W\) and \(H\) allows good approximation is thus a way of learning a decomposition of the observed data into recurrent parts.
In order to apply the matrix factorization techniques to the learning in a multimodal setup, Driesen et al. have proposed the following approach. We suppose that each observation \(v\) is composed of a part representing the motion and a part representing the linguistic description. A language part and a motion part can also be identified in the data matrix \(V\) and the dictionary matrix \(W\). The coefficients from \(H\) are an internal combined representation of the data and thus do not contain a separate language and motion part.
Then the idea is to use the multimodal training data to learn the dictionary. Once the dictionary is learnt, if only speech data is observed, the speech part of the dictionary can be used to infer the internal coefficients. If the speech is to be compared to several observed motions, the same process yields internal coefficients for each motion. Then the agent can chose the motion with the closest internal representation to the speech's internal representation.
We more specifically use non-negative matrix factorization (NMF, [2], [3]). Importantly it requires that we represent motion, speech, and images by vectors with non-negative values. Such representations are detailed here and on the papers.