Matrix factorization for multi-modal learning

Matrix factorization is a class of techniques that can be used in machine learning to solve dictionary learning problems similar to the one we are interested in. We assume that the samples \(v^i\) that are observed can be expressed as the linear combination of atoms \(w^j\) from the dictionary: \[v^i = \sum\limits_{j=1}^{k}h_i^jw^j\] The set of samples are represented by a matrix \(V\), the set of atoms by a matrix \(W\) and the coefficients by a matrix \(H\). The previous equation can thus be re-written as: \[V \simeq W\cdot H\] The equality from the formal model has been replaced by the approximation that is optimized in practice. Finding the matrices \(W\) and \(H\) allows good approximation is thus a way of learning a decomposition of the observed data into recurrent parts.

In order to apply the matrix factorization techniques to the learning in a multimodal setup, Driesen et al. have proposed the following approach. We suppose that each observation \(v\) is composed of a part representing the motion and a part representing the linguistic description. A language part and a motion part can also be identified in the data matrix \(V\) and the dictionary matrix \(W\). The coefficients from \(H\) are an internal combined representation of the data and thus do not contain a separate language and motion part.

Then the idea is to use the multimodal training data to learn the dictionary. Once the dictionary is learnt, if only the motion data is observed, the motion part of the dictionary can be used to infer the internal coefficients. The internal coefficients can be used afterward to reconstruct the linguistic part associated with the observed motion. A similar process enables the reconstruction of the motion representation associated to an observed linguistic description.

We more specifically use non-negative matrix factorization (NMF, [2], [3]) which requires that we represent motion by vectors with non-negative values. Such a representation is detailed below.