Benefits of Multimodal Training

It makes sense that for a multimodal task, a multimodal model is better than a unimodal one. However, would a multimodal model be better for a unimodal task than a unimodal one?

Consider the following vignette: You are trying to teach a robot to lip read, which is inherently a strictly unimodal task. Technically the robot only needs a camera, as all the information that the robot will have to determine a speaker’s speech will be passed visually. However, some studies have shown that, in some cases, if you equip the robot with ears and eyes during training, it performs better than a robot equipped only with eyes during training, even when the robots only have access to eyes during testing. In other words, in some cases a robot that used to be able to hear will perform better at lip reading than a robot that never was able to hear.

Over winter break, I ran some experiments that indicate that, even when the model is simple linear regression, this benefit of multimodal training still exists! The next step is to understand why this benefit exists, because it is fairly unintuitive. The benefits of analyzing the linear case is that we are able to understand linear transformations much better than nonlinear ones. Furthermore, deep networks are locally linear (they are differentiable after all), so this linear analysis might apply locally to deep networks, too.

Some studies that show this multimodal training effect

  1. Multimodal Deep Learning. This paper talks about the AVSR (Audio Visual Speech Recognition) task, and indicate a deep autoencoder trained on both audio and visual data performs better on a lip-reading task than one trained on only visual data.
  2. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities. This model learns a representation for two modalities through translation, so that even when one modality is present, a representation that “captures” the best guess of the information for the other modality could be used to perform the test task.
  3. Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes. This model learns the word representations by predicting the coresponding visual features (similar in spirit to the translation method above, although there is no “back-translation” loss in this model). The paper reports benefits to using this model versus word2vec embeddings (which don’t use the visual features) on Common-Sense Assertion and Visual Paraphrasing. Note that neither task has the visual features present at test time.