A more general modality unspecific representation method: data2vec

What is Multimodal learning

Initial question list before reading the paper

  1. What is the performance for each task in CV/NLP and speech? Will it surpass all?
  2. If the answer is yes for 1, does it meaning that the multi-modality can have the similar effect as multi-task learning?
  3. How did they combine different piece modalities together?
  4. Will the model be super large and complicated if we want to have a general way to solve the intelligence problem?
  5. As transformers are becoming popular from NLP to CV, will they use transformer to build their model?
  6. How can we use this multi mobality work to solve a modality-specific task which was not involved in the multi modality work introduced by meta?
  7. Since they share the code, how does the code looks like?

Let’s read the paper (Q5 solved)


Model architecture

  • CV: ViT-strategy of encoding an image as a sequence of patches, each spanning 16x16 pixels, input to a linear transformation is used.
  • Speech: data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations.
  • Text is pre-processed to obtain sub-word units, which are then embedded in distributional space via learned embedding vectors.


Results (Q1 and Q2 solved)

CV on imagenet





