A more general modality unspecific representation method: data2vec

What is Multimodal learning

The cat image is screenshoted from [1] The text is generated by the author of this medium article. Feel free to use this image and do not forget to cite this article if you use it.

Initial question list before reading the paper

  1. What is the performance for each task in CV/NLP and speech? Will it surpass all?
  2. If the answer is yes for 1, does it meaning that the multi-modality can have the similar effect as multi-task learning?
  3. How did they combine different piece modalities together?
  4. Will the model be super large and complicated if we want to have a general way to solve the intelligence problem?
  5. As transformers are becoming popular from NLP to CV, will they use transformer to build their model?
  6. How can we use this multi mobality work to solve a modality-specific task which was not involved in the multi modality work introduced by meta?
  7. Since they share the code, how does the code looks like?

Let’s read the paper (Q5 solved)

Workflow

Model architecture

  • CV: ViT-strategy of encoding an image as a sequence of patches, each spanning 16x16 pixels, input to a linear transformation is used.
  • Speech: data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations.
  • Text is pre-processed to obtain sub-word units, which are then embedded in distributional space via learned embedding vectors.

Loss

Results (Q1 and Q2 solved)

CV on imagenet

Speech

NLP

Code

Citation

Xiaoke Shen, "A more general modality unspecific representation method: data2vec". https://jimmy-shen.medium.com/finally-we-have-a-more-general-modality-unspecific-representation-method-data2vec-5dcba6c853ef, 2022.

BibTeX citation:

@misc{shen,
author = {Xiaoke Shen},
title = {{A more general modality unspecific representation method: data2vec}},
year = {2022},
howpublished = {\url{https://jimmy-shen.medium.com/finally-we-have-a-more-general-modality-unspecific-representation-method-data2vec-5dcba6c853ef}},
}

Reference

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store