A more general modality unspecific representation method: data2vec

Meta’s new paper data2vec

5 min readJan 21, 2022

As a member of ML community, especially for NLP professionals, we shold be pretty familar with word2vec. Recently Meta posted a new paper named “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”.

Based on the abstract, “Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input.”

What is Multimodal learning

The cat image is screenshoted from [1] The text is generated by the author of this medium article. Feel free to use this image and do not forget to cite this article if you use it.

Based on wikipedia, Multimodal learning is “a good model to represent the joint representations of different modalities.” The common modalities are image and text. “For instance, images are usually represented as pixel intensities, while texts are represented as discrete word count vectors”. A quick illustration can be found in the above image.

Initial question list before reading the paper

Before reading the paper, I have several questions:

What is the performance for each task in CV/NLP and speech? Will it surpass all?
If the answer is yes for 1, does it meaning that the multi-modality can have the similar effect as multi-task learning?
How did they combine different piece modalities together?
Will the model be super large and complicated if we want to have a general way to solve the intelligence problem?
As transformers are becoming popular from NLP to CV, will they use transformer to build their model?
How can we use this multi mobality work to solve a modality-specific task which was not involved in the multi modality work introduced by meta?
Since they share the code, how does the code looks like?

Let’s read the paper (Q5 solved)

Paper can be found from HERE.

Workflow

Yes, as the paper mentioned, it is using the transformer with the teach and student approach. (Question 5 is solved)

Specifically, we train an off-the- shelf Transformer network (Vaswani et al., 2017) which weuse either in teacher or student mode (Illustration in Figure 1)

The workflow is shown in Figure 1. It has two parts: the top one is normal approach and the bottom is masked approach as what BERT was doing. The upper is using a teacher mode and the bottom is using a student mode. The student(with mask) is asked to predict the representation from the original inputs(without mask). The input are from CV, NLP and speech. The paper also mentioned that “since differ- ent modalities have vastly different inputs, e.g., pixels vs. words, we use modality-specific feature encoders.”.

Sofar, everything makes sense to me. However, still not clear whether they have shared network until I read this:

In an effort to get closer to machines that learn in more general ways about the environment, we designed data2vec, a framework for general self-supervised learning that works for images, speech and text where the learning objective is identical in each modality. The present work unifies the learning algorithm but still learns representations individually for each modality. We hope that a single algorithm will make future multi-modal learning simpler, more effective and lead to models that understand the world better through multiple modalities.

It looks that Meta has some similar idea as Tesla. Tesla is using 8 cameras and hope that the fixed structure can help keeping on improving the model’s performance. It is a pretty interesting point.

Extra benefits of reading this paper. Nice summary of Self-supervised learning in CV/NLP, etc in the related work. Check it out.

Model architecture

“We use the standard Transformer architecture with a modality-specific encoding of the input data borrowed from prior work.”

CV: ViT-strategy of encoding an image as a sequence of patches, each spanning 16x16 pixels, input to a linear transformation is used.
Speech: data is encoded using a multi-layer 1-D convolutional neural network that maps 16 kHz waveform to 50 Hz representations.
Text is pre-processed to obtain sub-word units, which are then embedded in distributional space via learned embedding vectors.

Loss

Smooth L1 loss borrowed from faster rcnn is used to regress the targets (contextualized training target in the teacher mode)

Results (Q1 and Q2 solved)

Everything looks better.

CV on imagenet

Speech

NLP

Code

Code can be found HERE. Looks like the pretrained models are available for speech and NLP. However, for the CV, it says “coming soon”

Similar for the code we have speech and NLP model available.

Will check the code in detail later.

Thanks for reading. Overall, a pretty interesting paper. I highly recommend the reader reading the original paper.

Citation

For attribution in academic contexts or books, please cite this work as:

Xiaoke Shen, "A more general modality unspecific representation method: data2vec". https://jimmy-shen.medium.com/finally-we-have-a-more-general-modality-unspecific-representation-method-data2vec-5dcba6c853ef, 2022.

BibTeX citation:

@misc{shen,
author = {Xiaoke Shen},
title = {{A more general modality unspecific representation method: data2vec}},
year = {2022},
howpublished = {\url{https://jimmy-shen.medium.com/finally-we-have-a-more-general-modality-unspecific-representation-method-data2vec-5dcba6c853ef}},
}

Reference

[1] https://www.youtube.com/watch?v=uHKfrz65KSU