Multimodal Large Language Model (MLLM)

Jimmy (xiaoke) Shen
2 min readMar 1, 2023

--

A recent paper by Microsoft titled: Language Is Not All You Need: Aligning Perception with Language Models

It is recently (Feb 27, 2023)available on arxiv.

The paper is well written, highly recommend to read the paper by yourself. One example I want to highlight here is below:

From the paper

Isn’t this is the feature of dialog with the computer? Maybe we also need to add video to the chat in the future.

The model architecture

The model architecture can be found above cropped from the paper. The input is embedding from language, vision, etc. The output can be many types depending on the task.

Model architecture can be found above.

Input

A unified input was designed for this work. Some format can be found in the following table.

From the paper

Multimodal Large Language Models (MLLMs)

Based on the paper, not surprisingly, it is based on transformer architecture, and as it is a generation model, and it is similar to GPT by using the decoder of the transformer. “After obtaining the embeddings of an input sequence, we feed them into the Transformer-based decoder.”

Implementation details

Source code

Source code is available from here. However, seems the code is not available yet. We can find some details from the paper:

Backbone network

MAGNETO, a Transformer variant was used in the paperas the backbone architecture. This is due to “MAGNETO has better training stability and superior performance across modalities.”

Implementation related library

The implementation is based on the library TorchScale, which is designed for large- scale model training. From the paper.

TorchScale: paper, code

Reference

[1] Language Is Not All You Need: Aligning Perception with Language Models

[2] Code

--

--

No responses yet