Multimodal Large Language Model (MLLM)
A recent paper by Microsoft titled: Language Is Not All You Need: Aligning Perception with Language Models
It is recently (Feb 27, 2023)available on arxiv.
The paper is well written, highly recommend to read the paper by yourself. One example I want to highlight here is below:
Isn’t this is the feature of dialog with the computer? Maybe we also need to add video to the chat in the future.
The model architecture
Model architecture can be found above.
Input
A unified input was designed for this work. Some format can be found in the following table.
Multimodal Large Language Models (MLLMs)
Based on the paper, not surprisingly, it is based on transformer architecture, and as it is a generation model, and it is similar to GPT by using the decoder of the transformer. “After obtaining the embeddings of an input sequence, we feed them into the Transformer-based decoder.”
Implementation details
Source code
Source code is available from here. However, seems the code is not available yet. We can find some details from the paper:
Backbone network
MAGNETO, a Transformer variant was used in the paperas the backbone architecture. This is due to “MAGNETO has better training stability and superior performance across modalities.”
Implementation related library
The implementation is based on the library TorchScale, which is designed for large- scale model training. From the paper.
Reference
[1] Language Is Not All You Need: Aligning Perception with Language Models
[2] Code