Vision Foundational Models

Jimmy (xiaoke) Shen
2 min readMay 14, 2024


Segment Anything (SAM) from Meta

From [4]

The above alreay show the key components of the SAM paper. The more flexible way to trigger the model generate model. The encoder of prompt, image and a lighweight mask decoder of the whole model architecture. The large dataset used.


  • Image encoder: “we use an MAE [47] pre-trained Vision Transformer (ViT) [33]”
  • Prompt encoder. “We consider two sets of prompts: sparse (points, boxes, text) and dense (masks). We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP”
  • Mask decoder. “The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask. This design, inspired by [14, 20], employs a modification of a Transformer decoder block [103] followed by a dynamic mask prediction head. Our modified decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings.”


[1] Foundational Models Defining a New Era in Vision: A Survey and Outlook Submitted on 25 Jul 2023 [pdf]

[2] Recent Advances in Vision Foundation Models CVPR 2024 talk June 2023 [web link]

[3] Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct, 2022 [pdf]

[4] Segment Anything [April 2023] [pdf]