Vision Foundational Models
2 min readMay 14, 2024
Segment Anything (SAM) from Meta
The above alreay show the key components of the SAM paper. The more flexible way to trigger the model generate model. The encoder of prompt, image and a lighweight mask decoder of the whole model architecture. The large dataset used.
Model
- Image encoder: “we use an MAE [47] pre-trained Vision Transformer (ViT) [33]”
- Prompt encoder. “We consider two sets of prompts: sparse (points, boxes, text) and dense (masks). We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP”
- Mask decoder. “The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask. This design, inspired by [14, 20], employs a modification of a Transformer decoder block [103] followed by a dynamic mask prediction head. Our modified decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings.”
Reference
[1] Foundational Models Defining a New Era in Vision: A Survey and Outlook Submitted on 25 Jul 2023 [pdf]
[2] Recent Advances in Vision Foundation Models CVPR 2024 talk June 2023 [web link]
[3] Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Oct, 2022 [pdf]
[4] Segment Anything [April 2023] [pdf]