Paper: https://arxiv.org/abs/2304.02643
Code:
Web: https://segment-anything.com/

Motivation
- Reducing the need for task-specific modeling expertise, training compute, and custom data annotation for image segmentation.
- Build a foundation model for object segmentation.
Objective
Segment Anything (SAM) = Interactive Segmentation + Automatic Segmentation
- SAM allows users to interactively segment objects (click, bounding box, text).
- SAM can output multiple valid masks.
- SAM can automatically find and mask all objects in an image.
- SAM can generate a segmentation mask for any prompt in real time after precomputing the image embedding, allowing for real-time interaction with the model.
Global Framework

Loss Function
1) Lmask: Supervise mask prediction with a linear combination of focal loss [65] and dice loss [73] in a 20:1 ratio of focal loss to dice loss.
2) LIoU: The IoU prediction head is trained with mean-square-error loss between the IoU prediction and the predicted mask’s IoU with the ground truth mask. It is added to the mask loss with a constant scaling factor of 1.0.

Detail
Image Encoder

Prompt Encoder
Dense Prompt
Dense prompts (masks) have a spatial correspondence with the image.

- Input masks M at a 4× lower resolution than the input image, then downscale an additional 4× using two 2×2, stride-2 convolutions with output channels 4 and 16, respectively.
- A final 1×1 convolution maps the channel dimension to 256.
- Each layer is separated by GELU activations [50] and layer normalization.
- The mask and image embedding are then added element-wise. If there is no mask prompt, a learned embedding representing “no mask” is added to each image embedding location.
Sparse Prompt
Sparse prompts (Point, Box, Text) are mapped to 256-dimensional vectorial embeddings.
Prompt encoder accepts multiple prompts at a time.

- Point: A point is represented as the sum of:
- A positional encoding of the point’s coords (x.y) è Positional embedding vector.
- One of two learned embeddings that indicate if the point is labeled either in the “foreground” or “background”.
- Box: A box is represented by an embedding pair:
- The positional encoding of its top-left corner summed with a learned embedding representing “top-left corner”.
- The same structure but using a learned embedding indicating “bottom-right corner”.


Mask Decoder
A modified Transformer Decoder Block
A Dynamic Prediction Head
Data Collection
Data Engine
Dataset
Experimental Results
Conclusion
- SAM achieves strong zero-shot performance on diverse segmentation tasks by combining a powerful image encoder with a promptable interface and massive dataset.
- SAM’s flexible design positions it as a foundation model for numerous applications, with potential extending beyond computer vision.
- Future work includes enhancing SAM’s capabilities in complex scenes, prompt engineering, and 3D segmentation.
Leave a Reply