[SAM] Segment Anything

Paper: https://arxiv.org/abs/2304.02643

Code:

Web: https://segment-anything.com/

Motivation

  • Reducing the need for task-specific modeling expertise, training compute, and custom data annotation for image segmentation.
  • Build a foundation model for object segmentation.

Objective

Segment Anything (SAM) = Interactive Segmentation + Automatic Segmentation

  • SAM allows users to interactively segment objects (click, bounding box, text).
  • SAM can output multiple valid masks.
  • SAM can automatically find and mask all objects in an image.
  • SAM can generate a segmentation mask for any prompt in real time after precomputing the image embedding, allowing for real-time interaction with the model.

Global Framework

Loss Function

1) Lmask: Supervise mask prediction with a linear combination of focal loss [65] and dice loss [73] in a 20:1 ratio of focal loss to dice loss.

2) LIoU: The IoU prediction head is trained with mean-square-error loss between the IoU prediction and the predicted mask’s IoU with the ground truth mask. It is added to the mask loss with a constant scaling factor of 1.0.

Detail

Image Encoder

Prompt Encoder

Dense Prompt

Dense prompts (masks) have a spatial correspondence with the image.

  1. Input masks M at a 4× lower resolution than the input image, then downscale an additional 4× using two 2×2, stride-2 convolutions with output channels 4 and 16, respectively.
  2. A final 1×1 convolution maps the channel dimension to 256.
  3. Each layer is separated by GELU activations [50] and layer normalization.
  4. The mask and image embedding are then added element-wise. If there is no mask prompt, a learned embedding representing “no mask” is added to each image embedding location.

Sparse Prompt

Sparse prompts (Point, Box, Text) are mapped to 256-dimensional vectorial embeddings.

Prompt encoder accepts multiple prompts at a time.

  1. Point: A point is represented as the sum of:
    • A positional encoding of the point’s coords (x.y) è Positional embedding vector.
    • One of two learned embeddings that indicate if the point is labeled either in the “foreground” or “background”.
  2. Box: A box is represented by an embedding pair:
    • The positional encoding of its top-left corner summed with a learned embedding representing “top-left corner”.
    • The same structure but using a learned embedding indicating “bottom-right corner”.

Mask Decoder

A modified Transformer Decoder Block

A Dynamic Prediction Head

Data Collection

Data Engine

Dataset

Experimental Results

Conclusion

  1. SAM achieves strong zero-shot performance on diverse segmentation tasks by combining a powerful image encoder with a promptable interface and massive dataset.
  2. SAM’s flexible design positions it as a foundation model for numerous applications, with potential extending beyond computer vision.
  3. Future work includes enhancing SAM’s capabilities in complex scenes, prompt engineering, and 3D segmentation.
Categories: ,

Leave a Reply

Your email address will not be published. Required fields are marked *