[Paper Review] Segment Anything Model 2 (SAM2)

5 minute read

Published:

Introduction

Unlike static images, videos require reasoning over spatio-temporal extents—object appearance changes across time and interactions with the scene evolve. SAM2 tackles this challenge by extending promptable visual segmentation (PVS) to video, powered by a memory-based mechanism that maintains and updates object representations across frames.

  • SAM2 integrates memory modules that store object features and interaction history from previous frames.
  • In static images, the memory remains empty, and SAM2 naturally reverts to SAM behavior.

Overview

Beyond architecture, SAM2 introduces a data engine designed to build a generalized segmentation dataset through semi-automatic annotation. Unlike previous segmentation datasets that only include full object classes, SAM2 allows finer-grained labels such as parts or subparts (e.g., a dog’s tongue).

Relevant domains include:

  • Image Segmentation (e.g., SAM)
  • Interactive and Semi-supervised Video Object Segmentation (VOS)
  • Dataset construction for fine-grained object understanding

Tongue Example

SAM2 supports part-level segmentation by treating prompts as flexible and resolving ambiguities in object boundaries.

Task: Promptable Visual Segmentation (PVS)

SAM2 follows the prompt-based segmentation paradigm from SAM, supporting multiple prompt types:

  • Positive/Negative points (clicks)
  • Bounding boxes (bboxes)
  • Reference masks

These prompts can be used to:

  1. Select new target objects
  2. Refine or correct segmentation

Evaluation settings include:

  • Online/offline multi-frame video segmentation with varying prompt frequency
  • Semi-supervised VOS (only first-frame label provided)
  • Image segmentation (SA benchmark)

Model Architecture

Model Block

At its core, SAM2 performs single-frame segmentation and extends it over time using memory. The decoder receives two types of frame embeddings:

  1. Prompted frame embeddings from the current or future frames
  2. Memory embeddings from past predictions stored in the memory bank

These embeddings are jointly attended to and used to predict the mask for the current frame.

Image Encoder

SAM2 uses a Hiera (hierarchical MAE-pretrained) encoder to generate multi-scale image features:

  • These are unconditional tokens $\phi(x_t) \in \mathbb{R}^{N_t \times d}$, where $N_t$ is the number of patches in frame $t$.
  • Hierarchical encoding allows capturing both coarse and fine image structures.

Memory Attention

For a current frame $x_t$, its representation $\, z_t = \phi(x_t) \,$ is updated by attending to previously stored frames ${z_{t-k}}$ and their associated object masks ${m_{t-k}}$.

The conditioned embedding is computed via:

\[\tilde{z}_t = \text{Attention}(z_t, [z_{t-1}, z_{t-2}, \ldots], [m_{t-1}, m_{t-2}, \ldots])\]

Internally, this consists of self-attention among $z_t$ and cross-attention with memory tokens.

Prompt Encoder & Mask Decoder

SAM2 inherits the two-way transformer architecture from SAM, which updates both the prompt and image embeddings iteratively:

Prompt Module

  • Prompts that are spatially ambiguous generate coarse masks
  • A fallback mechanism selects the most confident (IoU-maximizing) mask
  • SAM2 also predicts object presence, allowing for null masks when no object exists in the frame
  • Skip connections from the encoder directly enhance the final mask quality

Memory Encoder

Memory encoding combines:

  • Downsampled predicted masks $m_{t-1}$
  • Corresponding frame embeddings $z_{t-1}$

These are added elementwise and passed through convolutional layers to form memory tokens:

\[M_{t-1} = \text{Conv}(z_{t-1} + \text{Down}(m_{t-1}))\]

Memory Bank

  • Memory tokens and prompt frames are stored in FIFO queues
  • Object pointers are derived from decoder output and stored separately
  • Memory attention uses cross-attention between spatial tokens and object pointers

Data Engine Phases

Phase 1: Per-frame SAM

Each frame is independently annotated using SAM:

  • No temporal reuse → slow but accurate
  • Speed: ≈37.8 sec/frame

Phase 2: SAM + SAM2 Mask Propagation

First frame annotated with SAM, others automatically labeled by SAM2:

  • Annotator corrects future frames
  • Speed: ≈7.4 sec/frame

Phase 3: Full SAM2 Prompt Loop

All SAM2 prompt types used to initialize and refine masks:

  • Annotator edits rather than drawing from scratch
  • Speed: ≈4.5 sec/frame

Data Loop

Quality Verification

Masks are labeled:

  • Satisfactory”: accurate, consistent over time
  • Unsatisfactory”: either spatially incorrect or inconsistent → sent back for refinement

Auto Masklet Generation

To ensure diverse annotations, SAM2 auto-generates candidate masks using grid-based prompts on the first frame.

  • Satisfactory ones are added directly
  • Unsatisfactory ones are sent for manual refinement

Auto Masklets

Video Segmentation Tasks

VOS Comparison

  • SAM2 outperforms conventional pipelines like SAM+XMem, SAM+Cutie
  • Its memory and object-prompt integration yield better temporal consistency

SOTA Results

  • SAM2 performs better in object boundary quality and frame consistency, especially under class-agnostic settings

Comparison to SOTA: Semi-supervised VOS

SSVOS Benchmarks

Using Hiera-S and Hiera-B encoders, SAM2 surpasses the best previous semi-supervised video segmentation models in J\&F metrics.

Appendix

1. Memory Encoder & Bank (Appendix C.1)

Memory Diagram

Memory is formed by combining frame embeddings with mask tokens:

\[M_t = \text{Conv}(\phi(x_t) + \text{Down}(m_t))\]
  • Object pointers come from decoder outputs and represent salient regions for the current frame
  • These are used in cross-attention to refine future predictions

2. Model Capacity (Appendix C.2.1)

Scalability tradeoffs:

  • Resolution: 512 → 1024 = +4% J\&F, but 4.5× slower
  • Frame memory: 4 → 8 = +2% J\&F, negligible speed drop

Model Scaling

3. Data Engine Comparison (Appendix D.2.2)

Data Efficiency

  • Phase 3 reduces time to 4.5s/frame (8.4× faster than Phase 1)
  • Refined frame ratio drops to 19%, clicks per frame lower

The data engine is key for category-agnostic generalization, preventing overfitting to predefined classes.

4. Zero-shot Video Tasks (Appendix E.1)

  • Achieves 75.4% J\&F with 3 clicks
  • Scores 77.6% on zero-shot video datasets with no retraining

Zero-shot Results

Conclusion

SAM2 represents a major leap in promptable segmentation for videos:

  • Integrates spatio-temporal memory, improving object tracking over time
  • Uses flexible prompts and structured memory to handle occlusion, ambiguity, and absence
  • Combines accurate segmentation with efficient annotation pipelines

Through its generalization and high annotation throughput, SAM2 provides a foundation for large-scale, object-centric video understanding.