Meta Unveils SAM 3: Promptable Concept Segmentation for Images and Videos

What SAM 3 brings

Meta AI released Segment Anything Model 3 (SAM 3), a unified open-source foundation model for promptable segmentation that operates on visual concepts rather than only pixels. SAM 3 detects, segments and tracks every instance of a concept across images and videos using text prompts and visual prompts such as points, boxes and exemplar crops. Compared with SAM 2, SAM 3 aims to exhaustively find all instances of open-vocabulary concepts — for example every 'red baseball cap' across a long video — with a single model.

From interactive masks to promptable concept segmentation

Earlier Segment Anything models emphasized interactive segmentation: a user provided a point or box and the model returned a single mask. That approach does not scale when systems must locate all instances of a concept across large image or video collections. SAM 3 formalizes Promptable Concept Segmentation (PCS): the model accepts concept prompts and returns instance masks along with stable identities for each matching object in images and videos.

Concept prompts combine short noun phrases with visual exemplars. Text prompts allow phrases like 'yellow school bus' or 'player in red', while exemplar crops act as positive or negative examples to disambiguate fine-grained visual differences. SAM 3 can also be invoked from multimodal large language models that generate extended referring expressions and distill them into compact concept prompts for the model to act on.

Architecture, presence token and tracking

SAM 3 has 848 million parameters and is composed of a detector and a tracker that share a single vision encoder. The detector uses a DETR-style architecture conditioned on three inputs: text prompts, geometric prompts and image exemplars. This design separates core image representation from prompting interfaces, enabling one backbone to support many segmentation tasks.

A notable addition is the presence token, a module that predicts whether each candidate box or mask actually corresponds to the requested concept. The presence token reduces confusion when related prompts are used simultaneously, for example 'a player in white' versus 'a player in red'. In SAM 3 recognition (classifying a candidate as the concept) is decoupled from localization (predicting box and mask shapes), improving open-vocabulary precision.

For video, SAM 3 reuses a transformer encoder-decoder tracker concept from SAM 2 but integrates it tightly with the new detector. The tracker propagates instance identities across frames and supports interactive refinement. This decoupled detector-and-tracker design minimizes task interference, scales with more data and concepts, and preserves interactive refinement capabilities similar to earlier Segment Anything models.

SA-Co dataset and benchmark suite

To train and evaluate PCS, Meta introduces the SA-Co family of datasets and benchmarks. The SA-Co benchmark includes around 270,000 unique evaluated concepts, more than 50 times the concept coverage of previous open-vocabulary segmentation benchmarks. Every image or video in SA-Co is paired with noun phrases and dense instance masks for all objects matching each phrase, including hard negatives where no objects should match.

Meta reports that the associated data engine has automatically annotated over 4 million unique concepts, making SA-Co one of the largest high-quality open-vocabulary segmentation corpora. The engine leverages large ontologies, automated checks and hard negative mining to assemble diverse and challenging concept examples needed to train a model that responds robustly to varied text prompts in real-world scenes.

Image and video performance

On the SA-Co image benchmarks SAM 3 reaches roughly 75 to 80 percent of human performance measured by the cgF1 metric. Competing systems such as OWLv2, DINO-X and Gemini 2.5 fall significantly behind. For example, on SA-Co Gold box detection SAM 3 reports a cgF1 of 55.7 versus OWLv2 at 24.5, DINO-X at 22.5 and Gemini 2.5 at 14.4. These results indicate that a single unified model can outperform specialized detectors on open-vocabulary segmentation tasks.

For video, SAM 3 is evaluated on SA-V, YT-Temporal 1B, SmartGlasses, LVVIS and BURST. Reported metrics include cgF1, pHOTA, mAP and HOTA across datasets, demonstrating the model's ability to handle both image-level PCS and long-horizon video tracking within a single architecture.

Implications for annotation platforms and production stacks

Data-centric annotation platforms such as Encord, CVAT, SuperAnnotate and Picsellia already integrate Segment Anything models for auto-labeling and tracking. SAM 3's promptable concept segmentation and unified image-video tracking present immediate editorial and benchmarking opportunities for those platforms: quantifying label cost reductions, measuring quality improvements when moving from SAM 2 to SAM 3, and extending zero-shot labeling and model-in-the-loop annotation workflows to dense video datasets and multimodal settings.

Key takeaways

SAM 3 unifies image and video promptable concept segmentation into a single 848M parameter model that accepts text prompts, exemplars, points and boxes. The SA-Co benchmark provides broad concept coverage with roughly 270K evaluated concepts and over 4M auto-annotated concepts. SAM 3 substantially outperforms prior open-vocabulary systems on SA-Co benchmarks, and its decoupled DETR-based detector and SAM 2 style tracker with a presence head enable stable instance tracking and interactive refinement in production settings.