Meta AI Launches SAM Audio: Revolutionary Audio Separation

Overview

Meta has released SAM Audio, a prompt-driven audio separation model that targets a common editing bottleneck: isolating one sound from a real-world mix without building a custom model for each sound class. Meta released three main sizes: sam-audio-small, sam-audio-base, and sam-audio-large. The model is available for download and experimentation in the Segment Anything Playground.

Architecture

SAM Audio employs separate encoders for each conditioning signal, including:

An audio encoder for the mixture
A text encoder for natural language descriptions
A span encoder for time anchors
A visual encoder that processes a visual prompt derived from video plus an object mask.

The encoded streams are concatenated into time-aligned features, processed by a diffusion transformer that applies self-attention to the time-aligned representation and cross-attention to the textual feature. Finally, a DACVAE decoder reconstructs waveforms, emitting two outputs: target audio and residual audio.

Functionality of SAM Audio

SAM Audio takes an input recording containing multiple overlapping sources—like speech, traffic, and music—and separates out a target source based on a prompt. In the public inference API, the model yields two outputs: result.target (the isolated sound) and result.residual (everything else).

The target plus residual interface aligns with common editorial operations. For instance, if you want to remove a dog bark in a podcast track, treat the bark as the target, keeping only the residual. Alternatively, if you're extracting a guitar part from a concert, retain the target waveform instead.

Prompt Types Supported

Meta positions SAM Audio as a unified model accommodating three prompt types, which can be used alone or in combination:

Text prompting: Describe the sound using natural language (e.g., "dog barking," "singing voice").
Visual prompting: Select a person or object in a video to isolate the audio associated with that visual element.
Span prompting: Mark time segments where the target sound occurs, guiding separation and preventing over-separation in ambiguous contexts.

Performance Metrics

The SAM Audio model shows cutting-edge performance across diverse real-world scenarios, framing it as a unified alternative to specialized audio tools. The research team provides a subjective evaluation table for different categories:

General: 3.62 (small), 3.28 (base), 3.50 (large)
Instr(pro): 4.49 (large)

Key Takeaways

Unified Model: SAM Audio segments sounds from complex mixtures using text, visual, and time span prompts.
Output Structure: The core API produces target for isolated sound and residual for everything else, facilitating tasks like noise removal or stem extraction.
Multiple Variants: Includes sam-audio-small, sam-audio-base, and sam-audio-large, with additional performance-oriented tv variants.
Beyond Inference: The release offers a sam-audio-judge model to score separation results against a text description based on quality, recall, precision, and faithfulness.