Meta AI Launches SAM Audio: Revolutionary Audio Separation
Explore SAM Audio, a unified model for separating audio from complex mixtures using intuitive prompts.
Overview
Meta has released SAM Audio, a prompt-driven audio separation model that targets a common editing bottleneck: isolating one sound from a real-world mix without building a custom model for each sound class. Meta released three main sizes: sam-audio-small, sam-audio-base, and sam-audio-large. The model is available for download and experimentation in the Segment Anything Playground.
Architecture
SAM Audio employs separate encoders for each conditioning signal, including:
- An audio encoder for the mixture
- A text encoder for natural language descriptions
- A span encoder for time anchors
- A visual encoder that processes a visual prompt derived from video plus an object mask.
The encoded streams are concatenated into time-aligned features, processed by a diffusion transformer that applies self-attention to the time-aligned representation and cross-attention to the textual feature. Finally, a DACVAE decoder reconstructs waveforms, emitting two outputs: target audio and residual audio.
Functionality of SAM Audio
SAM Audio takes an input recording containing multiple overlapping sources—like speech, traffic, and music—and separates out a target source based on a prompt. In the public inference API, the model yields two outputs: result.target (the isolated sound) and result.residual (everything else).
The target plus residual interface aligns with common editorial operations. For instance, if you want to remove a dog bark in a podcast track, treat the bark as the target, keeping only the residual. Alternatively, if you're extracting a guitar part from a concert, retain the target waveform instead.
Prompt Types Supported
Meta positions SAM Audio as a unified model accommodating three prompt types, which can be used alone or in combination:
- Text prompting: Describe the sound using natural language (e.g., "dog barking," "singing voice").
- Visual prompting: Select a person or object in a video to isolate the audio associated with that visual element.
- Span prompting: Mark time segments where the target sound occurs, guiding separation and preventing over-separation in ambiguous contexts.
Performance Metrics
The SAM Audio model shows cutting-edge performance across diverse real-world scenarios, framing it as a unified alternative to specialized audio tools. The research team provides a subjective evaluation table for different categories:
- General: 3.62 (small), 3.28 (base), 3.50 (large)
- Instr(pro): 4.49 (large)
Key Takeaways
- Unified Model: SAM Audio segments sounds from complex mixtures using text, visual, and time span prompts.
- Output Structure: The core API produces
targetfor isolated sound andresidualfor everything else, facilitating tasks like noise removal or stem extraction. - Multiple Variants: Includes
sam-audio-small,sam-audio-base, andsam-audio-large, with additional performance-orientedtvvariants. - Beyond Inference: The release offers a
sam-audio-judgemodel to score separation results against a text description based on quality, recall, precision, and faithfulness.
Сменить язык
Читать эту статью на русском