NVIDIA Launches Audio Flamingo 3: Pioneering Open-Source Audio General Intelligence

Introducing Audio Flamingo 3

NVIDIA has unveiled Audio Flamingo 3 (AF3), a groundbreaking open-source large audio-language model that marks a significant step forward in how AI understands and reasons about sound. Unlike previous models limited to speech transcription or basic audio classification, AF3 comprehends audio content in a rich, human-like context — spanning speech, ambient sounds, and music over extended durations.

Core Innovations in Audio Flamingo 3

AF-Whisper: A Unified Audio Encoder

At the heart of AF3 is AF-Whisper, a novel encoder adapted from Whisper-v3. This unified encoder processes speech, ambient noise, and music using a single architecture, overcoming the inconsistencies of earlier models that relied on separate encoders. It leverages audio-caption datasets and synthesized metadata within a dense 1280-dimensional embedding space to align audio with textual representations.

Chain-of-Thought Reasoning for Audio

AF3 introduces on-demand reasoning capabilities by employing the AF-Think dataset comprising 250,000 examples. This allows the model to perform chain-of-thought reasoning, explaining its inference steps before reaching conclusions—an essential advancement toward transparent and interpretable audio AI.

Multi-Turn, Multi-Audio Conversations

Through training on the AF-Chat dataset (75,000 dialogues), AF3 supports contextual conversations involving multiple audio inputs across several turns. It mimics natural human interactions by referencing prior audio cues and enables voice-to-voice communication powered by a streaming text-to-speech module.

Long Audio Reasoning Abilities

AF3 is the first fully open model capable of reasoning over audio inputs lasting up to 10 minutes. Using the LongAudio-XL dataset with 1.25 million examples, it can perform complex tasks such as meeting summarization, podcast comprehension, sarcasm detection, and temporal grounding.

Performance Benchmarks and Practical Impact

Audio Flamingo 3 outperforms both open and closed-source models on over 20 benchmarks, including:

MMAU average accuracy: 73.14% (+2.14% over Qwen2.5-O)
LongAudioBench score: 68.6 (evaluated by GPT-4o), surpassing Gemini 2.5 Pro
LibriSpeech ASR WER: 1.57%, better than Phi-4-mm
ClothoAQA accuracy: 91.1% (vs. 89.2% from Qwen2.5-O)

It also advances voice chat and speech generation benchmarks, delivering a generation latency of 5.94 seconds compared to 14.62 seconds for Qwen2.5 and improved similarity metrics.

Data and Open Source Availability

NVIDIA revamped the data pipeline with fully open datasets including:

AudioSkills-XL: 8 million examples integrating ambient, music, and speech reasoning.
LongAudio-XL: Long-form speech data from audiobooks, podcasts, and meetings.
AF-Think: Dataset focused on chain-of-thought reasoning.
AF-Chat: Designed for multi-turn, multi-audio dialogues.

All datasets, model weights, training recipes, and inference code are fully open-source, fostering reproducibility and accelerating research in auditory reasoning and multi-modal AI.

Toward True Audio General Intelligence

Audio Flamingo 3 sets a new standard in deep auditory understanding by combining scale, innovative training methods, and diverse data sources. This model listens, comprehends, and reasons about audio in ways that were previously unattainable, bringing us closer to real-world Audio General Intelligence.

Explore the research paper, code, and model on Hugging Face to dive deeper into this exciting advancement.