StepFun AI Unveils Step-Audio 2 Mini — Open-Source 8B Speech-to-Speech Model That Tops GPT-4o-Audio
Overview
StepFun AI released Step-Audio 2 Mini, an open-source 8B-parameter speech-to-speech large audio language model (LALM) under the Apache 2.0 license. The model combines powerful text reasoning with fine-grained audio generation to enable expressive, grounded, and real-time audio interactions. Step-Audio 2 Mini demonstrates state-of-the-art results across speech recognition, audio understanding, translation, and conversational benchmarks, outperforming several commercial systems including GPT-4o-Audio.
Unified audio–text tokenization
Instead of a cascaded ASR+LLM+TTS setup, Step-Audio 2 uses multimodal discrete token modeling where text and audio tokens share a single modeling stream. This unified tokenization enables seamless cross-modal reasoning, on-the-fly voice style switching during inference, and consistent semantic, prosodic, and emotional outputs.
Expressive and emotion-aware generation
Step-Audio 2 is designed to capture paralinguistic features such as pitch, rhythm, timbre, emotion, and speaking style. It does more than transcribe: it interprets and generates natural-sounding emotional tones like whispering, sadness, or excitement. On the StepEval-Audio-Paralinguistic benchmark, Step-Audio 2 reaches 83.1% accuracy, significantly higher than GPT-4o Audio at 43.5% and Qwen-Omni at 44.2%.
Retrieval-augmented speech generation
The model supports multimodal retrieval-augmented generation (RAG). It integrates web search for factual grounding and introduces audio search that can retrieve real voice samples from a large library and fuse them into responses. This audio retrieval capability enables realistic voice timbre and style imitation at inference time.
Tool calling and multimodal reasoning
Step-Audio 2 extends beyond generation by supporting tool invocation. It matches textual LLMs on tool selection and parameter accuracy and excels at audio search tool calls, a capability absent from text-only LLMs. This makes the model useful in complex pipelines that require external data lookup, tool interaction, or multimodal decision making.
Training scale and data
The model was trained on a massive corpus: 1.356 trillion text and audio tokens, over 8 million hours of real and synthetic audio, and roughly 50,000 distinct speaker voices covering many languages and dialects. A multi-stage pretraining curriculum covered ASR, TTS, speech-to-speech translation, and emotion-labeled conversational synthesis. Step-Audio 2 Mini builds on Qwen2-Audio for text reasoning and CosyVoice for tokenization to achieve both robust language understanding and detailed audio control.
Performance benchmarks
Automatic speech recognition (ASR)
- English: Average WER 3.14%, outperforming GPT-4o Transcribe at about 4.5%.
- Chinese: Average CER 3.08%, substantially lower than GPT-4o and Qwen-Omni.
Audio understanding (MMAU benchmark)
- Step-Audio 2: 78.0 average, ahead of Omni-R1 at 77.0 and Audio Flamingo 3 at 73.1, with especially strong performance in sound and speech reasoning.
Speech translation
- CoVoST 2 (S2TT): BLEU 39.26, top among open and closed models.
- CVSS (S2ST): BLEU 30.87, ahead of GPT-4o at 23.68.
Conversational benchmarks (URO-Bench)
- Chinese conversations: top scores with 83.3 (basic) and 68.2 (pro).
- English conversations: competitive with GPT-4o, scoring 83.9 versus 84.5.
What this means for developers and researchers
Step-Audio 2 Mini makes advanced multimodal speech intelligence accessible under an open-source license. With unified tokenization, emotion-aware generation, retrieval-augmented grounding, and tool support, it is positioned for research and product use cases that require realistic, controllable, and grounded speech interactions. The model, paper, and checkpoints are available on Hugging Face for experimentation and integration.
For more details, model weights, and evaluation artifacts see the Hugging Face page and the paper linked on the release pages.
Links