StepFun Launches Step-Audio-AQAA: Revolutionizing Voice Interaction with End-to-End Audio Language Model
StepFun unveils Step-Audio-AQAA, a fully unified audio language model that enables natural and expressive voice interaction by directly converting spoken queries into spoken responses without text intermediaries.
Rethinking Audio-Based Human-Computer Interaction
Machines that respond to human speech with equally natural and expressive audio have become a key focus in intelligent interaction systems. Audio-language modeling merges speech recognition, natural language understanding, and audio generation to enable machines to understand and reply using voice alone, without relying on text conversions. This approach promotes accessibility, inclusiveness, and more fluid, human-like interactions for voice assistants, audio storytelling, and hands-free computing.
Challenges of Cascaded Speech Pipelines
Most current systems depend on a sequence of separate modules for speech-to-text, text processing, and text-to-speech conversion. This modular pipeline often results in accumulated errors, latency issues, and limited expressiveness, making them inadequate for nuanced tasks such as emotional dialogue or dynamic speech synthesis. An ideal model would unify these steps, directly interpreting audio queries and generating expressive audio answers without intermediate text.
Advances Toward Unified Audio Language Models
Initial attempts like HuggingGPT and AudioGPT combined separate speech and language models but struggled with real-time voice interaction. Token-based models such as VALL-E, SpeechGPT, AudioPaLM, and Qwen2-Audio convert audio into discrete tokens but typically output text and require vocoders for speech synthesis, limiting immediacy and expressiveness.
Introducing Step-Audio-AQAA: A Fully End-to-End Audio Query–Audio Answer System
StepFun’s Step-Audio-AQAA is a groundbreaking large audio-language model designed for Audio Query–Audio Answer tasks. It directly transforms spoken input into expressive spoken output without intermediate text conversion. The system integrates a dual-codebook tokenizer, a 130-billion-parameter large language model named Step-Omni, and a flow-matching vocoder to synthesize natural speech with low latency.
Tokenization and Model Architecture
Two separate audio tokenizers handle different aspects of speech: a linguistic tokenizer based on Paraformer captures structured speech elements like phonemes at 16.7 Hz with 1,024 tokens, while a semantic tokenizer inspired by CosyVoice 1.0 encodes acoustic richness at 25 Hz with 4,096 tokens. These tokens are interleaved in a 2:3 ratio and fed into Step-Omni, a multimodal decoder-only LLM trained on text, audio, and images. The model outputs tri-codebook sequences of audio and text tokens, which the vocoder converts into smooth, expressive speech. This design allows precise control over voice characteristics, including emotion and speech rate.
Benchmark Performance
Step-Audio-AQAA was tested on the StepEval-Audio-360 benchmark, covering multilingual and multi-dialectal audio tasks across nine categories such as creativity, gaming, emotion control, and voice understanding. It outperformed state-of-the-art models like Kimi-Audio and Qwen-Omni, achieving the highest Mean Opinion Scores in most categories. Notably, a 10:15 text-audio token ratio yielded excellent Chat (4.03), Relevance (0.65), and Factuality (0.67) scores. Marker-preserving concatenation for audio interleaving further improved results with Chat (4.22), Relevance (0.57), and Factuality (0.57) scores, demonstrating superior semantic accuracy and emotional richness.
Advancing Expressive Machine Speech
By integrating expressive audio tokenization, a powerful multimodal LLM, and advanced training techniques like Direct Preference Optimization and model merging, Step-Audio-AQAA transcends the limitations of modular speech pipelines. It produces high-quality, emotionally resonant audio responses, marking a significant advance toward machines that communicate with natural, expressive speech.
For more details, check the Paper and Model on Hugging Face and follow updates on Twitter. Join the ML SubReddit and subscribe to the newsletter for the latest in machine learning innovations.
Сменить язык
Читать эту статью на русском