MiMo-Audio: Xiaomi's 7B Speech LM Trained on 100M+ Hours with High-Fidelity RVQ Tokens

September 20, 2025 · 3 min

High-fidelity discrete tokens for speech

Xiaomi’s MiMo team introduced MiMo-Audio, a 7-billion-parameter audio-language model that trains a single next-token objective over interleaved text and discretized speech. The key novelty is a high-fidelity residual vector quantization tokenizer that preserves prosody, timbre, and speaker identity, enabling autoregressive modeling of speech alongside text at scale beyond 100 million hours of audio.

RVQ tokenizer and why it matters

Instead of lossy acoustic tokens or separate task heads, MiMo-Audio uses a bespoke RVQ tokenizer that operates at 25 Hz and emits 8 RVQ layers, roughly 200 tokens per second. The tokenizer aims for both semantic fidelity and high-quality reconstruction, giving the language model access to near-lossless speech features it can predict with the same next-token objective used for text.

Architecture and patchification

The stack is organized as: patch encoder → 7B LLM → patch decoder. To reconcile the mismatch between audio frame rates and language model sequence lengths, the system groups four 25 Hz timesteps into a single patch, downsampling to 6.25 Hz for LM consumption. A causal patch decoder reconstructs full-rate RVQ streams. A delayed multi-layer RVQ generation scheme staggers codebook predictions to respect inter-layer dependencies and stabilize synthesis.

All three components are trained jointly under a single next-token objective, avoiding task-specific pretraining losses for ASR or TTS.

Training stages and emergent few-shot behavior

Training proceeds in two phases. First, an ‘understanding’ stage optimizes text-token loss over interleaved speech-text corpora. Second, a joint ‘understanding + generation’ stage enables audio losses for tasks like speech continuation, S2T/T2S, and instruction-style data. The report highlights a compute and data threshold where few-shot capabilities begin to appear, mirroring emergence curves observed in large text-only LMs.

Benchmarks, demos, and tooling

MiMo-Audio is evaluated on speech reasoning and general audio benchmarks. Reported results include strong scores on SpeechMMLU and MMAU, with a reduced modality gap between text-only and speech-in/speech-out settings. Xiaomi also released MiMo-Audio-Eval, a public toolkit to reproduce evaluations, plus listen-and-respond demos for speech continuation, voice and emotion conversion, denoising, and speech translation. See the demo page: https://xiaomimimo.github.io/MiMo-Audio-Demo/

Why this approach is important

The design intentionally keeps pretraining simple: a GPT-style next-token prediction over text and high-fidelity audio tokens. The critical engineering choices are a tokenizer that preserves prosody and speaker identity, patchification to keep sequence lengths manageable, and delayed RVQ decoding to preserve generation quality. For teams building spoken agents, this translates into effective few-shot speech-to-speech editing and robust speech continuation with minimal task-specific finetuning.

Six technical takeaways

High-fidelity tokenization: custom RVQ at 25 Hz with 8 codebooks preserves prosody and speaker characteristics while remaining LM-friendly.
Patchified sequence modeling: grouping 4 timesteps into one patch reduces sequence length from 25 Hz to 6.25 Hz for the LM.
Unified next-token objective: no separate ASR/TTS heads during pretraining, enabling simpler multi-task generalization.
Emergent few-shot abilities: tasks like speech continuation, voice conversion, emotion transfer, and speech translation appear after large-scale training (~100M hours).
Benchmark leadership: state-of-the-art scores reported on SpeechMMLU and MMAU, with a narrowed text-speech modality gap.
Open ecosystem: Xiaomi released the tokenizer, 7B checkpoints, MiMo-Audio-Eval toolkit, and public demos to enable replication and extension.

Where to find more

Paper, technical details and demos are available from the MiMo project page and GitHub. The demos demonstrate practical in-context S2S editing, continuation, and cross-modal tasks using the 7B stack.