VoXtream Starts Speaking From the First Word — Open-Source Full-Stream Zero-Shot TTS for Real-Time Use

September 23, 2025 · 4 min

What full-stream TTS means

Most streaming TTS systems decode audio in chunks and require the whole input text before starting playback, which introduces a perceptible silence at the onset. Full-stream TTS flips that model: it consumes text as it arrives (for example, word-by-word from an LLM) and emits audio in lockstep, minimizing input-side buffering and enabling much earlier vocal onset.

How VoXtream begins immediately

VoXtream, released by KTH’s Speech, Music and Hearing group, tackles onset latency directly. It can start generating speech after the first word, producing audio in 80 ms frames and reporting first-packet latency (FPL) as low as 102 ms on a modern GPU with PyTorch compile. The key is an incremental phoneme predictor that uses a dynamic look-ahead of up to 10 phonemes to stabilize prosody while still allowing generation to begin immediately rather than waiting for a fixed window of future context.

Architecture overview

VoXtream is a single fully-autoregressive pipeline composed of three transformers that stream in sequence:

Phoneme Transformer (PT): a decoder-only, incremental transformer that phonemizes words (g2pE at the word level) and applies a dynamic phoneme look-ahead (≤ 10 phonemes) so prosody can be stabilized without blocking generation.
Temporal Transformer (TT): an autoregressive predictor over Mimi codec semantic tokens plus a duration token that enforces monotonic phoneme-to-audio alignment. Mimi operates at 12.5 Hz, producing 80 ms frames.
Depth Transformer (DT): an autoregressive generator for the remaining Mimi acoustic codebooks, conditioned on TT outputs and a ReDimNet speaker embedding that enables zero-shot voice prompting.

The Mimi codec provides a dual-stream tokenization: VoXtream uses the first Mimi codebook as semantic context and the remaining codebooks for high-fidelity reconstruction. The Mimi decoder reconstructs waveforms frame-by-frame, enabling continuous emission of audio frames.

Performance and benchmarks

The project includes benchmark scripts measuring FPL and real-time factor (RTF). Reported results include:

A100: 171 ms / 1.00 RTF uncompiled, 102 ms / 0.17 RTF compiled
RTX 3090: 205 ms / 1.19 RTF uncompiled, 123 ms / 0.19 RTF compiled

On LibriSpeech-long full-stream evaluation (word-by-word input), VoXtream achieves a WER of 3.24% versus 6.11% for CosyVoice2, and listener studies show significant naturalness preference for VoXtream (p ≤ 5e-10). CosyVoice2 still scores higher on speaker similarity, consistent with its flow-matching decoder design.

Why the autoregressive design helps onset latency

Diffusion and flow vocoders commonly generate audio in chunks and often require multi-step sampling, which sets a floor on first-packet latency. VoXtream keeps every stage autoregressive and frame-synchronous (PT → TT → DT → Mimi decoder), so the first 80 ms packet can emerge after a single pass through the stack rather than waiting on a chunked sampler.

Training data and quality control

VoXtream was trained on a mid-scale ~9k-hour corpus: approximately 4.5k hours from Emilia and 4.5k hours from HiFiTTS-2 (22 kHz subset), all resampled to 24 kHz. The team diarized to remove multi-speaker clips, filtered transcripts using ASR, and applied NISQA to drop low-quality audio. The dataset card documents preprocessing, Mimi tokenization, MFA alignments, duration labels, and speaker templates.

Ablations and real-world robustness

Table 1 in the paper shows VoXtream is competitive on zero-shot metrics (WER, UTMOS, and speaker similarity) across SEED-TTS test-en and LibriSpeech test-clean. Ablation studies indicate that adding the CSM Depth Transformer and the speaker encoder improves speaker similarity without a large WER penalty. Subjective listening tests use a MUSHRA-like protocol followed by a tailored preference test for full-stream generation.

Where VoXtream fits in the TTS landscape

The contribution of VoXtream is not a new codec or a massive model but a latency-oriented autoregressive arrangement and duration-token alignment that enable true input-side streaming. For live agents, simultaneous translation, or low-latency dubbing, the trade-off is explicit: a modest reduction in speaker similarity compared with some flow-based decoders, but an order-of-magnitude reduction in FPL when operating in full-stream conditions.

Try it and learn more

The paper, model weights on Hugging Face, and code are available from the project pages and GitHub repository. Benchmarks, tutorials, and notebooks are provided to help developers evaluate and integrate VoXtream into real-time pipelines. See the arXiv paper for full technical details: https://arxiv.org/pdf/2509.15969