Kyutai Unveils Ultra-Low Latency 2B Parameter Streaming Text-to-Speech Model Trained on 2.5M Hours

Breakthrough in Streaming Text-to-Speech Technology

Kyutai, an open AI research lab, has launched a state-of-the-art streaming Text-to-Speech (TTS) model featuring approximately 2 billion parameters. Built for real-time applications, this model achieves an exceptionally low latency of 220 milliseconds without compromising audio quality. Trained on a massive dataset of 2.5 million hours of speech, it demonstrates significant advancements in speech generation efficiency and accessibility.

High-Performance Streaming with Multiple Users

The model excels in streaming capabilities, supporting up to 32 concurrent users on a single NVIDIA L40 GPU while maintaining latency below 350 milliseconds. For individual users, latency can be as low as 220 milliseconds, enabling near real-time scenarios such as conversational agents, voice assistants, and live narration. This is made possible through Kyutai’s innovative Delayed Streams Modeling approach, which allows incremental speech generation as text is processed.

Technical Specifications

Model size: ~2 billion parameters
Training data: 2.5 million hours of audio
Latency: 220ms for single user, under 350ms for 32 users on one L40 GPU
Language support: English and French
License: CC-BY-4.0 (permissive open source)

Delayed Streams Modeling: Enabling Real-Time Responsiveness

Kyutai’s unique Delayed Streams Modeling technique allows the model to start synthesizing speech before receiving the complete input text. This balances prediction accuracy with rapid response times, enabling efficient streaming TTS that outperforms traditional autoregressive models in latency and temporal coherence.

Open Access and Community Collaboration

The entire codebase and training recipes are publicly available on Kyutai’s GitHub repository, promoting reproducibility and community engagement. Model weights and inference scripts are also accessible on Hugging Face under the CC-BY-4.0 license, allowing unrestricted use with attribution.

Applications and Impact

Reducing speech generation latency to approximately 200 milliseconds significantly enhances user experience across various fields:

Conversational AI for more natural voice interactions
Assistive technologies including faster screen readers
Media production with rapid voiceover turnaround
Edge computing devices optimized for on-device inference

The ability to efficiently serve multiple users on a single GPU also benefits scalable cloud-based speech services.

Ready for Deployment

Kyutai’s streaming TTS model stands out as an open, fast, and versatile solution for researchers and developers seeking high-quality real-time speech synthesis. Its multilingual support and scalable performance provide a compelling alternative to proprietary technologies.

For additional information, please visit Kyutai’s GitHub and Hugging Face pages, and check the official model documentation.