<RETURN_TO_BASE

Kyutai Unveils Ultra-Low Latency 2B Parameter Streaming Text-to-Speech Model Trained on 2.5M Hours

Kyutai has launched a groundbreaking streaming TTS model with 2 billion parameters, achieving 220ms latency and trained on 2.5 million hours of speech. This open-source model supports multiple users and real-time applications, advancing speech AI technology.

Breakthrough in Streaming Text-to-Speech Technology

Kyutai, an open AI research lab, has launched a state-of-the-art streaming Text-to-Speech (TTS) model featuring approximately 2 billion parameters. Built for real-time applications, this model achieves an exceptionally low latency of 220 milliseconds without compromising audio quality. Trained on a massive dataset of 2.5 million hours of speech, it demonstrates significant advancements in speech generation efficiency and accessibility.

High-Performance Streaming with Multiple Users

The model excels in streaming capabilities, supporting up to 32 concurrent users on a single NVIDIA L40 GPU while maintaining latency below 350 milliseconds. For individual users, latency can be as low as 220 milliseconds, enabling near real-time scenarios such as conversational agents, voice assistants, and live narration. This is made possible through Kyutai’s innovative Delayed Streams Modeling approach, which allows incremental speech generation as text is processed.

Technical Specifications

  • Model size: ~2 billion parameters
  • Training data: 2.5 million hours of audio
  • Latency: 220ms for single user, under 350ms for 32 users on one L40 GPU
  • Language support: English and French
  • License: CC-BY-4.0 (permissive open source)

Delayed Streams Modeling: Enabling Real-Time Responsiveness

Kyutai’s unique Delayed Streams Modeling technique allows the model to start synthesizing speech before receiving the complete input text. This balances prediction accuracy with rapid response times, enabling efficient streaming TTS that outperforms traditional autoregressive models in latency and temporal coherence.

Open Access and Community Collaboration

The entire codebase and training recipes are publicly available on Kyutai’s GitHub repository, promoting reproducibility and community engagement. Model weights and inference scripts are also accessible on Hugging Face under the CC-BY-4.0 license, allowing unrestricted use with attribution.

Applications and Impact

Reducing speech generation latency to approximately 200 milliseconds significantly enhances user experience across various fields:

  • Conversational AI for more natural voice interactions
  • Assistive technologies including faster screen readers
  • Media production with rapid voiceover turnaround
  • Edge computing devices optimized for on-device inference

The ability to efficiently serve multiple users on a single GPU also benefits scalable cloud-based speech services.

Ready for Deployment

Kyutai’s streaming TTS model stands out as an open, fast, and versatile solution for researchers and developers seeking high-quality real-time speech synthesis. Its multilingual support and scalable performance provide a compelling alternative to proprietary technologies.

For additional information, please visit Kyutai’s GitHub and Hugging Face pages, and check the official model documentation.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский