<RETURN_TO_BASE

VibeVoice-1.5B: Microsoft’s Open TTS for 90-Minute, Multi-Speaker Audio

'Microsoft released VibeVoice 1.5B, an open source TTS model that generates up to 90 minutes of expressive audio with up to four speakers and supports cross lingual and singing synthesis.'

VibeVoice in brief

Microsoft released VibeVoice-1.5B as an open source text to speech framework built for long form and multi speaker generation. The model is MIT licensed and aimed at research and developer use cases that need expressive, coherent speech over extended durations. It can synthesize up to 90 minutes of continuous audio and support up to four distinct speakers in a single session, plus cross lingual and singing synthesis capabilities.

Notable capabilities

  • Massive context and multi speaker support: generate very long monologues or multi turn dialogues with up to four speakers in one session.
  • Simultaneous generation: the model supports parallel audio streams that simulate conversational turn taking rather than just concatenating single voice clips.
  • Cross lingual and singing synthesis: despite primary training on English and Chinese, VibeVoice can perform cross lingual narration and basic singing synthesis.
  • Open and permissive license: released under the MIT license to encourage research, transparency, and reproducibility.
  • Scalable streaming design: built for efficient long duration synthesis with an upcoming 7B streaming model announced to expand real time and low latency use cases.
  • Emotion and expressiveness: supports emotion control and natural prosody suitable for podcasts, audiobooks, and conversational agents.

Architecture highlights

VibeVoice is built on a 1.5B parameter language model backbone (Qwen2.5-1.5B) and uses two complementary tokenizers plus a diffusion based decoder head:

  • Acoustic tokenizer: a sigma VAE variant with a mirrored encoder decoder structure, about 340M parameters per side, achieving a 3200x downsampling from raw 24 kHz audio.
  • Semantic tokenizer: an encoder only topology trained with an ASR proxy task, designed to work at a low frame rate for consistent long sequence modeling.
  • Diffusion decoder head: a lightweight conditional diffusion module of roughly 123M parameters that predicts acoustic features using classifier free guidance and DPM Solver for better perceptual quality.
  • Context curriculum: training progressively scales context from 4k tokens up to 65k tokens, enabling coherent long form generation.

This separation of semantic sequence modeling and diffusion based acoustic generation helps preserve speaker identity and produce detailed audio over lengthy segments.

Limits and responsible use

  • Language coverage: trained on English and Chinese only, so other languages may be unintelligible or produce unsafe results.
  • No overlapping speech: the model supports sequential turn taking but does not model overlapping speakers.
  • Speech only: VibeVoice generates speech and does not add background sound effects, Foley, or music.
  • Legal and ethical constraints: Microsoft forbids use for impersonation, disinformation, authentication bypass, and requires disclosure of AI generated content. Users should follow applicable laws and ethical guidelines.
  • Not for low latency interactive production yet: while efficient, this 1.5B release is not optimized for professional real time or live streaming use; the upcoming 7B model targets those scenarios.

Getting started and resources

Researchers and creators can find the model and documentation on Hugging Face and GitHub. Community reports indicate that the 1.5B checkpoint can run inference on an 8 GB consumer GPU like an RTX 3060, with roughly 7 GB VRAM usage for multi speaker dialogs. For downloads, docs and examples visit the Hugging Face model page: https://huggingface.co/microsoft/VibeVoice-1.5B

Who should explore VibeVoice

VibeVoice is particularly interesting for teams building synthetic voices for podcasts, long form narration, multi speaker demos, and research on expressive TTS. Its open license and streaming oriented design make it a practical starting point for labs and developers experimenting with long duration and multi speaker speech synthesis.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский