Microsoft Unveils VibeVoice-ASR for Long-Form Audio

Overview of VibeVoice-ASR

Microsoft has released VibeVoice-ASR as part of the VibeVoice family of open-source frontier voice AI models. This unified speech-to-text model handles 60-minute long-form audio in a single pass and outputs structured transcriptions that encode Who, When, and What, with support for Customized Hotwords.

VibeVoice Ecosystem

VibeVoice sits in a single repository that also hosts Text-to-Speech, real-time TTS, and Automatic Speech Recognition models under an MIT license. This ecosystem leverages continuous speech tokenizers that operate at 7.5 Hz and employs a next-token diffusion framework, where a Large Language Model reasons over text and a diffusion head generates acoustic details. While primarily documented for TTS, this framework sets the design context for VibeVoice-ASR.

Advantages of Long-Form ASR

Unlike conventional ASR systems that segment audio, VibeVoice-ASR accepts up to 60 minutes of continuous input within a 64K token length budget. This allows it to maintain a global representation of the entire session, preserving speaker identity and topic context throughout the hour without frequent resets.

Single-Pass Processing Benefits

Why Single-Pass?

Conventional ASR systems often segment long audio, risking the loss of global context. VibeVoice-ASR retains context across the full recording, crucial for applications like meeting transcriptions or lectures. This single pass simplifies the process, eliminating the need for custom logic to merge hypotheses or repair speaker labels at chunk boundaries.

Enhanced Recognition with Customized Hotwords

Hotword Functionality

Customized Hotwords allow users to provide specific terminology like product names or technical terms. This guidance improves recognition accuracy for domain-specific terms without requiring model retraining. Dev-users can pass internal project names during inference, making it adaptable across products with different vocabularies but similar acoustic profiles.

Rich Outputs and Diarization

Structured Transcriptions

The model provides Rich Transcription, performing ASR, diarization, and timestamping to return structured outputs detailing who spoke when. Evaluation metrics like DER, cpWER, and tcpWER assess the model's performance on multi-speaker long-form data, targeting scenarios like meetings and lectures.

Key Takeaways

VibeVoice-ASR is designed for 60-minute audio processing in a single pass.
It produces structured transcripts with speaker details and timing.
Hotwords refine accuracy for specific domain terms.
Its evaluation focuses on multi-speaker scenarios, relevant for various conversational contexts.
Available under the MIT license with official weights and fine-tuning scripts for experimentation.