Microsoft Unveils VibeVoice-ASR for Long-Form Audio
VibeVoice-ASR offers a unified speech-to-text model for 60-minute audio handling.
Records found: 14
VibeVoice-ASR offers a unified speech-to-text model for 60-minute audio handling.
Build a low-latency voice agent with ASR, LLM, and TTS streaming.
Explore NVIDIA's new Nemotron Speech ASR model designed for voice agents and live captioning with low-latency performance.
'A practical framework for evaluating modern voice agents that extends beyond ASR and WER to include task success, barge-in handling, hallucination-under-noise, safety, and perceptual quality.'
'Liquid AI released LFM2-Audio-1.5B, an end-to-end audio-language model that achieves sub-100 ms latency and supports ASR, TTS and conversational agents from a compact 1.5B-parameter stack.'
'Learn how to build a real-time voice AI agent with Whisper for ASR, FLAN-T5 for reasoning, and Bark for TTS — all running in Colab with a simple Gradio UI.'
A hands-on guide to building a compact pipeline with SpeechBrain that generates speech, adds noise, enhances audio with MetricGAN+, and measures ASR word error rates before and after denoising
'Qwen3-ASR Flash from Alibaba is a single-model ASR that auto-detects and transcribes 11 languages, supports context injection for domain terms, and keeps WER below 8% in noisy or musical audio.'
'StepFun AI released Step-Audio 2 Mini, an open-source 8B speech-to-speech model that combines unified audio-text tokenization, emotion-aware generation, and retrieval-augmented grounding to beat GPT-4o-Audio on multiple benchmarks.'
'Discover how AI voice agents work, why they matter now, and compare the top 9 platforms to build production-grade voice bots in 2025.'
'NVIDIA launched Granary, a one-million-hour open-source speech dataset covering 25 European languages, alongside Canary-1b-v2 and Parakeet-tdt-0.6b-v3 models for fast, accurate ASR and speech translation.'
NVIDIA's Canary-Qwen-2.5B model sets a new benchmark in speech recognition with a record low Word Error Rate and fast processing speed. This open-source, commercially licensed hybrid ASR-LLM model enables advanced audio transcription and language understanding.
Mistral AI launches Voxtral, cutting-edge open-weight speech recognition models that integrate transcription and language understanding with support for long audio contexts and multiple languages.
NVIDIA has released Parakeet TDT 0.6B, an open-source ASR model that transcribes an hour of audio in just one second while achieving top accuracy benchmarks, setting a new industry standard.