NVIDIA Launches Canary-Qwen-2.5B: The Leading ASR-LLM Hybrid Model with Unmatched Accuracy and Speed

Breakthrough in Speech Recognition and Language Modeling

NVIDIA has introduced Canary-Qwen-2.5B, an innovative hybrid model that combines automatic speech recognition (ASR) and large language modeling (LLM). It currently holds the top position on the Hugging Face OpenASR leaderboard with a record low Word Error Rate (WER) of 5.63%. Licensed under CC-BY, this model is open-source and commercially available, supporting enterprise applications without restrictions.

Unified Architecture for Transcription and Language Understanding

Canary-Qwen-2.5B's core strength lies in its hybrid architecture that integrates transcription and language understanding into a single model. It features a FastConformer encoder optimized for low-latency and high-accuracy speech transcription, coupled with a Qwen3-1.7B LLM decoder. The decoder is an unmodified pretrained large language model that uses adapters to convert audio-transcribed tokens, enabling seamless multi-modal processing.

High Performance and Speed

This model achieves a WER of 5.63%, surpassing all previous models on the OpenASR leaderboard despite having a moderate size of 2.5 billion parameters. It boasts a Real-Time Factor (RTFx) of 418, meaning it can process audio 418 times faster than real-time, making it highly efficient for large-scale transcription and live captioning.

Extensive Training Dataset and Flexibility

Trained on 234,000 hours of diverse English speech, the dataset includes various accents and speaking styles, ensuring robustness across noisy and domain-specific audio. The model is built using NVIDIA's NeMo framework, allowing researchers to customize the architecture by swapping encoders or decoders without full retraining.

Broad Hardware Support and Deployment

Canary-Qwen-2.5B is optimized for a wide range of NVIDIA GPUs, from data center-class A100 and H100 to workstation and consumer-grade GPUs like RTX PRO 6000 and GeForce RTX 5090. This enables flexible deployment scenarios, including cloud and edge computing.

Enterprise-Ready Features and Use Cases

Released under a permissive CC-BY license, the model supports commercial deployment in applications like enterprise transcription, real-time meeting summarization, voice-commanded AI agents, and regulatory-compliant documentation in healthcare, legal, and finance sectors. Its integrated LLM decoding improves punctuation, capitalization, and contextual accuracy, critical for sensitive industries.

Open Innovation and Future Directions

By open-sourcing the model and training recipes, NVIDIA encourages community-driven development and experimentation. The approach of integrating LLMs as active components in ASR pipelines signals a shift towards more intelligent, agentic systems capable of comprehensive multimodal understanding.

Explore the model and leaderboard on Hugging Face to experience this state-of-the-art speech AI technology.