NVIDIA Unveils Granary: Europe’s Largest Open-Source Speech Dataset and Ultra-Fast ASR Models

Granary: a new foundation for European speech AI

NVIDIA announced Granary, a massive open-source speech corpus developed with Carnegie Mellon University and Fondazione Bruno Kessler. The collection totals about one million hours of audio, split into roughly 650,000 hours for automatic speech recognition (ASR) and 350,000 hours for speech translation (AST). Granary covers 25 European languages, including nearly all official EU languages plus Russian and Ukrainian, and prioritizes languages with scarce annotated data such as Croatian, Estonian, and Maltese.

Key characteristics of Granary:

Largest open-source speech dataset for 25 European languages.
Pseudo-labeling pipeline: public unlabeled audio is processed with NVIDIA NeMo's Speech Data Processor to add structure and improve quality, reducing manual annotation needs.
Designed to support both transcription (ASR) and translation (AST) tasks.
Open access for developers, researchers, and companies to train production-scale models.

Thanks to the high-quality preprocessing and breadth of data, models trained on Granary converge faster. NVIDIA reports that developers need roughly half as much Granary data to reach target accuracies compared to competing datasets, which is especially useful for resource-constrained languages and rapid prototyping.

Canary-1b-v2: compact, accurate, multitask model

Canary-1b-v2 is a billion-parameter encoder-decoder model trained on Granary. It provides high-quality transcription and translation between English and 24 other European languages and expands Canary's coverage significantly.

Highlights:

Languages supported: 25 European languages in total, a major increase from previous coverage.
Performance: accuracy comparable to models three times larger, with inference up to 10x faster.
Multitask capability: performs both ASR and AST reliably.
Output features: automatic punctuation and capitalization, word- and segment-level timestamps, and timestamped translated outputs.
Architecture: FastConformer encoder with Transformer decoder and a unified SentencePiece vocabulary across languages.
Robustness: maintains performance under noisy conditions and reduces output hallucinations.

Evaluation metrics mentioned by NVIDIA include ASR Word Error Rate (WER) of 7.15% on the AMI dataset and 10.82% on LibriSpeech Clean. AST COMET scores are reported as 79.3 for X→English and 84.56 for English→X. Canary-1b-v2 is released under a CC BY 4.0 license and is optimized for NVIDIA GPU-accelerated systems to support fast training and inference at scale.

Parakeet-tdt-0.6b-v3: real-time, high-throughput ASR

Parakeet-tdt-0.6b-v3 is a 600M-parameter multilingual ASR model designed for fast, large-volume transcription across the same 25 European languages. It extends Parakeet beyond English to full European coverage.

Key traits:

Automatic language detection that transcribes audio without extra prompts.
Real-time capability: able to transcribe audio segments up to 24 minutes in a single inference pass.
Optimized for low latency, high throughput, and batch processing with word-level timestamps, punctuation, and capitalization.
Robustness for complex audio content, such as numbers and lyrics, and in challenging acoustic conditions.

Why this release matters

By releasing Granary along with Canary-1b-v2 and Parakeet-tdt-0.6b-v3, NVIDIA significantly lowers the barrier for building multilingual speech applications across Europe. The combination of a large, well-processed dataset and efficient models enables developers to create inclusive systems such as multilingual chatbots, customer service voice agents, and near-real-time translation services.

Open access and production-quality optimizations mean researchers, startups, and enterprises can rapidly prototype and scale speech AI solutions that support linguistic diversity across Europe.

Resources