OLMoASR: AI2’s Open ASR Suite Challenging OpenAI Whisper
The Allen Institute for AI (AI2) published OLMoASR, a fully open automatic speech recognition (ASR) suite that goes beyond releasing model weights: it also provides dataset identifiers, filtering steps, training recipes, and benchmark scripts. This level of transparency positions OLMoASR as one of the most extensible platforms for speech recognition research and practical deployment.
Why open ASR matters
Many state-of-the-art ASR systems from providers like OpenAI, Google, and Microsoft are available only via closed APIs. While these services deliver strong performance, they act as black boxes: training data, filtering procedures, and evaluation details are often hidden. That opacity hampers reproducibility, verification, and domain adaptation. OLMoASR confronts these limitations by opening the full pipeline, enabling researchers to reproduce results, test alternatives, and adapt models to new domains without rebuilding massive datasets from scratch.
Model architecture and scaling
OLMoASR uses a transformer encoder–decoder architecture: the encoder processes audio waveforms into hidden representations, and the decoder generates text tokens conditioned on those representations. This approach mirrors modern ASR paradigms like Whisper, but OLMoASR’s implementation, training code, and configs are fully open.
The released family covers six English-only sizes, enabling trade-offs between inference cost and accuracy:
- tiny.en – 39M parameters, for lightweight inference
- base.en – 74M parameters
- small.en – 244M parameters
- medium.en – 769M parameters
- large.en-v1 – 1.5B parameters, trained on 440K hours
- large.en-v2 – 1.5B parameters, trained on 680K hours
Smaller models suit embedded or real-time use cases; larger models prioritize accuracy for research and batch workloads.
Data: from web scraping to curated mixes
A key contribution of OLMoASR is publishing the training data strategy, not just model weights.
OLMoASR-Pool (~3M hours): a massive, weakly supervised collection of speech paired with transcripts scraped from the web. It contains noisy, misaligned, and duplicate examples—similar in spirit to Whisper’s large-scale, noisy corpora.
OLMoASR-Mix (~1M hours): a heavily filtered subset created to improve quality and zero-shot generalization. AI2 applied alignment heuristics, fuzzy deduplication, and cleaning rules to ensure better audio–transcript matches and reduce low-diversity or mismatched examples.
This two-tiered approach mirrors large-scale language model pretraining: scale with noisy corpora, then refine with curated subsets.
Performance benchmarks
AI2 compared OLMoASR against Whisper on short-form and long-form tasks using datasets such as LibriSpeech, TED-LIUM3, Switchboard, AMI, and VoxPopuli.
- Medium model (769M): 12.8% WER on short-form, 11.0% WER on long-form, close to Whisper-medium.en (12.4% / 10.5%).
- Large models (1.5B): large.en-v1 (440K hours) records about 13.0% WER short-form (Whisper large-v1 ~12.2%); large.en-v2 (680K hours) improves to around 12.6% WER, narrowing the gap to under 0.5% in some settings.
- Smaller models remain competitive: tiny.en ~20.5% WER short-form, base.en ~16.6% WER short-form, showing useful options for constrained environments.
These results give developers flexibility to choose models that match latency and compute budgets while keeping strong accuracy.
How to use
Basic transcription takes just a few lines of Python code:
import olmoasr
model = olmoasr.load_model("medium", inference=True)
result = model.transcribe("audio.mp3")
print(result)
The output includes both the transcription and time-aligned segments, which is useful for captioning, meeting notes, or downstream NLP pipelines.
Fine-tuning and domain adaptation
Because AI2 provides full training recipes and code, OLMoASR can be fine-tuned to specialized domains:
- Medical speech recognition using clinical datasets
- Legal or courtroom transcription with domain-specific audio
- Low-resource accents and dialects by fine-tuning on targeted corpora
Open pipelines simplify domain adaptation, which is critical when out-of-distribution jargon or acoustic conditions reduce off-the-shelf performance.
Applications
OLMoASR’s openness unlocks multiple use cases:
- Academic research into dataset construction, filtering, and architecture effects on ASR performance
- Embedding speech recognition in HCI, real-time meeting transcription, and accessibility tools without relying on closed APIs
- Multimodal systems that combine speech input with large language models for context-aware assistants
- Standardized benchmarking, since models, data identifiers, and evaluation scripts are published for reproducible comparisons