OLMoASR: AI2’s Open ASR Suite Challenging OpenAI Whisper

September 4, 2025 · 3 min

The Allen Institute for AI (AI2) published OLMoASR, a fully open automatic speech recognition (ASR) suite that goes beyond releasing model weights: it also provides dataset identifiers, filtering steps, training recipes, and benchmark scripts. This level of transparency positions OLMoASR as one of the most extensible platforms for speech recognition research and practical deployment.

Why open ASR matters

Many state-of-the-art ASR systems from providers like OpenAI, Google, and Microsoft are available only via closed APIs. While these services deliver strong performance, they act as black boxes: training data, filtering procedures, and evaluation details are often hidden. That opacity hampers reproducibility, verification, and domain adaptation. OLMoASR confronts these limitations by opening the full pipeline, enabling researchers to reproduce results, test alternatives, and adapt models to new domains without rebuilding massive datasets from scratch.

Model architecture and scaling

OLMoASR uses a transformer encoder–decoder architecture: the encoder processes audio waveforms into hidden representations, and the decoder generates text tokens conditioned on those representations. This approach mirrors modern ASR paradigms like Whisper, but OLMoASR’s implementation, training code, and configs are fully open.

The released family covers six English-only sizes, enabling trade-offs between inference cost and accuracy:

tiny.en – 39M parameters, for lightweight inference
base.en – 74M parameters
small.en – 244M parameters
medium.en – 769M parameters
large.en-v1 – 1.5B parameters, trained on 440K hours
large.en-v2 – 1.5B parameters, trained on 680K hours

Smaller models suit embedded or real-time use cases; larger models prioritize accuracy for research and batch workloads.

Data: from web scraping to curated mixes

A key contribution of OLMoASR is publishing the training data strategy, not just model weights.

OLMoASR-Pool (~3M hours): a massive, weakly supervised collection of speech paired with transcripts scraped from the web. It contains noisy, misaligned, and duplicate examples—similar in spirit to Whisper’s large-scale, noisy corpora.
OLMoASR-Mix (~1M hours): a heavily filtered subset created to improve quality and zero-shot generalization. AI2 applied alignment heuristics, fuzzy deduplication, and cleaning rules to ensure better audio–transcript matches and reduce low-diversity or mismatched examples.

This two-tiered approach mirrors large-scale language model pretraining: scale with noisy corpora, then refine with curated subsets.

Performance benchmarks

AI2 compared OLMoASR against Whisper on short-form and long-form tasks using datasets such as LibriSpeech, TED-LIUM3, Switchboard, AMI, and VoxPopuli.

Medium model (769M): 12.8% WER on short-form, 11.0% WER on long-form, close to Whisper-medium.en (12.4% / 10.5%).
Large models (1.5B): large.en-v1 (440K hours) records about 13.0% WER short-form (Whisper large-v1 ~12.2%); large.en-v2 (680K hours) improves to around 12.6% WER, narrowing the gap to under 0.5% in some settings.
Smaller models remain competitive: tiny.en ~20.5% WER short-form, base.en ~16.6% WER short-form, showing useful options for constrained environments.

These results give developers flexibility to choose models that match latency and compute budgets while keeping strong accuracy.

How to use

Basic transcription takes just a few lines of Python code:

import olmoasr

model = olmoasr.load_model("medium", inference=True)
result = model.transcribe("audio.mp3")
print(result)

The output includes both the transcription and time-aligned segments, which is useful for captioning, meeting notes, or downstream NLP pipelines.

Fine-tuning and domain adaptation

Because AI2 provides full training recipes and code, OLMoASR can be fine-tuned to specialized domains:

Medical speech recognition using clinical datasets
Legal or courtroom transcription with domain-specific audio
Low-resource accents and dialects by fine-tuning on targeted corpora

Open pipelines simplify domain adaptation, which is critical when out-of-distribution jargon or acoustic conditions reduce off-the-shelf performance.

Applications

OLMoASR’s openness unlocks multiple use cases:

Academic research into dataset construction, filtering, and architecture effects on ASR performance
Embedding speech recognition in HCI, real-time meeting transcription, and accessibility tools without relying on closed APIs
Multimodal systems that combine speech input with large language models for context-aware assistants
Standardized benchmarking, since models, data identifiers, and evaluation scripts are published for reproducible comparisons