<RETURN_TO_BASE

Google Health AI Launches MedASR for Clinical Dictation

MedASR is an advanced speech-to-text model optimized for clinical dictation.

What MedASR is and where it fits

MedASR is a speech-to-text model based on the Conformer architecture, pre-trained for medical dictation and transcription. It serves as a foundational tool for developers aiming to create healthcare-focused voice applications, including radiology dictation tools or visit note capture systems.

The model contains 105 million parameters and accepts mono channel audio at 16,000 Hz with 16-bit integer waveforms. It outputs text only, making it suitable for integration into downstream natural language processing or generative models like MedGemma.

MedASR is part of the Health AI Developer Foundations portfolio, alongside MedGemma, MedSigLIP, and other specialized medical models that adhere to common usage terms and consistent governance.

Training Data and Domain Specialization

MedASR benefits from training on a diverse corpus of de-identified medical speech, featuring around 5,000 hours of physician dictations and clinical conversations across various specialties, including radiology, internal medicine, and family medicine.

The training data pairs audio segments with transcripts and metadata, annotating subsets with medical named entities such as symptoms, medications, and conditions. This enables the model to encompass clinical vocabulary and common phrasing patterns found in routine documentation.

Currently, the model only supports English, predominantly trained on audio from native English speakers raised in the United States. Performance may vary for other speaker profiles or in noisy environments, and fine-tuning is recommended.

Architecture and Decoding

MedASR adopts the Conformer encoder design, which integrates convolution blocks with self-attention layers to capture both local acoustic patterns and longer-range temporal dependencies.

It functions as an automated speech detector featuring a CTC-style interface. Developers can utilize AutoProcessor to generate input features from waveform audio and AutoModelForCTC to convert those into token sequences. The default decoding method is greedy, although integrating an external six-gram language model with beam search (size 8) can enhance the word error rate.

Training occurs using JAX and ML Pathways on TPUv4p, TPUv5p, and TPUv5e hardware, which are essential for scaling large speech models in alignment with Google’s overall foundation model training strategy.

Performance on Medical Speech Tasks

Key results include:

  • RAD DICT, radiologist dictation: MedASR (greedy) at 6.6%, MedASR with language model at 4.6%, while competitors stand at Gemini 2.5 Pro at 10.0%, Gemini 2.5 Flash at 24.4%, and Whisper v3 Large at 25.3%.
  • GENERAL DICT, encompassing general and internal medicine: MedASR (greedy) at 9.3%, MedASR with language model at 6.9%, compared to Gemini 2.5 Pro at 16.4%, Gemini 2.5 Flash at 27.1%, and Whisper v3 Large at 33.1%.
  • FM DICT, family medicine: MedASR (greedy) at 8.1%, MedASR with language model at 5.8%, alongside Gemini 2.5 Pro at 14.6%, Gemini 2.5 Flash at 19.9%, and Whisper v3 Large at 32.5%.
  • Eye Gaze, evaluation on 998 MIMIC chest X-ray cases: MedASR (greedy) at 6.6%, MedASR with language model at 5.2%, with Gemini 2.5 Pro at 5.9%, Gemini 2.5 Flash at 9.3%, and Whisper v3 Large at 12.5%.

Developer Workflow and Deployment Options

A minimal pipeline example is:

from transformers import pipeline
import huggingface_hub
 
audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", model="google/medasr")
result = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(result)

For enhanced control, developers can load AutoProcessor and AutoModelForCTC, resample audio to 16,000 Hz using librosa, transfer tensors to CUDA if available, and then call model.generate followed by processor.batch_decode.

Key Takeaways

  1. MedASR is a lightweight, open weights Conformer-based medical ASR model: With 105M parameters, it’s been designed for medical dictation and transcription, available under the Health AI Developer Foundations program exclusively for English healthcare developers.
  2. Domain-specific training on 5,000 hours of de-identified medical audio: MedASR has been fine-tuned on physician dictations and clinical interactions, fostering strong proficiency in clinical terminology compared to general ASR systems.
  3. Competitive word error rates on medical dictation benchmarks: In evaluations covering radiology, general and family medicine, as well as Eye Gaze datasets, MedASR demonstrates performance that either matches or exceeds large general models like Gemini 2.5 and Whisper v3.

Additional resources and implementation details can be found on GitHub and Hugging Face.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский