OpenAI Launches HealthBench: The Ultimate Open-Source Benchmark for Healthcare AI Models

Introducing HealthBench

OpenAI has introduced HealthBench, an innovative open-source evaluation framework crafted to measure both the performance and safety of large language models (LLMs) specifically within healthcare contexts. This benchmark was developed with input from 262 physicians spanning 60 countries and 26 medical specialties, ensuring a broad and expert-driven foundation.

Overcoming Limitations of Previous Benchmarks

Traditional healthcare AI benchmarks often rely on structured formats like multiple-choice exams, which fall short in capturing the complexity of real clinical interactions. HealthBench instead uses 5,000 multi-turn conversations between models and users—either laypeople or healthcare professionals. Each conversation ends with a prompt, and model responses are evaluated through rubrics created by physicians that define positive and negative criteria with point values.

Comprehensive Rubric-Based Scoring

The rubrics assess attributes such as clinical accuracy, clarity of communication, completeness, and adherence to instructions. HealthBench evaluates over 48,000 unique criteria, with scoring performed by a model-based grader validated against expert physician judgment.

Benchmark Themes and Variants

HealthBench’s evaluation spans seven key themes: emergency referrals, global health, health data tasks, context-seeking, expertise-tailored communication, response depth, and handling uncertainty. Additionally, two specialized subsets are provided:

HealthBench Consensus: Focuses on 34 physician-validated criteria emphasizing critical behaviors like advising emergency care.
HealthBench Hard: Contains 1,000 challenging conversations designed to push the limits of current models.

Model Performance Insights

OpenAI tested various models including GPT-3.5 Turbo, GPT-4o, GPT-4.1, and the latest o3 model. Performance improved significantly across versions: GPT-3.5 scored 16%, GPT-4o 32%, and o3 reached 60%. GPT-4.1 nano, a smaller and more cost-effective model, outperformed GPT-4o while reducing inference costs by 25 times.

Performance strengths were noted in emergency referrals and tailored communication, whereas context-seeking and completeness were more challenging areas. Completeness showed the strongest correlation with overall scores, highlighting its critical role in healthcare tasks.

Physician Comparison and Collaboration

Interestingly, unassisted physicians scored lower than models but could enhance model-generated drafts, especially with earlier versions. This suggests that LLMs could serve as valuable collaborators in clinical documentation and decision-making.

Reliability and Evaluation Consistency

HealthBench incorporates metrics like the "worst-at-k" to measure performance consistency across multiple runs. While newer models show improved stability, variability remains a challenge.

OpenAI also validated its automatic grader against over 60,000 annotated examples, finding that GPT-4.1 matches or surpasses average physician evaluations, establishing it as a reliable scoring tool.

Accessibility and Future Directions

HealthBench is available on the simple-evals GitHub repository, empowering researchers to benchmark, analyze, and refine AI models for healthcare applications. This framework marks a significant advance towards safer and more effective AI deployment in medicine.

For further details, visit the Paper, GitHub Page, and Official Release.