MedAgentBench: Benchmarking AI Agents in Real EHR Workflows

What MedAgentBench Is

MedAgentBench is a new benchmark suite from Stanford designed to evaluate large language model agents in realistic healthcare settings. Moving beyond static question-answer tests, it creates a virtual electronic health record environment where agents must interact with patient data, plan multi-step tasks, and execute actions such as documentation and orders.

Why an Agentic Healthcare Benchmark Matters

Recent LLMs are increasingly agentic: they interpret high-level instructions, call APIs, integrate patient data, and orchestrate multi-step procedures. In medicine, these capabilities could reduce staff workload, streamline documentation, and improve process efficiency. Existing general-purpose agent benchmarks miss healthcare-specific complexity such as FHIR interoperability and longitudinal records; MedAgentBench fills that gap with a reproducible, clinically relevant framework.

Benchmark Composition and Patient Data

MedAgentBench contains 300 tasks across 10 categories, authored by licensed physicians. Tasks include patient information retrieval, lab tracking, documentation, test ordering, referrals, and medication management, averaging 2–3 steps to mirror common inpatient and outpatient workflows. The benchmark uses 100 realistic patient profiles derived from Stanford’s STARR repository, with over 700,000 records including labs, vitals, diagnoses, procedures, and medication orders. Data were de-identified and jittered to protect privacy while preserving clinical validity.

EHR Environment and FHIR Compliance

The evaluation environment is FHIR-compliant and supports both retrieval (GET) and modification (POST) operations. Agents can perform realistic clinical interactions such as documenting vitals or placing medication orders, making the benchmark translatable to live EHR systems and enabling more practical assessment of agent capabilities in tool-based workflows.

Evaluation Method and Results

Models are judged by task success rate (SR) using a strict pass@1 metric to reflect real-world safety requirements. The suite tested 12 leading LLMs including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek-V3, Qwen2.5, and Llama 3.3, with a baseline orchestrator exposing nine FHIR functions and limiting agents to eight interaction rounds per task. Claude 3.5 Sonnet v2 led overall with a 69.67% success rate, excelling in retrieval tasks at 85.33%. GPT-4o achieved 64.0% and DeepSeek-V3 reached 62.67%, the top performance among open-weight models. Most models performed well on queries but struggled with action-based tasks that require safe multi-step execution.

Common Failure Modes

Two primary failure patterns emerged: failures in instruction adherence, such as invalid API calls or incorrect JSON formatting, and output mismatches where models returned free-form sentences instead of the required structured numerical values. These errors reveal gaps in precision and reliability that are critical to address before clinical deployment.

Implications for Clinical AI Development

MedAgentBench establishes a first-of-its-kind, large-scale benchmark for LLM agents operating in EHR-like settings, pairing clinician-authored tasks with real patient profiles and a FHIR-compliant environment. Results show promising retrieval capabilities but highlight an urgent need to improve safe action execution and structured output reliability. While the benchmark is constrained by single-institution data and a focus on EHR interactions, it provides an open and reproducible platform to drive development of dependable healthcare AI agents.

For more details see the PAPER and technical blog, and visit the project GitHub for tutorials, code, and notebooks. The DOI for the study is https://ai.nejm.org/doi/full/10.1056/AIdbp2500144.