Biomni-R0: Reinforcement-Learned LLMs Reach Expert-Level Biomedical Reasoning

AI’s expanding role in biomedical research

Biomedical AI is moving beyond retrieval and classification toward agents that can reason across genomics, clinical diagnostics, and molecular workflows. Practical biomedical assistants must interface with domain-specific tools, understand biological hierarchies, and simulate multi-step experimental or diagnostic pipelines to provide useful, actionable guidance for researchers and clinicians.

The gap to expert-level reasoning

Most large language models can surface facts or spot patterns, but they struggle with deep, chained reasoning required for tasks like rare disease diagnosis, gene prioritization, and complex experimental troubleshooting. These tasks demand contextual interpretation, integration across heterogeneous data, and domain judgment rather than surface-level matching to literature.

Why prior methods are limited

Supervised fine-tuning on curated datasets or retrieval-augmented approaches help ground outputs, but they tend to be brittle. Static prompting and fixed behaviors limit adaptability, and many systems break when they must orchestrate external tools or handle unfamiliar biomedical structures. In dynamic, high-stakes environments, robustness, interpretability, and sustained multi-step reasoning are crucial.

Biomni-R0: agentic models trained with reinforcement learning

Stanford and UC Berkeley introduced Biomni-R0, a family of biomedical agent models trained end-to-end with reinforcement learning in an environment tailored to biomedical reasoning. The reported models, Biomni-R0-8B and Biomni-R0-32B, combine Stanford’s Biomni platform and environment with Berkeley’s SkyRL RL infrastructure, and they use expert-annotated tasks plus a specialized reward design to push performance toward and beyond human expert levels.

Two-phase training pipeline and system design

Training followed two phases. First, supervised fine-tuning (SFT) was performed using high-quality trajectories sampled from Claude-4 Sonnet via rejection sampling to bootstrap structured reasoning behavior. Second, reinforcement learning optimized two reward axes: correctness (for example, selecting the right gene or diagnosis) and formatting (encouraging structured outputs such as explicit thinking and answer tags).

To handle expensive external tool rollouts and reduce GPU idle time, the team implemented asynchronous rollout scheduling and decoupled environment execution from model inference. They also extended context windows to 64k tokens so the agent can maintain and reason over long multi-turn dialogues and workflows.

Benchmarks and key results

Performance improvements were substantial. Biomni-R0-32B achieved a composite score of 0.669 versus the base model’s 0.346. Biomni-R0-8B scored 0.588, outpacing much larger general-purpose models such as Claude 4 Sonnet and GPT-5 on many tasks. By task breakdown, Biomni-R0-32B led on 7 of 10 tasks, GPT-5 led on 2, and Claude 4 led on 1.

Standout gains include a rare disease diagnosis score of 0.67 for Biomni-R0-32B compared with Qwen-32B’s 0.03 (a greater-than-20× improvement), and GWAS variant prioritization moving from 0.16 to 0.74 after RL training. These results highlight the value of domain-specific RL and reward engineering.

Scalability, longer reasoning traces, and implications

The system design focused on decoupling heavy environment steps from inference to scale efficiently despite variable tool latencies. RL-trained agents consistently produced longer, more structured reasoning traces, and those extended traces correlated with better task performance. The work suggests that depth and explicit structure in reasoning traces are important markers of expert-level understanding in biomedical domains.

Biomni-R0 demonstrates that end-to-end RL with expert-informed rewards and careful system engineering can produce smaller models that outperform much larger generalist models on domain-critical tasks, paving the way for more capable and reliable biomedical agents.