Anthropic AI Launches Bloom for AI Evaluations

Overview

Anthropic has released Bloom, an open-source agentic framework that automates behavioral evaluations for frontier AI models. The system takes a researcher-specified behavior and builds targeted evaluations that measure how often and how strongly that behavior appears in realistic scenarios.

Why Bloom?

Behavioral evaluations for safety and alignment are expensive to design and maintain. Teams must handcraft scenarios, run numerous interactions, analyze long transcripts, and aggregate scores. As models evolve, old benchmarks can become obsolete or leak into training data. Anthropic’s research team frames this as a scalability problem; they need a way to generate fresh evaluations for misaligned behaviors faster while keeping metrics meaningful.

Bloom targets this gap. Instead of a fixed benchmark with a small set of prompts, Bloom grows an evaluation suite from a seed configuration. The seed anchors what behavior to study, how many scenarios to generate, and what interaction style to use. The framework then produces new yet behavior-consistent scenarios on each run, while still allowing reproducibility through the recorded seed.

Seed Configuration and System Design

Bloom is implemented as a Python pipeline and is released under the MIT license on GitHub. The core input is the evaluation “seed”, defined in seed.yaml. This file references a behavior key in behaviors/behaviors.json, optional example transcripts, and global parameters that shape the entire run.

Key configuration elements include:

behavior: A unique identifier defined in behaviors.json for the target behavior, for example, sycophancy or self-preservation.
examples: Zero or more few-shot transcripts stored under behaviors/examples/.
total_evals: The number of rollouts to generate in the suite.
rollout.target: The model under evaluation such as claude-sonnet-4.
Controls such as diversity, max_turns, modality, reasoning effort, and additional judgment qualities.

Bloom uses LiteLLM as a backend for model API calls and can interact with both Anthropic and OpenAI models through a single interface. It integrates with Weights and Biases for large sweeps and exports Inspect-compatible transcripts.

Four-Stage Agentic Pipeline

Bloom’s evaluation process is organized into four agent stages that run in sequence:

Understanding agent: Reads the behavior description and example conversations, building a summary of what counts as a positive instance of the behavior and why this behavior matters.
Ideation agent: Generates candidate evaluation scenarios, describing situations, user personas, and tools available to the target model. It batches scenario generation to use token budgets efficiently and employs the diversity parameter.
Rollout agent: Instantiates scenarios with the target model for multi-turn conversations, recording all messages and tool calls. Configuration parameters control the autonomy of the target model.
Judgment and meta judgment agents: A judge model scores each transcript numerically for behavior presence and additional qualities. A meta-judge then summarizes all rollouts, producing a report highlighting important cases and patterns.

Validation on Frontier Models

Anthropic used Bloom to build four alignment-relevant evaluation suites for delusional sycophancy, instructed long-horizon sabotage, self-preservation, and self-preferential bias. Each suite contains 100 distinct rollouts, repeated three times across 16 frontier models. The reported plots show elicitation rate with standard deviation error bars.

Bloom is tested on intentionally misaligned ‘model organisms’ from earlier alignment work. Across 10 quirky behaviors, Bloom separates the organism from the baseline production model in 9 cases. In the remaining case, the baseline model exhibits similar behavior frequency, leading to score overlap. A separate validation compared human labels on transcripts against candidates, reaching a Spearman correlation of 0.86 with human scores.

Relationship to Petri and Positioning

Anthropic positions Bloom as complementary to Petri, a broad coverage auditing tool. Bloom starts from one behavior definition, automating the engineering needed to create a large, targeted evaluation suite with quantitative metrics.

Key Takeaways

Bloom is an open-source agentic framework that transforms a single behavior specification into a complete behavioral evaluation suite for large models, using a four-stage pipeline: understanding, ideation, rollout, and judgment.
The system is driven by a seed configuration in seed.yaml and behaviors/behaviors.json, where researchers specify the target behavior, example transcripts, total evaluations, and controls.
Bloom utilizes LiteLLM for unified access to Anthropic and OpenAI models, tracks experiments with Weights and Biases, and exports Inspect compatible JSON with an interactive viewer for transcript inspections.
Anthropic validates Bloom on 4 behaviors across 16 models, separating misaligned organisms from baseline models in 9 cases, with models matching human labels with high Spearman correlations.