Meta ARE and Gaia2 Redefine Agent Evaluation for Asynchronous Event-Driven AI

October 14, 2025 · 3 min

Why asynchronous, event-driven evaluation matters

Most agent benchmarks so far have simplified interaction by pausing the environment while the model plans a response. That synchronous setup misses crucial real-world demands: agents must operate while the world keeps changing. ARE decouples agent and environment time so the environment continues to evolve during reasoning, injecting scheduled or stochastic events such as replies, reminders, or updates. This design surfaces capabilities like proactivity, interruption handling, deadline awareness, and graceful recovery from unexpected changes.

ARE architecture: everything as events

Agents Research Environments (ARE) is a modular, time-driven simulation stack that treats “everything as an event.” ARE exposes five core concepts that structure simulations:

Apps: stateful interfaces that represent tools or services the agent can use. Tools are typed as read or write, enabling precise verification of state-mutating actions.
Environments: collections of apps, rules and datasets that define the simulated world.
Events: logged happenings that drive time and behavior in simulations.
Notifications: configurable observability primitives that determine what the agent can see and when.
Scenarios: initial state plus scheduled events and a verifier to judge agent behavior.

The Mobile environment used in many experiments mimics a smartphone with apps like email, messaging and calendar to create realistic, multitasking conditions.

What Gaia2 measures beyond search-and-execute

Gaia2 is the follow-up benchmark built on top of ARE. It moves evaluation away from single-turn, synchronous correctness toward capabilities that matter under change. Key focal points include:

Adaptability to environment responses and unexpected events
Handling of ambiguity and noisy inputs
Time-aware behavior, including meeting deadlines and tolerances for action timing
Agent-to-Agent collaboration, where sub-agents represent apps and must coordinate

Scenarios in Gaia2 are verifiable and reproducible: deterministic seeds and oracle traces are used to ensure repeatable evaluation and precise scoring.

Scale and available datasets

There is some nuance in counts: the public release on Hugging Face provides 800 scenarios across 10 universes, while the paper references 1,120 verifiable annotated scenarios in the Mobile environment used for experiments (reflecting extended or augmented configurations). Practitioners will most commonly encounter the 800-scenario public dataset, with the paper demonstrating how the suite can scale for larger studies.

Scoring agents in a changing world

Gaia2 scores sequences of write actions by comparing them to oracle actions with argument-level checks. Argument validation can be strict (exact match) or soft (LLM-based judge) depending on the argument type. The evaluation preserves causality and enforces relative-time constraints so agents are not credited merely for achieving a final state if intermediate trajectories were unsafe or policy-violating.

Implications for production-ready agents

ARE plus Gaia2 shift the bar from static correctness to correctness under change. If an agent is claimed to be production-ready, it must demonstrably handle asynchrony, interruptions, ambiguity, noise, timing constraints, and multi-agent coordination, while producing verifiable write-action traces. Meta supplies a controllable simulator, a challenging benchmark, and a transparent evaluation loop to stress these real-world behaviors.

Resources

Read the paper and explore code, tutorials and notebooks on the Meta AI pages and the project GitHub to try ARE and Gaia2 yourself:

https://ai.meta.com/research/publications/are-scaling-up-agent-environments-and-evaluations/