WebChoreArena: Pushing AI Web Agents Beyond Simple Browsing with Complex Memory and Reasoning Tasks

The Rise of Web Automation Agents

Web automation agents have increasingly captured attention in AI due to their ability to perform human-like tasks within digital environments. These agents simulate human interactions on websites by clicking, typing, and navigating graphical user interfaces, which allows them to operate without relying on APIs that may be unavailable or restricted. This universal applicability makes them versatile across a broad spectrum of web domains.

Need for Advanced Benchmarks

With the advancement of large language models (LLMs), these agents have grown more sophisticated, capable of interpreting content, reasoning, planning, and executing complex actions. However, existing benchmarks primarily focus on simple browsing tasks and fail to adequately challenge agents on memory-intensive, multi-step, and logic-heavy tasks that mirror real-world digital chores.

Limitations of Previous Benchmarks

Prior benchmarks such as WebArena provided reproducible tasks across simulated websites but mainly tested general browsing abilities. Other benchmarks like Mind2Web, GAIA, and MMIn offered various strengths but suffered from issues including limited interactivity, narrow scope, and lack of reproducibility. These shortcomings left a gap in evaluating agents' true capabilities in complex decision-making and long-term memory application.

Introducing WebChoreArena

Researchers at the University of Tokyo developed WebChoreArena to address these challenges. Building on WebArena's structure, WebChoreArena introduces 532 new, more demanding tasks spanning four simulated websites. These tasks involve data aggregation, memory recall, multi-step reasoning, and more, designed to reflect realistic and challenging web interactions.

Task Categories and Input Modalities

WebChoreArena divides tasks into four main categories:

Massive Memory (117 tasks): Extracting and remembering extensive information, such as compiling customer names linked to high-value transactions.
Calculation (132 tasks): Performing arithmetic operations like identifying peak spending months.
Long-Term Memory (127 tasks): Connecting information across multiple pages, for example, retrieving pricing rules from one site and applying them on another.
Others (65 tasks): Miscellaneous tasks, including assigning labels in GitLab.

Input modalities vary, with most tasks allowing any observation type, while some require textual or image inputs.

Benchmark Evaluation Using Leading LLMs and Agents

The benchmark was tested using GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, paired with advanced web agents AgentOccam and BrowserGym. Results showed a significant drop in performance compared to previous benchmarks, highlighting the increased difficulty:

GPT-4o scored 6.8% accuracy on WebChoreArena, down from 42.8% on WebArena.
Gemini 2.5 Pro achieved the highest accuracy at 44.9%, demonstrating current limitations in handling complex tasks.

WebChoreArena also proved more sensitive in differentiating model performance levels, making it a valuable tool for ongoing development.

Key Insights and Contributions

532 tasks across multiple categories and websites ensure diversity and challenge.
Task templates and rigorous annotation (over 300 hours) guarantee reproducibility and standardization.
Evaluation methods include string matching, URL matching, and HTML structure comparisons.

WebChoreArena bridges the gap between simple navigation and higher-order reasoning, memory, and logic required for real-world web automation. It sets a new standard for benchmarking AI web agents, pushing the field toward more practical and capable solutions.

For more details, see the [Paper], [GitHub Page], and [Project Page].