WALT: Turning Website Features into Reusable Tools for LLM Agents

What WALT does

WALT (Web Agents that Learn Tools) is a framework from Salesforce AI Research that reverse engineers latent website functionality into stable, reusable tools. Instead of driving web agents as long chains of low-level clicks and heuristics, WALT surfaces callable operations such as search, filter, sort, post_comment, and create_listing. These encapsulated tools reduce the need for step-by-step freeform reasoning by large language models and increase determinism at execution time.

Two-phase discovery and construction pipeline

WALT operates in two main phases: discovery and construction with validation. During discovery, the system explores a site offline and proposes candidate tools mapped to common goals like discovery, content management, and communication. In the construction phase, WALT converts interaction traces into deterministic scripts, stabilizes selectors, attempts URL promotion where feasible, induces input schemas, and runs end-to-end checks. Only candidates that pass validation become registered tools. The pipeline shifts as much work as possible into stable URL and form operations, keeping agentic grounding for the cases that genuinely require flexible reasoning.

How tools are represented

Each discovered tool carries a contract in the form of a schema and usage examples. Tools encapsulate navigation, selection, extraction, and optional bounded agentic steps such as waiting for dynamic content or extracting text. At runtime, an agent composes a short program that invokes a few tools to complete a task, which yields higher success with fewer steps and more predictable behavior than lengthy click sequences.

Benchmark results

WALT was evaluated on VisualWebArena and WebArena. On VisualWebArena it reports an average success rate of 52.9 percent, with 64.1 percent on Classifieds, 53.4 percent on Shopping, and 39.0 percent on Reddit. Baselines cited in the paper include SGV at 50.2 percent and ExaCT at 33.7 percent, while human performance averages 88.7 percent.

On WebArena WALT reaches a 50.1 percent average across GitLab, Map, Shopping, CMS, Reddit, and Multi. The method outperforms prior approaches by about nine points over the best skill induction baseline, compared with human performance at 78.2 percent.

Efficiency gains and ablation findings

Toolization reduces action counts by roughly 1.4x on average compared to a matched agent without tools. On the Classifieds split, using tools consistently improves performance across agent backbones. WALT with GPT 5 mini shows a 7 percent higher success rate and 27 percent fewer steps, while a human demonstration strategy achieves 66.0 percent success. Fully autonomous WALT hits 64.1 percent with 5 percent fewer steps than the human demonstration case. Additional components contribute measurable gains: multimodal DOM parsing adds 2.6 percent absolute improvement and external verification contributes +3.3 percent, though with more checks. Overall, WALT records about 21.3 percent fewer steps than baseline policies.

Design choices that favor determinism

WALT prioritizes URL-level operations when search and filtering can be expressed via query parameters or explicit routes. When dynamic grounding is unavoidable, the tool scripts insert small bounded agentic steps such as content extraction or waiting for page load. Selector stabilization and schema validation reduce drift as site layouts change, and the discovered tool sets keep the fraction of agentic operations low while biasing toward deterministic actions like navigation, input, and click.

Practical notes and availability

The release ships a CLI with commands like walt discover and walt agent, along with MCP serving for integration. The paper and code are available at the linked arXiv and GitHub pages for those who want tutorials, notebooks, and implementation details.

References

https://arxiv.org/pdf/2510.01524

Check the project GitHub for code, tutorials, and notebooks.