Salesforce's GTA1 Sets New Benchmark in GUI Agents, Surpassing OpenAI's CUA

Introducing GTA1: A Breakthrough in GUI Agent Technology

Salesforce AI Research has unveiled GTA1, a novel graphical user interface (GUI) agent designed to autonomously interact with real operating system environments, including Linux. GTA1 tackles two major challenges in GUI agent development: ambiguous task planning and inaccurate action grounding. Achieving a 45.2% task success rate on the OSWorld benchmark, GTA1 outperforms OpenAI’s Computer-Using Agent (CUA), establishing a new state-of-the-art for open-source models.

Overcoming Core Challenges in GUI Agents

GUI agents convert high-level instructions into sequences of actions like clicks and keystrokes, observing UI changes to plan subsequent steps. However, planning ambiguity and grounding precision remain persistent issues. Ambiguity arises because multiple action sequences can complete the same task with varying efficiency. Grounding precision is difficult due to the challenge of accurately mapping abstract actions to specific GUI coordinates in complex environments.

Innovative Test-Time Scaling for Smarter Planning

GTA1 introduces test-time scaling, a method that samples multiple candidate actions concurrently at each decision point rather than committing to a single action prematurely. A multimodal judge model, often a large language model (LLM), evaluates these candidates to select the best action. This approach allows GTA1 to explore execution paths more thoroughly without needing future rollouts, which are impractical for GUI tasks due to irreversible actions. The technique is planner-agnostic and scales efficiently with task complexity.

Reinforcement Learning Enhances Grounding Accuracy

Unlike previous models relying on supervised fine-tuning to predict UI element centers, GTA1 uses reinforcement learning based on Group Relative Policy Optimization (GRPO). The agent receives rewards only when clicks fall within the correct UI element, enabling it to learn precise grounding directly from interaction outcomes. This strategy eliminates the need for complex intermediate reasoning or bounding box predictions, improving accuracy notably, especially in static environments.

Benchmark Performance Highlights

OSWorld Task Success Rate: GTA1-7B achieves 45.2%, surpassing OpenAI’s CUA (42.9%) and Claude 3.7 (28.0%).
ScreenSpot-Pro Grounding Accuracy: GTA1-7B reaches 50.1%, outperforming UGround-72B (34.5%).
ScreenSpot-V2 Cross-platform Grounding: GTA1-72B scores 94.8%, nearing top proprietary models.
OSWorld-G Linux GUI Grounding: GTA1-7B attains 67.7%, leading all open-source methods. These results demonstrate the effectiveness of GTA1’s dual innovations in planning and grounding.

Additional Design Considerations

Data quality is enhanced by filtering misaligned annotations from datasets such as Aria-UI and OS-Atlas using OmniParser. The model scales effectively from 7B to 72B parameters, with GTA1-7B offering an optimal performance-to-compute balance. The multimodal judge is reusable, often the same LLM employed for planning, reducing computational overhead.

GTA1 exemplifies a streamlined, modular framework that leverages diverse planning and precise reinforcement learning-based grounding to advance GUI agent capabilities, pushing the boundaries of open-ended digital interaction.

For more details, check the paper, code repositories, and model releases. Follow Salesforce AI on Twitter, YouTube, and Spotify, join their 100k+ ML SubReddit, and subscribe to their newsletter.