<RETURN_TO_BASE

CoAct-1: Revolutionizing Autonomous Computing with Hybrid GUI and Code Execution

CoAct-1 introduces a hybrid approach combining GUI manipulation and direct coding to improve efficiency and reliability in autonomous computer operation, achieving a record 60.76% success rate on the OSWorld benchmark.

Introducing CoAct-1: A New Era in Autonomous Computer Agents

Researchers from USC, Salesforce AI, and the University of Washington have developed CoAct-1, a groundbreaking multi-agent computer-using agent (CUA) that significantly enhances autonomous computer operation. CoAct-1 innovatively elevates coding to a first-class action, alongside traditional GUI manipulation, addressing long-standing challenges in efficiency and reliability when performing complex, long-horizon tasks.

Limitations of Conventional GUI-Only Agents

Traditional CUA agents rely exclusively on pixel-based interactions, mimicking human users by clicking, typing, and navigating graphical interfaces. Although this approach reflects typical user workflows, it is fragile and inefficient for complicated multi-step tasks involving dense user interfaces, multiple applications, or complex operating system operations. Mistakes such as a mis-click can disrupt entire workflows, and task sequences become excessively long as complexity grows.

The Hybrid Architecture of CoAct-1

CoAct-1 overcomes these issues by integrating three specialized agents:

  • Orchestrator: A high-level planner that breaks down complex tasks and dynamically assigns subtasks to the Programmer or GUI Operator depending on the nature of the task.
  • Programmer: Executes backend operations like file management, data processing, and environment configuration directly through Python or Bash scripts, bypassing the slow and error-prone GUI sequences.
  • GUI Operator: Employs a vision-language model to interact with graphical interfaces when human-like navigation is essential.

This hybrid design allows CoAct-1 to replace fragile and lengthy mouse-keyboard activities with concise, dependable code executions, while still utilizing GUI interactions when necessary.

Benchmark Results on OSWorld

OSWorld, a comprehensive benchmark with 369 tasks across office productivity, IDEs, browsers, file managers, and multi-application workflows, serves as a rigorous testbed for such systems. Tasks simulate real-world language goals and are evaluated with a detailed rule-based scoring system.

  • Overall Success Rate: CoAct-1 achieved a state-of-the-art success rate of 60.76% in the 100+ step category, the first CUA agent to surpass 60%, outperforming GTA-1 (53.10%), OpenAI CUA 4o (31.40%), and UI-TARS-1.5 (29.60%).
  • Step Efficiency: It completes successful tasks in an average of 10.15 steps, fewer than GTA-1 (15.22) and UI-TARS (14.90), with significantly higher success than OpenAI CUA 4o, which has fewer steps (6.14) but much lower success.
  • Task Type Performance: CoAct-1 excels especially in workflows benefiting from code execution, including multi-app workflows (47.88% success), OS tasks (75.00%), and VLC tasks (66.07%). In productivity and IDE tasks, it consistently matches or surpasses the best.

Key Factors Behind CoAct-1's Success

  • Coding Replaces Redundant GUI Actions: Tasks like batch image resizing or advanced file manipulations are handled by single scripts, reducing error-prone clicks and overall steps.
  • Dynamic Task Delegation: The Orchestrator optimally assigns tasks between coding and GUI interaction.
  • Strong Backbone Models: Using OpenAI CUA 4o for GUI operations, OpenAI o3 for orchestration, and o4-mini for programming yields the highest performance.
  • Efficiency Drives Reliability: Fewer steps mean fewer chances for errors, strongly correlating with successful task completion.

CoAct-1 marks a significant advancement in autonomous computer agents by synergizing coding and GUI actions, setting a new benchmark for reliable and efficient computer automation.

For more details, check out the Paper, GitHub Page, and follow the project on Twitter. Join the community on the ML SubReddit and subscribe to the newsletter for updates.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский