CoAct-1: Revolutionizing Autonomous Computing with Hybrid GUI and Code Execution

Introducing CoAct-1: A New Era in Autonomous Computer Agents

Researchers from USC, Salesforce AI, and the University of Washington have developed CoAct-1, a groundbreaking multi-agent computer-using agent (CUA) that significantly enhances autonomous computer operation. CoAct-1 innovatively elevates coding to a first-class action, alongside traditional GUI manipulation, addressing long-standing challenges in efficiency and reliability when performing complex, long-horizon tasks.

Limitations of Conventional GUI-Only Agents

Traditional CUA agents rely exclusively on pixel-based interactions, mimicking human users by clicking, typing, and navigating graphical interfaces. Although this approach reflects typical user workflows, it is fragile and inefficient for complicated multi-step tasks involving dense user interfaces, multiple applications, or complex operating system operations. Mistakes such as a mis-click can disrupt entire workflows, and task sequences become excessively long as complexity grows.

The Hybrid Architecture of CoAct-1

CoAct-1 overcomes these issues by integrating three specialized agents:

Orchestrator: A high-level planner that breaks down complex tasks and dynamically assigns subtasks to the Programmer or GUI Operator depending on the nature of the task.
Programmer: Executes backend operations like file management, data processing, and environment configuration directly through Python or Bash scripts, bypassing the slow and error-prone GUI sequences.
GUI Operator: Employs a vision-language model to interact with graphical interfaces when human-like navigation is essential.

This hybrid design allows CoAct-1 to replace fragile and lengthy mouse-keyboard activities with concise, dependable code executions, while still utilizing GUI interactions when necessary.

Benchmark Results on OSWorld

OSWorld, a comprehensive benchmark with 369 tasks across office productivity, IDEs, browsers, file managers, and multi-application workflows, serves as a rigorous testbed for such systems. Tasks simulate real-world language goals and are evaluated with a detailed rule-based scoring system.

Overall Success Rate: CoAct-1 achieved a state-of-the-art success rate of 60.76% in the 100+ step category, the first CUA agent to surpass 60%, outperforming GTA-1 (53.10%), OpenAI CUA 4o (31.40%), and UI-TARS-1.5 (29.60%).
Step Efficiency: It completes successful tasks in an average of 10.15 steps, fewer than GTA-1 (15.22) and UI-TARS (14.90), with significantly higher success than OpenAI CUA 4o, which has fewer steps (6.14) but much lower success.
Task Type Performance: CoAct-1 excels especially in workflows benefiting from code execution, including multi-app workflows (47.88% success), OS tasks (75.00%), and VLC tasks (66.07%). In productivity and IDE tasks, it consistently matches or surpasses the best.

Key Factors Behind CoAct-1's Success

Coding Replaces Redundant GUI Actions: Tasks like batch image resizing or advanced file manipulations are handled by single scripts, reducing error-prone clicks and overall steps.
Dynamic Task Delegation: The Orchestrator optimally assigns tasks between coding and GUI interaction.
Strong Backbone Models: Using OpenAI CUA 4o for GUI operations, OpenAI o3 for orchestration, and o4-mini for programming yields the highest performance.
Efficiency Drives Reliability: Fewer steps mean fewer chances for errors, strongly correlating with successful task completion.

CoAct-1 marks a significant advancement in autonomous computer agents by synergizing coding and GUI actions, setting a new benchmark for reliable and efficient computer automation.