ComputerRL: Zhipu AI’s Hybrid API-GUI Framework for Autonomous Desktop Agents
'Zhipu AI's ComputerRL combines programmatic APIs with GUI actions and a scalable RL infrastructure to build more capable desktop agents. Experimental results show strong gains on the OSWorld benchmark, driven by the API-GUI paradigm and the Entropulse training method.'
A new direction for desktop agents
Zhipu AI unveiled ComputerRL, a framework that fuses programmatic API control with direct GUI interactions to enable agents that can effectively operate in complex desktop environments. The aim is to bridge the gap between human-oriented graphical interfaces and machine-friendly programmatic control, letting agents choose the most efficient way to complete tasks.
The API-GUI paradigm
ComputerRL introduces a hybrid API-GUI paradigm. Rather than relying solely on slow, brittle GUI simulations of clicks and scrolls, agents can call programmatic APIs where available and fall back to GUI actions when needed. This hybrid approach preserves the precision of API calls for tasks that benefit from direct programmatic control while maintaining the adaptability of GUI interactions for broader compatibility.
Automated API construction with LLMs
The framework automates much of the API-creation process using large language models. Given example tasks, the system analyzes requirements, implements APIs using appropriate Python libraries, and generates test cases. The result is a set of reusable, general-purpose APIs for desktop applications. Zhipu demonstrates this with integrations for Ubuntu apps such as GIMP and LibreOffice, enabling operations like image processing or document formatting with fewer steps than GUI-only approaches.
Scalable RL infrastructure
Training dense desktop agents at scale requires efficient virtual environments. ComputerRL addresses this with a distributed RL infrastructure built on Docker and gRPC, capable of running thousands of parallel Ubuntu VMs. Key technical components include lightweight qemu-in-docker VM deployment, multi-node clustering, and a web-based monitoring interface. The infrastructure supports fully asynchronous training through the AgentRL framework, decoupling data collection from parameter updates to increase throughput and mitigate resource bottlenecks.
Entropulse: alternating training to preserve exploration
To prevent entropy collapse, where agents lose the ability to explore during extended training, ComputerRL uses Entropulse. This method alternates reinforcement learning phases with supervised fine-tuning on successful rollout trajectories. The pipeline begins with behavior cloning from diverse LLM-generated trajectories, proceeds with step-level Group Relative Policy Optimization using rule-based rewards, and interleaves supervised fine-tuning on curated high-quality rollouts to restore exploration and sustain learning progress.
Experimental results on OSWorld
Zhipu applied ComputerRL to open models like GLM-4-9B-0414 and Qwen2.5-14B, producing AutoGLM-OS variants. On the OSWorld benchmark, AutoGLM-OS-9B reached a 48.1% success rate, outperforming proprietary baselines such as OpenAI’s CUA o3 (42.9%) and Claude 4.0 (30.7%). The API-GUI approach produced a 134% improvement over GUI-only baselines in many office and professional scenarios. Ablation studies show behavior cloning provides a strong initialization, while Entropulse-enabled RL phases contribute substantial additional gains.
Practical cases and remaining challenges
Case studies include tasks like building sales summary tables in LibreOffice Calc and generating system reports through Terminal commands. Error analysis points to visual perception failures and cross-application coordination as dominant failure modes, indicating areas for further research such as improved multimodal perception and hierarchical planning.
Path forward for desktop autonomy
ComputerRL charts a practical route toward more capable desktop agents by combining scalable RL infrastructure with a pragmatic interaction paradigm. Future work will likely focus on richer training diversity, multimodal sensing, hierarchical task planning, and safety mechanisms such as permission controls and action validation to prepare agents for real-world deployment.
For full technical details see the paper on arXiv and the project GitHub for tutorials, code, and notebooks.
Сменить язык
Читать эту статью на русском