Alibaba Unveils GUI-Owl and Mobile-Agent-v3: AI Agents That Automate Any Interface

Why GUI agents are the next frontier

Graphical user interfaces remain the dominant way people interact with apps and operating systems across mobile, desktop, and web. Traditional automation approaches rely on brittle scripts or hand-engineered rules that fail on UI changes and cross-platform variations. Recent advances in vision-language models enable agents that can actually perceive a screen, reason about tasks, plan steps, and execute actions end to end.

GUI-Owl: a unified multimodal policy

GUI-Owl is an end-to-end multimodal agent initialized from Qwen2.5-VL and extensively post-trained on diverse GUI interaction data. Instead of splitting perception, planning, and execution into separate modules, GUI-Owl integrates grounding, reasoning, planning, and action in a single policy network. That design allows explicit multi-turn reasoning and seamless decision-making across ambiguous and dynamic interfaces.

Key capabilities include:

Mobile-Agent-v3: modular multi-agent coordination

Mobile-Agent-v3 builds on GUI-Owl as a core module and coordinates specialized agents to tackle long-horizon, cross-application workflows. The framework breaks tasks into subgoals, updates plans dynamically, and preserves contextual memory. The four coordinated roles are:

This multi-agent orchestration improves robustness on error-prone, multi-step tasks by enabling reflection, recovery, and persistent memory.

Training pipeline and data generation at scale

A major challenge for GUI agents is scalable, high-quality training data. The team created a self-evolving data pipeline that automates dataset creation and curation:

They also synthesize grounding tasks from accessibility trees and screenshots, distill planning knowledge from historical trajectories and large LLMs, and generate action-effect data using before-and-after screenshots.

Reinforcement learning advances

GUI-Owl is refined with a scalable RL framework that supports fully asynchronous training and introduces Trajectory-aware Relative Policy Optimization (TRPO). TRPO assigns credit across long and variable-length action sequences, addressing the sparse-reward nature of many GUI tasks where success is only verifiable after a sequence completes.

Benchmarks and performance highlights

GUI-Owl and Mobile-Agent-v3 were evaluated across grounding, single-step decision-making, question answering, and end-to-end task completion benchmarks.

Grounding and UI understanding:

Single-step decision-making and reasoning:

End-to-end multi-step tasks:

Real-world integration:

These results show both broad grounding capability and strong long-horizon task performance when combined with multi-agent orchestration.

Deployment and action space

GUI-Owl supports platform-specific actions. On mobile: taps, long presses, swipes, text entry, system buttons, and app launches. On desktop: mouse moves, clicks, drags, scrolls, keyboard input, and app-specific commands. Actions are mapped to low-level device commands like ADB for Android and pyautogui for desktop, enabling real deployments.

The agent follows an explicit reasoning loop: observe screen, recall compressed history, reason about the next action, summarize intent, and execute. This transparency helps debugging and integration with larger multi-agent systems where specialized roles can collaborate.

Implications for automation and research

By unifying perception, grounding, reasoning, and action, and by building a self-improving training pipeline, GUI-Owl and Mobile-Agent-v3 push the field toward general-purpose, autonomous GUI agents. Their open-source performance surpasses many proprietary baselines on key benchmarks, suggesting practical value for automation, testing, and assistive interfaces.

For more details, see the paper at https://arxiv.org/abs/2508.15144 and check the project’s GitHub for code, tutorials, and notebooks.