Alibaba Unveils GUI-Owl and Mobile-Agent-v3: AI Agents That Automate Any Interface

August 31, 2025 · 4 min

Why GUI agents are the next frontier

Graphical user interfaces remain the dominant way people interact with apps and operating systems across mobile, desktop, and web. Traditional automation approaches rely on brittle scripts or hand-engineered rules that fail on UI changes and cross-platform variations. Recent advances in vision-language models enable agents that can actually perceive a screen, reason about tasks, plan steps, and execute actions end to end.

GUI-Owl: a unified multimodal policy

GUI-Owl is an end-to-end multimodal agent initialized from Qwen2.5-VL and extensively post-trained on diverse GUI interaction data. Instead of splitting perception, planning, and execution into separate modules, GUI-Owl integrates grounding, reasoning, planning, and action in a single policy network. That design allows explicit multi-turn reasoning and seamless decision-making across ambiguous and dynamic interfaces.

Key capabilities include:

Grounding UI elements from natural language queries
Task planning that decomposes complex instructions into steps
Action semantics understanding to predict GUI state changes
Fine-tuning with a mix of supervised learning and reinforcement learning aimed at task success

Mobile-Agent-v3: modular multi-agent coordination

Mobile-Agent-v3 builds on GUI-Owl as a core module and coordinates specialized agents to tackle long-horizon, cross-application workflows. The framework breaks tasks into subgoals, updates plans dynamically, and preserves contextual memory. The four coordinated roles are:

Manager: decomposes high-level instructions and updates plans
Worker: executes the most relevant actionable subgoal in the current UI state
Reflector: evaluates outcomes and generates diagnostic feedback for replanning
Notetaker: persists important context like codes or credentials across steps

This multi-agent orchestration improves robustness on error-prone, multi-step tasks by enabling reflection, recovery, and persistent memory.

Training pipeline and data generation at scale

A major challenge for GUI agents is scalable, high-quality training data. The team created a self-evolving data pipeline that automates dataset creation and curation:

Query generation: human-annotated DAGs model realistic navigation and input slots; LLMs synthesize natural instructions from these flows
Trajectory generation: agents interact with virtual environments (Android, Ubuntu, macOS, Windows) to produce action-state sequences
Trajectory correctness judgment: a two-level critic evaluates individual steps and full trajectories using multimodal reasoning and consensus
Guidance synthesis and iterative training: successful trajectories generate guidance and are added back into training, creating a closed loop of improvement

They also synthesize grounding tasks from accessibility trees and screenshots, distill planning knowledge from historical trajectories and large LLMs, and generate action-effect data using before-and-after screenshots.

Reinforcement learning advances

GUI-Owl is refined with a scalable RL framework that supports fully asynchronous training and introduces Trajectory-aware Relative Policy Optimization (TRPO). TRPO assigns credit across long and variable-length action sequences, addressing the sparse-reward nature of many GUI tasks where success is only verifiable after a sequence completes.

Benchmarks and performance highlights

GUI-Owl and Mobile-Agent-v3 were evaluated across grounding, single-step decision-making, question answering, and end-to-end task completion benchmarks.

Grounding and UI understanding:

GUI-Owl-7B and GUI-Owl-32B lead open-source performance. On MMBench-GUI L2, GUI-Owl-7B scores 80.49 and GUI-Owl-32B reaches 82.97.
On ScreenSpot Pro, GUI-Owl-7B scores 54.9, outperforming comparable large models.

Single-step decision-making and reasoning:

On MMBench-GUI L1, GUI-Owl-7B scores 84.5 (easy), 86.9 (medium), and 90.9 (hard), indicating strong UI understanding and reasoning.
On Android Control, GUI-Owl-7B hits 72.8 while GUI-Owl-32B achieves 76.6.

End-to-end multi-step tasks:

GUI-Owl-7B scores 66.4 on AndroidWorld and 34.9 on OSWorld.
Mobile-Agent-v3 (with GUI-Owl core) reaches 73.3 and 37.7 respectively, establishing a new open-source state of the art.

Real-world integration:

As a core module inside other agentic systems, GUI-Owl-32B achieves 62.1% success on AndroidWorld and 48.4% on a tough OSWorld subset.

These results show both broad grounding capability and strong long-horizon task performance when combined with multi-agent orchestration.

Deployment and action space

GUI-Owl supports platform-specific actions. On mobile: taps, long presses, swipes, text entry, system buttons, and app launches. On desktop: mouse moves, clicks, drags, scrolls, keyboard input, and app-specific commands. Actions are mapped to low-level device commands like ADB for Android and pyautogui for desktop, enabling real deployments.

The agent follows an explicit reasoning loop: observe screen, recall compressed history, reason about the next action, summarize intent, and execute. This transparency helps debugging and integration with larger multi-agent systems where specialized roles can collaborate.

Implications for automation and research

By unifying perception, grounding, reasoning, and action, and by building a self-improving training pipeline, GUI-Owl and Mobile-Agent-v3 push the field toward general-purpose, autonomous GUI agents. Their open-source performance surpasses many proprietary baselines on key benchmarks, suggesting practical value for automation, testing, and assistive interfaces.

For more details, see the paper at https://arxiv.org/abs/2508.15144 and check the project’s GitHub for code, tutorials, and notebooks.