Alibaba Tongyi Lab Unveils MAI-UI: Next-Gen GUI Agents

Overview of MAI-UI

Alibaba Tongyi Lab has introduced MAI-UI—a groundbreaking family of foundation GUI agents. This innovative system integrates MCP tool usage, agent user interaction, device-cloud collaboration, and online reinforcement learning (RL), achieving state-of-the-art results in general GUI grounding and mobile navigation. Notably, it outperforms Gemini 2.5 Pro, Seed 1.8, and UI-Tars-2 on AndroidWorld.

What is MAI-UI?

MAI-UI is built on the Qwen3 VL framework, offering model sizes of 2B, 8B, 32B, and 235B A22B. It processes natural language instructions and rendered UI screenshots, producing structured actions for real-time Android environments. The action space includes standard operations like clicking, swiping, entering text, and pressing buttons, along with advanced features for user interaction and MCP tool calls.

Advanced GUI Grounding Techniques

Grounding is critical for GUI agents, enabling them to convert free-form language (e.g., "open monthly billing settings") into actionable commands. MAI-UI's strategy builds upon the UI-Ins concept, utilizing multiple perspectives for each UI element through diverse training data. This minimizes the impact of ambiguous instructions and improves accuracy.

Self-Evolving Navigation Data Pipeline

The navigation component of MAI-UI is sophisticated, leveraging a self-evolving data pipeline that tracks user contexts across various app interactions. This system expands its dataset dynamically by simulating user tasks in Android environments, evaluating performance via a judge model that filters effective trajectories for training.

Online Reinforcement Learning in Action

To adapt to rapidly changing mobile environments, MAI-UI employs an online RL framework that interacts directly with containerized Android Virtual Devices. This framework is scalable, demonstrating significant performance enhancements with increased parallel environments and extended task execution steps.

Key Performance Metrics

Grounding Accuracy: MAI-UI achieves 73.5% accuracy on ScreenSpot Pro and ranks high across other benchmarks.
MobileWorld Success: It attained 41.7% overall task success in the MobileWorld benchmark, outpacing leading end-to-end GUI solutions.
Scalable Learning: The RL system shows that scaling environments leads to notable improvements in navigation success rates.

Conclusion

MAI-UI stands out as a versatile solution for mobile GUI tasks by integrating advanced technology for enhanced user interaction and decision-making. Its innovative architecture marks a significant step forward in achieving dynamic real-world mobile deployments.