<RETURN_TO_BASE

Gelato-30B-A3B Sets a New Standard for GUI Grounding, Beating GTA1-32B

'Gelato-30B-A3B converts screenshots and text instructions into precise click coordinates, achieving state of the art results on GUI grounding benchmarks and improving end-to-end agent success over GTA1-32B.'

What Gelato-30B-A3B does

Gelato-30B-A3B is a 31B-parameter grounding model designed to convert natural language instructions and screenshots into a single, precise click coordinate. Built as a modular component for computer-use agents, it is intended to be called by a planner model (for example GPT-5 in the experiments) to resolve high-level actions into concrete clicks across different operating systems and app layouts.

The model is fine-tuned from Qwen3-VL-30B-A3B Instruct using a mixture-of-experts architecture and trained on the Click 100k dataset. In benchmarks it reaches 63.88% accuracy on ScreenSpot Pro, 69.15% on OS-World-G (with refusal prompting), and 74.65% on OS-World-G Refined. It outperforms prior grounding models such as GTA1-32B and larger vision-language models including Qwen3-VL-235B-A22B-Instruct.

Click 100k: a targeted dataset for GUI grounding

Click 100k pairs real screenshots with precise, low-level instructions and bounding boxes for the target element. Each sample includes the instruction, image dimensions, bounding box, and normalized coordinates. The dataset aggregates and unifies multiple public GUI sources including ShowUI, AutoGUI, PC Agent E, WaveUI, OS Atlas, UGround, PixMo Points, SeeClick, UI VISION, a focused JEDI subset, and 85 professional tutorial videos annotated with Claude-4-Sonnet.

To ensure high-quality supervision, the team applies an aggressive filtering pipeline. OmniParser removes clicks that do not land on detected interface elements. Qwen2.5-7B-VL and SE-GUI-3B filter trivial examples. GTA1-7B-2507 and UI-Venus-7B drop samples where instruction and click region mismatch. Training a baseline on a balanced 10k subset shows a +9 percentage point gain on ScreenSpot Pro compared with unfiltered data, validating the filtering approach.

Professional application coverage is emphasized: the dataset supplements public sources with UI VISION, the JEDI subset for spreadsheet and text cell actions, and bounding boxes generated for tutorial videos, followed by manual inspection and corrections.

Training recipe: GRPO on Qwen3-VL

Gelato-30B-A3B uses a GRPO reinforcement learning setup on top of Qwen3-VL initialization. The team follows a DAPO-like configuration with adjustments: they remove the KL divergence term, set the clipping threshold to 0.28, and skip rollouts with zero advantage. Rewards are sparse and granted only when the predicted click lands inside the target bounding box, mirroring the GTA1 recipe.

Training was run for 100 RL steps on 32 A100 40GB GPUs, with the best checkpoint picked at step 84 based on mean performance across ScreenSpot Pro, OS-World-G, and OS-World-G Refined. At that checkpoint the model attains 63.88% on ScreenSpot Pro and 67.19% / 73.40% on OS-World-G and OS-World-G Refined respectively. Adding a refusal-prompting strategy raises OS-World-G to 69.15% and 74.65%.

End-to-end agent evaluation on OS World

The researchers integrated Gelato-30B-A3B into the GTA1.5 agent framework to measure full computer-use performance. In these experiments GPT-5 provided planning, Gelato provided grounding, and agents were allowed up to 50 steps with 3 second pauses between actions.

On a fixed OS World snapshot, Gelato-30B-A3B achieved a 58.71% automated success rate (small standard deviation) versus 56.97% for GTA1-32B under the same evaluation harness. Human evaluation on 20 problematic tasks produced 61.85% success for Gelato compared with 59.47% for GTA1-32B, indicating the automatic evaluator sometimes misses valid solutions.

Why this matters

Gelato-30B-A3B demonstrates that a Qwen3-VL based mixture-of-experts model trained on a carefully curated Click 100k dataset can improve grounding accuracy and translate that improvement into stronger end-to-end agent performance. By outperforming older grounding models and even much larger VLMs on standard GUI benchmarks, Gelato establishes a new open baseline for GUI grounding and is accessible via GitHub and Hugging Face for further experimentation.

Repository and model weights are available at https://github.com/mlfoundations/Gelato

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский