Holo1.5: Vision Models Tuned for Pinpoint GUI Localization and UI-VQA

What Holo1.5 delivers

H Company has released Holo1.5, a family of open foundation vision models optimized for computer-use (CU) agents that operate on real user interfaces. The lineup includes 3B, 7B, and 72B checkpoints. Across sizes, Holo1.5 shows about a 10% accuracy gain over Holo1. The 7B checkpoint is available under Apache-2.0; the 3B and 72B checkpoints inherit research-only constraints from their upstream bases.

Focus on pixel-precise localization

A core capability for CU agents is converting intent into pixel-level actions. For example, an instruction like ‘Open Spotify’ relies on predicting exact clickable coordinates on the current screen. Small localization errors cascade through multi-step workflows and can break automation. Holo1.5 is trained and evaluated for high-resolution screens (up to 3840×2160) across desktop (macOS, Ubuntu, Windows), web, and mobile interfaces. The models improve robustness on dense professional UIs where small icons and tight layouts otherwise raise error rates.

Training objectives and design choices

Unlike general VLMs that prioritize broad grounding and captioning, Holo1.5 aligns its data and objectives with CU needs. Training includes large-scale supervised fine-tuning on GUI tasks followed by GRPO-style reinforcement learning focused on tightening coordinate accuracy and decision reliability. These models are intended as perception components to be embedded in planners and executors (for example, Surfer-style agents), not as end-to-end agents themselves.

Benchmark performance

Holo1.5 reports state-of-the-art GUI grounding across multiple benchmarks: ScreenSpot-v2, ScreenSpot-Pro, GroundUI-Web, Showdown, and WebClick. Representative 7B numbers (averages over six localization tracks):

On ScreenSpot-Pro (professional apps with dense layouts), Holo1.5-7B achieves 57.94 versus 29.00 for Qwen2.5-VL-7B, indicating materially better target selection under realistic conditions. The 3B and 72B checkpoints show similar relative gains versus their Qwen2.5-VL counterparts.

UI understanding (UI-VQA)

Holo1.5 also improves UI visual question answering. On VisualWebBench, WebSRC, and ScreenQA (short/complex), the models deliver consistent accuracy gains. Reported 7B averages are around 88.17, with the 72B variant near 90.00. Stronger UI-VQA reduces ambiguity for queries like ‘Which tab is active?’ or ‘Is the user signed in?’, enabling agents to verify state before and after actions.

Comparison to other systems

Under the published evaluation setup, Holo1.5 outperforms open baselines such as Qwen2.5-VL, competitive specialized systems (UI-TARS, UI-Venus), and shows advantages versus some closed generalist models on the cited UI tasks. Because protocols, prompts, and screen resolutions affect results, practitioners should reproduce evaluations on their own harness before drawing deployment conclusions.

Integration implications for CU stacks

How to use Holo1.5 in a stack

Think of Holo1.5 as the screen perception layer. Input is full-resolution screenshots (optionally with UI metadata). Outputs include target coordinates with confidence scores and short textual answers about screen state. Downstream action policies convert predictions into click and keyboard events while monitoring layers verify outcomes and trigger retries or fallbacks.

Practical recommendation

If you need a commercially usable perception base today, start with Holo1.5-7B (Apache-2.0), benchmark it on your screens, and instrument your planner and safety layers around it. Models and technical details are available on H Company pages, Hugging Face, and GitHub for tutorials, code, and notebooks.

Links: https://www.hcompany.ai/blog/holo-1-5