Ovis 2.5: Alibaba's Native-Resolution Multimodal LLMs Push Visual Reasoning Forward

What Ovis 2.5 Brings

Ovis2.5 is the newest large multimodal language model (MLLM) release from Alibaba's AIDC-AI team, available in 9B and 2B parameter variants. It targets long-standing weaknesses in multimodal AI by improving high-resolution visual perception, multimodal reasoning, and robust OCR. The model aims to retain fine detail in images and perform deeper stepwise analysis on complex tasks.

Native-Resolution Vision with NaViT

A central innovation in Ovis2.5 is the native-resolution vision transformer (NaViT). Unlike approaches that rely on tiling or forced resizing, NaViT processes images at their original, variable resolutions to preserve global context and small visual details. This capability makes Ovis2.5 particularly strong on visually dense inputs: scientific diagrams, complex infographics, charts, forms, and natural images where resizing would otherwise discard critical information.

Deeper Multimodal Reasoning and Thinking Mode

To improve reasoning, Ovis2.5 uses a curriculum that extends beyond conventional chain-of-thought supervision. The training set includes "thinking-style" samples designed to encourage self-correction and reflection. At inference time, users can optionally enable a "thinking mode" that trades latency for enhanced step-by-step accuracy and model introspection. This mode is useful for tasks requiring systematic multimodal analysis, such as scientific QA or mathematical problem solving.

Performance and Benchmarks

Ovis2.5-9B achieves an average score of 78.3 on the OpenCompass multimodal leaderboard, outperforming all open-source MLLMs under 40B parameters. The 2B variant scores 73.9, establishing a strong benchmark for lightweight models suitable for on-device or resource-constrained scenarios. The models lead open-source competitors across several specialized domains:

STEM reasoning (MathVista, MMMU, WeMath)
OCR and chart analysis (OCRBench v2, ChartQA Pro)
Visual grounding (RefCOCO, RefCOCOg)
Video and multi-image comprehension (BLINK, VideoMME)

Community commentary on Reddit and X highlights notable OCR and document-processing improvements, including better text extraction in cluttered images and improved form understanding.

Training Efficiency and Deployment

Ovis2.5 uses multimodal data packing and advanced hybrid parallelism to boost end-to-end training efficiency, reporting up to 3–4× throughput improvements. The design philosophy of the series—'small model, big performance'—is embodied in the 2B variant, which targets mobile and edge deployment where compute and memory are limited.

Where to Find More

The release includes a technical report and model checkpoints available on Hugging Face. The team also maintains a GitHub page with tutorials, code, and notebooks. Interested users and researchers can follow the project on social channels and community forums for updates and discussion.