Baidu's ERNIE-4.5 'Thinking' Brings 3B-Scale Multimodal Reasoning

Overview

Baidu has added ERNIE-4.5-VL-28B-A3B-Thinking to the open source ERNIE-4.5 family, a vision-language model optimized for document, chart and video understanding while keeping the runtime activation budget in the 3B parameter class. The release targets teams that need strong multimodal reasoning on dense text, fine chart structure and video segments without the full resource footprint of a flagship model.

Architecture and training setup

ERNIE-4.5-VL-28B-A3B-Thinking is based on a heterogeneous multimodal Mixture of Experts design. The family sits in a 28B-VL architectural branch with roughly 30B total parameters available, but it uses an A3B routing scheme that activates only about 3B parameters per token. That combination delivers the compute and memory profile of a 3B class model while preserving a larger capacity pool for reasoning.

The model undergoes a dedicated mid training stage focused on visual language reasoning. This stage enhances representation power and alignment between vision and language modalities, which is important for dense document text and subtle chart features. Training also incorporates multimodal reinforcement learning on verifiable tasks with GSPO and IcePop strategies and dynamic difficulty sampling to stabilize MoE training and emphasize challenging examples.

How 'Thinking with Images' and tool utilization work

Thinking with Images is a core capability of this variant. The model can iteratively zoom into regions, reason on cropped views and then integrate local observations into a final answer. This enables stepwise inspection of documents or charts rather than a single-shot pass.

Tool utilization complements internal reasoning by enabling calls to external tools such as image search when internal knowledge is insufficient. Both Thinking with Images and tool calls are exposed through the reasoning parser and tool call parser during deployment, allowing integrated pipelines that mix internal multimodal reasoning and external lookups.

Key capabilities

Baidu lists visual reasoning, STEM reasoning, visual grounding, Thinking with Images, tool utilization and video understanding among the model's official capabilities. It shows strength on analytics-style charts, circuit and STEM problems, visual grounding with JSON bounding boxes and video segment localization with timestamped answers.

ERNIE-4.5-VL variants support both thinking and non-thinking modes. The thinking mode is tuned to improve reasoning-centered tasks while maintaining high perception quality.

Performance and positioning

On internal benchmarks, ERNIE-4.5-VL-28B-A3B-Thinking reportedly achieves competitive or superior results compared to Qwen-2.5-VL-7B and Qwen-2.5-VL-32B on many tasks while activating fewer parameters. Baidu positions the Thinking variant as closely matching flagship industry models across multimodal benchmarks while keeping a small activation footprint in production.

Deployment, licensing and fine-tuning

The model is released under Apache License 2.0 and supports deployment via transformers, vLLM and FastDeploy. Teams can fine-tune it with ERNIEKit using SFT, LoRA and DPO for commercial multimodal applications.

Hugging Face model page: https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking

Who should consider this model

This release is practical for teams that need strong multimodal reasoning on documents, charts and videos but want to keep inference costs and memory use close to a 3B class model. It is also suitable for deployments that require flexible tool calls and iterative visual reasoning strategies.