GLM-4.1V-Thinking: Breaking New Ground in Multimodal Reasoning and Understanding

The Rise of Vision-Language Models

Vision-language models (VLMs) have become essential in modern intelligent systems, enabling detailed interpretation of visual data. The complexity of tasks requiring multimodal intelligence is increasing, spanning from scientific problem-solving to the creation of autonomous agents. The expectations for VLMs now extend beyond simple perception to include advanced reasoning abilities.

Introducing GLM-4.1V-Thinking

Researchers from Zhipu AI and Tsinghua University have developed GLM-4.1V-Thinking, a vision-language model aimed at enhancing general-purpose multimodal understanding and reasoning. This model employs Reinforcement Learning with Curriculum Sampling (RLCS) to unlock its full potential, improving performance across various domains such as STEM problem solving, video analysis, content recognition, coding, grounding, GUI-based agents, and long document comprehension.

Model Architecture and Innovations

GLM-4.1V-Thinking consists of three main components: a vision encoder, an MLP adapter, and an LLM decoder. It uses AIMv2-Huge as the vision encoder and GLM as the large language model. Key innovations include replacing 2D convolutions with 3D convolutions for temporal downsampling, integrating 2D-RoPE to handle arbitrary image resolutions and extreme aspect ratios (over 200:1), and extending RoPE to 3D-RoPE within the LLM for enhanced spatial understanding in multimodal contexts. For video temporal modeling, time index tokens and timestamp strings are added to help the model grasp real-world time gaps between frames.

Training Strategy

The pre-training phase combines diverse datasets, including large academic corpora and interleaved image-text data rich in knowledge. Incorporating pure text data helps preserve the model’s core language capabilities, resulting in superior pass@k performance compared to other similar-sized models. Supervised fine-tuning uses a curated long-chain-of-thought (long-CoT) corpus to enable the model to perform long CoT inference across both verifiable tasks (like STEM problems) and non-verifiable tasks (such as instruction following). The reinforcement learning phase combines RLVR and RLHF to conduct large-scale training across multimodal domains including STEM, grounding, optical character recognition (OCR), GUI agents, and others.

Performance Highlights

GLM-4.1V-9B-Thinking outperforms all open-source models under 10B parameters on General Visual Question Answering (VQA) tasks, covering both single-image and multi-image scenarios. It achieves top scores on challenging STEM benchmarks like MMMU_Val, MMMU_Pro, VideoMMMU, and AI2D. In OCR and chart analysis, it sets new state-of-the-art results on ChartQAPro and ChartMuseum. For long document understanding, it leads on MMLongBench and establishes new records in GUI Agents and multimodal coding tasks. The model also shows strong video understanding, surpassing benchmarks like VideoMME, MMVU, and MotionBench.

Challenges and Future Directions

Despite the impressive results, some limitations remain. These include inconsistent improvements in reasoning quality from reinforcement learning, training instability, and difficulties with complex scenarios. Future work should aim to enhance supervision and evaluation of reasoning, including reward models that assess intermediate reasoning steps while detecting hallucinations and logical inconsistencies. Preventing reward hacking in subjective evaluation tasks is also crucial to advance towards general-purpose intelligence.

Researchers have made GLM-4.1V-9B-Thinking open source. For further details, check out the paper and the GitHub page.

All credit goes to the dedicated researchers behind this project.