ReVisual-R1: Breaking New Ground in Multimodal Reasoning with a 7B Parameter Open-Source Model
ReVisual-R1 is an innovative open-source 7B multimodal language model that advances complex reasoning by integrating a three-stage training pipeline with novel reinforcement learning techniques.
Challenges in Multimodal Reasoning
Recent successes in text-based language models like DeepSeek-R1 have shown that reinforcement learning (RL) can significantly enhance reasoning skills. Researchers have tried to extend these RL methods to multimodal large language models (MLLMs) that process both visual and textual inputs. However, these attempts have not been fully successful, as MLLMs still face difficulties handling complex reasoning tasks. This points to the fact that RL strategies effective for text-only models may not directly apply to multimodal scenarios, where integrating different data types creates unique challenges requiring specialized solutions.
Progress and Limitations in Multimodal Language Models
The development of MLLMs builds upon advances in large language models (LLMs) by integrating visual data with language understanding. Foundational models like CLIP and MiniGPT-4 paved the way, followed by instruction-tuned models such as LLaMA. While some closed-source models achieve strong reasoning with extended chain-of-thought (CoT) outputs, open-source alternatives have mostly focused on fine-tuning and CoT adaptations, often producing brief answers that limit deeper reasoning. Reinforcement learning methods including RLHF and GRPO have shown promise in improving reasoning for LLMs. Inspired by these, recent research aims to apply RL to MLLMs to enhance visual reasoning and generate richer, longer responses.
Introduction of ReVisual-R1
A collaborative team from Tsinghua University, Shanghai Jiao Tong University, and the Shanghai Artificial Intelligence Laboratory introduced ReVisual-R1, an open-source multimodal model with 7 billion parameters that advances multimodal reasoning capabilities. Their research highlights three key findings:
- Careful text-only pretraining creates a strong base, outperforming many existing MLLMs even before RL.
- The widely used GRPO algorithm suffers from gradient stagnation, which they overcome with a novel approach named Prioritized Advantage Distillation (PAD).
- Incorporating a final text-only RL phase after multimodal RL further improves reasoning.
Their three-stage training pipeline—text pretraining, multimodal RL with PAD, and final text RL—strikes an effective balance between grounding in visual data and deeper cognitive reasoning.
The GRAMMAR Dataset
Recognizing that existing multimodal cold-start datasets lack sufficient complexity for strong reasoning training, the researchers developed the GRAMMAR dataset. It combines diverse textual and multimodal samples through a multi-stage curation process. Text-only datasets like DeepMath showed better improvements in reasoning tasks, indicating that textual complexity better stimulates reasoning abilities. GRAMMAR supports the Staged Reinforcement Optimization (SRO) framework, which first applies multimodal RL enhanced by PAD to prevent stalled learning and uses an efficient-length reward to avoid verbosity, followed by a text-only RL phase to boost reasoning and language fluency.
Three-Stage Training Pipeline
ReVisual-R1’s training involves three structured stages. It starts with pure text data to establish a solid language foundation, then moves to multimodal RL to improve visual-text reasoning, and finally applies text-only RL for refining reasoning and fluency. The model was evaluated on multiple benchmarks and outperformed both open-source and some commercial models in multimodal and mathematical reasoning tasks, achieving top scores on 9 out of 10 benchmarks. Ablation studies underscored the importance of the training sequence and the PAD technique, which helped the model focus on high-quality responses, significantly boosting overall performance.
Contributions and Impact
ReVisual-R1 is a 7B parameter open-source MLLM designed to address the complexities of multimodal reasoning. Rather than relying solely on model scale, it employs a carefully crafted three-stage training strategy: strong text-only pretraining for foundational reasoning, multimodal RL enhanced with PAD for stable learning, and a final text-only RL phase for reasoning refinement. This approach substantially elevates performance, setting a new standard among 7B models and excelling in challenging tasks such as MathVerse and AIME. The research demonstrates how structured training curricula can unlock deeper reasoning capabilities in multimodal language models.
For more details, check the Paper and GitHub page. Follow the researchers on Twitter and join the growing ML community on SubReddit and Newsletter.
Сменить язык
Читать эту статью на русском