VL-Cogito: Curriculum RL and Adaptive Length Rewards Transform Multimodal Reasoning
'VL-Cogito uses a staged curriculum RL and adaptive length rewards to significantly boost multimodal reasoning on math, science and chart benchmarks, outperforming several prior MLLMs.'
Tackling multimodal reasoning challenges
Multimodal reasoning requires integrating signals from text, images and diagrams to solve problems in math, science, logic and chart interpretation. VL-Cogito, developed by DAMO Academy and partners, targets instability and domain gaps in this space by applying a dedicated reinforcement learning pipeline that boosts stepwise reasoning and general understanding.
Progressive Curriculum Reinforcement Learning (PCuRL)
The core of VL-Cogito is the PCuRL framework, which stages RL training into easy, medium and hard phases. Instead of a single, undifferentiated RL stage, PCuRL gradually increases task difficulty exposure while adapting reward and weighting mechanisms so the model builds reliable reasoning skills across domains.
Online Difficulty Soft Weighting (ODSW)
ODSW assigns dynamic weights to training samples based on sample difficulty and the model's current capabilities. Rather than discarding samples deemed too easy or too hard, ODSW adjusts each prompt's contribution to gradients using a piecewise function tuned to prioritize easy, medium or hard stages. This continuous curriculum is grounded in learnability theory and empirical difficulty distributions, enabling smoother progression from straightforward cases to intricate problems.
Dynamic Length Reward (DyLR)
Static length rewards can push models toward verbosity or encourage premature truncation of reasoning. DyLR computes an ideal response length for each prompt by estimating the average length of correct rollouts for that question. Easy tasks therefore reward shorter, efficient chains; harder tasks receive incentives for deeper, multi-step exploration. This balances efficiency with correctness across heterogeneous tasks.
Training pipeline and hyperparameters
VL-Cogito starts RL post-training from the Qwen2.5-VL-Instruct-7B backbone without a supervised fine-tuning cold start. PCuRL runs three sequential RL stages (easy, medium, hard). In each stage the same dataset is reshuffled to expose the model to different generalization challenges, ODSW biases gradient updates toward the target difficulty, and DyLR is activated in the hard stage to expand reasoning chains.
Key technical setup:
- Optimizer: AdamW, learning rate 1e-6, DeepSpeed-ZeRO3
- Rollout batch size: 512; global batch size: 128; sequence length: 4096
- KL loss weight: 1e-3; 16 response samples per prompt; temperature 1.0
- Reward hyperparameters: α=1, β=0.5, γ=1, w=0.25 (penalty for zero-accuracy prompts)
Dataset curation and sampling
The training set covers 23 open-source multimodal datasets across six categories: Mathematical Reasoning, Logical Reasoning, Counting, Science Reasoning, Chart Understanding and General Image Understanding. All samples were reformulated into open-ended QA to prevent shortcuts from multiple-choice cues.
Difficulty sampling uses a trial filter with Qwen2.5-VL-7B-Instruct: any sample passed by that model at ≥50% accuracy over 8 runs is dropped, ensuring the RL phase focuses on genuinely challenging cases.
Benchmarks and results
VL-Cogito was evaluated on a ten-task panel including Geometry@3K, MathVerse, MathVista, ChartQA, ScienceQA, MMMU, EMMA and MMStar. Absolute accuracy improvements over the Qwen backbone include +7.6% on Geometry@3K, +5.5% on MathVista, +4.9% on LogicVista, +2.2% on ScienceQA, +4.5% on EMMA and +3.8% on MMStar. VL-Cogito achieves state-of-the-art or matched top results on 6 of 10 benchmarks, excelling particularly on challenging math and scientific tasks.
Component-wise ablations show curriculum RL alone raises average scores by +0.8% compared to vanilla GRPO. Adding Dynamic Length Reward boosts performance further in difficult math domains, and ODSW outperforms binary hard-sample filtering especially on imbalanced data.
Efficiency and training dynamics
Adaptive length rewards improve token efficiency and average accuracy compared to fixed-length cosine rewards. DyLR results in longer reasoning chains for math and logic tasks and shorter chains for science and general understanding, as intended. The PCuRL hard stage triggers a noticeable spike in reasoning length and validation accuracy, while vanilla GRPO plateaus with static output length.
Case studies and behavioral observations
VL-Cogito generates detailed, stepwise reasoning and self-reflection. In math problems it decomposes solutions into granular chains and actively corrects missteps, a behavior reinforced by RL verification and advantage estimation. For classification-style multimodal problems the model evaluates options methodically before committing to an answer, demonstrating robust multimodal comprehension and reliable process reasoning.
Key insights
VL-Cogito’s pipeline supports several takeaways: learnability and intermediate difficulty sampling accelerate progress; progressive exposure to harder prompts cultivates deeper analytic ability; fine-grained reward components (correctness, format, length) yield context-sensitive reasoning; and a no-SFT cold-start RL approach can be both feasible and effective when paired with a curriculum and adaptive rewards.
Сменить язык
Читать эту статью на русском