Surprising Math Reasoning Gains from Incorrect and Random Rewards in Qwen2.5-Math

Reinforcement Learning with Verifiable Rewards in NLP

In natural language processing (NLP), reinforcement learning (RL) methods like reinforcement learning with human feedback (RLHF) have been used to improve model outputs by optimizing responses based on feedback. Reinforcement learning with verifiable rewards (RLVR) extends this by using automatic signals such as mathematical correctness or syntactic features as feedback, enabling large-scale tuning of language models. RLVR aims to enhance reasoning abilities without extensive human supervision, focusing on how models can learn mathematical, logical, or structural reasoning with limited supervision.

Challenges of Learning with Imperfect Supervision

Building models that reason effectively under minimal or noisy supervision remains a challenge. In mathematical problem-solving, perfect ground-truth labels are impractical at scale, raising questions about whether models can learn from noisy, misleading, or incorrect signals. Over-reliance on perfect feedback risks poor generalization when such supervision is unavailable, limiting real-world usability.

Exploring Reward Signals on Qwen2.5-Math

Researchers from the University of Washington, Allen Institute for AI, and UC Berkeley tested various reward signals on Qwen2.5-Math, a family of models fine-tuned for math reasoning. These included ground-truth rewards, majority-vote rewards, format-based rewards, random rewards, and incorrect rewards. Surprisingly, even spurious signals like random and incorrect rewards led to significant performance gains.

For example, training Qwen2.5-Math-7B on MATH-500 gave:

Ground-truth rewards: +28.8% accuracy
Incorrect labels: +24.6%
Random rewards: +21.4%
Format rewards: +16.4%
Majority-vote rewards: +26.5%

Qwen2.5-Math-1.5B also showed strong gains, e.g., +24.4% accuracy with incorrect labels. Conversely, other model families such as Llama3 and OLMo2 did not benefit and sometimes performed worse with spurious rewards.

Emergence of Code Reasoning Behavior

A key insight was that Qwen models increasingly generated math solutions in a code-like format, particularly resembling Python, regardless of reward type. Code reasoning frequency rose from 66.7% to over 90% during training with spurious rewards. Answers containing code reasoning achieved about 64% accuracy versus 29% for others. This suggests spurious rewards may activate latent reasoning capabilities learned during pretraining rather than teaching new skills.

Robustness and Model-Specific Results

The substantial gains from random and incorrect rewards nearly matched those from ground-truth rewards, especially in Qwen models. Improvements were consistent across different tasks like AMC and AIME2024, though ground-truth labels retained some advantage. However, non-Qwen models like Llama3.1-8B experienced up to 8.5% performance drops with spurious rewards, highlighting that these benefits are model-specific.

Key Takeaways

Qwen2.5-Math-7B achieved up to 28.8% accuracy gain with ground-truth rewards, and up to 24.6% with incorrect rewards.
Code reasoning patterns increased significantly under RLVR, boosting accuracy.
Non-Qwen models did not benefit and sometimes declined with spurious rewards.
Gains appeared rapidly within 50 training steps, indicating fast elicitation of reasoning abilities.
Caution is advised in generalizing RLVR results from Qwen models to others.

These findings emphasize the need to validate RLVR methods across diverse model architectures rather than relying solely on Qwen-centric results.

Further Resources

For more details, check the paper, official release, and GitHub page linked in the original announcement. Follow related updates on Twitter and join the ML SubReddit community for ongoing discussions.