<RETURN_TO_BASE

Revolutionizing Math Reasoning: How 1-Shot Reinforcement Learning Boosts LLM Performance

Researchers reveal that training large language models with just one example using 1-shot reinforcement learning significantly enhances their math reasoning abilities, matching results from large datasets.

Breakthrough in Mathematical Reasoning with LLMs

Recent advancements in large language models (LLMs) like OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have made significant strides in solving complex math problems. A key innovation behind this progress is Reinforcement Learning with Verifiable Reward (RLVR), which rewards models based on correctness, fostering abilities such as self-reflection and improved generalization.

Data Efficiency in RLVR: The Role of Minimal Training Examples

Traditionally, optimizing reinforcement learning algorithms like PPO and GRPO has been the focus. However, the impact of training data quantity and quality on RLVR effectiveness has been less explored. Previous works like LIMR showed that a reduced dataset could still maintain performance, but the extreme scenario of using just a few examples remained under-investigated.

1-Shot RLVR: Unlocking Power with a Single Example

Researchers from the University of Washington, Microsoft, USC, and other institutions demonstrated that training LLMs with just one example using 1-shot RLVR significantly enhances mathematical reasoning. For instance, applying this method to the Qwen2.5-Math-1.5B model improved its MATH500 benchmark accuracy from 36.0% to 73.6%, equaling results achieved with much larger datasets.

Generalization Across Models and Domains

This 1-shot RLVR improvement generalizes across different models, tasks, and domains. Remarkably, training on a single example often boosts performance even on unrelated tasks, a phenomenon the study calls "post-saturation generalization." The researchers also identified that policy gradient loss and entropy-driven exploration play crucial roles in these gains.

Data Selection and Training Details

The study utilized subsets of the DeepScaleR dataset and the MATH dataset for evaluation, training models including Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Llama-3.2-3B-Instruct, and DeepSeek-R1-DistillQwen-1.5B. Training was conducted using the Verl pipeline with specific hyperparameters. Surprisingly, training with just one or two specific examples (π1 and π13) led to strong generalization beyond math tasks, despite typical signs of overfitting.

Insights into Mechanisms Behind 1-Shot RLVR

The findings suggest that base LLMs already have intrinsic reasoning capabilities that can be unlocked with minimal examples. Policy gradient loss is critical to the success of 1-shot RLVR, and entropy regularization further enhances model exploration and post-saturation generalization.

Implications and Future Directions

This research highlights the potential to drastically reduce training data while maintaining or improving LLM performance on reasoning tasks. It also underscores the importance of careful data selection and exploration strategies, especially in resource-constrained settings.

For more details, check out the original paper and the GitHub repository linked in the source.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский