Polaris-4B and Polaris-7B: Scalable Reinforcement Learning Unlocks Advanced Math and Logic Reasoning

The Demand for Scalable Reasoning Models

Advanced reasoning models are crucial in machine intelligence fields such as mathematical problem solving and symbolic reasoning. These models aim to replicate human-like multi-step calculations and logical deductions. Although reinforcement learning (RL) techniques are used post-pretraining to improve accuracy, scaling these methods efficiently remains challenging. Researchers seek smaller, resource-efficient models that maintain high reasoning performance by focusing on data quality, exploration strategies, and long-context generalization.

Challenges in Reinforcement Learning for Large Models

One significant issue in RL for large reasoning models is balancing task difficulty with model capabilities. Too simple tasks lead to stagnation, while overly complex tasks yield no learning signals. This imbalance is especially problematic when applying methods designed for smaller models to larger architectures. Additionally, existing approaches lack mechanisms to adapt rollout diversity and output length dynamically during training and inference, limiting reasoning capabilities on complex benchmarks.

Limitations of Previous Post-Training Methods

Previous methods like DeepScaleR and GRPO have successfully enhanced small reasoning models with around 1.5 billion parameters. However, applying these methods to larger models such as Qwen3-4B or Deepseek-R1-Distill-Qwen-7B results in minimal improvement or performance drops. This is mainly due to static data distributions and limited sampling diversity, with no adaptation based on model capability or dynamic control of sampling temperature and response length.

Introducing Polaris: A Novel Recipe for Scalable Reinforcement Learning

Researchers from the University of Hong Kong, Bytedance Seed, and Fudan University developed Polaris, a post-training RL recipe tailored for advanced reasoning tasks. Polaris includes Polaris-4B-Preview (fine-tuned from Qwen3-4B) and Polaris-7B-Preview (based on Deepseek-R1-Distill-Qwen-7B). The framework is model-agnostic and incorporates data difficulty adjustments, controlled sampling temperature for diverse exploration, and extended inference through length extrapolation. These models are optimized to run on consumer-grade GPUs using open-source data and pipelines.

Innovations in Polaris

Polaris curates training data by filtering out examples that are too easy or unsolvable, creating a balanced J-shaped difficulty distribution that evolves with the model's capacity. Sampling temperature is dynamically adjusted during training—1.4, 1.45, and 1.5 for Polaris-4B; 0.7, 1.0, and 1.1 for Polaris-7B—to maintain rollout diversity. A Yarn-based extrapolation method extends inference context length up to 96K tokens without additional training, enabling a "train-short, test-long" paradigm. Additional techniques like the Rollout Rescue Mechanism and Intra-Batch Informative Substitution prevent zero-reward batches and preserve valuable training signals even with small rollout sizes of 8.

Benchmark Performance

Polaris models achieve state-of-the-art accuracy across various math benchmarks. Polaris-4B-Preview attains 81.2% on AIME24 and 79.4% on AIME25, outperforming Qwen3-32B while using less than 2% of its parameters. It also scores 44.0% on Minerva Math, 69.1% on Olympiad Bench, and 94.8% on AMC23. Polaris-7B-Preview scores 72.6% on AIME24 and 52.6% on AIME25. These results surpass models like Claude-4-Opus and Grok-3-Beta, positioning Polaris as a competitive lightweight model bridging the gap between small open models and large commercial 30B+ parameter models.

Efficient Reinforcement Learning via Smart Post-Training

The key to scaling reasoning models lies in smart control over training data difficulty, sampling diversity, and inference length rather than simply increasing model size. Polaris provides a reproducible recipe that effectively adjusts these factors, enabling smaller models to rival the reasoning power of massive commercial systems.

Explore the model and code to learn more. All credit goes to the researchers involved in this project.