<RETURN_TO_BASE

MiroMind-M1 Sets New Standards in Open-Source Mathematical Reasoning with Innovative Multi-Stage Reinforcement Learning

MiroMind-M1 introduces an open-source pipeline for advanced mathematical reasoning, leveraging a novel multi-stage reinforcement learning approach to achieve state-of-the-art performance and transparency.

Advancing Mathematical Reasoning with Open-Source Models

Large language models (LLMs) have shown impressive progress in multi-step reasoning, particularly in solving complex mathematical problems. While proprietary models like GPT-4o and Claude Sonnet 4 lead the field, their closed-source nature limits transparency and reproducibility. The MiroMind AI team addresses this by releasing the MiroMind-M1 series, an entirely open-source pipeline including datasets, models, training code, and evaluation scripts. This initiative pushes the boundaries of openness and state-of-the-art mathematical reasoning within the Qwen-2.5 model ecosystem.

Architecture and Training Approach

MiroMind-M1 is built upon the Qwen-2.5 backbone enhanced specifically for mathematical reasoning. The training protocol involves two main stages:

  • Supervised Fine-Tuning (SFT): The model is fine-tuned on 719,000 carefully curated and verified mathematical problems, enabling strong step-by-step reasoning capabilities.
  • Reinforcement Learning with Verifiable Rewards (RLVR): Following SFT, the model is further trained on 62,000 challenging math problems using reinforcement learning guided by a robust external verifier that provides accurate reward signals.

This two-stage approach leverages the strengths of chain-of-thought imitation and precise reward-driven reinforcement to improve accuracy and efficiency.

High-Quality and Transparent Data

The MiroMind-M1 project emphasizes full transparency and cleanliness in its training data:

  • The SFT corpus includes datasets like OpenR1, OpenThoughts, Light-R1, and Synthetic-1, all containing verified solutions and rich multi-step reasoning traces.
  • Stringent deduplication and decontamination procedures remove overlaps and data leaks with evaluation sets such as AIME24, AIME25, and MATH500.
  • Preference is given to longer reasoning trajectories, as experiments show they lead to higher benchmark scores by providing deeper semantic content.

This results in a dataset of 719,000 verified training traces, significantly advancing reproducible open research.

Superior Fine-Tuning Performance

MiroMind-SFT-7B, initialized from Qwen2.5-Math-7B, is trained with a large context window of up to 32,768 tokens and employs a no-packing strategy to prevent cross-sample attention contamination. Its performance surpasses peer open models on critical benchmarks:

| Model | AIME24 | AIME25 | MATH500 | |------------------|--------|--------|---------| | DeepSeek-R1-Distill | 55.5 | 40.4 | 92.8 | | MiMo-7B-SFT | 58.7 | 44.3 | 93.0 | | MiroMind-SFT-7B | 60.4 | 50.4 | 94.6 |

These results demonstrate the effectiveness of meticulous data curation and training design.

CAMPO: Context-Aware Multi-Stage Policy Optimization

A key innovation during the RLVR phase is the CAMPO algorithm, which tackles challenges like training instability and token inefficiency:

  • Multi-stage training gradually expands output length limits, starting from 16,000 tokens to allow deeper reasoning while maintaining efficiency.
  • A dynamic repetition penalty discourages early or excessive repetition, preserving output diversity.
  • An improved external verifier accurately evaluates complex math answers, including tricky cases involving units, π, and percentages.

CAMPO stabilizes reinforcement learning and produces models that solve problems using fewer, more relevant tokens, accelerating inference and reducing costs without compromising accuracy.

Benchmark Results and Efficiency

MiroMind’s open models achieve highly competitive or state-of-the-art results among Qwen-2.5-based math models (7B and 32B parameters):

| Model | AIME24 | AIME25 | MATH500 | |------------------|--------|--------|---------| | DeepSeek-R1-7B | 55.5 | 39.2 | – | | MiMo-7B-RL | 68.2 | 55.4 | 95.8 | | Skywork-OR1-7B | 72.2 | 54.6 | – | | MiroMind-RL-7B | 73.4 | 57.8 | 96.7 | | Skywork-OR1-32B | 77.1 | 68.2 | 97.5 | | MiroMind-RL-32B | 77.5 | 65.6 | 96.4 |

The 32B MiroMind-RL model notably produces shorter and more concise solutions without losing correctness, highlighting CAMPO’s efficiency.

Open-Source Full Stack and Reproducibility

All components of the MiroMind-M1 project are openly available:

  • Model weights for both SFT and RL checkpoints (7B and 32B)
  • Complete datasets (719K SFT and 62K RLVR)
  • Training scripts supporting multi-node distributed training
  • Evaluation code with standardized scripts and benchmark configurations

This comprehensive openness enables researchers to replicate, audit, and extend the work, fostering reproducibility and accelerating new research in mathematical reasoning LLMs.

Resources

For more information, check out the [Paper], [GitHub Page], and the model on [Hugging Face]. Follow the team on Twitter, join the 100k+ ML SubReddit, and subscribe to their newsletter for updates.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский