<RETURN_TO_BASE

Xiaomi's MiMo-7B: Compact AI Model Excelling in Math and Code Reasoning Beyond Larger Rivals

Xiaomi's MiMo-7B is a compact language model that surpasses larger models in math and code reasoning through advanced pre-training and reinforcement learning strategies.

Rising Demand for Advanced Reasoning AI

The increasing need for AI systems capable of handling multi-step logic, mathematical proofs, and software development has driven researchers to enhance the reasoning capabilities of language models. Traditionally seen as a uniquely human skill, reasoning is now being integrated into smaller, more efficient models to expand their practical use.

Challenges in Developing Compact Reasoning Models

Achieving strong performance in both mathematics and programming within a compact model remains a significant challenge. Most top models in these domains have around 32 billion parameters or more. Smaller models often struggle due to limited generalization, sparse reward feedback in reinforcement learning, and less curated training data focused on reasoning tasks.

Existing Approaches and Their Limitations

Models like OpenAI's o-series, DeepSeek R1, and Claude 3.7 use large parameter counts and complex reinforcement learning techniques, including stepwise planning and backtracking, to improve reasoning. However, these models tend to rely heavily on post-training and less on high-quality pre-training data. They also use fixed, template-based reward systems that can be exploited, resulting in inconsistent performance on challenging code generation tasks.

Xiaomi's Innovative MiMo-7B Approach

Xiaomi's research team introduced the MiMo-7B family with a balanced focus on both pre-training and post-training phases to build reasoning capabilities. MiMo-7B-Base was trained from scratch on 25 trillion tokens using a three-stage mixture strategy that increased mathematical and programming content progressively. A multiple-token prediction objective improved both performance and inference speed.

For post-training, a curated dataset of 130,000 math and programming problems annotated with difficulty scores was used. Reinforcement learning employed a difficulty-driven reward framework, allowing nuanced feedback. This process produced two variants: MiMo-7B-RL and MiMo-7B-RL-Zero.

Data Preparation and Pre-Training Innovations

The team developed a custom HTML extractor to preserve math equations and code snippets from web pages, academic papers, and books. Enhanced PDF parsing tools ensured accurate interpretation of scientific and programming content. Global deduplication prevented data repetition. Small fine-tuned language models filtered content quality, replacing outdated heuristics. Synthetic reasoning data generated by advanced models was added in the final training stage.

This resulted in a training mix with 70% math and code data in stage two and 10% synthetic content in stage three. The model supports a maximum context length of 32,768 tokens, enabling processing of long-form reasoning problems.

Reinforcement Learning Enhancements

A seamless rollout engine with asynchronous reward computation and early termination reduced GPU idle time, accelerating training by 2.29 times and validation by 1.96 times. Fine-grained difficulty-based rewards addressed sparse feedback issues in programming benchmarks. Data re-sampling improved training stability and efficiency, allowing the model to learn effectively even from cold-start states.

Performance Highlights

MiMo-7B-Base scored 75.2 on the Big-Bench Hard task, outperforming other open-source 7B models. MiMo-7B-RL scored 55.4 on the AIME 2025 benchmark, surpassing OpenAI’s o1-mini by 4.7 points. On code generation benchmarks, it outperformed larger models like DeepSeek-R1-Zero-32B and Qwen2.5-32B-RL-Zero on LiveCodeBench v5 and v6.

Significance of MiMo-7B

This project demonstrates that with optimized pre-training, data quality, and reinforcement learning infrastructure, compact models can rival or exceed much larger models in complex reasoning tasks. Xiaomi’s approach challenges the notion that model size alone determines intelligence or versatility and highlights the potential of smaller, well-designed AI models.

Key Takeaways

  • MiMo-7B trained on 25 trillion tokens with structured data mixtures.
  • 130,000 annotated math and code problems used for reinforcement learning.
  • Three-stage pre-training with increasing math and coding content.
  • Training speed improvements via a seamless rollout engine.
  • Superior benchmark performance compared to larger models.
  • All model variants and checkpoints are publicly available.

For more information, check out the official paper and GitHub repository.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский