OpenThoughts: Advancing Scalable Data Curation for Cutting-Edge Reasoning Models

The Challenge of Curating Reasoning Data

Recent reasoning models like DeepSeek-R1 and o3 have demonstrated exceptional capabilities in mathematical, coding, and scientific tasks by leveraging post-training techniques such as supervised fine-tuning (SFT) and reinforcement learning (RL). However, the full methodologies behind these models remain proprietary, complicating efforts to build advanced reasoning systems. While SFT data curation is a powerful strategy for enhancing reasoning abilities, existing approaches often limit themselves to human-written questions or single teacher models, restricting exploration of the wider design space. Generating diverse question-answer pairs is costly in terms of teacher inference and model training.

Existing Approaches and Innovations

Models like Gemini, QwQ, and DeepSeek-R1 provide reasoning traces that enable knowledge distillation for smaller models. Initiatives such as OpenR1, OpenMathReasoning, and OpenCodeReasoning collect questions from forums and competitions, while Natural Reasoning uses pre-training corpora as seeds. Some projects like S1 and LIMO curate small, high-quality prompt datasets manually. Meanwhile, DeepMath-103K and Nvidia Nemotron innovate in sourcing, filtering, and scaling data. RL-based methods such as AceReason and Skywork-OR1 further improve reasoning beyond traditional SFT.

Introducing OpenThoughts: A Scalable SFT Dataset Pipeline

OpenThoughts is a new state-of-the-art open reasoning data pipeline developed collaboratively by researchers from Stanford, University of Washington, BespokeLabs.ai, Toyota Research Institute, UC Berkeley, and others. It progresses through three iterations:

OpenThoughts-114K: Extends the Sky-T1 pipeline with automated verification.
OpenThoughts2-1M: Expands data volume via augmented question diversity and synthetic generation.
OpenThoughts3-1.2M: Incorporates insights from over 1,000 ablation studies to create a simple, scalable, and high-performance data curation pipeline.

The OpenThinker3-7B model trained on this data achieves state-of-the-art results among open-data models at the 7B parameter scale.

Methodology and Evaluation

OpenThoughts3-1.2M was developed by independently ablating each pipeline component while holding others constant, generating 31,600 data points per strategy and fine-tuning the Qwen2.5-7B-Instruct model on each dataset. Evaluation covered eight benchmarks across mathematics (AIME24, AMC23, MATH500), coding (CodeElo, CodeForces, LiveCodeBench), and science (GPQA Diamond, JEEBench). Rigorous data decontamination was performed to remove high-similarity samples, and a held-out benchmark set ensured robust generalization testing. Evalchemy was used to maintain consistent evaluation standards.

Key Findings from Pipeline Evaluation

Question sourcing: Competitive coding questions (CodeGolf) yield highest scores for code tasks (25.3-27.5), while LLM-generated and human-written questions perform best in math (58.5-58.8), and physics StackExchange combined with chemistry textbooks excel in science (43.2-45.3).
Question mixing: Combining sources decreases performance, with only about 5% accuracy improvement achievable through selective mixing.
Teacher model: QwQ-32B outperforms DeepSeek-R1 in knowledge distillation by 1.9-2.6% accuracy.

Future Directions

OpenThoughts demonstrates that systematic experimentation can significantly improve SFT data curation for reasoning models. Despite impressive results, challenges remain such as exploring reinforcement learning methods, staged fine-tuning, and curriculum learning. Future work will investigate cross-domain transfer effects and scaling behavior as smaller models approach teacher performance.

For more details, see the Paper, Project Page, and GitHub repository. Follow updates on Twitter, join the 99k+ ML SubReddit, and subscribe to the newsletter.