GURU: Advancing LLM Reasoning Across Six Diverse Domains with Reinforcement Learning

Challenges of Reinforcement Learning in Reasoning

Reinforcement Learning (RL) has shown promising results in enhancing the reasoning abilities of large language models (LLMs), particularly in systems like OpenAI-O3 and DeepSeek-R1. However, most RL research has been limited to narrow domains such as mathematics and coding. This narrow focus restricts the generalizability of RL improvements and results in models that lack versatility. Expanding RL to broader reasoning domains is difficult due to the scarcity of reliable reward signals and curated datasets, which are easier to define for math and code but challenging for open-ended reasoning tasks.

Limitations of Narrow Domain Focus

RL has become popular for boosting LLM reasoning skills, especially after successes with models like GPT-3 and DeepSeek-R1. Many open-source projects focus mainly on mathematical and coding problems. While effective in these areas, these models’ reasoning abilities often do not transfer well to other domains. Research suggests RL may not always teach new skills but rather helps the model better access existing reasoning patterns. However, some studies indicate that extended RL training can unlock new reasoning strategies entirely.

Introduction of the GURU Dataset

A team of researchers from UC San Diego, MBZUAI, Carnegie Mellon, and Purdue developed GURU, a comprehensive RL dataset containing 92,000 examples across six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular data. Each domain features carefully designed reward functions and rigorous filtering to ensure quality. Training models on GURU demonstrated that RL benefits depend on domain familiarity: common domains gain from cross-domain RL, while unfamiliar domains require in-domain training for significant gains. Their GURU-7B and GURU-32B models outperform prior open models by up to 7.9% on 17 benchmark tasks, underscoring the importance of multi-domain benchmarks for RL.

Cross-Domain vs. In-Domain RL Training

To explore RL’s impact on reasoning across domains, models were trained on both individual and mixed-domain data from GURU. Math, Code, and Science domains showed more improvement from cross-domain RL, likely due to their prevalence in pre-training data. Mixed-domain training matched or exceeded single-domain training, suggesting that diverse tasks help enhance general reasoning. Training on only difficult examples improved performance within that domain but reduced accuracy on simpler tasks in others. These results highlight the importance of data diversity and balanced difficulty for building transferable reasoning skills.

GURU Model Architecture and Evaluation

The researchers trained 7-billion and 32-billion parameter models using the Verl framework and GRPO algorithm on the GURU dataset. Models were evaluated on a broad array of tasks—math, code, logic, science, simulation, and tabular—using consistent metrics. GURU models outperformed domain-specific baselines and performed well on unseen tasks. Analysis of Pass@k showed that performance varies with task type, model size, and decoding settings. Larger models gained more from RL, and tuning sampling parameters like temperature and top-p improved model diversity and reasoning coverage.

Summary: Towards General-Purpose Reasoning

GURU provides a high-quality RL dataset with 92,000 examples spanning six reasoning domains, enabling broader studies beyond prior math- and code-focused RL research. The GURU-7B and GURU-32B models achieve state-of-the-art results on 17 benchmarks, especially excelling in domains less represented during pretraining. Findings demonstrate that RL can both refine existing knowledge and foster new reasoning strategies. All data, models, and code are publicly available to support further research in general-purpose reasoning.

For more information, check the Paper, Project Page, and GitHub Page. Follow the researchers on Twitter and join the 100k+ ML SubReddit or subscribe to the newsletter to stay updated.