NVIDIA and CMU Unveil Nemotron-CrossThink: Advancing Multi-Domain Reasoning in Large Language Models

Expanding the Reach of Reinforcement Learning in Large Language Models

Large Language Models (LLMs) have shown impressive abilities in reasoning across various tasks, with Reinforcement Learning (RL) playing a key role in enhancing their deep thinking. RL has traditionally excelled in domains like mathematics and coding where rules and correctness are well-defined. However, scaling RL to broader reasoning domains is challenging due to limited data and difficulties ensuring models generalize across different fields.

The Evolution of Reasoning Techniques

Chain-of-Thought (CoT) reasoning marked a breakthrough by enabling LLMs to solve complex problems through multi-step intermediate reasoning, improving performance in math, science, and programming. Despite successes in mathematical reasoning, expanding RL training to diverse fields such as law, social sciences, and humanities remains largely unexplored.

Challenges in Multi-Domain Reasoning

Diversifying RL training data raises questions about the best strategies for blending data from various domains. Developing verifiable reward models in domains without deterministic solutions is particularly difficult. Different fields require unique reasoning approaches, and varying question formats (open-ended vs multiple-choice) demand adaptable strategies. Incorporating diverse reasoning domains could significantly enhance the cognitive abilities of LLMs.

Introducing Nemotron-CrossThink

Researchers from NVIDIA, Carnegie Mellon University, and Boston University propose Nemotron-CrossThink, a framework designed to integrate multi-domain corpora into RL training to boost cross-task generalization. This approach curates diverse data from synthetic sources like CommonCrawl and open-source question-answer pairs covering STEM, humanities, law, and social sciences. By applying templated formats such as Multiple Choice Questions (MCQ) and open-ended questions, filtering for verifiable rewards, and using strategic data blending, the framework enables robust self-learning across varied reasoning tasks.

Key Innovations and Results

Nemotron-CrossThink enhances reasoning accuracy and response adaptability. Models trained this way can generate concise answers for general queries and detailed explanations for math problems, optimizing inference efficiency. The framework addresses non-deterministic reward challenges through templated data curation and filters data by complexity to amplify RL effectiveness. Performance improvements include +30.1% on MATH-500, +27.5% on AMC23, +12.8% on MMLU-PRO, and +11.3% on GPQA-DIAMOND benchmarks.

Comprehensive Data Curation

The training dataset combines synthetic data from CommonCrawl with open-source QA datasets spanning general-purpose reasoning and mathematical content. General-purpose datasets include MMLU, Natural Reasoning, and synthesized QA pairs from STEM, economics, social sciences, and humanities. Mathematical reasoning data includes MATH, Numina-Math, and synthetically generated problems.

Template Application and Data Filtering

To enable verifiable rewards in non-mathematical domains, Nemotron-CrossThink applies templates to structure question-answer pairs into MCQ and open-ended formats. This limits answer variability for effective reward modeling. Filtering removes samples that cannot be reliably evaluated, such as MCQs lacking correct answers or open-ended responses longer than ten words.

Strategic Data Blending and Reinforcement Learning

The framework uses Group Relative Policy Optimization (GRPO) to enhance RL efficiency by estimating baselines from group scores without a separate critic model. Six blending recipes analyze how various data sources and question types impact training, demonstrating that combining general-purpose reasoning with mathematical data yields more adaptable LLMs.

Technical Contributions

Templated QA formats stabilize reward modeling, with unified open-ended formats improving performance by 1.21% over mixed formats and short-form answers outperforming long-form by 1.20%.
Multi-domain data blending boosts reasoning accuracy by 1.61% compared to math-only training, while reducing token usage by 28%.
Model-driven filtering selects challenging samples, adding 2.15% accuracy gains for Qwen-2.5-32B.

Experimental Findings

NuminaMath dataset delivered the highest average performance, excelling in math tasks and generalizing across domains. Synthetic QA data improved performance by about 1.0%, especially in MMLU-PRO, AGIEVAL, and MATH-500 benchmarks. Nemotron-CrossThink outperformed base models with the general-purpose reasoning blend achieving a 5% average improvement over OPEN-REASONER-ZERO, with notable gains in reasoning benchmarks.

Open-ended question formats yielded better results in math benchmarks than multiple-choice, aligning with the open-ended nature of math problems. Mathematical reasoning data transfers well to structured tasks, while general-purpose data alone is less effective, indicating the importance of including math in training blends.

Summary

Nemotron-CrossThink offers a scalable RL framework that improves LLM generalization by blending diverse reasoning data with a 2:1 ratio of general-purpose to mathematical content. Its innovations in data curation, templating, filtering, and blending deliver significant accuracy improvements, moving LLM reasoning beyond math to embrace broad human knowledge.