<RETURN_TO_BASE

Microsoft and Tsinghua Introduce Reward Reasoning Models to Enhance LLM Judgement with Dynamic Compute Scaling

Microsoft and Tsinghua researchers propose Reward Reasoning Models that adaptively allocate compute resources during evaluation, significantly improving large language model judgment and alignment across complex tasks.

Challenges in Reward Modeling for Large Language Models

Reinforcement learning (RL) is a critical method used in fine-tuning large language models (LLMs), leveraging human feedback (RLHF) or verifiable rewards (RLVR). While RLVR is promising in mathematical reasoning tasks, it is limited by the dependency on queries with verifiable answers, restricting its scalability to large, general-domain datasets. Current reward models, whether scalar or generative, lack the ability to dynamically allocate computational resources during evaluation, applying uniform compute regardless of query complexity.

Reward Reasoning Models (RRMs): A New Approach

Researchers from Microsoft Research, Tsinghua University, and Peking University have developed Reward Reasoning Models (RRMs), which incorporate an explicit reasoning phase before delivering final reward judgments. This approach allows RRMs to dynamically scale computational resources during test time, allocating more compute to complex queries that require deeper analysis.

RRMs use chain-of-thought reasoning to self-evolve their reward assessment capabilities without needing explicit reasoning traces in training data. They employ the Qwen2 model with a Transformer-decoder architecture and treat reward modeling as a text completion task. RRMs autoregressively generate reasoning steps followed by preference judgments between two responses without ties.

Evaluation and Performance

The researchers utilized the RewardBench repository to evaluate RRMs across various criteria such as instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments combined with majority voting to improve robustness.

Results demonstrate that RRMs achieve competitive or superior performance compared to strong baselines on RewardBench and PandaLM Test benchmarks. Notably, the RRM-32B model reached 98.6% accuracy in reasoning tasks. RRMs outperform DirectJudge models trained on the same data, confirming their ability to leverage increased test-time compute for complex queries.

Moreover, RRMs excel in reward-guided best-of-N inference without additional compute and show steady improvements in downstream tasks like MMLU-Pro and GPQA post-training. Scaling experiments indicate that longer reasoning horizons consistently enhance accuracy across models of different sizes (7B, 14B, and 32B).

Implications for Alignment and Future Use

By introducing explicit reasoning before reward assignment, RRMs address the computational rigidity seen in traditional reward models. Their ability to utilize test-time compute efficiently through both parallel and sequential scaling presents a strong alternative to scalar reward models in alignment techniques.

The research team has made the paper and models available on Hugging Face, inviting the community to explore and build upon their work.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский