Rubrics as Rewards: Enhancing Language Model Training with Structured Multi-Criteria Feedback

Challenges in Reinforcement Learning for Language Models

Reinforcement Learning with Verifiable Rewards (RLVR) has enabled large language models (LLMs) to excel in tasks with clear, verifiable outcomes, such as mathematics and coding. However, many real-world applications lack explicit verifiable answers, making it difficult to train models without direct reward signals. Current approaches like Reinforcement Learning from Human Feedback (RLHF) rely on preference rankings over model outputs, but these methods require extensive pairwise comparisons and often suffer from overfitting to superficial features such as response length or annotator biases.

Extending RLVR Beyond Traditional Domains

Recent advancements extend RLVR techniques beyond mathematics and coding into areas like physics, finance, and policy. For example, GENERAL-REASONER achieved significant improvements on complex benchmarks through fine-tuning. Rubric-based evaluation frameworks, such as HEALTHBENCH, combine clinician-written criteria with automated judges to assess factors like factuality, safety, and empathy. Despite their success in evaluation, these rubrics are typically not used during training.

Introducing Rubrics as Rewards (RaR)

Researchers from Scale AI proposed Rubrics as Rewards (RaR), an on-policy reinforcement learning framework that leverages checklist-style rubrics to guide multi-criteria tasks during training. RaR generates prompt-specific rubrics grounded in expert principles, each item clearly defining standards for high-quality responses and providing human-interpretable supervision signals. This approach was applied in medicine and science domains, producing two specialized datasets: RaR-Medicine-20k and RaR-Science-20k.

How RaR Works

LLMs serve as expert proxies to generate rubrics that meet key criteria: expert grounding, comprehensive coverage, semantic weighting, and self-contained evaluation. For each domain, prompts instruct LLMs to create 7-20 rubric items, each weighted categorically (e.g., Essential or Important Criteria) to reflect their relevance. Training employs the GRPO algorithm with Qwen2.5-7B as the base model. The training pipeline consists of three main components: Response Generation, Reward Computation, and Policy Update.

Performance and Benefits

RaR-Implicit, a variant of the method, outperforms several baselines including Simple-Likert, achieving up to 28% relative improvement on HealthBench-1k and 13% on GPQA. It surpasses both base and instruction-tuned models, demonstrating the effectiveness of rubric-guided training for nuanced response evaluation. Rubric-based rewards offer clearer, more accurate signals and better alignment with human preferences across various model sizes.

Limitations and Future Directions

While RaR advances language model training by integrating structured, checklist-style rubrics as reward signals, its current application is limited to medical and science domains. Future work could explore validation in broader tasks such as open-ended dialogue, alternative reward aggregation methods beyond implicit and explicit strategies, and controlled studies on reward hacking risks. Additionally, relying on off-the-shelf LLMs as judges suggests potential benefits in developing dedicated evaluators with enhanced reasoning capabilities.

For detailed insights, refer to the original paper linked by the researchers.