<RETURN_TO_BASE

Crome: Google DeepMind's Causal Framework Enhances Reward Modeling for Safer LLM Alignment

Google DeepMind and collaborators introduce Crome, a causal framework that improves reward modeling robustness in LLM alignment by using counterfactual data augmentation to tackle reward hacking issues.

Challenges in Reward Modeling for LLM Alignment

Reward models are essential for aligning large language models (LLMs) with human preferences, but they often suffer from reward hacking. These models mistakenly focus on superficial features such as response length or formatting instead of true quality indicators like factual correctness and relevance. This happens because traditional training objectives fail to distinguish between spurious correlations in the data and actual causal factors that drive response quality. As a result, reward models become brittle and produce misaligned policies.

Limitations of Current Approaches

Current solutions to reward hacking in RLHF (Reinforcement Learning from Human Feedback) systems include architectural tweaks like Odin, policy-level adjustments, and data-centric methods such as ensembles and consistency checks. Some recent causal-inspired methods attempt to regularize or correct for known spurious factors, but they often miss unknown ones. Augmentation and evaluation strategies remain limited in their ability to train reward models robustly against diverse spurious cues.

Introducing Crome: A Causal Framework for Robust Reward Modeling

Researchers from Google DeepMind, McGill University, and MILA have introduced Crome (Causally Robust Reward Modeling), a framework based on an explicit causal model of answer generation. Crome trains reward models to distinguish genuine quality drivers from superficial cues by incorporating preference datasets enriched with LLM-generated counterfactual examples.

Two types of synthetic training pairs are created:

  • Causal Augmentations: These introduce changes along specific causal attributes such as factuality to ensure sensitivity to true quality changes.
  • Neutral Augmentations: These enforce invariance along spurious attributes like style by using tie-labels.

This approach significantly improves robustness, boosting RewardBench accuracy by up to 4.5%, and enhancing safety and reasoning capabilities.

Technical Approach: Counterfactual Data and Composite Loss

Crome's methodology involves generating attribute-aware counterfactual data guided by a causal model, followed by training the reward model with a specialized composite loss on combined datasets. Theoretical analysis demonstrates how causal augmentation isolates true reward drivers from spurious correlations under idealized conditions.

Experiments leverage the UltraFeedback dataset with counterfactuals generated via Gemini 2.0 Flash. Evaluation occurs on RewardBench and reWordBench benchmarks using diverse base LLMs such as Gemma-2-9B-IT, Qwen2.5-7B, and Gemma-2-2B, employing both Pairwise Preference and Bradley-Terry reward models. Downstream alignment effects are measured through Best-of-N selection across multiple tasks.

Performance Improvements Across Benchmarks

On RewardBench, Crome consistently outperforms prior methods (e.g., RRM), achieving up to 13.18% improvement in Safety and 7.19% in Reasoning categories. On reWordBench, it attains up to 9.1% aggregate accuracy gains and excels in 21 of 23 tested transformations. Additionally, Crome maintains more stable ranking accuracy between RewardBench and reWordBench compared to baselines.

On WildGuardTest, Crome shows enhanced safety by reducing attack success rates on harmful prompts without increasing refusal rates for benign ones.

Future Directions in Causal Data Augmentation

Crome demonstrates that targeted synthetic data augmentation strategies informed by causal models can effectively address reward hacking in LLM alignment. This data-centric training approach opens new avenues for synthetic data generation and causal attribute verification, which hold promise for future advances in robust language model alignment.

For more details, check out the original research paper and follow the project on Twitter and relevant ML communities.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский