Meta and NYU's Semi-Online Reinforcement Learning Enhances LLM Alignment Efficiency

Reinforcement Learning for Large Language Model Alignment

Large language models (LLMs) often require a fine-tuning phase to better align with human expectations. Reinforcement learning (RL) is key in this process, enabling models to adjust their outputs based on human feedback or task correctness, thereby improving their suitability for instruction-driven applications and precise tasks like mathematical problem-solving.

Challenges of Offline vs. Online RL Approaches

Fine-tuning methods typically fall into offline and online categories. Offline RL relies on static datasets and lacks adaptability during training, while online RL continuously updates the model with each interaction but demands significantly more computational resources. Balancing these approaches is challenging, especially when models must perform well on both verifiable (mathematical) and non-verifiable (open-ended) tasks.

Existing Alignment Algorithms: DPO and GRPO

Direct Preference Optimization (DPO) is an offline method that uses preference-based data pairs, praised for its simplicity and data efficiency but limited adaptability. Group Relative Policy Optimization (GRPO), based on PPO, performs online fine-tuning by comparing groups of outputs for relative advantages. Although adaptive, GRPO’s on-policy nature increases computational cost and complicates experimentation.

Introducing Semi-Online Reinforcement Learning

Meta and NYU researchers proposed a semi-online training approach that modulates synchronization frequency between the model’s generation and training modules. Instead of updating every step (online) or never updating during training (offline), this method adjusts synchronization intervals to strike a balance. This reduces training time while preserving model adaptability. The approach supports flexible use of either DPO or GRPO algorithms with task-specific reward models.

Application to Instruction Following and Mathematical Reasoning

The team fine-tuned the Llama-3.1-8B-Instruct model on two task types: open-ended instruction following and math problem-solving. For open-ended tasks, prompts from the WildChat-1M dataset were evaluated using the Athene-RM-8B reward model, which assigns scalar scores. For verifiable math tasks, the NuminaMath dataset and Math-Verify toolkit ensured answer correctness. Experiments ran on 32 NVIDIA H200 GPUs for training and 8 GPUs for inference, comparing offline, semi-online, and online synchronization setups.

Performance Improvements Across Benchmarks

On the Math500 benchmark, offline DPO achieved 53.7% accuracy, while semi-online DPO with synchronization interval s = 100 reached 58.9%. Online DPO and GRPO showed similar performance at 58.7% and 58.1%. On NuminaMath, offline DPO scored 36.4%, improving to 39.4% with semi-online variants (s = 10). For non-verifiable tasks measured by AlpacaEval 2.0 and Arena-Hard benchmarks, models trained with mixed reward types consistently outperformed others. Combining verifiable and non-verifiable rewards in training enhanced overall model generalization.

A Scalable and Flexible Framework

This research demonstrates that strict offline or online training is not necessary. By tuning synchronization frequency and balancing reward types, the semi-online method improves training efficiency and model performance across diverse task types without incurring high computational costs. This flexible framework opens new avenues for efficient LLM alignment.

For further details, see the original research paper. Credit to the Meta and NYU research teams for this advancement.