Revolutionizing LLM Reasoning with Off-Policy RL and KL Divergence Regularization

Enhancing Reasoning in Large Language Models with Policy Gradient Methods

Policy gradient methods have significantly boosted the reasoning capabilities of large language models (LLMs), especially when combined with reinforcement learning (RL). A pivotal component in stabilizing these methods is the use of Kullback-Leibler (KL) regularization, which helps prevent abrupt shifts between the current policy and a reference policy.

Exploring KL Divergence Variants in RL

Although KL regularization is commonly implemented in algorithms like Proximal Policy Optimization (PPO), there remains a rich landscape of KL variants—such as Forward KL, Reverse KL, and unnormalized forms—that can be estimated and integrated into loss functions. These choices, alongside different gradient estimators and the distinction between on-policy and off-policy settings, influence training stability and performance in subtle and underexplored ways.

Aligning LLMs through Human Feedback and Reinforcement Learning

Fine-tuning LLMs with human feedback is vital for building aligned AI systems. Two primary strategies prevail: policy gradient optimization with reward models (e.g., PPO) and direct training on human preferences via methods like Direct Preference Optimization (DPO). While PPO stabilizes training through reward models, DPO simplifies and scales learning by leveraging pairwise preference comparisons. Reinforcement learning is increasingly applied to enhance reasoning abilities in complex domains such as mathematics and coding, with new approaches aiming to lower computational costs and improve stability by modifying KL penalties or replacing value networks.

Introducing Regularized Policy Gradient (RPG)

Researchers from UCLA, Tsinghua University, and Shanghai Qi Zhi propose Regularized Policy Gradient (RPG), a unified framework for KL-regularized policy gradients in online reinforcement learning. RPG derives policy gradients and surrogate loss functions using both Forward and Reverse KL divergences, accommodating normalized and unnormalized policies. It supports fully differentiable objectives as well as REINFORCE-style estimators tailored for off-policy training with importance sampling. The framework also addresses theoretical challenges in existing methods like GRPO and examines KL regularization in REINFORCE++.

RPG's Methodology and Gradient Structure

The study presents policy gradient methods that incorporate KL divergence regularization in both online and off-policy settings by employing importance sampling from older policies. For forward KL, gradients combine importance-weighted rewards with a regularization term, with the loss function resembling maximum likelihood loss when rewards vanish. The unnormalized forward KL variant adds corrections for distribution mass mismatches. Similarly, reverse KL variants penalize deviations from the reference policy by adjusting rewards based on log-probability ratios. All variants share a REINFORCE-like gradient structure enabling alternative implementations using the stop-gradient operator, which aids in stable and efficient optimization.

Experimental Validation on Complex Reasoning Tasks

The researchers evaluated RPG's differentiable and REINFORCE-style methods against leading baselines on challenging math reasoning tasks using Qwen2.5 language models. Training was conducted on the DAPO-Math-17k dataset, with performance evaluated on benchmarks such as AMC23 and AIME. RPG variants consistently achieved superior accuracy, training stability, and memory efficiency. Implementations utilized the Verl framework, incorporating KL regularization, PPO-style clipping, and Schedule-Free AdamW optimizer for smoother training dynamics. RPG models excelled in reward shaping, entropy control, and response length management, underscoring their robustness and suitability for stable, high-performance learning.

Advancing Policy Gradient Methods for LLMs

RPG offers a comprehensive framework for designing and analyzing policy gradient methods with KL regularization in both online and off-policy reinforcement learning contexts. By exploring various configurations—including forward and reverse KL divergences, normalized and unnormalized policies, and fully differentiable versus REINFORCE-style estimators—it provides a structured approach to implementation and theoretical understanding. Applied to reasoning tasks in LLMs, RPG demonstrates enhanced training stability and performance improvements over established baselines such as GRPO, REINFORCE++, and DAPO.

For more details, check out the Paper and GitHub page. All credit goes to the researchers behind this project. Follow us on Twitter and join our 95k+ ML SubReddit and Newsletter for more updates.