Shanghai AI Lab Unveils Entropy-Based Scaling Laws to Tackle Exploration Collapse in Reinforcement Learning for LLMs

Expanding Reinforcement Learning in Large Language Models

Recent developments in reasoning-focused large language models (LLMs) have extended the use of reinforcement learning (RL) beyond specific tasks, enabling enhanced generalization and reasoning. However, this expansion brings significant challenges, particularly in scaling the computational resources needed for training. Unlike pre-training and fine-tuning based on imitation learning, RL requires more intensive computation due to its reliance on learning from experience.

The Role of Policy Entropy in Exploration and Exploitation

A key issue in RL is the decline of policy entropy, which determines the balance between exploiting known strategies and exploring new possibilities. Maintaining this balance is crucial for effective training, as excessive exploitation can limit the discovery of better policies. Maximum entropy RL techniques incorporate a regularization term to encourage uncertainty and broader exploration, but their application to LLMs remains under debate.

Introducing an Entropy-Performance Relationship

Researchers from Shanghai AI Laboratory and collaborating universities formulated an empirical equation linking policy entropy (H) and downstream performance (R):

R = -a \exp{H} + b

where a and b are fitting coefficients. This suggests that as entropy decreases, performance improvement is bottlenecked, highlighting the importance of sustaining entropy during training.

New Techniques to Manage Entropy Collapse

The team identified that changes in policy entropy are driven by the covariance between action probabilities and changes in logits. To mitigate entropy collapse, they proposed two techniques:

Clip-Cov: Clips tokens with high covariance values.
KL-Cov: Applies a KL divergence penalty to tokens with high covariance.

These methods aim to preserve exploration by managing tokens that contribute to entropy reduction.

Experimental Validation Across Diverse Models and Benchmarks

The study evaluated these techniques on 11 open-source LLMs from four families (Qwen2.5, Mistral, LLaMA, DeepSeek) with 0.5B to 32B parameters. Testing on eight benchmarks, including MATH500 and AIME 2024, used autoregressive generation with RL algorithms like GRPO, REINFORCE++, and PRIME.

Significant Performance Gains with Clip-Cov and KL-Cov

On the Qwen2.5 models with the DAPOMATH dataset, Clip-Cov and KL-Cov delivered consistent improvements over the GRPO baseline, with average gains of 2.0% for the 7B model and 6.4% for the 32B model. KL-Cov notably maintained entropy levels over ten times higher when the baseline entropy plateaued. The largest 32B model showed remarkable improvements of up to 15.0% on challenging benchmarks.

Implications for Scaling Reinforcement Learning in LLMs

This research addresses the critical challenge of policy entropy collapse, revealing a trade-off between performance and exploration. The proposed regularization strategies offer a pathway to sustain exploration during RL training, essential for scaling reasoning-centric LLMs beyond pre-training methods. These insights lay groundwork for future advances in building more intelligent and capable language models through reinforcement learning.