RL's Edge: MIT Shows Reinforcement Learning Cuts Catastrophic Forgetting vs Supervised Fine-Tuning

September 8, 2025 · 3 min

What catastrophic forgetting means for foundation models

Foundation models are typically trained once and then deployed broadly. When these models are fine-tuned on new tasks, they often lose previously acquired abilities — a phenomenon known as catastrophic forgetting. This limits the promise of long-lived, continually improving agents.

A simple, measurable law for forgetting

The MIT team proposes a clear, empirical relationship between forgetting and distributional shift. In their formulation:

Forgetting ∝ KL(π0 || π)

Here π0 is the base policy (or base model) and π is the fine-tuned policy. Measured on the new task, the forward KL divergence from the base model to the fine-tuned model strongly predicts how much prior capability is lost. Importantly, this makes forgetting quantifiable without needing data from previously seen tasks.

Why reinforcement learning forgets less than supervised fine-tuning

MIT’s experiments show that online reinforcement learning (RL) tends to preserve prior abilities better than supervised fine-tuning (SFT). Both approaches can reach comparable performance on new tasks, but SFT typically shifts the model’s output distribution farther from the base policy and overwrites previous skills. By contrast, on-policy RL samples from the model’s own outputs and reweights them by reward, which naturally constrains updates to distributions nearer to the base model.

Empirical results on large language models

Using Qwen 2.5 3B-Instruct as a base, the researchers fine-tuned models on diverse tasks: math reasoning (Open-Reasoner-Zero), science Q&A (SciKnowEval subset), and tool use (ToolAlpaca). They evaluated retention on benchmarks like HellaSwag, MMLU, TruthfulQA, and HumanEval. The findings were consistent: RL improved new-task accuracy while keeping prior-task accuracy stable, whereas SFT often improved the new task at the cost of earlier capabilities.

Robotics experiments: preserving manipulation skillsets

The team repeated the comparison in a robotics context with OpenVLA-7B adapted in SimplerEnv pick-and-place scenarios. RL-based adaptation maintained general manipulation skills across different tasks. SFT could succeed on an immediate target task but frequently degraded broader manipulation abilities, again demonstrating RL’s conservatism in preserving knowledge.

ParityMNIST: isolating the mechanism

To test mechanisms in a controlled setting, the researchers introduced ParityMNIST, a toy problem that isolates distributional effects. Both RL and SFT achieved high new-task accuracy, but SFT led to sharper drops on an auxiliary FashionMNIST benchmark. When forgetting was plotted against KL divergence, both methods fell on a single predictive curve, validating forward KL as the governing factor.

Why on-policy updates constrain forgetting

On-policy RL gathers data from the model’s own output distribution and adjusts probabilities incrementally according to reward. This incremental, importance-weighted update implicitly favors solutions that are close in distributional space to the base policy. Theoretical analysis in the paper shows that policy gradient methods converge to KL-minimal optimal solutions, which explains RL’s tendency to be conservative in distributional shifts.

Rejected alternative explanations

The team evaluated other potential explanations for forgetting, including weight-space change magnitude, hidden representation drift, sparsity of updates, and alternative metrics like reverse KL, total variation, and L2 distance. None matched the forward KL divergence’s predictive power, reinforcing that distributional closeness (forward KL) is the critical factor.

Design implications for post-training and continual learning

The study reframes catastrophic forgetting as a distributional problem and suggests practical axes for algorithm design and evaluation:

Evaluation metrics should account for KL-conservatism, not only new-task accuracy.
Hybrid methods that combine the efficiency of SFT with explicit KL minimization or conservative RL updates could yield better trade-offs.
For continual learning and lifelong agents, measuring and limiting forward KL shift provides a precise control knob to avoid erasing previous capabilities.

Where to read more

The full technical details are available in the MIT paper (https://arxiv.org/pdf/2509.04259) and accompanying project resources, which include code, tutorials, and notebooks.