Revolutionizing LLMs: Self-Evolving Language Models Learn Without Labels Using Test-Time Reinforcement Learning

Overcoming Dependence on Labeled Data in LLMs

Large language models (LLMs) have made impressive strides in reasoning capabilities thanks to reinforcement learning (RL). However, they still largely depend on supervised data and human feedback, which limits their adaptability in dynamic, real-world scenarios. Existing techniques like RLHF improve instruction following but require costly labeled datasets.

Introducing Test-Time Reinforcement Learning (TTRL)

Researchers from Tsinghua University and Shanghai AI Lab have developed Test-Time Reinforcement Learning (TTRL), a novel framework that enables LLMs to learn during inference using only unlabeled test data. TTRL leverages the model’s own priors to estimate pseudo-rewards by aggregating multiple output samples through majority voting.

Instead of explicit labels, TTRL treats the most frequent prediction across sampled outputs as a pseudo-label. Responses aligning with this consensus receive positive reinforcement, turning inference into a self-supervised adaptive learning process that improves the model over time without external supervision.

How TTRL Works

TTRL operates in two stages:

Label Estimation via Majority Voting: For each prompt, the model generates multiple outputs. The most common answer is taken as the estimated label.
Reward Assignment and Policy Optimization: Responses matching the pseudo-label get a binary reward. Using gradient-based RL algorithms such as PPO or GRPO, the model updates its policy to maximize agreement with these pseudo-labels.

This simple yet effective approach uses temperature-controlled sampling (commonly temperature=1.0), with 64 samples for voting and 16 for training updates, requiring no ground-truth labels.

Empirical Success in Mathematical Reasoning

TTRL was tested on three math benchmarks: AIME 2024, AMC, and MATH-500. Results showed significant improvements:

Qwen2.5-Math-7B’s accuracy on AIME 2024 jumped from 16.7% to 43.3% (pass@1), a 159.3% increase without labeled data.
The same model averaged an 84.1% relative gain across all three benchmarks.
Smaller models like Qwen2.5-Math-1.5B also improved markedly, from 33.0% to 80.0% on MATH-500.

These results demonstrate TTRL’s ability to boost performance beyond the accuracy of the majority-voted pseudo-labels, indicating a self-reinforcing learning loop. Moreover, TTRL generalizes well across tasks, maintaining gains on benchmarks not used during training.

Implications and Future Directions

TTRL represents a paradigm shift for RL in LLMs by enabling continuous, label-free adaptation through self-generated supervision signals. Its compatibility with various RL algorithms and scalability with model size make it a promising approach for evolving language models in real-world applications.

While initially demonstrated on mathematical reasoning, the principles of TTRL—self-estimated supervision and test-time adaptation—may extend to other domains. Further research is needed to explore its theoretical properties and applications in interactive or multi-agent environments.