CURE: Revolutionizing Code and Unit Test Generation with Self-Supervised Reinforcement Learning in LLMs

Advancements in Large Language Models for Code Generation

Large Language Models (LLMs) have made significant strides in reasoning and precision, largely due to reinforcement learning (RL) and test-time scaling techniques. Traditional unit test generation methods, like O1-Coder and UTGEN, require supervision from ground-truth code, which makes data collection expensive and limits the scale of training data.

Challenges with Current Approaches

Existing unit test generation approaches depend on rigid software analysis rules or neural machine translation methods that often fail to maintain semantic alignment. Even recent prompt-based and agentic methods, which show improved performance, still rely heavily on labeled code for fine-tuning. This dependency restricts scalability and adaptability in large-scale, real-world applications.

Introducing CURE: A Self-Supervised Co-Evolution Framework

A collaboration between researchers at the University of Chicago, Princeton University, Peking University, and ByteDance Seed has developed CURE, a self-supervised reinforcement learning framework that trains a code generator and a unit test generator simultaneously without any ground-truth code.

CURE uses a self-play mechanism where the LLM generates both correct and incorrect code. The unit test generator learns to identify failure modes and refines itself accordingly. This co-evolution process improves both code generation and verification without external supervision.

Architecture and Methodology

Base Models and Sampling: CURE is based on Qwen2.5-7B and 14B Instruct models, with Qwen3-4B for long-chain-of-thought (CoT) variants. Each training step samples 16 candidate code completions and 16 unit tests derived from tasks. Sampling is managed by vLLM with temperature 1.0 and top-p 1.0. For long-CoT models, a response-length-aware transformation penalizes lengthy outputs to enhance inference efficiency.

Reward Function and Optimization: CURE introduces a reward function designed to maximize reward precision—the probability that correct code scores higher than incorrect code across generated unit tests. It also applies response-based reward adjustments to reduce latency for longer outputs. The coder and unit tester are optimized jointly using policy gradient methods.

Evaluation and Performance

CURE was evaluated on five standard coding datasets: LiveBench, MBPP, LiveCodeBench, CodeContests, and CodeForces. Metrics included unit test accuracy, one-shot code generation accuracy, and Best-of-N (BoN) accuracy using 16 code and test samples.

The ReasonFlux-Coder models trained with CURE achieved impressive gains:

+37.8% in unit test accuracy
+5.3% in one-shot code generation accuracy
+9.0% in BoN accuracy

ReasonFlux-Coder-4B notably reduced average unit test response length by 64.8%, boosting inference speed. Across all benchmarks, these models outperformed traditional supervised fine-tuned models like Qwen2.5-Coder-Instruct.

Commercial Application and Cost Efficiency

When paired with GPT-series models, ReasonFlux-Coder-4B improved performance:

GPT-4o-mini saw +5.5% BoN accuracy
GPT-4.1-mini improved by +1.8%

This pairing also reduced API costs, presenting a cost-effective solution for production inference pipelines.

Using CURE for Label-Free Fine-Tuning

Unit test generators trained with CURE can serve as reward models in RL training. Utilizing ReasonFlux-Coder-4B’s generated unit tests offers comparable improvements to human-labeled test supervision, enabling fully label-free reinforcement learning workflows.

Broader Impact and Future Prospects

ReasonFlux-Coder models integrate well with agentic coding frameworks such as MPSC (Multi-Perspective Self-Consistency), AlphaCodium, and S*. CURE enhances agentic unit test generation accuracy by over 25.1%, demonstrating versatility and iterative refinement capabilities.

CURE marks a significant leap in self-supervised learning for code generation and verification, allowing LLMs to co-evolve coding and unit test generation without ground-truth data. Its improvements in accuracy, efficiency, and adaptability make it a scalable, cost-effective choice for training and deployment in large-scale environments.