<RETURN_TO_BASE

Prefix-RFT: Guiding LLMs with Partial Demonstrations to Merge SFT and RFT

Prefix-RFT blends supervised and reinforcement fine-tuning by using partial demonstration prefixes to guide exploration, achieving stronger and more stable performance on math reasoning benchmarks than SFT, RFT, and hybrid baselines.

What Prefix-RFT does

Prefix-RFT is a fine-tuning strategy that blends supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) by conditioning model updates on partial demonstrations, or prefixes. Instead of fully imitating expert outputs or relying purely on reward-driven exploration, the model receives a sampled prefix from a demonstration and is allowed to generate the remainder of the solution. This hybrid setup steers exploration toward promising solution trajectories while preserving adaptability.

Why combine SFT and RFT

SFT teaches models to follow instruction-style examples and stabilizes training through imitation, but it can make behavior rigid and harm generalization. RFT optimizes for task success using reward signals and encourages creative strategies, yet it can be unstable and overly dependent on a strong initial policy. By using partial demonstrations, Prefix-RFT retains the structural guidance of SFT while keeping the exploratory, outcome-driven benefits of RFT.

Key techniques in the method

Prefix-RFT uses several mechanisms to stabilize and improve learning. Entropy-based clipping focuses updates on high-entropy prefix tokens, ensuring the model explores uncertain parts of the demonstration rather than overfitting obvious parts. The authors apply a Dr. GRPO update that restricts parameter updates to the top 20% of prefix tokens by entropy. A cosine decay scheduler reduces prefix length over training, decaying the prefix fraction from 95% down to 5% so the model gradually shoulders more of the generation responsibility. These choices maintain an intermediate SFT loss throughout training, preserving a balance between imitation and exploration.

Datasets, models, and evaluation

The method was evaluated on high-quality offline math datasets, including OpenR1-Math-220K (with 46k filtered problems), and tested using models such as Qwen2.5-Math-7B, Qwen2.5-Math-1.5B, and LLaMA-3.1-8B. Benchmarks included AIME 2024/25, AMC, MATH500, Minerva, and OlympiadBench. Compared to standalone SFT, RFT, and mixed-policy baselines such as ReLIFT and LUFFY, Prefix-RFT achieved superior avg@32 and pass@1 scores across tasks.

Empirical findings

Prefix-RFT delivered consistent gains across architectures and datasets. Notably, even with a drastic reduction in training data (using just 1% or roughly 450 prompts), average performance remained strong: avg@32 dropped only modestly from 40.8 to 37.6. The top-20% entropy-based token update strategy produced the best results and resulted in shorter, higher-quality outputs. The cosine decay scheduler for prefix length improved stability and convergence compared with a uniform prefix strategy, especially on difficult tasks like AIME.

Practical implications

By guiding exploration with sampled demonstration prefixes, Prefix-RFT offers a straightforward and robust way to merge imitation and reward-driven learning. It integrates cleanly into existing fine-tuning pipelines and is resilient to variations in demonstration quality and quantity. For tasks that benefit from structured reasoning, such as math problem solving, blending partial demonstrations with reinforcement updates can produce more adaptive and higher-performing LLMs.

Where to find more information

The full paper and additional resources are available at the authors' reference page: https://arxiv.org/abs/2507.01679. The authors also point to accompanying code, tutorials, and notebooks on their project repository.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский