What SRL tries to fix

Small open models often fail on the hardest reasoning tasks even when provided with expert traces. Supervised fine-tuning (SFT) can cause models to imitate long demonstrations token-by-token and degrade performance on scarce, hard datasets. Outcome-level reinforcement learning approaches can also fail when no correct rollout exists to reward. Supervised Reinforcement Learning (SRL) offers a third path: keep the RL optimization but inject supervision directly into the reward channel using expert trajectories.

How the SRL training loop works

Each expert trajectory is decomposed into a sequence of actions. For every prefix of that sequence the training creates an example where the model first emits a private reasoning span wrapped in ..., then outputs a single action for that step. Only the action is compared with the teacher action, using a sequence similarity metric based on difflib. That produces a dense reward signal because every step receives a score even if the overall final answer is wrong. The model's reasoning span is unconstrained, so it can develop its own chain-of-thought rather than being forced to copy the teacher's tokens.

Why this matters for small models

SRL avoids two common failure modes:

SFT overfitting to long demonstrations, which causes performance drop on small, hard datasets.
Outcome-based RL collapse when no rollout is correct and the reward is sparse or misleading. By turning each step of an expert trajectory into a rewarded action, SRL provides consistent dense feedback that guides learning in the Dhard regime.

Math benchmark outcomes

All models in the reported experiments are initialized from Qwen2.5 7B Instruct and trained on the DeepSeek R1 formatted s1K 1.1 set for apples-to-apples comparison. Key results from the paper:

Base Qwen2.5 7B Instruct: AMC23 greedy 50.0, AIME24 greedy 13.3, AIME25 greedy 6.7.
SRL: AMC23 greedy 50.0, AIME24 greedy 16.7, AIME25 greedy 13.3.
SRL then RLVR: AMC23 greedy 57.5, AIME24 greedy 20.0, AIME25 greedy 10.0.

SRL alone removes the SFT degradation and raises performance on the tougher benchmarks. Running RLVR after SRL yields the best open-source scores reported in the study; the authors emphasize the pipeline SRL -> RLVR as the strongest configuration.

Software engineering (SWE) experiments

The team applied SRL to Qwen2.5 Coder 7B Instruct using 5,000 verified agent trajectories generated by claude 3 7 sonnet. Each trajectory was decomposed into step-wise items, producing a total of 134,000 training steps. Evaluated on SWE Bench Verified:

Base model (oracle file edit): 5.8%, end-to-end: 3.2%.
SWE Gym 7B: 8.4% / 4.2%.
SRL: 14.8% / 8.6%.

SRL roughly doubles the base model's performance and outperforms the SFT baseline on this coding task set.

Practical characteristics and trade-offs

SRL keeps a GRPO-style objective and relies only on expert actions plus a lightweight string-similarity reward, avoiding the need for an extra learned reward model. That makes it feasible to run on small, hard datasets where collecting large numbers of examples or training an auxiliary reward model is impractical. The method generalizes across domains: the same SRL recipe works for mathematical reasoning and agentic software engineering traces.

Implications for open models

SRL is a practical bridge between process-level supervision and reinforcement learning. By providing dense, step-wise rewards derived from expert traces and allowing the model to develop unconstrained internal reasoning, SRL enables small models to learn tasks that SFT and outcome-based RL struggle with. The recommended pipeline from the paper is to apply SRL first and then refine with RLVR for further gains.

SRL: Teaching 7B Models to Reason Step-by-Step on Hard Math and Code