Reinforcement learning applied to open-weight models

Researchers from Nebius AI and Humanoid developed a reinforcement learning (RL) pipeline that trains open-weight large language models to act as capable, long-context software engineering (SWE) agents. Instead of relying on teacher-based supervision or proprietary models, the team adapted Decoupled Advantage Policy Optimization (DAPO) and introduced practical changes to make RL work for multi-turn, real-environment debugging and code editing tasks.

Why single-turn RL falls short for software engineering

Most RL work on LLMs focuses on one-shot tasks such as math problems or single-step code generation, where reward arrives at the end and interactions are short. Software engineering is different: agents must handle long trajectories, interpret rich environment feedback like compiler errors and test logs, and preserve context across tens of thousands of tokens. These differences make standard bandit-style RL approaches poorly suited for SWE automation.

Core challenges for RL in SWE environments

Long-horizon reasoning: agents must remain coherent across many sequential steps, often requiring context windows above 100k tokens.
Stateful environment feedback: actions produce meaningful observations (shell outputs, failing tests) that must guide future decisions.
Sparse and delayed rewards: success often appears only after long interactions, complicating credit assignment.
Noisy evaluation: unrolled trajectories can be flakey because tests and environments may be unstable.

The two-stage training pipeline

The team trained a Qwen2.5-72B-Instruct agent using a two-stage approach: rejection fine-tuning followed by RL with a modified DAPO.

Rejection Fine-Tuning (RFT)

The project starts with supervised fine-tuning on 7,249 filtered SWE tasks drawn from the SWE-REBENCH dataset. Interaction traces that successfully pass the environment test suite are retained for fine-tuning, with training adapted to mask invalid environment-formatting actions. This stage raises baseline Pass@1 accuracy from about 11% to 20% on the SWE-bench Verified benchmark.

Reinforcement Learning using a modified DAPO

Building on Decoupled Advantage Policy Optimization, the researchers introduced several modifications for stability and scalability:

Asymmetric clipping to prevent policy collapse and preserve exploration.
Dynamic sample filtering to prioritize trajectories that carry useful learning signals.
Length penalties to discourage excessively long episodes and mitigate looping behavior.
Token-level averaging so every token in each trajectory contributes equally to gradient updates, allowing long trajectories to meaningfully shape learning.

The agent operates in a ReAct-style loop, interleaving reasoning with tool use. Supported tools include arbitrary shell commands, fine-grained code edits, navigation and search utilities, and a submit action to mark episode completion. Episodes run inside sandboxed environments initialized from real repository snapshots and GitHub-style issue prompts.

Scaling context length and episode length

Initial training used a context length of 65k tokens, already larger than most open models, but performance plateaued at 32% Pass@1. A subsequent RL phase extended context to 131k tokens and increased the episode length ceiling while focusing training on the most informative tasks. This scaling allows the agent to handle long stack traces and extended diff histories typical in real debugging and patch workflows.

Results compared to baselines

The final RL-trained agent achieves 39.04% Pass@1 on SWE-bench Verified and 58.4% Pass@10, roughly doubling the rejection fine-tuned baseline and matching the performance of state-of-the-art open-weight models such as DeepSeek-V3-0324, all without teacher supervision. On held-out SWE-REBENCH splits the method remains robust, with 35.0% for May and 31.7% for June. Head-to-head comparisons show the RL agent matches or outperforms several leading open baselines and specialized SWE agents.

Key insights and future directions

Credit assignment remains challenging in sparse-reward, long-horizon tasks. Future work could explore reward shaping, step-level critics, or prefix-based rollouts to provide more granular learning signals.
Uncertainty estimation is important for real-world agents so they can abstain or signal low confidence. Techniques like output-entropy measures or explicit confidence scoring are promising next steps.
Efficient infrastructure matters: training used context parallelism over 16 H200 nodes, Kubernetes-based orchestration, Tracto AI, and vLLM for inference speedups.

Practical implications

This research demonstrates that reinforcement learning can be a potent method for building autonomous software engineering agents using open-weight LLMs. By addressing long-horizon interaction, stateful environment feedback, and sparse rewards, the methodology opens a path to scalable, teacher-free agent development that directly leverages interactive learning rather than static instruction.

Where to find more

The paper and associated code, tutorials, and notebooks are available on the authors' GitHub and project pages. The research team also maintains social channels and a newsletter for updates and community engagement.

Reinforcement Learning Unlocks Open-Weight LLMs for Long-Horizon Software Engineering