ToolTrain: ByteDance's RL Framework That Teaches LLMs to Explore Code Repos
'ToolTrain teaches LLMs to use simple repository tools and combines SFT with tool-integrated RL to improve multi-hop issue localization, delivering state-of-the-art results on real-world benchmarks.'
The challenge of locating issues in large repositories
Issue localization means finding the exact code locations that need changes to fix bugs. In large repositories this is often a slow, manual process that demands multi-step reasoning, careful navigation, and targeted tool use. Large language models can act as agents and call external tools to explore repositories, but they struggle with Repo Deep Search, a sequential, multi-hop navigation task that requires coherent reasoning and strategic tool calls.
Previous approaches and their limitations
Existing approaches to fault localization include classical deep learning methods and newer LLM-based techniques. Methods such as DeepFL and DeepRL4FL use neural networks and CNNs to analyze test coverage, data dependencies, and static code representations. Recent LLM-based techniques can narrow down code locations but often lack the complex reasoning and disciplined tool usage needed for multi-step repository exploration.
Agentic training attempts to close this gap. Frameworks like SWE-Gym and SEAlign fine-tune LLMs with high-quality trajectories to improve behavior, while LocAgent constructs ground truth for localization from functions modified by golden patches on GitHub. Despite these advances, maintaining correct tool calls and coherent reasoning chains during deep repo searches remains challenging.
What ToolTrain introduces
Researchers from Peking University, ByteDance, and Beijing Institute of Technology propose ToolTrain, a training framework that integrates tools into the learning loop to improve multi-hop reasoning in issue localization. ToolTrain centers on two ideas:
- RepoSearcher: a lightweight agent that exposes simple retrieval tools. These tools let the model locate function or class definitions by name, enabling more targeted exploration than blind search.
- Two-stage training: a rejection-sampled supervised fine-tuning stage followed by tool-integrated reinforcement learning. This combination teaches models not only to call tools correctly but to use them strategically, avoiding redundant exploration and focusing on promising code paths.
To build training data, the researchers extract labeled trajectories from open-source repositories so the model learns real, multi-step exploration patterns.
Evaluation dataset and metrics
The team evaluates ToolTrain using SWE-Bench-Verified, a benchmark drawn from real GitHub issues and manually verified by professional developers. Ground-truth answers are the functions and files actually modified in golden patches. RepoSearcher and ToolTrain are evaluated with metrics including Recall@k, MAP, MRR, nDCG@k, and a %Resolved metric that measures issue resolution success.
ToolTrain is applied to two model sizes: Qwen-7B and Qwen-32B. These are compared against four state-of-the-art baselines representing diverse designs: Agentless, CrcaLoca, CoSIL, and LocAgent. The comparison highlights ToolTrain's impact on precise and strategic code exploration.
Results and notable findings
ToolTrain-enabled RepoSearcher achieves state-of-the-art performance among similarly sized models, and even beats larger commercial models on specific metrics. Examples:
- RepoSearcher with ToolTrain-32B reaches function-level Recall@5 of 68.55, outperforming Claude-3.7-Sonnet at 66.38.
- The 7B model enhanced by ToolTrain outperforms other frameworks that use 32B models, showing ToolTrain boosts tool-calling capabilities even in smaller models.
- For issue resolution, RepoSearcher with ToolTrain-7B achieves Recall@5 of 62.38 and a resolution rate of 14.00, the best among 7B models.
However, resolution rates vary depending on the patch generation model used. For instance, ToolTrain-7B shows a resolution rate of 14.00 versus 31.60 for ToolTrain-32B, despite similar localization performance. This suggests downstream patch generation quality still affects final resolution outcomes.
Why this matters for software engineering
ToolTrain demonstrates that combining supervised fine-tuning with reinforcement learning and explicit tool integration can teach LLMs to perform disciplined, multi-hop repository searches. By reducing redundant exploration and improving strategic tool use, ToolTrain helps smaller models punch above their weight and makes automated issue localization more practical for real-world codebases.
Additional resources and community notes
The paper and the GitHub repository include full details, tutorials, code, and notebooks. The authors also encourage following project updates on social channels and joining community hubs for further discussion and support.
Сменить язык
Читать эту статью на русском