<RETURN_TO_BASE

Achieving 50.8% on SWE-Bench Using Monolithic Long-Context Language Models Without Tooling

New research shows that powerful long-context language models can reach 50.8% accuracy on the SWE-Bench software engineering benchmark without relying on complex tool scaffolding, simplifying LM agent design.

Advancements in LM Agents for Complex Tasks

Recent developments in language model (LM) agents have demonstrated strong potential for automating complex real-world tasks by proposing and executing actions through APIs. These applications span software engineering, robotics, and scientific experimentation. To handle increasing task complexity, LM agent frameworks have evolved to incorporate multiple agents, multi-step retrieval, and custom scaffolding techniques aimed at optimizing performance.

Challenges with Partial Observability and Tool Use

A key challenge is the agents’ need to explore and understand their environment effectively. This has led to engineered scaffolds such as tools, memory mechanisms, and customized pipelines. Most existing approaches assume partial observability, requiring incremental observation gathering. While this fits dynamic or unfamiliar environments, it is less relevant in fully observable settings like SWE-Bench, where all necessary information is available upfront.

Strategies in Software Engineering LM Agents

Research in software engineering has mainly followed two strategies: agent-based frameworks and structured pipelines. Agent-based systems like SWE-Agent and OpenHands CodeAct allow LMs to autonomously interact with codebases through custom interfaces and retrieval tools. Other models, including Moatless, AutoCodeRover, and SpecRover, enhance localization and scaffolding design. In contrast, structured pipelines such as Agentless and CodeMonkey break down tasks into phases like localization, repair, and validation. These methods rely heavily on engineered components.

Leveraging Long-Context Language Models (LCLMs)

This study proposes using Long-Context LMs (LCLMs) to directly interpret the entire task environment without complex scaffolding. Recent advances in LCLM architecture and infrastructure enable these models to outperform retrieval-augmented systems, reducing reliance on external scaffolds.

Experimental Results with Gemini Models

Researchers from Stanford, IBM, and the University of Toronto tested whether complex scaffolding is necessary on tasks like SWE-Bench. Using LCLMs like Gemini-1.5-Pro with simple prompting and no scaffolding, they achieved 38% accuracy on SWE-Bench-Verified. Gemini-2.5-Pro reached 50.8% under the same conditions. A hybrid two-stage approach combining Gemini-1.5-Pro and Claude-3.7 achieved 48.6%, supporting the simplified architecture approach.

State-in-Context Agents and Methods

Traditional LM agents rely on interactive exploration due to partial observability, but many tasks such as software debugging offer full observability. The study introduces state-in-context agents that process full or compressed environment states directly via LCLMs, eliminating complex scaffolding. For large codebases, a ranking-based compression selects relevant files to fit context limits. Two methods are proposed:

  • DIRECTSOLVE: LCLMs solve tasks using the full context.
  • SELECTSOLVE: LCLMs localize relevant files for short-context LMs (SCLMs) to solve.

Both methods use targeted patch formats and validation to ensure accuracy and minimize hallucinations.

Evaluation on SWE-Bench Verified Benchmark

The simplified agent framework was evaluated on the SWE-Bench Verified benchmark with 500 real-world software engineering tasks. DIRECTSOLVE and SELECTSOLVE methods use LCLMs like Gemini-1.5-Pro and Gemini-2.5-Pro; SELECTSOLVE also employs an SCLM (Claude-3.7-Sonnet) for patch generation. Results show DIRECTSOLVE outperforms complex agents like Agentless and CodeAct with minimal engineering. SELECTSOLVE improves accuracy further by leveraging stronger patching models. Ablation studies highlight the importance of chain-of-thought prompting, code restatement, and token-efficient context design. Proper ordering of relevant files in prompts enhances performance, revealing some long-context processing limitations.

Cost and Practical Considerations

Using LCLM-based methods currently costs more than existing approaches: $2.60 per instance vs. $0.25 (Agentless) and $0.87 (CodeAct). However, inference costs are dropping rapidly, and growing context lengths make LCLMs increasingly practical. Techniques like KV caching reduce costs significantly after initial runs, down to about $0.725 per instance. Although minor codebase changes limit caching benefits, improvements are expected. LCLMs can handle long interaction histories, reducing the need for complex memory and retrieval mechanisms. Notably, unscaffolded LCLM models still perform competitively on SWE-Bench tasks.

Further Information

For full details, refer to the original research paper and follow the project updates on Twitter and the ML SubReddit with over 90k members.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский