AWS Launches SWE-PolyBench: A Multilingual Benchmark to Evaluate AI Coding Agents

Advancing AI Coding Agent Evaluation

Recent progress in large language models (LLMs) has enabled AI coding agents to generate, modify, and understand software code. However, current evaluation methods mostly rely on synthetic or narrowly focused benchmarks, primarily in Python, which do not reflect the complexity and diversity of real-world codebases. This limitation causes many AI agents to overfit benchmark-specific patterns rather than demonstrate robust and transferable coding skills.

Introducing SWE-PolyBench

AWS AI Labs has addressed these challenges by introducing SWE-PolyBench, a comprehensive, multilingual, repository-level benchmark designed for execution-based evaluation of AI coding agents. SWE-PolyBench covers 21 GitHub repositories across four popular programming languages: Java, JavaScript, TypeScript, and Python. It includes 2,110 tasks such as bug fixes, feature implementations, and code refactorings.

Unlike previous benchmarks, SWE-PolyBench uses real pull requests that resolve actual issues and come with associated test cases, allowing verifiable evaluation. Additionally, a smaller subset called SWE-PolyBench500 has been released to facilitate faster experimentation without sacrificing task and language diversity.

Technical Setup and Metrics

SWE-PolyBench employs an execution-based evaluation pipeline. Each task provides a repository snapshot and a problem statement derived from a GitHub issue. The benchmark applies the ground truth patch within a containerized test environment configured for the language ecosystem (e.g., Maven for Java, npm for JavaScript/TypeScript).

Evaluation relies on unit test results measuring fail-to-pass (F2P) and pass-to-pass (P2P) transitions. Moreover, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics at both file and node levels to assess an agent’s capacity to locate and modify relevant code sections. These metrics provide detailed insights beyond simple pass/fail results, especially for complex multi-file changes.

Key Findings from Empirical Evaluation

Three open-source coding agents—Aider, SWE-Agent, and Agentless—were adapted to SWE-PolyBench, all using Anthropic’s Claude 3.5 as their underlying model.

The agents performed best on Python tasks, reaching up to a 24.1% pass rate, while struggling with TypeScript tasks, which had pass rates as low as 4.7%. Interestingly, Java, despite its complexity, showed higher success than TypeScript, highlighting the importance of model pretraining and syntax familiarity.

Task complexity also influenced performance. Simpler tasks involving single-function or single-class changes had success rates up to 40%, whereas multi-file or mixed-change tasks showed significant drops. Notably, high precision and recall in file and CST node retrieval did not always correlate with better pass rates, indicating that accurate code localization alone is insufficient for problem solving.

Impact and Future Directions

SWE-PolyBench offers a nuanced and robust evaluation framework that overcomes limitations of existing benchmarks by supporting multiple languages, diverse task types, and syntax-aware metrics. It provides a realistic measure of AI coding agents’ capabilities and highlights areas for improvement in generalizability, robustness, and reasoning.

This new benchmark lays the groundwork for advancing AI coding assistants to better handle the complexities of real-world software development.