AWS Launches SWE-PolyBench: A Multilingual Benchmark to Evaluate AI Coding Agents
AWS AI Labs has launched SWE-PolyBench, an open-source, multilingual benchmark designed to evaluate AI coding agents with real-world coding tasks across multiple languages, improving upon previous limited benchmarks.
Advancing AI Coding Agent Evaluation
Recent progress in large language models (LLMs) has enabled AI coding agents to generate, modify, and understand software code. However, current evaluation methods mostly rely on synthetic or narrowly focused benchmarks, primarily in Python, which do not reflect the complexity and diversity of real-world codebases. This limitation causes many AI agents to overfit benchmark-specific patterns rather than demonstrate robust and transferable coding skills.
Introducing SWE-PolyBench
AWS AI Labs has addressed these challenges by introducing SWE-PolyBench, a comprehensive, multilingual, repository-level benchmark designed for execution-based evaluation of AI coding agents. SWE-PolyBench covers 21 GitHub repositories across four popular programming languages: Java, JavaScript, TypeScript, and Python. It includes 2,110 tasks such as bug fixes, feature implementations, and code refactorings.
Unlike previous benchmarks, SWE-PolyBench uses real pull requests that resolve actual issues and come with associated test cases, allowing verifiable evaluation. Additionally, a smaller subset called SWE-PolyBench500 has been released to facilitate faster experimentation without sacrificing task and language diversity.
Technical Setup and Metrics
SWE-PolyBench employs an execution-based evaluation pipeline. Each task provides a repository snapshot and a problem statement derived from a GitHub issue. The benchmark applies the ground truth patch within a containerized test environment configured for the language ecosystem (e.g., Maven for Java, npm for JavaScript/TypeScript).
Evaluation relies on unit test results measuring fail-to-pass (F2P) and pass-to-pass (P2P) transitions. Moreover, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics at both file and node levels to assess an agent’s capacity to locate and modify relevant code sections. These metrics provide detailed insights beyond simple pass/fail results, especially for complex multi-file changes.
Key Findings from Empirical Evaluation
Three open-source coding agents—Aider, SWE-Agent, and Agentless—were adapted to SWE-PolyBench, all using Anthropic’s Claude 3.5 as their underlying model.
The agents performed best on Python tasks, reaching up to a 24.1% pass rate, while struggling with TypeScript tasks, which had pass rates as low as 4.7%. Interestingly, Java, despite its complexity, showed higher success than TypeScript, highlighting the importance of model pretraining and syntax familiarity.
Task complexity also influenced performance. Simpler tasks involving single-function or single-class changes had success rates up to 40%, whereas multi-file or mixed-change tasks showed significant drops. Notably, high precision and recall in file and CST node retrieval did not always correlate with better pass rates, indicating that accurate code localization alone is insufficient for problem solving.
Impact and Future Directions
SWE-PolyBench offers a nuanced and robust evaluation framework that overcomes limitations of existing benchmarks by supporting multiple languages, diverse task types, and syntax-aware metrics. It provides a realistic measure of AI coding agents’ capabilities and highlights areas for improvement in generalizability, robustness, and reasoning.
This new benchmark lays the groundwork for advancing AI coding assistants to better handle the complexities of real-world software development.
For more details, visit the AWS DevOps Blog, check out SWE-PolyBench on Hugging Face and GitHub, and follow updates on Twitter, Telegram, and LinkedIn.
Сменить язык
Читать эту статью на русском