<RETURN_TO_BASE

TikTok Launches SWE-Perf: Benchmarking LLMs for Real-World Code Performance Optimization

TikTok researchers have launched SWE-Perf, the pioneering benchmark designed to assess LLMs' ability to optimize code performance across entire repositories, revealing current AI limitations compared to human experts.

Advancing Performance Optimization in Software Engineering with SWE-Perf

Large language models (LLMs) have made significant progress in software engineering tasks such as code generation and bug fixing. However, optimizing code performance, especially at the repository level, remains a challenging frontier. To address this, TikTok researchers together with collaborators have introduced SWE-Perf, the first benchmark designed specifically to evaluate LLMs' ability to optimize code performance across entire repositories.

Why SWE-Perf Stands Out

Previous benchmarks like SWE-Bench, Mercury, and EFFIBench predominantly focused on correctness or function-level efficiency. SWE-Perf goes further by capturing the complexity involved in repository-scale performance tuning. It provides a reproducible and quantitative framework to study and enhance performance optimization capabilities of modern LLMs.

Challenges in Repository-Level Optimization

Real-world codebases are large, modular, and deeply interconnected. Effective performance optimization requires understanding cross-file dependencies, execution paths, and computational bottlenecks. These challenges are beyond the scope of datasets focusing on isolated function-level tasks. SWE-Perf is designed to measure LLM performance in these realistic, complex scenarios.

Dataset Composition

SWE-Perf is based on a curated dataset derived from over 100,000 pull requests across nine high-profile GitHub repositories. Key features include:

  • 140 curated instances demonstrating measurable and stable performance improvements.
  • Complete codebases before and after optimization.
  • Target functions categorized into oracle (file-level) and realistic (repo-level) settings.
  • Unit tests and Docker environments ensuring reproducible execution and performance measurement.
  • Expert-authored patches serving as gold standards.

Each unit test must pass before and after applying the patch and demonstrate statistically significant runtime improvements (verified by Mann-Whitney U test, p < 0.1) over 20 repetitions. Performance gains are measured using minimum performance gain (δ) to isolate true improvements and filter noise.

Benchmark Settings: Oracle vs. Realistic

  • Oracle Setting: The model receives only the target functions and corresponding files, testing localized optimization capabilities.
  • Realistic Setting: The model is provided with the entire repository and must autonomously identify and optimize performance-critical paths, simulating real engineering work.

Evaluation Metrics

SWE-Perf uses a three-tier evaluation framework:

  • Apply: Whether the model-generated patch can be applied cleanly.
  • Correctness: Whether the patch preserves functional integrity (all unit tests pass).
  • Performance: Whether the patch results in measurable runtime improvements.

These metrics are reported independently, allowing nuanced analysis of tradeoffs between correctness and performance.

Experimental Results

The benchmark tested multiple leading LLMs in both oracle and realistic settings with the following performance (% improvement):

| Model | Setting | Performance (%) | |------------------------|-----------|-----------------| | Claude-4-opus | Oracle | 1.28 | | GPT-4o | Oracle | 0.60 | | Gemini-2.5-Pro | Oracle | 1.48 | | Claude-3.7 (Agentless) | Realistic | 0.41 | | Claude-3.7 (OpenHands) | Realistic | 2.26 | | Expert (Human Patch) | – | 10.85 |

Even the best LLMs fall short of expert human performance. The agent-based OpenHands method on Claude-3.7-sonnet achieves the highest performance in the realistic setting but still lags behind human experts.

Insights from the Benchmark

  • Agent-based frameworks like OpenHands excel in complex, multi-step optimizations compared to direct prompts and pipeline approaches.
  • Performance tends to decline as the number of target functions grows, indicating scalability challenges for LLMs.
  • LLMs show limited gains in long-runtime scenarios where expert optimizations continue to improve performance.
  • Analysis reveals LLMs focus more on low-level code elements (imports, environment setup), whereas experts target high-level semantic abstractions for effective tuning.

Impact and Future Directions

SWE-Perf offers a critical foundation to measure and enhance LLMs' repository-scale performance optimization capabilities. It highlights a substantial gap between current models and human experts, guiding future research towards practical, production-ready software optimization at scale. As LLMs evolve, SWE-Perf will serve as an essential benchmark to drive progress in real-world software enhancement.

For more details, see the paper and the project GitHub page.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский