Reinforcement Learning Empowers LLMs to Outperform Traditional Compilers in Assembly Code Optimization

Expanding the Role of LLMs in Code Optimization

Large Language Models (LLMs) have demonstrated remarkable abilities in programming tasks, yet their application in optimizing code—particularly at the assembly level—has been limited. While previous research has enhanced performance in higher-level languages like C++ and Python, optimizing low-level code remains a challenge. Typical benchmarks for LLMs focus on code generation or bug fixing, with models such as Codex and AlphaCode targeting code quality rather than performance improvements.

Emerging Learning-Based Optimization Techniques

Recent advances have introduced reinforcement learning (RL) and graph neural networks to improve compiler optimization processes. Approaches like AutoPhase and Coreset have shown promise in sequencing compiler passes and enhancing efficiency. Superoptimization techniques attempt to find the most efficient program versions but are often restricted to small code snippets. Frameworks like AutoTVM and Ansor optimize GPU kernels using statistical methods.

Reinforcement Learning for Assembly Code

A new study by researchers from Stanford, UIUC, CMU, and Visa Research explores using LLMs combined with reinforcement learning to optimize assembly code traditionally managed by compilers like GCC. They developed a Proximal Policy Optimization (PPO) framework that balances code correctness with execution speed, using a dataset of 8,072 real-world C programs compiled to assembly.

The model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and an average 1.47× speedup over the gcc -O3 baseline, outperforming 20 other models including Claude-3.7-sonnet. The approach involves generating a functionally equivalent assembly program that runs faster, verified with test cases and speed measurements. Two reward functions guide training: Correctness-Guided Speedup and Speedup-Only.

Evaluation and Insights

Most LLMs struggle with assembly optimization, showing low pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO significantly excels. Ablation studies highlight the importance of using gcc -O3 as a performance reference. Notably, some models like Claude-3.7-sonnet can outperform compilers by applying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, indicating advanced semantic code transformations beyond traditional compiler capabilities.

Challenges and Future Directions

While the study demonstrates the potential of RL-trained LLMs to surpass traditional compiler optimizations, challenges remain. The approach lacks formal correctness guarantees, and performance can vary across hardware platforms. Nonetheless, this research opens new avenues for applying AI in low-level code optimization, potentially transforming compiler design and performance tuning.