UC Berkeley Launches CyberGym: The Ultimate AI Cybersecurity Benchmark for Real-World Large-Scale Code Vulnerabilities

The Growing Importance of AI in Cybersecurity

The field of cybersecurity is increasingly intertwined with artificial intelligence as software systems grow larger and more complex. Traditional security measures are no longer sufficient; modern cybersecurity demands advanced AI capabilities for automated reasoning, vulnerability detection, and deep code comprehension. To meet these challenges, AI tools must be rigorously tested against real-world conditions that reflect the intricacies of vast software ecosystems.

Shortcomings of Existing Benchmarks

Current benchmarks for evaluating AI in cybersecurity often fall short. Many rely on simplified, small-scale tasks that fail to capture the complex nature of vulnerabilities in large, actively maintained codebases. These tests typically do not simulate real-world environments where bugs hide deep within millions of lines of code, requiring nuanced understanding and advanced reasoning. This gap makes it difficult to gauge whether AI agents can be trusted for critical security functions.

Introducing CyberGym: A Comprehensive Benchmark

UC Berkeley’s new benchmarking tool, CyberGym, addresses these limitations by providing a large-scale, realistic evaluation framework. CyberGym includes 1,507 benchmark tasks derived from real vulnerabilities found and patched in 188 major open-source projects. These vulnerabilities, discovered by Google’s OSS-Fuzz, come with full pre-patch codebases, executables, and detailed textual descriptions.

AI agents are tasked with generating proof-of-concept exploits that reproduce vulnerabilities in the unpatched code while ensuring these exploits fail on the patched versions. This approach requires agents to navigate complex code paths and synthesize inputs under realistic conditions. CyberGym is modular and containerized, enabling easy expansion and reproducibility.

Multi-Level Evaluation Pipeline

CyberGym’s evaluation system consists of four difficulty levels, each providing increasing amounts of information:

Level 0: Only the codebase is provided, no hints about vulnerabilities.
Level 1: Adds a natural language description of the vulnerability.
Level 2: Includes a ground-truth proof of concept and crash stack trace.
Level 3: Provides the patch and post-patch codebase.

Each level demands deeper reasoning. For example, Level 1 requires agents to infer vulnerabilities solely from descriptions and code. CyberGym also filters tasks for quality by validating reproducibility and removing redundancies.

The dataset spans codebases with a median of 1,117 files and nearly 400,000 lines of code, sometimes reaching over 40,000 files and 7 million lines. Vulnerabilities cover various crash types, with heap-buffer-overflow READ and uninitialized value use among the most common.

Experimental Findings

Tests with leading AI agent frameworks revealed notable challenges:

The best performer, OpenHands combined with Claude-3.7-Sonnet, reproduced just 11.9% of vulnerabilities.
Success dropped sharply for longer proof-of-concept inputs, with under 8% success for inputs over 100 bytes.
Open-source models like DeepSeek-V3 achieved only 3.6% success.
Specialized fine-tuned models struggled to generalize, scoring below 2%.
Performance improved with richer input information, peaking at 17.1% success on Level 3.

Interestingly, most successful exploits were generated within the first 20 to 40 execution steps, with diminishing returns after 80 steps. Furthermore, agents discovered 15 previously unknown zero-day vulnerabilities and two unpatched disclosed ones, demonstrating practical potential.

Key Insights

CyberGym provides the largest and most realistic benchmark for AI cybersecurity evaluation.
Current AI agents have significant limitations in reproducing and discovering vulnerabilities.
Providing more context and information notably improves agent performance.
Handling long and complex proof-of-concept inputs remains a major challenge.
Agent interactions with runtime tools enhance exploit generation success.

CyberGym paves the way for more robust AI evaluation in cybersecurity, highlighting the gap between current AI capabilities and the demands of real-world security tasks, while showcasing the potential for future breakthroughs.