<RETURN_TO_BASE

VERINA Benchmark: Pushing the Limits of Verifiable Code Generation with LLMs

VERINA introduces a holistic benchmark for evaluating LLMs on verifiable code generation, combining code, formal specifications, and proofs across diverse difficulty levels.

The Verification Challenge in LLM-Based Code Generation

Large Language Models (LLMs) like those powering Cursor and GitHub Copilot have significantly boosted programming productivity. Despite their impressive capabilities, these models inherently lack formal guarantees due to their probabilistic nature. This results in generated code that may contain bugs, creating bottlenecks when relying solely on LLM-based code generation for critical applications.

Limitations of Existing Benchmarks

Current benchmarks such as HumanEval and MBPP primarily focus on code generation but do not address formal specifications or proofs necessary for verification. Many verification efforts target only a subset of tasks—either code generation, specification, or proof generation—often relying on human input for the remaining parts. Systems like DafnyBench and miniCodeProps focus on proofs, while AutoSpec and SpecGen infer specifications and proofs from human-written code. Interactive theorem-proving environments like Lean provide opportunities for verifiable code generation with LLMs by enabling proof construction with intermediate steps. However, existing Lean benchmarks have limited coverage and lack robust quality control.

Introducing VERINA: A Comprehensive Benchmark

VERINA (Verifiable Code Generation Arena), developed by researchers at the University of California and Meta FAIR, addresses these gaps by offering a holistic benchmark encompassing code, specification, and proof generation. It includes 189 programming challenges formatted in Lean, featuring detailed problem descriptions, formal specifications, code implementations, and test suites with full line coverage.

All problems are manually curated for clarity and accuracy, sourced from MBPP, LiveCodeBench, LeetCode, and others, providing a range of difficulty levels. Each problem includes test suites that validate both positive and negative scenarios, ensuring comprehensive evaluation.

Dataset Structure: VERINA-BASIC and VERINA-ADV

VERINA is divided into two subsets:

  • VERINA-BASIC: Contains 108 problems translated from human-written Dafny code, including 49 problems from MBPP-DFY50 and 59 from CloverBench, translated with OpenAI's o3-mini model and manually inspected.

  • VERINA-ADV: Comprises 81 more challenging problems derived from student coursework involving formalization in Lean, sourced from platforms like LeetCode and LiveCodeBench.

Rigorous quality assurance underpins the dataset, ensuring detailed descriptions, full code coverage with positive tests, and complete test pass rates on ground truth specifications.

Evaluating LLMs on VERINA: Insights and Challenges

Nine state-of-the-art LLMs were evaluated on VERINA, revealing a clear performance hierarchy:

  • Code generation achieved the highest success rates.
  • Specification generation was moderately successful.
  • Proof generation proved the most difficult, with pass@1 rates below 3.6% across models.

The advanced subset, VERINA-ADV, was considerably more challenging across all tasks. Iterative proof refinement using the o4-mini model improved success rates on simpler problems from 7.41% to 22.22% after 64 iterations, though improvements were limited on advanced problems. Providing ground truth specifications notably enhanced code generation, demonstrating that formal specifications effectively guide synthesis processes.

Future Directions and Impact

VERINA sets a new standard for evaluating verifiable code generation by integrating code, specifications, and proofs into a single benchmark. Despite its relatively small size for fine-tuning, it offers a valuable foundation for future research. Scaling through automated annotation, enhancing metrics with more capable provers including LLMs or SMT solvers, and addressing complex soundness and completeness relationships are promising directions to improve specification generation and overall verification reliability.

For more information, explore the Paper, Dataset Card, and GitHub Page. Follow the project on Twitter and join relevant ML communities to stay updated.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский