ByteDance Launches Seed-Prover: A Breakthrough in Automated Mathematical Theorem Proving

Advances in Mathematical Reasoning with LLMs

Large Language Models (LLMs) have significantly improved mathematical reasoning by leveraging natural language, boosting performance on benchmarks like MATH and AIME. However, reinforcement learning (RL) faces challenges because verifying natural language proofs requires meticulous manual checking, limiting RL's use in training theorem-proving models.

Seed-Prover: Lemma-Centric Whole-Proof Reasoning

ByteDance's Seed Team presents Seed-Prover, a lemma-style whole-proof reasoning system that iteratively refines proofs using Lean feedback, previously proven lemmas, and self-summarization. Unlike traditional step-by-step or whole-proof generation approaches, Seed-Prover centers its reasoning on lemmas, enabling deeper and broader inference strategies to solve complex problems such as those in the International Mathematical Olympiad (IMO).

Overcoming Limitations with Seed-Geometry

To complement Seed-Prover, Seed-Geometry is introduced as a geometry reasoning engine addressing Lean's constraints in geometric support. This enhances the system's ability to handle geometry problems effectively.

Training Methodology and Dataset

Seed-Prover utilizes multi-stage, multi-task reinforcement learning based on VAPO for interaction with Lean. The training data merges open-source datasets and in-house formal problems, with a proposer generating simpler task variants while excluding overly simple problems with high proof rates. Seed-Geometry's backend supports large-scale problem generation, identifying over 230 million unique problems within a week, achieving an eightfold improvement in search efficiency.

Performance Across Benchmarks

Seed-Prover demonstrates state-of-the-art results:

IMO 2025: Fully solved 5 out of 6 problems; Seed-Geometry instantly solved Problem 2; combined methods derived proofs for the remaining problem.
Past IMO Problems: Solved 121 out of 155 tasks with a 78.1% success rate across difficulties.
MiniF2F Benchmark: Achieved 99.6% proof rate on validation and test sets, solving challenging problems like IMO 1990 Problem 3.
PutnamBench: Improved from 201 to 331 solved problems out of 657 when upgrading inference settings.
CombiBench: Solved 30 out of 100 combinatorics problems, outperforming existing methods while highlighting challenges in combinatorial reasoning.
MiniCTX-v2: Achieved 81.8% success, outperforming baseline methods significantly.

Future Directions

The integration of formal languages like Lean with LLMs provides rapid, cost-effective, and reliable proof verification compared to human experts and LLM judges. Future research aims to merge formal systems with LLM capabilities to tackle open mathematical conjectures.

For more information, visit the Paper and GitHub Page for tutorials, codes, and notebooks. Follow the team on Twitter and join their ML community for updates.