ShinkaEvolve: LLM-Driven Program Evolution Reaches SOTA Circle Packing in ~150 Evaluations

September 26, 2025 · 4 min

What ShinkaEvolve aims to do

ShinkaEvolve is an open-source framework from Sakana AI that couples large language models (LLMs) with evolutionary search to evolve programs for scientific and engineering problems. The core claim is dramatic sample efficiency: instead of thousands of program evaluations, ShinkaEvolve can find strong solutions with only hundreds of evaluations in several benchmark domains.

Key techniques that cut evaluation costs

The framework reduces wasted evaluations with three interacting mechanisms:

Adaptive parent sampling: Parents for mutations are drawn from islands using fitness- and novelty-aware policies, balancing exploration and exploitation. Sampling strategies include power-law draws or weighting by performance and offspring counts instead of always following the current best.
Novelty-based rejection filtering: Candidate edits are embedded and compared to existing entries in the archive. If cosine similarity is above a threshold, a secondary LLM acts as a novelty judge to decide whether to execute the candidate, avoiding near-duplicate evaluations.
Bandit-based LLM ensembling: Multiple LLM backends are treated like arms in a bandit. The system tracks relative fitness gains produced by each model and routes future mutation proposals to the most promising models using a UCB1-style update on improvement over parent or baseline.

Benchmarks and empirical results

ShinkaEvolve was evaluated across four domains and showed consistent gains under tight evaluation budgets:

Circle packing (n=26 in a unit square): The system reached a new SOTA configuration using approximately 150 program evaluations. The team also validated solutions with strict exact-constraint checks.
AIME math reasoning (2024 set): ShinkaEvolve evolved agentic scaffolds that map out a Pareto frontier of accuracy versus LLM-call budget, outperforming hand-built baselines under limited query budgets and transferring across AIME years and LLMs.
Competitive programming (ALE-Bench LITE): Starting from ALE-Agent solutions, ShinkaEvolve obtained about a 2.3% mean improvement across 10 tasks and elevated one solution from 5th to 2nd in an AtCoder leaderboard counterfactual.
LLM training (Mixture-of-Experts): The system discovered a new load-balancing loss that adds an entropy-modulated under-use penalty to the global-batch objective, reducing miss-routing and improving perplexity and downstream accuracy across regularization strengths.

How the evolutionary loop operates in practice

ShinkaEvolve maintains an archive of evaluated programs with fitness, public metrics, and textual feedback. Each generation follows these steps:

Sample an island and select parent(s) according to adaptive policies.
Build a mutation context combining top-K candidates and random “inspiration” programs.
Propose edits via three operators: diff edits, full rewrites, and LLM-guided crossovers, while protecting immutable code regions with explicit markers.
Apply novelty filtering on proposed candidates and run only those passing the judge.
Execute evaluated candidates, update the archive, and update bandit statistics that steer future LLM selection.

The system also periodically generates a meta-scratchpad summarizing recently successful strategies; those summaries are fed back into prompts to accelerate later generations.

Concrete engineering discoveries

ShinkaEvolve did not simply reapply hand-coded strategies. Examples of discovered techniques include:

Circle packing: structured initialization patterns (like golden-angle patterns), a hybrid global–local search that mixes simulated annealing with SLSQP, and escape mechanisms such as temperature reheating and ring rotations.
AIME scaffolds: a three-stage expert ensemble workflow involving generation, critical peer review, and synthesis that reaches a cost-effective accuracy point at roughly seven LLM calls.
ALE-Bench improvements: targeted engineering optimizations such as caching kd-tree subtree statistics and targeted edge moves toward misclassified items, which improve scores without full rewrites.
MoE loss improvement: adding an entropy-modulated under-use penalty to the global-batch objective reduces miss-routing and improves perplexity and downstream benchmarks.

Comparison with prior systems

Closed-source systems like AlphaEvolve reported strong results but relied on far larger evaluation budgets. ShinkaEvolve reproduces and surpasses the circle-packing result using orders-of-magnitude fewer samples, and it releases the entire stack under Apache-2.0. Ablation studies show that adaptive parent selection, novelty filtering, and bandit ensembles each contribute measurably to the observed efficiency gains.

Availability and resources

ShinkaEvolve is released under an Apache-2.0 license with public code, a research report, tutorials, and a WebUI. The project page and repository contain technical details, examples, and notebooks for reproducing the reported runs.

For more details and code, see https://sakana.ai/shinka-evolve/.