78 Examples, Massive Gains: LIMI Turns Tiny Datasets into Powerful Software Agents

LIMI at a glance

Researchers from Shanghai Jiao Tong University and the SII Generative AI Research Lab present LIMI, short for Less Is More for Agency. LIMI is a supervised fine-tuning approach that converts a base language model into a capable software and research agent using only 78 curated, long-horizon, tool-grounded trajectories. On AgencyBench, LIMI achieves a 73.5% average (FTFC 71.7, RC@3 74.2, SR@3 74.6) and outperforms multiple strong baselines, including large models trained on orders of magnitude more samples.

What makes LIMI different

The core claim is an Agency Efficiency Principle: agent competence scales more with the quality and structure of supervision than with raw sample count. Instead of thousands of short instruction-response pairs, LIMI uses a small set of dense demonstrations where each sample captures an entire multi-turn workflow. These trajectories document model reasoning, tool calls, environment observations, and verification steps collected inside the SII-CLI execution environment.

How the data were constructed

The training set contains 78 trajectories derived from two sources: 60 real practitioner queries and 18 synthesized cases based on high-star GitHub pull requests. PhD-level annotators performed tight question-answer validation. Each trajectory is token-dense (roughly 13k to 152k tokens, about 42.4k on average) and records the full path from query to successful completion across tasks that include interactive software development and research workflows such as search, analysis, and experiment design.

Training and base models

LIMI was applied to GLM-4.5 (355B) and GLM-4.5-Air (106B) using the slime SFT framework. The authors kept training configs identical across comparisons to isolate the effect of data quality and structure. The focus was to measure how far a small set of long-horizon, tool-grounded trajectories can push agentic behavior relative to conventional large-scale SFT baselines.

Evaluation and results

Primary evaluation used AgencyBench with three-round interactions and metrics FTFC, SR@3, and RC@3, alongside a suite of generalization benchmarks (TAU2, EvalPlus-HE/MBPP, DS-1000, SciCode). LIMI achieved a 73.5% average on AgencyBench, a large margin above baselines: GLM-4.5 scored 45.1, Qwen3-235B-A22B 27.5, and others were lower. Remarkably, LIMI outperformed a GLM-4.5 variant fine-tuned on AFM-CodeAgent SFT with 10,000 samples (73.5% vs 47.8%), representing a 128x reduction in training samples and a 53.7-point absolute advantage.

Generalization results show LIMI averaging about 57% across diverse tool-use, coding, and scientific computing tasks. Even without tool access, LIMI retains a small lead, suggesting that curated trajectories impart intrinsic reasoning and orchestration skills beyond environment-specific tooling.

Why trajectory quality matters

The LIMI results emphasize that long, coherent trajectories that mirror real multi-step workflows capture planning, tool orchestration, and verification patterns that short, generic instructions do not. Each dense demonstration encodes multi-turn strategies and environment interactions that the model can imitate and generalize from more effectively than many disconnected examples.

Practical implications and next steps

LIMI indicates a promising direction for building practical software agents with limited labeling budgets: invest in collecting high-quality, tool-grounded trajectories rather than scaling generic SFT data blindly. Future work will explore broader task coverage, automated trajectory synthesis, and transfer to other base models and tooling environments. The paper, code, and model card are available for deeper inspection and reproduction of results.