MIT's PDDL-INSTRUCT Turns an 8B LLM into a 94% Accurate Planner — Massive Gains on Hard Domains

Problem and motivation

Large language models often produce multi-step plans that sound plausible but fail logically when executed. The MIT CSAIL team behind PDDL-INSTRUCT set out to make planning outputs provably valid rather than just plausible, by combining explicit semantic reasoning with an external plan validator.

How PDDL-INSTRUCT works

PDDL-INSTRUCT is an instruction-tuning framework that encourages logical chain-of-thought (CoT) reasoning grounded in PDDL-style state and action semantics, and pairs that reasoning with external verification using the classic VAL plan validator. The approach has three core components:

Two-stage training

The tuning process uses a two-stage optimization: first the model is optimized to produce correct reasoning chains by penalizing state-transition errors, and then it is optimized for end-task planning accuracy. Detailed validator feedback and longer feedback budgets consistently improve performance.

Benchmarks and results

Evaluation follows PlanBench and covers Blocksworld, Mystery Blocksworld (with predicate names obfuscated to prevent pattern matching), and Logistics — benchmarks designed to stress planning capabilities where generic LLMs historically underperform.

Key reported results:

Detailed validator feedback outperforms binary signals, and allocating more feedback steps helps further. These findings suggest that grounding reasoning steps in formal semantics and checking them with an oracle is a practical route to much more reliable planning from LLMs.

Scope and limitations

PDDL-INSTRUCT is demonstrated on classical PDDL domains and currently depends on VAL as an external oracle. The approach shows immediate utility for agent pipelines that can tolerate a verifier in the loop, but longer-horizon planning, temporal and numeric constraints, and cost-sensitive planning remain open areas for extension. The method is an important neuro-symbolic step: it trains LLMs to produce reasoning traces that can be automatically validated against formal semantics, reducing the gap between plausible answers and provably correct plans.

References and artifacts

The authors provide the full paper and code artifacts for tutorials and reproduction. See the arXiv paper for full experimental details and numbers.

(Original research: arXiv:2509.13351)