MIT's PDDL-INSTRUCT Turns an 8B LLM into a 94% Accurate Planner — Massive Gains on Hard Domains
Problem and motivation
Large language models often produce multi-step plans that sound plausible but fail logically when executed. The MIT CSAIL team behind PDDL-INSTRUCT set out to make planning outputs provably valid rather than just plausible, by combining explicit semantic reasoning with an external plan validator.
How PDDL-INSTRUCT works
PDDL-INSTRUCT is an instruction-tuning framework that encourages logical chain-of-thought (CoT) reasoning grounded in PDDL-style state and action semantics, and pairs that reasoning with external verification using the classic VAL plan validator. The approach has three core components:
- Error education: During training, the model learns to explain why candidate plans fail, identifying issues such as unsatisfied preconditions, incorrect effects, frame violations, or failure to reach the goal.
- Logical chain-of-thought: Prompts force the model to produce explicit state→action→state traces of the form ⟨sᵢ, aᵢ₊₁, sᵢ₊₁⟩, reasoning step-by-step over preconditions and add/delete effects.
- External verification (VAL): Each step is checked with VAL. Feedback can be binary (valid/invalid) or detailed (which precondition or effect failed); the paper finds detailed feedback yields the strongest gains.
Two-stage training
The tuning process uses a two-stage optimization: first the model is optimized to produce correct reasoning chains by penalizing state-transition errors, and then it is optimized for end-task planning accuracy. Detailed validator feedback and longer feedback budgets consistently improve performance.
Benchmarks and results
Evaluation follows PlanBench and covers Blocksworld, Mystery Blocksworld (with predicate names obfuscated to prevent pattern matching), and Logistics — benchmarks designed to stress planning capabilities where generic LLMs historically underperform.
Key reported results:
- Blocksworld: A tuned Llama-3-8B reaches 94% valid plans under PDDL-INSTRUCT.
- Mystery Blocksworld: Dramatic relative improvements compared to near-zero baselines (the paper reports orders-of-magnitude gains, including examples like 64× in summary figures).
- Logistics: Substantial increases in valid plans across scenarios.
- Across domains: The authors report up to a 66% absolute improvement over untuned baselines.
Detailed validator feedback outperforms binary signals, and allocating more feedback steps helps further. These findings suggest that grounding reasoning steps in formal semantics and checking them with an oracle is a practical route to much more reliable planning from LLMs.
Scope and limitations
PDDL-INSTRUCT is demonstrated on classical PDDL domains and currently depends on VAL as an external oracle. The approach shows immediate utility for agent pipelines that can tolerate a verifier in the loop, but longer-horizon planning, temporal and numeric constraints, and cost-sensitive planning remain open areas for extension. The method is an important neuro-symbolic step: it trains LLMs to produce reasoning traces that can be automatically validated against formal semantics, reducing the gap between plausible answers and provably correct plans.
References and artifacts
The authors provide the full paper and code artifacts for tutorials and reproduction. See the arXiv paper for full experimental details and numbers.
(Original research: arXiv:2509.13351)