PoE-World: Modular Symbolic Models Surpass RL Baselines in Montezuma’s Revenge with Minimal Data

The Role of Symbolic Reasoning in AI World Modeling

Creating AI agents that can adapt to complex environments requires a deep understanding of how the world works. Traditional neural network models like Dreamer are flexible but demand vast amounts of data, much more than humans need. In contrast, program synthesis using large language models (LLMs) offers data-efficient world models by generating code-based representations. However, these approaches struggle to scale to complex, dynamic environments due to the difficulty of generating large comprehensive programs.

Challenges of Existing Programmatic World Models

Current methods often synthesize a single large program to represent the world, as seen in models like WorldCoder and CodeWorldModels. This approach limits scalability and the ability to handle uncertainty and partial observability in complex domains. Some research integrates symbolic reasoning with visual input in robotics, often relying on restricted domain-specific languages or related structures like factor graphs. Theoretical frameworks such as AIXI also explore world modeling through Turing machines and history-based methods.

Introducing PoE-World: A Modular and Probabilistic Approach

PoE-World, developed by researchers from Cornell, Cambridge, The Alan Turing Institute, and Dalhousie University, takes a different approach by combining multiple small LLM-synthesized programs, each encoding a specific environmental rule. This modular, probabilistic structure enables learning from minimal demonstrations and supports generalization to new situations. PoE-World emphasizes symbolic object observations rather than raw pixels and focuses on accurate modeling to facilitate efficient planning in complex games like Pong and Montezuma’s Revenge.

Architecture and Learning in PoE-World

The environment is modeled as a mixture of programmatic experts—small, interpretable Python programs—each responsible for a particular rule or behavior. These experts are weighted and combined to predict future states based on history and actions, assuming conditional independence among features. Hard constraints refine predictions, and experts are updated or pruned as new data arrives. The system supports planning and reinforcement learning by simulating likely futures. Programs are synthesized with LLMs and interpreted probabilistically, with expert weights optimized by gradient descent.

Performance on Atari Benchmarks

PoE-World + Planner was tested on Atari’s Pong and Montezuma’s Revenge, including challenging variants, using minimal demonstration data. It outperformed RL baselines like PPO, ReAct, and WorldCoder, especially in low-data scenarios. The model generalizes well, accurately capturing game dynamics even in modified environments without new demonstrations. Notably, PoE-World is the only method to consistently achieve positive scores in Montezuma’s Revenge. Pre-training policies within PoE-World’s simulated environment significantly accelerate real-world learning. Compared to WorldCoder’s limited and sometimes inaccurate models, PoE-World produces detailed, constraint-aware representations that yield better planning and realistic in-game behaviors.

Symbolic Modular Programs for Scalable AI Planning

PoE-World illustrates the power of modular, symbolic world models synthesized by LLMs for building adaptive AI agents. By recombining programmatic experts, it achieves strong generalization from minimal data, efficient planning, and robust performance in complex tasks. The code and demos are publicly available for further exploration.

For more information, see the Paper, Project Page, and GitHub repository.