DeepMind's SIMA 2: Gemini-Powered Agent Tackles Complex 3D Game Worlds

What SIMA 2 Is

DeepMind's SIMA 2 (Scalable Instructable Multiworld Agent) is a generalist embodied agent designed to operate inside complex 3D game environments. The new release builds on the original SIMA by replacing the low-level policy with a Gemini reasoning core, enabling the agent to form internal plans, explain its intentions, and improve through self-play across many different virtual worlds.

From SIMA 1 to SIMA 2

SIMA 1 learned over 600 language-following skills like 'turn left', 'climb the ladder', and 'open the map'. It controlled commercial games using only rendered pixels and a virtual keyboard and mouse, without access to game internals. On DeepMind's main benchmark SIMA 1 achieved roughly 31% task completion compared with around 71% for human players.

SIMA 2 preserves the same embodied interface but swaps the policy for a Gemini model. Reportedly using Gemini 2.5 Flash Lite as its reasoning engine, the agent no longer maps pixels directly to actions. Instead it infers high-level goals from observations and instructions, reasons about plans in language, and executes action sequences via the virtual input interface. DeepMind frames this as a shift from an instruction follower to an interactive gaming companion that can collaborate with a human player.

Gemini in the Control Loop

The SIMA 2 architecture places Gemini at the agent's core. Visual observations plus user instructions are fed to the model, which infers a task-level goal and generates action outputs sent through the virtual keyboard and mouse. Training mixes human demonstration videos labeled with language and labels generated by Gemini itself. This hybrid supervision aligns the agent's internal reasoning with both human intent and model-generated behavior descriptions.

A practical consequence of this design is interpretability: SIMA 2 can explain what it intends to do, list the steps it will take, answer questions about current objectives, and expose an interpretable chain of thought about the environment.

Generalization and Measured Performance

On DeepMind's main evaluation suite SIMA 2 roughly doubles SIMA 1's performance, moving from about 31% to approximately 62% task completion, while humans remain around 70%. The important takeaway is the shape of the improvement: the Gemini-centered agent closes most of the gap between SIMA 1 and human players on long, language-specified missions in training games.

Held-out environments like ASKA and MineDojo, never seen during training, show a similar pattern: SIMA 2 achieves substantially higher task completion than SIMA 1. This suggests genuine zero-shot generalization rather than overfitting. The agent also demonstrates concept transfer — for example, reusing a learned notion of 'mining' when asked to 'harvest' in a different title.

Multimodal Instruction Following

SIMA 2 expands the instruction channel beyond plain text. Demonstrations show the agent following spoken commands, reacting to on-screen sketches, and executing tasks prompted by emojis. In one example a user asks SIMA 2 to go to 'the house that is the color of a ripe tomato'; the Gemini core reasons that ripe tomatoes are red, selects that house, and walks to it.

Gemini enables support for multiple natural languages and mixed prompts combining language and visual cues. For robotics and physical AI, this is a concrete multimodal stack: a shared representation links text, audio, images, and in-game actions so abstract symbols can be grounded in concrete control sequences.

Self-Improvement at Scale

A major research contribution in SIMA 2 is the explicit self-improvement loop. After initial training using human gameplay, the agent is deployed in new games to learn from its own experience. A separate Gemini model generates tasks for the agent in each world, and a learned reward model scores attempts. Trajectories are stored in a bank of self-generated data; later generations train on this dataset, allowing the agent to succeed on tasks where earlier versions failed without new human demonstrations.

This is an example of a multitask, model-in-the-loop data engine: a language model specifies goals and provides feedback, while the embodied agent converts that feedback into improved policies.

Testing with Genie 3

To push generalization further, DeepMind couples SIMA 2 with Genie 3, a world model that synthesizes interactive 3D environments from a single image or text prompt. In these generated worlds SIMA 2 orients, parses instructions, and acts even when geometry and assets differ from training games.

The reported behavior shows that SIMA 2 can navigate Genie 3 scenes, identify objects like benches and trees, and perform coherent requested actions. This demonstrates that a single agent can operate across commercial titles and procedurally generated environments using the same reasoning core and control interface.

Implications for Robotics and Embodied AI

SIMA 2 is presented as a systems milestone: integrating a trimmed Gemini 2.5 Flash Lite model with multimodal perception, language-based planning, and a self-improving loop validated across commercial and generated environments. The results indicate a practical recipe for more general-purpose embodied agents and point toward future steps for real-world robotic systems.