Gemini Robotics 1.5: DeepMind's ER↔VLA Stack Brings Agentic Robots to the Real World

September 28, 2025 · 3 min

How the stack works

DeepMind splits embodied intelligence into two cooperating models to tackle long-horizon real-world tasks. Gemini Robotics-ER 1.5 handles high-level embodied reasoning: spatial understanding, planning, progress and success estimation, point grounding, and external tool calls. Gemini Robotics 1.5 (the VLA controller) handles low-level visuomotor execution, converting visual and language inputs into motor commands while emitting explicit “think-before-act” traces to decompose long tasks into short-horizon skills.

Gemini Robotics-ER 1.5: the reasoner

ER 1.5 is a multimodal planner that ingests images and video (with optional audio), grounds references via 2D points in the scene, tracks progress, detects success, and can query external tools or APIs to fetch constraints before issuing sub-goals. That ability to call external tools makes plans conditionable on outside information, such as local rules or context-sensitive constraints. ER 1.5 is available through the Gemini API in Google AI Studio.

Gemini Robotics 1.5 (VLA): the executor

The VLA model focuses on closed-loop visuomotor control. It converts instructions and percepts into motor commands and produces explicit intermediate reasoning traces during execution. Those traces enable the system to “think” while acting, improving decomposition of long tasks and facilitating mid-task plan revisions. Initial availability of the VLA is limited to selected partners.

Why separate cognition from control

End-to-end vision-language-action models often struggle with robust planning, verifiable success detection, and transfer across different robots. By isolating deliberation and orchestration in ER 1.5 and execution in the VLA, DeepMind gains interpretability through visible internal traces, better error recovery, and improved long-horizon reliability. The modular design also makes it easier to evolve or swap components independently.

Motion Transfer across embodiments

A core contribution in Robotics 1.5 is Motion Transfer (MT). The idea is to train the VLA on a unified motion representation constructed from heterogeneous robot data sources—examples include ALOHA, a bi-arm Franka platform, and the Apptronik Apollo humanoid. With MT, skills learned on one platform can be reused zero-shot or with few-shot adaptation on another, reducing per-robot data collection and narrowing sim-to-real gaps by leveraging cross-embodiment priors.

Quantitative findings and real-hardware tests

DeepMind reports controlled A/B comparisons on real hardware and aligned MuJoCo scenes. Highlights include superior instruction following, action generalization, visual generalization, and task generalization across the three platforms compared with prior Gemini Robotics baselines. Motion Transfer yields measurable gains in progress and success for cross-robot transfers (for example, Franka→ALOHA and ALOHA→Apollo). Allowing the VLA to produce thought traces increases long-horizon completion rates and stabilizes mid-rollout plan revisions. Pairing ER 1.5 with the VLA substantially improves progress on multi-step tasks versus a Gemini-2.5-Flash-based baseline.

Safety, evaluation, and rollout

DeepMind layers safety controls across the stack: policy-aligned dialog and planning checks, safety-aware grounding to avoid pointing at hazardous objects, low-level physical limits, and expanded evaluation suites such as ASIMOV-style scenarios and automated red-teaming to probe edge-case failures. The goal is to catch hallucinated affordances or nonexistent objects before actuation. ER 1.5 is available via the Gemini API with docs and preview controls; the VLA is initially accessible to select partners with a public waitlist.

Why this matters for real-world agents

Gemini Robotics 1.5 operationalizes a clear separation between deliberation and execution, adds Motion Transfer to reuse data across heterogeneous robots, and surfaces reasoning outputs like point grounding and progress estimation to developers. For teams building embodied agents, the design reduces per-platform data burden, strengthens long-horizon reliability, and keeps safety testing and guardrails central to deployment.