What Gemini 3 Pro brings

Google released the Gemini 3 family with Gemini 3 Pro as the flagship model aimed at bridging single-prompt LMs and truly agentic, multimodal systems. Google frames Gemini 3 as its most capable model to date, with major improvements in reasoning, multimodal understanding, and agentic behavior. The model is already available in preview and integrated across Google products like the Gemini app, AI Mode in Search, Gemini API, Google AI Studio, Vertex AI, and the new Google Antigravity agent development platform.

Sparse MoE architecture and massive context window

Gemini 3 Pro is built as a sparse mixture of experts transformer with native support for text, images, audio and video. Sparse MoE layers route each token to a small subset of experts, letting the model scale total parameter count while keeping per-token compute efficient. The model accepts inputs up to 1M tokens and can generate up to 64k output tokens, making it suitable for very long codebases, extended documents, or multi-hour transcripts. Importantly, Gemini 3 Pro was trained from scratch rather than as a fine tune of Gemini 2.5.

Training data and fine-tuning

Training data includes large-scale public web text, multilingual code, images, audio and video, combined with licensed sources, user interaction data, and synthetic samples. After base training, the model undergoes multimodal instruction tuning and reinforcement learning from human and critic feedback to enhance multi-step reasoning, problem solving and theorem proving. Training runs on Google TPUs using JAX and ML Pathways.

Reasoning and academic benchmarks

Gemini 3 Pro shows large improvements over Gemini 2.5 Pro on public benchmarks and competes with frontier systems such as GPT 5.1 and Claude Sonnet 4.5. On Humanity’s Last Exam, which aggregates PhD-level questions across science and humanities, Gemini 3 Pro scores 37.5 percent without tools versus 21.6 percent for Gemini 2.5 Pro, 26.5 percent for GPT 5.1 and 13.7 percent for Claude Sonnet 4.5. With search and code execution enabled, Gemini 3 Pro reaches 45.8 percent.

On ARC AGI 2 visual reasoning puzzles, it scores 31.1 percent, up from 4.9 percent for Gemini 2.5 Pro, and outperforming GPT 5.1 at 17.6 percent and Claude Sonnet 4.5 at 13.6 percent. For scientific question answering on GPQA Diamond, Gemini 3 Pro reaches 91.9 percent, slightly ahead of GPT 5.1 at 88.1 percent and Claude Sonnet 4.5 at 83.4 percent. In mathematics, the model achieves 95.0 percent on AIME 2025 without tools and 100.0 percent with code execution, while also scoring 23.4 percent on MathArena Apex, a challenging contest benchmark.

Multimodal understanding and long context behavior

As a native multimodal model, Gemini 3 Pro outperforms previous versions on benchmarks that test understanding across modalities. On MMMU Pro, measuring multimodal reasoning across many university-level subjects, it scores 81.0 percent versus 68.0 percent for Gemini 2.5 Pro and Claude Sonnet 4.5, and 76.0 percent for GPT 5.1. On Video MMMU, which evaluates knowledge acquisition from videos, Gemini 3 Pro reaches 87.6 percent.

User interface and document understanding also show big gains. ScreenSpot Pro, a benchmark for locating elements on a screen, reports Gemini 3 Pro at 72.7 percent compared to 11.4 percent for Gemini 2.5 Pro. On OmniDocBench 1.5, which measures OCR and structured document understanding, Gemini 3 Pro achieves a 0.115 edit distance, lower than all baselines in the comparison.

For long context, Gemini 3 Pro is evaluated on MRCR v2 with 8-needle retrieval. At a 128k average context it scores 77.0 percent, and at a 1M token pointwise setting it reaches 26.3 percent, ahead of Gemini 2.5 Pro at 16.4 percent. Competing published models do not yet support that context length in the comparison.

Coding, agentic workflows, and Google Antigravity

A central focus for Gemini 3 Pro is coding and agentic use. It tops the LMArena leaderboard with an Elo score of 1501 and achieves 1487 Elo in WebDev Arena. On Terminal Bench 2.0, which tests the ability to operate a computer through a terminal via an agent, it reaches 54.2 percent, above GPT 5.1 at 47.6 percent and Claude Sonnet 4.5 at 42.8 percent.

On SWE Bench Verified, which measures single-attempt code changes across GitHub issues, Gemini 3 Pro scores 76.2 percent compared to 59.6 percent for Gemini 2.5 Pro, 76.3 percent for GPT 5.1 and 77.2 percent for Claude Sonnet 4.5. The model also performs well on τ2 bench for tool use at 85.4 percent and on Vending Bench 2, which evaluates long-horizon planning for a simulated business, producing a mean net worth of 5478.16 dollars versus 573.64 dollars for Gemini 2.5 Pro and 1473.43 dollars for GPT 5.1.

Google Antigravity exposes these capabilities in an agent-first development environment. Antigravity pairs Gemini 3 Pro with the Gemini 2.5 Computer Use model for browser control and the Nano Banana image model, enabling agents to plan, write and run code, interact with terminals or browsers, and verify results within a single workflow.

Implications for developers and products

Gemini 3 Pro is positioned as an API-ready workhorse for agentic, production-facing systems. Its combination of sparse MoE architecture, massive context window, strong multimodal reasoning and tooling integration makes it a clear step toward more general AI systems that can reason over long inputs and act reliably on user behalf. The model is benchmark-driven but also integrated across Google platforms, suggesting an immediate path for developers to experiment with agentic workflows and long-context applications.

For technical readers, Google provides further technical details and documentation, plus GitHub tutorials, notebooks and community channels for deeper exploration.

Google's Gemini 3 Pro: Sparse MoE and 1M-Token Context Fuel Practical Multimodal Agents