Gemma Scope 2 Unveiled: A Game-Changer for AI Interpretability

Overview of Gemma Scope 2

Google DeepMind Researchers introduce Gemma Scope 2, an open suite of interpretability tools that exposes how Gemma 3 language models process and represent information across all layers, from 270M to 27B parameters.

Its core goal is simple: give AI safety and alignment teams a practical way to trace model behavior back to internal features instead of relying only on input-output analysis. When a Gemma 3 model jailbreaks, hallucinates, or shows sycophantic behavior, Gemma Scope 2 lets researchers inspect which internal features fired and how those activations flowed through the network.

What is Gemma Scope 2?

Gemma Scope 2 is a comprehensive, open suite of sparse autoencoders (SAEs) and related tools trained on internal activations of the Gemma 3 model family. SAEs act as a microscope on the model, decomposing high-dimensional activations into a sparse set of human-inspectable features that correspond to concepts or behaviors.

Training Gemma Scope 2 required storing around 110 Petabytes of activation data and fitting over 1 trillion total parameters across all interpretability models.

The suite targets every Gemma 3 variant, including 270M, 1B, 4B, 12B, and 27B parameter models, covering the full depth of the network. This is crucial because many safety-relevant behaviors only appear at larger scales.

Enhancements Over the Original Gemma Scope

The first Gemma Scope release focused on Gemma 2 and enabled research on model hallucination, identifying secrets known by a model, and training safer models. Gemma Scope 2 extends that work in four main ways:

The tools now span the entire Gemma 3 family up to 27B parameters, essential for studying emergent behaviors observed only in larger models.
It includes SAEs and transcoders trained on every layer of Gemma 3, helping trace multi-step computations distributed across layers.
The suite applies the Matryoshka training technique, allowing SAEs to learn more useful and stable features.
New interpretability tools are fine-tuned for chat-oriented Gemma 3 models, analyzing multi-step behaviors like jailbreaks and refusal mechanisms.

Key Takeaways