Gemma Scope 2 Unveiled: A Game-Changer for AI Interpretability
Explore the new features of Gemma Scope 2 for deep model insights.
Overview of Gemma Scope 2
Google DeepMind Researchers introduce Gemma Scope 2, an open suite of interpretability tools that exposes how Gemma 3 language models process and represent information across all layers, from 270M to 27B parameters.
Its core goal is simple: give AI safety and alignment teams a practical way to trace model behavior back to internal features instead of relying only on input-output analysis. When a Gemma 3 model jailbreaks, hallucinates, or shows sycophantic behavior, Gemma Scope 2 lets researchers inspect which internal features fired and how those activations flowed through the network.
What is Gemma Scope 2?
Gemma Scope 2 is a comprehensive, open suite of sparse autoencoders (SAEs) and related tools trained on internal activations of the Gemma 3 model family. SAEs act as a microscope on the model, decomposing high-dimensional activations into a sparse set of human-inspectable features that correspond to concepts or behaviors.
Training Gemma Scope 2 required storing around 110 Petabytes of activation data and fitting over 1 trillion total parameters across all interpretability models.
The suite targets every Gemma 3 variant, including 270M, 1B, 4B, 12B, and 27B parameter models, covering the full depth of the network. This is crucial because many safety-relevant behaviors only appear at larger scales.
Enhancements Over the Original Gemma Scope
The first Gemma Scope release focused on Gemma 2 and enabled research on model hallucination, identifying secrets known by a model, and training safer models. Gemma Scope 2 extends that work in four main ways:
- The tools now span the entire Gemma 3 family up to 27B parameters, essential for studying emergent behaviors observed only in larger models.
- It includes SAEs and transcoders trained on every layer of Gemma 3, helping trace multi-step computations distributed across layers.
- The suite applies the Matryoshka training technique, allowing SAEs to learn more useful and stable features.
- New interpretability tools are fine-tuned for chat-oriented Gemma 3 models, analyzing multi-step behaviors like jailbreaks and refusal mechanisms.
Key Takeaways
- Gemma Scope 2 is an open interpretability suite for all Gemma 3 models, empowering AI safety work.
- It uses strict sparse autoencoders and transcoders to analyze internal activations.
- The suite allows for thorough investigation of issues like jailbreaks and discrepancies between the internal state and reasoning.
Сменить язык
Читать эту статью на русском