<RETURN_TO_BASE

Gemma Scope 2 Unveiled: A Game-Changer for AI Interpretability

Explore the new features of Gemma Scope 2 for deep model insights.

Overview of Gemma Scope 2

Google DeepMind Researchers introduce Gemma Scope 2, an open suite of interpretability tools that exposes how Gemma 3 language models process and represent information across all layers, from 270M to 27B parameters.

Its core goal is simple: give AI safety and alignment teams a practical way to trace model behavior back to internal features instead of relying only on input-output analysis. When a Gemma 3 model jailbreaks, hallucinates, or shows sycophantic behavior, Gemma Scope 2 lets researchers inspect which internal features fired and how those activations flowed through the network.

What is Gemma Scope 2?

Gemma Scope 2 is a comprehensive, open suite of sparse autoencoders (SAEs) and related tools trained on internal activations of the Gemma 3 model family. SAEs act as a microscope on the model, decomposing high-dimensional activations into a sparse set of human-inspectable features that correspond to concepts or behaviors.

Training Gemma Scope 2 required storing around 110 Petabytes of activation data and fitting over 1 trillion total parameters across all interpretability models.

The suite targets every Gemma 3 variant, including 270M, 1B, 4B, 12B, and 27B parameter models, covering the full depth of the network. This is crucial because many safety-relevant behaviors only appear at larger scales.

Enhancements Over the Original Gemma Scope

The first Gemma Scope release focused on Gemma 2 and enabled research on model hallucination, identifying secrets known by a model, and training safer models. Gemma Scope 2 extends that work in four main ways:

  1. The tools now span the entire Gemma 3 family up to 27B parameters, essential for studying emergent behaviors observed only in larger models.
  2. It includes SAEs and transcoders trained on every layer of Gemma 3, helping trace multi-step computations distributed across layers.
  3. The suite applies the Matryoshka training technique, allowing SAEs to learn more useful and stable features.
  4. New interpretability tools are fine-tuned for chat-oriented Gemma 3 models, analyzing multi-step behaviors like jailbreaks and refusal mechanisms.

Key Takeaways

  1. Gemma Scope 2 is an open interpretability suite for all Gemma 3 models, empowering AI safety work.
  2. It uses strict sparse autoencoders and transcoders to analyze internal activations.
  3. The suite allows for thorough investigation of issues like jailbreaks and discrepancies between the internal state and reasoning.
🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский