Fudan University Unveils Lorsa: Decoding Transformer Attention Superposition with Sparse Mechanisms

Understanding Attention Mechanisms in Transformers

Large Language Models (LLMs) have advanced rapidly, yet deciphering their internal workings remains complex. Transformer models use multiple attention heads, some of which have identifiable roles, such as induction heads predicting tokens based on context. However, most attention heads distribute focus across varied inputs without clear, isolated functions. This complexity is attributed to attention superposition, where multiple atomic attention units overlap within the same heads, making interpretation difficult.

Challenges in Explaining Attention Heads

Past studies have identified specialized attention heads using methods like activation and path patching, revealing functions such as name moving, copy suppression, and long context retrieval. Yet, the superposition hypothesis suggests that neurons and attention heads represent multiple overlapping features rather than single functionalities. Sparse Autoencoders have helped extract sparse, interpretable features from neural networks but still face challenges explaining the collaborative behavior of attention heads in language models.

Introduction to Lorsa: Low-Rank Sparse Attention

Fudan University's research team introduced Lorsa, a novel approach designed to disentangle atomic attention units from the complex superposition found in Multi-Head Self-Attention (MHSA). Lorsa replaces standard MHSA with an overcomplete set of attention heads featuring single-dimensional OV circuits and sparsity constraints. This structure allows for better interpretability by activating only a subset of heads dynamically per token position.

Lorsa's Architecture and Methodology

Lorsa minimizes mean square error to predict MHSA outputs using one-dimensional OV circuits, restricting read/write operations to specific residual features. It shares Query and Key parameters across DLorsa QK heads to maintain efficiency. Unlike traditional MHSA, Lorsa activates only the top-K heads per token, resembling attention Sparse Autoencoders but with activations derived from previous token attention patterns.

Evaluating Lorsa's Interpretability

The team developed an exploration interface showcasing comprehensive data on each Lorsa head. Key metrics include top activations, showing the highest activating tokens, and z pattern analysis, which breaks down token-wise contributions from prior positions. These techniques illuminate how specific heads, such as a "you"-specific induction head, operate by focusing on relevant token features and activating corresponding predictions.

Discoveries and Results

Lorsa successfully recovered known attention mechanisms like induction, name mover, successor, and attention sink heads in models such as Pythia-160M and Llama-3.1-8B. New findings include arithmetic-specific heads that handle simple math operations and thematic anchor heads that maintain long-range topic attention, influencing predictions towards domain-relevant vocabulary.

Significance and Future Directions

This work provides unprecedented visibility into transformer attention mechanisms, highlighting the importance of accounting for attention superposition in model interpretability. Despite progress, challenges persist in fully unbinding Query-Key circuits and reducing superposition. Future research aims to explore low-dimensional QK structures, cross-layer superposition, and systematic Query/Key/Value composition to enhance understanding and control of language models.

For more details, check out the Paper, Model on Hugging Face, and GitHub Page. Follow updates on Twitter and explore AI news and events at Marktechpost.