Microsoft's Muon Optimizer Dramatically Speeds Up Grokking in Transformers

Revisiting Grokking in Deep Learning

Grokking is a fascinating phenomenon where deep learning models suddenly shift from memorizing training data to effectively generalizing to unseen data after a prolonged delay. Initially observed in simple algorithmic tasks like modular arithmetic, grokking challenges our understanding of training dynamics. Models often achieve near-perfect training accuracy but perform poorly on validation for many epochs before abruptly improving.

The Role of Optimizers in Grokking

Microsoft researchers explored how different optimizers impact this grokking behavior, comparing the popular AdamW optimizer with Muon, a novel optimizer featuring spectral norm constraints and second-order information. The study aimed to determine if Muon could accelerate the generalization phase compared to AdamW.

Experimental Setup

The researchers tested seven algorithmic tasks, including modular arithmetic operations and parity classification, all implemented using a Transformer architecture in PyTorch. They also examined three softmax variants—standard softmax, stablemax, and sparsemax—to see if output normalization influenced grokking, though the focus remained on optimizer effects.

Model Architecture and Optimization Details

The Transformer model employed components like multi-head self-attention, rotary positional embeddings (RoPE), RMS normalization, SiLU activations, and dropout. Inputs were encoded through identity embeddings.

AdamW uses adaptive learning rates with decoupled weight decay, serving as a baseline.

Muon optimizer applies orthogonalized gradients, enforces spectral norm constraints for training stability, and approximates second-order curvature to provide more informative updates. These techniques help avoid pitfalls like "softmax collapse" and promote more efficient training by aligning updates with layer dimensionality.

Results and Findings

Experiments were run on NVIDIA H100 GPUs with multiple seeds for robustness. Grokking was defined as the epoch where validation accuracy first exceeded 95% after training accuracy stabilized.

Muon reached this threshold significantly faster, averaging 102.89 epochs versus 153.09 epochs for AdamW. Statistical analysis confirmed the difference was highly significant (t = 5.0175, p ≈ 6.33e−8). Additionally, Muon showed more consistent grokking times across tasks.

Implications for Neural Network Training

These findings highlight that optimizer design—particularly incorporating geometry-aware updates and spectral norm constraints—can profoundly influence when and how models generalize. Muon’s approach helps models bypass extended memorization phases, steering training more directly toward underlying data structures.

This study suggests that beyond data and regularization, the choice and design of optimizers must be considered a crucial factor in deep learning training strategies.

For more details, check out the original paper and follow related discussions on social media platforms.