Fixing Instability in Hyper Connections with 1967 Algorithm
DeepSeek researchers address training instability in LLMs using a 1967 matrix normalization technique.
The Challenge of Large Language Models
DeepSeek researchers are trying to solve a precise issue in large language model training. Residual connections made very deep networks trainable; hyper connections widened that residual stream, leading to instability at scale. The new method, mHC (Manifold Constrained Hyper Connections), retains the richer topology of hyper connections but constrains the mixing behavior on a well-defined manifold to ensure numerical stability in deep stacks.
From Residual Connections To Hyper Connections
Standard residual connections, as seen in ResNets and Transformers, propagate activations using the formula:
x_{l+1} = x_l + F(x_l, W_l)
The identity path preserves magnitude and ensures gradients remain usable even when stacking many layers.
Hyper Connections generalize this structure by utilizing an n stream buffer x_l ∈ R^{n×C}, with learned mappings that control inter-layer read/write operations:
H_l^{pre}selects a mixture of streams as the layer inputFrepresents typical attention or feed-forward sublayersH_l^{post}writes results back into the n stream bufferH_l^{res} ∈ R^{n×n}mixes streams between layers
The update occurs as:
x_{l+1} = H_l^{res} x_l + H_l^{post}^T F(H_l^{pre} x_l, W_l)
Setting n to 4 enhances expressivity without significantly increasing floating-point cost, which explains why hyper connections improve downstream performance in language models.
Why Hyper Connections Become Unstable
Instability arises from the product of residual mixers across several layers. In a 27B mixture of experts model, the composite mapping defines an Amax Gain Magnitude based on maximum row and column sums, which measures worst-case amplification in signal paths. Here, the gain can peak around 3000, far from the ideal value of 1 expected for stable residual paths.
This means small deviations per layer compound into significant amplification factors, causing loss spikes and unstable gradient norms relative to baseline models. Additionally, increased memory traffic per token from the multi stream buffer makes naive scaling of hyper connections unattractive for production-level models.
Manifold Constrained Hyper Connections
The mHC approach retains the multi-stream residual idea but constrains risky elements. The residual mixing matrix H_l^{res} is projected onto the manifold of doubly stochastic matrices (Birkhoff polytope), where all entries are non-negative and each row and column sum to 1.
DeepSeek enforces this constraint using the classical Sinkhorn Knopp algorithm from 1967, which alternates row and column normalizations to approximate a doubly stochastic matrix. The team applies 20 iterations per layer during training to keep the mapping close to the target manifold while managing costs.
Under these constraints, H_l^{res} x_l behaves like a convex combination of residual streams. The total feature mass is preserved and tightly regularized, eliminating the explosive growth seen in typical hyper connections.
With mHC, the composite Amax Gain Magnitude remains bounded, peaking at about 1.6 in the 27B model, compared to peaks near 3000 in the unconstrained version. This represents a three-orders-of-magnitude reduction in worst-case amplification.
Systems Work And Training Overhead
Integrating Sinkhorn-style iterations adds overhead on paper. However, the DeepSeek team employs several systems optimizations:
- Fused kernels combine RMSNorm, projections, and gating for mHC mappings to maintain low memory traffic.
- Recompute-based activation checkpointing trades compute for memory, recalculating mHC activations during backpropagation in blocks of layers.
- Integration with a DualPipe-style pipeline schedule maximizes overlapping communication and recomputation.
In large-scale training runs, mHC with an expansion rate of n equal to 4 adds approximately 6.7% training time overhead compared to baseline architectures, already factoring in extra compute from Sinkhorn Knopp and optimizations.
Empirical Results
The research team trained 3B, 9B, and 27B mixture of experts models, evaluating them across standard language benchmarks including BBH, DROP, GSM8K, HellaSwag, MMLU, PIQA, and TriviaQA.
For the 27B model, results indicate substantial gains:
- Baseline: BBH 43.8, DROP F1 47.0
- With hyper connections: BBH 48.9, DROP 51.6
- With mHC: BBH 51.0, DROP 53.9
This demonstrates that hyper connections provide performance enhancements over basic residual designs, while manifold constrained hyper connections stabilize and further improve performance. Similar trends are observed across benchmarks and model sizes, indicating durable advantages across varying compute budgets.
Key Takeaways
- mHC stabilizes widened residual streams: Retaining the widening of the residual pathway into 4 streams while constraining the residual mixing matrices prevents exploding behavior.
- Amplification reduction: The gain magnitude reduces from ≈3000 to ≈1.6 for a 27B MoE model, thereby preventing issues during training.
- Doubly stochastic enforcement: Sinkhorn iterations ensure the sums of rows and columns are 1, which maintains stability in residual behavior.
- Small overhead, measurable gains: Across varying model sizes, mHC boosts benchmark accuracy while only introducing about 6.7% training time overhead.
- New scaling dimension for LLM design: mHC illustrates the potential of designing topology and manifold constraints to unlock further performance and stability improvements in future large language models.
Сменить язык
Читать эту статью на русском