Beyond Regular Polytopes: How Correlations Reshape Superposition in Neural Networks

The Puzzle of Neural Geometry

Mechanistic interpretability has long operated under a compelling geometric narrative. Following the influential framework of Elhage et al. (2022), researchers have understood neural superposition as a packing problem: when a network must represent more features than it has dimensions, features arrange themselves to minimize interference, typically forming regular polytopes or antipodal pairs. In this view, interference represents harmful noise that non-linearities such as ReLUs must filter out, allowing only sparse, uncorrelated features to activate cleanly.

Yet this theoretical picture conflicts with what we observe in real language models. Rather than isotropic, locally orthogonal arrangements, researchers have identified ordered circular structures, such as the months of the year arranged in a ring, and semantic clusters where related concepts group together rather than repelling each other. These observations suggest that our understanding of how features organize geometrically remains incomplete.

In their ICLR 2026 paper, "From Data Statistics to Feature Geometry: How Correlations Shape Superposition," Prieto et al. resolve this tension by demonstrating that interference need not be purely destructive. Through their proposed Bag-of-Words Superposition (BOWS) framework, they show that when features exhibit realistic correlation structures, interference becomes constructive, enabling efficient encoding that gives rise precisely to the cyclical and clustered geometries observed in practice.

The Standard Model and Its Constraints

To appreciate the contribution, one must first understand the canonical formulation of superposition. Consider a set of d features embedded in an m-dimensional space where m < d. Definition 1 from Prieto et al. formalizes superposition as requiring two conditions: interference (non-zero dot products between feature directions) and recoverability (the existence of a decoder achieving R²_i ≥ 1−ε for each feature).

In the standard account, features are assumed sparse and uncorrelated. Under these conditions, the optimal geometric arrangement minimizes pairwise dot products to reduce interference. ReLU activations serve as a selective filter, allowing the network to suppress harmful cross-talk between active features. This yields highly symmetric, local structures: regular polytopes where features are roughly equidistant, or antipodal pairs for mutually exclusive concepts.

This framework has proven productive for understanding toy models and motivated dictionary learning approaches such as sparse autoencoders (SAEs). However, it contains a critical assumption: that interference is noise to be minimized. Prieto et al. argue that this assumption breaks down for realistic data distributions where features correlate according to natural co-occurrence patterns.

Constructive Interference and Linear Superposition

The authors introduce Bag-of-Words Superposition (BOWS) as a controlled experimental setting that bridges synthetic toy models and real language data. In BOWS, an autoencoder learns to encode binary bag-of-words representations derived from internet text. This preserves the natural correlation structure of language while maintaining ground-truth feature identities, allowing precise geometric analysis.

Using BOWS, the authors identify a regime they term linear superposition (Definition 2). In contrast to the standard view, linear superposition occurs when features correlate in ways that make their interference constructive rather than destructive. When features A and B frequently co-occur, arranging their directions to have positive interference allows the presence of A to actually assist in reconstructing B, and vice versa. The ReLU still functions to suppress spurious activations and false positives, but it no longer needs to filter out all interference; some interference now carries signal.

This regime is characterized by low-rank structure in the data covariance. When the feature covariance matrix exhibits strong off-diagonal patterns, the optimal encoder leverages these statistical dependencies. The authors demonstrate that under conditions of tight bottlenecks (high compression ratios) or with weight decay regularization, networks prominently adopt these linear superposition solutions. Weight decay proves particularly significant; by penalizing weight norm, regularization pushes the model toward solutions that exploit data correlations to achieve efficient reconstruction with smaller weights, effectively amplifying the statistical structure of the training data in the geometric structure of the representation.

Emergent Geometries: Circles and Clusters

The geometric consequences of constructive interference differ markedly from the regular polytope picture. Rather than distributing features isotropically to minimize all dot products, the network arranges features according to their co-activation patterns.

First, this gives rise to semantic clusters, or anisotropic superposition. Related features that frequently co-occur, such as "Christmas," "December," and "winter," cluster together in representation space. Their mutual interference is constructive, reinforcing the signal for seasonally related concepts. This explains observations in real language models where SAEs find features grouped by semantic similarity rather than orthogonal isolation.

Second, and more strikingly, the framework naturally produces cyclical structures. The authors reproduce the "months of the year" phenomenon observed in LLMs: months arrange in a circular pattern because seasonal words constructively interfere with adjacent months. January activates February; December activates Christmas; the topology emerges from transition probabilities in the training data. This circular arrangement is not a quirk to be explained away, but a natural consequence of encoding correlated features in superposition while maintaining recoverability.

The paper also introduces a distinction between presence-coding and value-coding features to account for structure that persists even without correlations. Presence-coding features (binary indicators) exhibit the cyclical geometries when correlated, while value-coding features (continuous magnitudes) may show different organizations. This dichotomy helps explain why certain structured representations appear robustly across different data regimes.

Implications for Interpretability Research

These findings carry significant implications for the practice of mechanistic interpretability, particularly for dictionary learning. Current SAE training often assumes that features are approximately orthogonal or that interference represents noise to be eliminated. If real networks extensively employ linear superposition, then optimal dictionaries may need to account for correlated, non-orthogonal basis vectors. The geometric prior used in SAE training may need revision to accommodate constructive interference and anisotropic arrangements.

Furthermore, the role of weight decay in shaping feature geometry suggests that regularization choices have profound consequences for interpretability. Strong weight decay does not merely prevent overfitting; it actively pushes representations toward semantic clustering by forcing the network to exploit data correlations for efficient encoding. This connects to recent work on model editing and adversarial robustness, where understanding which features interfere constructively versus destructively becomes crucial for targeted interventions.

Limitations and Open Questions

While BOWS provides a crucial intermediate step between toy models and full language models, several limitations remain. The framework uses binary bag-of-words representations, which capture co-occurrence statistics but lack the compositional depth and contextual variation of real transformer representations. The experiments focus primarily on single-layer autoencoders, leaving open the question of how constructive interference propagates through deep, non-linear stacks.

Additionally, the distinction between beneficial constructive interference and harmful cross-talk requires careful empirical validation in larger models. While the theory predicts that correlated features should cluster, identifying whether specific interference patterns in LLMs are indeed exploited for efficient computation, as opposed to being merely tolerated, remains challenging.

Conclusion

The work of Prieto et al. fundamentally reframes how we conceptualize neural feature geometry. Superposition is not merely a packing problem where features jostle for orthogonal real estate; it is an encoding strategy that respects and exploits the statistical structure of the data. By recognizing that interference can be constructive, we gain explanatory purchase on the cyclical manifolds and semantic clusters that have puzzled interpretability researchers.

Moving forward, the field must update its geometric intuitions. Rather than searching only for regular polytopes and antipodal pairs, we should expect and seek out anisotropic structures that mirror the correlation geometry of the training distribution. The path toward interpretable AI requires understanding not just how features avoid each other, but how they cooperate through the mathematics of constructive interference.