The Geometry of Superposition: Why Sparse Autoencoders Fail at Compositional Generalization
The mechanistic interpretability community is repeating the mistake of variational autoencoders. In the rush to understand large language models, we have traded theoretical rigor for computational convenience, assuming that efficient feedforward architectures can recover the sparse structure of neural activations. The paper Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation by Barin Pacela, Joshi, and colleagues presents a sobering corrective. Their analysis demonstrates that the dominant methods for interpreting neural networks, linear probes and sparse autoencoders (SAEs), fail under compositional distribution shifts not because of insufficient training data or model capacity, but because of a fundamental geometric misunderstanding of how concepts are encoded under superposition.
The Representation-Accessibility Distinction
The linear representation hypothesis (LRH) has served as the bedrock assumption for much of modern interpretability. It posits that high-level concepts exist as linear combinations within a model’s activation space, formally expressed as y ≈ Wz, where z represents the ground-truth latent concepts and y represents the observed activations. This assumption has motivated the widespread use of linear probes and activation steering, operating on the intuition that if concepts are linearly encoded, they should be linearly decodable.
However, as the authors carefully articulate, linear representation is not equivalent to linear accessibility. In the regime of superposition, where the dimensionality of the concept space exceeds that of the activation space (d_z > d_y), the encoding matrix W projects a high-dimensional concept space onto a lower-dimensional activation manifold. This projection folds the latent space onto itself, transforming linear decision boundaries in concept space into nonlinear, potentially fractured boundaries in activation space. A linear probe trained in distribution may achieve perfect accuracy on training data, yet fail catastrophically when evaluated on out of distribution (OOD) compositions of familiar concepts.
The theoretical stakes become clear through the lens of compressed sensing. Classical results establish that recovering a k-sparse signal from d_y measurements requires d_y = O(k log(d_z/k)) dimensions when using nonlinear iterative methods such as basis pursuit or iterative thresholding. In contrast, linear recovery methods require d_y = Ω(k² / log(k) log(d_z/k)) dimensions. For a concrete scenario with d_z = 10^6 latent concepts and k = 100 active concepts, nonlinear methods require merely d_y ≈ 920 dimensions, while linear methods demand d_y ≈ 20,000. Standard transformer hidden dimensions of 4096 comfortably satisfy the former bound but fall drastically short of the latter. This quantitative gap explains why linear probes, despite their simplicity, are fundamentally ill-equipped for compositional generalization under superposition.
The Amortization Gap and Dictionary Learning
Given that linear methods are insufficient, one might hope that SAEs, which perform nonlinear inference through their encoder-decoder architecture, would succeed. Yet the authors identify a critical failure mode that persists even when the inference is nonlinear. The issue lies not in the complexity of the inference procedure, but in its amortization. Classical sparse coding solves an optimization problem from scratch for each input, using only the observation y and a learned dictionary W. SAEs, by contrast, amortize this inference into a fixed encoder r: R^{d_y} → R^{d_h} trained to predict sparse codes in a single forward pass.
This amortization introduces a systematic gap between the training distribution and OOD generalization. The fixed encoder learns to exploit statistical regularities in the training data, effectively memorizing common co-occurrence patterns of latent factors. When evaluated on novel compositional combinations, these patterns change, and the encoder fails to recover the correct sparse codes. The authors term this discrepancy the amortization gap, and they demonstrate that it persists across varying training set sizes, latent dimensions, and sparsity levels.
Crucially, the authors decompose this failure to identify its root cause. By replacing the SAE encoder with per-sample FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) inference while keeping the SAE-learned dictionary fixed, they show that the gap remains substantial. This indicates that the dictionary learning procedure, not the inference mechanism, constitutes the binding constraint. The dictionaries learned by SAEs point in substantially wrong directions, such that even optimal per-sample inference cannot recover the true latent factors. To prove that the problem is solvable in principle, the authors implement an oracle baseline using the ground-truth dictionary with per-sample FISTA. This baseline achieves near-perfect OOD recovery at all tested scales, confirming that the geometric challenges of superposition are surmountable with a correct dictionary.
Implications for Mechanistic Interpretability
These findings force a reconsideration of current practices in interpretability research. The field has increasingly relied upon SAEs as the primary tool for disentangling superposed representations, treating them as scalable approximations to sparse coding. The reality is more nuanced. SAEs are not merely approximating sparse coding; they are implementing a fundamentally different computational strategy that sacrifices compositional robustness for inference speed. This is not a bug that can be patched with more data or architectural tweaks, but a structural limitation inherent to amortized inference under distribution shift.
The work suggests that our current interpretability tools may be capturing statistical artifacts rather than true causal latent factors. If SAE features fail to generalize compositionally, then interventions based on these features, such as activation steering or concept ablation, risk producing unreliable or misleading results when applied to novel inputs. The safety implications are significant. As we attempt to align increasingly capable systems, we cannot rely on interpretability methods that conflate training distribution correlations with invariant causal structure.
Looking forward, the research points toward scalable dictionary learning as the critical open problem. Rather than refining amortized encoders, we should invest in algorithms that can learn high-quality dictionaries capable of supporting accurate per-sample inference. Potential directions include hybrid approaches that use amortized inference for initialization followed by iterative refinement, or meta-learning frameworks that explicitly train dictionaries to be compatible with iterative sparse recovery. Additionally, the connection between superposition geometry and the quadratic dimensional requirements of linear probes suggests that we should abandon linear methods for any task requiring compositional generalization, regardless of their in-distribution performance.
Conclusion
The analysis presented in Stop Probing, Start Coding reframes the challenge of neural network interpretability from one of representation extraction to one of geometric inference. Under superposition, the linear representation hypothesis holds, yet linear accessibility fails. Sparse autoencoders attempt to bridge this gap through nonlinear amortized inference, but their dictionaries are not sufficiently aligned with the true latent structure to support compositional generalization.
The path forward requires us to stop probing with linear classifiers and start coding with principled sparse inference. We must develop dictionary learning algorithms that scale to the massive latent spaces of modern language models while maintaining the geometric fidelity required for OOD recovery. Until we solve this dictionary learning problem, our understanding of neural networks will remain bound to the training distribution, unable to generalize to the novel compositional structures that characterize intelligent reasoning. The question is no longer whether concepts are linearly represented, but whether we can learn the dictionaries necessary to decode them.