Model performance is not simply a function of parameter count. It depends on the geometry of internal representations — and that geometry has hard, measurable limits. This paper traces exactly where those limits are and what drives scaling to work when it does.
A transformer with m embedding dimensions representing n distinct semantic concepts faces a geometric problem when n > m: it cannot assign each concept an orthogonal direction. Instead features are forced to overlap, sharing directions in the embedding space. This is feature superposition — the central phenomenon driving both the power and the limits of language model scaling.
The phenomenon is not catastrophic because natural language is sparse. In any given context, only a fraction of all concepts are active. The model can therefore recover individual features the same way compressed sensing reconstructs signals from fewer measurements than classical theory would allow. The cost is interference: when two superposed features co-occur, they degrade each other's representation quality.
"The loss doesn't come from missing features. It comes from the interference between features that must coexist in the same representational space."
How interference behaves determines everything about how a model scales. We identify two qualitatively different regimes, controlled by weight decay during training.
Features are largely orthogonal. When a feature is absent from the representation, the quality cost depends on how frequently it appears in training data — rare features cause disproportionate loss when missing. Scaling behaviour is noisy and distribution-dependent.
Features heavily overlap throughout the space. Individual interference is small but universal. Because it distributes evenly, the aggregate loss simplifies dramatically: loss scales as 1/model width, independent of feature frequency. This is the clean scaling law.
Every modern production LLM we examined operates in strong superposition. Empirically, you can verify this by measuring φ₁/₂ — the fraction of weight vectors whose L2 norm exceeds 0.5. Values above 0.8 confirm the strong superposition regime. This explains why empirical scaling laws are so consistent: the messy frequency dependence of weak superposition has washed out entirely.
Validation Loss vs Model Width — Two Scaling Regimes
Strong superposition produces the clean 1/n curve. Weak superposition fluctuates with feature frequency distribution and provides less reliable scaling estimates.
Knowing the scaling law is useful. Knowing where it stops being useful is more valuable. Increasing embedding dimension improves semantic coverage — but at a rate that decelerates quickly, while compute cost grows quadratically. The optimal range is defined by where this trade-off is most favourable.
This range captures 92–95% of measurable semantic information at a compute multiple production systems can sustain. Below 2K dimensions you leave meaningful quality on the table. Above 16K dimensions you pay 4× the compute for under 2% additional gain.
Semantic Capture, Efficiency, and Compute Cost vs Embedding Dimension
Performance (left axis) grows sublinearly. Efficiency and cost (right axis) diverge past the 4K–8K sweet zone.
Natural language has a finite intrinsic dimensionality. Analysis of large corpora puts the number of distinct semantic concepts driving most variation at roughly 10,000–20,000. Once your embedding space captures those, additional dimensions encode noise rather than signal. The benefit curve flattens while costs compound.
| Dimensions | Semantic Coverage | Relative Compute | Best For | |
|---|---|---|---|---|
| 512 – 1,728 | 72 – 82% | 1× | Specialized tasks, on-device, fast inference | Efficient |
| 4,096 – 8,192 | 92 – 95% | 3 – 6× | General-purpose production LLMs | Optimal |
| 16,384+ | 96 – 97% | 20 – 50× | Research only — marginal gains rarely justify the cost | Diminishing |
Optimal Dimension by Task Type
Domain-specific models saturate earlier. Multilingual models require more representational space. General models centre in the sweet zone.
Once embedding dimension is at the sweet spot, the constraint shifts to sequence length. Standard self-attention is O(n²) in context length — doubling the window quadruples compute. The same geometric insight applies: distant tokens don't need full-resolution attention. They carry less contextual specificity and can be represented at lower fidelity without meaningful quality loss.
Approaches that exploit this graduated precision extend effective context from 4–8K to 32–256K tokens at compute overhead of 1.2–1.5× rather than the 16–1000× that naïve scaling would require.
Compute vs Context Length — Standard vs Sparse Approaches
Standard O(n²) becomes infeasible past 32K tokens. Graduated hierarchical attention achieves near-linear scaling across the same range.
Sliding window with global tokens. Each token attends to a local window (4K tokens) plus a sparse set of global tokens (256) that see the full sequence. Effective context of 128K at 1/29th the compute of full attention.
Graduated KV-cache compression. Store recent tokens at full precision (fp32), compress older tokens progressively through fp16, int8, and int4. A 128K context that would occupy 8.4 GB of KV cache occupies roughly 425 MB — a 20× reduction with negligible quality impact.
RoPE temperature scaling. Scale rotary position frequencies to cover longer ranges. A model trained on 8K context generalises to 128K with no retraining required and approximately 5% quality loss, which fine-tuning on long-context data largely recovers.
Hierarchical position encoding. Encode recent positions at full resolution. Bucket distant positions into groups of 4, then 16, then 64. Extends effective context to 128K at only 1.3× compute overhead relative to the base model.
The "bigger is always better" account of LLM improvement is approximately right but obscures the mechanism. Parameters matter because more parameters typically means wider embeddings — and wider embeddings reduce interference between superposed features. It is the geometry improving, not the count itself.
A model at the right embedding dimension will outperform a larger model at the wrong one, and cost a fraction of the compute to train and run. Knowing where the optimum sits lets you spend efficiently rather than naïvely.
Current vs Architecturally-Optimal Scaling
The capability gap between brute-force and dimension-optimal scaling widens as model size increases.
Mixture of Experts routes tokens to specialised sub-networks, each operating in a lower-dimensional space. This is consistent with the finding that 4K–8K dimensions is optimal per task domain — MoE creates multiple such domains within one model without multiplying total compute proportionally.
Hierarchical representations encode language at multiple scales — token, phrase, sentence, document — rather than collapsing all levels into a single embedding layer. Each level requires fewer dimensions than a flat approach attempting the same.
Sparse activation allows a nominally larger embedding space (16K+) while keeping actual compute equivalent to 4K–8K, by activating only the dimensions each input genuinely requires.
Progressive dimension training — starting at the efficiency peak (1.7K dims), then expanding to 4K–8K — achieves equivalent final quality at roughly 3× lower total training compute. The model learns common patterns cheaply before expanding capacity for rarer ones.
Weight decay as a regime dial. High weight decay early in training encourages sparse, well-separated features. Reducing it later allows strong superposition to fill remaining capacity. This deliberately sequences the model through an efficient weak-superposition phase before committing to dense representations.