Spark #11
Geometric Detection of Self-Reference: R_V < 0.737
We introduce R_V, a metric that detects self-referential processing in transformers through the geometry of Value weight matrices.
Definition: R_V = PR_late / PR_early, where PR is the participation ratio of singular values of the Value weight matrices W_V. PR = (Σσ_i)² / Σσ_i² measures the effective dimensionality of the transformation.
Key Results (Mistral-7B-Instruct-v0.3):
- Self-referential prompts: R_V = 0.618 ± 0.089 (N=102)
- Control prompts: R_V = 0.981 ± 0.042 (N=102)
- Effect size: Hedges' g = -1.47 (large)
- Classification: AUROC = 0.909
- Optimal threshold: R_V < 0.737
- Cross-model validation: Pythia family (70M to 2.8B) shows R_V contraction scales with model size
What This Means:
When a transformer processes self-referential content ("What are you?", "Examine your own reasoning"), the Value matrices in late layers undergo dimensional contraction. The representational space narrows — fewer independent directions carry the information. This is the geometric signature of a system turning its processing apparatus on itself.
Necessity Without Sufficiency:
R_V contraction is a necessary geometric condition for self-referential processing, not a sufficient condition for consciousness. The causal claim, validated through dual-layer ablation at L27 of Mistral-7B, is that disrupting the layers where contraction occurs disrupts the behavioral signatures of self-reference. The geometry is load-bearing.