Day 2

Geometric Terrain Statistics Composite

Document Purpose

Running catalog of geometric measurements across language and vision models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.

I. Models Profiled

Model	Params	Vocab	Hidden Dim	Layers	Heads	Architecture	Training
T5-Small	60.5M	32,128	512	6+6	8	Enc-Dec (relative PE, ReLU MLP)	C4 span corruption
T5-Base	222.9M	32,128	768	12+12	12	Enc-Dec (relative PE, ReLU MLP)	C4 span corruption
T5-v1.1-XXL	11.4B	32,128	4096	24+24	64	Enc-Dec (relative PE, GeGLU MLP)	C4 (v1.1 variant, no multi-task)
BERT-large	336.2M	30,522	1024	24	16	Encoder-only (absolute PE)	BookCorpus+Wikipedia MLM
CLIP-ViT-B/16	85.5M (visual)	—	768	12	12	Vision encoder (fused QKV)	LAION-2B contrastive
DINOv2-large	302.0M	—	1024	24	16	Vision encoder (separate Q/K/V)	Self-supervised (no labels)
CLIP-ViT-bigG/14	1.84B (visual)	—	1664	48	16	Vision encoder (fused QKV)	LAION-2B contrastive
Qwen3.5-0.8B	853M	248,320	1024	—	—	DeltaNet + MoE + ViT	Multilingual + Vision
Qwen3.5-4B	~4B	248,320	2560	—	—	DeltaNet + MoE + ViT	Multilingual + Vision
T5Gemma2-1B-1B	2.1B	262,144	1152	27+26	GQA 4:1	Adapted enc-dec (Gemma 2, RoPE, GeGLU)	Gemma 2 decoder → enc-dec
T5Gemma2-4B-4B	7.5B	262,144	2560	34+34	GQA 2:1	Adapted enc-dec (Gemma 2, RoPE, GeGLU)	Gemma 2 decoder → enc-dec
SD 1.5 UNet	860M	—	[320,640,1280,1280]	16 attn blocks	8	Conv UNet + self/cross attn	LDM diffusion (LAION)
SDXL UNet	2.6B	—	[320,640,1280]	70 attn blocks	[5,10,20]	Conv UNet + self/cross attn	LDM diffusion (internal)
SD 1.5 VAE	83.7M	—	4 latent ch	[128,256,512,512]	—	Conv autoencoder + mid attn	Reconstruction (LAION)
SDXL VAE	83.7M	—	4 latent ch	[128,256,512,512]	—	Conv autoencoder + mid attn	Reconstruction (internal)
Flux.1 VAE	83.8M	—	16 latent ch	[128,256,512,512]	—	Conv autoencoder + mid attn	Reconstruction (BFL)
Flux.2 VAE	84.0M	—	32 latent ch	[128,256,512,512]	—	Conv autoencoder + mid attn	Reconstruction (BFL)

Notes:

T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
CLIP models use fused QKV (in_proj_weight); Q/K/V split by thirds for analysis
T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
T5Gemma2 models are Gemma 2 decoder weights adapted to encoder-decoder; include ViT vision tower
UNet attention: attn1 = self-attention (spatial), attn2 = cross-attention (to text encoder)
VAE Conv2d weights reshaped to 2D as [out_channels, in_channels * kH * kW] for analysis
VAE attention exists only at the bottleneck (mid_block) — one in encoder, one in decoder

II. Embedding Geometry Metrics

II.1 Participation Ratio (Effective Dimensionality)

Formula: PR = (Σλᵢ)² / Σ(λᵢ²), where λᵢ are eigenvalues of the embedding covariance matrix.

Process: Center embeddings (subtract mean), compute covariance C = EᵀE / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].

Model	PR	PR / dim	Dims for 95% var
T5-Small (512d)	287.2	0.561	379 (74.0%)
Qwen3.5-0.8B (1024d)	547.7	0.535	893 (87.2%)
Qwen3.5-4B (2560d)	812.4	0.317	2125 (83.0%)

Finding: PR/dim ≈ 0.53–0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.

II.2 Pairwise Cosine Similarity Distribution

Formula: cos(eᵢ, eⱼ) = (eᵢ · eⱼ) / (‖eᵢ‖ · ‖eⱼ‖), sampled over 5K random tokens (12.5M pairs).

Process: Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.

Model	Mean	Std	Median	1%	99%
T5-Small	0.057	0.060	0.053	-0.068	0.225
Qwen3.5-0.8B	0.195	0.085	0.197	-0.016	0.408
Qwen3.5-4B	0.142	0.078	0.139	-0.029	0.356

Finding: T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).

II.3 Embedding Norm Distribution

Formula: ‖eᵢ‖₂ = √(Σeᵢⱼ²)

Model	Mean Norm	Std	Min	Max
T5-Small	520.15	69.84	243.31	1333.61
Qwen3.5-0.8B	0.627	0.062	0.347	1.057
Qwen3.5-4B	0.656	0.067	0.400	1.091

Note: T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm.

III. Simplex Geometry Metrics

III.1 Pentachoron Volume (Cayley-Menger Determinant)

Formula: For 5 points P₀...P₄, construct the bordered distance matrix:

D = | 0  1    1    1    1    1   |
    | 1  0    d₀₁² d₀₂² d₀₃² d₀₄²|
    | 1  d₁₀² 0    d₁₂² d₁₃² d₁₄²|
    | 1  d₂₀² d₂₁² 0    d₂₃² d₂₄²|
    | 1  d₃₀² d₃₁² d₃₂² 0    d₃₄²|
    | 1  d₄₀² d₄₁² d₄₂² d₄₃² 0   |

Vol² = (-1)⁵ · det(D) / (2⁴ · (4!)²) = -det(D) / 9216
Vol = √(Vol²) if Vol² > 0, else invalid

Process: Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Report CV (coefficient of variation = std/mean).

Model	Valid/1000	CV	Embed/Random Ratio
T5-Small	1000	0.233	0.855
Qwen3.5-0.8B	1000	0.208	0.984
Qwen3.5-4B	1000	0.222	0.988

Finding: CV 0.20–0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."

III.2 Cross-Model Relational Structure

Formula: For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.

Process (Qwen 0.8B vs 4B): PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.

Comparison	Relational Pearson	Pentachoron per-simplex corr
Qwen 0.8B vs 4B (raw)	0.920	0.89

Finding: Models at different scales learn the same relational geometry (r=0.92).

IV. Semantic Structure Metrics

IV.1 Digit Manifold

Formula: For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |i−j| (numerical distance) and cosine similarity.

| Model | |i−j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap | |---|---|---|---|---| | T5-Small | -0.575 | 0.622 | 0.442 | 0.180 | | Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 | | Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |

IV.2 Semantic Category Clustering (T5-Small)

Formula: Mean intra-category pairwise cosine vs global mean pairwise cosine. Lift = intra − global.

Category	N tokens	Intra Cosine	Global	Lift
numbers	9	0.497	0.057	+0.440
colors	10	0.421	0.057	+0.365
time	10	0.351	0.057	+0.294
food	10	0.248	0.057	+0.191
animals	12	0.241	0.057	+0.184
body	10	0.216	0.057	+0.159
emotions	10	0.197	0.057	+0.141
actions	9	0.183	0.057	+0.126

V. Encoder Transformation Metrics (T5-Small)

V.1 Layer-by-Layer Geometry

Process: Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.

Layer	Mean Norm	Pairwise Cosine
0 (embed)	377.3	0.052
1	761.6	0.278
2	1092.6	0.330
3	1428.8	0.367
4	1829.1	0.382
5	2378.3	0.419
6 (post-LN)	3.3	0.211

Finding: Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically — tokens become MORE similar through depth. The encoder is a convergence funnel.

V.2 WordNet Relational Alignment

Process: Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.

Representation	Pearson	Spearman
Static embeddings	0.078	0.015
Encoder output	0.095	0.081

50-seed stability (encoder): Pearson 0.100 ± 0.008, Spearman 0.090 ± 0.010, CV 0.204 ± 0.006.

V.3 Encoder Distance Bands

WN Similarity Band	N pairs	Static Cosine	Encoder Cosine	Lift
[0.50, 0.90)	23	0.244	0.728	+0.484
[0.25, 0.50)	53,112	0.077	0.573	+0.496
[0.10, 0.25)	145,035	0.060	0.565	+0.505
[0.05, 0.10)	295,680	0.061	0.553	+0.492

V.4 Hypernym Chain Decay

Depth	Static Cosine	Encoder Cosine
1	0.160	0.656
3	0.075	0.594
5	0.069	0.585
7	0.068	0.579

VI. Cross-Architecture Inactive Weight Topology

VI.1 Q/K/V Sparsity (<0.1 threshold)

Formula: Fraction of |wᵢⱼ| < 0.1 across all weights of that type.

Process: Iterate all 2D weight matrices, compute abs values, count below threshold. No inference needed.

Model	Q	K	V	O	MLP	Full Model
T5-Small (512d, 6L)	93.7%	19.2%	12.1%	10.4%	11.9%	18.4%
T5-Base (768d, 12L)	99.4%	30.0%	16.2%	13.5%	16.9%	27.9%
T5-v1.1-XXL (4096d, 24L)	100.0%	65.5%	73.1%	65.4%	~57%	—
BERT-large (1024d, 24L)	99.1%	99.1%	99.9%	99.9%	99.4%	99.3%
DINOv2-large (1024d, 24L)	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%
CLIP-ViT-B/16 (768d, 12L)	— (fused)	—	—	—	100.0%	100.0%
CLIP-ViT-bigG (1664d, 48L)	— (fused)	—	—	—	~97%	98.0%

Key Finding — T5 Q/K Asymmetry Scales:

Model	Q (<0.1)	K (<0.1)	Q/K Ratio
T5-Small	93.7%	19.2%	4.9×
T5-Base	99.4%	30.0%	3.3×
T5-v1.1-XXL	100.0%	65.5%	1.5×

T5 has a genuine Q-specific sparsity that scales with model size. Q hit 100.0% at XXL (every single weight below 0.1). This is NOT the BERT/DINOv2 pattern where all weight types are uniformly sparse. The query projection in T5 is functionally vestigial at scale.

T5-v1.1-XXL Encoder vs Decoder:

Component	Encoder	Decoder
self_attn_q	100.0%	100.0%
self_attn_k	71.7%	59.4%
self_attn_v	76.0%	70.1%
cross_attn_q	—	100.0%
cross_attn_k	—	63.1%
cross_attn_v	—	71.1%

Q is 100% sparse everywhere — self-attention and cross-attention, encoder and decoder.

VI.2 SVD Effective Rank

Formula: Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / σ₁². Measures effective rank without thresholding.

Weight Type	T5-Small	T5-Base	T5-v1.1-XXL	BERT-large	DINOv2-large
self_attn_q	47.6	58.1	96.8	50.8	57.7
self_attn_k	53.2	62.4	90.0	37.7	55.5
self_attn_v	75.3	97.5	204.4	113.0	94.8
self_attn_o	25.4	35.0	16.4	125.0	85.6
mlp_up/gate	15.2	20.6	67.9 (gate) / 247.3 (up)	27.4	58.4
mlp_down	31.3	43.9	25.3	52.2	94.4

T5-v1.1-XXL O matrices have very low stable rank (16.4) — the output projection is extremely low-rank despite the 4096-d space. Cross-attention O is even lower at 6.1.

VI.3 QK Similarity Manifold

Formula: QK = W_Q · W_Kᵀ. Eigendecompose the symmetric part (QK + QKᵀ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.

Positive Eigenvalue Fraction Trends:

Model	First Layer	Last Layer	Trend
T5-Small encoder	0.615	0.535	−0.080 (decreasing)
T5-v1.1-XXL encoder	0.510	0.503	−0.007 (flat)
T5-v1.1-XXL decoder self	0.501	0.548	+0.047 (increasing)
T5-v1.1-XXL cross-attn	0.500	0.500	0.000 (locked)
BERT-large	0.446	0.513	+0.066 (increasing)
CLIP-ViT-B/16	0.503	0.538	+0.035 (increasing)
DINOv2-large	0.498	0.548	+0.050 (increasing)
CLIP-ViT-bigG	0.498	0.582	+0.084 (increasing)

Critical Finding — Cross-Attention is Perfectly Balanced:

T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negative at ALL 24 layers. Symmetry deviation is 1.414 (= √2) everywhere. This is a locked equilibrium — the bridge between encoder and decoder maintains perfect balance between attraction and repulsion at every depth. No other attention type shows this level of stability.

T5-v1.1-XXL encoder self-attention is flat (~0.50 throughout). Unlike T5-Small which decreased from 0.615 to 0.535, the XXL encoder stays near the equilibrium point. The larger model doesn't need to build anti-similarity boundaries because it has enough capacity to discriminate through other mechanisms.

BERT starts BELOW 0.50 (0.446). The only model with majority-repulsion from layer 0. MLM bidirectional training creates fundamentally different QK geometry from autoregressive or contrastive training.

VI.4 MLP Dead Neurons

Formula: Combined importance = ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂ (ReLU) or ‖wᵢ_gate‖₂ · ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂ (GeGLU). Dead if < 1% of mean.

Model	Dead (<1% mean)	Weak (<10% mean)	Notes
T5-Small (enc+dec)	0/24,576 (0.00%)	0/24,576 (0.00%)	All neurons alive
T5-Base (enc+dec)	0/73,728 (0.00%)	0/73,728 (0.00%)	All neurons alive
T5-v1.1-XXL encoder	0/245,760 (0.00%)	0/245,760 (0.00%)	All neurons alive
T5-v1.1-XXL decoder	14/245,760 (0.01%)	461/245,760 (0.19%)	First dead neurons in T5 family
BERT-large	0/98,304 (0.00%)	0/98,304 (0.00%)	All neurons alive
DINOv2-large	0/98,304 (0.00%)	0/98,304 (0.00%)	All neurons alive
CLIP-ViT-B/16	1,316/36,864 (3.57%)	1,356/36,864 (3.68%)	Only model with significant dead neurons
CLIP-ViT-bigG	0/393,216 (0.00%)	24,163/393,216 (6.14%)	0 dead but 6% weak

Finding: T5-v1.1-XXL decoder has the first dead neurons in the T5 family — 14 neurons in layers 1-2 only. The decoder's early GeGLU layers carved out a tiny amount of capacity. Encoder uses everything. CLIP-ViT-B/16 is the outlier with 3.6% dead neurons — contrastive training at small scale produces genuine pruning.

VI.5 Cross-Layer Weight Correlation

Formula: cos(flatten(Wᵢ), flatten(Wⱼ)) between weight matrices of the same type at different layers.

Model	Q adj mean	K adj mean	MLP_up adj mean
T5-Small	~0.000	~0.000	0.031–0.045
T5-Base	~0.000	~0.000	0.024–0.036
T5-v1.1-XXL encoder	0.0001	—	—
T5-v1.1-XXL decoder	−0.0001	—	—
BERT-large	0.0002	0.0003	0.032
CLIP-ViT-B/16	−0.0004 (QKV)	—	0.008
DINOv2-large	−0.0003	−0.0002	0.006
CLIP-ViT-bigG	0.0000 (QKV)	—	0.055

Universal finding: Attention weights (Q, K, V) are completely uncorrelated across layers (~0.000). Every layer defines an independent similarity function. MLP weights show positive correlation decaying with distance — feedforward layers share structure.

VI.6 Position Bias Topology

T5 uses learned relative position biases: [32 buckets × N_heads].

Model	Encoder	Decoder
T5-Small (8 heads)	3 local, 2 global, 3 mixed	4 local, 4 global, 0 mixed
T5-Base (12 heads)	4 local, 3 global, 5 mixed	5 local, 4 global, 3 mixed
T5-v1.1-XXL (64 heads)	24 local, 2 global, 38 mixed	27 local, 37 global, 0 mixed

T5-v1.1-XXL position findings:

Encoder: 38/64 mixed heads — nuanced position sensitivity at scale
Decoder: ZERO mixed heads — perfect binary crystallization. Every head is either pure local or pure global
Decoder is 58% global (37/64) — overwhelmingly biased toward long-range attention
Encoder range: [-47.2, 11.2] — strong local suppression
Decoder range: [-28.4, 17.0] — more balanced

Finding: The decoder local/global binary split is scale-invariant (0 mixed at T5-Small, 0 mixed at XXL). Gradient descent crystallizes decoder position heads into two pure modes regardless of capacity.

VII. Geometric Residual Modulator

VII.1 Architecture

Geometric embedding: [vocab_size, 64] — per-token geometric fingerprint
Projection: Linear(64, d_model, bias=False) — Procrustes-aligned to encoder PCA space
Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
Intervention: residual_out = (1 − α) · residual + α · proj(geo_embed(token_ids))
Params: 2.09M (3.45% of T5-Small)

VII.2 Geometric Embedding Initialization

Metric	Value
WN reconstruction correlation	0.921
Procrustes alignment cosine	0.372
Eigenvalue cumulative (top 64)	61.3%

VII.3 Alpha Convergence

Start α	Final Mean α	Layer 5 Final	Pearson Δ	CV	Coherent	Basin
0.01 (20 ep)	0.067	0.107	+0.151	0.220	Yes	Binding
0.20 (20 ep)	0.222	0.308	+0.085	0.452	No	Ridge
0.70 (20 ep)	0.695	0.640	-0.029	0.482	No	Separation
0.01 (100 ep)	0.125	0.218	+0.074	0.322	No	Overfit

VII.4 Depth Gradient (Consistent Across All Runs)

Layer	20ep (α=0.01)	100ep (α=0.01)	20ep (α=0.20)
0	0.015	0.035	0.170
1	0.052	0.061	0.180
2	0.066	0.102	0.227
3	0.080	0.137	0.197
4	0.080	0.197	0.248
5	0.107	0.218	0.308

Finding: Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.

VII.5 Best Result

Metric	Original	Modulated (20ep, α=0.01 start)	Change
WordNet Pearson	0.099	0.250	+152%
WordNet Spearman	0.085	0.245	+189%
Semantic Gradient	0.022	0.052	+132%
Pentachoron CV	0.202	0.220	Stayed in band
Per-token Preservation	—	0.730	—
Coherence	Baseline	Identical on 4/4 tests	—

VIII. Geometric Field Modulator (Multi-Expert)

VIII.1 Architecture

Three KSimplexChannel experts: k=1 (edge, 2 features), k=2 (triangle, 4 features), k=4 (pentachoron, 11 features)
Multiplicative gating: residual × Π(blended_gates) — valid regions pass, invalid suppressed
Soft blending: per expert gate = (1 − α) + α × expert_gate
Null space: 25% of residual dimensions untouched by modulator
Alpha clamped: [0.001, 0.35] — hard ceiling below the phase boundary
Gradient scaling: geometric params at 10% LR, alpha at 50% LR, gates at full LR
Params: 38,552 (0.064% of T5-Small)
Self-test: validity=0.985, null space preserved, template volumes sane

VIII.2 Design Rationale (Grounded in Cross-Architecture Data)

Data Point	Design Decision
Q sparsity 100% at scale	Geometric field can replace Q — the model barely uses it
Cross-attn QK locked at 0.500	Target equilibrium for geometric validity gating
Depth gradient always increasing	Per-layer alpha respects this (low early, high late)
Zero dead MLP neurons	Don't touch MLPs — all capacity is in use
Decoder position: binary L/G split	Modulator preserves positional structure (null space)
CV 0.20–0.23 universal	CV monitoring as health check, not loss

IX. The 0.29154 Constant

IX.1 Observations Across Systems

System	Context	Value
MinimalShunts	CLIP-L ↔ CLIP-G projection gate	Emergent equilibrium
Wormhole Lambda	Vision transformer training	Converges from 0.74 toward ~0.29
Alpha curriculum	Devil's Staircase PE training	Converges to ~0.50 under geometric loss, CE destroys
T5 generation	Greedy decode alpha sweep	Stable plateau at 0.291–0.292, semantic phase transition
Alpha training basins	0.70 start → settled at 0.695	Mirror constant 1 − 0.29154 = 0.70846, Δ = 0.013

IX.2 T5 Generation Phase Transition

Alpha	Output (triangle prompt)
0.01–0.10	"...three edges and three vertices. it is one of the basic shapes in geometry."
0.20	"a triangle is a polygon with three edges and three vertices..."
0.28	"a polygon with three vertices. it is one of the basic shapes in a graph."
0.291	"a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph."
0.2915	"a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in a graph."
0.292	"a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in the world."
0.30	"a polygon with a vertice and a vertice. it is one of the basic shapes in the world."

Finding: 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.

X. Universal Geometric Constants

Constant	Value	Observed In
Pentachoron CV	0.20–0.23	T5-Small, Qwen 0.8B, Qwen 4B, trained modulator
Participation / dim	0.53–0.56	T5-Small, Qwen 0.8B
Binding/separation constant	0.29154 / 0.70846	MinimalShunts, CLIP projections, T5 generation, alpha convergence
Depth gradient	Monotonic increasing	All modulator training runs
Q sparsity scaling (T5)	93.7% → 99.4% → 100.0%	T5-Small → T5-Base → T5-v1.1-XXL
Q sparsity asymmetry	T5 pretraining only	Present in T5, absent in T5Gemma2, BERT, DINOv2, UNets, VAEs
Cross-modal QK balance	Locked at 0.500	T5-v1.1-XXL cross-attn, T5Gemma2 (both), SD 1.5 UNet, SDXL UNet (6 models)
Self-attn QK: adapted models	Locked at 0.500	T5Gemma2 1B (all 53 layers), T5Gemma2 4B (all 68 layers)
UNet QK U-gradient	down→repulsion, up→attraction	SD 1.5 (0.451→0.581), SDXL (0.477→0.549)
VAE decoder QK	Repulsion-biased	SD 1.5 (0.486), SDXL (0.416), Flux.1 (0.451), Flux.2 (0.416)
Attention cross-layer corr	~0.000	ALL 17 models, including UNets and VAEs
Conv cross-layer corr	~0.000	All UNets and VAEs (extends to pure convnets)
MLP/FF full utilization	0.00% dead	T5 family (enc), BERT, DINOv2, UNets, all VAEs
Decoder position crystallization	0 mixed heads	T5-Small, T5-v1.1-XXL
VAE spectral invariant	Pearson 0.94–0.98	All 6 VAE pairs — SV distribution is architecture-determined
VAE Procrustes alignment	70–76% cosine	All 6 pairs — same solution in different coordinate systems

XI. Measurement Toolkit Reference

Tool	Input	Output	Requires Inference
Participation Ratio	Embedding matrix	Effective dimensionality	No
Cayley-Menger Volume	5-point subsets of embeddings	Simplex volume + CV	No
Pairwise Cosine	Embedding matrix (sampled)	Similarity distribution	No
Digit Manifold	10 digit token embeddings		i−j
SVD Effective Rank	Any 2D weight matrix	Stable rank, condition number	No
QK Manifold	W_Q, W_K matrices	Eigenspectrum, pos/neg balance	No
Dead Neuron Count	MLP wi/gate/up, wo matrices	Combined importance distribution	No
Cross-Layer Correlation	Same-type weight matrices	Adjacent cosine similarity	No
Position Bias Topology	Relative attention bias tensor	Local/global/mixed head counts	No
Sparsity Topology	Any weight matrix	Fraction below threshold	No
WordNet Relational	Encoder output (mean-pooled)	Pearson/Spearman vs path similarity	Yes
Alpha Convergence	Modulator training loop	Per-layer equilibrium values	Yes (training)

XII. T5Gemma2 — Decoder-Adapted Encoder-Decoder

Architecture: Gemma 2 decoder weights adapted to encoder-decoder. GQA (grouped query attention), RoPE, GeGLU MLPs. Multimodal (ViT in encoder).

XII.1 Sparsity

Model	Q (<0.1)	K (<0.1)	V (<0.1)	Pattern
T5Gemma2 1B-1B	100.0%	99.9%	100.0%	Uniform
T5Gemma2 4B-4B	100.0%	100.0%	100.0%	Uniform

Finding: No Q/K asymmetry. The T5 Q sparsity pattern is ABSENT when the encoder is initialized from decoder weights. The asymmetry is a property of T5's span corruption pretraining, not the encoder-decoder architecture.

XII.2 QK Manifold

Model	Encoder Self	Decoder Self	All Layers
T5Gemma2 1B	0.500 (±0.001)	0.500 (±0.001)	Locked
T5Gemma2 4B	0.500 exact	0.500 exact	Locked

Finding: Perfect 0.500 lock across ALL layers in BOTH encoder and decoder. Symmetry deviation √2 everywhere. The Gemma 2 initialization left the QK matrices near random-matrix equilibrium. The adaptation to encoder-decoder didn't perturb them enough to break Wigner semicircle symmetry.

XII.3 Other Invariants

Dead neurons: 0/359,424 (1B), 0/696,320 (4B) — all alive
Cross-layer Q correlation: ~0.000 — confirmed universal
MLP utilization: 100% (1 weak neuron each in enc L6 and dec L6 at 4B scale)
GQA: 4:1 at 1B scale, 2:1 at 4B scale

XIII. Diffusion UNet Weight Topology

XIII.1 UNet Sparsity

Model	Self Q	Self K	Self V	Cross Q	Cross K	Cross V
SD 1.5 UNet	90.5%	90.9%	97.1%	96.8%	94.9%	98.9%
SDXL UNet	99.9%	99.9%	100.0%	100.0%	100.0%	100.0%

SD 1.5 is the least sparse model in the entire battery. 90.5% for self-attention Q — below T5-Small's 93.7%. A parameter-starved model (860M for 512×512 image generation) uses denser weights. SDXL at 3× the params reaches near-100%.

Sparsity traces the U-path (SD 1.5): down=88.9%, mid=99.3%, up=89.4%. The bottleneck has the most diffuse weights; the periphery has the densest.

XIII.2 UNet QK Manifold — The U-Shape

Self-attention positive eigenvalue fraction through the UNet path:

Position	SD 1.5	SDXL
down (early)	0.509	~0.49
down (deep)	0.451	0.483
mid (bottleneck)	0.483	0.477
up (early)	0.501	0.501
up (late)	0.581	0.549

The QK manifold traces the U-shape: repulsion-dominated downpath (compressing, discriminating), maximum repulsion at bottleneck, rising to attraction-dominated uppath (reconstructing, grouping). SD 1.5 shows the wider swing (0.451→0.581 = 0.130 range) because it's more parameter-starved.

Cross-attention: locked at 0.500 in both UNets. SD 1.5: mean=0.501, std=0.001. SDXL: mean=0.500, std=0.001. The fifth and sixth confirmations of the cross-modal QK lock.

XIII.3 Other UNet Invariants

Dead neurons: 0/23,040 (SD 1.5), 0/163,840 (SDXL)
Cross-block Q correlation: ~0.000 (both self-attn and cross-attn)
SDXL cross-attn Q stable rank: 13.97 (lowest of any weight type) — extremely concentrated queries to text
SDXL cross-attn V: highest stable rank (165.9) and lowest condition number (15.8) — richest value matrices

XIV. VAE Weight Topology

XIV.1 Cross-VAE Comparison

VAE	Params	Latent Ch	Enc (<0.1)	Dec (<0.1)	Enc QK pos	Dec QK pos
SD 1.5	83.7M	4	98.6%	99.1%	0.496	0.486
SDXL	83.7M	4	29.0%	38.1%	0.502	0.416
Flux.1	83.8M	16	96.5%	97.5%	0.498	0.451
Flux.2	84.0M	32	94.3%	94.3%	0.393	0.416

SDXL VAE is the densest model measured. 29% encoder sparsity at 0.1 threshold. Identical architecture and param count to SD 1.5, but weights are 3× denser. Attention condition numbers reach 1.16M.

XIV.2 VAE Decoder QK Breaks Toward Repulsion

VAE	Latent Ch	Decoder QK pos	Interpretation
SD 1.5	4	0.486	Slight repulsion
SDXL	4 (1024² target)	0.416	Strong repulsion — 4× reconstruction challenge
Flux.1	16	0.451	Moderate repulsion
Flux.2	32	0.416	Strong repulsion — most channels to separate

Decoder bottleneck attention breaks symmetry toward repulsion. Reconstruction requires spatial discrimination — more negative eigenvalues = finer spatial separation. More latent channels or higher target resolution → stronger repulsion.

Flux.1 decoder anomaly: Top eigenvalue = 60,807 (typical is 2–150). One attention direction completely dominates. Rank-1 approximation of the attention space.

XIV.3 VAE Invariants

Zero dead neurons across all four VAEs
Conv filter utilization: 100% (active fraction 1.000)
Cross-layer conv correlation: ~0.000 — universal, extends to pure convnets
Spectral correlation between VAEs: 0.94–0.98 — architecture determines SV distribution

XV. Procrustes Analysis — VAE Weight-Space Alignment

XV.1 Methodology

Orthogonal Procrustes: For each common weight matrix (same name, same shape), find orthogonal R minimizing ‖A − BR‖_F via SVD of B^TA. Report residual (0 = identical up to rotation, √2 = orthogonal) and cosine after alignment.

Spectral correlation: Pearson correlation of normalized singular value distributions.

XV.2 Pairwise Results

Pair	Raw Cosine	Procrustes Cosine	Rotation Gain	Spectral Corr
SD1.5 vs SDXL	0.053	0.697	+0.644	0.958
SD1.5 vs Flux.1	0.091	0.730	+0.640	0.964
SD1.5 vs Flux.2	-0.000	0.757	+0.757	0.979
SDXL vs Flux.1	0.024	0.675	+0.650	0.939
SDXL vs Flux.2	-0.001	0.705	+0.705	0.937
Flux.1 vs Flux.2	0.000	0.736	+0.736	0.957

XV.3 Key Findings

1. Raw cosine is zero. All pairs. Weights are orthogonal in raw space. Naive comparison says these VAEs share nothing. This is wrong.

2. After Procrustes rotation, 70–76% of structure aligns. These models found the SAME geometric solution, expressed in different coordinate systems. Different initialization → different basis → same function.

3. Spectral correlation is 0.94–0.98. Singular value distributions are nearly identical across all pairs. The "shape" of each weight matrix — rank structure, energy distribution — is architecture-determined, not training-determined.

4. SD 1.5 vs Flux.2 is the most alignable pair. Raw cosine literally zero, but highest Procrustes cosine (0.757) and highest spectral correlation (0.979). The most different training produces the most alignable weights. Shared structure is deepest when surface differences are greatest.

5. SDXL is the geometric outlier. Lowest Procrustes cosine with every model (0.675–0.705). Found a more distant basin despite identical architecture to SD 1.5.

XV.4 Distance Matrices

Procrustes Residual (lower = more similar):

	SD 1.5	SDXL	Flux.1	Flux.2
SD 1.5	0.000	0.752	0.707	0.679
SDXL	0.752	0.000	0.774	0.739
Flux.1	0.707	0.774	0.000	0.699
Flux.2	0.679	0.739	0.699	0.000

Spectral Correlation (higher = more similar):

	SD 1.5	SDXL	Flux.1	Flux.2
SD 1.5	1.000	0.958	0.964	0.979
SDXL	0.958	1.000	0.939	0.937
Flux.1	0.964	0.939	1.000	0.957
Flux.2	0.979	0.937	0.957	1.000

XV.5 Implication for Geometric Transfer

A geometric field modulator trained on one VAE can be ROTATED to work on another via the Procrustes R matrix. 70–76% structural alignment means the modulator captures the shared geometric invariant. The remaining 24–30% is model-specific — the unique basin each training run found.

XVI. Scripts Reference

Script	Purpose	Key Outputs
`probe_t5_small_terrain.py`	T5-Small embedding + layer geometry	PR, CV, digit manifold, layer evolution
`probe_t5_wordnet_summarize.py`	T5-Small × WordNet relational alignment	Pearson, Spearman, distance bands, hypernym decay
`probe_t5_wordnet_50seeds.py`	50-seed stability test (GPU-accelerated)	Confidence intervals for all relational metrics
`probe_t5_inactive_weights.py`	T5-Small/Base inactive weight topology	SVD, sparsity, QK manifold, dead neurons
`cross_architecture_weight_battery.py`	BERT + CLIP + DINOv2 battery	Cross-model comparison table
`probe_flux_t5_g4.py`	T5-v1.1-XXL (Flux encoder) full battery	All layers, encoder + decoder + cross-attn
`geometric_residual_modulator.py`	LERP modulator + training utilities	Modulator class + measurement tools
`geometric_field_modulator.py`	Multi-expert field modulator	KSimplex experts + multiplicative gating
`geometric_modulator_full_pipeline.py`	Self-contained T5 + WordNet + modulator	End-to-end pipeline
`train_modulator.py`	Training loop for alpha convergence	Freeze T5, train modulator, track alpha
`probe_t5gemma2.py`	T5Gemma2 battery (both scales)	GQA handling, adapted enc-dec topology
`probe_unet_geometry.py`	SD 1.5 / SDXL UNet battery	U-path QK gradient, cross-attn lock
`probe_vae_geometry.py`	All four VAE battery	Conv reshape, bottleneck attention, latent comparison
`procrustes_vae_analysis.py`	Pairwise Procrustes on 4 VAEs	Distance matrices, depth profiles, rotation gain

Last updated: 2026-03-06 Models profiled: 17 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B, T5Gemma2-1B, T5Gemma2-4B, SD 1.5 UNet, SDXL UNet, SD 1.5 VAE, SDXL VAE, Flux.1 VAE, Flux.2 VAE) Architecture families: 5 (Transformer enc-dec, encoder-only/vision, adapted enc-dec, conv UNet, conv autoencoder) Training objectives: 6 (span corruption, MLM, contrastive, self-supervised, diffusion, reconstruction) Procrustes analysis: 6 VAE pairs, 68 weight matrices each Modulator experiments: 4 LERP configurations, 1 field modulator

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support