@@ -128,6 +128,41 @@ relevant comparison is ANN recall@k on embedding datasets, where TurboQuant's
128128block decomposition, PDX scan layout, and per-vector encode/decode are the
129129critical features.
130130
131+ ### Comparison to RotorQuant / IsoQuant
132+
133+ RotorQuant [ 13] replaces TurboQuant's full-dimension SORF with Clifford algebra
134+ rotors in Cl(3,0), chunking vectors into 3-dimensional groups and applying SO(3)
135+ sandwich products. IsoQuant extends this to SO(4) via quaternions, and PlanarQuant
136+ uses SO(2) Givens rotations. All three are block-diagonal rotation strategies with
137+ very small blocks (2-4 dimensions).
138+
139+ On real KV-cache tensors (Qwen2.5-3B), these small-block rotations showed severe
140+ quality regressions: RotorQuant at 3-bit measured 3.843 MSE vs. TurboQuant's
141+ 0.354 (10.8× worse), and IsoQuant at 4-bit incurred +36% perplexity impact vs.
142+ TurboQuant's +11.7% [ 13] . Independent analysis attributed this to the fundamental
143+ decorrelation limitation: block-diagonal rotations in SO(2)/SO(3)/SO(4) provide
144+ no cross-group coordinate mixing, while WHT/SORF mixes all coordinates
145+ simultaneously. Real embedding vectors exhibit full-dimension correlations that
146+ small-block rotations cannot break.
147+
148+ | | TurboQuant (SORF) | RotorQuant (SO(3)) | IsoQuant (SO(4)) |
149+ | ---------------------- | --------------------------------------------- | -------------------------- | --------------------------- |
150+ | Decorrelation | Full dimension (3-round SORF, all coords mix) | Block-diagonal (3D groups) | Block-diagonal (4D groups) |
151+ | Params (d=128) | 384 sign bits (3 × 128) | 186 rotor params | ~ 500 quaternion params |
152+ | MSE at 3-bit (Qwen KV) | 0.354 | 3.843 (10.8× worse) | Not reported at 3-bit |
153+ | Speed vs. WHT | Baseline (896 FMAs at d=128) | 2,408 FMAs (2.7× slower) | ~ 3.6× slower (CUDA prefill) |
154+
155+ ** Relevance to our design.** RFC 0033's Stage 2 block decomposition is also
156+ block-diagonal — each B-dim block has an independent SORF with no cross-block
157+ mixing. The critical difference is block size: B=256 with 3-round SORF provides
158+ 24 butterfly stages of within-block mixing (comparable to the current B=1024's
159+ 30 stages), vs. RotorQuant's 3-4 coordinate groups with no structured mixing at
160+ all. The RotorQuant/IsoQuant data provides empirical evidence that the quality
161+ cliff for block-diagonal rotations is steep at very small B and validates the
162+ RFC's minimum B ≥ 64 constraint. Whether B=256 is large enough to avoid
163+ meaningful decorrelation loss is an empirical question addressed in the
164+ Experimental plan.
165+
131166### Current Vortex implementation
132167
133168The [ current implementation] [ current-impl ] (Rust, in the ` vortex-tensor ` crate,
@@ -555,6 +590,18 @@ smaller block dimension B, within-block coordinate dependence after rotation may
555590be stronger even when marginals are correct — this is an additional motivation
556591for the experimental plan's comparison of block sizes.
557592
593+ ** Empirical evidence from small-block rotations.** The RotorQuant/IsoQuant
594+ experiments [ 13] provide direct evidence of this decorrelation failure mode:
595+ block-diagonal rotations in SO(3) (3-dim groups) and SO(4) (4-dim groups)
596+ caused 10× MSE regressions on real KV-cache vectors, attributed to complete
597+ absence of cross-group coordinate mixing. Our Stage 2 design operates at a
598+ fundamentally different scale — B=256 blocks with 3-round SORF provide 24
599+ butterfly mixing stages within each block, vs. RotorQuant's 3-4 raw coordinates
600+ with no structured mixing — so the decorrelation loss should be far less severe.
601+ Nevertheless, the experimental plan includes explicit cross-block correlation
602+ measurement on real embeddings to quantify any residual decorrelation gap
603+ between block-decomposed (B=256) and single-block (B=d) SORF.
604+
558605The actual MSE may depend on block dimension B: at larger B the coordinate
559606distribution is more concentrated (variance ~ 1/B), giving the Max-Lloyd
560607quantizer more to exploit. See Experimental plan.
@@ -954,6 +1001,15 @@ to 64 or raising to 256.
9541001- Test SORF coordinate distribution at each B: histogram vs. analytical Beta
9551002- Test 3, 4, 5 SORF rounds at each B
9561003- Determine if the practical MSE constant is worse at smaller B
1004+ - Measure cross-block coordinate correlation on real embeddings (Contriever,
1005+ OpenAI) before and after per-block SORF rotation: compute the average
1006+ absolute Pearson correlation between coordinates in different blocks. Compare
1007+ block-decomposed (B=256, k=3) vs. single-block (B=d) SORF at d=768 to
1008+ quantify how much cross-block dependence survives block decomposition. The
1009+ RotorQuant/IsoQuant experiments [ 13] showed that very small block-diagonal
1010+ rotations (3-4 dims) leave full-dimension correlations intact; this test
1011+ determines where on the block-size spectrum the decorrelation gap becomes
1012+ negligible
9571013
9581014The block-size rule ("greatest qualifying B") is a starting heuristic that
9591015maximizes per-block quality and minimizes norm count. Experiments may show that
@@ -1299,6 +1355,13 @@ IEEE Trans. PAMI 36(4):744-755, 2014.
12991355Alistarh, D. "Pushing the Limits of Large Language Model Quantization via the
13001356Linearity Theorem." arXiv:2411.17525, November 2024.
13011357
1358+ [ 13] johndpope et al. "RotorQuant: Clifford algebra vector quantization." PR #34 ,
1359+ TheTom/turboquant_plus, March-April 2026.
1360+ https://github.com/TheTom/turboquant_plus/pull/34
1361+ Explores SO(2)/SO(3)/SO(4) block-diagonal rotations as alternatives to
1362+ full-dimension SORF. Rejected due to 10×+ MSE regressions on real KV-cache
1363+ tensors, attributed to insufficient cross-group decorrelation.
1364+
13021365## Appendix A: Reference implementation bugs and Theorem 1 constant
13031366
13041367### Reference implementation bugs
0 commit comments