Skip to content

Commit b5a171d

Browse files
committed
Add RotorQuant/IsoQuant comparison and decorrelation analysis to RFC 0033
Incorporate findings from TheTom/turboquant_plus#34, where small block-diagonal rotations (SO(2)/SO(3)/SO(4)) caused 10x+ MSE regressions on real KV-cache data. This empirical evidence strengthens the case for large block sizes (B=256+) in Stage 2 and motivates a new experimental plan item measuring cross-block correlation on real embeddings. https://claude.ai/code/session_016qKqZ579LA83p7ThoAdqut
1 parent ef7e874 commit b5a171d

1 file changed

Lines changed: 63 additions & 0 deletions

File tree

proposed/0033-block-turboquant.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,41 @@ relevant comparison is ANN recall@k on embedding datasets, where TurboQuant's
128128
block decomposition, PDX scan layout, and per-vector encode/decode are the
129129
critical features.
130130

131+
### Comparison to RotorQuant / IsoQuant
132+
133+
RotorQuant [13] replaces TurboQuant's full-dimension SORF with Clifford algebra
134+
rotors in Cl(3,0), chunking vectors into 3-dimensional groups and applying SO(3)
135+
sandwich products. IsoQuant extends this to SO(4) via quaternions, and PlanarQuant
136+
uses SO(2) Givens rotations. All three are block-diagonal rotation strategies with
137+
very small blocks (2-4 dimensions).
138+
139+
On real KV-cache tensors (Qwen2.5-3B), these small-block rotations showed severe
140+
quality regressions: RotorQuant at 3-bit measured 3.843 MSE vs. TurboQuant's
141+
0.354 (10.8× worse), and IsoQuant at 4-bit incurred +36% perplexity impact vs.
142+
TurboQuant's +11.7% [13]. Independent analysis attributed this to the fundamental
143+
decorrelation limitation: block-diagonal rotations in SO(2)/SO(3)/SO(4) provide
144+
no cross-group coordinate mixing, while WHT/SORF mixes all coordinates
145+
simultaneously. Real embedding vectors exhibit full-dimension correlations that
146+
small-block rotations cannot break.
147+
148+
| | TurboQuant (SORF) | RotorQuant (SO(3)) | IsoQuant (SO(4)) |
149+
| ---------------------- | --------------------------------------------- | -------------------------- | --------------------------- |
150+
| Decorrelation | Full dimension (3-round SORF, all coords mix) | Block-diagonal (3D groups) | Block-diagonal (4D groups) |
151+
| Params (d=128) | 384 sign bits (3 × 128) | 186 rotor params | ~500 quaternion params |
152+
| MSE at 3-bit (Qwen KV) | 0.354 | 3.843 (10.8× worse) | Not reported at 3-bit |
153+
| Speed vs. WHT | Baseline (896 FMAs at d=128) | 2,408 FMAs (2.7× slower) | ~3.6× slower (CUDA prefill) |
154+
155+
**Relevance to our design.** RFC 0033's Stage 2 block decomposition is also
156+
block-diagonal — each B-dim block has an independent SORF with no cross-block
157+
mixing. The critical difference is block size: B=256 with 3-round SORF provides
158+
24 butterfly stages of within-block mixing (comparable to the current B=1024's
159+
30 stages), vs. RotorQuant's 3-4 coordinate groups with no structured mixing at
160+
all. The RotorQuant/IsoQuant data provides empirical evidence that the quality
161+
cliff for block-diagonal rotations is steep at very small B and validates the
162+
RFC's minimum B ≥ 64 constraint. Whether B=256 is large enough to avoid
163+
meaningful decorrelation loss is an empirical question addressed in the
164+
Experimental plan.
165+
131166
### Current Vortex implementation
132167

133168
The [current implementation][current-impl] (Rust, in the `vortex-tensor` crate,
@@ -555,6 +590,18 @@ smaller block dimension B, within-block coordinate dependence after rotation may
555590
be stronger even when marginals are correct — this is an additional motivation
556591
for the experimental plan's comparison of block sizes.
557592

593+
**Empirical evidence from small-block rotations.** The RotorQuant/IsoQuant
594+
experiments [13] provide direct evidence of this decorrelation failure mode:
595+
block-diagonal rotations in SO(3) (3-dim groups) and SO(4) (4-dim groups)
596+
caused 10× MSE regressions on real KV-cache vectors, attributed to complete
597+
absence of cross-group coordinate mixing. Our Stage 2 design operates at a
598+
fundamentally different scale — B=256 blocks with 3-round SORF provide 24
599+
butterfly mixing stages within each block, vs. RotorQuant's 3-4 raw coordinates
600+
with no structured mixing — so the decorrelation loss should be far less severe.
601+
Nevertheless, the experimental plan includes explicit cross-block correlation
602+
measurement on real embeddings to quantify any residual decorrelation gap
603+
between block-decomposed (B=256) and single-block (B=d) SORF.
604+
558605
The actual MSE may depend on block dimension B: at larger B the coordinate
559606
distribution is more concentrated (variance ~1/B), giving the Max-Lloyd
560607
quantizer more to exploit. See Experimental plan.
@@ -954,6 +1001,15 @@ to 64 or raising to 256.
9541001
- Test SORF coordinate distribution at each B: histogram vs. analytical Beta
9551002
- Test 3, 4, 5 SORF rounds at each B
9561003
- Determine if the practical MSE constant is worse at smaller B
1004+
- Measure cross-block coordinate correlation on real embeddings (Contriever,
1005+
OpenAI) before and after per-block SORF rotation: compute the average
1006+
absolute Pearson correlation between coordinates in different blocks. Compare
1007+
block-decomposed (B=256, k=3) vs. single-block (B=d) SORF at d=768 to
1008+
quantify how much cross-block dependence survives block decomposition. The
1009+
RotorQuant/IsoQuant experiments [13] showed that very small block-diagonal
1010+
rotations (3-4 dims) leave full-dimension correlations intact; this test
1011+
determines where on the block-size spectrum the decorrelation gap becomes
1012+
negligible
9571013

9581014
The block-size rule ("greatest qualifying B") is a starting heuristic that
9591015
maximizes per-block quality and minimizes norm count. Experiments may show that
@@ -1299,6 +1355,13 @@ IEEE Trans. PAMI 36(4):744-755, 2014.
12991355
Alistarh, D. "Pushing the Limits of Large Language Model Quantization via the
13001356
Linearity Theorem." arXiv:2411.17525, November 2024.
13011357

1358+
[13] johndpope et al. "RotorQuant: Clifford algebra vector quantization." PR #34,
1359+
TheTom/turboquant_plus, March-April 2026.
1360+
https://github.com/TheTom/turboquant_plus/pull/34
1361+
Explores SO(2)/SO(3)/SO(4) block-diagonal rotations as alternatives to
1362+
full-dimension SORF. Rejected due to 10×+ MSE regressions on real KV-cache
1363+
tensors, attributed to insufficient cross-group decorrelation.
1364+
13021365
## Appendix A: Reference implementation bugs and Theorem 1 constant
13031366

13041367
### Reference implementation bugs

0 commit comments

Comments
 (0)