Lance FSL Leaf Compression Community Discussion Proposal #6819
Xuanwo
started this conversation in
Lance File Format
Replies: 1 comment 1 reply
-
|
+1 I have no problem with pursuing this. I am curious what kinds of tokens you were encountering. When working with embeddings (FSL) I've found that compression is typically a lost cause (embeddings should be high entropy by definition). Still, there is no harm (other than some write penalty) in attempting to find good compression and there are cases where it can be done (e.g. if the user originally had f16 data and they inflated it into f32 or something like that). |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Abstract
I propose adding shape-preserving leaf compression for
FixedSizeListin Lance File Format 2.3.The immediate motivation for this proposal comes from token dataset compression: when the input column is
FixedSizeList<Int64/UInt64>, Lance v2.2 fixed-size storage currently consumes about 8.03 bytes/token. Given a token vocabulary on the order of 128K, the information-theoretic lower bound is roughly 17 bits/token, i.e., 2.125 bytes/token. Lance's current results essentially indicate that it is still writing out 64-bit raw values without routing the child primitive values through an integer compression path.The proposed direction is to treat
FixedSizeListas a shape layer, treat the primitive child values as a leaf value stream, and apply existing or new general-purpose physical compression on that leaf stream.The target model is:
For token IDs, leaf compression would typically choose integer bitpacking. For float embeddings, leaf compression could select byte-stream splitting or general compression. For timestamps, decimals, or integer-like values, the corresponding primitive encodings can also be reused.
FixedSizeListitself remains responsible for preserving the fixed shape, while the physical layout of child values is determined by the leaf encoding.Background
Lance is positioned as a columnar format for ML workloads, and
FixedSizeListis a critically important type for such workloads:FixedSizeList<Float32>[dim]FixedSizeList<Int64/UInt64>[seq_len]The logical schema for these data types should stay Arrow-native, while the physical encoding should make compression choices based on the actual domain of the leaf values.
Lance’s structure already moves in this direction:
FixedSizeList.valuesis already aCompressiveEncoding, which in theory can express combinations such asFixedSizeList -> Bitpacking.FixedSizeListBlockkeeps the dimension and child block.The gap is that the actual writer/reader paths still largely treat FSL child values as flat raw values. As a result,
FixedSizeList<UInt64>token IDs, even when their maximum value is far smaller than 2^64, still consume close to 8 bytes/token.Proposal
Enable shape-preserving leaf compression for
FixedSizeListin Lance 2.3.Core semantics:
During writing, the FSL wrapper is responsible for maintaining row boundaries; the leaf encoder is responsible for compressing the primitive item stream.
This approach simultaneously ensures that:
take(row)/ range scans can still perform chunk-local decoding.FixedSizeList.The physical encoding for a token column could then become:
This encoding still expresses the same Arrow logical schema:
The change occurs only at the physical encoding level.
Applicability
The same shape-preserving leaf compression mechanism can serve:
FixedSizeList<UInt64>: integer bitpacking / frame-of-reference / delta-like encodingFixedSizeList<Int64>: signed integer encoding, zigzag, frame-of-referenceFixedSizeList<Float32>: byte-stream splitting, general compressionFixedSizeList<Boolean>: bitmap / sparsity-aware boolean encodingFixedSizeList<FixedSizeList<T>>: stacked shape layers over a single primitive leaf streamAll these types share the same core requirement: the shape layer is responsible for restoring the logical layout, and the leaf layer is responsible for selecting the physical encoding based on the primitive value domain.
Compatibility Boundary
This capability can be introduced as a Lance 2.3 physical encoding feature. The compatibility boundary is:
The reason is that although the existing proto shape allows
FixedSizeList.values = CompressiveEncoding, the old reader’s implementation semantics do not fully support arbitrary inner encodings. If a new writer directly writes the following under an old file version:the old reader may panic, report an internal error, or misinterpret the buffers. This risk cannot be avoided simply by relying on the presence of a proto field.
Therefore, Lance 2.3 should explicitly introduce this feature as a physical encoding capability:
FixedSizeList(values = supported leaf encoding).FixedSizeList -> Flat.The recommended strategy is to gate this capability through the Lance 2.3 file version, because it changes the reader’s semantic contract for FSL inner encodings.
Writer Behavior
A Lance 2.3 writer can enable FSL leaf compression when the following conditions are met:
If the benefit is insufficient, the writer continues to write flat. High-entropy 64-bit values entering the FSL path should still maintain the size-based fallback.
The writer does not require explicit user configuration for token columns. The default strategy should be driven by data statistics.
Reader Behavior
A Lance 2.3 reader should change FSL decompression to a wrapper model:
The reader must retain the old path:
and simultaneously support the new path:
When encountering an unsupported inner encoding, the error should directly state the missing capability, for example:
Expected Impact
For token workloads:
Real-world results will be affected by chunk size, validity, page metadata, signedness, and fallback strategy, but the goal should be to eliminate the ~4× size gap caused by FSL blocking primitive compression.
For non-token workloads:
Lance 2.3 Scope
This proposal suggests that the Lance 2.3 implementation scope be converged at the physical encoding layer:
FixedSizeList<T>[dimension].FixedSizeList -> Flatreader path remains readable.Listencoding can continue to evolve as an independent design.Beta Was this translation helpful? Give feedback.
All reactions