Replies: 6 comments 2 replies
-
|
this maybe somehow related to this discussion which also addresses duplicated blob data #6592 |
Beta Was this translation helpful? Give feedback.
-
|
I like this idea. I am a little curious how this will eventually evolve to handle blob v2. In blob v2 we can store blobs inside a data file (inline), in a "pack file" (packed) or with a file-per-blob scheme (dedicated / external). The solution here would handle the inline case. I think we can definitely extend it for the packed case. Maybe we just can't do this for the file-per-blob schemes? If we do try to somehow support cross-file references we might have to deal with complexities like "what happens if a row referenced by another row is deleted and that delete is materialized". Anyways, +1 to the general idea. I'll post more specific feedback on the PR. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @westonpace! I definitely think we should support both inline and blob v2 (specifically the packed format). In our use cases, we often partition the data by ID, which means we naturally need the blobs to be packed externally from the Lance fragment itself. Regarding your concern about materialized deletes breaking cross-references: the impact should actually be quite lightweight. We aren't keeping a single base blob for a long sequence of rows. Instead, we maintain multiple base blobs (creating a new base every 5 diffs on average). Because these delta chains are so short, if a base row is deleted and materialized during compaction, the process would only need to rebase a very small handful of dependent deltas. It prevents the cascading complexity you'd get with deep delta chains. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
I think this is like h265 (video) and we can borrow some ideas from there late (binary tree references instead of fixed intervals etc) |
Beta Was this translation helpful? Give feedback.
-
|
now there is a full detailing the idea of separation of physical and logical blob from #6736 (comment) |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
Proposal
Add a new
DeltaBlobLayoutvariant to thePageLayoutprotobuf oneof inencodings_v2_1.proto, enabling binary delta encoding for blob columns.Motivation
When storing source code datasets (e.g., file revision histories from Git repositories), consecutive rows often differ by only a few lines. Today each blob is stored independently. Delta encoding can reduce storage by 5-10x for such workloads by storing only the binary differences (copy/insert instructions) between similar values.
This is directly motivated by the Git Lake project — storing the world's Git data in Lance.
Design
The new layout is a thin wrapper, structurally identical to
BlobLayout:How it works
(position, size, kind, base_offset)— extends the blob descriptor withkind(DeltaBase=4 or Delta=5 inBlobKind) andbase_offset(distance back to base value)DeltaBlobPageSchedulerreads descriptors, expands ranges to include required bases, loads bytes from external buffers, applies delta chains, returns reconstructed values.Backwards compatibility
PageLayoutoneof — old readers will see an unknown field and fail gracefullylance-encoding:delta-blob = "true"is set — existing data is unaffectedBlobKindvariants (DeltaBase=4,Delta=5) with properTryFrom<u8>handlingPR
#6733
Vote
Please vote with:
The vote will be open for 72 hours. Requires 3 binding +1 votes and no binding -1 votes to pass.
Beta Was this translation helpful? Give feedback.
All reactions