[Proposal]Blob v2 has no cross-row dedup API— proposing a ref_id schema field to add one #6592
Replies: 2 comments 2 replies
-
|
this maybe somehow related to this discussion which also addresses duplicated blob data #6736 |
Beta Was this translation helpful? Give feedback.
-
|
I agree this is a very common problem. One solution I often see to this problem is to use two tables. One table (fewer, wider rows) contains the blob contents. The other (more, narrower rows) contains the metadata. For example, one table contains videos and another table contains segments. The segments table then points back into the videos table (classic one-to-many relationship with foreign key). What advantages do you see to this approach over the two-table approach? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
Blob v2 today has no way to share bytes across rows. Every row owns its own payload; the preprocessor / encoder have no hook to say "this row reuses that row's blob." There is no API, no descriptor field, no internal cache. If 20 rows carry the same image, we write 20 copies of the image bytes. Dedicated path → 20 sidecar files. Packed path → 20 appended copies. Inline path → 20 buffer regions.
Today's behavior: 20 rows referencing the same payload → 20 copies
Concretely, here's what users write today when 20 rows conceptually reference
the same 6 MB payload. There is no way in the current API to express "these
rows share one blob":
What ends up on disk today — 20 independent blobs, each fully materialized:
The 120 MB is entirely avoidable bloat — the same 6 MB written 20 times. In
real multimodal training workloads where 8–1000 rows commonly reference one
object, this bloat becomes 8×–1000× of the raw data size.
A Common problem
Blob v2 today assumes 1 row = 1 blob. Every row owns its own bytes.
But real workloads often have rows at different frequencies aligned into a single table:
The shared object is not restricted to any particular format — it can be a raw image, a video GOP, a compressed archive (tar / zip containing many files), a point cloud, an audio chunk, a model checkpoint shard — anything that multiple rows logically reference as a single opaque byte blob.
Today the only options are: (a) duplicate the low-freq column 10–1000×, (b) downsample (lose data), or (c) two tables + runtime JOIN (slow, breaks columnar scans). None is good.
What we actually want: "physically low-frequency, logically high-frequency" — every row stays a complete record; storage keeps one copy.
The proposal: add
ref_idso users can express "these rows share one blob"We want to add a
ref_id: u32field so rows with the same positiveref_idshare one physical blob.ref_id = 0or null means no sharing (existing behavior — unchanged).Reference implementation: PR #6600
— a minimal, non-invasive version that keeps the on-disk Blob v2 descriptor at
5 fields (unchanged from upstream).
ref_idlives in memory only during thewrite; it is consumed by the preprocessor and encoder, then dropped before
any byte touches disk. Already running in our production workloads, stable,
~188 lines of production Rust changed.
The same example, with the proposed API:
What ends up on disk with
ref_id:Read-back is byte-identical to the no-dedup case — every row sees the full
6 MB payload. The sharing is invisible to readers; only the disk layout
differs.
How a single batch flows from
write_datasetinto RustThis section will recap the workflow when the Python
code above executes. The dedup primitive hooks into two existing layers — we
don't add a new stage. The goal here is to make it easy to see exactly where
the new behavior fits.
At a glance
flowchart TD P["<b>Python</b><br/>20 × Blob(data=bytes, ref_id=42)<br/>lance.write_dataset(batch, ...)"] P -->|PyO3| R R["<b>Rust orchestration</b><br/>Dataset::write → InsertBuilder::execute_stream<br/>→ write_fragments_internal <i>(gate: version ≥ 2.2)</i><br/>→ do_write_fragments"] R --> V V["V2WriterAdapter::write<br/><i>one BlobPreprocessor per fragment</i>"] V --> H1 H1(["<b>Hook 1 · BlobPreprocessor::preprocess_batch</b><br/>5-field Struct ──▶ 7-field Struct<br/>Packed / Dedicated dedup<br/><i>via ref_id_sidecar_cache</i>"]) H1 --> F F["FileWriter::write_batch<br/>encode_batch — per-column dispatch"] F --> H2 H2(["<b>Hook 2 · BlobV2StructuralEncoder::maybe_encode</b><br/>7-field Struct ──▶ 5-field descriptor<br/>Inline dedup<br/><i>via ref_dedup_tmp_map</i>"]) H2 --> D D[("<b>On disk</b><br/>kind · position · size · blob_id · blob_uri<br/><i>ref_id not persisted</i>")] classDef hook fill:#fff3e0,stroke:#e65100,stroke-width:2.5px,color:#3e2723 classDef normal fill:#fafafa,stroke:#9e9e9e,color:#212121 classDef disk fill:#e0f2f1,stroke:#00695c,stroke-width:2px,color:#004d40 class H1,H2 hook class P,R,V,F normal class D diskThe two pill-shaped amber nodes are the only additions by this PR;
everything else is the existing Blob v2 write path. Each hook transforms the
data shape (5 → 7 → 5 fields) and consults / updates one in-memory cache.
ref_idexits the pipeline at Hook 2 — it is never persisted to disk.Entry point: Python → Rust
At the Python layer the
batchis apyarrow.RecordBatchwhosepayloadcolumn is a
BlobType(lance.blob.v2) ExtensionArray. Its storage is a5-field StructArray:
lance.write_dataset(...)crosses the PyO3 boundary viapython/src/dataset.rs::write_dataset, which parses options intoWriteParamsand calls
Dataset::writein Rust. ExtensionArray is unwrapped to its storageStructArray on the way; extension identity survives as field metadata
(
ARROW:extension:name = "lance.blob.v2").Rust orchestration:
Dataset::write→do_write_fragmentsThe Rust side of the write pipeline is a thin chain of builders before the
per-batch work starts:
For the example above (one 20-row batch),
break_streamdoesn't splitanything — the chunk is the whole batch. A single
V2WriterAdapteris createdfor the fragment, and its bundled
BlobPreprocessorwill see this chunk.The Blob v2 branch:
V2WriterAdapter::write→BlobPreprocessorV2WriterAdapteris the layer that knows the difference between a plaincolumnar write and a write that carries blob v2 columns. The preprocessor
runs only when the fragment's schema has blob v2 columns:
preprocess_blob_batchesis a thin loop over batches; the real work isper-batch:
Crucially, the
BlobPreprocessorlives per fragment: its caches arecreated when the fragment's
V2WriterAdapteris created, and are dropped whenthe fragment is finalized. The cache is not shared across fragments in the
same write call, and nothing persists across write calls. This is the
property that makes the dedup strictly a write-time primitive.
preprocess_batch: input / output shapeThe preprocessor's job is to transform the user-facing 5-field struct into a
7-field intermediate struct the Lance file writer understands, writing any
Packed / Dedicated sidecar files to disk along the way.
Input (5 fields, user-supplied):
Output (7 fields, passed to the encoder):
The crucial bit: all 20 rows end up with the same
blob_id = 1afterpreprocessing. Exactly one sidecar file is written; the 19 later rows hit the
cache and reuse the coordinates.
Where the dedup decision happens
BlobPreprocessorholds one cache, keyed byref_id, covering the twopreprocessor-owned paths (Packed and Dedicated):
The row-level loop inside
preprocess_batchconsults the cache beforerouting by size. For the example (20 rows, 6 MB each, ref_id=42):
write_dedicated(blob_id=1, bytes),cache.insert(42, Dedicated{1, 6307500})Dedicated{1, 6307500}blob_id=1,continueThe relevant Rust snippet:
Inline path: a symmetric cache lives in the encoder
For Inline blobs (
data_len <= 64 KB) the preprocessor does not write bytesitself — the actual placement happens one layer down, in
BlobV2StructuralEncoder::maybe_encode, because only the encoder knows theout-of-line buffer offset. A symmetric cache keyed by
ref_idlives there:The two caches partition the problem cleanly:
BlobPreprocessor.ref_id_sidecar_cache— Packed + Dedicated (sidecar files)BlobV2StructuralEncoder.ref_dedup_tmp_map— Inline (main-file out-of-line buffer)One-line summary of the write path
ref_idBlob(ref_id=42)BlobArray.from_pylistBlobPreprocessor::preprocess_batchBlobV2StructuralEncoder::maybe_encodelance-file::v2::FileWriter::write_batchStages 4 and 5 are the only places in the whole pipeline that read
ref_id.The cache state they maintain is per-fragment, and once the writer is
finalized both caches are dropped. The descriptor that actually lands on disk
contains the shared
blob_id(for Dedicated/Packed) or sharedposition(for Inline) — which is exactly how dedup is expressed to the reader, without
any reader-side change required.
Beta Was this translation helpful? Give feedback.
All reactions