[Proposal]Blob v2 has no cross-row dedup API— proposing a ref_id schema field to add one #6592

DanielMao1 · 2026-04-22T06:08:26Z

DanielMao1
Apr 22, 2026

TL;DR

Blob v2 today has no way to share bytes across rows. Every row owns its own payload; the preprocessor / encoder have no hook to say "this row reuses that row's blob." There is no API, no descriptor field, no internal cache. If 20 rows carry the same image, we write 20 copies of the image bytes. Dedicated path → 20 sidecar files. Packed path → 20 appended copies. Inline path → 20 buffer regions.

Today's behavior: 20 rows referencing the same payload → 20 copies

Concretely, here's what users write today when 20 rows conceptually reference
the same 6 MB payload. There is no way in the current API to express "these
rows share one blob":

import lance
import numpy as np
import pyarrow as pa
from lance.blob import Blob, blob_array, blob_field

# ~6 MB image blob (1450×1450 RGB). Could equally be encoded JPEG/PNG,
# a video mp4, a tar of smaller files, or any other opaque byte blob.
shared_bytes = np.random.default_rng(42).integers(
    0, 256, size=(1450, 1450, 3), dtype=np.uint8
).tobytes()
print(f"Payload size: {len(shared_bytes) / 2**20:.2f} MB")   # ~6.02 MB

# The only way to write 20 rows carrying this payload today:
blobs = blob_array([shared_bytes] * 20)
#   (equivalent: blob_array([Blob(data=shared_bytes) for _ in range(20)]))
# No primitive exists to say "these 20 rows share the same blob."

batch = pa.RecordBatch.from_arrays(
    [pa.array(range(20), type=pa.int64()), blobs],
    schema=pa.schema([pa.field("row_idx", pa.int64()), blob_field("payload")]),
)

lance.write_dataset(batch, "ds.lance", mode="create", data_storage_version="2.2")

What ends up on disk today — 20 independent blobs, each fully materialized:

ds.lance/
├── _versions/
│   └── 1.manifest
└── data/
    ├── <stem>.lance                              # main file, 20 rows × descriptors
    └── <stem>/
        ├── blob_1.blob    (6 MB copy 1)
        ├── blob_2.blob    (6 MB copy 2)
        ├── blob_3.blob    (6 MB copy 3)
        ├── ... (17 more identical 6 MB files)
        └── blob_20.blob   (6 MB copy 20)
    total sidecar storage: 120 MB                 # all 20 copies are byte-identical

The 120 MB is entirely avoidable bloat — the same 6 MB written 20 times. In
real multimodal training workloads where 8–1000 rows commonly reference one
object, this bloat becomes 8×–1000× of the raw data size.

A Common problem

Blob v2 today assumes 1 row = 1 blob. Every row owns its own bytes.

But real workloads often have rows at different frequencies aligned into a single table:

Low-frequency (shared)	High-frequency (per-row)
Reference image / keyframe / video GOP / packaged archive (tar, zip)	per-frame label, bbox, annotation
LiDAR scan	per-object detection
Video embedding	caption / QA pair
RL observation	action / reward

The shared object is not restricted to any particular format — it can be a raw image, a video GOP, a compressed archive (tar / zip containing many files), a point cloud, an audio chunk, a model checkpoint shard — anything that multiple rows logically reference as a single opaque byte blob.

Today the only options are: (a) duplicate the low-freq column 10–1000×, (b) downsample (lose data), or (c) two tables + runtime JOIN (slow, breaks columnar scans). None is good.

What we actually want: "physically low-frequency, logically high-frequency" — every row stays a complete record; storage keeps one copy.

The proposal: add `ref_id` so users can express "these rows share one blob"

We want to add a ref_id: u32 field so rows with the same positive ref_id share one physical blob. ref_id = 0 or null means no sharing (existing behavior — unchanged).

Reference implementation: PR #6600
— a minimal, non-invasive version that keeps the on-disk Blob v2 descriptor at
5 fields (unchanged from upstream). ref_id lives in memory only during the
write; it is consumed by the preprocessor and encoder, then dropped before
any byte touches disk. Already running in our production workloads, stable,
~188 lines of production Rust changed.

The same example, with the proposed API:

# Every row carries ref_id=42 -> Lance dedups on write
blobs = blob_array([Blob(data=shared_bytes, ref_id=42) for _ in range(20)])
# ... everything else identical to above ...

What ends up on disk with ref_id:

ds.lance/
├── _versions/
│   └── 1.manifest
└── data/
    ├── <stem>.lance                              # same 20 rows, descriptors all share one blob_id
    └── <stem>/
        └── blob_1.blob    (6 MB, single physical copy)
    total sidecar storage: 6 MB                   # 20× reduction

Read-back is byte-identical to the no-dedup case — every row sees the full
6 MB payload. The sharing is invisible to readers; only the disk layout
differs.

How a single batch flows from `write_dataset` into Rust

This section will recap the workflow when the Python
code above executes. The dedup primitive hooks into two existing layers — we
don't add a new stage. The goal here is to make it easy to see exactly where
the new behavior fits.

At a glance

flowchart TD
    P["<b>Python</b><br/>20 × Blob(data=bytes, ref_id=42)<br/>lance.write_dataset(batch, ...)"]
    P -->|PyO3| R

    R["<b>Rust orchestration</b><br/>Dataset::write → InsertBuilder::execute_stream<br/>→ write_fragments_internal <i>(gate: version ≥ 2.2)</i><br/>→ do_write_fragments"]
    R --> V

    V["V2WriterAdapter::write<br/><i>one BlobPreprocessor per fragment</i>"]
    V --> H1

    H1(["<b>Hook 1 · BlobPreprocessor::preprocess_batch</b><br/>5-field Struct ──▶ 7-field Struct<br/>Packed / Dedicated dedup<br/><i>via ref_id_sidecar_cache</i>"])
    H1 --> F

    F["FileWriter::write_batch<br/>encode_batch — per-column dispatch"]
    F --> H2

    H2(["<b>Hook 2 · BlobV2StructuralEncoder::maybe_encode</b><br/>7-field Struct ──▶ 5-field descriptor<br/>Inline dedup<br/><i>via ref_dedup_tmp_map</i>"])
    H2 --> D

    D[("<b>On disk</b><br/>kind · position · size · blob_id · blob_uri<br/><i>ref_id not persisted</i>")]

    classDef hook fill:#fff3e0,stroke:#e65100,stroke-width:2.5px,color:#3e2723
    classDef normal fill:#fafafa,stroke:#9e9e9e,color:#212121
    classDef disk fill:#e0f2f1,stroke:#00695c,stroke-width:2px,color:#004d40

    class H1,H2 hook
    class P,R,V,F normal
    class D disk

The two pill-shaped amber nodes are the only additions by this PR;
everything else is the existing Blob v2 write path. Each hook transforms the
data shape (5 → 7 → 5 fields) and consults / updates one in-memory cache.
ref_id exits the pipeline at Hook 2 — it is never persisted to disk.

Entry point: Python → Rust

lance.write_dataset(batch, "ds.lance", mode="create", data_storage_version="2.2")

At the Python layer the batch is a pyarrow.RecordBatch whose payload
column is a BlobType (lance.blob.v2) ExtensionArray. Its storage is a
5-field StructArray:

Field "payload": Struct {
    data:     LargeBinary    # [shared_bytes × 20]  (PyArrow materializes every row)
    uri:      Utf8           # [null × 20]
    position: UInt64         # [null × 20]
    size:     UInt64         # [null × 20]
    ref_id:   UInt32         # [42 × 20]            <-- the new field
}

lance.write_dataset(...) crosses the PyO3 boundary via
python/src/dataset.rs::write_dataset, which parses options into WriteParams
and calls Dataset::write in Rust. ExtensionArray is unwrapped to its storage
StructArray on the way; extension identity survives as field metadata
(ARROW:extension:name = "lance.blob.v2").

Rust orchestration: `Dataset::write` → `do_write_fragments`

The Rust side of the write pipeline is a thin chain of builders before the
per-batch work starts:

Dataset::write                                 rust/lance/src/dataset.rs
    └─> InsertBuilder::execute_stream          rust/lance/src/dataset/write/insert.rs
        └─> write_uncommitted_stream_impl
            └─> write_fragments_internal       rust/lance/src/dataset/write.rs
                │   # gate: requires file format >= 2.2 for blob v2 columns
                │   let storage_version = ...;
                │   if storage_version < V2_2 && schema.has_blob_v2()
                │       { return Err("Blob v2 requires file version >= 2.2"); }
                └─> do_write_fragments
                    │   # chunk the input stream by max_rows_per_file
                    │   let mut buffered_reader = break_stream(data, max_rows_per_file);
                    │
                    │   # WriterGenerator creates V2WriterAdapter per fragment;
                    │   # each fragment owns one BlobPreprocessor (if blob v2 present)
                    │   let writer_generator = WriterGenerator::new(..., external_blob_mode, ...);
                    │
                    └─> while let Some(chunk) = buffered_reader.next().await {
                           writer.write(&chunk).await?;   // <-- per-chunk entry
                        }

For the example above (one 20-row batch), break_stream doesn't split
anything — the chunk is the whole batch. A single V2WriterAdapter is created
for the fragment, and its bundled BlobPreprocessor will see this chunk.

The Blob v2 branch: `V2WriterAdapter::write` → `BlobPreprocessor`

V2WriterAdapter is the layer that knows the difference between a plain
columnar write and a write that carries blob v2 columns. The preprocessor
runs only when the fragment's schema has blob v2 columns:

// rust/lance/src/dataset/write.rs  (V2WriterAdapter::write)
impl GenericWriter for V2WriterAdapter {
    async fn write(&mut self, batches: &[RecordBatch]) -> Result<()> {
        if let Some(pre) = self.preprocessor.as_mut() {
            let processed = preprocess_blob_batches(batches, pre).await?;  // <-- blob-aware
            for batch in processed {
                self.writer.write_batch(&batch).await?;  // <-- normal v2 file writer
            }
        } else {
            for batch in batches {
                self.writer.write_batch(batch).await?;
            }
        }
        Ok(())
    }
}

preprocess_blob_batches is a thin loop over batches; the real work is
per-batch:

pub async fn preprocess_blob_batches(batches: &[RecordBatch], pre: &mut BlobPreprocessor)
    -> Result<Vec<RecordBatch>>
{
    let mut out = Vec::with_capacity(batches.len());
    for batch in batches {
        out.push(pre.preprocess_batch(batch).await?);   // <-- the dedup decision point
    }
    Ok(out)
}

Crucially, the BlobPreprocessor lives per fragment: its caches are
created when the fragment's V2WriterAdapter is created, and are dropped when
the fragment is finalized. The cache is not shared across fragments in the
same write call, and nothing persists across write calls. This is the
property that makes the dedup strictly a write-time primitive.

`preprocess_batch`: input / output shape

The preprocessor's job is to transform the user-facing 5-field struct into a
7-field intermediate struct the Lance file writer understands, writing any
Packed / Dedicated sidecar files to disk along the way.

Input (5 fields, user-supplied):

Struct "payload" (20 rows)
├── data     : LargeBinary  [shared_bytes × 20]
├── uri      : Utf8         [null × 20]
├── position : UInt64       [null × 20]
├── size     : UInt64       [null × 20]
└── ref_id   : UInt32       [42 × 20]               <-- read here for dedup

Output (7 fields, passed to the encoder):

Struct "payload" (20 rows)
├── kind      : UInt8       [Dedicated × 20]         <-- new: routing decision
├── data      : LargeBinary [null × 20]              (bytes moved to sidecar)
├── uri       : Utf8        [null × 20]
├── blob_id   : UInt32      [1 × 20]                 <-- new: all share blob_id=1
├── blob_size : UInt64      [6307500 × 20]           <-- new: actual payload size
├── position  : UInt64      [null × 20]              (Dedicated has no in-file offset)
└── ref_id    : UInt32      [42 × 20]                (carried through for Inline dedup)

The crucial bit: all 20 rows end up with the same blob_id = 1 after
preprocessing. Exactly one sidecar file is written; the 19 later rows hit the
cache and reuse the coordinates.

Where the dedup decision happens

BlobPreprocessor holds one cache, keyed by ref_id, covering the two
preprocessor-owned paths (Packed and Dedicated):

pub struct BlobPreprocessor {
    object_store: ObjectStore,
    local_counter: u32,                                // allocates blob_id
    pack_writer: PackWriter,                           // rolling Packed sidecar writer
    ...
    ref_id_sidecar_cache: HashMap<u32, SidecarRef>,    // <-- new in this PR
}

#[derive(Clone, Copy)]
enum SidecarRef {
    Dedicated { blob_id: u32, size: u64 },
    Packed    { blob_id: u32, position: u64, size: u64 },
}

The row-level loop inside preprocess_batch consults the cache before
routing by size. For the example (20 rows, 6 MB each, ref_id=42):

Row	Cache lookup (key = 42)	Action	Writes a sidecar?
0	miss	`write_dedicated(blob_id=1, bytes)`, `cache.insert(42, Dedicated{1, 6307500})`	yes (1 file)
1	hit → `Dedicated{1, 6307500}`	emit descriptor with `blob_id=1`, `continue`	no
2..19	hit	same as row 1	no

The relevant Rust snippet:

// 0 (or null) means no sharing; non-zero values participate in dedup.
let ref_id = ref_id_col.as_ref()
    .filter(|c| !c.is_null(i)).map(|c| c.value(i)).unwrap_or(0);

// Early cache hit: reuse a previously-written sidecar blob.
if ref_id > 0 {
    if let Some(cached) = self.ref_id_sidecar_cache.get(&ref_id).copied() {
        match cached {
            SidecarRef::Dedicated { blob_id, size } => {
                kind_builder.append_value(BlobKind::Dedicated as u8);
                blob_id_builder.append_value(blob_id);
                blob_size_builder.append_value(size);
                // no write_dedicated, no new sidecar
                continue;
            }
            SidecarRef::Packed { blob_id, position, size } => { /* analogous */ }
        }
    }
}

// Normal write path below. For Dedicated (> threshold):
let blob_id = self.next_blob_id();
self.write_dedicated(blob_id, BlobWriteSource::Bytes(bytes)).await?;
if ref_id > 0 {
    self.ref_id_sidecar_cache.insert(
        ref_id, SidecarRef::Dedicated { blob_id, size: data_len as u64 },
    );
}

Inline path: a symmetric cache lives in the encoder

For Inline blobs (data_len <= 64 KB) the preprocessor does not write bytes
itself — the actual placement happens one layer down, in
BlobV2StructuralEncoder::maybe_encode, because only the encoder knows the
out-of-line buffer offset. A symmetric cache keyed by ref_id lives there:

pub struct BlobV2StructuralEncoder {
    descriptor_encoder: Box<dyn FieldEncoder>,
    ref_dedup_tmp_map: HashMap<u32, (u64, u64)>,   // ref_id -> (position, size)
}

BlobKind::Inline => {
    if ref_id > 0 {
        if let Some(&(pos, sz)) = self.ref_dedup_tmp_map.get(&ref_id) {
            (pos, sz)                                   // reuse, no add_buffer
        } else {
            let pos = external_buffers.add_buffer(bytes);
            self.ref_dedup_tmp_map.insert(ref_id, (pos, bytes.len() as u64));
            (pos, bytes.len() as u64)
        }
    } else { /* unconditional add_buffer */ }
}

The two caches partition the problem cleanly:

BlobPreprocessor.ref_id_sidecar_cache — Packed + Dedicated (sidecar files)
BlobV2StructuralEncoder.ref_dedup_tmp_map — Inline (main-file out-of-line buffer)

One-line summary of the write path

Stage	Function	What it does with `ref_id`
1	Python `Blob(ref_id=42)`	user-supplied value
2	`BlobArray.from_pylist`	packs into 5th column of input Struct
3	PyO3 boundary	carried across unchanged
4	`BlobPreprocessor::preprocess_batch`	consult + insert cache; dedup Packed/Dedicated
5	`BlobV2StructuralEncoder::maybe_encode`	consult + insert cache; dedup Inline
6	`lance-file::v2::FileWriter::write_batch`	writes the final descriptor page

Stages 4 and 5 are the only places in the whole pipeline that read ref_id.
The cache state they maintain is per-fragment, and once the writer is
finalized both caches are dropped. The descriptor that actually lands on disk
contains the shared blob_id (for Dedicated/Packed) or shared position
(for Inline) — which is exactly how dedup is expressed to the reader, without
any reader-side change required.

hpvd · 2026-05-12T09:11:39Z

hpvd
May 12, 2026

this maybe somehow related to this discussion which also addresses duplicated blob data #6736

0 replies

westonpace · 2026-05-12T20:53:28Z

westonpace
May 12, 2026
Maintainer

I agree this is a very common problem. One solution I often see to this problem is to use two tables. One table (fewer, wider rows) contains the blob contents. The other (more, narrower rows) contains the metadata. For example, one table contains videos and another table contains segments. The segments table then points back into the videos table (classic one-to-many relationship with foreign key).

What advantages do you see to this approach over the two-table approach?

2 replies

DanielMao1 May 15, 2026
Author

I agree this is a very common problem. One solution I often see to this problem is to use two tables. One table (fewer, wider rows) contains the blob contents. The other (more, narrower rows) contains the metadata. For example, one table contains videos and another table contains segments. The segments table then points back into the videos table (classic one-to-many relationship with foreign key).

What advantages do you see to this approach over the two-table approach?

The two-table pattern is clean when the 1:N relationship is part of the user's conceptual model — e.g., "segments belong to videos" naturally fits a foreign key. It is much less clean when the dedup is purely a storage-layer accident:

a) random-read latency. For our hot path the dataloader randomly picks frame rows from a 50-episode dataset. With ref_id each row's descriptor points directly at the physical blob — one S3 Range GET, cold or warm. With two tables, each random
frame becomes "lookup segments → resolve FK → fetch from videos" — two GETs on cold cache, even if the second one is small. We'd be doubling RTT on the path that matters most.

(b) the dedup boundary is below the schema. Multiple rows sharing the same physical bytes isn't an N:1 user-visible relationship for us — it's a storage-layer accident (8 consecutive frames happen to live in the same GOP blob). Forcing the
user to model that as segments(frame_no, video_id, gop_idx) + videos(video_id, blob) leaks an encoding detail (GOP=8) into every consumer's query. ref_id keeps that detail in storage.

(c) writes. Streaming converters from LeRobot / similar work row-by-row. With ref_id the converter writes one table, encoder hashes and dedupes. Two tables require either two-phase commits or user-side coordination to keep FK integrity.

hpvd May 15, 2026

@DanielMao1 thanks for giving these details!

Please see also #6736

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal]Blob v2 has no cross-row dedup API— proposing a ref_id schema field to add one #6592

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Proposal]Blob v2 has no cross-row dedup API— proposing a ref_id schema field to add one #6592

Uh oh!

Uh oh!

DanielMao1 Apr 22, 2026

TL;DR

Today's behavior: 20 rows referencing the same payload → 20 copies

A Common problem

The proposal: add ref_id so users can express "these rows share one blob"

How a single batch flows from write_dataset into Rust

At a glance

Entry point: Python → Rust

Rust orchestration: Dataset::write → do_write_fragments

The Blob v2 branch: V2WriterAdapter::write → BlobPreprocessor

preprocess_batch: input / output shape

Where the dedup decision happens

Inline path: a symmetric cache lives in the encoder

One-line summary of the write path

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

hpvd May 12, 2026

Uh oh!

westonpace May 12, 2026 Maintainer

Uh oh!

DanielMao1 May 15, 2026 Author

Uh oh!

hpvd May 15, 2026

DanielMao1
Apr 22, 2026

The proposal: add `ref_id` so users can express "these rows share one blob"

How a single batch flows from `write_dataset` into Rust

Rust orchestration: `Dataset::write` → `do_write_fragments`

The Blob v2 branch: `V2WriterAdapter::write` → `BlobPreprocessor`

`preprocess_batch`: input / output shape

Replies: 2 comments 2 replies

hpvd
May 12, 2026

westonpace
May 12, 2026
Maintainer

DanielMao1 May 15, 2026
Author