Re-evaluating stable row ID implementation #6933
Replies: 2 comments
-
Current State
Proposed Direction
Roadmap
|
Beta Was this translation helpful? Give feedback.
-
The "index-backed vs. inline metadata" choice is a false dichotomyThe framing in the community discussions so far has largely been: "inline row-id metadata is usually more efficient than a BTree, so an index-backed design is a regression." That's true as stated — but it frames a generic BTree as the alternative, and a BTree is the worst case for our data, not the target. Stable row IDs are mostly-ordered and largely sequential. The current inline sequence already exploits that: it's effectively a piecewise-linear model with ε=0 (integer runs of The axis I'd rather evaluate: is there a general, maintained index type that keeps the best-case-when-ordered behavior of the inline sequence, but degrades to no worse than a BTree? That's a well-studied class — piecewise-linear / learned indexes — and two are directly relevant:
Both generalize what I want to be honest about the risk: rigorous on-disk benchmarks (SIGMOD'23) show learned indexes are not a universal win over a tuned B+-tree for arbitrary data (B+-tree wins on p99/robustness; most learned indexes are larger on disk, PGM excepted). The advantage shows up for genuinely near-linear data — i.e. exactly the row-id / identity-column regime, not the general case. So this is a "measure it" claim, not a "trust me" claim. Proposed next step — and why it isn't blocked on infra. Two parallel workstreams intersect here, but neither needs to gate the experiment:
So the concrete experiment: prototype a PGM/FITing-tree-style segment index for the row-id map (Rust crates like On identity columns: I'd decouple this entirely. A user-facing identity / generated column is a valuable feature on its own, and whoever's interested should feel free to pursue it in parallel — it shouldn't be blocked on, or framed as, "replacing stable row ID." The shared payoff is just that if the experiment above gives us a good order-exploiting index type, both features can reuse it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Stable row IDs are becoming an important Lance primitive. They provide a durable logical row identifier that can survive physical rewrites such as compaction and updates.
Today, stable row IDs are implemented as a dedicated Lance system feature. The manifest tracks the stable-row-id feature flag and
next_row_id. Each fragment carriesrow_id_meta, which points to that fragment's row ID sequence, either inline in metadata or through an external file reference. At runtime, Lance materializes a row-id-to-row-address lookup from these per-fragment sequences and caches it.This raises an implementation question:
Could stable row IDs instead be implemented as a system-managed identity column backed by a maintained index?
This would preserve
_rowidas an internal Lance system column. The question is mostly about implementation: should stable row ID remain a purpose-built storage path, or should it become a special case of a more general identity-column and write-maintained-index mechanism?Related prior discussion: #6466
Requirements
Stable row IDs should continue to support:
The updated design should preserve these capabilities even if the internal implementation changes.
Current approach: dedicated row ID metadata
The current design has a purpose-built row ID storage path:
next_row_idis tracked in the manifestrow_id_metaPros
The current model is compact and tailored to the row ID use case.
In many cases, row ID metadata can be encoded in or near the manifest, which means lookup metadata may be available without extra index IO. The row ID index is also guaranteed to be updated on insert because it is part of the write path today.
There may also be an algorithmic advantage. Stable row IDs are usually ordered and mostly sequential. A specialized row ID structure can exploit this and may support cheaper maintenance than a generic BTree. For example, some updates may be closer to
O(N)instead ofO(N log N).Cons
The current model is a separate indexing path from Lance's general scalar index system. That adds long-term format and implementation surface area:
It may also create many fragmented row ID metadata file references over time and potentially inflate the manifest.
Another concern is behavior under frequent row-wide updates. As rows are rewritten, the row ID mapping may become less compact and less efficient. If the specialized structure degrades enough, a maintained scalar index could become more attractive despite higher baseline overhead.
Alternative approach: identity column + maintained index
The alternative is to model stable row ID as:
In this framing, stable row ID properties come from identity-column and write-maintained-index semantics:
Pros
This would reuse Lance's existing index infrastructure instead of maintaining a one-off row ID lookup system.
Stable row ID lookup could benefit from existing or shared mechanisms for:
It also moves row ID metadata out of the manifest and into index files. With segmented indexes, write-time deltas could be committed as new index segments and merged later. With transaction support that can atomically commit data files and index updates, data and stable row ID index maintenance could become part of one composite write.
This design also connects stable row ID to two broader Lance features:
Identity columns
Stable row ID could be the internal version of a more general identity-column feature. That could later support user-defined identity columns, generated columns, optional start/step configuration, and explicit semantics around uniqueness, immutability, and gaps.
Write-time index maintenance
Stable row ID needs its lookup structure to remain current as data changes. Other lightweight indexes may want similar treatment. If an index is cheap enough to maintain during writes, users could choose to pay write-time cost in exchange for immediately available lookup/index behavior.
Cons
A maintained BTree may require more IO than the current design, especially for simple append-heavy workloads where row ID metadata is already available from the manifest or fragment metadata.
A generic BTree may also be less efficient than a row-id-specific structure for mostly ordered, sequential IDs. If stable row IDs have predictable ranges, a specialized range-like index may be smaller and cheaper to update than a BTree.
There is also implementation risk. Lance needs a robust mechanism for keeping indexes up to date during writes. Segmented indexes and transaction improvements make this more feasible, but the correctness story needs to be clear before stable row ID depends on it.
Dedicated sequence-like index type
The choice may not be limited to the current custom row ID metadata versus a generic BTree.
If there is an index structure that is better than BTree for mostly ordered, sequence-like data, and that structure is useful beyond
_rowid, then it may be worth supporting as a dedicated Lance index type.That could matter if Lance eventually supports general identity columns. Identity columns often produce ordered or mostly ordered values, and they may need efficient point lookup, range lookup, uniqueness checks, and write-time maintenance. A sequence-optimized index could serve both stable row IDs and future identity columns without making stable row ID a one-off special case.
In that model:
This would preserve the possibility of a row-id-optimized structure while still moving toward a more general indexing abstraction.
Core questions
Should stable row ID remain a purpose-built metadata path, or should it become a system-managed identity column backed by the maintained index framework?
If it becomes index-backed, should the physical index be a generic BTree, or should Lance support a dedicated sequence-optimized index type that can also serve future identity columns?
Can write-time index maintenance provide the correctness guarantees needed for stable row ID use cases, including update-by-id, returning generated IDs, reserving IDs before append, and incremental refresh workflows?
What are the practical IO, storage, and lookup tradeoffs between the current design and an index-backed design for append-heavy workloads versus workloads with frequent row-wide updates?
Beta Was this translation helpful? Give feedback.
All reactions