feat: index-included columns (covering index) for vector search #6909
Replies: 4 comments 2 replies
-
|
+1 for this Some points to discuss:
|
Beta Was this translation helpful? Give feedback.
-
|
I love this idea. The only thing I want to mention is that I prefer to have concepts called "system columns" and "payload columns" inside the index's system columns:
pk_fingerprint # hidden, for MemWAL stale suppression
maybe shard/gen # optional, if needed by LSM planner
payload columns:
category
source
small metadata columnsThis way, users don't need to specify the |
Beta Was this translation helpful? Give feedback.
-
|
I think this is very reasonable, the work at #6899 was targeted towards this same problem - how can we provide a prefilter of stale row IDs to queries. This is another route to get there.
This isn't (or at least shouldn't be) true by design. It's the |
Beta Was this translation helpful? Give feedback.
-
|
+1 to the idea, it seems like a very natural extension. If we make any changes to the general proto (e.g. to indicate which fields are "covered" by the index) then I'd just request we do it in a way that is generic (e.g. maybe we will want to do something like this for other index types as well). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Today's Lance vector index stores only vectors (possibly quantized) and rowids. Any additional column value needed after index search — for PK dedup, bloom filter checking, or user projections — requires a
TakeExecthat does random I/O by rowid into data files. For small-k, low-latency vector search workloads, this take step dominates query latency.This problem surfaces concretely in two areas:
1. LSM stale-read suppression
The MemWAL LSM architecture searches multiple generations (base table, flushed memtables, active memtable) independently, then merges results with a global PK dedup. When a PK's fresh version falls out of its source's top-k, the stale copy from an older generation can leak through.
The correct fix is a pre-search exclusion filter: during HNSW/IVF traversal in older generations, skip any candidate whose PK exists in a newer generation (via bloom filter). But today the PK value is not available during index traversal — the index only has rowids. Getting the PK requires a take, which defeats the purpose of filtering during traversal.
PR #6899 works around this by building a per-query
GenPkIndex(scanning the entire base table's PK + rowid columns on every query) to construct a rowid-based block mask. This is O(base_table_size) per query, which is not scalable.2. General query performance
The most common vector search pattern is "embed → search → return a few metadata columns." Every production vector database stores metadata/payload alongside vectors (Pinecone, Qdrant, Elasticsearch, Milvus, Weaviate). Lance currently requires a random I/O take for every additional column, even for small, frequently-read columns like IDs or categories.
Proposal
Allow vector indexes to include additional columns alongside the vector data and rowid. These included columns are stored in the index at build time and returned directly from index search, eliminating the take step for covered queries.
API sketch
Tradeoffs
Context
Beta Was this translation helpful? Give feedback.
All reactions