feat: index-included columns (covering index) for vector search #6909

jackye1995 · 2026-05-22T06:28:14Z

jackye1995
May 22, 2026
Maintainer

Motivation

Today's Lance vector index stores only vectors (possibly quantized) and rowids. Any additional column value needed after index search — for PK dedup, bloom filter checking, or user projections — requires a TakeExec that does random I/O by rowid into data files. For small-k, low-latency vector search workloads, this take step dominates query latency.

This problem surfaces concretely in two areas:

1. LSM stale-read suppression

The MemWAL LSM architecture searches multiple generations (base table, flushed memtables, active memtable) independently, then merges results with a global PK dedup. When a PK's fresh version falls out of its source's top-k, the stale copy from an older generation can leak through.

The correct fix is a pre-search exclusion filter: during HNSW/IVF traversal in older generations, skip any candidate whose PK exists in a newer generation (via bloom filter). But today the PK value is not available during index traversal — the index only has rowids. Getting the PK requires a take, which defeats the purpose of filtering during traversal.

PR #6899 works around this by building a per-query GenPkIndex (scanning the entire base table's PK + rowid columns on every query) to construct a rowid-based block mask. This is O(base_table_size) per query, which is not scalable.

2. General query performance

The most common vector search pattern is "embed → search → return a few metadata columns." Every production vector database stores metadata/payload alongside vectors (Pinecone, Qdrant, Elasticsearch, Milvus, Weaviate). Lance currently requires a random I/O take for every additional column, even for small, frequently-read columns like IDs or categories.

Proposal

Allow vector indexes to include additional columns alongside the vector data and rowid. These included columns are stored in the index at build time and returned directly from index search, eliminating the take step for covered queries.

API sketch

# At index creation time
ds.create_index(
    "vector",
    index_type="IVF_PQ",
    include_columns=["pk_id", "category"],  # stored in index alongside vectors
)

# At query time — if projection is fully covered by index + included columns,
# no TakeExec is needed
ds.search(query_vector).select(["pk_id", "category"]).limit(10)

Tradeoffs

	Pro	Con
Index size	Small columns (PK hash, category) add negligible overhead vs vector data	Large or many included columns increase index size
Write amplification	—	Included columns stored in both index and data files
Read performance	Eliminates random I/O take for covered queries	—
Index rebuild	—	Required if included column schema changes
Data freshness	—	If an included column's data changes (e.g. via update or merge-insert), the index entries for affected fragments become stale. Those fragments must be invalidated immediately and their indexes rebuilt. Scoping invalidation to per-fragment level keeps this compatible with data evolution operations and horizontal merge-insert.

Context

PR fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter #6899 — block-list post-filter approach for LSM staleness (builds GenPkIndex per query, O(base_size))
PR fix(mem_wal): exact PK dedup for LSM vector search and point lookup #6856 comment — pre-search bloom filter exclusion proposal (requires PK available during traversal)
Every production vector DB (Pinecone, Qdrant, ES, Milvus, Weaviate) stores metadata alongside index entries

BubbleCal · 2026-05-22T07:03:44Z

BubbleCal
May 22, 2026
Maintainer

+1 for this
I believe the other vdbs also do this.

Some points to discuss:

Indexing would be much slower if included columns contain some large column, so this first cut should be internal only
I'm also thinking about that, for column like category, probably we can partition the index by the scalar value for more efficient search

2 replies

jackye1995 May 22, 2026
Maintainer Author

now we have things like primary key and clustering key in the table definition, I feel one natural choice would be to include those columns in the index automatically, because they would basically never change, and they are always pretty small. cc @beinan for some opinion here, since you are using clustering key, and let us know if it would make sense.

Xuanwo May 22, 2026
Maintainer

Exactly what I'm thinking. pk should not be specified by users.

Xuanwo · 2026-05-22T07:18:41Z

Xuanwo
May 22, 2026
Maintainer

I love this idea. The only thing I want to mention is that I prefer to have concepts called "system columns" and "payload columns" inside the index's auxiliary.idx. For example:

system columns:
  pk_fingerprint    # hidden, for MemWAL stale suppression
  maybe shard/gen   # optional, if needed by LSM planner

payload columns:
  category
  source
  small metadata columns

This way, users don't need to specify the pk in code. They only need to include the payload columns that they actually need. And under the hood, they all can use the same mechanism.

0 replies

hamersaw · 2026-05-22T11:18:37Z

hamersaw
May 22, 2026
Collaborator

I think this is very reasonable, the work at #6899 was targeted towards this same problem - how can we provide a prefilter of stale row IDs to queries. This is another route to get there.

scanning the entire base table's PK + rowid columns on every query

This isn't (or at least shouldn't be) true by design. It's the GenPkIndex is built once and cached - so perf-wise I imagine it would be similar to this approach right?

0 replies

westonpace · 2026-05-25T13:08:03Z

westonpace
May 25, 2026
Maintainer

+1 to the idea, it seems like a very natural extension. If we make any changes to the general proto (e.g. to indicate which fields are "covered" by the index) then I'd just request we do it in a way that is generic (e.g. maybe we will want to do something like this for other index types as well).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: index-included columns (covering index) for vector search #6909

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

feat: index-included columns (covering index) for vector search #6909

Uh oh!

jackye1995 May 22, 2026 Maintainer

Motivation

1. LSM stale-read suppression

2. General query performance

Proposal

API sketch

Tradeoffs

Context

Replies: 4 comments · 2 replies

Uh oh!

BubbleCal May 22, 2026 Maintainer

Uh oh!

jackye1995 May 22, 2026 Maintainer Author

Uh oh!

Xuanwo May 22, 2026 Maintainer

Uh oh!

Uh oh!

Xuanwo May 22, 2026 Maintainer

Uh oh!

Uh oh!

hamersaw May 22, 2026 Collaborator

Uh oh!

westonpace May 25, 2026 Maintainer

jackye1995
May 22, 2026
Maintainer

Replies: 4 comments 2 replies

BubbleCal
May 22, 2026
Maintainer

jackye1995 May 22, 2026
Maintainer Author

Xuanwo May 22, 2026
Maintainer

Xuanwo
May 22, 2026
Maintainer

hamersaw
May 22, 2026
Collaborator

westonpace
May 25, 2026
Maintainer