Skip to content

feat(semantic): improve code indexing with tree-sitter, enriched embeddings, and smart re-indexing#25

Merged
samestrin merged 8 commits into
mainfrom
feat/semantic-indexing-improvements
Apr 11, 2026
Merged

feat(semantic): improve code indexing with tree-sitter, enriched embeddings, and smart re-indexing#25
samestrin merged 8 commits into
mainfrom
feat/semantic-indexing-improvements

Conversation

@samestrin
Copy link
Copy Markdown
Owner

Summary

Five major improvements to the semantic code indexing system:

  • Signature-enriched embeddings: Prepend [language:type] signature in filepath metadata to chunk content before embedding, improving retrieval quality with zero schema changes
  • Content-aware diff re-indexing: Only re-embed chunks whose content actually changed; reuse embeddings for moved-but-unchanged code via content hash tracking
  • Overlapping chunks with parent context: Add configurable line overlap (default 3) between adjacent chunks and prepend enclosing scope (package/class) to embedding text
  • Tree-sitter chunkers: Replace regex-based Python/JS/TS/PHP/Rust chunkers with pure-Go tree-sitter AST parsing (odvcencio/gotreesitter) for accurate function boundary detection
  • Chunk-level dependency graph: Extract function calls, imports, and type references during indexing; store in chunk_refs table with batch resolution for caller/callee queries

Also updates default embedding API URL from Ollama (port 11434) to TEI (port 8080) and makes error messages provider-agnostic.

Test plan

  • All existing tests pass (5 packages, 0 failures)
  • Race detection passes (go test -race)
  • New tests for all features: embed enrichment, diff reindex, overlap, tree-sitter chunkers, ref storage/extraction
  • Schema migrations are idempotent and backward compatible
  • Adversarial review completed and all critical/high issues fixed

Prepend structured metadata ([language:type] signature in filepath) to
chunk content before embedding. The stored Content field is unchanged —
only the text sent to the embedder is enriched. This helps the embedding
model understand code context (language, type, location) for better
retrieval quality.

Also updates default embedding API URL from Ollama (11434) to TEI (8080)
and makes error messages provider-agnostic.
Add ContentHash field to Chunk, DiffChunks() algorithm for comparing
old vs new chunks by content hash, schema migration for content_hash
column, and GetChunksByFilePath/ReadEmbeddings storage methods.

This enables re-indexing to only re-embed chunks whose content actually
changed, and to reuse embeddings for code that moved but didn't change.
…em 5)

Replace the delete-all + reprocess pattern in Update() with
processFileWithDiff() which uses DiffChunks() to compare old vs new
chunks by content hash. Only changed chunks get re-embedded; moved but
unchanged code reuses existing embeddings. Also compute ContentHash
during initial indexing via chunkFile().
Post-process chunks to add configurable line overlap between adjacent
chunks and prepend enclosing scope context (package/class name).
Enrichment goes into EmbedText field only — stored Content is unchanged.

Defaults: 3-line overlap + parent context enabled. Configurable via
--overlap-lines and --no-parent-context CLI flags.

Includes DeduplicateOverlapping() for removing redundant search results
caused by chunk overlap.
… (Item 1)

Replace regex-based chunkers with tree-sitter AST parsing using the
pure-Go gotreesitter runtime (no CGo). Provides accurate function
boundary detection, nested structure handling, decorator support, and
generics for all supported languages.

Go chunker stays as-is (already uses go/ast). Old regex chunkers remain
available as fallback but are no longer registered by default.

New package: internal/semantic/treesitter/ with shared helpers and
per-language chunker implementations.
Introduce ChunkRef type and RefExtractor/RefStorage interfaces for
tracking references between chunks. SQLite storage implements the
chunk_refs table with StoreRefs, GetRefs, GetCallers, DeleteRefsByChunk,
and ResolveRefs methods.

Go chunker implements RefExtractor to extract function calls, imports,
and type references from AST. Refs are automatically extracted and
stored during indexing when the chunker supports it.

ResolveRefs batch-resolves symbol names to chunk IDs for caller/callee
queries at search time.
Critical fixes:
- Fix unique constraint violation in processFileWithDiff when reusing
  chunks with same ID (always delete before create)
- Fix overlapConfig race condition by adding mutex protection
- Fix chunksCreated double-count for fallback reuse items
- Set overlapConfig in Update() path (was missing, overlap silently disabled)
- Add OverlapLines/IncludeParentContext to UpdateOptions
- Guard os.ReadFile errors in overlap application paths
@samestrin samestrin merged commit d7dc9fb into main Apr 11, 2026
1 check passed
@samestrin samestrin deleted the feat/semantic-indexing-improvements branch April 11, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant