feat(semantic): improve code indexing with tree-sitter, enriched embeddings, and smart re-indexing by samestrin · Pull Request #25 · samestrin/llm-tools

samestrin · 2026-04-11T19:04:35Z

Summary

Five major improvements to the semantic code indexing system:

Signature-enriched embeddings: Prepend [language:type] signature in filepath metadata to chunk content before embedding, improving retrieval quality with zero schema changes
Content-aware diff re-indexing: Only re-embed chunks whose content actually changed; reuse embeddings for moved-but-unchanged code via content hash tracking
Overlapping chunks with parent context: Add configurable line overlap (default 3) between adjacent chunks and prepend enclosing scope (package/class) to embedding text
Tree-sitter chunkers: Replace regex-based Python/JS/TS/PHP/Rust chunkers with pure-Go tree-sitter AST parsing (odvcencio/gotreesitter) for accurate function boundary detection
Chunk-level dependency graph: Extract function calls, imports, and type references during indexing; store in chunk_refs table with batch resolution for caller/callee queries

Also updates default embedding API URL from Ollama (port 11434) to TEI (port 8080) and makes error messages provider-agnostic.

Test plan

All existing tests pass (5 packages, 0 failures)
Race detection passes (go test -race)
New tests for all features: embed enrichment, diff reindex, overlap, tree-sitter chunkers, ref storage/extraction
Schema migrations are idempotent and backward compatible
Adversarial review completed and all critical/high issues fixed

Prepend structured metadata ([language:type] signature in filepath) to chunk content before embedding. The stored Content field is unchanged — only the text sent to the embedder is enriched. This helps the embedding model understand code context (language, type, location) for better retrieval quality. Also updates default embedding API URL from Ollama (11434) to TEI (8080) and makes error messages provider-agnostic.

Add ContentHash field to Chunk, DiffChunks() algorithm for comparing old vs new chunks by content hash, schema migration for content_hash column, and GetChunksByFilePath/ReadEmbeddings storage methods. This enables re-indexing to only re-embed chunks whose content actually changed, and to reuse embeddings for code that moved but didn't change.

…em 5) Replace the delete-all + reprocess pattern in Update() with processFileWithDiff() which uses DiffChunks() to compare old vs new chunks by content hash. Only changed chunks get re-embedded; moved but unchanged code reuses existing embeddings. Also compute ContentHash during initial indexing via chunkFile().

Post-process chunks to add configurable line overlap between adjacent chunks and prepend enclosing scope context (package/class name). Enrichment goes into EmbedText field only — stored Content is unchanged. Defaults: 3-line overlap + parent context enabled. Configurable via --overlap-lines and --no-parent-context CLI flags. Includes DeduplicateOverlapping() for removing redundant search results caused by chunk overlap.

… (Item 1) Replace regex-based chunkers with tree-sitter AST parsing using the pure-Go gotreesitter runtime (no CGo). Provides accurate function boundary detection, nested structure handling, decorator support, and generics for all supported languages. Go chunker stays as-is (already uses go/ast). Old regex chunkers remain available as fallback but are no longer registered by default. New package: internal/semantic/treesitter/ with shared helpers and per-language chunker implementations.

Introduce ChunkRef type and RefExtractor/RefStorage interfaces for tracking references between chunks. SQLite storage implements the chunk_refs table with StoreRefs, GetRefs, GetCallers, DeleteRefsByChunk, and ResolveRefs methods. Go chunker implements RefExtractor to extract function calls, imports, and type references from AST. Refs are automatically extracted and stored during indexing when the chunker supports it. ResolveRefs batch-resolves symbol names to chunk IDs for caller/callee queries at search time.

Critical fixes: - Fix unique constraint violation in processFileWithDiff when reusing chunks with same ID (always delete before create) - Fix overlapConfig race condition by adding mutex protection - Fix chunksCreated double-count for fallback reuse items - Set overlapConfig in Update() path (was missing, overlap silently disabled) - Add OverlapLines/IncludeParentContext to UpdateOptions - Guard os.ReadFile errors in overlap application paths

samestrin added 8 commits April 11, 2026 11:32

fix: remove accidentally committed .index and .claude files

c96a7a0

samestrin merged commit d7dc9fb into main Apr 11, 2026
1 check passed

samestrin deleted the feat/semantic-indexing-improvements branch April 11, 2026 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(semantic): improve code indexing with tree-sitter, enriched embeddings, and smart re-indexing#25

feat(semantic): improve code indexing with tree-sitter, enriched embeddings, and smart re-indexing#25
samestrin merged 8 commits into
mainfrom
feat/semantic-indexing-improvements

samestrin commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samestrin commented Apr 11, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant