feat(semantic): improve code indexing with tree-sitter, enriched embeddings, and smart re-indexing#25
Merged
Conversation
Prepend structured metadata ([language:type] signature in filepath) to chunk content before embedding. The stored Content field is unchanged — only the text sent to the embedder is enriched. This helps the embedding model understand code context (language, type, location) for better retrieval quality. Also updates default embedding API URL from Ollama (11434) to TEI (8080) and makes error messages provider-agnostic.
Add ContentHash field to Chunk, DiffChunks() algorithm for comparing old vs new chunks by content hash, schema migration for content_hash column, and GetChunksByFilePath/ReadEmbeddings storage methods. This enables re-indexing to only re-embed chunks whose content actually changed, and to reuse embeddings for code that moved but didn't change.
…em 5) Replace the delete-all + reprocess pattern in Update() with processFileWithDiff() which uses DiffChunks() to compare old vs new chunks by content hash. Only changed chunks get re-embedded; moved but unchanged code reuses existing embeddings. Also compute ContentHash during initial indexing via chunkFile().
Post-process chunks to add configurable line overlap between adjacent chunks and prepend enclosing scope context (package/class name). Enrichment goes into EmbedText field only — stored Content is unchanged. Defaults: 3-line overlap + parent context enabled. Configurable via --overlap-lines and --no-parent-context CLI flags. Includes DeduplicateOverlapping() for removing redundant search results caused by chunk overlap.
… (Item 1) Replace regex-based chunkers with tree-sitter AST parsing using the pure-Go gotreesitter runtime (no CGo). Provides accurate function boundary detection, nested structure handling, decorator support, and generics for all supported languages. Go chunker stays as-is (already uses go/ast). Old regex chunkers remain available as fallback but are no longer registered by default. New package: internal/semantic/treesitter/ with shared helpers and per-language chunker implementations.
Introduce ChunkRef type and RefExtractor/RefStorage interfaces for tracking references between chunks. SQLite storage implements the chunk_refs table with StoreRefs, GetRefs, GetCallers, DeleteRefsByChunk, and ResolveRefs methods. Go chunker implements RefExtractor to extract function calls, imports, and type references from AST. Refs are automatically extracted and stored during indexing when the chunker supports it. ResolveRefs batch-resolves symbol names to chunk IDs for caller/callee queries at search time.
Critical fixes: - Fix unique constraint violation in processFileWithDiff when reusing chunks with same ID (always delete before create) - Fix overlapConfig race condition by adding mutex protection - Fix chunksCreated double-count for fallback reuse items - Set overlapConfig in Update() path (was missing, overlap silently disabled) - Add OverlapLines/IncludeParentContext to UpdateOptions - Guard os.ReadFile errors in overlap application paths
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five major improvements to the semantic code indexing system:
[language:type] signature in filepathmetadata to chunk content before embedding, improving retrieval quality with zero schema changeschunk_refstable with batch resolution for caller/callee queriesAlso updates default embedding API URL from Ollama (port 11434) to TEI (port 8080) and makes error messages provider-agnostic.
Test plan
go test -race)