Added embedding benchmark along with new config file and plot updates#14
Added embedding benchmark along with new config file and plot updates#14radoslavralev wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds optional cross-encoder reranking functionality to improve retrieval quality, implements top-k candidate retrieval in embedding matching, normalizes embeddings before Redis operations, enhances the Redis index handling with dimension probing, and improves visualization with dual-subplot charts. The changes span the core matching logic, benchmarking infrastructure, and result visualization.
Key Changes
- Adds cross-encoder reranking with configurable top-k retrieval for both Redis-based and standard neural embedding matching
- Implements top-k retrieval logic in blockwise embedding matching with support for variable k values
- Normalizes embeddings to float32 and unit length before Redis operations, with dimension probing to handle models with incorrect config dimensions
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
src/customer_analysis/query_engine.py |
Adds embedding dimension probing, enables trust_remote_code, and always overwrites Redis index on initialization |
src/customer_analysis/embedding_interface.py |
Implements top-k retrieval in blockwise matching methods with support for variable k; updates dimension inference to probe models first |
src/customer_analysis/data_processing.py |
Integrates cross-encoder reranking in both run_matching and run_matching_redis; adds embedding normalization before Redis operations |
scripts/plot_multiple_precision_vs_cache_hit_ratio.py |
Creates dual-subplot visualization with precision-CHR curves and AUC bar chart; adds numpy version compatibility for trapezoid/trapz |
run_benchmark.sh |
Provides example shell script for running benchmarks with cross-encoder models |
run_benchmark.py |
Extends benchmark loop to iterate over cross-encoder models and include them in output paths |
evaluation.py |
Adds command-line arguments for cross-encoder model and rerank_k parameter |
Comments suppressed due to low confidence (2)
src/customer_analysis/embedding_interface.py:414
- The docstring for
calculate_best_matches_with_cache_large_datasetdoes not document the newkparameter or how it affects the return value shapes. The function returns arrays with shape(num_sentences,)whenk=1but(num_sentences, k)whenk>1. This should be documented to avoid confusion.
"""Large-dataset variant: find best cache match for each sentence using memmaps.
Writes two memmaps (rows for sentences, cols for cache), normalised, and
performs blockwise dot-products. If `sentence_offset` is provided and the
cache corresponds to the same corpus, the self-similarity diagonal is masked.
"""
src/customer_analysis/embedding_interface.py:490
- The docstrings for
calculate_best_matches_with_cacheandcalculate_best_matches_from_embeddings_with_cachedo not document the newkparameter or how it affects return value shapes. Whenk=1, arrays have shape(N,), but whenk>1, they have shape(N, k). This should be documented.
"""
Calculate the best similarity match for each sentence against all other
sentences using a neural embedding model.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else: | ||
| # Top-k logic | ||
| # If columns in this block < k, take all valid | ||
| curr_block_size = col_end - col_start | ||
| if curr_block_size <= k: | ||
| top_k_in_block_idx = np.argsort(-sim, axis=1) # Sort all | ||
| top_k_in_block_val = np.take_along_axis(sim, top_k_in_block_idx, axis=1) | ||
| # Might have fewer than k if block is small | ||
| else: | ||
| # Use argpartition for top k | ||
| # We want largest k | ||
| part_idx = np.argpartition(-sim, k, axis=1)[:, :k] | ||
| top_k_in_block_val = np.take_along_axis(sim, part_idx, axis=1) | ||
|
|
||
| # Sort them to have ordered top-k (optional but good for merging) | ||
| sorted_sub_idx = np.argsort(-top_k_in_block_val, axis=1) | ||
| top_k_in_block_val = np.take_along_axis(top_k_in_block_val, sorted_sub_idx, axis=1) | ||
| top_k_in_block_idx = np.take_along_axis(part_idx, sorted_sub_idx, axis=1) | ||
|
|
||
| # Merge with accumulated bests | ||
| # chunk_best_scores: (batch, k) | ||
| # top_k_in_block_val: (batch, min(block, k)) | ||
|
|
||
| # Adjust indices to global column indices | ||
| top_k_in_block_idx_global = top_k_in_block_idx + col_start | ||
|
|
||
| combined_vals = np.concatenate([chunk_best_scores, top_k_in_block_val], axis=1) | ||
| combined_idxs = np.concatenate([chunk_best_indices, top_k_in_block_idx_global], axis=1) | ||
|
|
||
| # Find top k in combined | ||
| best_combined_args = np.argsort(-combined_vals, axis=1)[:, :k] | ||
|
|
||
| chunk_best_scores = np.take_along_axis(combined_vals, best_combined_args, axis=1) | ||
| chunk_best_indices = np.take_along_axis(combined_idxs, best_combined_args, axis=1) |
There was a problem hiding this comment.
The top-k retrieval logic in blockwise matching lacks test coverage. Since the repository has comprehensive tests and this is a significant new feature that changes the shape of return values and introduces complex merging logic, it should have tests covering: 1) k > 1 with various cache sizes, 2) edge cases where block size < k, 3) self-similarity masking with k > 1, and 4) correct sorting and merging across blocks.
|
|
||
| [[tool.uv.index]] | ||
| url = "https://artifactory.dev.redislabs.com/artifactory/api/pypi/cloud-pypi-local/simple" | ||
| url = "" |
There was a problem hiding this comment.
Empty UV index URL breaks package installation
High Severity
The [[tool.uv.index]] URL is set to an empty string "" instead of being removed entirely. An empty URL is not a valid package index URL and will cause uv to fail when attempting to resolve or install packages from this index. The section should either be removed completely or contain a valid URL.
|
|
||
| if all_pairs: | ||
| all_scores = cross_encoder.predict(all_pairs, batch_size=128, show_progress_bar=True) | ||
| all_scores = all_scores.reshape(N, k) |
There was a problem hiding this comment.
Cross-encoder reranking creates invalid pairs when cache is small
Medium Severity
When using cross-encoder reranking with k > 1 and the cache has fewer entries than k, the code iterates over all k indices in best_indices[i], but some of these indices are initialized to 0 with scores of -inf (from init_results). This causes cache_list[idx] to repeatedly access cache_list[0] for invalid entries, creating spurious query-cache pairs. The cross-encoder then scores these invalid pairs, which could lead to incorrect match selection.
| emb_path, len(sentences), embedding_dim, row_block, col_block, dtype, early_stop | ||
| ) | ||
|
|
||
| self._cleanup_memmap(created, memmap_dir, emb_path) |
There was a problem hiding this comment.
Missing try/finally for memmap cleanup causes temp file leak
Medium Severity
In _calculate_best_matches_large_dataset and calculate_best_matches_with_cache_large_dataset, temporary memmap files are created via _write_embeddings_memmap, but the cleanup code that removes these files is not wrapped in a try/finally block. If an exception is raised during _compute_blockwise_best_matches or _compute_blockwise_best_matches_two_sets, the cleanup code is never executed and temporary files accumulate on disk. Over time with repeated failures, this could fill up disk space.
Note
Introduces reranking and a unified, provider-agnostic embedding stack, plus end-to-end benchmarking utilities.
--cross_encoder_model,--rerank_k) across local and Redis workflows; retrieval upgraded to top-k with CE selectionEmbeddingModel,SimilarityMatcher, andembedding_providers(HuggingFace, OpenAI, Gemini); updatesRedisVectorIndexto use providers and correct vector handlingrun_benchmark.py,run_benchmark.sh) with bootstrap runs, model/CE sweeps, and organized outputsrun_chr_analysis.shWritten by Cursor Bugbot for commit fbfe4e7. This will update automatically on new commits. Configure here.