Skip to content

IndexError: index (2864) out of range when not re-creating index or restarting webapp after config change #131

@nh2

Description

@nh2

I indexed some .cpp files as described in #90 (comment), adding a - doc_path: ... entry, and running llmsearch index update, but without restarting the llmsearch interact webapp ....

When I then query something via the web UI, I get:

2025-05-30 00:18:38.969 | DEBUG    | __main__:<module>:246 - CONFIG FILE: /home/ubuntu/llm-search/configs/niklas-config-1.yaml
2025-05-30 00:18:38.975 | DEBUG    | llmsearch.ranking:get_relevant_documents:105 - Evaluating query: What's the name of the API endpoint that generates thumbnails?
2025-05-30 00:18:38.975 | INFO     | llmsearch.ranking:get_relevant_documents:107 - Adding query prefix for retrieval: query: 
2025-05-30 00:18:38.975 | INFO     | llmsearch.splade:query:248 - SPLADE search will search over all documents of chunk size: 1024. Number of docs: 2865
────────────────────────── Traceback (most recent call last) ───────────────────────────
  /home/ubuntu/.venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_  
  code.py:121 in exec_func_with_error_handling                                          
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/scrip  
  t_runner.py:645 in code_to_exec                                                       
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/llmsearch/webapp.py:342 in <module>   
                                                                                        
    339 │   │   │   │   conv_history_rewrite_query                                      
    340 │   │   │   )                                                                   
    341 │   │                                                                           
  ❱ 342 │   │   output = generate_response(                                             
    343 │   │   │   question=text,                                                      
    344 │   │   │   use_hyde=st.session_state["llm_bundle"].hyde_enabled,               
    345 │   │   │   use_multiquery=st.session_state["llm_bundle"].multiquery_enabled,   
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/streamlit/runtime/caching/cache_util  
  s.py:219 in __call__                                                                  
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/streamlit/runtime/caching/cache_util  
  s.py:261 in _get_or_create_cached_value                                               
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/streamlit/runtime/caching/cache_util  
  s.py:320 in _handle_cache_miss                                                        
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/llmsearch/webapp.py:175 in            
  generate_response                                                                     
                                                                                        
    172 ):                                                                              
    173 │   # _config and _bundle are under scored so paratemeters aren't hashed        
    174 │                                                                               
  ❱ 175 │   output = get_and_parse_response(                                            
    176 │   │   query=question, config=_config, llm_bundle=_bundle, label=label_filter  
    177 │   )                                                                           
    178 │   return output                                                               
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/llmsearch/process.py:66 in            
  get_and_parse_response                                                                
                                                                                        
     63 │   │   offset_max_chars = 0                                                    
     64 │                                                                               
     65 │   semantic_search_config = config.semantic_search                             
  ❱  66 │   most_relevant_docs, score = get_relevant_documents(                         
     67 │   │   original_query, queries, llm_bundle, semantic_search_config, label=lab  
     68 │   │   offset_max_chars = offset_max_chars                                     
     69 │   )                                                                           
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/llmsearch/ranking.py:109 in           
  get_relevant_documents                                                                
                                                                                        
    106 │   │   │   if config.query_prefix:                                             
    107 │   │   │   │   logger.info(f"Adding query prefix for retrieval: {config.query  
    108 │   │   │   │   query = config.query_prefix + query                             
  ❱ 109 │   │   │   sparse_search_docs_ids, sparse_scores = sparse_retriever.query(     
    110 │   │   │   │   search=query, n=config.max_k, label=label, chunk_size=chunk_si  
    111 │   │   │   )                                                                   
    112                                                                                 
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/llmsearch/splade.py:253 in query      
                                                                                        
    250 │   │   │   )                                                                   
    251 │   │                                                                           
    252 │   │   # print(indices)                                                        
  ❱ 253 │   │   embeddings = self._embeddings[indices]  # type: ignore                  
    254 │   │   ids = self._ids[indices]  # type: ignore                                
    255 │   │   l2_norm_matrix = scipy.sparse.linalg.norm(embeddings, axis=1)           
    256                                                                                 
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/scipy/sparse/_index.py:30 in          
  __getitem__                                                                           
                                                                                        
     27 │   This class provides common dispatching and validation logic for indexing.   
     28 │   """                                                                         
     29 │   def __getitem__(self, key):                                                 
  ❱  30 │   │   index, new_shape = self._validate_indices(key)                          
     31 │   │                                                                           
     32 │   │   # 1D array                                                              
     33 │   │   if len(index) == 1:                                                     
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/scipy/sparse/_index.py:288 in         
  _validate_indices                                                                     
                                                                                        
    285 │   │   │   │   index_ndim = tmp_ndim                                           
    286 │   │   │   else:  # dense array                                                
    287 │   │   │   │   N = self._shape[index_ndim]                                     
  ❱ 288 │   │   │   │   idx = self._asindices(idx, N)                                   
    289 │   │   │   │   index.append(idx)                                               
    290 │   │   │   │   array_indices.append(index_ndim)                                
    291 │   │   │   │   index_ndim += 1                                                 
                                                                                        
  /home/ubuntu/.venv/lib/python3.12/site-packages/scipy/sparse/_index.py:332 in         
  _asindices                                                                            
                                                                                        
    329 │   │   # Check bounds                                                          
    330 │   │   max_indx = x.max()                                                      
    331 │   │   if max_indx >= length:                                                  
  ❱ 332 │   │   │   raise IndexError('index (%d) out of range' % max_indx)              
    333 │   │                                                                           
    334 │   │   min_indx = x.min()                                                      
    335 │   │   if min_indx < 0:                                                        
────────────────────────────────────────────────────────────────────────────────────────
IndexError: index (2864) out of range

It seems fixed when I restart llmsearch interact webapp, AND run llmsearch index create ... instead of llmsearch index update ...

Is that expected?

If yes, it would be nice to get a better error than IndexError, to tell me that I have to restart the whole webapp after changing the config.

But then again, if I add another entry for another programming language, the IndexError persists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions