Add validation to prevent zero vectors in KNN fields#15621
Add validation to prevent zero vectors in KNN fields#15621vigyasharma wants to merge 7 commits intoapache:mainfrom
Conversation
|
@vigyasharma sorry for not commenting on the other issue. I am not sure we should prevent |
|
That's okay, I can make the changes if needed. :) I was thinking about this, and while cosine gets mathematically broken with 0 vectors, I'm not sure if zero vectors are meaningful for other functions either. They'll always return 0 values for dot product and Max inner product, and search won't really be able to differentiate based on those similarity scores. It will silently affect scores, graph geometry and result set, which feels trappy? Are there meaningful scenarios where we want to allow zero vectors? I'm not sure if bit vectors need them? Are hamming distance or jaccard sim meaningful when all bits are 0? |
Technically, this is the same exact problem that vectors have in general. Two different vectors can return the same scores. I don't think this is a good reason. Additionally, "same scores" is possible even in term based search.
I am not sure it will do so silently, again, users do all sorts of things to test, and it seems to me plausible for there to be zero vectors.
I don't know. But Preventing this in cosine only keeps the system sane. |
|
Sounds good, I'll update this PR to disallow zero vectors only when cosine similarity is configured. |
|
Closing this in favor of #15751 |
We don't have checks to disallow zero vectors from getting indexed today. This becomes a problem later when vectors are searched or segments are merged, as noted in #15540. This change prevents zero vectors from getting indexed.
AI Disclosure: The change itself is trivial but it broke quite a few tests. I fixed about half of them manually, then leveraged AI to fix remaining tests. I've reviewed all AI test fixes. Most of them were simply replacing zero/empty vectors with non-empty vector values. Some tests checked for vector similarity scores, and I've fixed them accordingly.
Addresses #15540