Search for all vectors with similarity above threshold

I want to query a collection of embedding vectors for all vectors whose similarity with a given vector is above a certain threshold. To be clear, I’m not asking about a classic vector search where you typically want the n most similar vectors. In my case, I might want all the vectors in the collection, and there may be tens of thousands of them. Currently I’m doing it manually in Python like this:

result = [entry['vector'] if similarity(entry['vector'], v) > threshold for entry in db.embeddings.find()]

But it’s slow. Is there way to do this with a database query? Thanks in advance!

Hello @Hammarberg, I think you could just set your vector search limits and numCandidates to be the size of your collection, and then have a $match stage after that filters based on the minimum distance you’d accept.

I would expect it to be slow though as it is a collection scan, but potentially faster than reading everything into the application?

Thank you very much - I tested and it’s much faster actually. However there seems to be an upper limit on numCandidates of 10,000, so I have to set limit and numCandidates to min(10000, collection_size). It works in practice for now, but as the database grows I’ll need a better solution.