Vector Search Pre-filtering

Henry_Weller · February 6, 2025, 5:28pm

Hi @Cameron_Flanagan, thanks for the question. It would be useful to have more insight into your query pattern and data model, but I can take a stab at a broad answer on a couple of dimensions.

Depending on the degree of data duplication, I think you have a few options:

Scenario 1. Minimal metadata duplication relative to size of binData compresssed vectors (say <20% of total document size is non-vector)

In this scenario I would recommend duplicating parent metadata within the same document as each embedding, the impact to storage shouldn’t be massive and the query performance gained compared to the query pattern suggested in scenario 2 is likely considerable.

Scenario 2. Large metadata duplication relative to size of binData compressed vectors (say >80% of total document size is non-vector)
Here I would suggest the following query pattern of unioning the result sets of the two queries issued against separate collections, and group on a common key.

[$match([find matching documents in parent metadata], [$unionWith($vectorSearch[$find relevant child documents with key tying back to parent metadata]), $group([group on common key]]

This is very similar to post-filtering and may have implications on performance depending on how many documents are overrequested.

Future State

We are looking at supporting nested vector search which I think should help with the parent metadata duplication problem depending on how performant either of these query patterns / data models is for you. Do let me know how these suggestions fit or don’t against your use case so we make sure that that experience considers its needs appropriately.