Vector Search Pre-filtering

How can I efficiently pre-filter a vector search query based on fields in another collection. Due to limitations on vector index being one-one on a document my content has an associated embedding collection where a content id is associated with the embedding. I want to search my embeddings but apply pre-filters based on properties in the content collection. Is there a way of doing this cleanly without having to duplicate all properties from my content collection onto my embedding documents and then filtering?

Hi @Cameron_Flanagan, thanks for the question. It would be useful to have more insight into your query pattern and data model, but I can take a stab at a broad answer on a couple of dimensions.

Depending on the degree of data duplication, I think you have a few options:

Scenario 1. Minimal metadata duplication relative to size of binData compresssed vectors (say <20% of total document size is non-vector)

In this scenario I would recommend duplicating parent metadata within the same document as each embedding, the impact to storage shouldn’t be massive and the query performance gained compared to the query pattern suggested in scenario 2 is likely considerable.

Scenario 2. Large metadata duplication relative to size of binData compressed vectors (say >80% of total document size is non-vector)
Here I would suggest the following query pattern of unioning the result sets of the two queries issued against separate collections, and group on a common key.

[$match([find matching documents in parent metadata], [$unionWith($vectorSearch[$find relevant child documents with key tying back to parent metadata]), $group([group on common key]]

This is very similar to post-filtering and may have implications on performance depending on how many documents are overrequested.

  1. Future State

We are looking at supporting nested vector search which I think should help with the parent metadata duplication problem depending on how performant either of these query patterns / data models is for you. Do let me know how these suggestions fit or don’t against your use case so we make sure that that experience considers its needs appropriately.

1 Like

Yeah so I definitely think we can go with scenario 1 for now as there are a limited number of filters we would be looking to apply. Longer term I definitely think option 3 would be nice!

In terms of the detail basically because of the limit of having a vector index on a non list field we have stored all our document embeddings for the content in question in a separate collection. This is because we are chunking large pieces of content for better search performance. However we also have a series of metadata that exists on that content object model that we would use to filter the results. So we are of course duplicating those fields across each “chunk” in the other collection so we can do the filtering. But as you say its a small amount of duplciation and hopefully in future we can find a nicer solution.

I imagine this sort of pattern is quite common? For anyone searching across long form content. Really appreciate the advice though, scenario 1 was sort of my thinking so good to validate that.