We are using Vector Search to power our semantic search applications and are trying to find some information on the best ways to structure our application/data for scaling up. Here’s a few questions we couldn’t answer from the docs:
https://www.mongodb.com/docs/atlas/atlas-vector-search/tune-vector-search/ says : “You must ensure that the data nodes have enough RAM to hold the vector data and indexes.” . So I assume that “the vector data” means n_documents*embeddings_dim; where can I see the index size? Is it the size indicated under the “Atlas search” tab (5Gb in the screenshot below), or does that also include the vector data?
Assuming that the index holds in RAM: how does the performance scale with the amount of vectors? Is it ~O(1)? O(log n)? Something else?
The docs recommend pre-filtering the data to improve performance. How is the performance scale w.r.t. amount of vectors/amount matched by the filter? Say we have 1M vectors and our filter matches 20k of them. Will the vector search performance be the same as searching an index with only 20k vectors?
Is there a limit to how many “partitions” we can have on the vectors for filtering? Say we have a field filter_field which we use to split embeddings in N groups which we use to pre-filter the embeddings before doing the vector search. Can N be arbitrarily large without increasing the
Our embeddings are all in a single collection and are partitioned into N buckets with a field (say bucket_id). We typically search within a group of M buckets by using the pre-filter "$in": [bucket_id_1, ..., bucket_id_M]. How does M impact the performance of the search? Is this approach suitable for scenarios where M might grow very large (100? 1k? 10K?)?
In case the answer to the previous question is that M should stay relatively small, here is the problem we are trying to solve: our current scenario (simplifying a bit) is that we partition data per-customer, where a customer might have a few “buckets” of embeddings (actually, each bucket is a set of text documents that are then split and embedded). So when doing a search for customer X, we get the list of the buckets for that customer that are relevant to the search filter embeddings belonging to that bucket using “$in”. We’re scaling up now and are facing the issue that some customers have massive amounts of buckets where the majority of text document (and thus embeddings) are duplicated in many of the buckets, leading to massive amounts of duplicate embeddings.
One solution to this (c.f. previous question) would be that rather than filtering based on buckets; we filter based on the text documents within the bucket (each embedding would only point to the parent text document, and “buckets” would contain a list of parent text documents). This means that to compute the filter for a query, we’d get the list of text documents within a bucket (potentially very large) and filter embeddings that are $in to any of those text documents.
Hey @Luca_Dorigo! Thanks for submitting your question. Here are some answers that are hopefully a helpful guide:
Can you help me understand a bit more why there is data duplication? From how you had described it it sounded like a pretty standard multi-tenanted architecture.
Hi Henry, thank you for your answer and sorry for the latency in answering, I did not get a notification for your reply!
Let me give some more details of our actual usecase: indeed the general setup is multi-tenant - in our case, our topmost “tenants” (our customers) are educational institutions. However; for each tenant, we have a further subdivision into user groups - in practice, this generally corresponds to a specific instance of a course given at the institution.
At the tenant level, there is indeed little to no duplication; however, we found that at the course level, the vast majority of documents are heavily duplicated across courses (in particular because some institutions tend to run many “copies” of the same course in parallel, with most but not all content shared amongst them). To give an idea, we found that the vast majority (95%+) of course-specific documents are duplicated in up to 50 different courses.
The “dumb” solution for now was to just chunk/embed each of those files 50 times - which will obviously not scale well at all
The easy solution is what I described above - rather than use a filter to get all documents/chunks associated with a course (so the filter would be chunk.document.course_id == XXX ), we would store each document as a unique object and the chunks (which are embedded) would point to the document to which they belong; for each course, we would then store a list of documents, and the queries would use a filter like chunk.document_id in course.document_ids . My question is whether the performance here would be reasonable, since course.document_ids might contain several thousand elements, and I’m not sure whether the vector search index is optimized for this type of queries.
Otherwise, we would have to write logic to factor out the “shared documents” and identify which courses use which sets of shared documents, but that’s significantly more work and housekeeping, so I only want to do this if I’m sure the “easy solution” is not enough.