How to Model Your Documents for Vector Search
Rate this tutorial
Atlas Vector Search was recently released, so let’s dive into a tutorial on how to properly model your documents when utilizing vector search to revolutionize your querying capabilities!
Data modeling in MongoDB revolves around organizing your data into documents within various collections. Varied projects or organizations will require different ways of structuring data models due to the fact that successful data modeling depends on the specific requirements of each application, and for the most part, no one document design can be applied for every situation. There are some commonalities, though, that can guide the user. These are:
- Choosing whether to embed or reference your related data.
- Using arrays in a document.
- Indexing your documents (finding fields that are frequently used and applying the appropriate indexing, etc.).
We are going to be building our vector embedding example using a MongoDB document for our series. Here, we have a single MongoDB document representing our MongoDB TV show, without any embeddings in place. We have a nested array featuring our array of seasons, and within that, our array of different episodes. This way, in our document, we are capable of seeing exactly which season each episode is a part of, along with the episode number, the title, the description, and the date:
Now that we have our example set up, let’s incorporate vector embeddings and discuss the proper techniques to set you up for success.
Let’s first understand exactly what vector search is: Vector search is the way to search based on meaning rather than specific words. This comes in handy when querying using similarities rather than searching based on keywords. When using vector search, you can query using a question or a phrase rather than just a word. In a nutshell, vector search is great for when you can’t think of exactly that book or movie, but you remember the plot or the climax.
This process happens when text, video, or audio is transformed via an encoder into vectors. With MongoDB, we can do this using OpenAI, Hugging Face, or other natural language processing models. Once we have our vectors, we can upload them in the base of our document and conduct vector search using them. Please keep in mind the of vector search and how to properly embed your vectors.
You can store your vector embeddings alongside other data in your document, or you can store them in a new collection. It is really up to the user and the project goals. Let’s go over what a document with vector embeddings can look like when you incorporate them into your data model, using the same example from above:
Here, you have your vector embeddings classified at the base in your document. Currently, there is a limitation where vector embeddings cannot be nested in an array in your document. Please ensure your document has your embeddings at the base. There are various tutorials on our , alongside our and our , that can help you figure out how to embed these vectors into your document and how to acquire the necessary vectors in the first place.
When setting up your search index, you want to change the “” to be your vector path. In our case, it would be “vectorEmbeddings”. “type” can stay the way it is. For “numDimensions”, please match the dimensions of the model you’ve chosen. This is just the number of vector dimensions, and the value cannot be greater than 2048. This limitation comes from the base embedding model that is being used, so please ensure you’re using a supported LLM (large language model) such as OpenAI or Hugging Face. When using one of these, there won’t be any issues running into vector dimensions. For “similarity”, please pick which vector function you want to use to search for the top K-nearest neighbors.
When you’re ready to query and find results from your embedded documents, it’s time to create an aggregation pipeline on your embedded vector data. To do this, you can use the“$vectorSearch” operator, which is a new aggregation stage in Atlas. It helps execute an Approximate Nearest Neighbor query.