Semantic search with Jina Embeddings v2 and MongoDB Atlas
Rate this tutorial
Semantic search is a great ally for AI embeddings.
Using vectors to identify and rank matches has been a part of search for longer than AI has. The venerable tf/idf algorithm, which dates back to the 1960s, uses the counts of words, and sometimes parts of words and short combinations of words, to create representative vectors for text documents. It then uses the distance between vectors to find and rank potential query matches and compare documents to each other. It forms the basis of many information retrieval systems.
We call this “semantic search” because these vectors already have information about the meaning of documents built into them. Searching with semantic embeddings works the same way, but instead, the vectors come from AI models that do a much better job of making sense of the documents.
Because vector-based retrieval is a time-honored technique for retrieval, there are database platforms that already have all the mechanics to do it. All you have to do is plug in your AI embeddings model.
This article will show you how to enhance MongoDB Atlas — an out-of-the-box, cloud-based solution for document retrieval — with Jina Embeddings’ top-of-the-line AI to produce your own killer search solution.
Once logged in, you should see your Projects page. If not, use the navigation menu on the left to get to it.
Create a new project by clicking the New Project button on the right.
You can add new members as you like, but you shouldn’t need to for this tutorial.
This should return you to the Overview page where you can now create a deployment. Click the +Create button to do so.
Select the M0 Free tier for this project and the provider of your choice, and then click the Create button at the bottom of the screen.
On the next screen, you will need to create a user with a username and secure password for this deployment. Do not lose this password and username! They are the only way you will be able to access your work.
Then, select access options. We recommend for this tutorial selecting My Local Environment, and clicking the Add My Current IP Address button.
If you have a VPN or a more complex security topology, you may have to consult your system administrator to find out what IP number you should insert here instead of your current one.
After that, click Finish and Deploy at the bottom of the page. After a brief pause, you will now have an empty MongoDB database deployed on Atlas for you to use.
Note: If you have difficulty accessing your database from outside, you can get rid of the IP Access List and accept connections from all IP addresses. Normally, this would be very poor security practice, but because this is a tutorial that uses publicly available sample data, there is little real risk.
To do this, click the Network Access tab under Security on the left side of the page:
Then, click ADD IP ADDRESS from the right side of the page:
You will get a modal window. Click the button marked ALLOW ACCESS FROM ANYWHERE, and then click Confirm.
Your Network Access tab should now have an entry labeled
0.0.0.0/0
.
This will allow any IP address to access your database if it has the right username and password.
In this tutorial, we will be using a sample database of Airbnb reviews. You can add this to your database from the Database tab under Deployments in the menu on the left side of the screen. Once you are on the “Database Deployments” page, find your cluster (on the free tier, you are only allowed one, so it should be easy). Then, click the “three dots” button and choose Load Sample Data. It may take several minutes to load the data.
This will add a collection of free data sources to your MongoDB instance for you to experiment with, including a database of Airbnb reviews.
For the rest of this tutorial, we will use Python and PyMongo to access your new MongoDB Atlas database.
Make sure PyMongo is installed in your Python environment. You can do this with the following command:
You will also need to know:
- The username and password you set when you set up the database.
- The URL to access your database deployment.
If you have lost your username and password, click on the Database Access tab under Security on the left side of the page. That page will enable you to reset your password.
To get the URL to access your database, return to the Database tab under Deployment on the left side of the screen. Find your cluster, and look for the button labeled Connect. Click it.
You will see a modal pop-up window like the one below:
Click Drivers under Connect to your application. You will see a modal window like the one below. Under number three, you will see the URL you need but without your password. You will need to add your password when using this URL.
Create a file for a new Python script. You can call it
test_mongo_connection.py
.Write into this file the following code, which uses PyMongo to create a client connection to your database:
Remember to insert the URL to connect to your database, including the correct username and password.
Next, add code to connect to the Airbnb review dataset that was installed as sample data:
The variable
collection
is an iterable that will return the entire dataset item by item. To test that it works, add the following line and run test_mongo_connection.py
:This will print JSON formatted text that contains the information in one database entry, whichever one it happened to find first. It should look something like this:
Getting a text response like this will show that you can connect to your MongoDB Atlas database.
Copy the API key from this page. It provides you with 10,000 tokens of free embedding using Jina Embeddings models. Due to this limitation on the number of tokens allowed to be used in the free tier, we will only embed a small part of the Airbnb reviews collection. You can buy additional quota by clicking the “Top up” tab on the Jina Embeddings web page if you want to either embed the entire collection on MongoDB Atlas or apply these steps to another dataset.
Test your API key by creating a new script, call it
test_jina_ai_connection.py
, and put the following code into it, inserting your API code where marked:Run the script test_jina_ai_connection.py. You should get something like this:
This indicates you have access to Jina Embeddings via its API.
Now, we’re going to put all these pieces together with some Python functions to use Jina Embeddings to assign embedding vectors to descriptions in the Airbnb dataset.
Create a new Python script, call it
index_embeddings.py
, and insert some code to import libraries and declare some variables:Then, add code to set up a MongoDB client and connect to the Airbnb dataset:
Now, we will add to the script a function to convert lists of texts into embeddings using the
jina-embeddings-v2-base-en
AI model:And we will create a function that iterates over up to 30 documents in the listings database, creating embeddings for the descriptions and summaries, and adding them to each entry in the database:
With this in place, we can now index the collection:
Run the script
index_embeddings.py
. This may take several minutes.
When this finishes, we will have added embeddings to 30 of the Airbnb items.Return to the MongoDB website, and click on Database under Deployment on the left side of the screen.
Click on the link for your cluster (Cluster0 in the image above).
Find the Search tab in the cluster page and click it to get a page like this:
Click the button marked Create Search Index.
Now, click JSON Editor and then Next:
Now, perform the following steps:
- Under Database and Collection, find sample_airbnb, and underneath it, check listingsAndReviews.
- Under Index Name, fill in the name
listings_comments_semantic_search
. - Underneath that, in the numbered lines, add the following JSON text:
Your screen should look like this:
Now click Next and then Create Search Index in the next screen:
This will schedule the indexing in MongoDB Atlas. You may have to wait several minutes for it to complete.
When completed, the following modal window will pop up:
Return to your Python client, and we will perform a search.
Now that our embeddings are indexed, we will perform a search.
We will write a search function that does the following:
- Take a query string and convert it to an embedding using Jina Embeddings and our existing generate_embeddings function.
- Query the index on MongoDB Atlas using the client connection we already set up.
- Print names, summaries, and descriptions of the matches.
Define the search functions as follows:
And now, let’s run a search:
Your results may vary because this tutorial did not index all the documents in the dataset, and which ones were indexed may vary dramatically. You should get a result like this:
Experiment with your own queries to see what you get.
You’ve now created the core of a MongoDB Atlas-based semantic search engine, powered by Jina AI’s state-of-the-art embedding technology. For any project, you will follow essentially the same steps outlined above:
- Create an Atlas instance and fill it with your data.
- Create embeddings for your data items using the Jina Embeddings API and store them in your Atlas instance.
- Index the embeddings using MongoDB’s vector indexer.
- Implement semantic search using embeddings.
This boilerplate Python code will integrate easily into your own projects, and you can create equivalent code in Java, JavaScript, or code for any other integration framework that supports HTTPS.
To see the full documentation of the MongoDB Atlas API, so you can integrate it into your own offerings, see the Atlas API section of the MongoDB website.
To learn more about Jina Embeddings and its subscription offerings, see the Embeddings page of the Jina AI website. You can find the latest news about Jina AI’s embedding models on the Jina AI website and X/Twitter, and you can contribute to discussions on Discord.