EventGet 50% off your ticket to MongoDB.local London on October 2. Use code WEB50Learn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Semantic search with Jina Embeddings v2 and MongoDB Atlas

Scott Martens, Saahil Ognawala12 min read • Published Dec 05, 2023 • Updated Dec 05, 2023
Atlas
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Semantic search is a great ally for AI embeddings.
Using vectors to identify and rank matches has been a part of search for longer than AI has. The venerable tf/idf algorithm, which dates back to the 1960s, uses the counts of words, and sometimes parts of words and short combinations of words, to create representative vectors for text documents. It then uses the distance between vectors to find and rank potential query matches and compare documents to each other. It forms the basis of many information retrieval systems.
We call this “semantic search” because these vectors already have information about the meaning of documents built into them. Searching with semantic embeddings works the same way, but instead, the vectors come from AI models that do a much better job of making sense of the documents.
Because vector-based retrieval is a time-honored technique for retrieval, there are database platforms that already have all the mechanics to do it. All you have to do is plug in your AI embeddings model.
This article will show you how to enhance MongoDB Atlas — an out-of-the-box, cloud-based solution for document retrieval — with Jina Embeddings’ top-of-the-line AI to produce your own killer search solution.

Setting up

You will first need a MongoDB Atlas account. Register for a new account or sign in using your Google account directly on the website. Mongo Atlas sign-up screen

Create a project

Once logged in, you should see your Projects page. If not, use the navigation menu on the left to get to it.
Mongo Atlas Projects page
Create a new project by clicking the New Project button on the right.
Mongo Atlas "Create a Project" screen
You can add new members as you like, but you shouldn’t need to for this tutorial. The "Add Members" screen of the "Create a Project" page

Create a deployment

This should return you to the Overview page where you can now create a deployment. Click the +Create button to do so.
"Create a Deployment" screen on the "Overview" page.
Select the M0 Free tier for this project and the provider of your choice, and then click the Create button at the bottom of the screen.
Mongo Atlas deployment screen
On the next screen, you will need to create a user with a username and secure password for this deployment. Do not lose this password and username! They are the only way you will be able to access your work.
Adding a user and configuring security settings for a Mongo Atlas deployment
Then, select access options. We recommend for this tutorial selecting My Local Environment, and clicking the Add My Current IP Address button.
Configuring access restrictions for a Mongo Atlas deployment
If you have a VPN or a more complex security topology, you may have to consult your system administrator to find out what IP number you should insert here instead of your current one.
After that, click Finish and Deploy at the bottom of the page. After a brief pause, you will now have an empty MongoDB database deployed on Atlas for you to use.
Note: If you have difficulty accessing your database from outside, you can get rid of the IP Access List and accept connections from all IP addresses. Normally, this would be very poor security practice, but because this is a tutorial that uses publicly available sample data, there is little real risk.
To do this, click the Network Access tab under Security on the left side of the page: The Network Access tab on the Mongo Atlas sidebar
Then, click ADD IP ADDRESS from the right side of the page: Allowing access from all IP addresses, on the Network Access screen
You will get a modal window. Click the button marked ALLOW ACCESS FROM ANYWHERE, and then click Confirm.
Modal window for entering information about specific IP address restrictions
Your Network Access tab should now have an entry labeled 0.0.0.0/0. Modal window for entering information about specific IP address restrictions
This will allow any IP address to access your database if it has the right username and password.

Adding Data

In this tutorial, we will be using a sample database of Airbnb reviews. You can add this to your database from the Database tab under Deployments in the menu on the left side of the screen. Once you are on the “Database Deployments” page, find your cluster (on the free tier, you are only allowed one, so it should be easy). Then, click the “three dots” button and choose Load Sample Data. It may take several minutes to load the data.
Loading sample data into a Mongo Atlas deployment
This will add a collection of free data sources to your MongoDB instance for you to experiment with, including a database of Airbnb reviews.

Using PyMongo to access your data

For the rest of this tutorial, we will use Python and PyMongo to access your new MongoDB Atlas database.
Make sure PyMongo is installed in your Python environment. You can do this with the following command:
You will also need to know:
  1. The username and password you set when you set up the database.
  2. The URL to access your database deployment.
If you have lost your username and password, click on the Database Access tab under Security on the left side of the page. That page will enable you to reset your password.
The Database Access tab on the Mongo Atlas sidebar
To get the URL to access your database, return to the Database tab under Deployment on the left side of the screen. Find your cluster, and look for the button labeled Connect. Click it.
The “Database Deployments” page of Mongo Atlas
You will see a modal pop-up window like the one below: Modal window providing information on accessing a MongoDB Atlas deployment
Click Drivers under Connect to your application. You will see a modal window like the one below. Under number three, you will see the URL you need but without your password. You will need to add your password when using this URL.
Finding specific access information in the modal window

Connecting to your database

Create a file for a new Python script. You can call it test_mongo_connection.py.
Write into this file the following code, which uses PyMongo to create a client connection to your database:
Remember to insert the URL to connect to your database, including the correct username and password.
Next, add code to connect to the Airbnb review dataset that was installed as sample data:
The variable collection is an iterable that will return the entire dataset item by item. To test that it works, add the following line and run test_mongo_connection.py:
This will print JSON formatted text that contains the information in one database entry, whichever one it happened to find first. It should look something like this:
Getting a text response like this will show that you can connect to your MongoDB Atlas database.

Accessing Jina Embeddings v2

Go to the Jina AI embeddings website, and you will see a page like this: Getting a token to access Jina Embeddings from the Jina AI website
Copy the API key from this page. It provides you with 10,000 tokens of free embedding using Jina Embeddings models. Due to this limitation on the number of tokens allowed to be used in the free tier, we will only embed a small part of the Airbnb reviews collection. You can buy additional quota by clicking the “Top up” tab on the Jina Embeddings web page if you want to either embed the entire collection on MongoDB Atlas or apply these steps to another dataset.
Test your API key by creating a new script, call it test_jina_ai_connection.py, and put the following code into it, inserting your API code where marked:
Run the script test_jina_ai_connection.py. You should get something like this:
This indicates you have access to Jina Embeddings via its API.

Indexing your MongoDB collection

Now, we’re going to put all these pieces together with some Python functions to use Jina Embeddings to assign embedding vectors to descriptions in the Airbnb dataset.
Create a new Python script, call it index_embeddings.py, and insert some code to import libraries and declare some variables:
Then, add code to set up a MongoDB client and connect to the Airbnb dataset:
Now, we will add to the script a function to convert lists of texts into embeddings using the jina-embeddings-v2-base-en AI model:
And we will create a function that iterates over up to 30 documents in the listings database, creating embeddings for the descriptions and summaries, and adding them to each entry in the database:
With this in place, we can now index the collection:
Run the script index_embeddings.py. This may take several minutes. When this finishes, we will have added embeddings to 30 of the Airbnb items.

Create the embedding index in MongoDB Atlas

Return to the MongoDB website, and click on Database under Deployment on the left side of the screen.
Creating an index on Mongo Atlas from the “Database Deployments” page
Click on the link for your cluster (Cluster0 in the image above). Find the Search tab in the cluster page and click it to get a page like this: Creating an index from the Search tab on the page for a specific deployment
Click the button marked Create Search Index. Configuring an index before creation
Now, click JSON Editor and then Next: Configuring an index by specifying parameters in JSON format
Now, perform the following steps:
  1. Under Database and Collection, find sample_airbnb, and underneath it, check listingsAndReviews.
  2. Under Index Name, fill in the name listings_comments_semantic_search.
  3. Underneath that, in the numbered lines, add the following JSON text:
Your screen should look like this: Completed index configuration in JSON format
Now click Next and then Create Search Index in the next screen: Confirming JSON configuration before creating an index
This will schedule the indexing in MongoDB Atlas. You may have to wait several minutes for it to complete.
Modal confirmation that your index is being created
When completed, the following modal window will pop up: Modal confirmation that your index is ready to use
Return to your Python client, and we will perform a search.

Search with Embeddings

Now that our embeddings are indexed, we will perform a search.
We will write a search function that does the following:
  1. Take a query string and convert it to an embedding using Jina Embeddings and our existing generate_embeddings function.
  2. Query the index on MongoDB Atlas using the client connection we already set up.
  3. Print names, summaries, and descriptions of the matches.
Define the search functions as follows:
And now, let’s run a search:
Your results may vary because this tutorial did not index all the documents in the dataset, and which ones were indexed may vary dramatically. You should get a result like this:
Experiment with your own queries to see what you get.

Next steps

You’ve now created the core of a MongoDB Atlas-based semantic search engine, powered by Jina AI’s state-of-the-art embedding technology. For any project, you will follow essentially the same steps outlined above:
  1. Create an Atlas instance and fill it with your data.
  2. Create embeddings for your data items using the Jina Embeddings API and store them in your Atlas instance.
  3. Index the embeddings using MongoDB’s vector indexer.
  4. Implement semantic search using embeddings.
This boilerplate Python code will integrate easily into your own projects, and you can create equivalent code in Java, JavaScript, or code for any other integration framework that supports HTTPS.
To see the full documentation of the MongoDB Atlas API, so you can integrate it into your own offerings, see the Atlas API section of the MongoDB website.
To learn more about Jina Embeddings and its subscription offerings, see the Embeddings page of the Jina AI website. You can find the latest news about Jina AI’s embedding models on the Jina AI website and X/Twitter, and you can contribute to discussions on Discord.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

A Free GraphQL API for Johns Hopkins University COVID-19 Dataset


Aug 15, 2024 | 7 min read
Tutorial

Developing Your Applications More Efficiently with MongoDB Atlas Serverless Instances


Feb 03, 2023 | 7 min read
Tutorial

Nairobi Stock Exchange Web Scraper


Apr 02, 2024 | 20 min read
Article

Using Atlas Data Federation to Control Access to Your Analytics Node


Aug 28, 2024 | 9 min read
Table of Contents