Semantic search with Jina Embeddings v2 and MongoDB Atlas
Rate this tutorial
Semantic search is a great ally for AI embeddings.
Using vectors to identify and rank matches has been a part of search for longer than AI has. The venerable tf/idf algorithm, which dates back to the 1960s, uses the counts of words, and sometimes parts of words and short combinations of words, to create representative vectors for text documents. It then uses the distance between vectors to find and rank potential query matches and compare documents to each other. It forms the basis of many information retrieval systems.
We call this “semantic search” because these vectors already have information about the meaning of documents built into them. Searching with semantic embeddings works the same way, but instead, the vectors come from AI models that do a much better job of making sense of the documents.
Because vector-based retrieval is a time-honored technique for retrieval, there are database platforms that already have all the mechanics to do it. All you have to do is plug in your AI embeddings model.
This article will show you how to enhance MongoDB Atlas — an out-of-the-box, cloud-based solution for document retrieval — with Jina Embeddings’ top-of-the-line AI to produce your own killer search solution.
Once logged in, you should see your Projects page. If not, use the navigation menu on the left to get to it.
Create a new project by clicking the New Project button on the right.
You can add new members as you like, but you shouldn’t need to for this tutorial.
This should return you to the Overview page where you can now create a deployment. Click the +Create button to do so.
Select the M0 Free tier for this project and the provider of your choice, and then click the Create button at the bottom of the screen.
On the next screen, you will need to create a user with a username and secure password for this deployment. Do not lose this password and username! They are the only way you will be able to access your work.
Then, select access options. We recommend for this tutorial selecting My Local Environment, and clicking the Add My Current IP Address button.
If you have a VPN or a more complex security topology, you may have to consult your system administrator to find out what IP number you should insert here instead of your current one.
After that, click Finish and Deploy at the bottom of the page. After a brief pause, you will now have an empty MongoDB database deployed on Atlas for you to use.
Note: If you have difficulty accessing your database from outside, you can get rid of the IP Access List and accept connections from all IP addresses. Normally, this would be very poor security practice, but because this is a tutorial that uses publicly available sample data, there is little real risk.
To do this, click the Network Access tab under Security on the left side of the page:
Then, click ADD IP ADDRESS from the right side of the page:
You will get a modal window. Click the button marked ALLOW ACCESS FROM ANYWHERE, and then click Confirm.
Your Network Access tab should now have an entry labeled
0.0.0.0/0
.
This will allow any IP address to access your database if it has the right username and password.
In this tutorial, we will be using a sample database of Airbnb reviews. You can add this to your database from the Database tab under Deployments in the menu on the left side of the screen. Once you are on the “Database Deployments” page, find your cluster (on the free tier, you are only allowed one, so it should be easy). Then, click the “three dots” button and choose Load Sample Data. It may take several minutes to load the data.
This will add a collection of free data sources to your MongoDB instance for you to experiment with, including a database of Airbnb reviews.
For the rest of this tutorial, we will use Python and PyMongo to access your new MongoDB Atlas database.
Make sure PyMongo is installed in your Python environment. You can do this with the following command:
1 pip install pymongo
You will also need to know:
- The username and password you set when you set up the database.
- The URL to access your database deployment.
If you have lost your username and password, click on the Database Access tab under Security on the left side of the page. That page will enable you to reset your password.
To get the URL to access your database, return to the Database tab under Deployment on the left side of the screen. Find your cluster, and look for the button labeled Connect. Click it.
You will see a modal pop-up window like the one below:
Click Drivers under Connect to your application. You will see a modal window like the one below. Under number three, you will see the URL you need but without your password. You will need to add your password when using this URL.
Create a file for a new Python script. You can call it
test_mongo_connection.py
.Write into this file the following code, which uses PyMongo to create a client connection to your database:
1 from pymongo.mongo_client import MongoClient 2 3 client = MongoClient("<URL from above>")
Remember to insert the URL to connect to your database, including the correct username and password.
Next, add code to connect to the Airbnb review dataset that was installed as sample data:
1 db = client.sample_airbnb 2 collection = db.listingsAndReviews
The variable
collection
is an iterable that will return the entire dataset item by item. To test that it works, add the following line and run test_mongo_connection.py
:1 print(collection.find_one())
This will print JSON formatted text that contains the information in one database entry, whichever one it happened to find first. It should look something like this:
1 {'_id': '10006546', 2 'listing_url': 'https://www.airbnb.com/rooms/10006546', 3 'name': 'Ribeira Charming Duplex', 4 'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic 5 area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary 6 building fully rehabilitated, without losing their original character.', 7 'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers 8 the perfect conditions to discover the history and the charm of Porto. 9 Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. 10 Within walking distance of all the most emblematic places of the city of Porto. 11 The apartment is fully equipped to host 8 people, with cooker, oven, washing 12 machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The 13 apartment is located in a very typical area of the city that allows to cross 14 with the most picturesque population of the city, welcoming, genuine and happy 15 people that fills the streets with his outspoken speech and contagious with 16 your sincere generosity, wrapped in a only parochial spirit.', 17 'description': 'Fantastic duplex apartment with three bedrooms, located in the historic 18 area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary 19 building fully rehabilitated, without losing their original character. 20 Privileged views of the Douro River and Ribeira square, our apartment 21 offers the perfect conditions to discover the history and the charm of 22 Porto. Apartment comfortable, charming, romantic and cozy in the heart of 23 Ribeira. Within walking distance of all the most emblematic places of the 24 city of Porto. The apartment is fully equipped to host 8 people, with 25 cooker, oven, washing machine, dishwasher, microwave, coffee machine 26 (Nespresso) and kettle. The apartment is located in a very typical area 27 of the city that allows to cross with the most picturesque population of 28 the city, welcoming, genuine and happy people that fills the streets with 29 his outspoken speech and contagious with your sincere generosity, wrapped 30 in a only parochial spirit. We are always available to help guests', 31 ... 32 }
Getting a text response like this will show that you can connect to your MongoDB Atlas database.
Copy the API key from this page. It provides you with 10,000 tokens of free embedding using Jina Embeddings models. Due to this limitation on the number of tokens allowed to be used in the free tier, we will only embed a small part of the Airbnb reviews collection. You can buy additional quota by clicking the “Top up” tab on the Jina Embeddings web page if you want to either embed the entire collection on MongoDB Atlas or apply these steps to another dataset.
Test your API key by creating a new script, call it
test_jina_ai_connection.py
, and put the following code into it, inserting your API code where marked:1 import requests 2 3 url = 'https://api.jina.ai/v1/embeddings' 4 5 headers = { 6 'Content-Type': 'application/json', 7 'Authorization': 'Bearer <insert your API key here>' 8 } 9 10 data = { 11 'input': ["Your text string goes here"], 12 'model': 'jina-embeddings-v2-base-en' 13 } 14 15 response = requests.post(url, headers=headers, json=data) 16 17 print(response.content)
Run the script test_jina_ai_connection.py. You should get something like this:
1 b'{"model":"jina-embeddings-v2-base-en","object":"list","usage":{"total_tokens":14, 2 "prompt_tokens":14},"data":[{"object":"embedding","index":0,"embedding":[-0.14528547, 3 -1.0152762,1.3449358,0.48228237,-0.6381836,0.25765118,0.1794826,-0.5094953,0.5967494, 4 ..., 5 -0.30768695,0.34024483,-0.5897042,0.058436804,0.38593403,-0.7729841,-0.6259417]}]}'
This indicates you have access to Jina Embeddings via its API.
Now, we’re going to put all these pieces together with some Python functions to use Jina Embeddings to assign embedding vectors to descriptions in the Airbnb dataset.
Create a new Python script, call it
index_embeddings.py
, and insert some code to import libraries and declare some variables:1 import requests 2 from pymongo.mongo_client import MongoClient 3 4 jinaai_token = "<your Jina token here>" 5 mongo_url = "<your MongoDB Atlas database URL>" 6 embedding_url = "https://api.jina.ai/v1/embeddings"
Then, add code to set up a MongoDB client and connect to the Airbnb dataset:
1 client = MongoClient(mongo_url) 2 db = client.sample_airbnb
Now, we will add to the script a function to convert lists of texts into embeddings using the
jina-embeddings-v2-base-en
AI model:1 def generate_embeddings(texts): 2 payload = {"input": texts, 3 "model": "jina-embeddings-v2-base-en"} 4 try: 5 response = requests.post( 6 embedding_url, 7 headers={"Authorization": f"Bearer {jinaai_token}"}, 8 json=payload 9 ) 10 except Exception as e: 11 raise ValueError(f"Error in calling embedding API: {e}/nInput: {texts}") 12 if response.status_code != 200: 13 raise ValueError(f"Error in embedding service {response.status_code}: {response.text}, {texts}") 14 embeddings = [d["embedding"] for d in response.json()["data"]] 15 return embeddings
And we will create a function that iterates over up to 30 documents in the listings database, creating embeddings for the descriptions and summaries, and adding them to each entry in the database:
1 def index(): 2 collection = db.listingsAndReviews 3 docs_to_encode = collection.find({ "embedding_summary" : { "$exists" : False } }).limit(30) 4 for i, doc in enumerate(docs_to_encode): 5 if i and i%5==0: 6 print("Finished embedding", i, "documents") 7 try: 8 embedding_summary, embedding_description = generate_embeddings([doc["summary"], doc["description"]]) 9 except Exception as e: 10 print("Error in embedding", doc["_id"], e) 11 continue 12 doc["embedding_summary"] = embedding_summary 13 doc["embedding_description"] = embedding_description 14 collection.replace_one({'_id': doc['_id']}, doc)
With this in place, we can now index the collection:
1 index()
Run the script
index_embeddings.py
. This may take several minutes.
When this finishes, we will have added embeddings to 30 of the Airbnb items.Return to the MongoDB website, and click on Database under Deployment on the left side of the screen.
Click on the link for your cluster (Cluster0 in the image above).
Find the Search tab in the cluster page and click it to get a page like this:
Click the button marked Create Search Index.
Now, click JSON Editor and then Next:
Now, perform the following steps:
- Under Database and Collection, find sample_airbnb, and underneath it, check listingsAndReviews.
- Under Index Name, fill in the name
listings_comments_semantic_search
. - Underneath that, in the numbered lines, add the following JSON text:
1 { 2 "mappings": { 3 "dynamic": true, 4 "fields": { 5 "embedding_description": { 6 "dimensions": 768, 7 "similarity": "dotProduct", 8 "type": "knnVector" 9 }, 10 "embedding_summary": { 11 "dimensions": 768, 12 "similarity": "dotProduct", 13 "type": "knnVector" 14 } 15 } 16 } 17 }
Your screen should look like this:
Now click Next and then Create Search Index in the next screen:
This will schedule the indexing in MongoDB Atlas. You may have to wait several minutes for it to complete.
When completed, the following modal window will pop up:
Return to your Python client, and we will perform a search.
Now that our embeddings are indexed, we will perform a search.
We will write a search function that does the following:
- Take a query string and convert it to an embedding using Jina Embeddings and our existing generate_embeddings function.
- Query the index on MongoDB Atlas using the client connection we already set up.
- Print names, summaries, and descriptions of the matches.
Define the search functions as follows:
1 def search(query): 2 query_embedding = generate_embeddings([query])[0] 3 results = db.listingsAndReviews.aggregate([ 4 { 5 '$search': { 6 "index": "listings_comments_semantic_search", 7 "knnBeta": { 8 "vector": query_embedding, 9 "k": 3, 10 "path": ["embedding_summary", "embedding_description"] 11 } 12 } 13 } 14 ]) 15 for document in results: 16 print(f'Listing Name: {document["name"]}\nSummary: {document["name"]}\nDescription: {document["description"]}\n\n')
And now, let’s run a search:
1 search("an amazing view and close to amenities")
Your results may vary because this tutorial did not index all the documents in the dataset, and which ones were indexed may vary dramatically. You should get a result like this:
1 Listing Name: Rented Room 2 Summary: Rented Room 3 Description: Beautiful room and with a great location in the city of Rio de Janeiro 4 5 6 Listing Name: Spacious and well located apartment 7 Summary: Spacious and well located apartment 8 Description: Enjoy Porto in a spacious, airy and bright apartment, fully equipped, in a 9 building with lift, located in a region full of cafes and restaurants, close to the subway 10 and close to the best places of the city. The apartment offers total comfort for those 11 who, besides wanting to enjoy the many attractions of the city, also like to relax and 12 feel at home, All airy and bright, with a large living room, fully equipped kitchen, and a 13 delightful balcony, which in the summer refreshes and in the winter protects from the cold 14 and rain, accommodating up to six people very well. It has 40-inch interactive TV, internet 15 and high-quality wi-fi, and for those who want to work a little, it offers a studio with a 16 good desk and an inspiring view. The apartment is all available to guests. I leave my guests 17 at ease, but I am available whenever they need me. It is a typical neighborhood of Porto, 18 where you have silence and tranquility, little traffic, no noise, but everything at hand: 19 good restaurants and c 20 21 22 Listing Name: Panoramic Ocean View Studio in Quiet Setting 23 Summary: Panoramic Ocean View Studio in Quiet Setting 24 Description: Luxury studio unit is located in a family-oriented neighborhood that lets you 25 experience Hawaii like a local! with tranquility and serenity, while in close proximity to 26 beaches and restaurants! The unit is surrounded by lush tropical vegetation! High-speed 27 Wi-Fi available in the unit!! A large, private patio (lanai) with fantastic ocean views is 28 completely under roof and is part of the studio unit. It's a great space for eating outdoors 29 or relaxing, while checking our the surfing action. This patio is like a living room 30 without walls, with only a roof with lots and lots of skylights!!! We provide Wi-Fi and 31 beach towels! The studio is detached from the main house, which has long-term tenants 32 upstairs and downstairs. The lower yard and the front yard are assigned to those tenants, 33 not the studio guests. The studio has exclusive use of its large (600 sqft) patio - under 34 roof! Check-in and check-out times other than the ones listed, are by request only and an 35 additional charges may apply; 36 37 38 Listing Name: GOLF ROYAL RESIDENCE SUİTES(2+1)-2 39 Summary: GOLF ROYAL RESIDENCE SUİTES(2+1)-2 40 Description: A BIG BED ROOM WITH A BIG SALOON INCLUDING A NICE BALAKON TO HAVE SOME FRESH 41 AIR . OUR RESIDENCE SITUATED AT THE CENTRE OF THE IMPORTANT MARKETS SUCH AS NİŞANTAŞİ, 42 OSMANBEY AND TAKSIM SQUARE, 43 44 45 Listing Name: DOUBLE ROOM for 1 or 2 ppl 46 Summary: DOUBLE ROOM for 1 or 2 ppl 47 Description: 10m2 with interior balkony kitchen, bathroom small but clean and modern metro 48 in front of the building 7min walk to Sagrada Familia, 2min walk TO amazing Gaudi Hospital 49 Sant Pau SAME PRICE FOR 1 OR 2 PPL-15E All flat for your use, terrace, huge TV.
Experiment with your own queries to see what you get.
You’ve now created the core of a MongoDB Atlas-based semantic search engine, powered by Jina AI’s state-of-the-art embedding technology. For any project, you will follow essentially the same steps outlined above:
- Create an Atlas instance and fill it with your data.
- Create embeddings for your data items using the Jina Embeddings API and store them in your Atlas instance.
- Index the embeddings using MongoDB’s vector indexer.
- Implement semantic search using embeddings.
This boilerplate Python code will integrate easily into your own projects, and you can create equivalent code in Java, JavaScript, or code for any other integration framework that supports HTTPS.
To see the full documentation of the MongoDB Atlas API, so you can integrate it into your own offerings, see the Atlas API section of the MongoDB website.
To learn more about Jina Embeddings and its subscription offerings, see the Embeddings page of the Jina AI website. You can find the latest news about Jina AI’s embedding models on the Jina AI website and X/Twitter, and you can contribute to discussions on Discord.
Related
Tutorial
Rapidly Build a Highly Performant GraphQL API for MongoDB With Hasura
Feb 15, 2024 | 10 min read
Tutorial
Introducing Atlas Stream Processing Support Within the MongoDB for VS Code Extension
Mar 05, 2024 | 4 min read