BlogAtlas Vector Search voted most loved vector database in 2024 Retool State of AI reportLearn more >>
MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Confessions of a PyMongoArrowholic: Using Atlas Vector Search and PyMongoArrow to Semantically Search Through Luxury Fashion Items

Anaiya Raisinghani9 min read • Published Jun 27, 2024 • Updated Jun 27, 2024
AIPandasPythonMongoDB
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
As a twenty-something-year-old living in New York City, online shopping is my second favorite hobby. What is my first, you ask? Figuring out ways to optimize my shopping addiction so I can spend fewer hours scrolling.
Anyone who fantasizes about luxury (it’s called manifestation, people!) knows about Net-A-Porter and all the incredible pieces the website offers. While my normal approach is sorting by price from low to high, I’d be lying if I said it isn’t incredibly fun to see the full scope of what’s out there. So, let’s use a fun dataset that holds last season's Net-A-Porter items, and semantically search to explore some of the most expensive items, from any brand, any category, and with natural language queries.
In this tutorial, we are going to be using MongoDB Atlas, the PyMongoArrow library, MongoDB Atlas Vector Search, and a luxury fashion dataset from Kaggle.
Before we dive in, let’s first cover some of the important aspects of what will help us achieve our overall outcome.

What is PyMongoArrow?

PyMongoArrow is a Python library for data analysis with MongoDB. Due to our dataset being a .csv file, we are going to be reading it using the Pandas library, so it’ll be read in as a Pandas dataframe. With the pymongoarrow library, we can export all our data to MongoDB Atlas in the most ideal format for our tutorial with a handful of easy steps. It’s built on top of pymongo, so it allows us to work with MongoDB data in a super easy and performant manner. As you work through this tutorial, you’ll see how simple transferring your data and configuring it becomes when using the pymongoarrow library, an issue many data developers have dealt with in the past.

What is MongoDB Atlas Vector Search?

MongoDB Atlas Vector Search has truly revolutionized search capabilities. It lets you easily search semantically through your database while keeping your vector embeddings in the same place as your source data. Searching semantically means to search by meaning, so instead of having to search using exact keywords, we can query and receive results that convey the same idea without the precise wording.
For example, instead of searching through our sample size using simple queries such as “dress,” we can actually use phrases or generalities, like “summer beach tropical” or just even “summer.” We are going to be utilizing the $vectorSearch aggregation stage in this tutorial, which simplifies using Atlas Vector Search even more.
Let’s get started!

Pre-requisites

  1. IDE of your choosing — this tutorial uses Google Colab. Feel free to run the commands directly in the notebook.
  2. A MongoDB Atlas account
  3. A MongoDB Atlas cluster — the free tier works perfectly for this tutorial.
  4. Dataset from Kaggle — please ensure you’re downloading the correct .csv file.
  5. An OpenAI API key — this is how we will be embedding our data prior to uploading it into MongoDB Atlas.
Once your cluster has been created and you’ve downloaded the dataset locally, you’re ready to begin!

Upload our .csv file

Our first step is to upload our .csv file into Google Colab. On the left-hand side of Google Colab, access the "Files" section. Select the downloaded net-a-porter.csv file and upload. Uploaded file in Google Colab
Once your file is uploaded, we need to do two important things:
  1. We need to use OpenAI to create embeddings on each item in our file.
  2. We need to clean up our dataset and reconfigure it into a format best suited for our end goal, which is to ensure we can use semantic search to find items in our database.

Configure OpenAI

If you take a look at your .csv file, you’ll notice it consists of four columns (brand, description, price_usd, type) and a multitude of rows. Our .csv file
We need to add a column in our dataframe that contains the embeddings for our item descriptions. To embed each item description, we will use the “text-embedding-3-small” embedding model, and this embedding function:
We also need to use the pandas library to work with the data we have on hand. Pandas is a Python library used for working with data sets, and it’s super crucial when analyzing, cleaning, exploring, and manipulating data.
To start off, we want to install our dependencies. This means installing openai and importing pandas.
Now, we need to grab our OpenAI secret key. Make sure to save your key somewhere safe and do not share it anywhere, as it’s very sensitive. In this tutorial, to keep things simple and demonstrate other functionalities available, we are hard-coding in our API key, but in production or anywhere else, it’s important to store your sensitive values in a .env file.
Copy in your key and the embedding function from above:
We’ve set ourselves up for success with processing our embeddings, so let’s go ahead and configure our dataframe!
First, we want to read in the file that was just uploaded:
We only want to focus on the first three columns, since we don’t necessarily need the type column, so let’s drop it:
It’s important to make sure that the columns we are dealing with are clean and don’t have any null values. This is crucial because null values can mess up our data in the long run, and it’s a good practice to always ensure you’re working with a clean dataset. To do this, use dropna:
To ensure we are not spending a ton of money and time embedding each and every description in our large dataset, let’s slice it down to 100 rows. This will still provide us with an interesting sample size, but it won’t take up too many resources:
Now, we’re ready to create a new column for where our embeddings will go, and then we can print out our first 20 rows just to ensure we’re on the right track:
This should be your output, with the new description_embedding column: Our data and newly included description_embedding column
As you can see, we have a dataframe with the columns we need, and specifically, our newly included description_embedding column! Let’s make sure we can save this into our cluster so we can use MongoDB Atlas Vector Search when we’re ready to do so.

Import data into Atlas using PyMongoArrow

Since pymongoarrow uses Apache Arrow behind the scenes, to move our data into MongoDB Atlas, we need to convert our Pandas dataframe into an Arrow table. The great part of Arrow tables is that they allow for nested columns, so if we had a more complicated dataset, we wouldn’t need to jump through too many hoops to accommodate nesting.
Now that we have all our items and embeddings, let’s use pymongoarrow to import all of our data into MongoDB Atlas. Use a pip command to install pymongo, pymongoarrow, and pyarrow.
Once that succeeds, we can sort our items by most expensive to least expensive (just for fun) and then we can import all our items into our cluster. Please ensure you have your MongoDB connection string on hand so you can connect to your cluster and do this step. While we are hard-coding this in for this tutorial, please keep in mind that it’s not secure and variables should always be stored in a separate file.
Copy the code below to do this:
Once you run this code block, be sure to double-check in MongoDB Atlas that everything looks as expected. The rows from your .csv file will have been transformed into separate documents, with each column as a new field. Make sure that your new description_embedding field is included as well!
Our data has been properly imported into MongoDB Atlas
Now that we have our embedded documents in place, we can set up MongoDB Atlas Vector Search.
Let’s start searching semantically through our newly imported data. We first need to create a Vector Search index. To do this, head into your Atlas account and follow the steps.
Once finished, it should look like this.
The path we are using is description_embedding since we want our Vector Search index to be used against our newly incorporated embedding column. For the similarity field, we are choosing “euclidean,” but depending on your use case, you can either use “cosine” or “dot-product.”
Keep your “Index Name” as “vector_index,” or change it to something that you’ll remember, but make sure you’ve selected the correct database and collection. Once you’ve saved your index and it’s uploaded, you’ll know it’s active when the status looks like this. Active Vector Search index
Keep in mind that your Vector Search index is isolated to MongoDB Atlas. It is not a part of your overall Python script and you should not be running the index in your script.
Now, go back to the Google Colab file. To search semantically, we need to embed our queries. This is a very important part: When we are using semantic search, we are not comparing vectors to text — we are comparing vectors to vectors! Do this with these couple of lines:
Since we’ve already used the embedding model above, it doesn’t take much work to embed our queries as well.
Now, we need to define the aggregation pipeline so that we can semantically search. We can do this using $vectorSearch. The pipeline looks like this:
As you can see, we have used the $project feature to only show the fields that we want. We’ve also used $vectorSearch to define the index, the path, and our query vector. Double-check to ensure all the fields are correct before you proceed. Otherwise, it will not run.
Once your pipeline has been written, define which database and collection you want it to run on, and then print your results:
In this tutorial we used the simple query of “summer” and these are our results: Our results with the query: “summer”
It’s interesting here because when I queried on “summer,” items that included summer months showed up, such as the month of August.
Let’s change our query to say “winter” and see the results. As you can see, out of our sample size, we are pulling up results that are oriented toward colder weather, such as coats, ski jackets, and wool pants. Output of our query “winter”
They are also sorted in descending order from most expensive to least (to dream!) and we can search through the items with limited scrolling. So, if you’re ever in ultimate lounging mode and need a cashmere-hoodie-and-sweatpants ‘fit that’ll set you back almost $1300 (before tax), you know where to look.

Next steps

While this tutorial was done using a flat dataset, once you truly understand the concepts around how to incorporate the platforms and libraries introduced, feel free to create a web scraper and try this same method on live data.

Conclusion

This tutorial gives you a great overview of what is possible with PyMongoArrow and MongoDB Atlas Vector Search. We were able to take a dataset, process it using Pandas, generate necessary embeddings with OpenAI, store our newly developed Arrow table into MongoDB Atlas using PyMongoArrow, and then semantically query on our database.
For more information on PyMongoArrow, please visit the documentation, and for more information on MongoDB Atlas Vector Search, explore the tutorial. If you have questions or want to share your work, join us in the MongoDB Developer Community.
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Integrate Azure Key Vault with MongoDB Client-Side Field Level Encryption


May 24, 2022 | 9 min read
Quickstart

Java - MongoDB Multi-Document ACID Transactions


Mar 01, 2024 | 10 min read
Tutorial

Scaling for Demand: Deploying Python Applications Using MongoDB Atlas on Azure App Service


Apr 02, 2024 | 12 min read
Tutorial

How to Model Your Documents for Vector Search


Apr 10, 2024 | 4 min read
Table of Contents