Rate this video
✅ Sign-up for a free cluster at → https://www.mongodb.com/cloud/atlas/register
✅ Get help on our Community Forums → https://www.mongodb.com/community/forums/
✅ https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe
In this MongoDB video, we'll explore the seamless integration of MongoDB Atlas with a luxury fashion dataset, demonstrating how to semantically search for high-end items using natural language queries. We'll delve into the power of MongoDB Atlas, the P[ __ ] Arrow Library, and MongoDB Atlas Vector Search, showcasing how to process, embed, and query data to find the most extravagant luxury items. This tutorial is perfect for anyone looking to harness the capabilities of MongoDB for advanced data analysis and search functionality.
📚 RESOURCES 📚
https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe
⏱️ Timestamps ⏱️
Introduction and Overview [00:00:00 - 00:04:27]
Anay Sani introduces herself as an associate developer advocate and discusses the luxury items available on Net-a-Porter. She outlines the tutorial's objective to explore expensive items using MongoDB Atlas, the P[ __ ]Arrow Library, and semantic search with a luxury dataset from Kaggle.
Setting Up the Environment [00:04:27 - 00:08:54]
Anay explains the importance of the P[ __ ]Arrow Python library for data analysis with MongoDB and the process of reading the CSV file as a pandas dataframe. She discusses the prerequisites for the tutorial, including a MongoDB Atlas account, a Kaggle dataset, and an OpenAI API key.
Preparing the Data [00:08:54 - 00:13:21]
The focus shifts to uploading the CSV file to Google Colab, adding comments for clarity, and preparing the dataset by embedding item descriptions using OpenAI's embedding model. Anay also emphasizes cleaning the dataset to ensure no null values are present.
Importing Data into MongoDB Atlas [00:13:21 - 00:17:48]
Anay demonstrates how to install P[ __ ]Arrow and PiArrow, convert the pandas dataframe to an Arrow table, and import the data into MongoDB Atlas. She also shows how to sort items by price and secure the MongoDB connection string.
Setting Up MongoDB Atlas Vector Search [00:17:48 - 00:23:31]
Anay guides viewers through creating a vector search index in MongoDB Atlas and explains the significance of embedding queries to compare vectors to vectors. She sets up the aggregation pipeline for the vector search using the MongoDB Atlas interface.
Semantic Search and Conclusion [00:23:31 - 00:26:42]
The tutorial concludes with Anay performing semantic searches on the database using summer and winter queries, showcasing the results, and discussing the potential of using P[ __ ]Arrow and MongoDB Atlas Vector search. She invites viewers to join the MongoDB developer community and share their work.
------
✅ Subscribe to our channel → https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe" target="_blank" rel="noreferrer">https://mdb.link/subscribe
Empty string
Full Video Transcript
hi everyone my name is anay Sani and I'm an associate developer Advocate over here with mangodi anyone who fantasizes about luxury knows all about Neta Porter and the amazing pieces that the website offers while my normal approach is sorting from low to High by price I would be lying I said it wasn't incredibly fun to see the full range of what the website offers so let's use a fun data set that holds last season's net a porter items and semantically search to explore some of the most expensive items from any brand any category and with natural language queries in this tutorial we are going to be using mongodb Atlas the P [ __ ] Arrow Library mongod Tob Atlas Vector search and of course our fun luxury data set from kagle before we dive in let's just go over some of the really important aspects of the platforms that we are going to be using so what is p [ __ ] ARA P [ __ ] arrow is a python library for data analysis with mongod DB due to our data set being a CSV file we are actually going to be reading it in as a panda Library so it'll be read in as a panda data frame with the pine [ __ ] Arrow Library we can export all of our data to mongod DB Atlas in the most ideal format for our tutorial it's actually built on top of P [ __ ] so it allows us to work with mongodb data in a super easy and performant manner as you work through this tutorial you will see how simple transferring your data and configuring it becomes when you're using the P [ __ ] era Library which is an issue many many data scientists have had in the past so what is mongodb atlas Vector search mongodb Atlas Vector search has actually really revolutionized how you can search because it lets you search semantically through your database while keeping all of your vector embeddings right next to your Source data in the same database searching semantically means to search by meaning instead of searching by using exact keywords or phrases so this means that we can query and receive results that convey the same IDE aidea without that precise wording for example instead of searching through our sample size using simple queries such as dress or skirt we can actually use phrases or generalities such as like summer beach tropical or just even summer we are going to be utilizing the dollar sign Vector search aggregation stage in this tutorial which will make things even a little bit easier so let's go over some PRX in order for us to be successful during this tutorial our first one of course is using an ID of your choosing where you're going to be used in Google collab in this tutorial and then of course you need a mongodb atlas account you need to set up a mongodb atlas cluster and the free tier works perfectly fine for this tutorial then we need to download a data set from kaggle you can follow along with the same one that I used and then you're going to need an open AI API key in order to embed all of your documents and links to everything will be in the description box below perfect let's get started so once your cluster has been created and you've downloaded the data set locally we are ready to begin the tutorial so our very first step is to actually upload our do CSV file into Google collab and before we upload that I just want to point out for this tutorial I have added in a lot of comments just to make things a little bit easier cuz we are going to be doing a lot of really cool stuff in this tutorial so I've just left I've left a couple things like our get pass Pro promp because I know that that is not a part of our written tutorial if you are following along and that's just to keep things secret and safe while I am doing this live so we have our Vector search index and then just some comments that we're going to fill out to make things a little bit easier for you and for myself so once again our first step is to just go ahead and actually upload our. CSV file that we downloaded from kaggle and so we can just go over here click that little arrow and we can go over and find it with net aor so CSV perfect and that's okay you can hit okay awesome it's downloaded in so once your file is uploaded we actually need to do two very important things our first step is to use open AI to create embeddings on each item in our file and our second step is that we actually need to clean up our data set and configure it into a format best suited for our end goal so this is just to ensure that we can actually use semantic search to find any items that we want inside of our database and we can find any luxury items that our heart desires so if you're going to take a look at our CSV file right over here you'll notice that it actually consists of four columns we have brand description price and USD and type and then there are a multitude of rows I'm not going to scroll through all of them but you can get the idea you can scroll through it yourself so we need to add a column inside of our data frame that actually contains the embeddings for each of our item descriptions this is going to allow for us to search semantically based off of the item description so to embed each item we are actually going to be using our text embedding three small embedding model along with this embedding function right over here we also will need to use the panas library to work with the data that we have on hand and if you are a data scientist or you've worked with python and data before I know that you probably know pandas but for those that are new to it um pandas is actually a python Library used for working with data sets and it's super crucial when analyzing cleaning exploring and manipulating data and as you're going to see it makes our work a lot easier just to get to the parts of the data frame that we want to use and to get the data that we most desire so to start off we first have to install open AI so we can do this with a very simple pip command and then we can run that make sure it works and downloads perfect and then I've already put in our Imports up top over here so we're just importing pandas as PD we're importing open AI cuz we just installed it and then we're going to be importing get pass as well so talking about Keys make sure that you take a second and just grab your open AI secret key and then we are going to fill out some of these handy little comments that I've already created so the first one is this we are going to be using this embedding model for text so we can just do embedding model once again it is the text X embedding 3 small and then the embedding function I'm just going to copy and paste it from up top it's right over here perfect so we have set ourselves up for success with processing or Bings so let's go ahead and actually configure our data frame the first step is that we really want to read in the file that was just uploaded there's no point in uploading a file if we're not going to read it in right so let's just do DF for data frame PD CSV and so here we're just using pandas to read it in and we can go in and say content slet porter. CSV cool and now we only really want to focus on the first three columns and that's just because I personally feel like we do not need the type column as you can see here for a large majority of the items in our data set the type is clothing but besides that it just gives like shoes accessories stuff like that so I am personally just going to drop that column if you want to use that column in yours no worries at all but I'm just going to drop it to make things a little bit cleaner for ourselves so DF do drop and then from our columns we want to drop type perfect I'm going do in place is equal to True very nice so that we're down to our three columns that we want to use um it is important to make sure that the columns that we are dealing with are clean and don't have any null values at all so this is crucial because n values can actually mess up our data in the long run and it's just really good practice to always ensure that you're working with a clean data set so to do this we're going to use something called dropna do gf. dropna and then subset is equal to this is specifying the column that we want to drop the nil values from so we are going to do brand and then price USD and then we're also going to do our description description perfect and then once again just hit it with the in place true so now to ensure that we are not spending a ton of time or money just resources in general embedding the entire gigantic CSV file I'm just going to cut it down to 100 rows that also give us a pretty good sample size of data to work with but it's not going to take up a lot of our resources and we'll still get the same idea um whether we only embed 100 descriptions versus the probably thousands that are in that CSV file so this is just simple as just DF is equal to D f. head and then 100 to just splice it down now what we have to do is we have to create a new column that we're going to put into our data frame that contain the actual vectorized embeddings so we can do this by just going TF is description so this is saying that we're going to have a new column called description embedding and that taking it from the description and then we want to apply our embedding function awesome and then we're just going to print out the first 20 rows just to make sure that we're on the right track so DF do head and then 20 very nice Okay cool so let's run it and once again the expectation for what's going to come out of this section is that we are going to have four columns but we're not going to have the ti of column any more and instead of the type column we are going to have a description embedding column with all of our descriptions embedded as vectors so let's run that and see if that works perfect so let's enter that in awesome so as you can see we have a data frame with the columns that we need which are brand description price in Us doll and description embedding which is our description column as vectorized embeddings so our next step is to actually save this data inside of mongodb Atlas so that we can use mongodb Atlas Vector search when we're ready to do so and we can do this very easily by importing our data into Atlas using the P [ __ ] Arrow Library so the really cool thing about P [ __ ] arrow is that it actually uses aachi Arrow behind the scenes so to move our data into Atlas we need to convert our Panda's data frame into to an arrow table and the great part about Arrow tables um is actually that they allow for nested columns so if we did have a more complicated data set we wouldn't need to jump through too many Hoops to accommodate nesting so let's go ahead and get started our first step of course is to actually install everything that we need we're just going to use a pip statement and install P [ __ ] P [ __ ] arrow and then Pi Arrow perfect and then these are Imports so from p mango we're going to be importing manga client from Pongo a. API we're going to import a write feature which is going to write our data frame into mang be Atlas and some magic is going to happen and every single Row from our CSV file is actually going to be turned into a document so you're going to see that in a second and then we're going to import Pi Arrow spa and up before I forget let me run this actually and make sure everything gets installed so sometimes this happens where we have to restart the run time U and that's been happening a handful of times but that's it's not a problem at all we just click restart session and then we can just go through and actually rerun everything that we've already run I like to just do everything just to make sure so once that succeeds we can actually sort our items by most expensive to least expensive and that is just a personal preference and that's just for fun for this tutorial especially when luxury shopping sometimes it's fun to see the most extravagant items so that's what I'm going to be searching based off of and then also a very important part is please ensure that you have your manga DB connection strength I'll show you how to grab that in a second if you haven't done that before but once again you know if you're doing this in production um or you need more security please keep these secret values in a n file for the the sake of the tutorial and just for Simplicity purposes I am using get pass to input my variables and my secrets so let's first sort by most expensive items to least expensive so DF is equal to tf. sort values and we want to sort by so we it's nice that we already have our price as a column and then we want to do ascending and that is going to be false so if I want it to be the other way around of course it would be ascending true and then this is where our connection string is going to go and we can just write our client M go client and then connection string perfect and once again you can really just name your database and collection anything that you like because we've created the cluster but we haven't created the database in the collection so when we do write in our data frame using Pongo Arrow into Atlas that's when our database and our collection are going to be created so we can name it whatever we like I'm just going to do database equal to client and that's going to be NP uh order and then I want to have my collection be equal to and then that is also going to be I'm just going to name it average prices descending perfect so in order to save our data we first do need to convert our pandas data frame to an arrow table using pi Arrow so we can do that just with bar table table from pandas and then we're put in our data frame nice and now we can use that little right so we can go right and then we want to write and oops collection Arrow table very nice and now we can just do a print um just to make sure that it worked I'll just say successful and then I'm also going to show you if it is successful which fingers cross it is I'll show you what it looks like in Atlas with all of our documents so let's run this and then see what happens oh yeah so first let's put in our connection string so go over to M to be Atlas hit connect conect on your cluster go to drivers we're using python of course so I'm going to copy that in I'm going to take it off screen for a second input my password right over here and then copy and paste it inside of collap so my connection string has been copied and pasted in with my password inside of it so let's enter that in and it was successful awesome so let's go over back to our cluster and let's go and take a look and see if our documents have actually loaded in nice and as you can see our database was created our collection was created we have a total documents of 100 perfect because we spliced down our entire data set and as you can see every single description and not just the description s every single item inside of our database well the hundred that we spliced down has been turned into a document so we have our description embedding which is the most important part we we also have the brand we have the description of what the item is we have the price then we have the embedding and we have that for I believe the first 100 which is awesome so now that we have our embedded documents in place we can actually set up mongodi Atlas Vector search so to start searching semantically what we're first going to have to do is create a vector search index so I've left it over here just for clarity sakes so I'm going to go over and copy this and then go back to my cluster and go to atlas search I'm going to go down here and hit create search index we're going to be using Alice Vector search of course so click that and click next The Collection that I want to be doing this on is of course our average price is descending collection I'm going to keep the name the same just because it's default and it's a little bit easier but if you do change the index name please follow that through the entirety of the tutorial so let me cop paste that in and of course if you remember from seeing the embeddings because we're using open AI the number of Dimensions is 1536 the path that we created that we want to use this on is description embedding and then similarity we're using cosine and the type of course is Vector so hit next scroll down create search Index close and then let's just wait for this to turn green and tell us that it's active perfect so once it's active we can actually go back to our Google colop file and first what we need to do the search semantically we need to First embed our queries so a lot of people struggle with this concept but one of the most important things to take away from Vector search is that we are not comparing text to vectors right we want to be comparing vectors to vectors so of course we have to embed our queries as well so this is very simple to do because we've already have used our embedding function before so all we have to do is is just name it something I'm just doing query description and then I want this to be we can just do summer for now I'll do summer and winter as my queries and show y'all what the outcomes are of both of those so now let's embed it over here we can do query Vector is equal to the same embedding function from above because we want to make sure that we're using the same model to embed not only our queries but also our documents so so description perfect and now for kind of the more difficult part or not difficult part but definitely the longer part I'll leave some documentation Below on aggregation pipelines if y'all are not used to those but we need to be using the dollar sign Vector search operator and create our aggregation pipeline so that we can actually semantically search on our data so in order to do that we're first going to open up our brackets like that and use the vector search dollar Vector search and then we can put in some information so the first thing that we want to do is we want to Define our index so we just created that so Vector index if you named it something else this is where you want to put that in if you left it default the way that I did keep it as Vector index and then the path and our path of course is just going to be description embedding and then we have our query Vector so this is query Vector from above so I only had 100 rows saved right so we're just going to do num candidates we'll have our whole sample size but then let's limit it to five awesome so now we can begin our second stage so let's open that up and we can use our projects because this is just saying like what feels do we want to show so first things first we don't need to show our ID and let me actually go back to our cluster and show you kind of what we mean by this stage so this is just saying like out of all of these fields that are listed what do we want to see right so ideally I don't want to see the ID I don't necessarily care for the ID but I do want to see the brand I want to see the description I want to see the price I don't need the description embedding so I'm not going to include that but we are going to include one more thing called the score so ID we can have it equal to zero and then brand I want one and description I want want one and then price I also want one and score so with score we can open that up and we can just do that uh oops Vector search score so let's just say the vector search score and I'll point that out when we run it so now next step and last step is just to sort it by price so we can do sort and price USD and once again we want to sort by most expensive so we're just going to do negative 1 awesome so let's just double check make sure everything looks fine before we continue on I believe it looks good so so let's finish things off strong so we want to run it on our database and our collection right okay so we're going to do database is equal to client and then have net aor oops nice and then the collection that we're running this on average price is descending and then let's just do results is equal to collection Aggregates and we have our pipeline just named pipeline perfect and then let's print something right we want to see results so result just print clothing perfect so let's run this and see what happens cool so keep in mind the query is Summer and when we're searching semantically based on our luxury items inside of our database we are seeing five responses that are telling us that they're summer related right and they're going by most expensive to least expensive still pretty expensive and this is the vector search score right over here pretty cool so as you can see too one of the cool aspects is when we search for summer we are seeing summer months show up as well August and things like shirts midi dresses mini dresses maxi dresses halter net tops all things that you would wear in the summer so let's change the description to Winter and see what happens perfect awesome so all things that you would wear in the winter things like ski jackets wool wideleg pants cashmere sweaters cashmere hoodie and track pant set more cashmere sweaters perfect so try this out yourself and change the query to be whatever you like and then see what happens so while this tutorial was done done using a flat data set once you truly understand each of the platforms used and the concepts feel free to create a web scraper and try this on live data and this tutorial just gives you a really great overview of what is possible with P [ __ ] arrow and manga Tob Alice Vector search we were able to take a data set process it using pandas generate necessary embeddings with open AI store our newly developed aot table into mongod DB Atlas using Pongo arrow and then we were able to semantically query on top of our database for more information on P [ __ ] Arrow please visit the documentation and for more information on mongodb Atlas search please feel free to explore the tutorial that will be linked in the description box below if you have any questions or want to share your work please join us in the mongodb developer community and drop a comment and like this video and let me know what you think thank you so much [Applause]