How to Implement Databricks Workflows and Atlas Vector Search for Enhanced Ecommerce Search Accuracy
Rate this tutorial
In the vast realm of Ecommerce, customers' ability to quickly and accurately search through an extensive range of products is paramount. is emerging as a turning point in this space, offering a refined approach to search that goes beyond mere keyword matching. Let's delve into its implementation using , Atlas Vector Search, and .
In a previous tutorial, , we showcased how the integration of MongoDB and Databricks provides a comprehensive solution for the retail industry by combining real-time data processing, workflow orchestration, machine learning, custom data functions, and advanced search capabilities as a way to optimize product catalog management and enhance customer interactions.
The diagram below represents the Databricks workflow for indexing data from the atp (available to promise), images, prd_desc (product discount), prd_score (product score), and price collections. These collections are also part of the previously mentioned tutorial, so please refer back if you need to access them.
Within the MongoDB Atlas platform, we can use change streams and the to move data from the collections into a new collection called Catalog. From there, we will use a text transformer to create the
Catalog Final Collection. This will enable us to create a corpus of indexed and vector embedded data that will be used later as the search dictionary. We’ll call this collection
catalog_final_myn. This will be shown further along after we embed the product names.
The catalog final collection will include the available to promise status for each product, its images, the product discount, product relevance score, and price, along with the vectorized or embedded product name that we’ll point our vector search engine at.
This part of the notebook reads data changes from the atp_status_myn collection in the search database, drops the _id field, and then writes (or updates) the processed data to the catalog_myn collection in the same database.
Using a combination of Python libraries and PySpark operations to process data from the Catalog MongoDB collection, we’ll transform it, vectorize it, and write the transformed data back to the Final Catalog collection. On top of this, we’ll build our application search business logic.
We start by using the %pip magic command, which is specific to Jupyter notebooks and IPython environments. The necessary packages are:
- pymongo: A Python driver for MongoDB.
- tqdm: A library to display progress bars.
- sentence-transformers: A library for state-of-the-art sentence, text, and image embeddings.
First, let’s use pip to install these packages in our Databricks notebook:
We continue the notebook with the following code:
The next step is to bring the data from the MongoDB Atlas catalog and search collections. This as a continuation of the same notebook:
With this code, we set up a structured streaming read from the
catalog_myncollection in the
searchdatabase of MongoDB. The resulting data is stored in the
catalog_statusDataFrame in Spark. The read operation is configured to fetch the full document from MongoDB's change stream and does not apply any aggregation.
The notebook code continues with:
Here we’ll define how data should be stored and indexed for efficient searching. To configure the index, you can insert the snippet in MongoDB Atlas by browsing to your cluster splash page and clicking on the “Search” tab:
Next, you can click over “Create Index.” Make sure you select “JSON Editor”:
By implementing these indexing parameters, we speed up retrieval times. Especially with high-dimensional vector data, as raw vectors can consume a significant amount of storage and reduce the computational cost of operations like similarity calculations.
Instead of comparing a query vector with every vector in the dataset, indexing allows the system to compare with a subset, saving computational resources.
As you can see, you’ll first get results that match the specific tokenized keywords “tan” and “bags”. As a result, this will give you any product that contains any or both of those keywords in the product catalog collection documents.
However, not all the results are bags or of tan color. You can see shoes, wallets, a dress, and a pair of pants. This could be frustrating as a customer, prompting them to leave the site.
Now, enable vector search by clicking on the checkbox on the left of the magnifying glass icon in the search bar, and re-run the query “tan bags”. The results you get are in the image below:
As you can see from the screenshot, the results became more relevant for a consumer. Our search engine is able to identify similar products by understanding the context that “beige” is a similar color to “tan”, and therefore showcase these products as alternatives.
By working with MongoDB Atlas and Databricks, we can create real-time data transformation pipelines. We achieve this by leveraging the MongoDB Connector for Spark to prepare our operational data for vectorization, and store it back into our MongoDB Atlas collections. This approach allows us to develop the search logic for our Ecommerce app with minimal operational overhead.
On top of that, Atlas Vector Search provides a robust solution for implementing advanced search features, making it easy to deliver a great search user experience for your customers. By understanding and integrating these tools, developers can create search experiences that are fast, relevant, and user-friendly.