Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Comparing NLP Techniques for Scalable Product Search

AS
VH
NC
MM
Adit Shrimal, Varun Hande, Neil Chitre, Manav Middha8 min read • Published Sep 23, 2024 • Updated Sep 23, 2024
AIPythonAtlas
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
In the rapidly expanding online retail industry, data is being generated at an unprecedented scale. To provide a seamless shopping experience to customers, it’s crucial to build an efficient, scalable product search architecture for storing and processing data. In this article, we will compare four popular natural language processing (NLP) techniques — namely BM25, TF-IDF, Word2Vec, and BERT — to empirically find the most optimal solution for retrieving the most relevant results for a search query from a large corpus of products.

Where’s the code?

The full code for this project can be found here.

Data processing

Our data is sourced from Asos’s (an online retailer for clothing, footwear, accessories, etc.) real-time product data APIs. We utilized Apache Spark, Google Cloud Storage (GCS), and MongoDB to store and process data orchestrated by Apache Airflow.
Our data processing pipeline consists of three main parts:

1. Fetch data from APIs and write to GCS

We first fetch the product categories. Then, we get the products in each category and store this data to Google Cloud Storage. The Python snippet below illustrates how we fetched the data using RapidAPI and stored the data to GCS:
1import json
2import requests
3
4url = "https://asos2.p.rapidapi.com/categories/list"
5headers = {
6 "x-rapidapi-key": "YOUR_RAPIDAPI_KEY",
7 "x-rapidapi-host": "asos2.p.rapidapi.com",
8}
9response = requests.request("GET", url, headers=headers)
10categories = json.loads(response.text)
11for category in categories:a
12 url = f"https://asos2.p.rapidapi.com/products/v2/list/{category['id']}"
13 response = requests.request("GET", url, headers=headers)
14 products = json.loads(response.text)
15 # Store products in GCS
16 bucket = os.environ.get('GS_BUCKET_NAME')
17 product_filename = f"products_{datetime.now().strftime('%H%m%S')}.json"
18 write(bucket, str(date.today()) + "/" + product_filename, products)

2. Pre-processing

The next step is preprocessing. Since product names are often not clean, we tokenized the data and removed stop words. For this, we used the RegexTokenizer and StopWordsRemover from SparkML.
1from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
2
3# Instantiate and set input and output column parameters
4regexTokenizer = RegexTokenizer(
5 inputCol="product_name", outputCol="words", pattern="\\W"
6)
7remover = StopWordsRemover(inputCol="words", outputCol="filtered")
8
9# Transform the data
10regexTokenized = regexTokenizer.transform(df)
11removedStopWords = remover.transform(regexTokenized)

3. Write transformed data to MongoDB

The last step is to store the transformed documents in a MongoDB collection. MongoDB’s document model, ability to efficiently handle large data volumes, built-in full-text search functionality, and easy integrations with big data tools such as PySpark make it a fitting choice for our data processing pipeline. The following code snippet shows how to write data to MongoDB:
1import json
2
3from pymongo import MongoClient
4
5client = MongoClient(
6 "mongodb://<username>:<password>@cluster0.mongodb.net/test?retryWrites=true&w=majority"
7)
8db = client.get_database("product_db")
9products = db.products
10
11# Fetch data from GCS
12for product in gcs_data:
13 products.insert_one(product)

Experimenting with different NLP techniques

Our primary objective was to find the most similar products for a given search query. We experimented with several methodologies:

1. TF-IDF + cosine similarity

TF-IDF, short for Term Frequency–Inverse Document Frequency, is a measure of importance of a word to a document in a corpus. It is a widely used method to encode text. Given a user query, we first encode it using TF-IDF and then use cosine similarity to find the most similar products from our catalog.
1from pyspark.ml.feature import IDF, HashingTF
2
3hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=20)
4featurizedData = hashingTF.transform(removedStopWords)
5
6idf = IDF(inputCol="rawFeatures", outputCol="features")
7idfModel = idf.fit(featurizedData)
8rescaledData = idfModel.transform(featurizedData)

2. Word2Vec + cosine similarity

Word2Vec is a family of model architectures to learn word embeddings from large datasets. Here, we use the Word2Vec implementation in SparkML to embed the user query and then use cosine similarity to find the most similar products from our catalog.
1from pyspark.ml.feature import Word2Vec
2
3word2Vec = Word2Vec(
4 vectorSize=300, minCount=0, inputCol="filtered", outputCol="features"
5)
6model = word2Vec.fit(removedStopWords)
7result = model.transform(removedStopWords)

3. BM25

BM25, short for “Best Match 25,” is a ranking function used in information retrieval systems to rank documents based on their relevance to a given search query. It is a probabilistic ranking models and is known for its ability to balance term frequency (TF) and inverse document frequency (IDF), thus making it a more sophisticated version of TF-IDF.
In our project, we implemented BM25 from scratch in PySpark due to the lack of a direct implementation in the library. Here is a more detailed look at our approach and code:
Firstly, we calculated the IDF values for each term, which is a measure of how much information the word provides, i.e., if it’s common or rare across all documents.
Secondly, we computed the term frequency in each document. Term frequency is simply the number of times a word appears in a document.
With these two metrics, we could then compute the BM25 score for each document relative to a query. Below is a simplified pseudocode for the process:
1# Compute IDF values for each term
2# Compute term frequency in each document for each term
3# For each query term, do:
4# For each document containing the term do:
5# Compute BM25 Score
6# Sort documents by score

Building an inverted index

In order to enhance the time efficiency of calculating TF and IDF every time for a product, we created an inverted index. This inverted index saves these important values beforehand to improve the retrieval time when someone searches for a product.
The first part of preprocessing involves stemming, which is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form. It’s done to get to the basic part of a word that carries the meaning. For example, the word stem of “running” is “run.” In this work, we used the Porter Stemming algorithm, a popular and robust stemming algorithm.
1def stemming(words):
2 ps = PorterStemmer()
3 result = []
4
5 for w in words:
6 root_word = ps.stem(w)
7 result.append(root_word)
8
9 return result
10
11stemming_udf = udf(lambda x: stemming(x), ArrayType(elementType=StringType()))
12preprocessed_data = preprocessed_data.withColumn(
13"tokens", stemming_udf("filtered_words")
14)
Next, we flatten the list of tokens into individual words and count the total number of unique words and total words.
1words = preprocessed_data.select(explode("tokens").alias("word"))
2total_unique_words = words.select("word").distinct().count()
3total_words = words.count()
Once we have a list of words, we start building the inverted index. For each word in our dataset, we create an entry in our index, where each entry points to a list of documents that contain the word, along with the word’s frequency in each document.
1id_str = concat_ws("", col("_id").cast(StringType()))
2
3word_count_df = (
4 preprocessed_data.select("_id", explode("tokens").alias("word"))
5 .groupBy("word", "_id")
6 .agg(size(collect_list("word")).alias("occurrence"))
7 .groupBy("word")
8 .agg(
9 map_from_entries(collect_list(struct(id_str.alias("id"), "occurrence"))).alias(
10 "documents"
11 )
12 )
13 .withColumnRenamed("word", "_id")
14).rdd.map(lambda row: (row["_id"], row["documents"]))
15
16inverted_index = word_count_df.collectAsMap()
Finally, we also compute the length of all documents and the unique word count for each document. These values are used later in the BM25 formula.
1doc_length_df = (
2 preprocessed_data.select("_id", size("tokens").alias("length"))
3 .groupBy("_id")
4 .agg(sum(col("length")).alias("doc_length"))
5 .withColumn("_id", concat_ws("", col("_id").cast(StringType())))
6 .rdd.map(lambda row: (row["_id"], row["doc_length"]))
7)
8doc_length = doc_length_df.collectAsMap()
9
10doc_length["word_counter"] = total_unique_words
11doc_length["doc_counter"] = total_words
12
13unique_word_count_df = preprocessed_data.select(
14 "_id",
15 size(array_distinct("tokens")).alias("num_unique_words"),
16).withColumn("_id", concat_ws("", col("_id").cast(StringType()))).rdd.map(lambda row: (row["_id"], row["num_unique_words"]))
17unique_word = unique_word_count_df.collectAsMap()
This concludes the preprocessing and inverted index-building step. The inverted index is a critical component of the BM25 ranking function, as it enables fast retrieval of relevant documents for a given query.
Now, let’s dive into the implementation of the BM25 algorithm.

Implementing BM25

The mathematical formula for BM25 is as follows:
BM25 mathematical formula
Where:
  • f(q_i, D): Frequency of query term q_i in document D
  • |D|: Word count of document D
  • avgdl: Average word count across all documents
  • k_1 and b: Tuning parameters, usually set as k_1 in [1.2, 2.0] and b=0.75
  • IDF(q_i): Weight of query term q_i based on its rarity across all documents
Here’s how we’ve implemented the above algorithm in Python:
1def bm25(doc_length,
2 inverted_index,
3 unique_word,
4 word_counter,
5 doc_counter,
6 query):
7 k1 = 1.2
8 b = 0.75
9 k2 = 100
10 N = len(doc_length) - 2
11 avgdl = doc_length["doc_counter"] / N
12
13 score = defaultdict(int)
14 for word in query:
15 if word in inverted_index:
16 n = len(inverted_index[word])
17 idf = math.log((N - n + 0.5) / (n + 0.5))
18 for doc, freq in inverted_index[word].items():
19 f = freq
20 qf = query.count(word)
21 dl = doc_length[doc]
22 K = k1 * ((1 - b) + b * (float(dl) / float(avgdl)))
23 R1 = idf * ((f * (k1 + 1)) / (f + K))
24 R2 = (qf * (k2 + 1)) / (qf + k2)
25 R = R1 * R2
26 score[doc] += R
27 else:
28 Continue
29
30 return score
This function takes the document length dictionary, the inverted index, the unique word count dictionary, the total word counter, the document counter, and the search query. It calculates the BM25 score for each document in relation to the query and returns a dictionary where the keys are the document IDs, and the values are the corresponding BM25 scores. The documents with the highest BM25 scores are the ones most relevant to the query. In this study, we chose values for k1, b, and k2 based on other existing studies.

4. BERT

BERT, short for Bidirectional Encoder Representations from Transformers, was one of the early examples of large language models (LLMs) and uses a transformer encoder architecture to represent text as vectors. We used a pre-trained BERT model to map the user query to a high-dimensional vector space and then use cosine similarity to find the most similar products from our catalog.
1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer('bert-base-nli-mean-tokens')
4sentence_embeddings = model.encode(data['product_name'])

Results

Each of the models had their strengths and weaknesses. While TF-IDF was the easiest to interpret, it struggled to capture the semantic meaning of the data. Word2Vec was limited to handling queries present in the training corpus. BM25 was very quick, but it lacked semantic understanding. Among the algorithms we tested, the best model was BERT, as it could incorporate the semantic meaning of the data and handle queries in multiple languages.
For our project, we used a GPU-accelerated compute instance on AWS. Below are the details of the specifications we used:
  • Databricks Runtime version: 12.1 ML (includes Apache Spark 3.3.1, GPU, Scala 2.12)
  • Worker/Driver types: g4dn.xlarge
  • Number of workers: minimum 2, maximum 5
  • Resources per worker: 32–80 GB Memory, 8–20 Cores
  • Driver resources: 16 GB Memory, 4 Cores
We compared the performance of the four search algorithms (BM25, TF-IDF, Word2Vec, and BERT) on three distinct queries: "sneakers", "t-shirt", and "chappals" (which means sandals/slippers in Hindi). The results varied in terms of speed and relevance for each query and algorithm combination. A summary of the results is as follows:
  • Sneakers query: BM25 was the fastest in returning relevant results. Both TF-IDF and Word2Vec were slower compared to BM25, while BERT offered a balance between speed and relevance.
  • T-shirt query: While BM25 was the fastest to return results, it provided irrelevant results due to the limited vocabulary of the corpus. Word2Vec performed poorly in terms of both speed and relevance, similar to TF-IDF. BERT was moderately slow but provided highly relevant results.
  • Chappals query: Chappal is a Hindi word for leather sandals. Again, BM25 excelled in terms of speed but delivered irrelevant results due to language limitations in the corpus. Word2Vec and TF-IDF had similar performance metrics as with the "t-shirt" query. BERT again demonstrated its effectiveness by delivering highly relevant results even with the complexity of handling multiple languages.
TD-IDF results on “chappals” query
TF-IDF's results for the “chappals” query show the algorithm's struggle with multi-language queries.
 Word2Vec results on “chappals” query
Word2Vec results demonstrate its limitations in handling diverse languages.
BM25 results on “chappals” query
BM25 responds quickly but produces irrelevant results due to language limitations in the corpus.
BERT results on “chappals” query
BERT successfully retrieves relevant results, showcasing its strength in handling multi-language queries.

Conclusion

In this article, we compared four different natural language processing (NLP) techniques — namely BM25, TF-IDF, Word2Vec, and BERT — to empirically find the most optimal solution for retrieving the most relevant results for a search query from a large corpus of products.
Our preliminary experiments show that the BM25 algorithm is the most time-efficient, while BERT provides the best trade-off between speed and relevance. This observation is in-line with what we see in practice in the real world, with BM25 being a popular algorithm for lexical search, and vector search being the preferred choice for semantic search. Hybrid search, a technique to combine the strengths of keyword and semantic search, is becoming increasingly common.
MongoDB Atlas supports lexical, vector, as well as hybrid search. Here are some resources to get you started:
For further questions, reach out to us in our community forums!
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Instant GraphQL APIs for MongoDB with Grafbase


Oct 12, 2023 | 7 min read
Code Example

EHRS-Peru


Sep 11, 2024 | 3 min read
Tutorial

Java Meets Queryable Encryption: Developing a Secure Bank Account Application


Oct 08, 2024 | 14 min read
Code Example

myLeG


Jul 07, 2022 | 1 min read
Table of Contents