Comparing NLP Techniques for Scalable Product Search

Adit Shrimal, Varun Hande, Neil Chitre, Manav Middha8 min read • Published Sep 23, 2024 • Updated Sep 23, 2024

AI Python Atlas

Rate this article

In the rapidly expanding online retail industry, data is being generated at an unprecedented scale. To provide a seamless shopping experience to customers, it’s crucial to build an efficient, scalable product search architecture for storing and processing data. In this article, we will compare four popular natural language processing (NLP) techniques — namely BM25, TF-IDF, Word2Vec, and BERT — to empirically find the most optimal solution for retrieving the most relevant results for a search query from a large corpus of products.

Where’s the code?

The full code for this project can be found here.

Data processing

Our data is sourced from Asos’s (an online retailer for clothing, footwear, accessories, etc.) real-time product data APIs. We utilized Apache Spark, Google Cloud Storage (GCS), and MongoDB to store and process data orchestrated by Apache Airflow.

Our data processing pipeline consists of three main parts:

1. Fetch data from APIs and write to GCS

We first fetch the product categories. Then, we get the products in each category and store this data to Google Cloud Storage. The Python snippet below illustrates how we fetched the data using RapidAPI and stored the data to GCS:

1 import json
2 import requests
3 
4 url = "https://asos2.p.rapidapi.com/categories/list"
5 headers = {
6    "x-rapidapi-key": "YOUR_RAPIDAPI_KEY",
7    "x-rapidapi-host": "asos2.p.rapidapi.com",
8 }
9 response = requests.request("GET", url, headers=headers)
10 categories = json.loads(response.text)
11 for category in categories:a
12    url = f"https://asos2.p.rapidapi.com/products/v2/list/{category['id']}"
13    response = requests.request("GET", url, headers=headers)
14    products = json.loads(response.text)
15    # Store products in GCS
16    bucket = os.environ.get('GS_BUCKET_NAME')
17    product_filename = f"products_{datetime.now().strftime('%H%m%S')}.json"
18    write(bucket, str(date.today()) + "/" + product_filename, products)

2. Pre-processing

The next step is preprocessing. Since product names are often not clean, we tokenized the data and removed stop words. For this, we used the RegexTokenizer and StopWordsRemover from SparkML.

1 from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
2 
3 # Instantiate and set input and output column parameters
4 regexTokenizer = RegexTokenizer(
5    inputCol="product_name", outputCol="words", pattern="\\W"
6 )
7 remover = StopWordsRemover(inputCol="words", outputCol="filtered")
8 
9 # Transform the data
10 regexTokenized = regexTokenizer.transform(df)
11 removedStopWords = remover.transform(regexTokenized)

3. Write transformed data to MongoDB

The last step is to store the transformed documents in a MongoDB collection. MongoDB’s document model, ability to efficiently handle large data volumes, built-in full-text search functionality, and easy integrations with big data tools such as PySpark make it a fitting choice for our data processing pipeline. The following code snippet shows how to write data to MongoDB:

1 import json
2 
3 from pymongo import MongoClient
4 
5 client = MongoClient(
6    "mongodb://<username>:<password>@cluster0.mongodb.net/test?retryWrites=true&w=majority"
7 )
8 db = client.get_database("product_db")
9 products = db.products
10 
11 # Fetch data from GCS
12 for product in gcs_data:
13    products.insert_one(product)

Experimenting with different NLP techniques

Our primary objective was to find the most similar products for a given search query. We experimented with several methodologies:

1. TF-IDF + cosine similarity

TF-IDF, short for Term Frequency–Inverse Document Frequency, is a measure of importance of a word to a document in a corpus. It is a widely used method to encode text. Given a user query, we first encode it using TF-IDF and then use cosine similarity to find the most similar products from our catalog.

1 from pyspark.ml.feature import IDF, HashingTF
2 
3 hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=20)
4 featurizedData = hashingTF.transform(removedStopWords)
5 
6 idf = IDF(inputCol="rawFeatures", outputCol="features")
7 idfModel = idf.fit(featurizedData)
8 rescaledData = idfModel.transform(featurizedData)

2. Word2Vec + cosine similarity

Word2Vec is a family of model architectures to learn word embeddings from large datasets. Here, we use the Word2Vec implementation in SparkML to embed the user query and then use cosine similarity to find the most similar products from our catalog.

1 from pyspark.ml.feature import Word2Vec
2 
3 word2Vec = Word2Vec(
4    vectorSize=300, minCount=0, inputCol="filtered", outputCol="features"
5 )
6 model = word2Vec.fit(removedStopWords)
7 result = model.transform(removedStopWords)

3. BM25

BM25, short for “Best Match 25,” is a ranking function used in information retrieval systems to rank documents based on their relevance to a given search query. It is a probabilistic ranking models and is known for its ability to balance term frequency (TF) and inverse document frequency (IDF), thus making it a more sophisticated version of TF-IDF.

In our project, we implemented BM25 from scratch in PySpark due to the lack of a direct implementation in the library. Here is a more detailed look at our approach and code:

Firstly, we calculated the IDF values for each term, which is a measure of how much information the word provides, i.e., if it’s common or rare across all documents.

Secondly, we computed the term frequency in each document. Term frequency is simply the number of times a word appears in a document.

With these two metrics, we could then compute the BM25 score for each document relative to a query. Below is a simplified pseudocode for the process:

1 # Compute IDF values for each term
2 # Compute term frequency in each document for each term
3 # For each query term, do:
4 # For each document containing the term do:
5 # Compute BM25 Score
6 # Sort documents by score

Building an inverted index

In order to enhance the time efficiency of calculating TF and IDF every time for a product, we created an inverted index. This inverted index saves these important values beforehand to improve the retrieval time when someone searches for a product.

The first part of preprocessing involves stemming, which is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form. It’s done to get to the basic part of a word that carries the meaning. For example, the word stem of “running” is “run.” In this work, we used the Porter Stemming algorithm, a popular and robust stemming algorithm.

1 def stemming(words):
2    ps = PorterStemmer()
3    result = []
4 
5    for w in words:
6        root_word = ps.stem(w)
7        result.append(root_word)
8 
9    return result
10 
11 stemming_udf = udf(lambda x: stemming(x), ArrayType(elementType=StringType()))
12 preprocessed_data = preprocessed_data.withColumn(
13 "tokens", stemming_udf("filtered_words")
14 )

Next, we flatten the list of tokens into individual words and count the total number of unique words and total words.

1 words = preprocessed_data.select(explode("tokens").alias("word"))
2 total_unique_words = words.select("word").distinct().count()
3 total_words = words.count()

Once we have a list of words, we start building the inverted index. For each word in our dataset, we create an entry in our index, where each entry points to a list of documents that contain the word, along with the word’s frequency in each document.

1 id_str = concat_ws("", col("_id").cast(StringType()))
2 
3 word_count_df = (
4    preprocessed_data.select("_id", explode("tokens").alias("word"))
5    .groupBy("word", "_id")
6    .agg(size(collect_list("word")).alias("occurrence"))
7    .groupBy("word")
8    .agg(
9        map_from_entries(collect_list(struct(id_str.alias("id"), "occurrence"))).alias(
10            "documents"
11        )
12    )
13    .withColumnRenamed("word", "_id")
14 ).rdd.map(lambda row: (row["_id"], row["documents"]))
15 
16 inverted_index = word_count_df.collectAsMap()

Finally, we also compute the length of all documents and the unique word count for each document. These values are used later in the BM25 formula.

1 doc_length_df = (
2    preprocessed_data.select("_id", size("tokens").alias("length"))
3    .groupBy("_id")
4    .agg(sum(col("length")).alias("doc_length"))
5    .withColumn("_id", concat_ws("", col("_id").cast(StringType())))
6    .rdd.map(lambda row: (row["_id"], row["doc_length"]))
7 )
8 doc_length = doc_length_df.collectAsMap()
9 
10 doc_length["word_counter"] = total_unique_words
11 doc_length["doc_counter"] = total_words
12 
13 unique_word_count_df = preprocessed_data.select(
14    "_id",
15    size(array_distinct("tokens")).alias("num_unique_words"),
16 ).withColumn("_id", concat_ws("", col("_id").cast(StringType()))).rdd.map(lambda row: (row["_id"], row["num_unique_words"]))
17 unique_word = unique_word_count_df.collectAsMap()

This concludes the preprocessing and inverted index-building step. The inverted index is a critical component of the BM25 ranking function, as it enables fast retrieval of relevant documents for a given query.

Now, let’s dive into the implementation of the BM25 algorithm.

Implementing BM25

The mathematical formula for BM25 is as follows:

Where:

f(q_i, D): Frequency of query term q_i in document D
|D|: Word count of document D
avgdl: Average word count across all documents
k_1 and b: Tuning parameters, usually set as k_1 in [1.2, 2.0] and b=0.75
IDF(q_i): Weight of query term q_i based on its rarity across all documents

Source

Here’s how we’ve implemented the above algorithm in Python:

1 def bm25(doc_length,
2   inverted_index, 
3    unique_word, 
4    word_counter, 
5    doc_counter, 
6    query):
7    k1 = 1.2
8    b = 0.75
9    k2 = 100
10    N = len(doc_length) - 2
11    avgdl = doc_length["doc_counter"] / N
12 
13    score = defaultdict(int)
14    for word in query:
15        if word in inverted_index:
16            n = len(inverted_index[word])
17            idf = math.log((N - n + 0.5) / (n + 0.5))
18            for doc, freq in inverted_index[word].items():
19                f = freq
20                qf = query.count(word)
21                dl = doc_length[doc]
22                K = k1 * ((1 - b) + b * (float(dl) / float(avgdl)))
23                R1 = idf * ((f * (k1 + 1)) / (f + K))
24                R2 = (qf * (k2 + 1)) / (qf + k2)
25                R = R1 * R2
26                score[doc] += R
27        else:
28            Continue
29 
30    return score

This function takes the document length dictionary, the inverted index, the unique word count dictionary, the total word counter, the document counter, and the search query. It calculates the BM25 score for each document in relation to the query and returns a dictionary where the keys are the document IDs, and the values are the corresponding BM25 scores. The documents with the highest BM25 scores are the ones most relevant to the query. In this study, we chose values for k1, b, and k2 based on other existing studies.

4. BERT

BERT, short for Bidirectional Encoder Representations from Transformers, was one of the early examples of large language models (LLMs) and uses a transformer encoder architecture to represent text as vectors. We used a pre-trained BERT model to map the user query to a high-dimensional vector space and then use cosine similarity to find the most similar products from our catalog.

1 from sentence_transformers import SentenceTransformer
2 
3 model = SentenceTransformer('bert-base-nli-mean-tokens')
4 sentence_embeddings = model.encode(data['product_name'])

Results

Each of the models had their strengths and weaknesses. While TF-IDF was the easiest to interpret, it struggled to capture the semantic meaning of the data. Word2Vec was limited to handling queries present in the training corpus. BM25 was very quick, but it lacked semantic understanding. Among the algorithms we tested, the best model was BERT, as it could incorporate the semantic meaning of the data and handle queries in multiple languages.

For our project, we used a GPU-accelerated compute instance on AWS. Below are the details of the specifications we used:

Databricks Runtime version: 12.1 ML (includes Apache Spark 3.3.1, GPU, Scala 2.12)
Worker/Driver types: g4dn.xlarge
Number of workers: minimum 2, maximum 5
Resources per worker: 32–80 GB Memory, 8–20 Cores
Driver resources: 16 GB Memory, 4 Cores

We compared the performance of the four search algorithms (BM25, TF-IDF, Word2Vec, and BERT) on three distinct queries: "sneakers", "t-shirt", and "chappals" (which means sandals/slippers in Hindi). The results varied in terms of speed and relevance for each query and algorithm combination. A summary of the results is as follows:

Sneakers query: BM25 was the fastest in returning relevant results. Both TF-IDF and Word2Vec were slower compared to BM25, while BERT offered a balance between speed and relevance.
T-shirt query: While BM25 was the fastest to return results, it provided irrelevant results due to the limited vocabulary of the corpus. Word2Vec performed poorly in terms of both speed and relevance, similar to TF-IDF. BERT was moderately slow but provided highly relevant results.
Chappals query: Chappal is a Hindi word for leather sandals. Again, BM25 excelled in terms of speed but delivered irrelevant results due to language limitations in the corpus. Word2Vec and TF-IDF had similar performance metrics as with the "t-shirt" query. BERT again demonstrated its effectiveness by delivering highly relevant results even with the complexity of handling multiple languages.

TF-IDF's results for the “chappals” query show the algorithm's struggle with multi-language queries.

Word2Vec results demonstrate its limitations in handling diverse languages.

BM25 responds quickly but produces irrelevant results due to language limitations in the corpus.

BERT successfully retrieves relevant results, showcasing its strength in handling multi-language queries.

Conclusion

In this article, we compared four different natural language processing (NLP) techniques — namely BM25, TF-IDF, Word2Vec, and BERT — to empirically find the most optimal solution for retrieving the most relevant results for a search query from a large corpus of products.

Our preliminary experiments show that the BM25 algorithm is the most time-efficient, while BERT provides the best trade-off between speed and relevance. This observation is in-line with what we see in practice in the real world, with BM25 being a popular algorithm for lexical search, and vector search being the preferred choice for semantic search. Hybrid search, a technique to combine the strengths of keyword and semantic search, is becoming increasingly common.

MongoDB Atlas supports lexical, vector, as well as hybrid search. Here are some resources to get you started:

For further questions, reach out to us in our community forums!

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this article

Tutorial

Migrate From an RDBMS to MongoDB With the Help of AI: An Introduction to Query Converter

Aug 28, 2024 | 4 min read

Article

AI Shop: The Power of LangChain, OpenAI, and MongoDB Atlas Working Together

Sep 18, 2024 | 7 min read

Tutorial

How to Develop a Web App With Netlify Serverless Functions and MongoDB

Jan 13, 2025 | 6 min read

Tutorial

RAG Evaluation: Detecting Hallucinations With Patronus AI and MongoDB

Aug 15, 2024 | 11 min read

Where’s the code?
Data processing
Experimenting with different NLP techniques
Results
Conclusion

Atlas

Comparing NLP Techniques for Scalable Product Search

Where’s the code?

Data processing

1. Fetch data from APIs and write to GCS

2. Pre-processing

3. Write transformed data to MongoDB

Experimenting with different NLP techniques

1. TF-IDF + cosine similarity

2. Word2Vec + cosine similarity

3. BM25

Building an inverted index

Implementing BM25

4. BERT

Results

Conclusion

Top Comments in Forums

Related

Migrate From an RDBMS to MongoDB With the Help of AI: An Introduction to Query Converter

AI Shop: The Power of LangChain, OpenAI, and MongoDB Atlas Working Together

How to Develop a Web App With Netlify Serverless Functions and MongoDB

RAG Evaluation: Detecting Hallucinations With Patronus AI and MongoDB

Table of Contents

1	import json
2	import requests
3
4	url = "https://asos2.p.rapidapi.com/categories/list"
5	headers = {
6	"x-rapidapi-key": "YOUR_RAPIDAPI_KEY",
7	"x-rapidapi-host": "asos2.p.rapidapi.com",
8	}
9	response = requests.request("GET", url, headers=headers)
10	categories = json.loads(response.text)
11	for category in categories:a
12	url = f"https://asos2.p.rapidapi.com/products/v2/list/{category['id']}"
13	response = requests.request("GET", url, headers=headers)
14	products = json.loads(response.text)
15	# Store products in GCS
16	bucket = os.environ.get('GS_BUCKET_NAME')
17	product_filename = f"products_{datetime.now().strftime('%H%m%S')}.json"
18	write(bucket, str(date.today()) + "/" + product_filename, products)

1	from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
2
3	# Instantiate and set input and output column parameters
4	regexTokenizer = RegexTokenizer(
5	inputCol="product_name", outputCol="words", pattern="\\W"
6	)
7	remover = StopWordsRemover(inputCol="words", outputCol="filtered")
8
9	# Transform the data
10	regexTokenized = regexTokenizer.transform(df)
11	removedStopWords = remover.transform(regexTokenized)

1	import json
2
3	from pymongo import MongoClient
4
5	client = MongoClient(
6	"mongodb://<username>:<password>@cluster0.mongodb.net/test?retryWrites=true&w=majority"
7	)
8	db = client.get_database("product_db")
9	products = db.products
10
11	# Fetch data from GCS
12	for product in gcs_data:
13	products.insert_one(product)

1	from pyspark.ml.feature import IDF, HashingTF
2
3	hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=20)
4	featurizedData = hashingTF.transform(removedStopWords)
5
6	idf = IDF(inputCol="rawFeatures", outputCol="features")
7	idfModel = idf.fit(featurizedData)
8	rescaledData = idfModel.transform(featurizedData)

1	from pyspark.ml.feature import Word2Vec
2
3	word2Vec = Word2Vec(
4	vectorSize=300, minCount=0, inputCol="filtered", outputCol="features"
5	)
6	model = word2Vec.fit(removedStopWords)
7	result = model.transform(removedStopWords)

1	# Compute IDF values for each term
2	# Compute term frequency in each document for each term
3	# For each query term, do:
4	# For each document containing the term do:
5	# Compute BM25 Score
6	# Sort documents by score

1	def stemming(words):
2	ps = PorterStemmer()
3	result = []
4
5	for w in words:
6	root_word = ps.stem(w)
7	result.append(root_word)
8
9	return result
10
11	stemming_udf = udf(lambda x: stemming(x), ArrayType(elementType=StringType()))
12	preprocessed_data = preprocessed_data.withColumn(
13	"tokens", stemming_udf("filtered_words")
14	)

1	words = preprocessed_data.select(explode("tokens").alias("word"))
2	total_unique_words = words.select("word").distinct().count()
3	total_words = words.count()

1	id_str = concat_ws("", col("_id").cast(StringType()))
2
3	word_count_df = (
4	preprocessed_data.select("_id", explode("tokens").alias("word"))
5	.groupBy("word", "_id")
6	.agg(size(collect_list("word")).alias("occurrence"))
7	.groupBy("word")
8	.agg(
9	map_from_entries(collect_list(struct(id_str.alias("id"), "occurrence"))).alias(
10	"documents"
11	)
12	)
13	.withColumnRenamed("word", "_id")
14	).rdd.map(lambda row: (row["_id"], row["documents"]))
15
16	inverted_index = word_count_df.collectAsMap()

1	doc_length_df = (
2	preprocessed_data.select("_id", size("tokens").alias("length"))
3	.groupBy("_id")
4	.agg(sum(col("length")).alias("doc_length"))
5	.withColumn("_id", concat_ws("", col("_id").cast(StringType())))
6	.rdd.map(lambda row: (row["_id"], row["doc_length"]))
7	)
8	doc_length = doc_length_df.collectAsMap()
9
10	doc_length["word_counter"] = total_unique_words
11	doc_length["doc_counter"] = total_words
12
13	unique_word_count_df = preprocessed_data.select(
14	"_id",
15	size(array_distinct("tokens")).alias("num_unique_words"),
16	).withColumn("_id", concat_ws("", col("_id").cast(StringType()))).rdd.map(lambda row: (row["_id"], row["num_unique_words"]))
17	unique_word = unique_word_count_df.collectAsMap()

1	def bm25(doc_length,
2	inverted_index,
3	unique_word,
4	word_counter,
5	doc_counter,
6	query):
7	k1 = 1.2
8	b = 0.75
9	k2 = 100
10	N = len(doc_length) - 2
11	avgdl = doc_length["doc_counter"] / N
12
13	score = defaultdict(int)
14	for word in query:
15	if word in inverted_index:
16	n = len(inverted_index[word])
17	idf = math.log((N - n + 0.5) / (n + 0.5))
18	for doc, freq in inverted_index[word].items():
19	f = freq
20	qf = query.count(word)
21	dl = doc_length[doc]
22	K = k1 * ((1 - b) + b * (float(dl) / float(avgdl)))
23	R1 = idf * ((f * (k1 + 1)) / (f + K))
24	R2 = (qf * (k2 + 1)) / (qf + k2)
25	R = R1 * R2
26	score[doc] += R
27	else:
28	Continue
29
30	return score

1	from sentence_transformers import SentenceTransformer
2
3	model = SentenceTransformer('bert-base-nli-mean-tokens')
4	sentence_embeddings = model.encode(data['product_name'])