Leveraging MongoDB Atlas Vector Search With LangChain

Arek Borucki6 min read • Published Sep 18, 2024 • Updated Sep 18, 2024

AI Atlas Vector Search Python

Rate this tutorial

Introduction to Vector Search in MongoDB Atlas

Vector search engines — also termed as vector databases, semantic search, or cosine search — locate the closest entries to a specified vectorized query. While the conventional search methods hinge on keyword references, lexical match, and the rate of word appearances, vector search engines measure similarity by the distance in the embedding dimension. Finding related data becomes searching for the nearest neighbors of your query.

Vector embeddings act as the numeric representation of data and its accompanying context, preserved in high-dimensional (dense) vectors. There are various models, both proprietary (like those from OpenAI and Hugging Face) and open-source ones (like FastText), designed to produce these embeddings. These models can be trained on millions of samples to deliver results that are both more pertinent and precise. In certain situations, the numeric data you've gathered or designed to showcase essential characteristics of your documents might serve as embeddings. The crucial part is to have an efficient search mechanism, like MongoDB Atlas.

MongoDB Atlas is a completely managed cloud database offered on AWS, Azure, and GCP. It has recently incorporated native vector search capabilities for your MongoDB document data. Atlas Vector Search utilizes the Hierarchical Navigable Small Worlds algorithm to execute semantic searches. You can leverage Atlas Vector Search's support for aNN queries to find results analogous to a specific product, conduct image searches, and more.

$vectorSearch operator

Atlas Vector Search queries take the form of an aggregation pipeline stage and use the new $vectorSearch operator. The $vectorSearch stage performs an aNN search on a vector in the specified field. The field you intend to search must be indexed with the Atlas Vector Search vector type. The $vectorSearch must be the first stage of any pipeline where it appears.

Introduction to LangChain

LangChain is a framework tailored for simplifying the creation of applications employing large language models (LLMs). It is an open-source framework that aids in developing applications powered by language models, particularly emphasizing large language models. The framework extends beyond standard API calls by being data-aware and agentic, which facilitates connections with various data sources to provide richer and more personalized experiences.

This feature enhances the application's ability to interact with different datasets and improve its functionality based on the obtained data. For example, a developer could use LangChain to create an application where a user's query is processed by a large language model, which then generates a vector representation of the query. This vector representation could be used to search through vector data stored in MongoDB Atlas using its vector search feature. The results from MongoDB Atlas could then be returned to the user or further processed by the language model to provide more detailed or personalized responses.

Now, let's consolidate all of these elements in an architectural view.

With the theoretical groundwork laid out, it's time to transition from conceptual understanding to practical application. Let's delve into the implementation process to see how these concepts come to life.

Setting up the environment

Create a search index

The first step is creating the vector search index. In the Atlas UI (you can also use Atlas Vector Search with local Atlas deployments that you create with the Atlas CLI), choose Search and Create Search Index. Please also visit the official MongoDB documentation to learn more.

Next, utilize the JSON Editor to configure fields of type vector. I named the field containing the embedding vector embedding.

1 {
2   "fields": [
3     {
4       "type": "vector",
5       "path": "embedding",  
6       "numDimensions": 1536, 
7       "similarity": "cosine"  
8     }
9   ]
10 }

Specify the namespace (database and collection) on which the vector search index should be created. I chose the namespace langchain.vectorSearch.

The similarity field in the vector definition specifies the function to use for searching the top K-nearest neighbors. The values can be:

euclidean: Measures vector end-point distance for similarity in varying dimensions.
cosine: Measures angle-based similarity, independent of magnitude; not suitable for zero-magnitude vectors. For cosine similarity, normalizing vectors and using dotProduct is recommended.
dotProduct: Similar to cosine but considers vector magnitude, allowing efficient similarity measurement based on both angle and magnitude. Normalize the vector to unit length at index- and query-time for usage.

You can perform semantic searches on your Atlas cluster running MongoDB version 6.0.11 or later. It allows the storage of vector embeddings for any type of data, alongside other data in your collection on the Atlas cluster. Atlas Vector Search accommodates embeddings up to 4096 dimensions.

Now that we have configured Atlas Vector Search, let's move on to configuring LangChain.

LangChain and OpenAI

In this article, we will utilize OpenAI to generate vector embeddings. Firstly, you will need the OpenAI API Key. Create an account, and then locate your Secret API key in your user settings.

To install LangChain, you'll first need to update pip for Python or npm for JavaScript, then use the respective install command. Here are the steps:

For Python version, use:

1 pip3 install pip --upgrade
2 pip3 install langchain

We will also need other Python modules, such as pymongo for communication with MongoDB Atlas, openai for communication with the OpenAI API, and pypdf` `and tiktoken`` for other functionalities.

1 pip3 install pymongo openai pypdf tiktoken

Start using Atlas Vector Search

In our exercise, we utilize a publicly accessible PDF document titled "MongoDB Atlas Best Practices" as a data source for constructing a text-searchable vector space. The implemented Python script employs several modules to process, vectorize, and index the document's content into a MongoDB Atlas collection.

In order to implement it, let's begin by setting up and exporting the environmental variables. We need the Atlas connection string and the OpenAI API key.

1 export OPENAI_API_KEY="xxxxxxxxxxx"
2 export ATLAS_CONNECTION_STRING="mongodb+srv://user:passwd@vectorsearch.abc.mongodb.net/?retryWrites=true"

Next, we can execute the code provided below. This script retrieves a PDF from a specified URL, segments the text, and indexes it in MongoDB Atlas for text search, leveraging LangChain's embedding and vector search features. The full code is accessible on GitHub.

1 import os
2 from pymongo import MongoClient
3 from langchain.document_loaders import PyPDFLoader
4 from langchain.text_splitter import RecursiveCharacterTextSplitter
5 from langchain.embeddings import OpenAIEmbeddings
6 from langchain_mongodb import MongoDBAtlasVectorSearch
7 
8 # Define the URL of the PDF MongoDB Atlas Best Practices document
9 pdf_url = "https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP"
10 
11 # Retrieve environment variables for sensitive information
12 OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
13 if not OPENAI_API_KEY:
14     raise ValueError("The OPENAI_API_KEY environment variable is not set.")
15 
16 ATLAS_CONNECTION_STRING = os.getenv('ATLAS_CONNECTION_STRING')
17 if not ATLAS_CONNECTION_STRING:
18     raise ValueError("The ATLAS_CONNECTION_STRING environment variable is not set.")
19 
20 # Connect to MongoDB Atlas cluster using the connection string
21 cluster = MongoClient(ATLAS_CONNECTION_STRING)
22 
23 # Define the MongoDB database and collection names
24 DB_NAME = "langchain"
25 COLLECTION_NAME = "vectorSearch"
26 
27 # Connect to the specific collection in the database
28 MONGODB_COLLECTION = cluster[DB_NAME][COLLECTION_NAME]
29 
30 # Initialize the PDF loader with the defined URL
31 loader = PyPDFLoader(pdf_url)
32 data = loader.load()
33 
34 # Initialize the text splitter
35 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
36 
37 # Split the document into manageable segments
38 docs = text_splitter.split_documents(data)
39 
40 # Initialize MongoDB Atlas vector search with the document segments
41 vector_search = MongoDBAtlasVectorSearch.from_documents(
42     documents=docs,
43     embedding=OpenAIEmbeddings(),
44     collection=MONGODB_COLLECTION,
45     index_name="default"  # Use a predefined index name
46 )
47 # At this point, 'docs' are split and indexed in MongoDB Atlas, enabling text search capabilities.

Upon completion of the script, the PDF has been segmented and its vector representations are now stored within the langchain.vectorSearch namespace in MongoDB Atlas.

Execute similarities searching query in Atlas Vector Search

"MongoDB Atlas auditing" serves as our search statement for initiating similarity searches. By utilizing the <code><em>OpenAIEmbeddings</em></code> class, we'll generate vector embeddings for this phrase. Following that, a similarity search will be executed to find and extract the three most semantically related documents from our MongoDB Atlas collection that align with our search intent.

In the first step, we need to create a MongoDBAtlasVectorSearch object:

1 def create_vector_search():
2     """
3     Creates a MongoDBAtlasVectorSearch object using the connection string, database, and collection names, along with the OpenAI embeddings and index configuration.
4 
5     :return: MongoDBAtlasVectorSearch object
6     """
7     vector_search = MongoDBAtlasVectorSearch.from_connection_string(
8         ATLAS_CONNECTION_STRING,
9         f"{DB_NAME}.{COLLECTION_NAME}",
10         OpenAIEmbeddings(),
11         index_name="default"
12     )
13     return vector_search

Subsequently, we can perform a similarity search.

1 def perform_similarity_search(query, top_k=3):
2     """
3     This function performs a similarity search within a MongoDB Atlas collection. It leverages the capabilities of the MongoDB Atlas Search, which under the hood, may use the `$vectorSearch` operator, to find and return the top `k` documents that match the provided query semantically.
4 
5     :param query: The search query string.
6     :param top_k: Number of top matches to return.
7     :return: A list of the top `k` matching documents with their similarity scores.
8     """
9 
10    # Get the MongoDBAtlasVectorSearch object
11     vector_search = create_vector_search()
12     
13     # Execute the similarity search with the given query
14     results = vector_search.similarity_search_with_score(
15         query=query,
16         k=top_k,
17     )
18     
19     return results
20 
21 # Example of calling the function directly
22 search_results = perform_similarity_search("MongoDB Atlas auditing")

The function returns the most semantically relevant documents from a MongoDB Atlas collection that correspond to a specified search query. When executed, it will provide a list of documents that are most similar to the query "MongoDB Atlas auditing". Each entry in this list includes the document's content that matches the search along with a similarity score, reflecting how closely each document aligns with the intent of the query. The function returns the top k matches, which by default is set to 5 but can be specified for any number of top results desired. Please find the code on GitHub.

Summary

MongoDB Atlas Vector Search enhances AI applications by facilitating the embedding of vector data into MongoDB documents. It simplifies the creation of search indices and the execution of KNN searches through the $vectorSearch MQL stage, utilizing the Hierarchical Navigable Small Worlds algorithm for efficient nearest neighbor searches. The collaboration with LangChain leverages this functionality, contributing to more streamlined and powerful semantic search capabilities. Harness the potential of MongoDB Atlas Vector Search and LangChain to meet your semantic search needs today!

In the next blog post, we will delve into LangChain Templates, a new feature set to enhance the capabilities of MongoDB Atlas Vector Search. Alongside this, we will examine the role of retrieval-augmented generation (RAG) in semantic search and AI development. Stay tuned for an in-depth exploration in our upcoming article!

Questions? Comments? We’d love to continue the conversation over in the Developer Community forum.

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this tutorial

Tutorial

Atlas Cluster Automation Using Scheduled Triggers

Jun 25, 2024 | 11 min read

Tutorial

How to Send MongoDB Document Changes to a Slack Channel

Oct 26, 2023 | 6 min read

Tutorial

Using the Node.js MongoDB Driver with AWS Lambda

Jan 23, 2024 | 5 min read

Tutorial

Streaming Data from MongoDB to BigQuery Using Confluent Connectors

Jul 11, 2023 | 4 min read

Introduction to Vector Search in MongoDB Atlas
$vectorSearch operator
Introduction to LangChain
Setting up the environment
Summary

Atlas

Leveraging MongoDB Atlas Vector Search With LangChain

Introduction to Vector Search in MongoDB Atlas

$vectorSearch operator

Introduction to LangChain

Setting up the environment

Create a search index

LangChain and OpenAI

Start using Atlas Vector Search

Execute similarities searching query in Atlas Vector Search

Summary

Top Comments in Forums

Related

Atlas Cluster Automation Using Scheduled Triggers

How to Send MongoDB Document Changes to a Slack Channel

Using the Node.js MongoDB Driver with AWS Lambda

Streaming Data from MongoDB to BigQuery Using Confluent Connectors

Table of Contents

1	{
2	"fields": [
3	{
4	"type": "vector",
5	"path": "embedding",
6	"numDimensions": 1536,
7	"similarity": "cosine"
8	}
9	]
10	}

1	export OPENAI_API_KEY="xxxxxxxxxxx"
2	export ATLAS_CONNECTION_STRING="mongodb+srv://user:passwd@vectorsearch.abc.mongodb.net/?retryWrites=true"

1	import os
2	from pymongo import MongoClient
3	from langchain.document_loaders import PyPDFLoader
4	from langchain.text_splitter import RecursiveCharacterTextSplitter
5	from langchain.embeddings import OpenAIEmbeddings
6	from langchain_mongodb import MongoDBAtlasVectorSearch
7
8	# Define the URL of the PDF MongoDB Atlas Best Practices document
9	pdf_url = "https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP"
10
11	# Retrieve environment variables for sensitive information
12	OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
13	if not OPENAI_API_KEY:
14	raise ValueError("The OPENAI_API_KEY environment variable is not set.")
15
16	ATLAS_CONNECTION_STRING = os.getenv('ATLAS_CONNECTION_STRING')
17	if not ATLAS_CONNECTION_STRING:
18	raise ValueError("The ATLAS_CONNECTION_STRING environment variable is not set.")
19
20	# Connect to MongoDB Atlas cluster using the connection string
21	cluster = MongoClient(ATLAS_CONNECTION_STRING)
22
23	# Define the MongoDB database and collection names
24	DB_NAME = "langchain"
25	COLLECTION_NAME = "vectorSearch"
26
27	# Connect to the specific collection in the database
28	MONGODB_COLLECTION = cluster[DB_NAME][COLLECTION_NAME]
29
30	# Initialize the PDF loader with the defined URL
31	loader = PyPDFLoader(pdf_url)
32	data = loader.load()
33
34	# Initialize the text splitter
35	text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
36
37	# Split the document into manageable segments
38	docs = text_splitter.split_documents(data)
39
40	# Initialize MongoDB Atlas vector search with the document segments
41	vector_search = MongoDBAtlasVectorSearch.from_documents(
42	documents=docs,
43	embedding=OpenAIEmbeddings(),
44	collection=MONGODB_COLLECTION,
45	index_name="default" # Use a predefined index name
46	)
47	# At this point, 'docs' are split and indexed in MongoDB Atlas, enabling text search capabilities.

1	def create_vector_search():
2	"""
3	Creates a MongoDBAtlasVectorSearch object using the connection string, database, and collection names, along with the OpenAI embeddings and index configuration.
4
5	:return: MongoDBAtlasVectorSearch object
6	"""
7	vector_search = MongoDBAtlasVectorSearch.from_connection_string(
8	ATLAS_CONNECTION_STRING,
9	f"{DB_NAME}.{COLLECTION_NAME}",
10	OpenAIEmbeddings(),
11	index_name="default"
12	)
13	return vector_search

1	def perform_similarity_search(query, top_k=3):
2	"""
3	This function performs a similarity search within a MongoDB Atlas collection. It leverages the capabilities of the MongoDB Atlas Search, which under the hood, may use the `$vectorSearch` operator, to find and return the top `k` documents that match the provided query semantically.
4
5	:param query: The search query string.
6	:param top_k: Number of top matches to return.
7	:return: A list of the top `k` matching documents with their similarity scores.
8	"""
9
10	# Get the MongoDBAtlasVectorSearch object
11	vector_search = create_vector_search()
12
13	# Execute the similarity search with the given query
14	results = vector_search.similarity_search_with_score(
15	query=query,
16	k=top_k,
17	)
18
19	return results
20
21	# Example of calling the function directly
22	search_results = perform_similarity_search("MongoDB Atlas auditing")