Simplify Semantic Search With LangChain and MongoDB

Brian Leonard4 min read • Published Sep 23, 2024 • Updated Oct 28, 2024

AI Atlas Vector Search Python

FULL APPLICATION

Rate this tutorial

Semantic Search Made Easy With LangChain and MongoDB

Enabling semantic search on user-specific data is a multi-step process that includes loading, transforming, embedding and storing data before it can be queried.

That graphic is from the team over at LangChain, whose goal is to provide a set of utilities to greatly simplify this process.

In this tutorial, we'll walk through each of these steps, using MongoDB Atlas as our Store. Specifically, we'll use the AT&T and Bank of America Wikipedia pages as our data source. We'll then use libraries from LangChain to Load, Transform, Embed and Store:

Once the source is store is stored in MongoDB, we can retrieve the data that interests us:

Prerequisites

MongoDB Atlas Subscription (Free Tier is fine)
Open AI API key

Quick Start Steps

Get the code:

1 git clone https://github.com/wbleonard/atlas-langchain.git

Update params.py with your MongoDB connection string and Open AI API key.
Create a new Python environment

1 python3 -m venv env

Activate the new Python environment

1 source env/bin/activate

Install the requirements

1 pip3 install -r requirements.txt

Load, Transform, Embed and Store

1 python3 vectorize.py

Retrieve

1 python3 query.py -q "Who started AT&T?"

The Details

Load -> Transform -> Embed -> Store

Step 1: Load

There's no lacking for sources of data: Slack, YouTube, Git, Excel, Reddit, Twitter, etc., and LangChain provides a growing list of integrations that includes this list and many more.

For this exercise, we're going to use the WebBaseLoader to load the Wikipedia pages for AT&T and Bank of America.

1 from langchain_community.document_loaders import WebBaseLoader
2 
3 # Step 1: Load
4 loaders = [
5  WebBaseLoader("https://en.wikipedia.org/wiki/AT%26T"),
6  WebBaseLoader("https://en.wikipedia.org/wiki/Bank_of_America")
7 ]
8 
9 docs = []
10 
11 for loader in loaders:
12     for doc in loader.lazy_load():
13         docs.append(doc)

Step 2: Transform (Split)

Now that we have a bunch of text loaded, it needs to be split into smaller chunks so we can tease out the relevant portion based on our search query. For this example we'll use the recommended RecursiveCharacterTextSplitter. As I have it configured, it attempts to split on paragraphs ("\n\n"), then sentences("(?<=\. )"), then words (" ") using a chunk size of 1000 characters. So if a paragraph doesn't fit into 1000 characters, it will truncate at the next word it can fit to keep the chunk size under 1000 chacters. You can tune the chunk_size to your liking. Smaller numbers will lead to more documents, and vice-versa.

1 # Step 2: Transform (Split)
2 from langchain.text_splitter import RecursiveCharacterTextSplitter
3 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[
4                                               "\n\n", "\n", r"(?<=\. )", " "], length_function=len)
5 docs = text_splitter.split_documents(docs)

Step 3: Embed

Embedding is where you use an LLM to create a vector representation text. There are many options to choose from, such as OpenAI and Hugging Face, and LangChang provides a standard interface for interacting with all of them.

For this exercise we're going to use the popular OpenAI embedding. Before proceeding, you'll need an API key for the OpenAI platform, which you will set in params.py.

We're simply going to load the embedder in this step. The real power comes when we store the embeddings in Step 4.

1 # Step 3: Embed
2 from langchain_openai import OpenAIEmbeddings
3 embeddings = OpenAIEmbeddings(openai_api_key=params.OPENAI_API_KEY)

Step 4: Store

You'll need a vector database to store the embeddings, and lucky for you MongoDB fits that bill. Even luckier for you, the folks at LangChain have a MongoDB Atlas module that will do all the heavy lifting for you! Don't forget to add your MongoDB Atlas connection string to params.py.

1 # Step 4: Store
2 from pymongo import MongoClient
3 from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
4 
5 client = MongoClient(params.MONGODB_CONN_STRING)
6 collection = client[params.DB_NAME][params.COLL_NAME]
7 
8 # Insert the documents in MongoDB Atlas with their embedding
9 docsearch = MongoDBAtlasVectorSearch.from_documents(
10     docs, embeddings, collection=collection, index_name=index_name
11 )

Step 5: Index the Vector Embeddings

The final step before we can query the data is to create a search index on the stored embeddings.

If you're on Atlas dedicated compute, Langchain can do this for you.

1 # Step 5: Create Vector Search Index
2 # THIS ONLY WORKS ON DEDICATED CLUSTERS (M10+)
3 docsearch.create_vector_search_index(dimensions=1536, update=True)

If you are on shared compute (M0, M2 or M5), in the Atlas console, create a Atlas Vector Search langchain_vsearch_index with the following definition:

1 {
2     "fields": [
3     {
4       "type": "vector",
5       "path": "embedding",
6       "numDimensions": 1536,
7       "similarity": "cosine"
8     }
9   ]
10 }

You'll find the complete script in vectorize.py, which needs to be run only once or when new data sources are added.

1 python3 vectorize.py

Retrieve

We could now run a search, using methods like similirity_search or max_marginal_relevance_search and that would return the relevant slice of data, which in our case would be an entire paragraph. However, we can continue to harness the power of the LLM to contextually compress the response so that it more directly tries to answer our question.

1 from pymongo import MongoClient
2 from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
3 from langchain.embeddings.openai import OpenAIEmbeddings
4 from langchain.llms import OpenAI
5 from langchain.retrievers import ContextualCompressionRetriever
6 from langchain.retrievers.document_compressors import LLMChainExtractor
7 
8 # Initialize MongoDB python client
9 client = MongoClient(params.MONGODB_CONN_STRING)
10 collection = client[params.DB_NAME][params.COLL_NAME]
11 
12 # initialize vector store
13 vectorStore = MongoDBAtlasVectorSearch(
14     collection, OpenAIEmbeddings(openai_api_key=params.OPENAI_API_KEY), index_name=params.INDEX_NAME
15 )
16 # perform a search between the embedding of the query and the embeddings of the documents
17 print("\nQuery Response:")
18 print("---------------")
19 docs = vectorStore.max_marginal_relevance_search(query, K=1)
20 #docs = vectorStore.similarity_search(query, K=1)
21 
22 print(docs[0].metadata['title'])
23 print(docs[0].page_content)
24 
25 # Contextual Compression
26 llm = OpenAI(openai_api_key=params.OPENAI_API_KEY, temperature=0)
27 compressor = LLMChainExtractor.from_llm(llm)
28 
29 compression_retriever = ContextualCompressionRetriever(
30     base_compressor=compressor,
31     base_retriever=vectorStore.as_retriever()
32 )

1 python3 query.py -q "Who started AT&T?"
2 
3 Your question:
4 -------------
5 Who started AT&T?
6 
7 AI Response:
8 -----------
9 AT&T - Wikipedia
10 "AT&T was founded as Bell Telephone Company by Alexander Graham Bell, Thomas Watson and Gardiner Greene Hubbard after Bell's patenting of the telephone in 1875."[25] "On December 30, 1899, AT&T acquired the assets of its parent American Bell Telephone, becoming the new parent company."[28]

Resources

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this tutorial

Tutorial

How to Improve LLM Applications With Parent Document Retrieval Using MongoDB and LangChain

Dec 13, 2024 | 15 min read

Article

Multi-agent Systems With AutoGen and MongoDB

Sep 18, 2024 | 10 min read

Tutorial

Smart Filtering: A Guide to Generating Pre-filters for Semantic Search

Oct 02, 2024 | 20 min read

Tutorial

Is it Safe to Go Outside? Data Investigation With MongoDB

Sep 23, 2022 | 11 min read

Semantic Search Made Easy With LangChain and MongoDB

Python

Simplify Semantic Search With LangChain and MongoDB

Semantic Search Made Easy With LangChain and MongoDB

Prerequisites

Quick Start Steps

The Details

Load -> Transform -> Embed -> Store

Step 1: Load

Step 2: Transform (Split)

Step 3: Embed

Step 4: Store

Step 5: Index the Vector Embeddings

Retrieve

Resources

Top Comments in Forums

Related

How to Improve LLM Applications With Parent Document Retrieval Using MongoDB and LangChain

Multi-agent Systems With AutoGen and MongoDB

Smart Filtering: A Guide to Generating Pre-filters for Semantic Search

Is it Safe to Go Outside? Data Investigation With MongoDB

Table of Contents

1	from langchain_community.document_loaders import WebBaseLoader
2
3	# Step 1: Load
4	loaders = [
5	WebBaseLoader("https://en.wikipedia.org/wiki/AT%26T"),
6	WebBaseLoader("https://en.wikipedia.org/wiki/Bank_of_America")
7	]
8
9	docs = []
10
11	for loader in loaders:
12	for doc in loader.lazy_load():
13	docs.append(doc)

1	# Step 2: Transform (Split)
2	from langchain.text_splitter import RecursiveCharacterTextSplitter
3	text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[
4	"\n\n", "\n", r"(?<=\. )", " "], length_function=len)
5	docs = text_splitter.split_documents(docs)

1	# Step 3: Embed
2	from langchain_openai import OpenAIEmbeddings
3	embeddings = OpenAIEmbeddings(openai_api_key=params.OPENAI_API_KEY)

1	# Step 4: Store
2	from pymongo import MongoClient
3	from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
4
5	client = MongoClient(params.MONGODB_CONN_STRING)
6	collection = client[params.DB_NAME][params.COLL_NAME]
7
8	# Insert the documents in MongoDB Atlas with their embedding
9	docsearch = MongoDBAtlasVectorSearch.from_documents(
10	docs, embeddings, collection=collection, index_name=index_name
11	)

1	# Step 5: Create Vector Search Index
2	# THIS ONLY WORKS ON DEDICATED CLUSTERS (M10+)
3	docsearch.create_vector_search_index(dimensions=1536, update=True)

1	{
2	"fields": [
3	{
4	"type": "vector",
5	"path": "embedding",
6	"numDimensions": 1536,
7	"similarity": "cosine"
8	}
9	]
10	}

1	from pymongo import MongoClient
2	from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
3	from langchain.embeddings.openai import OpenAIEmbeddings
4	from langchain.llms import OpenAI
5	from langchain.retrievers import ContextualCompressionRetriever
6	from langchain.retrievers.document_compressors import LLMChainExtractor
7
8	# Initialize MongoDB python client
9	client = MongoClient(params.MONGODB_CONN_STRING)
10	collection = client[params.DB_NAME][params.COLL_NAME]
11
12	# initialize vector store
13	vectorStore = MongoDBAtlasVectorSearch(
14	collection, OpenAIEmbeddings(openai_api_key=params.OPENAI_API_KEY), index_name=params.INDEX_NAME
15	)
16	# perform a search between the embedding of the query and the embeddings of the documents
17	print("\nQuery Response:")
18	print("---------------")
19	docs = vectorStore.max_marginal_relevance_search(query, K=1)
20	#docs = vectorStore.similarity_search(query, K=1)
21
22	print(docs[0].metadata['title'])
23	print(docs[0].page_content)
24
25	# Contextual Compression
26	llm = OpenAI(openai_api_key=params.OPENAI_API_KEY, temperature=0)
27	compressor = LLMChainExtractor.from_llm(llm)
28
29	compression_retriever = ContextualCompressionRetriever(
30	base_compressor=compressor,
31	base_retriever=vectorStore.as_retriever()
32	)

1	python3 query.py -q "Who started AT&T?"
2
3	Your question:
4	-------------
5	Who started AT&T?
6
7	AI Response:
8	-----------
9	AT&T - Wikipedia
10	"AT&T was founded as Bell Telephone Company by Alexander Graham Bell, Thomas Watson and Gardiner Greene Hubbard after Bell's patenting of the telephone in 1875."[25] "On December 30, 1899, AT&T acquired the assets of its parent American Bell Telephone, becoming the new parent company."[28]