Caching LLMs Response With MongoDB Atlas and Vector Search

Kanin Kearpimy8 min read • Published Sep 02, 2024 • Updated Sep 02, 2024

AI Atlas Python

Rate this tutorial

Large language models (LLMs) are the renowned solution for most business domains in 2024. While there is an estimation that 750 million applications will be integrated with LLMs in 2025, training LLMs does consume significant monetary resources. So, the cost of LLM platforms such as OpenAI’s GPT REST API reflects the above cost in their pricing. The problem is how we can reduce the operational cost of the AI applications in production; the obvious answer is calling the API less, and then another problem arises: the maintenance of response quality to users.

Caching has been a fundamental solution in software engineering for many years. The application creates a cache by extracting a key from the request and storing the corresponding result. Next time the same key is called, the server can immediately respond to user queries on the network without extra computation.

However, the character of the LLM query is not a fixed key but a flexible format like the meaning of the human’s question. Consequently, the traditional cache that stores fixed keys is inefficient enough to handle LLM queries.

Semantic cache

Unlike traditional cache, semantic cache characteristics of data and semantic representation are simply described as meaning-based representations. We call this process embedding. In LLM systems, the model converts text into numerical vectors that represent its semantic meaning.

We will store the embeddings in the cache system. When a new request comes in, the system extracts its semantic representation by creating an embedding. It then searches for similarities between this new embedding and the stored embeddings in the cache system. If a high similarity match is found, the corresponding cached response will be returned. This process allows for semantic-based retrieval of previously computed responses, potentially reducing the need for repeated API calls to the LLM service.

How to implement semantic cache with MongoDB Atlas and Vector Search

All of the code is in GitHub.

Prerequisites

• Python (3.12.3 or newer) • FastAPI (0.11 or newer) • PyMongo (4.7.2 or newer) • uvicorn (0.29.0 or newer)

1) Set up dependency in Python

We need to install the dependencies as mentioned above. You can utilize pip for the package manager. The necessary dependencies are in requirements.txt. After cloning the project and entering the project’s directory, you can run the below command to install them.

1 pip install fastapi pymongo openai uvicorn

In case you are creating an isolated project, you can enable python virtualenv for this specific environment.)

2) Create FastAPI server

To simulate the caching server, the request shall come from an HTTP request. We thus set up a web server in Python by FastAPI.1.2.1) Create app.py in the root directory.1.2.2) Import FastAPI and initiate / and /ask routes.

app.py

1 from fastapi import FastAPI
2 server = FastAPI()
3 # root route
4 @server.get("/")
5 async def home():
6   return { "message": "This is home server" }
7 
8 # search route
9 @server.get("/ask")
10 async def search(query: str):
11   return { "message": "The query is: {}".format(query) }

Next, run the application for testing our route. (--reload for hot refresh if application code is edited.)

1 uvicorn app:server --reload

Your server must be running at http://127.0.0.1:8000. Now, we can test our search route using the command below.

1 curl -X GET "http://127.0.0.1:8000/ask?query=hello+this+is+search+query"

The server must respond as below:

1 { "message": "The query is: hello this is search query" }

3) Connect OpenAI

We previously set up a basic FastAPI server and ask route. Next, the LLM functionality will be integrated.

1.3.1) Create llm.py in the same directory as app.py. 1.3.2) Set up OpenAI as an LLM service. llm.py

1 from openai import OpenAI
2 
3 open_api_key = "..."  # OpenAI API Key
4 openai_client = OpenAI(api_key=open_api_key)
5 language_model = "gpt-3.5-turbo"
6 
7 # getTextResponse receive text and ask LLM for answer
8 def getTextResponse(text):
9 	chat_completion = openai_client.chat.completions.create(
10     	messages=[{"role": "user", "content": text}], model=language_model
11 	)
12 	return chat_completion.choices[0].message.content

We have to modify app.py with a few lines of code. app.py

1 # ... other import dependencies
2 from llm import getTextResponse
3 # ... other routes
4 # search route
5 @server.get("/ask")
6 async def search(query):
7 	llm_response = getTextResponse(query)
8 	return {"message": "Your AI response is: {}".format(llm_response)}

Then, we can invoke the ask route with a new query.

1 curl -X GET "http://127.0.0.1:8000/ask?query=what+is+llm?"
2 Response
3 {
4   "message": "Your AI response is: LLM stands for Master of Laws, which is an advanced law degree typically pursued by individuals who have already received a law degree (such as a JD) and want to further specialize or advance their knowledge in a specific area of law. LLM programs are typically one year in length and often focus on areas such as international law, human rights, or commercial law."
5 }

4) Embedding LLM response

Now, we can receive a response from the OpenAI LLM. However, the system is always relying on the OpenAI service. Our goal is to reduce the load from the AI service to the cache system. To cache the LLM response, we must transform our text (or any type of data) into vector data. A vector can be thought of as an N-dimensional (depending on the embedding model) array in which each number represents the meaning of the original data. Example:

1 text = "Large Language Model"
2 # embedding_function(text: str) -> vector<number>([...N])
3 vector = embedding_function(text)
4 
5 # vector = [12, 23, 0.11, 22, 85, 43, ..., 90]

We can embed our data using a language model. In our case, we utilize OpenAI's text-embedding model. Therefore, we modify llm.py with a few lines.

llm.py

1 # ... other code
2 text_embedding_model = "text-embedding-3-small"
3 # getEmbedding receives text and response embedding (vector) of its original data
4 def getEmbedding(text):
5 	embedding = openai_client.embeddings.create(input=text, model=text_embedding_model)
6 	return embedding.data[0].embedding

So, we shall modify app.py to use the new functionality of llm.py. app.py

1 # ... other import dependencies
2 from llm import getTextResponse, getEmbedding
3 
4 # ... other routes
5 
6 # search route
7 @server.get("/ask")
8 async def search(query):
9 	llm_response = getTextResponse(query)
10 	query_vector = getEmbedding(query)
11 	print("embedding : ", query_vector)
12 	return {"message": "Your AI response is: {}".format(llm_response)}

If we run the curl command to invoke ask again, in the server shell, it must print similar data like below.

1 embedding :  [-0.02254991, 0.031336114, 0.019013261, 0.00017081834, -0.0202526, -0.0020466715, 0.0111036645, -0.0111036645, 0.036172554, 0.04038429, -0.027043771, 0.0046273666, -0.039820038, -0.011456322, 0.0048339227, 0.021824444, 0.0048666694, -0.017501874, 0.03915503, 0.03895351, 0.041311275, ..., 0.046349235]

5) Store vector information in MongoDB Atlas

We already have vector data for its semantic. Let's see how we will store them for our cache system.MongoDB Atlas Vector Search is a key feature that lets us enable AI-powered semantic search in vector data. To do so, we must store documents in the MongoDB database first.First, register for a MongoDB Atlas account. Existing users can sign into MongoDB Atlas.Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.

4.1) Connect MongoDB with Python. 4.1.1) Create db.py in same directory of app.py. 4.1.2) Implement document saving in MongoDB.

db.py

1 import pymongo
2 
3 MONGO_URI = ""  # MongoDB connection string
4 mongo_client = pymongo.MongoClient(MONGO_URI)
5 
6 db = mongo_client.get_database("logging")  # database name in mongodb
7 collection = db.get_collection("test")  # collection name
8 
9 # document {
10 #   query: string,
11 #   response: string,
12 #   embeddings: vector<number<1408>>
13 # }
14 
15 def save_cache(document):
16 	collection.insert_one(document)

4.1.3) Save response from AI to database. Modify app.py to save the AI response and its vector information in the database. app.py

1 # ... other import dependencies
2 from db import save_cache
3 
4 # ... other code
5 
6 # search route
7 @server.get("/ask")
8 async def search(query):
9 	llm_response = getTextResponse(query)
10 	query_vector = getEmbedding(query)
11 
12 	document = {
13   	"response": llm_response,
14   	"embeddings": query_vector,
15   	"query": query
16   	}
17 	save_cache(document)
18 
19 	return {"message": "Your AI response is: {}".format(llm_response)}

4.2) Create index vector search in MongoDB AtlasMongoDB’s Vector Search enables AI-powered experiences to perform semantic search of unstructured data by its embeddings with machine learning models. We have to enable vector search index in the database. You can go to the database in Atlas -> Atlas Search -> CREATE SEARCH INDEX.

Below is the JSON editor version of Atlas index.

1 {
2   "type": "vectorSearch", # vector search identity
3   "fields": [
4 	{
5   	# type of data in the asking the system again with a new question (but similar meaning): Howfield (vector information)
6   	"type": "vector",
7   	# field of document that store vector information
8   	"path": "embeddings",
9   	# dimension of vector, receiving from language model specification.
10   	# for `text-embedding-3-small` of OpenAI, it's 1408 dimensions of vector.
11   	"numDimensions": 1408,
12   	# vector search find similarity of searching document with other. So, closest documents would get high score.
13   	"similarity": "cosine"
14 	}
15   ]
16 }

6) Obtain cache from the system in MongoDB Atlas

Logically, when we receive a new request from the client, we’ll embed the search query and perform a vector search to find the documents that contain embeddings that are semantically similar to the query embedding. Vector search is one of the stages of aggregation pipelines. The pipeline is constructed as shown below.

1 # Step 1: Perform vector search
2 {
3   # $vectorSearch operation
4     "$vectorSearch": {
5         # Vector Search index in Atlas
6         "index": "vector_index",
7         # vector information in document
8         "path": "embeddings",
9         # embedding information of client's query
10         "queryVector": numerical_embedding,
11        # number of nearest neighbors to find the closest group. This value is used for 'searchScore' ranking
12         "numCandidates": 20,
13            # number of document to return ranked by semantic meaning / searchScore
14         "limit": 5,
15     },
16 },
17 # add 'score' field to show vector search score.
18 {
19   "$addFields": {
20   "score": {
21     # vectorSearchScore is attached in $meta for pipeline after vectorSearch operation
22     "$meta": "vectorSearchScore"
23     }
24   }
25 },
26 # return document that score greater than specific ratio
27 # in this case, only top one that has ration > 70% is going to return.
28 {
29   "$match": {
30     "score": { "$gte": 0.70 }
31   }
32 },
33 # remove `embeddings` field. Because it reduce load to retrieve data.
34 {
35   "$unset": ["embeddings"]
36 }
37 ]

Modify db.py and app.py to implement the PyMongo aggregation pipeline for vector search. db.py

1 # ... other code
2 def perform_search_cache(query):
3 	pipeline = [
4     	{
5         	"$vectorSearch": {
6             	"index": "vector_index",
7             	"path": "embeddings",
8             	"queryVector": query,
9             	"numCandidates": 20,
10             	"limit": 1,
11         	},
12     	},
13     	{"$addFields": {"score": {"$meta": "vectorSearchScore"}}},
14     	{"$match": {"score": {"$gte": 0.70}}},
15     	{"$unset": ["embeddings"]},
16 	]
17 	result = collection.aggregate(pipeline)
18 	return result

app.py

1 # ... other import dependencies
2 from db import save_cache, perform_search_cache
3 
4 # ... other code
5 # search route
6 @server.get("/ask")
7 async def search(query):
8 	query_vector = getEmbedding(query)
9 	cache_response = list(perform_search_cache(query_vector))
10 	response = ""
11 	if len(cache_response) < 1:
12     	# in case no cache hit
13     	# perform search and save the cache
14     	llm_response = getTextResponse(query)
15     	document = {"response": llm_response, "embeddings": query_vector, "query": query}
16     	save_cache(document)
17     	response = llm_response
18 	else:
19     	# if cache hit, return it
20     	response = cache_response[0]["response"]
21     	print("cache hit")
22     	print(cache_response)
23 	return {"message": "Your AI response is: {}".format(response)}

We can try to send the request to our system. Let’s ask the system, “How are things with you?”

1 curl -X GET "http://127.0.0.1:8000/ask?query=how+are+things+with+you?"

response For the first time, the system retrieves data from the AI service.

1 {
2   "message": "Your AI response is: I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?"
3 }

Let’s try to ask the system a new question (but with a similar meaning): How are you today?

1 curl -X GET "http://127.0.0.1:8000/ask?query=how+are+you+today?"

Now, the system will return cache data from MongoDB Atlas.

1 {
2   "message": "Your AI response is: I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?"
3 }

If you go to shell/terminal, you will see a log like below.

1 cache hit
2 [
3   {
4 	'_id': ObjectId('6671440cb2bf0b0eb12b75b3'),
5 	'response': "I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?",
6 	'query': 'how are thing with you?',
7 	'score': 0.8066450357437134
8   }
9 ]

It seems that the query “How are you today?” is 80% similar to “How are things with you?” That is what we expect.

Summary

This article outlines the implementation of a semantic caching system for LLM responses using MongoDB Atlas and Vector Search. The solution covered in this article aims to reduce costs and latency associated with frequent LLM API calls by caching responses based on query semantics rather than exact matches.

The solution integrates FastAPI, OpenAI, and MongoDB Atlas to create a workflow where incoming queries are embedded into vectors and compared against cached entries. Matching queries retrieve stored responses, while new queries are processed by the LLM and then cached.

Key benefits include reduced LLM service load, lower costs, faster response times for similar queries, and scalability. The system demonstrates how combining vector search capabilities with LLMs can optimize natural language processing applications, offering a balance between efficiency and response quality.

Learn how to implement semantic cache with the widely adopted LLM framework library LangChain.

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this tutorial

Tutorial

Using the Node.js MongoDB Driver with AWS Lambda

Jan 23, 2024 | 5 min read

Tutorial

Adding Autocomplete to Your Laravel Applications With MongoDB Atlas Search

Dec 03, 2024 | 9 min read

Tutorial

Interactive RAG With MongoDB Atlas + Function Calling API

Sep 18, 2024 | 16 min read

News & Announcements

Transform Your AI Development Skills with the RAG to Riches Developer Quest!

Jul 10, 2024 | 1 min read

Semantic cache
How to implement semantic cache with MongoDB Atlas and Vector Search
Summary

Atlas

Caching LLMs Response With MongoDB Atlas and Vector Search

Semantic cache

How to implement semantic cache with MongoDB Atlas and Vector Search

Prerequisites

1) Set up dependency in Python

2) Create FastAPI server

3) Connect OpenAI

4) Embedding LLM response

5) Store vector information in MongoDB Atlas

6) Obtain cache from the system in MongoDB Atlas

Summary

Top Comments in Forums

Related

Using the Node.js MongoDB Driver with AWS Lambda

Adding Autocomplete to Your Laravel Applications With MongoDB Atlas Search

Interactive RAG With MongoDB Atlas + Function Calling API

Transform Your AI Development Skills with the RAG to Riches Developer Quest!

Table of Contents

1	from fastapi import FastAPI
2	server = FastAPI()
3	# root route
4	@server.get("/")
5	async def home():
6	return { "message": "This is home server" }
7
8	# search route
9	@server.get("/ask")
10	async def search(query: str):
11	return { "message": "The query is: {}".format(query) }

1	from openai import OpenAI
2
3	open_api_key = "..." # OpenAI API Key
4	openai_client = OpenAI(api_key=open_api_key)
5	language_model = "gpt-3.5-turbo"
6
7	# getTextResponse receive text and ask LLM for answer
8	def getTextResponse(text):
9	chat_completion = openai_client.chat.completions.create(
10	messages=[{"role": "user", "content": text}], model=language_model
11	)
12	return chat_completion.choices[0].message.content

1	# ... other import dependencies
2	from llm import getTextResponse
3	# ... other routes
4	# search route
5	@server.get("/ask")
6	async def search(query):
7	llm_response = getTextResponse(query)
8	return {"message": "Your AI response is: {}".format(llm_response)}

1	curl -X GET "http://127.0.0.1:8000/ask?query=what+is+llm?"
2	Response
3	{
4	"message": "Your AI response is: LLM stands for Master of Laws, which is an advanced law degree typically pursued by individuals who have already received a law degree (such as a JD) and want to further specialize or advance their knowledge in a specific area of law. LLM programs are typically one year in length and often focus on areas such as international law, human rights, or commercial law."
5	}

1	text = "Large Language Model"
2	# embedding_function(text: str) -> vector<number>([...N])
3	vector = embedding_function(text)
4
5	# vector = [12, 23, 0.11, 22, 85, 43, ..., 90]

1	# ... other code
2	text_embedding_model = "text-embedding-3-small"
3	# getEmbedding receives text and response embedding (vector) of its original data
4	def getEmbedding(text):
5	embedding = openai_client.embeddings.create(input=text, model=text_embedding_model)
6	return embedding.data[0].embedding

1	# ... other import dependencies
2	from llm import getTextResponse, getEmbedding
3
4	# ... other routes
5
6	# search route
7	@server.get("/ask")
8	async def search(query):
9	llm_response = getTextResponse(query)
10	query_vector = getEmbedding(query)
11	print("embedding : ", query_vector)
12	return {"message": "Your AI response is: {}".format(llm_response)}

1	import pymongo
2
3	MONGO_URI = "" # MongoDB connection string
4	mongo_client = pymongo.MongoClient(MONGO_URI)
5
6	db = mongo_client.get_database("logging") # database name in mongodb
7	collection = db.get_collection("test") # collection name
8
9	# document {
10	# query: string,
11	# response: string,
12	# embeddings: vector<number<1408>>
13	# }
14
15	def save_cache(document):
16	collection.insert_one(document)

1	# ... other import dependencies
2	from db import save_cache
3
4	# ... other code
5
6	# search route
7	@server.get("/ask")
8	async def search(query):
9	llm_response = getTextResponse(query)
10	query_vector = getEmbedding(query)
11
12	document = {
13	"response": llm_response,
14	"embeddings": query_vector,
15	"query": query
16	}
17	save_cache(document)
18
19	return {"message": "Your AI response is: {}".format(llm_response)}

1	{
2	"type": "vectorSearch", # vector search identity
3	"fields": [
4	{
5	# type of data in the asking the system again with a new question (but similar meaning): Howfield (vector information)
6	"type": "vector",
7	# field of document that store vector information
8	"path": "embeddings",
9	# dimension of vector, receiving from language model specification.
10	# for `text-embedding-3-small` of OpenAI, it's 1408 dimensions of vector.
11	"numDimensions": 1408,
12	# vector search find similarity of searching document with other. So, closest documents would get high score.
13	"similarity": "cosine"
14	}
15	]
16	}

1	# Step 1: Perform vector search
2	{
3	# $vectorSearch operation
4	"$vectorSearch": {
5	# Vector Search index in Atlas
6	"index": "vector_index",
7	# vector information in document
8	"path": "embeddings",
9	# embedding information of client's query
10	"queryVector": numerical_embedding,
11	# number of nearest neighbors to find the closest group. This value is used for 'searchScore' ranking
12	"numCandidates": 20,
13	# number of document to return ranked by semantic meaning / searchScore
14	"limit": 5,
15	},
16	},
17	# add 'score' field to show vector search score.
18	{
19	"$addFields": {
20	"score": {
21	# vectorSearchScore is attached in $meta for pipeline after vectorSearch operation
22	"$meta": "vectorSearchScore"
23	}
24	}
25	},
26	# return document that score greater than specific ratio
27	# in this case, only top one that has ration > 70% is going to return.
28	{
29	"$match": {
30	"score": { "$gte": 0.70 }
31	}
32	},
33	# remove `embeddings` field. Because it reduce load to retrieve data.
34	{
35	"$unset": ["embeddings"]
36	}
37	]

1	# ... other code
2	def perform_search_cache(query):
3	pipeline = [
4	{
5	"$vectorSearch": {
6	"index": "vector_index",
7	"path": "embeddings",
8	"queryVector": query,
9	"numCandidates": 20,
10	"limit": 1,
11	},
12	},
13	{"$addFields": {"score": {"$meta": "vectorSearchScore"}}},
14	{"$match": {"score": {"$gte": 0.70}}},
15	{"$unset": ["embeddings"]},
16	]
17	result = collection.aggregate(pipeline)
18	return result

1	# ... other import dependencies
2	from db import save_cache, perform_search_cache
3
4	# ... other code
5	# search route
6	@server.get("/ask")
7	async def search(query):
8	query_vector = getEmbedding(query)
9	cache_response = list(perform_search_cache(query_vector))
10	response = ""
11	if len(cache_response) < 1:
12	# in case no cache hit
13	# perform search and save the cache
14	llm_response = getTextResponse(query)
15	document = {"response": llm_response, "embeddings": query_vector, "query": query}
16	save_cache(document)
17	response = llm_response
18	else:
19	# if cache hit, return it
20	response = cache_response[0]["response"]
21	print("cache hit")
22	print(cache_response)
23	return {"message": "Your AI response is: {}".format(response)}

1	{
2	"message": "Your AI response is: I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?"
3	}

1	cache hit
2	[
3	{
4	'_id': ObjectId('6671440cb2bf0b0eb12b75b3'),
5	'response': "I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?",
6	'query': 'how are thing with you?',
7	'score': 0.8066450357437134
8	}
9	]