Caching LLMs Response With MongoDB Atlas and Vector Search
Rate this tutorial
Large language models (LLMs) are the renowned solution for most business domains in 2024. While there is an estimation that 750 million applications will be integrated with LLMs in 2025, training LLMs does consume significant monetary resources. So, the cost of LLM platforms such as OpenAI’s GPT REST API reflects the above cost in their pricing. The problem is how we can reduce the operational cost of the AI applications in production; the obvious answer is calling the API less, and then another problem arises: the maintenance of response quality to users.
Caching has been a fundamental solution in software engineering for many years. The application creates a cache by extracting a key from the request and storing the corresponding result. Next time the same key is called, the server can immediately respond to user queries on the network without extra computation.
However, the character of the LLM query is not a fixed key but a flexible format like the meaning of the human’s question. Consequently, the traditional cache that stores fixed keys is inefficient enough to handle LLM queries.
Unlike traditional cache, semantic cache characteristics of data and semantic representation are simply described as meaning-based representations. We call this process embedding. In LLM systems, the model converts text into numerical vectors that represent its semantic meaning.
We will store the embeddings in the cache system. When a new request comes in, the system extracts its semantic representation by creating an embedding. It then searches for similarities between this new embedding and the stored embeddings in the cache system. If a high similarity match is found, the corresponding cached response will be returned. This process allows for semantic-based retrieval of previously computed responses, potentially reducing the need for repeated API calls to the LLM service.
• Python (3.12.3 or newer)
• FastAPI (0.11 or newer)
• PyMongo (4.7.2 or newer)
• uvicorn (0.29.0 or newer)
We need to install the dependencies as mentioned above. You can utilize pip for the package manager. The necessary dependencies are in requirements.txt. After cloning the project and entering the project’s directory, you can run the below command to install them.
1 pip install fastapi pymongo openai uvicorn
In case you are creating an isolated project, you can enable python virtualenv for this specific environment.)
To simulate the caching server, the request shall come from an HTTP request. We thus set up a web server in Python by FastAPI.1.2.1) Create app.py in the root directory.1.2.2) Import FastAPI and initiate / and /ask routes.
app.py
1 from fastapi import FastAPI 2 server = FastAPI() 3 # root route 4 5 async def home(): 6 return { "message": "This is home server" } 7 8 # search route 9 10 async def search(query: str): 11 return { "message": "The query is: {}".format(query) }
Next, run the application for testing our route. (--reload for hot refresh if application code is edited.)
1 uvicorn app:server --reload
Your server must be running at http://127.0.0.1:8000. Now, we can test our search route using the command below.
1 curl -X GET "http://127.0.0.1:8000/ask?query=hello+this+is+search+query"
The server must respond as below:
1 { "message": "The query is: hello this is search query" }
We previously set up a basic FastAPI server and ask route. Next, the LLM functionality will be integrated.
1.3.1) Create llm.py in the same directory as app.py.
1.3.2) Set up OpenAI as an LLM service.
llm.py
1 from openai import OpenAI 2 3 open_api_key = "..." # OpenAI API Key 4 openai_client = OpenAI(api_key=open_api_key) 5 language_model = "gpt-3.5-turbo" 6 7 # getTextResponse receive text and ask LLM for answer 8 def getTextResponse(text): 9 chat_completion = openai_client.chat.completions.create( 10 messages=[{"role": "user", "content": text}], model=language_model 11 ) 12 return chat_completion.choices[0].message.content
We have to modify app.py with a few lines of code.
app.py
1 # ... other import dependencies 2 from llm import getTextResponse 3 # ... other routes 4 # search route 5 6 async def search(query): 7 llm_response = getTextResponse(query) 8 return {"message": "Your AI response is: {}".format(llm_response)}
Then, we can invoke the ask route with a new query.
1 curl -X GET "http://127.0.0.1:8000/ask?query=what+is+llm?" 2 Response 3 { 4 "message": "Your AI response is: LLM stands for Master of Laws, which is an advanced law degree typically pursued by individuals who have already received a law degree (such as a JD) and want to further specialize or advance their knowledge in a specific area of law. LLM programs are typically one year in length and often focus on areas such as international law, human rights, or commercial law." 5 }
Now, we can receive a response from the OpenAI LLM. However, the system is always relying on the OpenAI service. Our goal is to reduce the load from the AI service to the cache system.
To cache the LLM response, we must transform our text (or any type of data) into vector data.
A vector can be thought of as an N-dimensional (depending on the embedding model) array in which each number represents the meaning of the original data.
Example:
1 text = "Large Language Model" 2 # embedding_function(text: str) -> vector<number>([...N]) 3 vector = embedding_function(text) 4 5 # vector = [12, 23, 0.11, 22, 85, 43, ..., 90]
We can embed our data using a language model. In our case, we utilize OpenAI's text-embedding model. Therefore, we modify llm.py with a few lines.
llm.py
1 # ... other code 2 text_embedding_model = "text-embedding-3-small" 3 # getEmbedding receives text and response embedding (vector) of its original data 4 def getEmbedding(text): 5 embedding = openai_client.embeddings.create(input=text, model=text_embedding_model) 6 return embedding.data[0].embedding
So, we shall modify app.py to use the new functionality of llm.py.
app.py
1 # ... other import dependencies 2 from llm import getTextResponse, getEmbedding 3 4 # ... other routes 5 6 # search route 7 8 async def search(query): 9 llm_response = getTextResponse(query) 10 query_vector = getEmbedding(query) 11 print("embedding : ", query_vector) 12 return {"message": "Your AI response is: {}".format(llm_response)}
If we run the
curl
command to invoke ask
again, in the server shell, it must print similar data like below.1 embedding : [-0.02254991, 0.031336114, 0.019013261, 0.00017081834, -0.0202526, -0.0020466715, 0.0111036645, -0.0111036645, 0.036172554, 0.04038429, -0.027043771, 0.0046273666, -0.039820038, -0.011456322, 0.0048339227, 0.021824444, 0.0048666694, -0.017501874, 0.03915503, 0.03895351, 0.041311275, ..., 0.046349235]
We already have vector data for its semantic. Let's see how we will store them for our cache system.MongoDB Atlas Vector Search is a key feature that lets us enable AI-powered semantic search in vector data. To do so, we must store documents in the MongoDB database first.First, register for a MongoDB Atlas account. Existing users can sign into MongoDB Atlas.Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
4.1) Connect MongoDB with Python.
4.1.1) Create db.py in same directory of app.py.
4.1.2) Implement document saving in MongoDB.
db.py
1 import pymongo 2 3 MONGO_URI = "" # MongoDB connection string 4 mongo_client = pymongo.MongoClient(MONGO_URI) 5 6 db = mongo_client.get_database("logging") # database name in mongodb 7 collection = db.get_collection("test") # collection name 8 9 # document { 10 # query: string, 11 # response: string, 12 # embeddings: vector<number<1408>> 13 # } 14 15 def save_cache(document): 16 collection.insert_one(document)
4.1.3) Save response from AI to database.
Modify app.py to save the AI response and its vector information in the database.
app.py
1 # ... other import dependencies 2 from db import save_cache 3 4 # ... other code 5 6 # search route 7 @server.get("/ask") 8 async def search(query): 9 llm_response = getTextResponse(query) 10 query_vector = getEmbedding(query) 11 12 document = { 13 "response": llm_response, 14 "embeddings": query_vector, 15 "query": query 16 } 17 save_cache(document) 18 19 return {"message": "Your AI response is: {}".format(llm_response)}
4.2) Create index vector search in MongoDB AtlasMongoDB’s Vector Search enables AI-powered experiences to perform semantic search of unstructured data by its embeddings with machine learning models. We have to enable vector search index in the database. You can go to the database in Atlas -> Atlas Search -> CREATE SEARCH INDEX.
Below is the JSON editor version of Atlas index.
1 { 2 "type": "vectorSearch", # vector search identity 3 "fields": [ 4 { 5 # type of data in the asking the system again with a new question (but similar meaning): Howfield (vector information) 6 "type": "vector", 7 # field of document that store vector information 8 "path": "embeddings", 9 # dimension of vector, receiving from language model specification. 10 # for `text-embedding-3-small` of OpenAI, it's 1408 dimensions of vector. 11 "numDimensions": 1408, 12 # vector search find similarity of searching document with other. So, closest documents would get high score. 13 "similarity": "cosine" 14 } 15 ] 16 }
Logically, when we receive a new request from the client, we’ll embed the search query and perform a vector search to find the documents that contain embeddings that are semantically similar to the query embedding.
Vector search is one of the stages of aggregation pipelines. The pipeline is constructed as shown below.
1 # Step 1: Perform vector search 2 { 3 # $vectorSearch operation 4 "$vectorSearch": { 5 # Vector Search index in Atlas 6 "index": "vector_index", 7 # vector information in document 8 "path": "embeddings", 9 # embedding information of client's query 10 "queryVector": numerical_embedding, 11 # number of nearest neighbors to find the closest group. This value is used for 'searchScore' ranking 12 "numCandidates": 20, 13 # number of document to return ranked by semantic meaning / searchScore 14 "limit": 5, 15 }, 16 }, 17 # add 'score' field to show vector search score. 18 { 19 "$addFields": { 20 "score": { 21 # vectorSearchScore is attached in $meta for pipeline after vectorSearch operation 22 "$meta": "vectorSearchScore" 23 } 24 } 25 }, 26 # return document that score greater than specific ratio 27 # in this case, only top one that has ration > 70% is going to return. 28 { 29 "$match": { 30 "score": { "$gte": 0.70 } 31 } 32 }, 33 # remove `embeddings` field. Because it reduce load to retrieve data. 34 { 35 "$unset": ["embeddings"] 36 } 37 ]
Modify db.py and app.py to implement the PyMongo aggregation pipeline for vector search.
db.py
1 # ... other code 2 def perform_search_cache(query): 3 pipeline = [ 4 { 5 "$vectorSearch": { 6 "index": "vector_index", 7 "path": "embeddings", 8 "queryVector": query, 9 "numCandidates": 20, 10 "limit": 1, 11 }, 12 }, 13 {"$addFields": {"score": {"$meta": "vectorSearchScore"}}}, 14 {"$match": {"score": {"$gte": 0.70}}}, 15 {"$unset": ["embeddings"]}, 16 ] 17 result = collection.aggregate(pipeline) 18 return result
app.py
1 # ... other import dependencies 2 from db import save_cache, perform_search_cache 3 4 # ... other code 5 # search route 6 7 async def search(query): 8 query_vector = getEmbedding(query) 9 cache_response = list(perform_search_cache(query_vector)) 10 response = "" 11 if len(cache_response) < 1: 12 # in case no cache hit 13 # perform search and save the cache 14 llm_response = getTextResponse(query) 15 document = {"response": llm_response, "embeddings": query_vector, "query": query} 16 save_cache(document) 17 response = llm_response 18 else: 19 # if cache hit, return it 20 response = cache_response[0]["response"] 21 print("cache hit") 22 print(cache_response) 23 return {"message": "Your AI response is: {}".format(response)}
We can try to send the request to our system. Let’s ask the system, “How are things with you?”
1 curl -X GET "http://127.0.0.1:8000/ask?query=how+are+things+with+you?"
response For the first time, the system retrieves data from the AI service.
1 { 2 "message": "Your AI response is: I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?" 3 }
Let’s try to ask the system a new question (but with a similar meaning): How are you today?
1 curl -X GET "http://127.0.0.1:8000/ask?query=how+are+you+today?"
Now, the system will return cache data from MongoDB Atlas.
1 { 2 "message": "Your AI response is: I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?" 3 }
If you go to shell/terminal, you will see a log like below.
1 cache hit 2 [ 3 { 4 '_id': ObjectId('6671440cb2bf0b0eb12b75b3'), 5 'response': "I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?", 6 'query': 'how are thing with you?', 7 'score': 0.8066450357437134 8 } 9 ]
It seems that the query “How are you today?” is 80% similar to “How are things with you?” That is what we expect.
This article outlines the implementation of a semantic caching system for LLM responses using MongoDB Atlas and Vector Search. The solution covered in this article aims to reduce costs and latency associated with frequent LLM API calls by caching responses based on query semantics rather than exact matches.
The solution integrates FastAPI, OpenAI, and MongoDB Atlas to create a workflow where incoming queries are embedded into vectors and compared against cached entries. Matching queries retrieve stored responses, while new queries are processed by the LLM and then cached.
Key benefits include reduced LLM service load, lower costs, faster response times for similar queries, and scalability. The system demonstrates how combining vector search capabilities with LLMs can optimize natural language processing applications, offering a balance between efficiency and response quality.
Top Comments in Forums
There are no comments on this article yet.