Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Caching LLMs Response With MongoDB Atlas and Vector Search

Kanin Kearpimy8 min read • Published Sep 02, 2024 • Updated Sep 02, 2024
AIAtlasPython
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Large language models (LLMs) are the renowned solution for most business domains in 2024. While there is an estimation that 750 million applications will be integrated with LLMs in 2025, training LLMs does consume significant monetary resources. So, the cost of LLM platforms such as OpenAI’s GPT REST API reflects the above cost in their pricing. The problem is how we can reduce the operational cost of the AI applications in production; the obvious answer is calling the API less, and then another problem arises: the maintenance of response quality to users.
Caching has been a fundamental solution in software engineering for many years. The application creates a cache by extracting a key from the request and storing the corresponding result. Next time the same key is called, the server can immediately respond to user queries on the network without extra computation.
However, the character of the LLM query is not a fixed key but a flexible format like the meaning of the human’s question. Consequently, the traditional cache that stores fixed keys is inefficient enough to handle LLM queries.

Semantic cache

Unlike traditional cache, semantic cache characteristics of data and semantic representation are simply described as meaning-based representations. We call this process embedding. In LLM systems, the model converts text into numerical vectors that represent its semantic meaning.
We will store the embeddings in the cache system. When a new request comes in, the system extracts its semantic representation by creating an embedding. It then searches for similarities between this new embedding and the stored embeddings in the cache system. If a high similarity match is found, the corresponding cached response will be returned. This process allows for semantic-based retrieval of previously computed responses, potentially reducing the need for repeated API calls to the LLM service.
High level system and logic.
All of the code is in GitHub.

Prerequisites

•      Python (3.12.3 or newer) •      FastAPI (0.11 or newer) •      PyMongo (4.7.2 or newer) •      uvicorn (0.29.0 or newer)

1) Set up dependency in Python

We need to install the dependencies as mentioned above. You can utilize pip for the package manager. The necessary dependencies are in requirements.txt. After cloning the project and entering the project’s directory, you can run the below command to install them.
1pip install fastapi pymongo openai uvicorn
In case you are creating an isolated project, you can enable python virtualenv for this specific environment.)

2) Create FastAPI server

To simulate the caching server, the request shall come from an HTTP request. We thus set up a web server in Python by FastAPI.1.2.1) Create app.py in the root directory.1.2.2) Import FastAPI and initiate / and /ask routes.
app.py
1from fastapi import FastAPI
2server = FastAPI()
3# root route
4@server.get("/")
5async def home():
6 return { "message": "This is home server" }
7
8# search route
9@server.get("/ask")
10async def search(query: str):
11 return { "message": "The query is: {}".format(query) }
Next, run the application for testing our route. (--reload for hot refresh if application code is edited.)
1uvicorn app:server --reload
Your server must be running at http://127.0.0.1:8000. Now, we can test our search route using the command below.
1curl -X GET "http://127.0.0.1:8000/ask?query=hello+this+is+search+query"
The server must respond as below:
1{ "message": "The query is: hello this is search query" }

3) Connect OpenAI

We previously set up a basic FastAPI server and ask route. Next, the LLM functionality will be integrated.
1.3.1) Create llm.py in the same directory as app.py. 1.3.2) Set up OpenAI as an LLM service. llm.py
1from openai import OpenAI
2
3open_api_key = "..." # OpenAI API Key
4openai_client = OpenAI(api_key=open_api_key)
5language_model = "gpt-3.5-turbo"
6
7# getTextResponse receive text and ask LLM for answer
8def getTextResponse(text):
9 chat_completion = openai_client.chat.completions.create(
10 messages=[{"role": "user", "content": text}], model=language_model
11 )
12 return chat_completion.choices[0].message.content
We have to modify app.py with a few lines of code. app.py
1# ... other import dependencies
2from llm import getTextResponse
3# ... other routes
4# search route
5@server.get("/ask")
6async def search(query):
7 llm_response = getTextResponse(query)
8 return {"message": "Your AI response is: {}".format(llm_response)}
Then, we can invoke the ask route with a new query.
1curl -X GET "http://127.0.0.1:8000/ask?query=what+is+llm?"
2Response
3{
4 "message": "Your AI response is: LLM stands for Master of Laws, which is an advanced law degree typically pursued by individuals who have already received a law degree (such as a JD) and want to further specialize or advance their knowledge in a specific area of law. LLM programs are typically one year in length and often focus on areas such as international law, human rights, or commercial law."
5}

4) Embedding LLM response

Now, we can receive a response from the OpenAI LLM. However, the system is always relying on the OpenAI service. Our goal is to reduce the load from the AI service to the cache system. To cache the LLM response, we must transform our text (or any type of data) into vector data. A vector can be thought of as an N-dimensional (depending on the embedding model) array in which each number represents the meaning of the original data. Example:
1text = "Large Language Model"
2# embedding_function(text: str) -> vector<number>([...N])
3vector = embedding_function(text)
4
5# vector = [12, 23, 0.11, 22, 85, 43, ..., 90]
We can embed our data using a language model. In our case, we utilize OpenAI's text-embedding model. Therefore, we modify llm.py with a few lines.
llm.py
1# ... other code
2text_embedding_model = "text-embedding-3-small"
3# getEmbedding receives text and response embedding (vector) of its original data
4def getEmbedding(text):
5 embedding = openai_client.embeddings.create(input=text, model=text_embedding_model)
6 return embedding.data[0].embedding
So, we shall modify app.py to use the new functionality of llm.py. app.py
1# ... other import dependencies
2from llm import getTextResponse, getEmbedding
3
4# ... other routes
5
6# search route
7@server.get("/ask")
8async def search(query):
9 llm_response = getTextResponse(query)
10 query_vector = getEmbedding(query)
11 print("embedding : ", query_vector)
12 return {"message": "Your AI response is: {}".format(llm_response)}
If we run the curl command to invoke ask again, in the server shell, it must print similar data like below.
1embedding : [-0.02254991, 0.031336114, 0.019013261, 0.00017081834, -0.0202526, -0.0020466715, 0.0111036645, -0.0111036645, 0.036172554, 0.04038429, -0.027043771, 0.0046273666, -0.039820038, -0.011456322, 0.0048339227, 0.021824444, 0.0048666694, -0.017501874, 0.03915503, 0.03895351, 0.041311275, ..., 0.046349235]

5) Store vector information in MongoDB Atlas

We already have vector data for its semantic. Let's see how we will store them for our cache system.MongoDB Atlas Vector Search is a key feature that lets us enable AI-powered semantic search in vector data. To do so, we must store documents in the MongoDB database first.First, register for a MongoDB Atlas account. Existing users can sign into MongoDB Atlas.Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
4.1) Connect MongoDB with Python. 4.1.1) Create db.py in same directory of app.py. 4.1.2) Implement document saving in MongoDB.
db.py
1import pymongo
2
3MONGO_URI = "" # MongoDB connection string
4mongo_client = pymongo.MongoClient(MONGO_URI)
5
6db = mongo_client.get_database("logging") # database name in mongodb
7collection = db.get_collection("test") # collection name
8
9# document {
10# query: string,
11# response: string,
12# embeddings: vector<number<1408>>
13# }
14
15def save_cache(document):
16 collection.insert_one(document)
4.1.3) Save response from AI to database. Modify app.py to save the AI response and its vector information in the database. app.py
1# ... other import dependencies
2from db import save_cache
3
4# ... other code
5
6# search route
7@server.get("/ask")
8async def search(query):
9 llm_response = getTextResponse(query)
10 query_vector = getEmbedding(query)
11
12 document = {
13 "response": llm_response,
14 "embeddings": query_vector,
15 "query": query
16 }
17 save_cache(document)
18
19 return {"message": "Your AI response is: {}".format(llm_response)}
4.2) Create index vector search in MongoDB AtlasMongoDB’s Vector Search enables AI-powered experiences to perform semantic search of unstructured data by its embeddings with machine learning models. We have to enable vector search index in the database. You can go to the database in Atlas -> Atlas Search -> CREATE SEARCH INDEX.
Below is the JSON editor version of Atlas index.
1{
2 "type": "vectorSearch", # vector search identity
3 "fields": [
4 {
5 # type of data in the asking the system again with a new question (but similar meaning): Howfield (vector information)
6 "type": "vector",
7 # field of document that store vector information
8 "path": "embeddings",
9 # dimension of vector, receiving from language model specification.
10 # for `text-embedding-3-small` of OpenAI, it's 1408 dimensions of vector.
11 "numDimensions": 1408,
12 # vector search find similarity of searching document with other. So, closest documents would get high score.
13 "similarity": "cosine"
14 }
15 ]
16}

6) Obtain cache from the system in MongoDB Atlas

Logically, when we receive a new request from the client, we’ll embed the search query and perform a vector search to find the documents that contain embeddings that are semantically similar to the query embedding. Vector search is one of the stages of aggregation pipelines. The pipeline is constructed as shown below.
1# Step 1: Perform vector search
2{
3 # $vectorSearch operation
4 "$vectorSearch": {
5 # Vector Search index in Atlas
6 "index": "vector_index",
7 # vector information in document
8 "path": "embeddings",
9 # embedding information of client's query
10 "queryVector": numerical_embedding,
11 # number of nearest neighbors to find the closest group. This value is used for 'searchScore' ranking
12 "numCandidates": 20,
13 # number of document to return ranked by semantic meaning / searchScore
14 "limit": 5,
15 },
16},
17# add 'score' field to show vector search score.
18{
19 "$addFields": {
20 "score": {
21 # vectorSearchScore is attached in $meta for pipeline after vectorSearch operation
22 "$meta": "vectorSearchScore"
23 }
24 }
25},
26# return document that score greater than specific ratio
27# in this case, only top one that has ration > 70% is going to return.
28{
29 "$match": {
30 "score": { "$gte": 0.70 }
31 }
32},
33# remove `embeddings` field. Because it reduce load to retrieve data.
34{
35 "$unset": ["embeddings"]
36}
37]
Modify db.py and app.py to implement the PyMongo aggregation pipeline for vector search. db.py
1# ... other code
2def perform_search_cache(query):
3 pipeline = [
4 {
5 "$vectorSearch": {
6 "index": "vector_index",
7 "path": "embeddings",
8 "queryVector": query,
9 "numCandidates": 20,
10 "limit": 1,
11 },
12 },
13 {"$addFields": {"score": {"$meta": "vectorSearchScore"}}},
14 {"$match": {"score": {"$gte": 0.70}}},
15 {"$unset": ["embeddings"]},
16 ]
17 result = collection.aggregate(pipeline)
18 return result
app.py
1# ... other import dependencies
2from db import save_cache, perform_search_cache
3
4# ... other code
5# search route
6@server.get("/ask")
7async def search(query):
8 query_vector = getEmbedding(query)
9 cache_response = list(perform_search_cache(query_vector))
10 response = ""
11 if len(cache_response) < 1:
12 # in case no cache hit
13 # perform search and save the cache
14 llm_response = getTextResponse(query)
15 document = {"response": llm_response, "embeddings": query_vector, "query": query}
16 save_cache(document)
17 response = llm_response
18 else:
19 # if cache hit, return it
20 response = cache_response[0]["response"]
21 print("cache hit")
22 print(cache_response)
23 return {"message": "Your AI response is: {}".format(response)}
We can try to send the request to our system. Let’s ask the system, “How are things with you?”
1curl -X GET "http://127.0.0.1:8000/ask?query=how+are+things+with+you?"
response For the first time, the system retrieves data from the AI service.
1{
2 "message": "Your AI response is: I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?"
3}
Let’s try to ask the system a new question (but with a similar meaning): How are you today?
1curl -X GET "http://127.0.0.1:8000/ask?query=how+are+you+today?"
Now, the system will return cache data from MongoDB Atlas.
1{
2 "message": "Your AI response is: I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?"
3}
If you go to shell/terminal, you will see a log like below.
1cache hit
2[
3 {
4 '_id': ObjectId('6671440cb2bf0b0eb12b75b3'),
5 'response': "I'm just a computer program, so I don't have feelings or emotions. But I'm here to help you with anything you need! How can I assist you today?",
6 'query': 'how are thing with you?',
7 'score': 0.8066450357437134
8 }
9]
It seems that the query “How are you today?” is 80% similar to “How are things with you?” That is what we expect.

Summary

This article outlines the implementation of a semantic caching system for LLM responses using MongoDB Atlas and Vector Search. The solution covered in this article aims to reduce costs and latency associated with frequent LLM API calls by caching responses based on query semantics rather than exact matches.
The solution integrates FastAPI, OpenAI, and MongoDB Atlas to create a workflow where incoming queries are embedded into vectors and compared against cached entries. Matching queries retrieve stored responses, while new queries are processed by the LLM and then cached.
Key benefits include reduced LLM service load, lower costs, faster response times for similar queries, and scalability. The system demonstrates how combining vector search capabilities with LLMs can optimize natural language processing applications, offering a balance between efficiency and response quality.
Learn how to implement semantic cache with the widely adopted LLM framework library LangChain.
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

How to Deploy MongoDB on Heroku


Oct 26, 2022 | 8 min read
Quickstart

Building AI Multi-Agents with BuildShip and MongoDB


Nov 18, 2024 | 3 min read
Tutorial

Unlocking Semantic Search: Building a Java-Powered Movie Search Engine with Atlas Vector Search and Spring Boot


Sep 18, 2024 | 10 min read
News & Announcements

Unlock the Value of Data in MongoDB Atlas with the Intelligent Analytics of Microsoft Fabric


Nov 17, 2023 | 6 min read
Table of Contents