Enable Generative AI and Semantic Search Capabilities on Your Database With MongoDB Atlas and OpenAI
Rate this tutorial
Our goal for this tutorial is to leverage available and popular open-source LLMs in the market and add the capabilities and power of those LLMs in the same database as your operational (or in other words, primary) workload.
Creating a large language model (LLM) is not a one- or two-day process. It can take years to build a tuned and optimized model. The good news is that we already have a lot of LLMs available on the market, including BERT, GPT-3, GPT-4, Hugging Face, and Claude, and we can make good use of them in different ways.
LLMs provide vector representations of text data, capturing semantic relationships and understanding the context of language. These vector representations can be leveraged for various tasks, including vector search, to find similar or relevant text items within datasets.
Vector representations of text data can be used in capturing semantic similarities, search and retrieval, document retrieval, recommendation systems, text clustering and categorization, and anomaly detection.
In this article, we will explore the semantic search capability with vector representations of text data with a real-world use case. We will use the Airbnb sample dataset from MongoDB wherein we will try to find a room of our choice by giving an articulated prompt.
We will use MongoDB Atlas as a data platform, where we will have our sample dataset (an operational workload) of Airbnb and will enable search and vector search capabilities on top of it.
Semantic search is an information retrieval technique that improves the user’s search experience by understanding the intent or meaning behind the queries and the content. Semantic search focuses on context and semantics rather than exact word match, like traditional search would. Learn more about semantic search and how it is different from Google search and text-based search.
Vector search is a technique used for information retrieval and recommendation systems to find items that are similar to query items or vectors. Data items are represented as high-dimensional vectors, and similarity between items is calculated based on the mathematical properties of these vectors. This is a very useful and commonly used approach in content recommendation, image retrieval, and document search.
Atlas Vector Search enables searching through unstructured data. You can store vector embeddings generated by popular machine learning models like OpenAI and Hugging Face, utilizing them for semantic search and personalized user experiences, creating RAGs, and many other use cases.
We have an Airbnb dataset that has a nice description written for each of the properties. We will let users express their choice of location in words — for example, “Nice cozy, comfy room near beach,” “3 bedroom studio apartment for couples near beach,” “Studio with nice city view,” etc. — and the database will return the relevant results based on the sentence and keywords added.
What it will do under the hood is make an API call to the LLM we’re using (OpenAI) and get the vector embeddings for the search/prompt that we passed on/queried for (like we do in the ChatGPT interface). It will then return the vector embeddings, and we will be able to search with those embeddings against our operational dataset which will enable our database to return semantic/contextual results.
Within a few clicks and with the power of existing, very powerful LLMs, we can give the best user search experience using our existing operational dataset.
- Create a database called sample_airbnb and add a single dummy record in the collection called listingsAndReviews.
- Use a machine with Python’s latest version (3.11.1 was used while preparing this article) and the PyMongo driver installed (the latest version — 4.6.1 was used while preparing this article).
At this point, assuming the initial setup is done, let's jump right into the integration steps.
- Create a trigger to add/update vector embeddings.
- Create a variable to store OpenAI credentials. (We will use this for retrieval in the trigger code.)
- Create an Atlas search index.
- Load/insert your data.
- Query the database.
We will follow through each of the integration steps mentioned above with helpful instructions below so that you can find the relevant screens while executing it and can easily configure your own environment.
On the left menu of your Atlas cluster, click on Triggers.
Click on Add Trigger which will be visible in the top right corner of the triggers page.
Select the appropriate options on the Add Trigger page, as shown below.
This is where the trigger code needs to be shown in the next step.
Add the following code in the function area, visible in Step 3 above, to add/update vector embeddings for documents which will be triggered when a new document is created or an existing document is updated.
1 exports = async function(changeEvent) { 2 // Get the full document from the change event. 3 const doc = changeEvent.fullDocument; 4 5 // Define the OpenAI API url and key. 6 const url = 'https://api.openai.com/v1/embeddings'; 7 // Use the name you gave the value of your API key in the "Values" utility inside of App Services 8 const openai_key = context.values.get("openAI_value"); 9 try { 10 console.log(`Processing document with id: ${doc._id}`); 11 12 // Call OpenAI API to get the embeddings. 13 let response = await context.http.post({ 14 url: url, 15 headers: { 16 'Authorization': [`Bearer ${openai_key}`], 17 'Content-Type': ['application/json'] 18 }, 19 body: JSON.stringify({ 20 // The field inside your document that contains the data to embed, here it is the "plot" field from the sample movie data. 21 input: doc.description, 22 model: "text-embedding-3-small" 23 }) 24 }); 25 26 // Parse the JSON response 27 let responseData = EJSON.parse(response.body.text()); 28 29 // Check the response status. 30 if(response.statusCode === 200) { 31 console.log("Successfully received embedding."); 32 33 const embedding = responseData.data[0].embedding; 34 35 // Use the name of your MongoDB Atlas Cluster 36 const collection = context.services.get("AtlasSearch").db("sample_airbnb").collection("listingsAndReviews"); 37 38 // Update the document in MongoDB. 39 const result = await collection.updateOne( 40 { _id: doc._id }, 41 // The name of the new field you'd like to contain your embeddings. 42 { $set: { description_embedding: embedding }} 43 ); 44 45 if(result.modifiedCount === 1) { 46 console.log("Successfully updated the document."); 47 } else { 48 console.log("Failed to update the document."); 49 } 50 } else { 51 console.log(`Failed to receive embedding. Status code: ${response.statusCode}`); 52 } 53 54 } catch(err) { 55 console.error(err); 56 } 57 };
At this point, with the above code block and configuration that we did, it will be triggered when a document(s) is updated or inserted in the listingAndReviews collection of our sample_airbnb database. This code block will call the OpenAI API, fetch the embeddings of the body field, and store the results in the description_embedding field of the listingAndReviews collection.
Now that we’ve configured a trigger, let's create variables to store the OpenAI credentials in the next step.
Once you’ve created the cluster, you will see the App Services tab in the top left area next to Charts.
Click on App Services. You will see the trigger that you created in the first step.
Click on the trigger present and it will open up a page where you can click on the Values tab present on the left menu, as shown below.
Click on Create New Value with the variable named openAI_value and another variable called openAI_key which we will link to the secret we stored in the openAI_value variable.
We’ve prepared our app service to fetch API credentials and have also added a trigger function that will be triggered/executed upon document inserts or updates.
Now, we will move on to creating an Atlas search index, loading MongoDB’s provided sample data, and querying the database.
Click on the cluster name and then the search tab from the cluster page.
Click on Create Index as shown below to create an Atlas search index.
Select JSON Editor and paste the JSON object.
Add a vector search index definition, as shown below.
We’ve created the Atlas search index in the above step. Now, we’re all ready to load the data in our prepared environment. So as a next step, let's load sample data.
As a prerequisite for this step, we need to make sure that the cluster is up and running and the screen is visible, as shown in Step 1 below. Make sure that the collection named listingsAndReviews is created under the sample_airbnb database. If you’ve not created it yet, create it by switching to the Data Explorer tab.
We can load the sample dataset from the Atlas cluster option itself, as shown below.
Once you load the data, verify whether the embedding field was added in the collection.
At this point, we’ve loaded the sample dataset. It should have triggered the code we configured to be triggered upon insert or updates. As a result of that, the description_embedding field will be added, containing an array of vectors.
Now that we’ve prepared everything, let’s jump right into querying our dataset and see the exciting results we get from our user prompt. In the next section of querying the database, we will pass our sample user prompt directly to the Python script.
As a prerequisite for this step, you will need a runtime for the Python script. It can be your local machine, an ec2 instance on AWS, or you can go with AWS Lambda — whichever option is most convenient. Make sure you’ve installed PyMongo in the environment of your choice. The following code block can be written in a Jupyter notebook or VSCode and can be executed from Jupyter runtime or via the command line, depending on which option you go with. The following code block demonstrates how you can perform an Atlas vector search and retrieve records from your operational database by finding embeddings of user prompts received from the OpenAI API.
1 import pymongo 2 import requests 3 import pprint 4 5 def get_vector_embeddings_from_openai(query): 6 openai_api_url = "https://api.openai.com/v1/embeddings" 7 openai_api_key = "<your-open-ai-api-key>" 8 9 data = { 10 'input': query, 11 'model': "text-embedding-3-small" 12 } 13 14 headers = { 15 'Authorization': 'Bearer {0}'.format(openai_api_key), 16 'Content-Type': 'application/json' 17 } 18 19 response = requests.post(openai_api_url, json=data, headers=headers) 20 embedding = [] 21 if response.status_code == 200: 22 embedding = response.json()['data'][0]['embedding'] 23 return embedding 24 25 def find_similar_documents(embedding): 26 mongo_url = 'mongodb+srv://<username>:<password>@<cluster-url.mongodb.net>/?retryWrites=true&w=majority' 27 client = pymongo.MongoClient(mongo_url) 28 db = client.sample_airbnb 29 collection = db["listingsAndReviews"] 30 31 pipeline = [ 32 { 33 "$vectorSearch": { 34 "index": "default", 35 "path": "descriptions_embedding", 36 “queryVector”: “embedding”, 37 “numCandidates”: 150, 38 “limit”: 10 39 } 40 }, 41 { 42 "$project": { 43 "_id": 0, 44 "description": 1 45 } 46 } 47 ] 48 documents = collection.aggregate(pipeline) 49 return documents 50 51 def main(): 52 query = "Best for couples, nearby beach area with cool weather" 53 try: 54 embedding = get_vector_embeddings_from_openai(query) 55 documents = find_similar_documents(embedding) 56 print("Documents") 57 pprint.pprint(list(documents)) 58 except Exception as e: 59 print("Error occured: {0}".format(e)) 60 61 main()
We did a search for “best for couples, nearby beach area with cool weather” from the code block. Check out the interesting results we got which are contextually and semantically matched and closely match with user expectations.
To summarize, we used Atlas Apps Services to configure the triggers and OpenAI API keys. In the trigger code, we wrote a logic to fetch the embeddings from OpenAI and stored it in imported/newly created documents. With these steps, we have enabled semantic search capabilities into our primary workload dataset which, in this case, is Airbnb.
If you’ve any doubts or questions or want to discuss this or any new use cases further, you can reach out to me on LinkedIn.
Top Comments in Forums
There are no comments on this article yet.