Using OpenAI Latest Embeddings in a RAG System With MongoDB
Rate this tutorial
OpenAI recently released new embeddings and moderation models. This article explores the step-by-step implementation process of utilizing one of the new embedding models: text-embedding-3-small within a retrieval-augmented generation (RAG) system powered by MongoDB Atlas Vector Database.
An embedding is a mathematical representation of data within a high-dimensional space, typically referred to as a vector space. Within a vector space, vector embeddings are positioned based on their semantic relationships, concepts, or contextual relevance. This spatial relationship within the vector space effectively mirrors the associations in the original data, making embeddings useful in various artificial intelligence domains, such as machine learning, deep learning, generative AI (GenAI), natural language processing (NLP), computer vision, and data science.
Creating an embedding involves mapping data related to entities like words, products, audio, and user profiles into a numerical format. In NLP, this process involves transforming words and phrases into vectors, converting their semantic meanings into a machine-readable form.
AI applications that utilize RAG architecture design patterns leverage embeddings to augment the large language model (LLM) generative process by retrieving relevant information from a data store such as MongoDB Atlas. By comparing embeddings of the query with those in the database, RAG systems incorporate external knowledge, improving the relevance and accuracy of the responses.
OpenAI recently introduced two new embedding models: text-embedding-3-small and text-embedding-3-large. The text-embedding-3-small model offers a compact and highly efficient solution, ideal for applications requiring speed and agility, while the text-embedding-3-large model provides a more detailed and powerful vector representation suitable for complex and nuanced data processing tasks.
ada v2 | text-embedding-3-small | text-embedding-3-large | |
Embedding Size | 1536 | 256, 512 and 1536 | 256, 1024 and 3072 |
- OpenAI's embedding models: Get introduced to OpenAI's new embedding models, text-embedding-3-small and text-embedding-3-large, and their applications.
- Practical implementation steps: Follow through practical steps, including library installation, data loading and preprocessing, creating embeddings, and data ingestion into MongoDB.
- Vector Search index in MongoDB: Learn to create and use a vector search index for efficient retrieval and user query processing.
- AI-driven query responses: Understand how to handle user queries and generate AI responses, integrating RAG system insights for more accurate answers.
- Real-world application insight: Gain hands-on experience in implementing an advanced RAG system for practical uses like a movie recommendation engine.
The following section introduces a series of steps that explain how to utilize the new OpenAI embedding model text-embedding-3-small to embed plot data points for movies within a movie dataset to power a RAG system that answers user queries based on the collection movies.
The steps also cover the typical stages within RAG systems and pipelines that AI engineers are likely to encounter:
- Data loading: importing and accessing datasets from various data sources for processing and analysis; the step involves making data available in the application environment.
- Data cleaning and preparation: refining the dataset by removing inaccuracies, filling missing values, and formatting data for use in the downstream stages in the pipeline.
- Data ingestion and indexing: moving the processed data into a data store such as MongoDB Atlas database and creating indexes to optimize retrieval efficiency and search performance.
- Querying: executing search queries against the database to retrieve relevant data based on specific criteria or user inputs.
The development environment for the demonstration of the text-embedding-3-small embedding model and the retrieval system requires the setting up of libraries and tools installed using the Python package manager pip.
1 !pip install datasets pandas openai pymongo
Below are brief explanations of the tools and libraries utilised within the implementation code:
- datasets: This library is part of the Hugging Face ecosystem. By installing 'datasets', we gain access to a number of pre-processed and ready-to-use datasets, which are essential for training and fine-tuning machine learning models or benchmarking their performance.
- pandas: This is a data science library that provides robust data structures and methods for data manipulation, processing, and analysis.
- openai: This is the official Python client library for accessing OpenAI's suite of AI models and tools, including GPT and embedding models.
- pymongo: PyMongo is a Python toolkit for MongoDB. It enables interactions with a MongoDB database.
The code snippet below shows the data loading phase where the libraries load_dataset from the Hugging Face datasets library and the panda's library, denoted as pd, are imported into the development environment. The load_dataset function is for accessing a wide range of datasets available in Hugging Face's repository.
Load the dataset titled AIatMongoDB/embedded_movies. This dataset is a collection of movie-related details that include attributes such as the title, release year, cast, and plot. A unique feature of this dataset is the plot_embedding field for each movie. These embeddings are generated using OpenAI's text-embedding-ada-002 model.
After loading the dataset, it is converted into a pandas DataFrame; this data format simplifies data manipulation and analysis. Display the first five rows using the head(5) function to gain an initial understanding of the data. This preview provides a snapshot of the dataset's structure and its various attributes, such as genres, cast, and plot embeddings.
1 from datasets import load_dataset 2 import pandas as pd 3 4 # <https://huggingface.co/datasets/AIatMongoDB/embedded_movies> 5 dataset = load_dataset("AIatMongoDB/embedded_movies") 6 7 # Convert the dataset to a pandas dataframe 8 dataset_df = pd.DataFrame(dataset['train']) 9 10 dataset_df.head(5)
Import libraries:
- from datasets import load_dataset: imports the load_dataset function from the Hugging Face datasets library; this function is used to load datasets from Hugging Face's extensive dataset repository.
- import pandas as pd: imports the pandas library, a fundamental tool in Python for data manipulation and analysis, using the alias pd.
Load the dataset:
dataset = load_dataset("AIatMongoDB/embedded_movies")
: Loads the dataset namedembedded_movies
from the Hugging Face datasets repository; this dataset is provided by MongoDB and is specifically designed for embedding and retrieval tasks.
Convert dataset to pandas DataFrame:
dataset_df = pd.DataFrame(dataset\['train'\])
: converts the training portion of the dataset into a pandas DataFrame.
Preview the dataset:
dataset_df.head(5)
: displays the first five entries of the DataFrame.
The next step cleans the data and prepares it for the next stage, which creates a new embedding data point using the new OpenAI embedding model.
1 # Remove data point where plot column is missing 2 dataset_df = dataset_df.dropna(subset=['plot']) 3 print("\\nNumber of missing values in each column after removal:") 4 print(dataset_df.isnull().sum()) 5 6 # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with the new OpenAI embedding Model "text-embedding-3-small" 7 dataset_df = dataset_df.drop(columns=['plot_embedding']) 8 dataset_df.head(5)
Removing incomplete data:
dataset_df = dataset_df.dropna(subset=\['plot'\])
: ensures data integrity by removing any data point/row where the “plot” column is missing data; since “plot” is a vital component for the new embeddings, its completeness affects the retrieval performance.
Preparing for new embeddings:
dataset_df = dataset_df.drop(columns=\['plot_embedding'\])
: remove the existing “plot_embedding” column; new embeddings using OpenAI's "text-embedding-3-small" model, the existing embeddings (generated by a different model) are no longer needed.dataset_df.head(5)
: allows us to preview the first five rows of the updated datagram to ensure the removal of the “plot_embedding” column and confirm data readiness.
This stage focuses on generating new embeddings using OpenAI's advanced model.
This demonstration utilises a Google Colab Notebook, where environment variables are configured explicitly within the notebook's Secrets section and accessed using the user data module. In a production environment, the environment variables that store secret keys are usually stored in a .env file or equivalent.
An OpenAI API key is required to ensure the successful completion of this step. More details on OpenAI's embedding models can be found on the official site.
1 python 2 import openai 3 from google.colab import userdata 4 5 openai.api_key = userdata.get("open_ai") 6 7 EMBEDDING_MODEL = "text-embedding-3-small" 8 9 def get_embedding(text): 10 """Generate an embedding for the given text using OpenAI's API.""" 11 12 # Check for valid input 13 if not text or not isinstance(text, str): 14 return None 15 16 try: 17 # Call OpenAI API to get the embedding 18 embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding 19 return embedding 20 except Exception as e: 21 print(f"Error in get_embedding: {e}") 22 return None 23 24 dataset_df["plot_embedding_optimised"] = dataset_df['plot'].apply(get_embedding) 25 26 dataset_df.head()
Setting up OpenAI API:
- Imports and API key: Import the openai library and retrieve the API key from Google Colab's userdata.
- Model selection: Set the variable EMBEDDING_MODEL to text-embedding-3-small.
Embedding generation function:
- get_embedding: converts text into embeddings; it takes both the string input and the embedding model as arguments and generates the text embedding using the specified OpenAI model.
- Input validation and API call: validates the input to ensure it's a valid string, then calls the OpenAI API to generate the embedding.
- If the process encounters any issues, such as invalid input or API errors, the function returns None.
- Applying to dataset: The function get_embedding is applied to the “plot” column of the DataFrame dataset_df. Each plot is transformed into an optimized embedding data stored in a new column, plot_embedding_optimised.
- Preview updated dataset: dataset_df.head() displays the first few rows of the DataFrame.
MongoDB acts as both an operational and a vector database. It offers a database solution that efficiently stores, queries, and retrieves vector embeddings — the advantages of this lie in the simplicity of database maintenance, management, and cost.
To create a new MongoDB database, set up a database cluster:
- Select the “Database” option on the left-hand pane, which will navigate to the Database Deployment page, where there is a deployment specification of any existing cluster. Create a new database cluster by clicking on the "+Create" button.
3. Select all the applicable configurations for the database cluster. Once all the configuration options are selected, click the “Create Cluster” button to deploy the newly created cluster. MongoDB also enables the creation of free clusters on the “Shared Tab.”
Note: Don’t forget to whitelist the IP for the Python host or 0.0.0.0/0 for any IP when creating proof of concepts.
4. After successfully creating and deploying the cluster, the cluster becomes accessible on the “Database Deployment” page.
5. Click on the “Connect” button of the cluster to view the option to set up a connection to the cluster via various language drivers.
6. This tutorial only requires the cluster's URI (unique resource identifier). Grab the URI and copy it into the Google Colab Secrets environment in a variable named MONGO_URI, or place it in a .env file or equivalent.
1 import pymongo 2 from google.colab import userdata 3 4 def get_mongo_client(mongo_uri): 5 """Establish connection to the MongoDB.""" 6 try: 7 client = pymongo.MongoClient(mongo_uri) 8 print("Connection to MongoDB successful") 9 return client 10 except pymongo.errors.ConnectionFailure as e: 11 print(f"Connection failed: {e}") 12 return None 13 14 mongo_uri = userdata.get('MONGO_URI') 15 if not mongo_uri: 16 print("MONGO_URI not set in environment variables") 17 18 mongo_client = get_mongo_client(mongo_uri) 19 20 # Ingest data into MongoDB 21 db = mongo_client['movies'] 22 collection = db['movie_collection'] 23 24 documents = dataset_df.to_dict('records') 25 collection.insert_many(documents) 26 27 print("Data ingestion into MongoDB completed")
7. Database connection setup:
- MongoDB connection function: The get_mongo_client function is defined to establish a connection to MongoDB using the provided URI. It includes error handling to manage connection failures.
8. Data ingestion process:
- Retrieving MongoDB URI: The MongoDB URI, which is crucial for connecting to the database, is obtained from the environment variables using
userdata.get('MONGO_URI')
. - Establishing database connection: The script attempts to connect to MongoDB using this URI.
- Database and collection selection: Once connected, the script selects the
movies
database and themovie_collection
collection. This specifies where the data will be stored in MongoDB. If the database or collection does not exist, MongoDB creates them automatically. - Data conversion and insertion: DataFrame, with enhanced embeddings, is converted into a dictionary format suitable for MongoDB using
to_dict('records')
. Theinsert_many
method is then used to ingest data in batch.
This next step is mandatory for conducting efficient and accurate vector-based searches based on the vector embeddings stored within the documents in the movie_collection collection. Creating a Vector Search index enables the ability to traverse the documents efficiently to retrieve documents with embeddings that match the query embedding based on vector similarity. Read more about MongoDB Vector Search indexes.
1. Navigate to the movie_collection in the movie database. At this point, the database is populated with several documents containing information about various movies, particularly within the action and romance genres.
2. Select the “Atlas Search” tab option on the navigation pane to create an Atlas Vector Search index. Click the “Create Search Index” button to create an Atlas Vector Search Index.
3. On the page to create a Vector Search index, select the Atlas Vector Search option that enables the creation of a vector search index by defining the index using JSON.
4. The following page depicted below enables the definition of the index via JSON. This page also provides the ability to name the vector index search. The name given to the index will be referenced in the implementation code in the following steps. For this tutorial, the name “vector_index” will be used.
5. To complete the creation of the vector search index, select the appropriate database and collection for which the index should be created. In this scenario, it is the “movies” database and the “movie_collection” collection. The JSON entered into the JSON editor should look similar to the following:
1 { 2 "fields": [ 3 { 4 "numDimensions": 1536, 5 "path": "plot_embedding_optimised", 6 "similarity": "cosine", 7 "type": "vector" 8 } 9 ] 10 }
- fields: This is a list that specifies the fields to be indexed in the MongoDB collection along with the definition of the characteristic of the index itself.
- numDimensions: Within each field item,
numDimensions
specifies the number of dimensions of the vector data. In this case, it is set to 1536. This number should match the dimensionality of the vector data stored in the field, and it is also one of the dimensions that OpenAI'stext-embedding-3-small
creates vector embeddings. - path: The path field indicates the path to the data within the database documents to be indexed. Here, it is set to
plot_embedding_optimised
. - similarity: The similarity field defines the type of similarity distance metric that will be used to compare vectors during a search. Here, it is set to cosine, which measures the
cosine
of the angle between two vectors, effectively determining how similar or different these vectors are in their orientation in the vector space. Other similarity distance metric measures are Euclidean and Dot Products. Find more information about how to index vector embeddings for vector search. - type: This field specifies the data type the index will handle. In this case, it is set to
vector
, indicating that this index is specifically designed for handling and optimizing searches over vector data.
Now, the vector search index should be created successfully. Navigating back to the Atlas Search page should show the index named vector_index with a status of active.
This step combines all the activities in the previous step to provide the functionality of conducting vector search on stored records based on embedded user queries.
This step implements a function that returns a vector search result by generating a query embedding and defining a MongoDB aggregation pipeline. The pipeline, consisting of the
$vectorSearch
and $project
stages, queries using the generated vector and formats the results to include only required information like plot, title, and genres while incorporating a search score for each result. This selective projection enhances query performance by reducing data transfer and optimizes the use of network and memory resources, which is especially critical when handling large datasets. For AI engineers and developers considering data security at an early stage, the chances of sensitive data leaked to the client side can be minimized by carefully excluding fields irrelevant to the user's query.1 def vector_search(user_query, collection): 2 """ 3 Perform a vector search in the MongoDB collection based on the user query. 4 5 Args: 6 user_query (str): The user's query string. 7 collection (MongoCollection): The MongoDB collection to search. 8 9 Returns: 10 list: A list of matching documents. 11 """ 12 13 # Generate embedding for the user query 14 query_embedding = get_embedding(user_query) 15 16 if query_embedding is None: 17 return "Invalid query or embedding generation failed." 18 19 # Define the vector search pipeline 20 pipeline = [ 21 { 22 "$vectorSearch": { 23 "index": "vector_index", 24 "queryVector": query_embedding, 25 "path": "plot_embedding_optimised", 26 "numCandidates": 150, # Number of candidate matches to consider 27 "limit": 5 # Return top 5 matches 28 } 29 }, 30 { 31 "$project": { 32 "_id": 0, # Exclude the _id field 33 "plot_embedding_opitimzed": 0, # Exclude the plot_embedding_opitimzed field 34 "plot": 1, # Include the plot field 35 "title": 1, # Include the title field 36 "genres": 1, # Include the genres field 37 "score": { 38 "$meta": "vectorSearchScore" # Include the search score 39 } 40 } 41 } 42 ] 43 44 # Execute the search 45 results = collection.aggregate(pipeline) 46 return list(results)
1. Vector Search custom function:
- The
vector_search
function is designed to perform a sophisticated search within a MongoDB collection, utilizing the vector embeddings stored in the database. - It accepts two parameters:
user_query
, a string representing the user's search query, andcollection
, the MongoDB collection to be searched.
2. Query embedding and search pipeline:
- Embedding generation: The function begins by generating an embedding for the user query using the
get_embedding
function. - Defining the search pipeline: A MongoDB aggregation pipeline is defined for the vector search. This pipeline uses the
$vectorSearch
operator to find documents whose embeddings closely match the query embedding. The pipeline specifies the index to use, the query vector, and the path to the embeddings in the documents and limits the number of candidate matches and the number of results returned. - Projection of results: The
$project
stage formats the output by including relevant fields like the plot, title, genres, and search score while excluding the MongoDB document ID.
The final step in the implementation phase focuses on the practical application of our vector search functionality and AI integration to handle user queries effectively. The
handle_user_query
function performs a vector search on the MongoDB collection based on the user's query and utilizes OpenAI's GPT-3.5 model to generate context-aware responses.1 def handle_user_query(query, collection): 2 3 get_knowledge = vector_search(query, collection) 4 5 search_result = '' 6 for result in get_knowledge: 7 search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('plot', 'N/A')}\\n" 8 9 completion = openai.chat.completions.create( 10 model="gpt-3.5-turbo", 11 messages=[ 12 {"role": "system", "content": "You are a movie recommendation system."}, 13 {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result} 14 ] 15 ) 16 17 return (completion.choices[0].message.content), search_result 18 19 Conduct query with retrieval of sources 20 query = "What is the best romantic movie to watch?" 21 response, source_information = handle_user_query(query, collection) 22 23 print(f"Response: {response}") 24 print(f"Source Information: \\n{source_information}")
1. Functionality for query handling:
- The
handle_user_query
function takes a user's query and the MongoDB collection as inputs. - It starts by executing a vector search on the collection based on the user query, retrieving relevant movie documents.
2. Generating AI-driven responses:
- Context compilation: Next, the function compiles a context string from the search results, concatenating titles and plots of the retrieved movies.
- OpenAI model integration: The
openai.chat.completions.create
function is called with the model gpt-3.5-turbo. - System and user roles: In the message sent to the OpenAI model, two roles are defined: system, which establishes the role of the AI as a movie recommendation system, and user, which provides the actual user query and the context.
3. Executing and displaying responses:
- The
handle_user_query
function returns the AI-generated response and the search result context used.
Below is the result of running the function:
1 Response: Based on the context provided, the best romantic movie to watch would be "Gorgeous". It revolves around a romantic girl who travels to Hong Kong in search of true love but unexpectedly falls for a kind-hearted professional fighter. 2 Source Information: 3 Title: Run, Plot: This action movie is filled with romance and adventure. As Abhisek fights for his life against the forces of crime and injustice, he meets Bhoomika, who captures his heart. 4 Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs. 5 Title: Gorgeous, Plot: A romantic girl travels to Hong Kong in search of certain love but instead meets a kind-hearted professional fighter with whom she begins to fall for instead. 6 Title: Once a Thief, Plot: A romantic and action packed story of three best friends, a group of high end art thieves, who come into trouble when a love-triangle forms between them. 7 Title: House of Flying Daggers, Plot: A romantic police captain breaks a beautiful member of a rebel group out of prison to help her rejoin her fellows, but things are not what they seem.
The new OpenAI embedding model promises better performance in terms of multi-language retrieval and task-specific accuracy. This is in comparison to previously released OpenAI embedding models. This article outlined the implementation steps for a RAG system that leverages the latest embedding model. View the GitHub repo for the implementation code.
In practical scenarios, lower-dimension embeddings that can maintain a high level of semantic capture are beneficial for Generative AI applications where the relevance and speed of retrieval are crucial to user experience and value.
Further advantages of lower embedding dimensions with high performance are:
- Improved user experience and relevance: Relevance of information retrieval is optimized, directly impacting the user experience and value in AI-driven applications.
- Comparison with previous model: In contrast to the previous ada v2 model, which only provided embeddings at a dimension of 1536, the new models offer more flexibility. The text-embedding-3-large extends this flexibility further with dimensions of 256, 1024, and 3072.
- Efficiency in data processing: The availability of lower-dimensional embeddings aids in more efficient data processing, reducing computational load without compromising the quality of results.
- Resource optimization: Lower-dimensional embeddings are resource-optimized, beneficial for applications running on limited memory and processing power, and for reducing overall computational costs.
Future articles will cover advanced topics, such as benchmarking embedding models and handling migration of embeddings.
An embedding is a technique where data — such as words, audio, or images — is transformed into mathematical representations, vectors of real numbers in a high-dimensional space referred to as a vector space. This process allows AI models to understand and process complex data by capturing the underlying semantic relationships and contextual nuances.
A vector store, such as a MongoDB Atlas database, is a storage mechanism for vector embeddings. It allows efficient storing, indexing, and retrieval of vector data, essential for tasks like semantic search, recommendation systems, and other AI applications.
A RAG system uses embeddings to improve the response generated by a large language model (LLM) by retrieving relevant information from a knowledge store based on semantic similarities. The query embedding is compared with the knowledge store (database record) embedding to fetch contextually similar and relevant data, which improves the accuracy and relevance of generated responses by the LLM to the user’s query.
Top Comments in Forums
There are no comments on this article yet.