How to Model Documents for Vector Search to Improve Querying Capabilities
Rate this tutorial
Atlas Vector Search was recently released, so let’s dive into a tutorial on how to properly model your documents when utilizing vector search to revolutionize your querying capabilities!
Vector search is new, so let’s first go over the basic ways of modeling your data in a MongoDB document before continuing on into how to incorporate vector embeddings.
Data modeling in MongoDB revolves around organizing your data into documents within various collections. Varied projects or organizations will require different ways of structuring data models due to the fact that successful data modeling depends on the specific requirements of each application, and for the most part, no one document design can be applied for every situation. There are some commonalities, though, that can guide the user. These are:
- Choosing whether to embed or reference your related data.
- Using arrays in a document.
- Indexing your documents (finding fields that are frequently used and applying the appropriate indexing, etc.).
For a more in-depth explanation and a comprehensive guide of data modeling with MongoDB, please check out our data modeling article.
We are going to be building our vector embedding example using a MongoDB document for our MongoDB TV series. Here, we have a single MongoDB document representing our MongoDB TV show, without any embeddings in place. We have a nested array featuring our array of seasons, and within that, our array of different episodes. This way, in our document, we are capable of seeing exactly which season each episode is a part of, along with the episode number, the title, the description, and the date:
1 { 2 "_id": ObjectId("238478293"), 3 "title": "MongoDB TV", 4 "description": "All your MongoDB updates, news, videos, and podcast episodes, straight to you!", 5 "genre": ["Programming", "Database", "MongoDB"], 6 "seasons": [ 7 { 8 "seasonNumber": 1, 9 "episodes": [ 10 { 11 "episodeNumber": 1, 12 "title": "EASY: Build Generative AI Applications", 13 "description": "Join Jesse Hall….", 14 "date": ISODate("Oct52023") 15 }, 16 { 17 "episodeNumber": 2, 18 "title": "RAG Architecture & MongoDB: The Future of Generative AI Apps", 19 "description": "Join Prakul Agarwal…", 20 "date": ISODate("Oct42023") 21 } 22 ] 23 }, 24 { 25 "seasonNumber": 2, 26 "episodes": [ 27 { 28 "episodeNumber": 1, 29 "title": "Cloud Connect - Harness the Power of AI/ML and Generative AI on AWS with MongoDB Atlas", 30 "description": "Join Igor Alekseev….", 31 "date": ISODate("Oct32023") 32 }, 33 { 34 "episodeNumber": 2, 35 "title": "The Index: Here’s what you missed last week…", 36 "description": "Join Megan Grant…", 37 "date": ISODate("Oct22023") 38 } 39 ] 40 } 41 ] 42 }
Now that we have our example set up, let’s incorporate vector embeddings and discuss the proper techniques to set you up for success.
Let’s first understand exactly what vector search is: Vector search is the way to search based on meaning rather than specific words. This comes in handy when querying using similarities rather than searching based on keywords. When using vector search, you can query using a question or a phrase rather than just a word. In a nutshell, vector search is great for when you can’t think of exactly that book or movie, but you remember the plot or the climax.
This process happens when text, video, or audio is transformed via an encoder into vectors. With MongoDB, we can do this using OpenAI, Hugging Face, or other natural language processing models. Once we have our vectors, we can upload them in the base of our document and conduct vector search using them. Please keep in mind the current limitations of vector search and how to properly embed your vectors.
You can store your vector embeddings alongside other data in your document, or you can store them in a new collection. It is really up to the user and the project goals. Let’s go over what a document with vector embeddings can look like when you incorporate them into your data model, using the same example from above:
1 { 2 "_id": ObjectId("238478293"), 3 "title": "MongoDB TV", 4 "description": "All your MongoDB updates, news, videos, and podcast episodes, straight to you!", 5 "genre": ["Programming", "Database", "MongoDB"], 6 “vectorEmbeddings”: [ 0.25, 0.5, 0.75, 0.1, 0.1, 0.8, 0.2, 0.6, 0.6, 0.4, 0.9, 0.3, 0.2, 0.7, 0.5, 0.8, 0.1, 0.8, 0.2, 0.6 ], 7 "seasons": [ 8 { 9 "seasonNumber": 1, 10 "episodes": [ 11 { 12 "episodeNumber": 1, 13 "title": "EASY: Build Generative AI Applications", 14 "description": "Join Jesse Hall….", 15 "date": ISODate("Oct 5, 2023") 16 17 }, 18 { 19 "episodeNumber": 2, 20 "title": "RAG Architecture & MongoDB: The Future of Generative AI Apps", 21 "description": "Join Prakul Agarwal…", 22 "date": ISODate("Oct 4, 2023") 23 } 24 ] 25 }, 26 { 27 "seasonNumber": 2, 28 "episodes": [ 29 { 30 "episodeNumber": 1, 31 "title": "Cloud Connect - Harness the Power of AI/ML and Generative AI on AWS with MongoDB Atlas", 32 "description": "Join Igor Alekseev….", 33 "date": ISODate("Oct 3, 2023") 34 }, 35 { 36 "episodeNumber": 2, 37 "title": "The Index: Here’s what you missed last week…", 38 "description": "Join Megan Grant…", 39 "date": ISODate("Oct 2, 2023") 40 } 41 ] 42 } 43 ] 44 }
Here, you have your vector embeddings classified at the base in your document. Currently, there is a limitation where vector embeddings cannot be nested in an array in your document. Please ensure your document has your embeddings at the base. There are various tutorials on our Developer Center, alongside our YouTube account and our documentation, that can help you figure out how to embed these vectors into your document and how to acquire the necessary vectors in the first place.
When you’re using vector search, it is necessary to create a search index so you’re able to be successful with your semantic search. To do this, please view our Vector Search documentation. Here is the skeleton code provided by our documentation:
1 { 2 "fields":[ 3 { 4 "type": "vector", 5 "path": "<field-to-index>", 6 "numDimensions": <number-of-dimensions>, 7 "similarity": "euclidean | cosine | dotProduct" 8 }, 9 { 10 "type": "filter", 11 "path": "<field-to-index>" 12 }, 13 ... 14 ] 15 }
When setting up your search index, you want to change the “” to be your vector path. In our case, it would be “vectorEmbeddings”. “type” can stay the way it is. For “numDimensions”, please match the dimensions of the model you’ve chosen. This is just the number of vector dimensions, and the value cannot be greater than 4096. This limitation comes from the base embedding model that is being used, so please ensure you’re using a supported LLM (large language model) such as OpenAI or Hugging Face. When using one of these, there won’t be any issues running into vector dimensions. For “similarity”, please pick which vector function you want to use to search for the top K-nearest neighbors.
When you’re ready to query and find results from your embedded documents, it’s time to create an aggregation pipeline on your embedded vector data. To do this, you can use the“$vectorSearch” operator, which is a new aggregation stage in Atlas. It helps execute an Approximate Nearest Neighbor query.
For more information on this step, please check out the tutorial on Developer Center about building generative AI applications, and our YouTube video on vector search.