How to Create Vector Embeddings
On this page
You can store vector embeddings alongside your other data in Atlas. These embeddings capture meaningful relationships in your data and allow you to perform semantic search and implement RAG with Atlas Vector Search.
Get Started
Use the following tutorial to learn how to create vector embeddings and query them using Atlas Vector Search. Specifically, you perform the following actions:
Define a function that uses an embedding model to generate vector embeddings.
Create embeddings from your data and store them in Atlas.
Create embeddings from your search terms and run a vector search query.
For production applications, you typically write a script to generate vector embeddings. You can start with the sample code on this page and customize it for your use case.
Prerequisites
➤ Use the Select your language drop-down menu to set the language of the examples on this page.
To complete this tutorial, you must have the following:
An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.
A terminal and code editor to run your Go project.
Go installed.
A Hugging Face Access Token or OpenAI API Key.
An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.
Java Development Kit (JDK) version 8 or later.
An environment to set up and run a Java application. We recommend that you use an integrated development environment (IDE) such as IntelliJ IDEA or Eclipse IDE to configure Maven or Gradle to build and run your project.
One of the following:
A Hugging Face Access Token with read access
An OpenAI API Key. You must have a paid OpenAI account with credits available for API requests. To learn more about registering an OpenAI account, see the OpenAI API website.
An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.
A terminal and code editor to run your Node.js project.
npm and Node.js installed.
If you're using OpenAI models, you must have an OpenAI API Key.
An Atlas account with a cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list. To learn more, see Create a Cluster.
An environment to run interactive Python notebooks such as VS Code or Colab.
If you're using OpenAI models, you must have an OpenAI API Key.
Define an Embedding Function
In this section, you define a function to generate vector embeddings by using an embedding model. Select a tab based on whether you want to use an open-source embedding model or a proprietary model from OpenAI.
Note
Open-source embedding models are free to use and can be loaded locally from your application. Proprietary models require an API key to access the models.
Create a .env
file to manage secrets.
In your project, create a .env
file to store your Atlas connection
string and Hugging Face access token.
HUGGINGFACEHUB_API_TOKEN = "<access-token>" ATLAS_CONNECTION_STRING = "<connection-string>"
Replace the <access-token>
placeholder value with your Hugging Face access token.
Replace the <connection-string>
placeholder value with the SRV
connection string for
your Atlas cluster.
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Define a function to generate vector embeddings.
Create a directory in your project called
common
to store common code that you'll use in later steps:mkdir common && cd common Create a file named
get-embeddings.go
and paste the following code. This code defines a function namedGetEmbeddings
to generate an embedding for a given input. This function specifies:The
feature-extraction
task using the Go port of the LangChain library. To learn more, see the Tasks documentation in the LangChain JavaScript documentation.The mxbai-embed-large-v1 embedding model.
get-embeddings.gopackage common import ( "context" "log" "github.com/tmc/langchaingo/embeddings/huggingface" ) func GetEmbeddings(documents []string) [][]float32 { hf, err := huggingface.NewHuggingface( huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"), huggingface.WithTask("feature-extraction")) if err != nil { log.Fatalf("failed to connect to Hugging Face: %v", err) } embs, err := hf.EmbedDocuments(context.Background(), documents) if err != nil { log.Fatalf("failed to generate embeddings: %v", err) } return embs } Note
503 when calling Hugging Face models
You may occasionally get 503 errors when calling Hugging Face model hub models. To resolve this issue, retry after a short delay.
Move back into the main project root directory.
cd ../
Create a .env
file to manage secrets.
In your project, create a .env
file to store your connection
string and OpenAI API token.
OPENAI_API_KEY = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
Replace the <api-key>
and <connection-string>
placeholder values with your OpenAI API key
and the SRV connection string
for your Atlas cluster. Your connection string should use
the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Define a function to generate vector embeddings.
Create a directory in your project called
common
to store code you'll use in multiple steps:mkdir common && cd common Create a file named
get-embeddings.go
and paste the following code. This code defines a function namedGetEmbeddings
that uses OpenAI'stext-embedding-3-small
model to generate an embedding for a given input.get-embeddings.gopackage common import ( "context" "log" "github.com/milosgajdos/go-embeddings/openai" ) func GetEmbeddings(docs []string) [][]float64 { c := openai.NewClient() embReq := &openai.EmbeddingRequest{ Input: docs, Model: openai.TextSmallV3, EncodingFormat: openai.EncodingFloat, } embs, err := c.Embed(context.Background(), embReq) if err != nil { log.Fatalf("failed to connect to OpenAI: %v", err) } var vectors [][]float64 for _, emb := range embs { vectors = append(vectors, emb.Vector) } return vectors } Move back into the main project root directory.
cd ../
Create your Java project and install dependencies.
From your IDE, create a Java project using Maven or Gradle.
Add the following dependencies, depending on your package manager:
If you are using Maven, add the following dependencies to the
dependencies
array in your project'spom.xml
file:pom.xml<dependencies> <!-- MongoDB Java Sync Driver v5.2.0 or later --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>[5.2.0,)</version> </dependency> <!-- Java library for working with Hugging Face models --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-hugging-face</artifactId> <version>0.35.0</version> </dependency> </dependencies> If you are using Gradle, add the following to the
dependencies
array in your project'sbuild.gradle
file:build.gradledependencies { // MongoDB Java Sync Driver v5.2.0 or later implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)' // Java library for working with Hugging Face models implementation 'dev.langchain4j:langchain4j-hugging-face:0.35.0' } Run your package manager to install the dependencies to your project.
Set your environment variables.
Note
This example sets the variables for the project in the IDE. Production applications might manage environment variables through a deployment configuration, CI/CD pipeline, or secrets manager, but you can adapt the provided code to fit your use case.
In your IDE, create a new configuration template and add the following variables to your project:
If you are using IntelliJ IDEA, create a new Application run configuration template, then add your variables as semicolon-separated values in the Environment variables field (for example,
FOO=123;BAR=456
). Apply the changes and click OK.To learn more, see the Create a run/debug configuration from a template section of the IntelliJ IDEA documentation.
If you are using Eclipse, create a new Java Application launch configuration, then add each variable as a new key-value pair in the Environment tab. Apply the changes and click OK.
To learn more, see the Creating a Java application launch configuration section of the Eclipse IDE documentation.
HUGGING_FACE_ACCESS_TOKEN=<access-token> ATLAS_CONNECTION_STRING=<connection-string>
Update the placeholders with the following values:
Replace the``<access-token>`` placeholder value with your Hugging Face access token.
Replace the
<connection-string>
placeholder value with the SRV connection string for your Atlas cluster.Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Define a method to generate vector embeddings.
Create a file named EmbeddingProvider.java
and paste
the following code.
This code defines two methods to generate embeddings for a given input using the mxbai-embed-large-v1 open-source embedding model:
Multiple Inputs: The
getEmbeddings
method accepts an array of text inputs (List<String>
), allowing you to create multiple embeddings in a single API call. The method converts the API-provided arrays of floats to BSON arrays of doubles for storing in your Atlas cluster.Single Input: The
getEmbedding
method accepts a singleString
, which represents a query you want to make against your vector data. The method converts the API-provided array of floats to a BSON array of doubles to use when querying your collection.
import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel; import dev.langchain4j.model.output.Response; import org.bson.BsonArray; import org.bson.BsonDouble; import java.util.List; import static java.time.Duration.ofSeconds; public class EmbeddingProvider { private static HuggingFaceEmbeddingModel embeddingModel; private static HuggingFaceEmbeddingModel getEmbeddingModel() { if (embeddingModel == null) { String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN"); if (accessToken == null || accessToken.isEmpty()) { throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty."); } embeddingModel = HuggingFaceEmbeddingModel.builder() .accessToken(accessToken) .modelId("mixedbread-ai/mxbai-embed-large-v1") .waitForModel(true) .timeout(ofSeconds(60)) .build(); } return embeddingModel; } /** * Takes an array of strings and returns a BSON array of embeddings to * store in the database. */ public List<BsonArray> getEmbeddings(List<String> texts) { List<TextSegment> textSegments = texts.stream() .map(TextSegment::from) .toList(); Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments); return response.content().stream() .map(e -> new BsonArray( e.vectorAsList().stream() .map(BsonDouble::new) .toList())) .toList(); } /** * Takes a single string and returns a BSON array embedding to * use in a vector query. */ public BsonArray getEmbedding(String text) { Response<Embedding> response = getEmbeddingModel().embed(text); return new BsonArray( response.content().vectorAsList().stream() .map(BsonDouble::new) .toList()); } }
Create your Java project and install dependencies.
From your IDE, create a Java project using Maven or Gradle.
Add the following dependencies, depending on your package manager:
If you are using Maven, add the following dependencies to the
dependencies
array in your project'spom.xml
file:pom.xml<dependencies> <!-- MongoDB Java Sync Driver v5.2.0 or later --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>[5.2.0,)</version> </dependency> <!-- Java library for working with OpenAI models --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-open-ai</artifactId> <version>0.35.0</version> </dependency> </dependencies> If you are using Gradle, add the following to the
dependencies
array in your project'sbuild.gradle
file:build.gradledependencies { // MongoDB Java Sync Driver v5.2.0 or later implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)' // Java library for working with OpenAI models implementation 'dev.langchain4j:langchain4j-open-ai:0.35.0' } Run your package manager to install the dependencies to your project.
Set your environment variables.
Note
This example sets the variables for the project in the IDE. Production applications might manage environment variables through a deployment configuration, CI/CD pipeline, or secrets manager, but you can adapt the provided code to fit your use case.
In your IDE, create a new configuration template and add the following variables to your project:
If you are using IntelliJ IDEA, create a new Application run configuration template, then add your variables as semicolon-separated values in the Environment variables field (for example,
FOO=123;BAR=456
). Apply the changes and click OK.To learn more, see the Create a run/debug configuration from a template section of the IntelliJ IDEA documentation.
If you are using Eclipse, create a new Java Application launch configuration, then add each variable as a new key-value pair in the Environment tab. Apply the changes and click OK.
To learn more, see the Creating a Java application launch configuration section of the Eclipse IDE documentation.
OPEN_AI_API_KEY=<api-key> ATLAS_CONNECTION_STRING=<connection-string>
Update the placeholders with the following values:
Replace the``<api-key>`` placeholder value with your OpenAI API key.
Replace the
<connection-string>
placeholder value with the SRV connection string for your Atlas cluster.Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Define a method to generate vector embeddings.
Create a file named EmbeddingProvider.java
and paste
the following code.
This code defines two methods to generate embeddings for a given input using the text-embedding-3-small OpenAI embedding model:
Multiple Inputs: The
getEmbeddings
method accepts an array of text inputs (List<String>
), allowing you to create multiple embeddings in a single API call. The method converts the API-provided arrays of floats to BSON arrays of doubles for storing in your Atlas cluster.Single Input: The
getEmbedding
method accepts a singleString
, which represents a query you want to make against your vector data. The method converts the API-provided array of floats to a BSON array of doubles to use when querying your collection.
import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.openai.OpenAiEmbeddingModel; import dev.langchain4j.model.output.Response; import org.bson.BsonArray; import org.bson.BsonDouble; import java.util.List; import static java.time.Duration.ofSeconds; public class EmbeddingProvider { private static OpenAiEmbeddingModel embeddingModel; private static OpenAiEmbeddingModel getEmbeddingModel() { if (embeddingModel == null) { String apiKey = System.getenv("OPEN_AI_API_KEY"); if (apiKey == null || apiKey.isEmpty()) { throw new IllegalStateException("OPEN_AI_API_KEY env variable is not set or is empty."); } return OpenAiEmbeddingModel.builder() .apiKey(apiKey) .modelName("text-embedding-3-small") .timeout(ofSeconds(60)) .build(); } return embeddingModel; } /** * Takes an array of strings and returns a BSON array of embeddings to * store in the database. */ public List<BsonArray> getEmbeddings(List<String> texts) { List<TextSegment> textSegments = texts.stream() .map(TextSegment::from) .toList(); Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments); return response.content().stream() .map(e -> new BsonArray( e.vectorAsList().stream() .map(BsonDouble::new) .toList())) .toList(); } /** * Takes a single string and returns a BSON array embedding to * use in a vector query. */ public BsonArray getEmbedding(String text) { Response<Embedding> response = getEmbeddingModel().embed(text); return new BsonArray( response.content().vectorAsList().stream() .map(BsonDouble::new) .toList()); } }
Update your package.json
file.
Configure your project to use ES modules
by adding "type": "module"
to your package.json
file
and then saving it.
{ "type": "module", // other fields... }
Create a .env
file.
In your project, create a .env
file to store your Atlas connection
string.
ATLAS_CONNECTION_STRING = "<connection-string>"
Replace the <connection-string>
placeholder value with the
SRV connection string
for your Atlas cluster. Your connection string should use
the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Note
Minimum Node.js Version Requirements
Node.js v20.x introduced the --env-file
option. If you are using an
older version of Node.js, add the dotenv
package to your project, or
use a different method to manage your environment variables.
Define a function to generate vector embeddings.
Create a file named get-embeddings.js
and paste
the following code. This code defines a function named
to generate an embedding for a given input. This function specifies:
The
feature-extraction
task from Hugging Face's transformers.js library. To learn more, see Tasks.The nomic-embed-text-v1 embedding model.
import { pipeline } from '@xenova/transformers'; // Function to generate embeddings for a given data source export async function getEmbedding(data) { const embedder = await pipeline( 'feature-extraction', 'Xenova/nomic-embed-text-v1'); const results = await embedder(data, { pooling: 'mean', normalize: true }); return Array.from(results.data); }
Update your package.json
file.
Configure your project to use ES modules
by adding "type": "module"
to your package.json
file
and then saving it.
{ "type": "module", // other fields... }
Create a .env
file.
In your project, create a .env
file to store your Atlas connection
string and OpenAI API key.
OPENAI_API_KEY = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
Replace the <api-key>
and <connection-string>
placeholder values with your OpenAI
API key and the SRV connection string
for your Atlas cluster. Your connection string should use
the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Note
Minimum Node.js Version Requirements
Node.js v20.x introduced the --env-file
option. If you are using an
older version of Node.js, add the dotenv
package to your project, or
use a different method to manage your environment variables.
Define a function to generate vector embeddings.
Create a file named get-embeddings.js
and paste
the following code. This code defines a function named getEmbedding
that uses OpenAI's text-embedding-3-small
model to generate an
embedding for a given input.
import OpenAI from 'openai'; // Setup OpenAI configuration const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY}); // Function to get the embeddings using the OpenAI API export async function getEmbedding(text) { const results = await openai.embeddings.create({ model: "text-embedding-3-small", input: text, encoding_format: "float", }); return results.data[0].embedding; }
Define a function to generate vector embeddings.
Paste and run the following code in your notebook to create a function that generates vector embeddings by using an open-source embedding model from Nomic AI. This code does the following:
Loads the nomic-embed-text-v1 embedding model.
Creates a function named
get_embedding
that uses the model to generate an embedding for a given text input.Tests the function by generating a single embedding for the string
foo
.
from sentence_transformers import SentenceTransformer # Load the embedding model (https://huggingface.co/nomic-ai/nomic-embed-text-v1") model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True) # Define a function to generate embeddings def get_embedding(data): """Generates vector embeddings for the given data.""" embedding = model.encode(data) return embedding.tolist() # Generate an embedding get_embedding("foo")
[-0.029808253049850464, 0.03841473162174225, -0.02561120130121708, -0.06707508116960526, 0.03867151960730553, ... ]
Define a function to generate vector embeddings.
Paste and run the following code in your notebook to create
a function that generates vector embeddings by using a
proprietary embedding model from
OpenAI.
Replace <api-key>
with your OpenAI API key.
This code does the following:
Specifies the
text-embedding-3-small
embedding model.Creates a function named
get_embedding
that calls the model's API to generate an embedding for a given text input.Tests the function by generating a single embedding for the string
foo
.
import os from openai import OpenAI # Specify your OpenAI API key and embedding model os.environ["OPENAI_API_KEY"] = "<api-key>" model = "text-embedding-3-small" openai_client = OpenAI() # Define a function to generate embeddings def get_embedding(text): """Generates vector embeddings for the given text.""" embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding return embedding # Generate an embedding get_embedding("foo")
[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]
Create Embeddings from Data
In this section, you create vector embeddings from your data using the function that you defined, and then you store these embeddings in a collection in Atlas.
Select a tab based on whether you want to create embeddings from new data or from existing data that you already have in Atlas.
Create a file named create-embeddings.go
and paste the following code.
Use the following code to generate embeddings from an existing collection in Atlas.
Specifically, this code uses the GetEmbeddings
function
that you defined and the MongoDB Go Driver
to generate embeddings from an array
of sample texts and ingest them into the sample_db.embeddings
collection in Atlas.
package main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) var data = []string{ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission", } type TextWithEmbedding struct { Text string Embedding []float32 } func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") embeddings := common.GetEmbeddings(data) docsToInsert := make([]interface{}, len(embeddings)) for i, string := range data { docsToInsert[i] = TextWithEmbedding{ Text: string, Embedding: embeddings[i], } } result, err := coll.InsertMany(ctx, docsToInsert) if err != nil { log.Fatalf("failed to insert documents: %v", err) } fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs)) }
package main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) var data = []string{ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission", } type TextWithEmbedding struct { Text string Embedding []float64 } func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") embeddings := common.GetEmbeddings(data) docsToInsert := make([]interface{}, len(data)) for i, movie := range data { docsToInsert[i] = TextWithEmbedding{ Text: movie, Embedding: embeddings[i], } } result, err := coll.InsertMany(ctx, docsToInsert) if err != nil { log.Fatalf("failed to insert documents: %v", err) } fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs)) }
Save and run the file.
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
You can also view your vector embeddings in the Atlas UI by navigating to the sample_db.embeddings
collection in your cluster.
Note
This example uses the sample_airbnb.listingsAndReviews
collection from our sample data,
but you can adapt the code to work with any collection
in your cluster.
Create a file named create-embeddings.go
and paste the following code.
Use the following code to generate embeddings from an existing collection in Atlas. Specifically, this code does the following:
Connects to your Atlas cluster.
Gets a subset of documents from the
sample_airbnb.listingsAndReviews
collection that have a non-emptysummary
field.Generates embeddings from each document's
summary
field by using theGetEmbeddings
function that you defined.Updates each document with a new
embeddings
field that contains the embedding value by using the MongoDB Go Driver.
package main import ( "context" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") filter := bson.D{ {"$and", bson.A{ bson.D{ {"$and", bson.A{ bson.D{{"summary", bson.D{{"$exists", true}}}}, bson.D{{"summary", bson.D{{"$ne", ""}}}}, }, }}, bson.D{{"embeddings", bson.D{{"$exists", false}}}}, }}, } opts := options.Find().SetLimit(50) cursor, err := coll.Find(ctx, filter, opts) if err != nil { log.Fatalf("failed to retrieve documents: %v", err) } var listings []common.Listing if err = cursor.All(ctx, &listings); err != nil { log.Fatalf("failed to unmarshal retrieved documents to Listing object: %v", err) } var summaries []string for _, listing := range listings { summaries = append(summaries, listing.Summary) } log.Println("Generating embeddings.") embeddings := common.GetEmbeddings(summaries) docsToUpdate := make([]mongo.WriteModel, len(listings)) for i := range listings { docsToUpdate[i] = mongo.NewUpdateOneModel(). SetFilter(bson.D{{"_id", listings[i].ID}}). SetUpdate(bson.D{{"$set", bson.D{{"embeddings", embeddings[i]}}}}) } bulkWriteOptions := options.BulkWrite().SetOrdered(false) result, err := coll.BulkWrite(context.Background(), docsToUpdate, bulkWriteOptions) if err != nil { log.Fatalf("failed to write embeddings to existing documents: %v", err) } log.Printf("Successfully added embeddings to %v documents", result.ModifiedCount) }
Create a file that contains Go models for the collection.
To simplify marshalling and unmarshalling Go objects to and from BSON, create a file that contains models for the documents in this collection.
Move into the
common
directory.cd common Create a file named
models.go
, and paste the following code into it:models.gopackage common import ( "time" "go.mongodb.org/mongo-driver/bson/primitive" ) type Image struct { ThumbnailURL string `bson:"thumbnail_url"` MediumURL string `bson:"medium_url"` PictureURL string `bson:"picture_url"` XLPictureURL string `bson:"xl_picture_url"` } type Host struct { ID string `bson:"host_id"` URL string `bson:"host_url"` Name string `bson:"host_name"` Location string `bson:"host_location"` About string `bson:"host_about"` ThumbnailURL string `bson:"host_thumbnail_url"` PictureURL string `bson:"host_picture_url"` Neighborhood string `bson:"host_neighborhood"` IsSuperhost bool `bson:"host_is_superhost"` HasProfilePic bool `bson:"host_has_profile_pic"` IdentityVerified bool `bson:"host_identity_verified"` ListingsCount int32 `bson:"host_listings_count"` TotalListingsCount int32 `bson:"host_total_listings_count"` Verifications []string `bson:"host_verifications"` } type Location struct { Type string `bson:"type"` Coordinates []float64 `bson:"coordinates"` IsLocationExact bool `bson:"is_location_exact"` } type Address struct { Street string `bson:"street"` Suburb string `bson:"suburb"` GovernmentArea string `bson:"government_area"` Market string `bson:"market"` Country string `bson:"Country"` CountryCode string `bson:"country_code"` Location Location `bson:"location"` } type Availability struct { Thirty int32 `bson:"availability_30"` Sixty int32 `bson:"availability_60"` Ninety int32 `bson:"availability_90"` ThreeSixtyFive int32 `bson:"availability_365"` } type ReviewScores struct { Accuracy int32 `bson:"review_scores_accuracy"` Cleanliness int32 `bson:"review_scores_cleanliness"` CheckIn int32 `bson:"review_scores_checkin"` Communication int32 `bson:"review_scores_communication"` Location int32 `bson:"review_scores_location"` Value int32 `bson:"review_scores_value"` Rating int32 `bson:"review_scores_rating"` } type Review struct { ID string `bson:"_id"` Date time.Time `bson:"date,omitempty"` ListingId string `bson:"listing_id"` ReviewerId string `bson:"reviewer_id"` ReviewerName string `bson:"reviewer_name"` Comments string `bson:"comments"` } type Listing struct { ID string `bson:"_id"` ListingURL string `bson:"listing_url"` Name string `bson:"name"` Summary string `bson:"summary"` Space string `bson:"space"` Description string `bson:"description"` NeighborhoodOverview string `bson:"neighborhood_overview"` Notes string `bson:"notes"` Transit string `bson:"transit"` Access string `bson:"access"` Interaction string `bson:"interaction"` HouseRules string `bson:"house_rules"` PropertyType string `bson:"property_type"` RoomType string `bson:"room_type"` BedType string `bson:"bed_type"` MinimumNights string `bson:"minimum_nights"` MaximumNights string `bson:"maximum_nights"` CancellationPolicy string `bson:"cancellation_policy"` LastScraped time.Time `bson:"last_scraped,omitempty"` CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"` FirstReview time.Time `bson:"first_review,omitempty"` LastReview time.Time `bson:"last_review,omitempty"` Accommodates int32 `bson:"accommodates"` Bedrooms int32 `bson:"bedrooms"` Beds int32 `bson:"beds"` NumberOfReviews int32 `bson:"number_of_reviews"` Bathrooms primitive.Decimal128 `bson:"bathrooms"` Amenities []string `bson:"amenities"` Price primitive.Decimal128 `bson:"price"` WeeklyPrice primitive.Decimal128 `bson:"weekly_price"` MonthlyPrice primitive.Decimal128 `bson:"monthly_price"` CleaningFee primitive.Decimal128 `bson:"cleaning_fee"` ExtraPeople primitive.Decimal128 `bson:"extra_people"` GuestsIncluded primitive.Decimal128 `bson:"guests_included"` Image Image `bson:"images"` Host Host `bson:"host"` Address Address `bson:"address"` Availability Availability `bson:"availability"` ReviewScores ReviewScores `bson:"review_scores"` Reviews []Review `bson:"reviews"` Embeddings []float32 `bson:"embeddings,omitempty"` } models.gopackage common import ( "time" "go.mongodb.org/mongo-driver/bson/primitive" ) type Image struct { ThumbnailURL string `bson:"thumbnail_url"` MediumURL string `bson:"medium_url"` PictureURL string `bson:"picture_url"` XLPictureURL string `bson:"xl_picture_url"` } type Host struct { ID string `bson:"host_id"` URL string `bson:"host_url"` Name string `bson:"host_name"` Location string `bson:"host_location"` About string `bson:"host_about"` ThumbnailURL string `bson:"host_thumbnail_url"` PictureURL string `bson:"host_picture_url"` Neighborhood string `bson:"host_neighborhood"` IsSuperhost bool `bson:"host_is_superhost"` HasProfilePic bool `bson:"host_has_profile_pic"` IdentityVerified bool `bson:"host_identity_verified"` ListingsCount int32 `bson:"host_listings_count"` TotalListingsCount int32 `bson:"host_total_listings_count"` Verifications []string `bson:"host_verifications"` } type Location struct { Type string `bson:"type"` Coordinates []float64 `bson:"coordinates"` IsLocationExact bool `bson:"is_location_exact"` } type Address struct { Street string `bson:"street"` Suburb string `bson:"suburb"` GovernmentArea string `bson:"government_area"` Market string `bson:"market"` Country string `bson:"Country"` CountryCode string `bson:"country_code"` Location Location `bson:"location"` } type Availability struct { Thirty int32 `bson:"availability_30"` Sixty int32 `bson:"availability_60"` Ninety int32 `bson:"availability_90"` ThreeSixtyFive int32 `bson:"availability_365"` } type ReviewScores struct { Accuracy int32 `bson:"review_scores_accuracy"` Cleanliness int32 `bson:"review_scores_cleanliness"` CheckIn int32 `bson:"review_scores_checkin"` Communication int32 `bson:"review_scores_communication"` Location int32 `bson:"review_scores_location"` Value int32 `bson:"review_scores_value"` Rating int32 `bson:"review_scores_rating"` } type Review struct { ID string `bson:"_id"` Date time.Time `bson:"date,omitempty"` ListingId string `bson:"listing_id"` ReviewerId string `bson:"reviewer_id"` ReviewerName string `bson:"reviewer_name"` Comments string `bson:"comments"` } type Listing struct { ID string `bson:"_id"` ListingURL string `bson:"listing_url"` Name string `bson:"name"` Summary string `bson:"summary"` Space string `bson:"space"` Description string `bson:"description"` NeighborhoodOverview string `bson:"neighborhood_overview"` Notes string `bson:"notes"` Transit string `bson:"transit"` Access string `bson:"access"` Interaction string `bson:"interaction"` HouseRules string `bson:"house_rules"` PropertyType string `bson:"property_type"` RoomType string `bson:"room_type"` BedType string `bson:"bed_type"` MinimumNights string `bson:"minimum_nights"` MaximumNights string `bson:"maximum_nights"` CancellationPolicy string `bson:"cancellation_policy"` LastScraped time.Time `bson:"last_scraped,omitempty"` CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"` FirstReview time.Time `bson:"first_review,omitempty"` LastReview time.Time `bson:"last_review,omitempty"` Accommodates int32 `bson:"accommodates"` Bedrooms int32 `bson:"bedrooms"` Beds int32 `bson:"beds"` NumberOfReviews int32 `bson:"number_of_reviews"` Bathrooms primitive.Decimal128 `bson:"bathrooms"` Amenities []string `bson:"amenities"` Price primitive.Decimal128 `bson:"price"` WeeklyPrice primitive.Decimal128 `bson:"weekly_price"` MonthlyPrice primitive.Decimal128 `bson:"monthly_price"` CleaningFee primitive.Decimal128 `bson:"cleaning_fee"` ExtraPeople primitive.Decimal128 `bson:"extra_people"` GuestsIncluded primitive.Decimal128 `bson:"guests_included"` Image Image `bson:"images"` Host Host `bson:"host"` Address Address `bson:"address"` Availability Availability `bson:"availability"` ReviewScores ReviewScores `bson:"review_scores"` Reviews []Review `bson:"reviews"` Embeddings []float64 `bson:"embeddings,omitempty"` } Move back into the project root directory.
cd ../
Generate embeddings.
go run create-embeddings.go
2024/10/10 09:58:03 Generating embeddings. 2024/10/10 09:58:12 Successfully added embeddings to 50 documents
You can view your vector embeddings as they generate by
navigating to the sample_airbnb.listingsAndReviews
collection
in the Atlas UI.
Define code to generate embeddings from an existing collection in Atlas.
Create a file named CreateEmbeddings.java
and paste the following code.
This code uses the getEmbeddings
method and the MongoDB
Java Sync Driver to do the following:
Connect to your Atlas cluster.
Get the array of sample texts.
Generate embeddings from each text using the
getEmbeddings
method that you defined previously.Ingest the embeddings into the
sample_db.embeddings
collection in Atlas.
import com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.result.InsertManyResult; import org.bson.BsonArray; import org.bson.Document; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class CreateEmbeddings { static List<String> data = Arrays.asList( "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ); public static void main(String[] args){ String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); System.out.println("Creating embeddings for " + data.size() + " documents"); EmbeddingProvider embeddingProvider = new EmbeddingProvider(); // generate embeddings for new inputted data List<BsonArray> embeddings = embeddingProvider.getEmbeddings(data); List<Document> documents = new ArrayList<>(); int i = 0; for (String text : data) { Document doc = new Document("text", text).append("embedding", embeddings.get(i)); documents.add(doc); i++; } // insert the embeddings into the Atlas collection List<String> insertedIds = new ArrayList<>(); try { InsertManyResult result = collection.insertMany(documents); result.getInsertedIds().values() .forEach(doc -> insertedIds.add(doc.toString())); System.out.println("Inserted " + insertedIds.size() + " documents with the following ids to " + collection.getNamespace() + " collection: \n " + insertedIds); } catch (MongoException me) { throw new RuntimeException("Failed to insert documents", me); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } }
Generate the embeddings.
Save and run the file. The output resembles:
Creating embeddings for 3 documents Inserted 3 documents with the following ids to sample_db.embeddings collection: [BsonObjectId{value=6735ff620d88451041f6dd40}, BsonObjectId{value=6735ff620d88451041f6dd41}, BsonObjectId{value=6735ff620d88451041f6dd42}]
You can also view your vector embeddings in the Atlas UI by navigating to the sample_db.embeddings
collection in your cluster.
Note
This example uses the sample_airbnb.listingsAndReviews
collection from our sample data,
but you can adapt the code to work with any collection
in your cluster.
Define code to generate embeddings from an existing collection in Atlas.
Create a file named CreateEmbeddings.java
and paste the following code.
This code uses the getEmbeddings
method and the MongoDB Java Sync Driver
to do the following:
Connect to your Atlas cluster.
Get a subset of documents from the
sample_airbnb.listingsAndReviews
collection that have a non-emptysummary
field.Generate embeddings from each document's
summary
field using thegetEmbeddings
method that you defined previously.Update each document with a new
embeddings
field that contains the embedding value.
import com.mongodb.MongoException; import com.mongodb.bulk.BulkWriteResult; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.BulkWriteOptions; import com.mongodb.client.model.Filters; import com.mongodb.client.model.UpdateOneModel; import com.mongodb.client.model.Updates; import com.mongodb.client.model.WriteModel; import org.bson.BsonArray; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; public class CreateEmbeddings { public static void main(String[] args){ String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); Bson filterCriteria = Filters.and( Filters.and(Filters.exists("summary"), Filters.ne("summary", null), Filters.ne("summary", "")), Filters.exists("embeddings", false)); try (MongoCursor<Document> cursor = collection.find(filterCriteria).limit(50).iterator()) { List<String> summaries = new ArrayList<>(); List<String> documentIds = new ArrayList<>(); int i = 0; while (cursor.hasNext()) { Document document = cursor.next(); String summary = document.getString("summary"); String id = document.get("_id").toString(); summaries.add(summary); documentIds.add(id); i++; } System.out.println("Generating embeddings for " + summaries.size() + " documents."); System.out.println("This operation may take up to several minutes."); EmbeddingProvider embeddingProvider = new EmbeddingProvider(); List<BsonArray> embeddings = embeddingProvider.getEmbeddings(summaries); List<WriteModel<Document>> updateDocuments = new ArrayList<>(); for (int j = 0; j < summaries.size(); j++) { UpdateOneModel<Document> updateDoc = new UpdateOneModel<>( Filters.eq("_id", documentIds.get(j)), Updates.set("embeddings", embeddings.get(j))); updateDocuments.add(updateDoc); } int updatedDocsCount = 0; try { BulkWriteOptions options = new BulkWriteOptions().ordered(false); BulkWriteResult result = collection.bulkWrite(updateDocuments, options); updatedDocsCount = result.getModifiedCount(); } catch (MongoException me) { throw new RuntimeException("Failed to insert documents", me); } System.out.println("Added embeddings successfully to " + updatedDocsCount + " documents."); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } }
Generate the embeddings.
Save and run the file. The output resembles:
Generating embeddings for 50 documents. This operation may take up to several minutes. Added embeddings successfully to 50 documents.
You can also view your vector embeddings in the Atlas UI by navigating to the
sample_airbnb.listingsAndReviews
collection in your cluster.
Create a file named create-embeddings.js
and paste the following code.
Use the following code to generate embeddings from an existing collection in Atlas.
Specifically, this code uses the getEmbedding
function
that you defined and the MongoDB Node.js Driver
to generate embeddings from an array
of sample texts and ingest them into the sample_db.embeddings
collection in Atlas.
import { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // Data to embed const data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] async function run() { // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); try { await client.connect(); const db = client.db("sample_db"); const collection = db.collection("embeddings"); await Promise.all(data.map(async text => { // Check if the document already exists const existingDoc = await collection.findOne({ text: text }); // Generate an embedding by using the function that you defined const embedding = await getEmbedding(text); // Ingest data and embedding into Atlas if (!existingDoc) { await collection.insertOne({ text: text, embedding: embedding }); console.log(embedding); } })); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir);
Save and run the file.
node --env-file=.env create-embeddings.js
[ -0.04323853924870491, -0.008460805751383305, 0.012494648806750774, -0.013014335185289383, ... ] [ -0.017400473356246948, 0.04922063276171684, -0.002836339408531785, -0.030395228415727615, ... ] [ -0.016950927674770355, 0.013881809078156948, -0.022074559703469276, -0.02838018536567688, ... ]
node --env-file=.env create-embeddings.js
[ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ] [ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ] [ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ]
Note
The number of dimensions in the output have been truncated for readability.
You can also view your vector embeddings in the Atlas UI by navigating to the sample_db.embeddings
collection in your cluster.
Note
This example uses the sample_airbnb.listingsAndReviews
collection from our sample data,
but you can adapt the code to work with any collection
in your cluster.
Create a file named create-embeddings.js
and paste the following code.
Use the following code to generate embeddings from an existing collection in Atlas. Specifically, this code does the following:
Connects to your Atlas cluster.
Gets a subset of documents from the
sample_airbnb.listingsAndReviews
collection that have a non-emptysummary
field.Generates embeddings from each document's
summary
field by using thegetEmbedding
function that you defined.Updates each document with a new
embedding
field that contains the embedding value by using the MongoDB Node.js Driver.
import { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { await client.connect(); const db = client.db("sample_airbnb"); const collection = db.collection("listingsAndReviews"); // Filter to exclude null or empty summary fields const filter = { "summary": { "$nin": [ null, "" ] } }; // Get a subset of documents from the collection const documents = await collection.find(filter).limit(50).toArray(); // Create embeddings from a field in the collection let updatedDocCount = 0; console.log("Generating embeddings for documents..."); await Promise.all(documents.map(async doc => { // Generate an embedding by using the function that you defined const embedding = await getEmbedding(doc.summary); // Update the document with a new embedding field await collection.updateOne({ "_id": doc._id }, { "$set": { "embedding": embedding } } ); updatedDocCount += 1; })); console.log("Count of documents updated: " + updatedDocCount); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir);
Save and run the file.
node --env-file=.env create-embeddings.js
Generating embeddings for documents... Count of documents updated: 50
You can view your vector embeddings as they generate by
navigating to the sample_airbnb.listingsAndReviews
collection
in the Atlas UI and expanding
the fields in a document.
Paste the following code in your notebook.
Use the following code to generate embeddings from new data.
Specifically, this code uses the get_embedding
function
that you defined and the MongoDB PyMongo Driver to
generate embeddings from an array of sample texts and
ingest them into the sample_db.embeddings
collection.
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_db"] collection = db["embeddings"] # Sample data data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] # Ingest data into Atlas inserted_doc_count = 0 for text in data: embedding = get_embedding(text) collection.insert_one({ "text": text, "embedding": embedding }) inserted_doc_count += 1 print(f"Inserted {inserted_doc_count} documents.")
Inserted 3 documents.
Specify the connection string.
Replace <connection-string>
with your Atlas cluster's SRV
connection string.
Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Run the code.
You can verify your vector embeddings by viewing them
in the Atlas UI
by navigating to the sample_db.embeddings
collection in your cluster.
Note
This example uses the sample_airbnb.listingsAndReviews
collection from our sample data,
but you can adapt the code to work with any collection
in your cluster.
Paste the following code in your notebook.
Use the following code to generate embeddings from a field in an existing collection. Specifically, this code does the following:
Connects to your Atlas cluster.
Gets a subset of documents from the
sample_airbnb.listingsAndReviews
collection that have a non-emptysummary
field.Generates embeddings from each document's
summary
field by using theget_embedding
function that you defined.Updates each document with a new
embedding
field that contains the embedding value by using the MongoDB PyMongo Driver.
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_airbnb"] collection = db["listingsAndReviews"] # Filter to exclude null or empty summary fields filter = { "summary": {"$nin": [ None, "" ]} } # Get a subset of documents in the collection documents = collection.find(filter).limit(50) # Update each document with a new embedding field updated_doc_count = 0 for doc in documents: embedding = get_embedding(doc["summary"]) collection.update_one( { "_id": doc["_id"] }, { "$set": { "embedding": embedding } } ) updated_doc_count += 1 print(f"Updated {updated_doc_count} documents.")
Updated 50 documents.
Specify the connection string.
Replace <connection-string>
with your Atlas cluster's SRV
connection string.
Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
Run the code.
You can view your vector embeddings as they generate by
navigating to the sample_airbnb.listingsAndReviews
collection
in the Atlas UI
and expanding the fields in a document.
Create Embeddings for Queries
In this section, you index the vector embeddings in your collection and create an embedding that you use to run a sample vector search query.
When you run the query, Atlas Vector Search returns documents whose embeddings are closest in distance to the embedding from your vector search query. This indicates that they are similar in meaning.
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_db.embeddings
collection that specifies the
embedding
field as the vector type
and the similarity measure as euclidean
.
Create a file named named
create-index.go
and paste the following code.create-index.gopackage main import ( "context" "fmt" "log" "os" "time" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") indexName := "vector_index" opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") type vectorDefinitionField struct { Type string `bson:"type"` Path string `bson:"path"` NumDimensions int `bson:"numDimensions"` Similarity string `bson:"similarity"` } type vectorDefinition struct { Fields []vectorDefinitionField `bson:"fields"` } indexModel := mongo.SearchIndexModel{ Definition: vectorDefinition{ Fields: []vectorDefinitionField{{ Type: "vector", Path: "embedding", NumDimensions: <dimensions>, Similarity: "dotProduct"}}, }, Options: opts, } log.Println("Creating the index.") searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel) if err != nil { log.Fatalf("failed to create the search index: %v", err) } // Await the creation of the index. log.Println("Polling to confirm successful index creation.") searchIndexes := coll.SearchIndexes() var doc bson.Raw for doc == nil { cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) if err != nil { fmt.Errorf("failed to list search indexes: %w", err) } if !cursor.Next(ctx) { break } name := cursor.Current.Lookup("name").StringValue() queryable := cursor.Current.Lookup("queryable").Boolean() if name == searchIndexName && queryable { doc = cursor.Current } else { time.Sleep(5 * time.Second) } } log.Println("Name of Index Created: " + searchIndexName) } Replace the
<dimensions>
placeholder value with1024
if you used the open-source model and1536
if you used the model from OpenAI.Save the file, then run the following command:
go run create-index.go 2024/10/09 17:38:51 Creating the index. 2024/10/09 17:38:52 Polling to confirm successful index creation. 2024/10/09 17:39:22 Name of Index Created: vector_index
Note
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
vector-query.go
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type TextAndScore struct { Text string `bson:"text"` Score float32 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") query := "ocean tragedy" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embedding"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"text", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []TextAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score) } } vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type TextAndScore struct { Text string `bson:"text"` Score float64 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") query := "ocean tragedy" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embedding"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"text", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []TextAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score) } } Save the file, then run the following command:
go run vector-query.go Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.0042472864 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.0031167597 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.0024476869 go run vector-query.go Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.4552372694015503 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.4050072133541107 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.35942140221595764
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_airbnb.listingsAndReviews
collection that specifies the
embeddings
field as the vector type
and the similarity measure as euclidean
.
Create a file named named
create-index.go
and paste the following code.create-index.gopackage main import ( "context" "fmt" "log" "os" "time" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") indexName := "vector_index" opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") type vectorDefinitionField struct { Type string `bson:"type"` Path string `bson:"path"` NumDimensions int `bson:"numDimensions"` Similarity string `bson:"similarity"` } type vectorDefinition struct { Fields []vectorDefinitionField `bson:"fields"` } indexModel := mongo.SearchIndexModel{ Definition: vectorDefinition{ Fields: []vectorDefinitionField{{ Type: "vector", Path: "embeddings", NumDimensions: <dimensions>, Similarity: "dotProduct"}}, }, Options: opts, } log.Println("Creating the index.") searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel) if err != nil { log.Fatalf("failed to create the search index: %v", err) } // Await the creation of the index. log.Println("Polling to confirm successful index creation.") searchIndexes := coll.SearchIndexes() var doc bson.Raw for doc == nil { cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) if err != nil { fmt.Errorf("failed to list search indexes: %w", err) } if !cursor.Next(ctx) { break } name := cursor.Current.Lookup("name").StringValue() queryable := cursor.Current.Lookup("queryable").Boolean() if name == searchIndexName && queryable { doc = cursor.Current } else { time.Sleep(5 * time.Second) } } log.Println("Name of Index Created: " + searchIndexName) } Replace the
<dimensions>
placeholder value with1024
if you used the open-source model and1536
if you used the model from OpenAI.Save the file, then run the following command:
go run create-index.go 2024/10/10 10:03:12 Creating the index. 2024/10/10 10:03:13 Polling to confirm successful index creation. 2024/10/10 10:03:44 Name of Index Created: vector_index
Note
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
vector-query.go
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type SummaryAndScore struct { Summary string `bson:"summary"` Score float32 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") query := "beach house" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embeddings"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"summary", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []SummaryAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score) } } vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type SummaryAndScore struct { Summary string `bson:"summary"` Score float64 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") query := "beach house" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embeddings"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"summary", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []SummaryAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score) } } Save the file, then run the following command:
go run vector-query.go Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living. Score: 0.0045180833 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.004480799 Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway. Score: 0.0042421296 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.004227752 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.0042201905 go run vector-query.go Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.4832950830459595 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.48093676567077637 Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!! Score: 0.4629695415496826 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.45800843834877014 Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2. Score: 0.45398443937301636
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_db.embeddings
collection that specifies the
embedding
field as the vector type
and the similarity measure as euclidean
.
Create a file named
CreateIndex.java
and paste the following code:CreateIndex.javaimport com.mongodb.MongoException; import com.mongodb.client.ListSearchIndexesIterable; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.SearchIndexModel; import com.mongodb.client.model.SearchIndexType; import org.bson.Document; import org.bson.conversions.Bson; import java.util.Collections; import java.util.List; public class CreateIndex { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); // define the index details String indexName = "vector_index"; int dimensionsHuggingFaceModel = 1024; int dimensionsOpenAiModel = 1536; Bson definition = new Document( "fields", Collections.singletonList( new Document("type", "vector") .append("path", "embedding") .append("numDimensions", <dimensions>) // replace with var for the model used .append("similarity", "dotProduct"))); // define the index model using the specified details SearchIndexModel indexModel = new SearchIndexModel( indexName, definition, SearchIndexType.vectorSearch()); // Create the index using the model try { List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); System.out.println("Successfully created a vector index named: " + result); } catch (Exception e) { throw new RuntimeException(e); } // Wait for Atlas to build the index and make it queryable System.out.println("Polling to confirm the index has completed building."); System.out.println("It may take up to a minute for the index to build before you can query using it."); ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes(); Document doc = null; while (doc == null) { try (MongoCursor<Document> cursor = searchIndexes.iterator()) { if (!cursor.hasNext()) { break; } Document current = cursor.next(); String name = current.getString("name"); boolean queryable = current.getBoolean("queryable"); if (name.equals(indexName) && queryable) { doc = current; } else { Thread.sleep(500); } } catch (Exception e) { throw new RuntimeException(e); } } System.out.println(indexName + " index is ready to query"); } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } Replace the
<dimensions>
placeholder value with the appropriate variable for the model you used:dimensionsHuggingFaceModel
:1024
dimensions ("mixedbread-ai/mxbai-embed-large-v1" model)dimensionsOpenAiModel
:1536
dimensions ("text-embedding-3-small" model)
Note
The number of dimensions is determined by the model used to generate the embeddings. If you adapt this code to use a different model, ensure that you pass the correct value to
numDimensions
. See also the Choosing an Embedding Model section.Save and run the file. The output resembles:
Successfully created a vector index named: [vector_index] Polling to confirm the index has completed building. It may take up to a minute for the index to build before you can query using it. vector_index index is ready to query
Note
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
VectorQuery.java
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
VectorQuery.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.search.FieldSearchPath; import org.bson.BsonArray; import org.bson.BsonValue; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; import static com.mongodb.client.model.Aggregates.project; import static com.mongodb.client.model.Aggregates.vectorSearch; import static com.mongodb.client.model.Projections.exclude; import static com.mongodb.client.model.Projections.fields; import static com.mongodb.client.model.Projections.include; import static com.mongodb.client.model.Projections.metaVectorSearchScore; import static com.mongodb.client.model.search.SearchPath.fieldPath; import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions; import static java.util.Arrays.asList; public class VectorQuery { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); // define $vectorSearch query options String query = "ocean tragedy"; EmbeddingProvider embeddingProvider = new EmbeddingProvider(); BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query); List<Double> embedding = new ArrayList<>(); for (BsonValue value : embeddingBsonArray.stream().toList()) { embedding.add(value.asDouble().getValue()); } // define $vectorSearch pipeline String indexName = "vector_index"; FieldSearchPath fieldSearchPath = fieldPath("embedding"); int limit = 5; List<Bson> pipeline = asList( vectorSearch( fieldSearchPath, embedding, indexName, limit, exactVectorSearchOptions() ), project( fields(exclude("_id"), include("text"), metaVectorSearchScore("score")))); // run query and print results List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); if (results.isEmpty()) { System.out.println("No results found."); } else { results.forEach(doc -> { System.out.println("Text: " + doc.getString("text")); System.out.println("Score: " + doc.getDouble("score")); }); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } Save and run the file. The output resembles one of the following, depending on the model you used:
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.004247286356985569 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.003116759704425931 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.002447686856612563 Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.45522359013557434 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.4049977660179138 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.35942474007606506
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_airbnb.listingsAndReviews
collection that specifies the
embeddings
field as the vector type
and the similarity measure as euclidean
.
Create a file named
CreateIndex.java
and paste the following code:CreateIndex.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.ListSearchIndexesIterable; import com.mongodb.client.MongoCursor; import com.mongodb.client.model.SearchIndexModel; import com.mongodb.client.model.SearchIndexType; import org.bson.Document; import org.bson.conversions.Bson; import java.util.Collections; import java.util.List; public class CreateIndex { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); // define the index details String indexName = "vector_index"; int dimensionsHuggingFaceModel = 1024; int dimensionsOpenAiModel = 1536; Bson definition = new Document( "fields", Collections.singletonList( new Document("type", "vector") .append("path", "embeddings") .append("numDimensions", <dimensions>) // replace with var for the model used .append("similarity", "dotProduct"))); // define the index model using the specified details SearchIndexModel indexModel = new SearchIndexModel( indexName, definition, SearchIndexType.vectorSearch()); // create the index using the model try { List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); System.out.println("Successfully created a vector index named: " + result); System.out.println("It may take up to a minute for the index to build before you can query using it."); } catch (Exception e) { throw new RuntimeException(e); } // wait for Atlas to build the index and make it queryable System.out.println("Polling to confirm the index has completed building."); ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes(); Document doc = null; while (doc == null) { try (MongoCursor<Document> cursor = searchIndexes.iterator()) { if (!cursor.hasNext()) { break; } Document current = cursor.next(); String name = current.getString("name"); boolean queryable = current.getBoolean("queryable"); if (name.equals(indexName) && queryable) { doc = current; } else { Thread.sleep(500); } } catch (Exception e) { throw new RuntimeException(e); } } System.out.println(indexName + " index is ready to query"); } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } Replace the
<dimensions>
placeholder value with the appropriate variable for the model you used:dimensionsHuggingFaceModel
:1024
dimensions (open-source)dimensionsOpenAiModel
:1536
dimensions
Note
The number of dimensions is determined by the model used to generate the embeddings. If you are using a different model, ensure that you pass the correct value to
numDimensions
. See also the Choosing an Embedding Model section.Save and run the file. The output resembles:
Successfully created a vector index named: [vector_index] Polling to confirm the index has completed building. It may take up to a minute for the index to build before you can query using it. vector_index index is ready to query
Note
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
VectorQuery.java
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
VectorQuery.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.search.FieldSearchPath; import org.bson.BsonArray; import org.bson.BsonValue; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; import static com.mongodb.client.model.Aggregates.project; import static com.mongodb.client.model.Aggregates.vectorSearch; import static com.mongodb.client.model.Projections.exclude; import static com.mongodb.client.model.Projections.fields; import static com.mongodb.client.model.Projections.include; import static com.mongodb.client.model.Projections.metaVectorSearchScore; import static com.mongodb.client.model.search.SearchPath.fieldPath; import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions; import static java.util.Arrays.asList; public class VectorQuery { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); // define the query and get the embedding String query = "beach house"; EmbeddingProvider embeddingProvider = new EmbeddingProvider(); BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query); List<Double> embedding = new ArrayList<>(); for (BsonValue value : embeddingBsonArray.stream().toList()) { embedding.add(value.asDouble().getValue()); } // define $vectorSearch pipeline String indexName = "vector_index"; FieldSearchPath fieldSearchPath = fieldPath("embeddings"); int limit = 5; List<Bson> pipeline = asList( vectorSearch( fieldSearchPath, embedding, indexName, limit, exactVectorSearchOptions()), project( fields(exclude("_id"), include("summary"), metaVectorSearchScore("score")))); // run query and print results List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); if (results.isEmpty()) { System.out.println("No results found."); } else { results.forEach(doc -> { System.out.println("Summary: " + doc.getString("summary")); System.out.println("Score: " + doc.getDouble("score")); }); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } Save and run the file. The output resembles one of the following, depending on the model you used:
Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living. Score: 0.004518083296716213 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.0044807991944253445 Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway. Score: 0.004242129623889923 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.004227751865983009 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.004220190457999706 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.4832950830459595 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.48092085123062134 Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!! Score: 0.4629460275173187 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.4581468403339386 Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2. Score: 0.45398443937301636
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_db.embeddings
collection that specifies the
embedding
field as the vector type
and the similarity measure as euclidean
.
Create a file named named
create-index.js
and paste the following code.create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { const database = client.db("sample_db"); const collection = database.collection("embeddings"); // define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": <dimensions> } ] } } // run the helper method const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); Replace the
<dimensions>
placeholder value with768
if you used the open-source model and1536
if you used the model from OpenAI.Save the file, then run the following command:
node --env-file=.env create-index.js
Note
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
vector-query.js
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // MongoDB connection URI and options const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_db"); const collection = database.collection("embeddings"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("ocean tragedy"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, text: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); Save the file, then run the following command:
node --env-file=.env vector-query.js '{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.5103757977485657}' '{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4616812467575073}' '{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.4115804433822632}' node --env-file=.env vector-query.js {"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926} {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898} {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_airbnb.listingsAndReviews
collection that specifies the
embedding
field as the vector type
and the similarity measure as euclidean
.
Create a file named named
create-index.js
and paste the following code.create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { const database = client.db("sample_airbnb"); const collection = database.collection("listingsAndReviews"); // Define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": <dimensions> } ] } } // Call the method to create the index const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); Replace the
<dimensions>
placeholder value with768
if you used the open-source model and1536
if you used the model from OpenAI.Save the file, then run the following command:
node --env-file=.env create-index.js
Note
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
vector-query.js
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // MongoDB connection URI and options const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_airbnb"); const collection = database.collection("listingsAndReviews"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("beach house"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, summary: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); Save the file, then run the following command:
node --env-file=.env vector-query.js '{"summary":"Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.","score":0.5334879159927368}' '{"summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":0.5240535736083984}' '{"summary":"The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.","score":0.5232879519462585}' '{"summary":"Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.","score":0.5186381340026855}' '{"summary":"A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.","score":0.5078228116035461}' node --env-file=.env vector-query.js {"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646} {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Paste the following code in your notebook.
This code creates an index on your collection that specifies the
embedding
field as the vector type, the similarity function asdotProduct
, and the number of dimensions as768
.from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 768 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) Run the code.
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Paste the following code in your notebook.
This code creates an index on your collection that specifies the
embedding
field as the vector type, the similarity function asdotProduct
, and the number of dimensions as1536
.from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 1536 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) Run the code.
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
# Generate embedding for the search query query_embedding = get_embedding("ocean tragedy") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"text": "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "score": 0.5166476964950562} {"text": "Avatar: A marine is dispatched to the moon Pandora on a unique mission", "score": 0.4587385058403015} {"text": "The Lion King: Lion cub and future king Simba searches for his identity", "score": 0.41374602913856506}
# Generate embedding for the search query query_embedding = get_embedding("ocean tragedy") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926} {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898} {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Paste the following code in your notebook.
This code creates an index on your collection that specifies the
embedding
field as the vector type, the similarity function asdotProduct
, and the number of dimensions as768
.from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 768 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) Run the code.
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Paste the following code in your notebook.
This code creates an index on your collection that specifies the
embedding
field as the vector type, the similarity function asdotProduct
, and the number of dimensions as1536
.from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 1536 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) Run the code.
The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, this code does the following:
Creates a sample query embedding by calling the embedding function that you defined.
Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs a sample vector search query. This query uses the the
$vectorSearch
stage to perform an ENN search over your embeddings. It returns semantically similar documents in order of relevance and their vector search score.Note
Your output might vary since environmental differences can introduce slight variations to your embeddings.
To learn more, see Run Vector Search Queries.
# Generate embedding for the search query query_embedding = get_embedding("beach house") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "summary": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"summary": "Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.", "score": 0.5372996926307678} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.5297179818153381} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.5234108567237854} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.5171462893486023} {"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.5095183253288269}
# Generate embedding for the search query query_embedding = get_embedding("beach house") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "summary": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646} {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
Considerations
Consider the following factors when creating vector embeddings:
Choosing a Method to Create Embeddings
In order to create vector embeddings, you must use an embedding model. Embedding models are algorithms that you use to convert your data into embeddings. You can choose one of the following methods to connect to an embedding model and create vector embeddings:
Method | Description |
---|---|
Load an open-source model | If you don't have an API key for a proprietary embedding model, load an open-source embedding model locally from your application. |
Use a proprietary model | Most AI providers offer APIs for their proprietary embedding models that you can use to create vector embeddings. |
Leverage an integration | You can integrate Atlas Vector Search with open-source frameworks and AI services to quickly connect to both open-source and proprietary embedding models and generate vector embeddings for Atlas Vector Search. To learn more, see Integrate Vector Search with AI Technologies. |
Choosing an Embedding Model
The embedding model you choose affects your query results and determines the number of dimensions you specify in your Atlas Vector Search index. Each model offers different advantages depending on your data and use case.
For a list of popular embedding models, see the Massive Text Embedding Benchmark (MTEB). This list provides insights into various open-source and proprietary text embedding models and allows you to filter models by use case, model type, and specific model metrics.
When choosing an embedding model for Atlas Vector Search, consider the following metrics:
Embedding Dimensions: The length of the vector embedding.
Smaller embeddings are more storage efficient, while larger embeddings can capture more nuanced relationships in your data. The model you choose should strike a balance between efficiency and complexity.
Max Tokens: The number of tokens that can be compressed in a single embedding.
A max token length of 512 can be used for most semantic search use cases, as you typically don't want more than a paragraph of text (~100 tokens) in a single embedding.
Model Size: The size of the model in gigabytes.
While larger models perform better, they require more computational resources as you scale Atlas Vector Search to production.
Retrieval Average: A score that measures the performance of retrieval systems.
A higher score indicates that the model is better at ranking relevant documents higher in the list of retrieved results. This score is important when choosing a model for RAG applications.
Validating Your Embeddings
Consider the following strategies to ensure that your embeddings are correct and optimal:
Test your functions and scripts.
Generating embeddings takes time and computational resources. Before you create embeddings from large datasets or collections, test that your embedding functions or scripts work as expected on a small subset of your data.
Create embeddings in batches.
If you want to generate embeddings from a large dataset or a collection with many documents, create them in batches to avoid memory issues and optimize performance.
Evaluate performance.
Run test queries to check if your search results are relevant and accurately ranked.
To learn more about how to evaluate your results and fine-tune the performance of your indexes and queries, see How to Measure the Accuracy of Your Query Results and Improve Vector Search Performance.
Troubleshooting
Consider the following strategies if you encounter issues with your embeddings:
Verify your environment.
Check that the necessary dependencies are installed and up-to-date. Conflicting library versions can cause unexpected behavior. Ensure that no conflicts exist by creating a new environment and installing only the required packages.
Note
If you're using Colab, ensure that your notebook session's IP address is included in your Atlas project's access list.
Monitor memory usage.
If you experience performance issues, check your RAM, CPU, and disk usage to identify any potential bottlenecks. For hosted environments like Colab or Jupyter Notebooks, ensure that your instance is provisioned with sufficient resources and upgrade the instance if necessary.
Ensure consistent dimensions.
Verify that the Atlas Vector Search index definition matches the dimensions of the embeddings stored in Atlas and your query embeddings match the dimensions of the indexed embeddings. Otherwise, you might encounter errors when running vector search queries.
To troubleshoot specific problems, see Troubleshooting.
Next Steps
Once you've learned how to create embeddings and query your embeddings with Atlas Vector Search, start building generative AI applications by implementing retrieval-augmented generation (RAG):
You can also convert your embeddings to BSON vectors for efficient storage and ingestion of vectors in Atlas. To learn more, see How to Ingest Pre-Quantized Vectors.