Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Semantic search with Jina Embeddings v2 and MongoDB Atlas

Scott Martens, Saahil Ognawala12 min read • Published Dec 05, 2023 • Updated Dec 05, 2023
Atlas
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Semantic search is a great ally for AI embeddings.
Using vectors to identify and rank matches has been a part of search for longer than AI has. The venerable tf/idf algorithm, which dates back to the 1960s, uses the counts of words, and sometimes parts of words and short combinations of words, to create representative vectors for text documents. It then uses the distance between vectors to find and rank potential query matches and compare documents to each other. It forms the basis of many information retrieval systems.
We call this “semantic search” because these vectors already have information about the meaning of documents built into them. Searching with semantic embeddings works the same way, but instead, the vectors come from AI models that do a much better job of making sense of the documents.
Because vector-based retrieval is a time-honored technique for retrieval, there are database platforms that already have all the mechanics to do it. All you have to do is plug in your AI embeddings model.
This article will show you how to enhance MongoDB Atlas — an out-of-the-box, cloud-based solution for document retrieval — with Jina Embeddings’ top-of-the-line AI to produce your own killer search solution.

Setting up

You will first need a MongoDB Atlas account. Register for a new account or sign in using your Google account directly on the website. Mongo Atlas sign-up screen

Create a project

Once logged in, you should see your Projects page. If not, use the navigation menu on the left to get to it.
Mongo Atlas Projects page
Create a new project by clicking the New Project button on the right.
Mongo Atlas "Create a Project" screen
You can add new members as you like, but you shouldn’t need to for this tutorial. The "Add Members" screen of the "Create a Project" page

Create a deployment

This should return you to the Overview page where you can now create a deployment. Click the +Create button to do so.
"Create a Deployment" screen on the "Overview" page.
Select the M0 Free tier for this project and the provider of your choice, and then click the Create button at the bottom of the screen.
Mongo Atlas deployment screen
On the next screen, you will need to create a user with a username and secure password for this deployment. Do not lose this password and username! They are the only way you will be able to access your work.
Adding a user and configuring security settings for a Mongo Atlas deployment
Then, select access options. We recommend for this tutorial selecting My Local Environment, and clicking the Add My Current IP Address button.
Configuring access restrictions for a Mongo Atlas deployment
If you have a VPN or a more complex security topology, you may have to consult your system administrator to find out what IP number you should insert here instead of your current one.
After that, click Finish and Deploy at the bottom of the page. After a brief pause, you will now have an empty MongoDB database deployed on Atlas for you to use.
Note: If you have difficulty accessing your database from outside, you can get rid of the IP Access List and accept connections from all IP addresses. Normally, this would be very poor security practice, but because this is a tutorial that uses publicly available sample data, there is little real risk.
To do this, click the Network Access tab under Security on the left side of the page: The Network Access tab on the Mongo Atlas sidebar
Then, click ADD IP ADDRESS from the right side of the page: Allowing access from all IP addresses, on the Network Access screen
You will get a modal window. Click the button marked ALLOW ACCESS FROM ANYWHERE, and then click Confirm.
Modal window for entering information about specific IP address restrictions
Your Network Access tab should now have an entry labeled 0.0.0.0/0. Modal window for entering information about specific IP address restrictions
This will allow any IP address to access your database if it has the right username and password.

Adding Data

In this tutorial, we will be using a sample database of Airbnb reviews. You can add this to your database from the Database tab under Deployments in the menu on the left side of the screen. Once you are on the “Database Deployments” page, find your cluster (on the free tier, you are only allowed one, so it should be easy). Then, click the “three dots” button and choose Load Sample Data. It may take several minutes to load the data.
Loading sample data into a Mongo Atlas deployment
This will add a collection of free data sources to your MongoDB instance for you to experiment with, including a database of Airbnb reviews.

Using PyMongo to access your data

For the rest of this tutorial, we will use Python and PyMongo to access your new MongoDB Atlas database.
Make sure PyMongo is installed in your Python environment. You can do this with the following command:
1pip install pymongo
You will also need to know:
  1. The username and password you set when you set up the database.
  2. The URL to access your database deployment.
If you have lost your username and password, click on the Database Access tab under Security on the left side of the page. That page will enable you to reset your password.
The Database Access tab on the Mongo Atlas sidebar
To get the URL to access your database, return to the Database tab under Deployment on the left side of the screen. Find your cluster, and look for the button labeled Connect. Click it.
The “Database Deployments” page of Mongo Atlas
You will see a modal pop-up window like the one below: Modal window providing information on accessing a MongoDB Atlas deployment
Click Drivers under Connect to your application. You will see a modal window like the one below. Under number three, you will see the URL you need but without your password. You will need to add your password when using this URL.
Finding specific access information in the modal window

Connecting to your database

Create a file for a new Python script. You can call it test_mongo_connection.py.
Write into this file the following code, which uses PyMongo to create a client connection to your database:
1from pymongo.mongo_client import MongoClient
2
3client = MongoClient("<URL from above>")
Remember to insert the URL to connect to your database, including the correct username and password.
Next, add code to connect to the Airbnb review dataset that was installed as sample data:
1db = client.sample_airbnb
2collection = db.listingsAndReviews
The variable collection is an iterable that will return the entire dataset item by item. To test that it works, add the following line and run test_mongo_connection.py:
1print(collection.find_one())
This will print JSON formatted text that contains the information in one database entry, whichever one it happened to find first. It should look something like this:
1{'_id': '10006546',
2 'listing_url': 'https://www.airbnb.com/rooms/10006546',
3 'name': 'Ribeira Charming Duplex',
4 'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic
5 area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary
6 building fully rehabilitated, without losing their original character.',
7 'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers
8 the perfect conditions to discover the history and the charm of Porto.
9 Apartment comfortable, charming, romantic and cozy in the heart of Ribeira.
10 Within walking distance of all the most emblematic places of the city of Porto.
11 The apartment is fully equipped to host 8 people, with cooker, oven, washing
12 machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The
13 apartment is located in a very typical area of the city that allows to cross
14 with the most picturesque population of the city, welcoming, genuine and happy
15 people that fills the streets with his outspoken speech and contagious with
16 your sincere generosity, wrapped in a only parochial spirit.',
17 'description': 'Fantastic duplex apartment with three bedrooms, located in the historic
18 area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary
19 building fully rehabilitated, without losing their original character.
20 Privileged views of the Douro River and Ribeira square, our apartment
21 offers the perfect conditions to discover the history and the charm of
22 Porto. Apartment comfortable, charming, romantic and cozy in the heart of
23 Ribeira. Within walking distance of all the most emblematic places of the
24 city of Porto. The apartment is fully equipped to host 8 people, with
25 cooker, oven, washing machine, dishwasher, microwave, coffee machine
26 (Nespresso) and kettle. The apartment is located in a very typical area
27 of the city that allows to cross with the most picturesque population of
28 the city, welcoming, genuine and happy people that fills the streets with
29 his outspoken speech and contagious with your sincere generosity, wrapped
30 in a only parochial spirit. We are always available to help guests',
31...
32}
Getting a text response like this will show that you can connect to your MongoDB Atlas database.

Accessing Jina Embeddings v2

Go to the Jina AI embeddings website, and you will see a page like this: Getting a token to access Jina Embeddings from the Jina AI website
Copy the API key from this page. It provides you with 10,000 tokens of free embedding using Jina Embeddings models. Due to this limitation on the number of tokens allowed to be used in the free tier, we will only embed a small part of the Airbnb reviews collection. You can buy additional quota by clicking the “Top up” tab on the Jina Embeddings web page if you want to either embed the entire collection on MongoDB Atlas or apply these steps to another dataset.
Test your API key by creating a new script, call it test_jina_ai_connection.py, and put the following code into it, inserting your API code where marked:
1import requests
2
3url = 'https://api.jina.ai/v1/embeddings'
4
5headers = {
6 'Content-Type': 'application/json',
7 'Authorization': 'Bearer <insert your API key here>'
8}
9
10data = {
11 'input': ["Your text string goes here"],
12 'model': 'jina-embeddings-v2-base-en'
13}
14
15response = requests.post(url, headers=headers, json=data)
16
17print(response.content)
Run the script test_jina_ai_connection.py. You should get something like this:
1b'{"model":"jina-embeddings-v2-base-en","object":"list","usage":{"total_tokens":14,
2"prompt_tokens":14},"data":[{"object":"embedding","index":0,"embedding":[-0.14528547,
3-1.0152762,1.3449358,0.48228237,-0.6381836,0.25765118,0.1794826,-0.5094953,0.5967494,
4...,
5-0.30768695,0.34024483,-0.5897042,0.058436804,0.38593403,-0.7729841,-0.6259417]}]}'
This indicates you have access to Jina Embeddings via its API.

Indexing your MongoDB collection

Now, we’re going to put all these pieces together with some Python functions to use Jina Embeddings to assign embedding vectors to descriptions in the Airbnb dataset.
Create a new Python script, call it index_embeddings.py, and insert some code to import libraries and declare some variables:
1import requests
2from pymongo.mongo_client import MongoClient
3
4jinaai_token = "<your Jina token here>"
5mongo_url = "<your MongoDB Atlas database URL>"
6embedding_url = "https://api.jina.ai/v1/embeddings"
Then, add code to set up a MongoDB client and connect to the Airbnb dataset:
1client = MongoClient(mongo_url)
2db = client.sample_airbnb
Now, we will add to the script a function to convert lists of texts into embeddings using the jina-embeddings-v2-base-en AI model:
1def generate_embeddings(texts):
2 payload = {"input": texts,
3 "model": "jina-embeddings-v2-base-en"}
4 try:
5 response = requests.post(
6 embedding_url,
7 headers={"Authorization": f"Bearer {jinaai_token}"},
8 json=payload
9 )
10 except Exception as e:
11 raise ValueError(f"Error in calling embedding API: {e}/nInput: {texts}")
12 if response.status_code != 200:
13 raise ValueError(f"Error in embedding service {response.status_code}: {response.text}, {texts}")
14 embeddings = [d["embedding"] for d in response.json()["data"]]
15 return embeddings
And we will create a function that iterates over up to 30 documents in the listings database, creating embeddings for the descriptions and summaries, and adding them to each entry in the database:
1def index():
2 collection = db.listingsAndReviews
3 docs_to_encode = collection.find({ "embedding_summary" : { "$exists" : False } }).limit(30)
4 for i, doc in enumerate(docs_to_encode):
5 if i and i%5==0:
6 print("Finished embedding", i, "documents")
7 try:
8 embedding_summary, embedding_description = generate_embeddings([doc["summary"], doc["description"]])
9 except Exception as e:
10 print("Error in embedding", doc["_id"], e)
11 continue
12 doc["embedding_summary"] = embedding_summary
13 doc["embedding_description"] = embedding_description
14 collection.replace_one({'_id': doc['_id']}, doc)
With this in place, we can now index the collection:
1index()
Run the script index_embeddings.py. This may take several minutes. When this finishes, we will have added embeddings to 30 of the Airbnb items.

Create the embedding index in MongoDB Atlas

Return to the MongoDB website, and click on Database under Deployment on the left side of the screen.
Creating an index on Mongo Atlas from the “Database Deployments” page
Click on the link for your cluster (Cluster0 in the image above). Find the Search tab in the cluster page and click it to get a page like this: Creating an index from the Search tab on the page for a specific deployment
Click the button marked Create Search Index. Configuring an index before creation
Now, click JSON Editor and then Next: Configuring an index by specifying parameters in JSON format
Now, perform the following steps:
  1. Under Database and Collection, find sample_airbnb, and underneath it, check listingsAndReviews.
  2. Under Index Name, fill in the name listings_comments_semantic_search.
  3. Underneath that, in the numbered lines, add the following JSON text:
1{
2 "mappings": {
3 "dynamic": true,
4 "fields": {
5 "embedding_description": {
6 "dimensions": 768,
7 "similarity": "dotProduct",
8 "type": "knnVector"
9 },
10 "embedding_summary": {
11 "dimensions": 768,
12 "similarity": "dotProduct",
13 "type": "knnVector"
14 }
15 }
16 }
17}
Your screen should look like this: Completed index configuration in JSON format
Now click Next and then Create Search Index in the next screen: Confirming JSON configuration before creating an index
This will schedule the indexing in MongoDB Atlas. You may have to wait several minutes for it to complete.
Modal confirmation that your index is being created
When completed, the following modal window will pop up: Modal confirmation that your index is ready to use
Return to your Python client, and we will perform a search.

Search with Embeddings

Now that our embeddings are indexed, we will perform a search.
We will write a search function that does the following:
  1. Take a query string and convert it to an embedding using Jina Embeddings and our existing generate_embeddings function.
  2. Query the index on MongoDB Atlas using the client connection we already set up.
  3. Print names, summaries, and descriptions of the matches.
Define the search functions as follows:
1def search(query):
2 query_embedding = generate_embeddings([query])[0]
3 results = db.listingsAndReviews.aggregate([
4 {
5 '$search': {
6 "index": "listings_comments_semantic_search",
7 "knnBeta": {
8 "vector": query_embedding,
9 "k": 3,
10 "path": ["embedding_summary", "embedding_description"]
11 }
12 }
13 }
14 ])
15 for document in results:
16 print(f'Listing Name: {document["name"]}\nSummary: {document["name"]}\nDescription: {document["description"]}\n\n')
And now, let’s run a search:
1search("an amazing view and close to amenities")
Your results may vary because this tutorial did not index all the documents in the dataset, and which ones were indexed may vary dramatically. You should get a result like this:
1Listing Name: Rented Room
2Summary: Rented Room
3Description: Beautiful room and with a great location in the city of Rio de Janeiro
4
5
6Listing Name: Spacious and well located apartment
7Summary: Spacious and well located apartment
8Description: Enjoy Porto in a spacious, airy and bright apartment, fully equipped, in a
9building with lift, located in a region full of cafes and restaurants, close to the subway
10and close to the best places of the city. The apartment offers total comfort for those
11who, besides wanting to enjoy the many attractions of the city, also like to relax and
12feel at home, All airy and bright, with a large living room, fully equipped kitchen, and a
13delightful balcony, which in the summer refreshes and in the winter protects from the cold
14and rain, accommodating up to six people very well. It has 40-inch interactive TV, internet
15and high-quality wi-fi, and for those who want to work a little, it offers a studio with a
16good desk and an inspiring view. The apartment is all available to guests. I leave my guests
17at ease, but I am available whenever they need me. It is a typical neighborhood of Porto,
18where you have silence and tranquility, little traffic, no noise, but everything at hand:
19good restaurants and c
20
21
22Listing Name: Panoramic Ocean View Studio in Quiet Setting
23Summary: Panoramic Ocean View Studio in Quiet Setting
24Description: Luxury studio unit is located in a family-oriented neighborhood that lets you
25experience Hawaii like a local! with tranquility and serenity, while in close proximity to
26beaches and restaurants! The unit is surrounded by lush tropical vegetation! High-speed
27Wi-Fi available in the unit!! A large, private patio (lanai) with fantastic ocean views is
28completely under roof and is part of the studio unit. It's a great space for eating outdoors
29or relaxing, while checking our the surfing action. This patio is like a living room
30without walls, with only a roof with lots and lots of skylights!!! We provide Wi-Fi and
31beach towels! The studio is detached from the main house, which has long-term tenants
32upstairs and downstairs. The lower yard and the front yard are assigned to those tenants,
33not the studio guests. The studio has exclusive use of its large (600 sqft) patio - under
34roof! Check-in and check-out times other than the ones listed, are by request only and an
35additional charges may apply;
36
37
38Listing Name: GOLF ROYAL RESIDENCE SUİTES(2+1)-2
39Summary: GOLF ROYAL RESIDENCE SUİTES(2+1)-2
40Description: A BIG BED ROOM WITH A BIG SALOON INCLUDING A NICE BALAKON TO HAVE SOME FRESH
41AIR . OUR RESIDENCE SITUATED AT THE CENTRE OF THE IMPORTANT MARKETS SUCH AS NİŞANTAŞİ,
42OSMANBEY AND TAKSIM SQUARE,
43
44
45Listing Name: DOUBLE ROOM for 1 or 2 ppl
46Summary: DOUBLE ROOM for 1 or 2 ppl
47Description: 10m2 with interior balkony kitchen, bathroom small but clean and modern metro
48in front of the building 7min walk to Sagrada Familia, 2min walk TO amazing Gaudi Hospital
49Sant Pau SAME PRICE FOR 1 OR 2 PPL-15E All flat for your use, terrace, huge TV.
Experiment with your own queries to see what you get.

Next steps

You’ve now created the core of a MongoDB Atlas-based semantic search engine, powered by Jina AI’s state-of-the-art embedding technology. For any project, you will follow essentially the same steps outlined above:
  1. Create an Atlas instance and fill it with your data.
  2. Create embeddings for your data items using the Jina Embeddings API and store them in your Atlas instance.
  3. Index the embeddings using MongoDB’s vector indexer.
  4. Implement semantic search using embeddings.
This boilerplate Python code will integrate easily into your own projects, and you can create equivalent code in Java, JavaScript, or code for any other integration framework that supports HTTPS.
To see the full documentation of the MongoDB Atlas API, so you can integrate it into your own offerings, see the Atlas API section of the MongoDB website.
To learn more about Jina Embeddings and its subscription offerings, see the Embeddings page of the Jina AI website. You can find the latest news about Jina AI’s embedding models on the Jina AI website and X/Twitter, and you can contribute to discussions on Discord.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Article

Multi-agent Systems With AutoGen and MongoDB


Sep 18, 2024 | 10 min read
Tutorial

How to Evaluate Your LLM Application


Jun 24, 2024 | 20 min read
Article

Auto Pausing Inactive Clusters


Sep 09, 2024 | 10 min read
Tutorial

MongoDB Charts Embedding SDK with React


Sep 09, 2024 | 5 min read
Table of Contents