Aperol Spritz Summer With MongoDB Geospatial Queries & Vector Search
Rate this tutorial
It’s summer in New York City and you know what that means: It’s the season of the spritz! There is nothing (and I fully, truly, 110% mean nothing) better than a crisp Aperol spritz to end a day that was so hot and muggy that the subway was interchangeable from a sauna.
While I normally love adventuring through the city in search of what will perfectly fulfill my current craving, there are certain months when I refuse to spend any more time than necessary moving around outdoors (hello, heatwave?!). At night during an NYC summer, we are lounging — lounging on rooftops, terraces, and sidewalks — wherever we can fit. And with minimal movement, we want our Aperol spritzes as close as possible. So, let’s use MongoDB geospatial queries, MongoDB Atlas Vector Search, and the Google Places API to find our closest spritz locations in the West Village neighborhood of New York City while using semantic search to help us get the most out of our queries.
In this tutorial, we will use the various platforms listed above to find all the locations selling Aperol spritzes in the West Village neighborhood of New York City, ones that match our semantic query of being outdoors with quick service (we need those spritzes and need them NOW!), and the one closest to our starting location.
Before we begin the tutorial, let’s go over some of the important platforms we will be using on our journey.
MongoDB geospatial queries allow you to search your database based on geographical locations! This means you are able to find different locations such as restaurants, parks, museums, etc. based just on their coordinates. In this tutorial, we will use MongoDB geospatial queries to search the locations of places that serve Aperol spritzes that we sourced from Google’s Places API. To use geospatial queries properly with MongoDB, we will need to ensure our data points are loaded in GeoJSON format. More on that below!
MongoDB Atlas Vector Search is a way of searching through your database semantically, or by meaning. This means instead of searching based on specific keywords or exact text phrases, you can retrieve results even if a word is spelled wrong, or retrieve results based on synonyms. This will integrate fabulously with our tutorial because we can search through the reviews we retrieve from our Google Places API and see which ones match closest to what we’re looking for.
Let’s go!
To be successful with this tutorial, you will need:
- The IDE of your choosing — this tutorial uses a Google Colab notebook. Please feel free to run your commands directly from a notebook.
- A Google Cloud Platform account — please create an account and a project. We will go through this together.
- An OpenAI API key — this is how we will embed our location reviews so we can use MongoDB Atlas Vector Search!
Once your MongoDB Atlas cluster has been provisioned and you have everything else written down in a secure spot, you’re ready to begin. Please also ensure you have allowed "Access From Anywhere" in your MongoDB cluster, under "Network Access". This is not recommended for production, but it is used in this tutorial for ease of reference. Without this in place, you will not be able to write to your MongoDB cluster.
Our first step is to create a project inside of our Google Cloud account. This is so we can ensure the use of the Google Places API to find all locations that serve Aperol spritzes in the West Village.
This is what your project will look like once it’s been created. Please make sure to set up your billing account information on the left-hand side of the screen. You can set up a free trial for $300 worth of credits, so if you’re trying out this tutorial, please feel free to do that and save some money!
Once your account is set up, let’s enable the Google Places API that we are going to be using. You can do this through the same link to set up your Google Cloud project.
This is the API we want to use:
Hit the Enable button and a popup will come up with your API key. Store it somewhere safe since we will be using it in our tutorial! Make sure to not lose it or expose it anywhere.
With every Places API request made, your API key must be used. You can find out more from the documentation.
Once that’s in place, we can get started on our tutorial.
Now, head over to your Google Colab notebook.
We want to install
googlemaps
and openai
in our notebook since these are necessary for us when building this tutorial.1 !pip install googlemaps 2 !pip install openai==0.28
Then, define and run your imports:
1 import googlemaps 2 import getpass 3 import openai
We are going to use the
getpass
library to keep our API keys secret.Set it up for your Google API key and your OpenAI API key:
1 # google API Key 2 google_api_key = getpass.getpass(prompt= "Put in Google API Key here") 3 map_client = googlemaps.Client(key=google_api_key) 4 # openAI API Key 5 openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here")
Now, let's set ourselves up for Vector Search success. First, set your key and then establish our embedding function. For this tutorial, we are using OpenAI's "text-embedding-3-small" embedding model. We are going to be embedding the reviews of our spritz locations so we can make some judgments on where to go!
1 # set your key 2 openai.api_key = openai_api_key 3 4 # embedding model we are using 5 EMBEDDING_MODEL = "text-embedding-3-small" 6 7 # our embedding function 8 def get_embedding(text): 9 response = openai.Embedding.create(input=text, model=EMBEDDING_MODEL) 10 return response['data'][0]['embedding']
When using Nearby Search in our Google Places API, we are required to set up three parameters: location, radius, and keyword. For our location, we can find our starting coordinates (the very middle of the West Village) by right-clicking on Google Maps and copying the coordinates to our clipboard. This is how I got the coordinates shown below:
For our radius, we have to have it in meters. Since I’m not very savvy with meters, let’s write a small function to help us make that conversion.
1 # for Google Maps API we need to use a radius in meters. Let's first change our miles to meters 2 def miles_to_meters(miles): 3 return miles * 1609.344
Our keyword will just be what we’re hoping to find from the Google Places API: Aperol spritzes!
1 middle_of_west_village = (40.73490473393682, -74.00521094160642) 2 search_radius = miles_to_meters(0.4) # West Village is small so just do less than half a mile. 3 spritz_finder = 'aperol spritz'
We can then make our API call using the
places_nearby
method.1 # making the API call using our places_nearby method and our parameters 2 response = map_client.places_nearby( 3 location=middle_of_west_village, 4 radius=search_radius, 5 keyword=spritz_finder 6 )
Before we can go ahead and print out our locations, let’s think about our end goal. We want to achieve a couple of things before we insert our documents into our MongoDB Atlas cluster. We want to:
- Get detailed information about our locations, so we need to make another API call to get our
place_id
, the locationname
, ourformatted_address
, thegeometry
for our coordinates, somereviews
(only up to five), and the locationrating
. You can find more fields to return (if your heart desires!) from the Nearby Search documentation. - Embed our reviews for each location using our embedding function. We want to make sure that we have a field for these so our vectors are stored in an array inside our cluster. We are choosing to embed here just to make things easier for ourselves in the long run. Let’s also join the five reviews together into one string to make things a bit easier on the embedding.
- Think about how our coordinates are set up, while we’re creating a dictionary with all the important information we want to portray. MongoDB geospatial queries require GeoJSON objects. This means we need to make sure we have the proper format, or else we won’t be able to use our geospatial queries operators later. We also need to keep in mind that the longitude and latitude are stored in a nested array underneath
geometry
andlocation
inside the Google Places API. So, unfortunately, we cannot just access it from the top level. We need to work some magic first. Here is an example of the output that I copied from their documentation showing where the latitude and longitude are nested:
1 { 2 "html_attributions": [], 3 "results": 4 [ 5 { 6 "business_status": "OPERATIONAL", 7 "geometry": 8 { 9 "location": { "lat": -33.8587323, "lng": 151.2100055 }, 10 "viewport": 11 { 12 "northeast": 13 { "lat": -33.85739847010727, "lng": 151.2112436298927 }, 14 "southwest": 15 { "lat": -33.86009812989271, "lng": 151.2085439701072 }, 16 },
With all this in mind, let’s get to it!
1 # find information we want: use the Nearby Places documentation to figure out which fields you want 2 spritz_locations = [] 3 for location in response.get('results', []): 4 location_detail = map_client.place( 5 place_id=location['place_id'], fields=['name', 'formatted_address', 'geometry', 'reviews', 'rating'] 6 ) 7 8 9 # these are the specific details we want to be saved as fields in our documents 10 details = location_detail.get('result', {}) 11 12 13 # we want to embed the five reviews so lets extract and join together 14 location_reviews = details.get('reviews', []) 15 store_reviews = [review['text'] for review in location_reviews[:5]] 16 joined_reviews = " ".join(store_reviews) 17 18 19 # generate embedding on your reviews 20 embedding_reviews = get_embedding(joined_reviews) 21 22 23 # we know that the longitude and latitude is nested inside Geometry and Location. 24 # so let's grab it using .get and then format it how we want. 25 geometry = details.get('geometry', {}) 26 location = geometry.get('location', {}) 27 28 29 # both are nested under location so open it up 30 longitude = location.get('lng') 31 latitude = location.get('lat') 32 33 34 location_info = { 35 'name': details.get('name'), 36 'address': details.get('formatted_address'), 37 38 39 # MongoDB geospatial queries require GeoJSON formatting 40 'location': { 41 'type': 'Point', 42 'coordinates': [longitude, latitude] 43 }, 44 'rating': details.get('rating'), 45 'reviews': store_reviews, 46 'embedding': embedding_reviews 47 } 48 spritz_locations.append(location_info)
Let’s print out our output and see what our spritz locations in the West Village neighborhood are! Let’s also check and make sure that we have a newly developed embedding field with our reviews embedded:
1 # print our spritz information 2 for location in spritz_locations: 3 print(f"Name: {location['name']}, Address: {location['address']}, Coordinates: {location['location']}, Rating: {location['rating']}, Reviews: {location['reviews']}, Embedding: {location['embedding']}")
So, if I scroll over in my notebook, I can see there are embeddings, but I will prove they are there once we insert our data into MongoDB Atlas since it’s a bit hard to capture in a single picture.
Let’s insert them using the
pymongo
library.First, let’s install
pymongo
.1 # install pymongo 2 !pip install pymongo
Please keep in mind that you can name your database and collection anything you like, since it won’t be created until we write in our data. I am naming my database “spritz_summer” and my collection “spritz_locations_WV”. Run the code block below to insert your documents into your cluster:
1 from pymongo import MongoClient 2 3 # set up your MongoDB connection 4 connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here") 5 client = MongoClient(connection_string) 6 7 # name your database and collection anything you want since it will be created when you enter your data 8 database = client['spritz_summer'] 9 collection = database['spritz_locations_WV'] 10 11 # insert our spritz locations 12 collection.insert_many(spritz_locations)
Go ahead and double-check that everything was written in correctly in MongoDB Atlas:
Make sure to double-check that your embedding field exists and that it’s an array of 1536, and please make sure your coordinates are properly configured the way mine are in the image.
Great question! Since both of these — if we’re looking at them simply from an aggregation pipeline operator — need to be the first stage in their pipelines, instead of making one pipeline, we can do a little loophole and create two. But how will we decide which one to do first?!
When I’m using Google Maps to figure out where to go, I normally first search for what I’m craving, and then I see how far away it is from where I currently am. So let’s keep that mindset and start off with MongoDB Atlas Vector Search. But, I understand that intuitively, some of you might prefer to search via all nearby locations and then semantically search (geospatial queries first and then vector search), so let’s highlight that method as well below.
We have a couple of steps here. Our first step is to create a Vector Search Index. Please do this inside of MongoDB Atlas by following the Vector Search documentation.
Please keep in mind that your index is not run in your script. It lives in your cluster. You’ll know it’s ready to go when it turns green and is activated.
1 # create a Vector Search Index so we can use it 2 { 3 "fields": [ 4 { 5 "numDimensions": 1536, 6 "path": "embedding", 7 "similarity": "cosine", 8 "type": "vector" 9 } 10 ] 11 }
Once it’s activated, let’s get to vector searching!
So. Let’s say I just finished dinner with my besties at our favorite restaurant in the West Village, Balaboosta. The food was great, it’s a summer night, we’re in the mood for post-dinner spritzes outside, and we would prefer to be seated quickly. Let’s see if we can find a spot!
Our first step in building our pipeline is to embed our query. We cannot compare text to vectors; we have to compare vectors to vectors. We can do this with only a couple of lines since we are using the same embedding model that we embedded our reviews with:
1 # You have to embed your queries just the same way you embedded your documents. 2 # my query 3 query_description = "outdoor seating quick service" 4 5 # we need to embed the query as well, since our documents are embedded 6 query_vector = get_embedding(query_description)
Now, let’s build out our aggregation pipeline. Since we are going to be using a
$geoNear
in our pipeline next, we want to keep the IDs found from this aggregation pipeline so we don’t search through everything — we only search through our sample size. For now, make sure your $vectorSearch
stage is at the very top!1 spritz_near_me_vector = [ 2 { 3 '$vectorSearch': { 4 'index': 'vector_index', 5 'path': 'embedding', 6 'queryVector': query_vector, 7 'numCandidates': 15, 8 'limit': 5 9 } 10 }, 11 { 12 "$project": { 13 "_id": 1, # we want to keep this in place so we can search again using GeoNear 14 "name": 1, 15 "rating": 1, 16 "reviews": 1 17 #"address": 1, 18 #"location": 1, 19 #"embedding": 1 20 } 21 } 22 ]
Let’s print out our results and see what happens from our query of “outdoor seating quick service”:
1 spritz_near_me_vector_results = list(collection.aggregate(spritz_near_me_vector)) 2 for result in spritz_near_me_vector_results: 3 print(result)
We have five fantastic options! If we go and read through the reviews, we can see they align with what we’re looking for. Here is one example:
Let’s go ahead and save the IDs from our pipeline above in a simple line so we can specify that we only want to use our
$geoNear
operator on these five:1 # now, we want to take the _ids from our above pipeline so we can use it to geo search 2 spritz_near_me_ids = [result['_id'] for result in spritz_near_me_vector_results] 3 print(spritz_near_me_ids)
Now that they’re saved, we can build out our
$geoNear
pipeline and see which one of these options is closest to us from our starting point, Balaboosta, so we can walk on over.To figure out the coordinates of Balaboosta, I right-clicked on Google Maps and saved the coordinates, and then made sure I had the longitude and latitude in the proper order.
1 collection.create_index( { "location" : "2dsphere" } )
Here is the pipeline, with our query specifying that we only want to use the IDs of the locations we found above:
1 # use the $geoNear operator to return documents that are at least 100 meters and at most 1000 meters from our specified GeoJSON point. 2 spritz_near_me_geo = [ 3 { 4 "$geoNear": { 5 "near": { 6 "type": "Point", 7 "coordinates": [-74.0059456749148, 40.73781277366724] 8 }, 9 # here we are saying that we only want to use the sample size from above 10 "query": {"_id": {"$in": spritz_near_me_ids}}, 11 "minDistance": 100, 12 "maxDistance": 1000, 13 "spherical": True, 14 "distanceField": "dist.calculated" 15 } 16 }, 17 { 18 "$project": { 19 "_id": 0, 20 "name": 1, 21 "address": 1, 22 "rating": 1, 23 "dist.calculated": 1, 24 #"location": 1, 25 #"embedding": 1 26 } 27 }, 28 { 29 "$limit": 3 30 }, 31 { 32 "$sort": { 33 "dist.calculated": 1 34 } 35 } 36 ]
Let’s print it out and see what we get!
1 spritz_near_me_geo_results = collection.aggregate(spritz_near_me_geo) 2 for result in spritz_near_me_geo_results: 3 print(result)
It seems like the restaurant we are heading over to is Pastis since it’s only 182.83 meters (0.1 miles) away. Time for an Aperol spritz outdoors!
For those who would prefer to switch things around and run geospatial queries first and then incorporate vector search, here is the pipeline:
1 # create a 2dsphere index on our location field 2 collection.create_index({"location": "2dsphere"}) 3 4 # our $geoNear pipeline 5 spritz_near_me_geo = [ 6 { 7 "$geoNear": { 8 "near": { 9 "type": "Point", 10 "coordinates": [-74.0059456749148, 40.73781277366724] 11 }, 12 "minDistance": 100, 13 "maxDistance": 1000, 14 "spherical": True, 15 "distanceField": "dist.calculated" 16 } 17 }, 18 { 19 "$project": { 20 "_id": 1, 21 "dist.calculated": 1 22 } 23 } 24 ] 25 26 # list of ID's and distances so we can use them as our sample size 27 places_ids = list(collection.aggregate(spritz_near_me_geo)) 28 distances = {result['_id']: result['dist']['calculated'] for result in places_ids} # have to create a new dictionary to keep our distances 29 spritz_near_me_ids = [result['_id'] for result in places_ids] 30 # print(spritz_near_me_ids)
First, create our
$geoNear
pipeline and ensure you’re saving in your places_ids
and the distances
so that we can carry them through our vector search pipeline.We also need to rebuild our MongoDB Atlas Vector Search index with an included “_id” path:
1 # our vector search index that was created inside of MongoDB Atlas 2 vector_search_index = { 3 "fields": [ 4 { 5 "numDimensions": 1536, 6 "path": "embedding", 7 "similarity": "cosine", 8 "type": "vector" 9 }, 10 { 11 "type": "filter", 12 "path": "_id" 13 } 14 ] 15 }
Once that’s active and ready, we can build out our vector search pipeline:
1 # vector search pipeline 2 spritz_near_me_vector = [ 3 { 4 '$vectorSearch': { 5 'index': 'vector_index', 6 'path': 'embedding', 7 'queryVector': query_vector, 8 'numCandidates': 15, 9 'limit': 3, 10 'filter': {"_id": {'$in': spritz_near_me_ids}} 11 } 12 }, 13 { 14 "$project": { 15 "_id": 1, # we want to keep this in place 16 "name": 1, 17 "rating": 1, 18 "dist.calculated": 1 19 #"reviews": 1 20 # "address": 1, 21 # "location": 1, 22 # "embedding": 1 23 } 24 } 25 ] 26 27 spritz_near_me_vector_results = collection.aggregate(spritz_near_me_vector) 28 for result in spritz_near_me_vector_results: 29 result['dist.calculated'] = distances.get(result['_id']) 30 print(result)
Run it, and you should see some pretty similar results as before! Leave a comment below letting me know which locations showed up for you as your output — these are mine:
As you can see, they’re the same results but in a slightly different order, as they are no longer ordered by distance.
In this tutorial, we covered how to use MongoDB Atlas Vector Search, and the Google Places API to find our closest spritz locations in the West Village neighborhood of New York City with semantic search, and then used MongoDB geospatial queries to find which locations were closest to us from a specific starting point.
For more information on MongoDB geospatial queries please visit the documentation located above, and if you have any questions or want to share your work, please join us in the MongoDB Developer Community.
Top Comments in Forums
Joao_SchaabJoão Schaab2 months ago
Thanks for this article, it is super helpful. Is there a way to return documents that are sorted by the score and distance without doing that in the application layer?