Add US Postal Abbreviations to Your Atlas Search in 5 Minutes
Ksenia Samokhvalova, Amelia Short9 min read ā¢ Published Sep 29, 2022 ā¢ Updated Sep 29, 2022
Rate this tutorial
There are cases when it helps to have synonyms set up to work with your Atlas Search index. For example, if the search in your application needs to work with addresses, it might help to set up a list of common synonyms for postal abbreviations, so one could type in āblvdā instead of āboulevardā and still find all places with āboulevardā in the address.
This tutorial will show you how to set up your Atlas Search index to recognize US postal abbreviations.
To be successful with this tutorial, you will need:
- Python, to use a script that scrapesĀ a list of street suffix abbreviationsĀ helpfully compiled by the United States Postal Service (USPS). This tutorial was written using Python 3.10.15, but you could try it on earlier versions of 3, if youād like.
- A MongoDB Atlas cluster. Follow theĀ Get Started with AtlasĀ guide to create your account and a MongoDB cluster. For this tutorial, you can use yourĀ free-forever MongoDB Atlas cluster!Ā Keep a note of your database username, password, andĀ connection stringĀ as you will need those later.
- Rosetta, if youāre on a MacOS with an M1 chip. This will allow you to run MongoDB tools likeĀ mongoimportĀ andĀ mongosh.Ā
- mongosh for running commands in the MongoDB shell. If you donāt already have it,Ā install mongosh.
- A copy ofĀ mongoimport. If you have MongoDB installed on your workstation, then you may already haveĀ mongoimportĀ installed. If not, follow the instructions on the MongoDB website toĀ install mongoimport.Ā
- We're going to be using a sample_restaurants dataset in this tutorial since it contains address data. For instructions on how to load sample data, see theĀ documentation. Also, you canĀ see all available sample datasets.
The examples shown here were all written on a MacOS but should run on any unix-type system. If you're running on Windows, we recommend running the example commands inside theĀ Windows Subsystem for Linux.
To learn about synonyms in Atlas Search, we suggest you start by checking out ourĀ documentation. SynonymsĀ allow you to index and search your collection for words that have the same or nearly the same meaning, or, in the case of our tutorial, you can search using different ways to write out an address and still get the results you expect. To set up and use synonyms in Atlas Search, you will need to:
- Create a collection in the same database as the collection youāre indexingĀ containing the synonyms. Note that every document in the synonyms collection must haveĀa specific format.
We will walk you through these steps in the tutorial, but first, letās start with creating the JSON documents that will form our synonyms collection.
We will useĀ the list of official street suffix abbreviationsĀ andĀ a list of secondary unit designators from the USPS website to create a JSON document for each set of the synonyms.
All documents in the synonyms collection must have aĀ specific formatthat specifies the type of synonymsāequivalent or explicit. Explicit synonyms have a one-way mapping. For example, if āboatā is explicitly mapped to āsail,ā weād be saying that if someone searches āboat,ā we want to return all documents that include āsailā and āboat.ā However, if we search the word āsail,ā we would not get any documents that have the word āboat.ā In the case of postal abbreviations, however, one can use all abbreviations interchangeably, so we will use the āequivalentā type of synonym in the mappingType field.
Here is a sample document in the synonyms collection for all the possible abbreviations of āavenueā:
1 āAvenueā:Ā 2 3 { 4 5 "mappingType":"equivalent", 6 7 "synonyms":["AVENUE","AV","AVEN","AVENU","AVN","AVNUE","AVE"] 8 9 }
We wrote the web scraping code for you in Python, and you can run it with the following commands to create a document for each synonym group:
1 git clone https://github.com/mongodb-developer/Postal-Abbreviations-Synonyms-Atlas-Search-Tutorial/Ā 2 3 cd Postal-Abbreviations-Synonyms-Atlas-Search-Tutorial 4 5 python3 main.py
To see details of the Python code, read the rest of the section.
In order to scrape the USPS postal website, we will need to import the following packages/libraries and install them using PIP:Ā requests,Ā BeautifulSoup, andĀ pandas. Weāll also want to importĀ jsonĀ andĀ reĀ for formatting our data when weāre ready:
1 import json 2 3 import requests 4 5 from bs4 import BeautifulSoup 6 7 import pandas as pd 8 9 import re
Letās start with the Street Suffix Abbreviations page. We want to create objects that represent both the URL and the page itself:
1 # Create a URL object 2 3 streetsUrl = 'https://pe.usps.com/text/pub28/28apc_002.htm' 4 5 # Create object page 6 7 headers = { 8 9 Ā Ā Ā Ā "User-Agent": 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Mobile Safari/537.36'} 10 11 streetsPage = requests.get(streetsUrl, headers=headers)
Next, we want to get the information on the page. Weāll start by parsing the HTML, and then get the table by its id:
1 # Obtain page's information 2 3 streetsSoup = BeautifulSoup(streetsPage.text, 'html.parser')
1 # Get the table by its id 2 3 streetsTable = streetsSoup.find('table', {'id': 'ep533076'})
Now that we have the table, weāre going to want to transform it into aĀ dataframe, and then format it in a way thatās useful for us:
1 # Transform the table into a list of dataframes 2 3 streetsDf = pd.read_html(str(streetsTable))
One thing to take note of is that in the table provided on USPSās website, one primary name is usually mapped to multiple commonly used names.
This means we need to dynamically group together commonly used names by their corresponding primary name and compile that into a list:
1 # Group together all "Commonly Used Street Suffix or Abbreviation" entries 2 3 streetsGroup = streetsDf[0].groupby(0)[1].apply(list)
Once our names are all grouped together, we can loop through them and export them as individual JSON files.
1 for x in range(streetsGroup.size): 2 3 4 Ā Ā Ā Ā dictionary = { 5 6 Ā Ā Ā Ā Ā Ā Ā Ā "mappingType": "equivalent", 7 8 Ā Ā Ā Ā Ā Ā Ā Ā "synonyms": streetsGroup[x] 9 10 Ā Ā Ā Ā } 11 12 13 Ā Ā Ā Ā # export the JSON into a file 14 15 Ā Ā Ā Ā with open(streetsGroup.index.values[x] + ".json", "w") as outfile: 16 17 Ā Ā Ā Ā Ā Ā Ā Ā json.dump(dictionary, outfile)
Now, letās do the same thing for the Secondary Unit Designators page:
Just as before, weāll start with getting the page and transforming it to a dataframe:
1 # Create a URL object 2 3 unitsUrl = 'https://pe.usps.com/text/pub28/28apc_003.htm' 4 5 6 unitsPage = requests.get(unitsUrl, headers=headers) 7 8 9 # Obtain page's information 10 11 unitsSoup = BeautifulSoup(unitsPage.text, 'html.parser') 12 13 14 # Get the table by its id 15 16 unitsTable = unitsSoup.find('table', {'id': 'ep538257'}) 17 18 19 # Transform the table into a list of dataframes 20 21 unitsDf = pd.read_html(str(unitsTable))
If we look at the table more closely, we can see that one of the values is blank. While it makes sense that the USPS would include this in the table, itās not something that we want in our synonyms list.
To take care of that, weāll simply remove all rows that have blank values:
1 unitsDf[0] = unitsDf[0].dropna()
Next, weāll take our new dataframe and turn it into a list:
1 # Create a 2D list that we will use for our synonyms 2 3 unitsList = unitsDf[0][[0, 2]].values.tolist()
You may have noticed that some of the values in the table have asterisks in them. Letās quickly get rid of them so they wonāt be included in our synonym mappings:
1 # Remove all non-alphanumeric characters 2 3 unitsList = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in unitsList]
Now we can now loop through them and export them as individual JSON files just as we did before. The one thing to note is that we want to restrict the range on which weāre iterating to include only the relevant data we want:
1 # Restrict the range to only retrieve the results we want 2 3 for x in range(1, len(unitsList) - 1): 4 5 6 Ā Ā Ā Ā dictionary = { 7 8 Ā Ā Ā Ā Ā Ā Ā Ā "mappingType": "equivalent", 9 10 Ā Ā Ā Ā Ā Ā Ā Ā "synonyms": unitsList[x] 11 12 Ā Ā Ā Ā } 13 14 15 Ā Ā Ā Ā # export the JSON into a file 16 17 Ā Ā Ā Ā with open(unitsList[x][0] + ".json", "w") as outfile: 18 19 Ā Ā Ā Ā Ā Ā Ā Ā json.dump(dictionary, outfile)
Now that we created the JSON documents for abbreviations, letās load them all into a collection in the sample_restaurants database. If you havenāt already created a MongoDB cluster, now is a good time to do that and load the sample data in.
The first step is to connect to your Atlas cluster. We will use mongosh to do it. If you donāt have mongosh installed, follow theĀ instructions.
To connect to your Atlas cluster, you will need aĀ connection string. Choose the āConnect with the MongoDB Shellā option and follow the instructions. Note that you will need to connect with aĀ database userĀ that has permissions to modify the database, since we would be creating a collection in the sample_restaurant database. The command you need to enter in the terminal will look something like:
1 mongosh "mongodb+srv://cluster0.XXXXX.mongodb.net/sample_restaurant" --apiVersion 1 --username <USERNAME>
When prompted for the password, enter the database userās password.
We created our synonym JSON documents in the right format already, but letās make sure that if we decide to add more documents to this collection, they will also have the correct format. To do that, we will create a synonyms collection with a validator that usesĀ $jsonSchema. The commands below will create a collection with the name āpostal_synonymsā in the sample_restaurants database and ensure that only documents with correct format are inserted into the collection.
1 use('sample_restaurants') 2 3 db.createCollection("postal_synonyms", { validator: { $jsonSchema: { "bsonType": "object", "required": ["mappingType", "synonyms"], "properties": { "mappingType": { "type": "string", "enum": ["equivalent", "explicit"], "description": "must be a either equivalent or explicit" }, "synonyms": { "bsonType": "array", "items": { "type": "string" }, "description": "must be an Array with each item a string and is required" }, "input": { "type": "array", "items": { "type": "string" }, "description": "must be an Array and is each item is a string" } }, "anyOf": [{ "not": { "properties": { "mappingType": { "enum": ["explicit"] } }, "required": ["mappingType"] } }, { "required": ["input"] }] } } })
We will use mongoimport to import all the JSON files we created.
You will need aĀ connection stringĀ for your Atlas cluster to use in the mongoimport command. If you donāt already have mongoimport installed, useĀ theĀ instructionsĀ in the MongoDB documentation.
In the terminal, navigate to the folder where all the JSON files for postal abbreviation synonyms were created.
1 cat *.json | mongoimport --uri 'mongodb+srv://<USERNAME>:<PASSWORD>@cluster0.pwh9dzy.mongodb.net/sample_restaurants?retryWrites=true&w=majority' --collection='postal_synonyms'
Take a look at the synonyms collections you just created in Atlas. You should see around 229 documents there.
Now that we created the synonyms collection in our sample_restaurants database, letās put it to use.
Letās start by creating a search index. Navigate to the Search tab in your Atlas cluster and click the āCREATE INDEXā button.
Since the Visual Index builder doesnāt support synonym mappings yet, we will choose JSON Editor and click Next:
In the JSON Editor, pick restaurants collection in the sample_restaurants database and enter the following into the index definition. Here, the source collection name refers to the name of the collection with all the postal abbreviation synonyms, which we named āpostal_synonyms.ā
1 { 2 3 Ā Ā "mappings": { 4 5 Ā Ā Ā Ā "dynamic": true 6 7 Ā Ā }, 8 9 Ā Ā "synonyms": [ 10 11 Ā Ā Ā Ā { 12 13 Ā Ā Ā Ā Ā Ā "analyzer": "lucene.standard", 14 15 Ā Ā Ā Ā Ā Ā "name": "synonym_mapping", 16 17 Ā Ā Ā Ā Ā Ā "source": { 18 19 Ā Ā Ā Ā Ā Ā Ā Ā "collection": "postal_synonyms" 20 21 Ā Ā Ā Ā Ā Ā } 22 23 Ā Ā Ā Ā } 24 25 Ā Ā Ā Ā ] 26 27 }
We are indexing the restaurants collection and creating a synonym mapping with the name āsynonym_mappingā that references the synonyms collection āpostal_synonyms.ā
Click on Next and then on Create Search Index, and wait for the search index to build.
Once the index is active, weāre ready to test it out.
Now that we have an active search index, weāre ready to test that our synonyms are working. Letās head to the Aggregation pipeline in the Collections tab to test different calls to $search. You can alsoĀ useĀ Compass, the MongoDB GUI, if you prefer.
Choose $search from the list of pipeline stages. The UI gives us a helpful placeholder for the $search commandās arguments.
Letās look for all restaurants that are located on a boulevard. We will search in the āaddress.streetā field, so the arguments to the $search stage will look like this:
1 { 2 3 Ā Ā index: 'default', 4 5 Ā Ā text: { 6 7 Ā Ā Ā Ā query: 'boulevard', 8 9 Ā Ā Ā Ā path: 'address.street' 10 11 Ā Ā } 12 13 }
Letās add a $count stage after the $search stage to see how many restaurants with an address that contains āboulevardā we found:
As expected, we found a lot of restaurants with the word āboulevardā in the address. But what if we donāt want to have users type āboulevardā in the search bar? What would happen if we put in āblvd,ā for example?
1 { 2 3 Ā Ā index: 'default', 4 5 Ā Ā text: { 6 7 Ā Ā Ā Ā query: blvd, 8 9 Ā Ā Ā Ā path: 'address.street' 10 11 Ā Ā } 12 13 }
Looks like it found us restaurants with addresses that have āblvdā in them. What about the addresses with āboulevard,ā though? Those did not get picked up by the search.Ā
And what if we werenāt sure how to spell āboulevardā and just searched for āboulā?Ā USPSās websiteĀ tells us itās an acceptable abbreviation for boulevard, but our $search finds nothing.
This is where our synonyms come in! We need to add a synonyms option to the text operator in the $search command and reference the synonym mappingās name:
1 { 2 3 Ā Ā index: 'default', 4 5 Ā Ā text: { 6 7 Ā Ā Ā Ā query: 'blvd', 8 9 Ā Ā Ā Ā path: 'address.street', 10 11 Ā Ā Ā Ā synonyms:'synonym_mapping' 12 13 Ā Ā } 14 15 }
And there you have it! We found all the restaurants on boulevards, regardless of which way the address was abbreviated, all thanks to our synonyms.
Synonyms is just one of many featuresĀ Atlas SearchĀ offers to give you all the necessary search functionality in your application. All of these features are available right now onĀ MongoDB Atlas. We just showed you how to add support for common postal abbreviations to your Atlas Search indexāwhat can you do with Atlas Search next? Try it now on your free-foreverĀ MongoDB AtlasĀ cluster and head over toĀ community forumsĀ if you have any questions!