Utilizing PySpark to Connect MongoDB Atlas with Azure Databricks
Rate this tutorial
Data processing is no easy feat, but with the proper tools, it can be simplified and can enable you to make the best data-driven decisions possible. In a world overflowing with data, we need the best methods to derive the most useful information.
The combination of MongoDB Atlas with Azure Databricks makes an efficient choice for big data processing. By connecting Atlas with Azure Databricks, we can extract data from our Atlas cluster, process and analyze the data using PySpark, and then store the processed data back in our Atlas cluster. Using Azure Databricks to analyze your Atlas data allows for access to Databricks’ wide range of advanced analytics capabilities, which include machine learning, data science, and areas of artificial intelligence like natural language processing! Processing your Atlas data with these advanced Databricks tools allows us to be able to handle any amount of data in an efficient and scalable way, making it easier than ever to gain insights into our data sets and enable us to make the most effective data-driven decisions.
This tutorial will show you how to utilize PySpark to connect Atlas with Databricks so you can take advantage of both platforms.
MongoDB Atlas is a scalable and flexible storage solution for your data while Azure Databricks provides the power of Apache Spark to work with the security and collaboration features that are available with a Microsoft Azure subscription. Apache Spark provides the Python interface for working with Spark, PySpark, which allows for an easy-to-use interface for developing in Python. To properly connect PySpark with MongoDB Atlas, the MongoDB Spark Connector is utilized. This connector ensures for seamless compatibility, as you will see below in the tutorial.
Our tutorial to combine the above platforms will consist of viewing and manipulating an Atlas cluster and visualizing our data from the cluster back in our PySpark console. We will be setting up both Atlas and Azure Databricks clusters, connecting our Databricks cluster to our IDE, and writing scripts to view and contribute to the cluster in our Atlas account. Let’s get started!
In order to successfully recreate this project, please ensure you have everything in the following list:
- MongoDB Atlas account.
- Microsoft Azure subscription (two-week free tier trial).
- Python 3.8+.
- Java on your local machine.
Our first step is to set up a MongoDB Atlas cluster. Access the Atlas UI and follow these steps. For this tutorial, a free “shared” cluster is perfect. Create a database and name it “bookshelf” with a collection inside named “books”. To ensure ease for this tutorial, please allow for a connection from anywhere within your cluster’s network securities.
Once properly provisioned, your cluster will look like this:
Now we can set up our Azure Databricks cluster.
Access the Azure Databricks page, sign in, and access the Azure Databricks tab. This is where you’ll create an Azure Databricks workspace.
For our Databricks cluster, a free trial works perfectly for this tutorial. Once the cluster is provisioned, you’ll only have two weeks to access it before you need to upgrade.
Hit “Review and Create” at the bottom. Once your workspace is validated, click “Create.” Once your deployment is complete, click on “Go to Resource.” You’ll be taken to your workspace overview. Click on “Launch Workspace” in the middle of the page.
This will direct you to the Microsoft Azure Databricks UI where we can create the Databricks cluster. On the left-hand of the screen, click on “Create a Cluster,” and then click “Create Compute” to access the correct form.
When creating your cluster, pay close attention to what your “Databricks runtime version” is. Continue through the steps to create your cluster.
We’re now going to install the libraries we need in order to connect to our MongoDB Atlas cluster. Head to the “Libraries” tab of your cluster, click on “Install New,” and select “Maven.” Hit “Search Packages” next to “Coordinates.” Search for
mongo
and select the mongo-spark
package. Do the same thing with xml
and select the spark-xml
package. When done, your library tab will look like this:
Now that we have our Azure Databricks cluster ready, we need to properly connect it to our IDE. We can do this through a very handy configuration named Databricks Connect. Databricks Connect allows for Azure Databricks clusters to connect seamlessly to the IDE of your choosing.
Before we establish our connection, let’s make sure we have our configuration essentials. This is available in the Databricks Connect tutorial on Microsoft’s website under “Step 2: Configure connection properties.” Please note these properties down in a safe place, as you will not be able to connect properly without them.
Access the Databricks Connect page linked above to properly set up
databricks-connect
on your machine. Ensure that you are downloading the databricks-connect
version that is compatible with your Python version and is the same as the Databricks runtime version in your Azure cluster.Please ensure prior to installation that you are working with a virtual environment for this project. Failure to use a virtual environment may cause PySpark package conflicts in your console.
Virtual environment steps in Python:
1 python3 -m venv name
Where the
name
is the name of your environment, so truly you can call it anything.Our second step is to activate our virtual environment:
1 source name/bin/activate
And that’s it. We are now in our Python virtual environment. You can see that you’re in it when the little (name) or whatever you named it shows up.
Continuing on...for our project, use this installation command:
1 pip install -U “databricks-connect==10.4.*”
Once fully downloaded, we need to set up our cluster configuration. Use the configure command and follow the instructions. This is where you will input your configuration essentials from our “Databricks configuration essentials” section.
Once finished, use this command to check if you’re connected to your cluster:
1 databricks-connect test
You’ll know you’re correctly configured when you see an “All tests passed” in your console.
Now, it’s time to set up our SparkSessions and connect them to our Atlas cluster.
The creation of a SparkSession object is crucial for our tutorial because it provides a way to access all important PySpark features in one place. These features include: reading data, creating data frames, and managing the overall configuration of PySpark applications. Our SparkSession will enable us to read and write to our Atlas cluster through the data frames we create.
The full code is on our Github account, so please access it there if you would like to replicate this exact tutorial. We will only go over the code for some of the essentials of the tutorial below.
This is the SparkSession object we need to include. We are going to use a basic structure where we describe the application name, configure our “read” and “write” connectors to our
connection_string
(our MongoDB cluster connection string that we have saved safely as an environment variable), and configure our mongo-spark-connector
. Make sure to use the correct mongo-spark-connector
for your environment. For ours, it is version 10.0.3. Depending on your Python version, the mongo-spark-connector
version might be different. To find which version is compatible with your environment, please refer to the MVN Repository documents.1 # use environment variable for uri 2 load_dotenv() 3 connection_string: str = os.environ.get("CONNECTION_STRING") 4 5 # Create a SparkSession. Ensure you have the mongo-spark-connector included. 6 my_spark = SparkSession \ 7 .builder \ 8 .appName("tutorial") \ 9 .config("spark.mongodb.read.connection.uri", connection_string) \ 10 .config("spark.mongodb.write.connection.uri", connection_string) \ 11 .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector:10.0.3") \ 12 .getOrCreate()
For more help on how to create a SparkSession object with MongoDB and for more details on the
mongo-spark-connector
, please view the documentation.Our next step is to create two data frames, one to
write
a book to our Atlas cluster, and a second to read
back all the books in our cluster. These data frames are essential; make sure to use the proper format or else they will not properly connect to your cluster.Data frame to
write
a book:1 add_books = my_spark \ 2 .createDataFrame([("<title>", "<author>", <year>)], ["title", "author", "year"]) 3 4 add_books.write \ 5 .format("com.mongodb.spark.sql.DefaultSource") \ 6 .option('uri', connection_string) \ 7 .option('database', 'bookshelf') \ 8 .option('collection', 'books') \ 9 .mode("append") \ 10 .save()
Data frame to
read
back our books:1 # Create a data frame so you can read in your books from your bookshelf. 2 return_books = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \ 3 .option('uri', connection_string) \ 4 .option('database', 'bookshelf') \ 5 .option('collection', 'books') \ 6 .load() 7 8 # Show the books in your PySpark shell. 9 return_books.show()
Add in the book of your choosing under the
add_books
dataframe. Here, exchange the title, author, and year for the areas with the < >
brackets. Once you add in your book and run the file, you’ll see that the logs are telling us we’re connecting properly and we can see the added books in our PySpark shell. This demo script was run six separate times to add in six different books. A picture of the console is below:We can double-check our cluster in Atlas to ensure they match up:
Congratulations! We have successfully connected our MongoDB Atlas cluster to Azure Databricks through PySpark, and we can
read
and write
data straight to our Atlas cluster.The skills you’ve learned from this tutorial will allow you to utilize Atlas’s scalable and flexible storage solution while leveraging Azure Databricks’ advanced analytics capabilities. This combination can allow developers to handle any amount of data in an efficient and scalable manner, while allowing them to gain insights into complex data sets to make exciting data-driven decisions!