Utilizing PySpark to Connect MongoDB Atlas with Azure Databricks

Anaiya Raisinghani6 min read • Published May 01, 2023 • Updated Apr 02, 2024

Spark MongoDB Python

Rate this tutorial

Data processing is no easy feat, but with the proper tools, it can be simplified and can enable you to make the best data-driven decisions possible. In a world overflowing with data, we need the best methods to derive the most useful information.

The combination of MongoDB Atlas with Azure Databricks makes an efficient choice for big data processing. By connecting Atlas with Azure Databricks, we can extract data from our Atlas cluster, process and analyze the data using PySpark, and then store the processed data back in our Atlas cluster. Using Azure Databricks to analyze your Atlas data allows for access to Databricks’ wide range of advanced analytics capabilities, which include machine learning, data science, and areas of artificial intelligence like natural language processing! Processing your Atlas data with these advanced Databricks tools allows us to be able to handle any amount of data in an efficient and scalable way, making it easier than ever to gain insights into our data sets and enable us to make the most effective data-driven decisions.

This tutorial will show you how to utilize PySpark to connect Atlas with Databricks so you can take advantage of both platforms.

MongoDB Atlas is a scalable and flexible storage solution for your data while Azure Databricks provides the power of Apache Spark to work with the security and collaboration features that are available with a Microsoft Azure subscription. Apache Spark provides the Python interface for working with Spark, PySpark, which allows for an easy-to-use interface for developing in Python. To properly connect PySpark with MongoDB Atlas, the MongoDB Spark Connector is utilized. This connector ensures for seamless compatibility, as you will see below in the tutorial.

Our tutorial to combine the above platforms will consist of viewing and manipulating an Atlas cluster and visualizing our data from the cluster back in our PySpark console. We will be setting up both Atlas and Azure Databricks clusters, connecting our Databricks cluster to our IDE, and writing scripts to view and contribute to the cluster in our Atlas account. Let’s get started!

Requirements

In order to successfully recreate this project, please ensure you have everything in the following list:

MongoDB Atlas account.
Microsoft Azure subscription (two-week free tier trial).
Python 3.8+.
GitHub Repository.
Java on your local machine.

Setting up a MongoDB Atlas cluster

Our first step is to set up a MongoDB Atlas cluster. Access the Atlas UI and follow these steps. For this tutorial, a free “shared” cluster is perfect. Create a database and name it “bookshelf” with a collection inside named “books”. To ensure ease for this tutorial, please allow for a connection from anywhere within your cluster’s network securities.

Once properly provisioned, your cluster will look like this:

Now we can set up our Azure Databricks cluster.

Setting up an Azure Databricks cluster

Access the Azure Databricks page, sign in, and access the Azure Databricks tab. This is where you’ll create an Azure Databricks workspace.

For our Databricks cluster, a free trial works perfectly for this tutorial. Once the cluster is provisioned, you’ll only have two weeks to access it before you need to upgrade.

Hit “Review and Create” at the bottom. Once your workspace is validated, click “Create.” Once your deployment is complete, click on “Go to Resource.” You’ll be taken to your workspace overview. Click on “Launch Workspace” in the middle of the page.

This will direct you to the Microsoft Azure Databricks UI where we can create the Databricks cluster. On the left-hand of the screen, click on “Create a Cluster,” and then click “Create Compute” to access the correct form.

When creating your cluster, pay close attention to what your “Databricks runtime version” is. Continue through the steps to create your cluster.

We’re now going to install the libraries we need in order to connect to our MongoDB Atlas cluster. Head to the “Libraries” tab of your cluster, click on “Install New,” and select “Maven.” Hit “Search Packages” next to “Coordinates.” Search for mongo and select the mongo-spark package. Do the same thing with xml and select the spark-xml package. When done, your library tab will look like this:

Utilizing Databricks-Connect

Now that we have our Azure Databricks cluster ready, we need to properly connect it to our IDE. We can do this through a very handy configuration named Databricks Connect. Databricks Connect allows for Azure Databricks clusters to connect seamlessly to the IDE of your choosing.

Databricks configuration essentials

Before we establish our connection, let’s make sure we have our configuration essentials. This is available in the Databricks Connect tutorial on Microsoft’s website under “Step 2: Configure connection properties.” Please note these properties down in a safe place, as you will not be able to connect properly without them.

Databricks-Connect configuration

Access the Databricks Connect page linked above to properly set up databricks-connect on your machine. Ensure that you are downloading the databricks-connect version that is compatible with your Python version and is the same as the Databricks runtime version in your Azure cluster.

Please ensure prior to installation that you are working with a virtual environment for this project. Failure to use a virtual environment may cause PySpark package conflicts in your console.

Virtual environment steps in Python:

1 python3 -m venv name

Where the name is the name of your environment, so truly you can call it anything.

Our second step is to activate our virtual environment:

1 source name/bin/activate

And that’s it. We are now in our Python virtual environment. You can see that you’re in it when the little (name) or whatever you named it shows up.

Continuing on...for our project, use this installation command:

1 pip install -U “databricks-connect==10.4.*”

Once fully downloaded, we need to set up our cluster configuration. Use the configure command and follow the instructions. This is where you will input your configuration essentials from our “Databricks configuration essentials” section.

Once finished, use this command to check if you’re connected to your cluster:

1 databricks-connect test

You’ll know you’re correctly configured when you see an “All tests passed” in your console. Now, it’s time to set up our SparkSessions and connect them to our Atlas cluster.

SparkSession + Atlas configuration

The creation of a SparkSession object is crucial for our tutorial because it provides a way to access all important PySpark features in one place. These features include: reading data, creating data frames, and managing the overall configuration of PySpark applications. Our SparkSession will enable us to read and write to our Atlas cluster through the data frames we create.

The full code is on our Github account, so please access it there if you would like to replicate this exact tutorial. We will only go over the code for some of the essentials of the tutorial below.

This is the SparkSession object we need to include. We are going to use a basic structure where we describe the application name, configure our “read” and “write” connectors to our connection_string (our MongoDB cluster connection string that we have saved safely as an environment variable), and configure our mongo-spark-connector. Make sure to use the correct mongo-spark-connector for your environment. For ours, it is version 10.0.3. Depending on your Python version, the mongo-spark-connector version might be different. To find which version is compatible with your environment, please refer to the MVN Repository documents.

1 # use environment variable for uri 
2 load_dotenv()
3 connection_string: str = os.environ.get("CONNECTION_STRING")
4 
5 # Create a SparkSession. Ensure you have the mongo-spark-connector included.
6 my_spark = SparkSession \
7     .builder \
8     .appName("tutorial") \
9     .config("spark.mongodb.read.connection.uri", connection_string) \
10     .config("spark.mongodb.write.connection.uri", connection_string) \
11     .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector:10.0.3") \
12     .getOrCreate()

For more help on how to create a SparkSession object with MongoDB and for more details on the mongo-spark-connector, please view the documentation.

Our next step is to create two data frames, one to write a book to our Atlas cluster, and a second to read back all the books in our cluster. These data frames are essential; make sure to use the proper format or else they will not properly connect to your cluster.

Data frame to write a book:

1 add_books = my_spark \
2     .createDataFrame([("<title>", "<author>", <year>)], ["title", "author", "year"])
3 
4 add_books.write \
5     .format("com.mongodb.spark.sql.DefaultSource") \
6     .option('uri', connection_string) \
7     .option('database', 'bookshelf') \
8     .option('collection', 'books') \
9     .mode("append") \
10     .save()

Data frame to read back our books:

1 # Create a data frame so you can read in your books from your bookshelf.
2 return_books = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
3     .option('uri', connection_string) \
4     .option('database', 'bookshelf') \
5     .option('collection', 'books') \
6     .load()
7 
8 # Show the books in your PySpark shell.
9 return_books.show()

Add in the book of your choosing under the add_books dataframe. Here, exchange the title, author, and year for the areas with the < > brackets. Once you add in your book and run the file, you’ll see that the logs are telling us we’re connecting properly and we can see the added books in our PySpark shell. This demo script was run six separate times to add in six different books. A picture of the console is below:

We can double-check our cluster in Atlas to ensure they match up:

Conclusion

Congratulations! We have successfully connected our MongoDB Atlas cluster to Azure Databricks through PySpark, and we can read and write data straight to our Atlas cluster.

The skills you’ve learned from this tutorial will allow you to utilize Atlas’s scalable and flexible storage solution while leveraging Azure Databricks’ advanced analytics capabilities. This combination can allow developers to handle any amount of data in an efficient and scalable manner, while allowing them to gain insights into complex data sets to make exciting data-driven decisions!

Questions? Comments? Let’s continue the conversation over at the MongoDB Developer Community!

Rate this tutorial

Tutorial

How to Choose the Right Chunking Strategy for Your LLM Application

Jun 17, 2024 | 16 min read

Tutorial

Building an Advanced RAG System With Self-Querying Retrieval

Sep 12, 2024 | 21 min read

Tutorial

How to Improve LLM Applications With Parent Document Retrieval Using MongoDB and LangChain

Feb 11, 2025 | 15 min read

Tutorial

Build Smart Applications With Atlas Vector Search and Google Vertex AI

Sep 18, 2024 | 4 min read

Setting up a MongoDB Atlas cluster
Setting up an Azure Databricks cluster
Utilizing Databricks-Connect
SparkSession + Atlas configuration
Conclusion

Python

Utilizing PySpark to Connect MongoDB Atlas with Azure Databricks

Requirements

Setting up a MongoDB Atlas cluster

Setting up an Azure Databricks cluster

Utilizing Databricks-Connect

Databricks configuration essentials

Databricks-Connect configuration

SparkSession + Atlas configuration

Conclusion

Related

How to Choose the Right Chunking Strategy for Your LLM Application

Building an Advanced RAG System With Self-Querying Retrieval

How to Improve LLM Applications With Parent Document Retrieval Using MongoDB and LangChain

Build Smart Applications With Atlas Vector Search and Google Vertex AI

Table of Contents

1	# use environment variable for uri
2	load_dotenv()
3	connection_string: str = os.environ.get("CONNECTION_STRING")
4
5	# Create a SparkSession. Ensure you have the mongo-spark-connector included.
6	my_spark = SparkSession \
7	.builder \
8	.appName("tutorial") \
9	.config("spark.mongodb.read.connection.uri", connection_string) \
10	.config("spark.mongodb.write.connection.uri", connection_string) \
11	.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector:10.0.3") \
12	.getOrCreate()

1	add_books = my_spark \
2	.createDataFrame([("<title>", "<author>", <year>)], ["title", "author", "year"])
3
4	add_books.write \
5	.format("com.mongodb.spark.sql.DefaultSource") \
6	.option('uri', connection_string) \
7	.option('database', 'bookshelf') \
8	.option('collection', 'books') \
9	.mode("append") \
10	.save()

1	# Create a data frame so you can read in your books from your bookshelf.
2	return_books = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
3	.option('uri', connection_string) \
4	.option('database', 'bookshelf') \
5	.option('collection', 'books') \
6	.load()
7
8	# Show the books in your PySpark shell.
9	return_books.show()