Stream Data Into MongoDB Atlas Using AWS Glue
BS
Babu Srinivasan, Anuj Panchal, Igor Alekseev6 min read • Published Apr 16, 2024 • Updated Jan 15, 2025
Rate this tutorial
In this tutorial, you'll find a tangible showcase of how AWS Glue, Amazon Kinesis, and MongoDB Atlas seamlessly integrate, creating a streamlined data streaming solution alongside extract, transform, and load (ETL) capabilities. This repository also harnesses the power of AWS CDK to automate deployment across diverse environments, enhancing the efficiency of the entire process.
To follow along with this tutorial, you should have intermediate proficiency with AWS and MongoDB services.

In the architecture described above, various streams of data, such as Order and Customer, are retrieved via the Amazon Kinesis stream. Subsequently, AWS Glue Studio is utilized to enrich the data. The enriched data is backed up in an S3 bucket, while the consolidated stream is stored in MongoDB Atlas and made accessible for downstream systems.
This repo is developed taking us-east-1 as the default region. Please update the scripts to your specific region (if required). This repo will create a MongoDB Atlas project and a free-tier database cluster automatically. No need to create a database cluster manually. This repo is created for a demo purpose and IP access is not restricted (0.0.0.0/0). Ensure you strengthen the security by updating the relevant IP address (if required).
git clone https://github.com/mongodb-partners/Stream_Data_into_MongoDB_AWS_Glue
cd kinesis-glue-aws-cdk
a. Set up the AWS Environment variable AWS Access Key ID, AWS Secret Access Key, and optionally, the AWS Session Token.
1 export AWS_ACCESS_KEY_ID = <"your AWS access key"> 2 export AWS_SECRET_ACCESS_KEY =<"your AWS secret access key"> 3 export AWS_SESSION_TOKEN = <"your AWS session token">
b. We will use CDK to make our deployments easier.
You should have npm pre-installed.
If you don’t have CDK installed:
npm install -g aws-cdk
Make sure you’re in the root directory.
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
For development setup, use requirements-dev.txt.
c. Bootstrap the application with the AWS account.
cdk bootstrap
d. Set the ORG_ID as an environment variable in the .env file. All other parameters are set to default in global_args.py in the kinesis-glue-aws-cdk folder. MONGODB_USER and MONGODB_PASSWORD parameters are set directly in mongodb_atlas_stack.py and glue_job_stack.py
The below screenshot shows the location to get the Organization ID from MongoDB Atlas.

Please note that using "0.0.0.0/0" as an IP_ADDRESS, we are allowing access to the database from anywhere. This might be suitable for development or testing purposes but is highly discouraged for production environments because it exposes the database to potential attacks from unauthorized sources.
e. List the CDKs:
cdk ls
You should see an output of the available stacks:
1 `aws-etl-mongo-atlas-stack` 2 `aws-etl-kinesis-stream-stack` 3 `aws-etl-bucket-stack`
aws-etl-glue-job-stack
Let’s walk through each of the stacks:
Stack for MongoDB Atlas: aws-etl-mongo-atlas-stack
This stack will create a MongoDB Atlas project and a free-tier database cluster with user and network permission (open).
a. Create an AWS role with its trust relationship as a CloudFormation service.

b.The following Public Extension in the CloudFormation Registry should be activated with the Role created in the earlier step. After logging into the AWS console, use this link to register extensions on CloudFormation.
1 MongoDB::Atlas::Project, 2 MongoDB::Atlas::DatabaseUser, 3 MongoDB::Atlas::Cluster, 4 MongoDB::Atlas::ProjectIpAccessList
Pass the ARN of the role from the earlier step as input to activate the MongoDB resource in Public Extension.
MongoDB Resource Activation in Public Extension:

The above screenshot shows the activation of the Registry Public Extensions for the MongoDB::Atlas::Cluster.
Alternatively, you can activate the above public extension through AWS CLI.
Command to list the MongoDB Public Extensions. Note down the arns for the above four public extensions.
1 aws cloudformation list-types \ 2 --visibility PUBLIC \ 3 --filters "Category=THIRD_PARTY,TypeNamePrefix=Mongodb"
Command to activate the Public Extension. Use this command to activate all four public extensions mentioned in the previous steps.
1 aws cloudformation activate-type --region us-east-1 --public-type-arn "<arn for the public extension noted down in the previous step>" --execution-role-arn "<arn of the role created in step a>"
c. Log into the MongoDB console and note down the organization ID. Ensure the Organization ID is updated in the global_args.py

The above screenshot shows the MongoDB Cluster Organization settings.
d. Create an API Key in an organization with Organization Owner access. Note down the API credentials.

The above screenshot shows the access managers for the API Key created in the MongoDB Atlas cluster.
Restrict the access to the Organization API with the API Access list. We provided an open access 0.0.0.0/1 for demo purposes only. We strongly discourage the use of this in any production environment or equivalent.
e. A profile should be created in the AWS Secrets Manager containing the MongoDB Atlas Programmatic API Key.

The above screenshot shows the parameters for the AWS CloudFormation stack.
Initiate the deployment with the following command:
1 cdk deploy aws-etl-mongo-atlas-stack
After successfully deploying the stack, validate the Outputs section of the stack and MongoDB Atlas cluster. You will find the stdUrl and stdSrvUrl for the connection string.

The above screenshot shows the output of the CloudFormation stack.
![Creation of a MongoDB Atlas cluster][9]
The above screenshot shows the successful creation of the MongoDB Atlas cluster.
This stack will create two Kinesis data streams. Each producer runs an ingesting stream of events for different customers with their orders.
Initiate the deployment with the following command:
1 cdk deploy aws-etl-kinesis-stream-stack
After successfully deploying the stack, Check the Outputs section of the stack. You will find the CustomerOrderKinesisDataStream Kinesis function.
![Output of the CloudFormation stack][10]
The above screenshot shows the output of the CloudFormation stack for Kinesis streams.
![Kinesis stream][11]
The above screenshot shows the Kinesis stream created.
This stack will create an S3 bucket that will be used by AWS Glue jobs to persist the incoming customer and order details.
1 cdk deploy aws-etl-bucket-stack
After successfully deploying the stack, check the Outputs section of the stack. You will find the
S3SourceBucket resource.
![Output of the CloudFormation stack][12]
The above screenshot shows the Output of the CloudFormation stack.
AWS S3 Bucket:
![S3 buckets created][13]
The above screenshot shows the S3 buckets created.
This stack will create two AWS Glue jobs: one job for the customer and another for the order. The code is in the location glue_job_stack/glue_job_scripts/customer_kinesis_streams_s3.py and glue_job_stack/glue_job_scripts/order_kinesis_streams_s3.py.
1 cdk deploy aws-etl-glue-job-stack
![Output of CloudFormation stack][14]
The above screenshot shows the output of the CloudFormation stack.
![AWS Glue Studio][15]
The above screenshot shows the AWS Glue Studio job to set up the job.
The MongoDB URL of the newly created cluster and other parameters will be passed to the AWS Glue job programmatically. Update these parameters to your values (if required).
"Spark UI logs path" and "Temporary path" details will be maintained in the same bucket location with folder name /sparkHistoryLogs and /temporary.
s3://<S3_BUCKET_NAME>/sparkHistoryLogs
s3://<S3_BUCKET_NAME>/temporary/
![AWS Glue parameters][16]
The above screenshot shows the AWS Glue parameters.
Once you are ready with all stacks, start the producers for the customer and order. The code is in this location, to ingest data into a kinesis data stream and also start the Glue job for both.
producer/customer.py
and producer/order.py
1 { 2 "_id": "1", 3 "country_id": "1", 4 "customer_name": "NICK", 5 "email_id": "nick@gmail.com", 6 "orders": [ 7 { 8 "order_id": "8", 9 "product_name": "Artisanal Cheese Selection", 10 "quantity": "8", 11 "price": "29.17" 12 } 13 ] 14 }
Use
cdk destroy
to clean up all the AWS CDK resources.Refer to GitHub to resolve some common issues encountered when using AWS CloudFormation/CDK with MongoDB Atlas Resources.
cdk ls
lists all stacks in the app.
cdk synth
emits the synthesized CloudFormation template.
cdk deploy
deploys this stack to your default AWS account/region.
cdk diff
compares the deployed stack with the current state.
cdk docs
opens CDK documentation.Top Comments in Forums
There are no comments on this article yet.