UDF Announcement for MongoDB to BigQuery Dataflow Templates
DJ
Venkatesh Shanbhag, Diana Annie Jenosh4 min read • Published Jan 26, 2023 • Updated Apr 02, 2024
Rate this tutorial
Many enterprise customers using MongoDB Atlas as their core operational database also use BigQuery for their Batch and AI/ML based analytics, making it pivotal for seamless transfer of data between these entities. Since the announcement of the Dataflow templates (in October Of 2022) on moving data between MongoDB and BigQuery, we have seen a lot of interest from customers as it made it effortless for an append-only, one-to-one migration of data. Though the three Dataflow templates provided cater to most of the common use cases, there was also a demand to be able to do transformations as part of these templates.
We are excited to announce the addition of the ability to write your own user-defined functions (UDFs) in these Dataflow pipelines! This new feature allows you to use UDFs in JavaScript to transform and analyze data within BigQuery. With UDFs, you can define custom logic and business rules that can be applied to your data as it is being processed by Dataflow. This allows you to perform complex transformations like transforming fields, concatenating fields, deleting fields, converting embedded documents to separate documents, etc. These UDFs take unprocessed documents as input parameters and return the processed documents as output.
To use UDFs with BigQuery Dataflow, simply write your JavaScript function and store it in the Google cloud storage bucket. Use the Dataflow templates’ optional parameter to read these UDFs while running the templates. The function will be executed on the data as it is being processed, allowing you to apply custom logic and transformations to your data during the transfer.
Let’s have a quick look at how to set up a sample UDF to process (transform a field, flatten an embedded document, and delete a field) from an input document before writing the processed data to BigQuery.
- MongoDB Atlas setup through GCP Marketplace. (MongoDB Atlas is available pay as you go in the GC marketplace).
- Click on Browse collections and click on +Create Database.
5: Name your database Sample_Company and collection Sample_Employee.
6: Click on INSERT DOCUMENT.
Copy and paste the below document and click on Insert.
1 { 2 "Name":"Venkatesh", 3 "Address":{"Phone":{"$numberLong":"123455"},"City":"Honnavar"}, 4 "Department":"Solutions Consulting", 5 "Direct_reporting": "PS" 6 }
7: To have authenticated access on the MongoDB Sandbox cluster from Google console, we need to create database users.
Click on the Database Access from the left pane on the Atlas Dashboard.
Choose to Add New User using the green button on the left. Enter the username
appUser
and password appUser123
. We will use built-in roles; click Add Default Privileges and in the Default Privileges section, add the roles readWriteAnyDatabase. Then press the green Add User button to create the user.8: Whitelist the IPs.
For the purpose of this demo, we will allow access from any ip, i.e 0.0.0.0/0. However, this is not recommended for a production setup, where the recommendation will be to use VPC Peering and private IPs.
- Create a cloud storage bucket.
- On your local machine, create a Javascript file transform.js and add below sample code.
1 function transform(inputDoc) { 2 var outputDoc = new Object(); 3 inputDoc["City"] = inputDoc["Address"]["City"]; 4 delete doc.Address; 5 outputDoc = doc; 6 return returnObj; 7 }
This function will read the document read from MongoDB using the Apache beam MongoDB IO connector. Flatten the embedded document Address/City to City. Delete the Address field and return the updated document.
a. Click on the Create Job from the template button at the top.
b. Job Name: mongodb-udf.
c. Region: Same as your BigQuery dataset region.
d. MongoDB connection URI: Copy the connection URI for connecting applications from MongoDB Atlas.
e. MongoDB Database: Sample_Company.
f. MongoDB Collection: Sample_Employee.
g. BigQuery Destination Table: Copy the destination table link from the BigQuery
h. Dataset details page in format: bigquery-project:sample_dataset.sample_company.
i. User Option: FLATTEN.
j. Click on show optional parameters.
k. Cloud storage location of your Javascript UDF: Browse your UDF file loaded to bucket location. This is the new feature that allows running the UDF and applies the transformations before inserting into BigQuery.
l. Name of your Javascript function: transform.
6: Click on RUN JOB to start running the pipeline. Once the pipeline finishes running, your graph should show Succeeded on each stage as shown below.
7: After completion of the job, you will be able to see the transformed document inserted into BigQuery.
In this blog, we introduced UDFs to MongoDB to BigQuery Dataflow templates and their capabilities to transform the documents read from MongoDB using custom user defined Javascript functions stored on Google Cloud storage buckets. This blog also includes a simple tutorial on how to set up MongoDB Atlas, Google Cloud, and the UDFs.
- Set up your first MongoDB cluster using Google Marketplace.