UDF Announcement for MongoDB to BigQuery Dataflow Templates

Venkatesh Shanbhag, Diana Annie Jenosh4 min read • Published Jan 26, 2023 • Updated Apr 02, 2024

AI Atlas JavaScript

Rate this tutorial

Many enterprise customers using MongoDB Atlas as their core operational database also use BigQuery for their Batch and AI/ML based analytics, making it pivotal for seamless transfer of data between these entities. Since the announcement of the Dataflow templates (in October Of 2022) on moving data between MongoDB and BigQuery, we have seen a lot of interest from customers as it made it effortless for an append-only, one-to-one migration of data. Though the three Dataflow templates provided cater to most of the common use cases, there was also a demand to be able to do transformations as part of these templates.

We are excited to announce the addition of the ability to write your own user-defined functions (UDFs) in these Dataflow pipelines! This new feature allows you to use UDFs in JavaScript to transform and analyze data within BigQuery. With UDFs, you can define custom logic and business rules that can be applied to your data as it is being processed by Dataflow. This allows you to perform complex transformations like transforming fields, concatenating fields, deleting fields, converting embedded documents to separate documents, etc. These UDFs take unprocessed documents as input parameters and return the processed documents as output.

To use UDFs with BigQuery Dataflow, simply write your JavaScript function and store it in the Google cloud storage bucket. Use the Dataflow templates’ optional parameter to read these UDFs while running the templates. The function will be executed on the data as it is being processed, allowing you to apply custom logic and transformations to your data during the transfer.

How to set it up

Let’s have a quick look at how to set up a sample UDF to process (transform a field, flatten an embedded document, and delete a field) from an input document before writing the processed data to BigQuery.

Set up MongoDB

MongoDB Atlas setup through registration.
MongoDB Atlas setup through GCP Marketplace. (MongoDB Atlas is available pay as you go in the GC marketplace).
Create your MongoDB cluster.
Click on Browse collections and click on +Create Database.

5: Name your database Sample_Company and collection Sample_Employee. 6: Click on INSERT DOCUMENT. Copy and paste the below document and click on Insert.

1 {
2   "Name":"Venkatesh",
3   "Address":{"Phone":{"$numberLong":"123455"},"City":"Honnavar"},
4   "Department":"Solutions Consulting",
5   "Direct_reporting": "PS"
6 }

7: To have authenticated access on the MongoDB Sandbox cluster from Google console, we need to create database users.

Click on the Database Access from the left pane on the Atlas Dashboard.

Choose to Add New User using the green button on the left. Enter the username appUser and password appUser123. We will use built-in roles; click Add Default Privileges and in the Default Privileges section, add the roles readWriteAnyDatabase. Then press the green Add User button to create the user.

8: Whitelist the IPs.

For the purpose of this demo, we will allow access from any ip, i.e 0.0.0.0/0. However, this is not recommended for a production setup, where the recommendation will be to use VPC Peering and private IPs.

Set up Google Cloud

Create a cloud storage bucket.
On your local machine, create a Javascript file transform.js and add below sample code.

1 function transform(inputDoc) {
2    var outputDoc = new Object();
3    inputDoc["City"] = inputDoc["Address"]["City"];
4    delete doc.Address;
5    outputDoc = doc;
6    return returnObj;
7 }

This function will read the document read from MongoDB using the Apache beam MongoDB IO connector. Flatten the embedded document Address/City to City. Delete the Address field and return the updated document.

3: Upload the javascript file to the Google Cloud storage bucket.

4: Create a BigQuery Dataset in your project in the region close to your physical location.

5: Create a Dataflow pipeline.

a. Click on the Create Job from the template button at the top.

b. Job Name: mongodb-udf.

c. Region: Same as your BigQuery dataset region.

d. MongoDB connection URI: Copy the connection URI for connecting applications from MongoDB Atlas.

e. MongoDB Database: Sample_Company.

f. MongoDB Collection: Sample_Employee.

g. BigQuery Destination Table: Copy the destination table link from the BigQuery

h. Dataset details page in format: bigquery-project:sample_dataset.sample_company.

i. User Option: FLATTEN.

j. Click on show optional parameters.

k. Cloud storage location of your Javascript UDF: Browse your UDF file loaded to bucket location. This is the new feature that allows running the UDF and applies the transformations before inserting into BigQuery.

l. Name of your Javascript function: transform.

6: Click on RUN JOB to start running the pipeline. Once the pipeline finishes running, your graph should show Succeeded on each stage as shown below.

7: After completion of the job, you will be able to see the transformed document inserted into BigQuery.

Conclusion

In this blog, we introduced UDFs to MongoDB to BigQuery Dataflow templates and their capabilities to transform the documents read from MongoDB using custom user defined Javascript functions stored on Google Cloud storage buckets. This blog also includes a simple tutorial on how to set up MongoDB Atlas, Google Cloud, and the UDFs.

Configure Email/Password Authentication in MongoDB Atlas App Services

Mar 13, 2024 | 3 min read

Article

Capturing and Storing Real-World Optics With MongoDB Atlas, OpenAI GPT-4o, and PyMongo

Sep 04, 2024 | 7 min read

Tutorial

Building a Knowledge Base and Visualization Graphs for RAG With MongoDB

Sep 02, 2024 | 12 min read

Quickstart

Quickstart Guide to RAG Application Using LangChain and LlamaIndex

Sep 18, 2024 | 10 min read

How to set it up
Conclusion

Atlas