Create an Atlas Data Lake Pipeline - Preview

On this page

Prerequisites

Procedure
Next steps

Data Lake is deprecated. As of September 2024, Data Lake is deprecated and will reach end-of-life. It will be removed on September 30, 2025. If you use Data Lake, you should migrate to alternative solutions before the service is removed. To learn more, see Atlas Data Lake Deprecation Guide.

You can create Atlas Data Lake pipelines using the Atlas UI, Data Lake Pipelines API, and the Atlas CLI. This page guides you through the steps for creating an Atlas Data Lake pipeline.

Prerequisites

Before you begin, you must have the following:

Backup-enabled M10 or higher Atlas cluster.
Project Owner role for the project for which you want to deploy a Data Lake.
Sample data loaded on your cluster (if you wish to try the example in the following Procedure).

Procedure

To create a new Data Lake pipeline using the Atlas CLI, run the following command:

atlas dataLakePipelines create <pipelineName> [options]

To learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas dataLakePipelines create.

Tip

See: Related Links

Watch for a Pipeline to Complete

To watch for the specified data lake pipeline to complete using the Atlas CLI, run the following command:

atlas dataLakePipelines watch <pipelineName> [options]

To learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas dataLakePipelines watch.

Tip

See: Related Links

To create an Atlas Data Lake pipeline through the API, send a POST request to the Data Lake pipelines endpoint. To learn more about the pipelines endpoint syntax and parameters for creating a pipeline, see Create One Data Lake Pipeline.

Tip

You can send a GET request to the Data Lake availableSchedules endpoint to retrieve the list of backup schedule policy items that you can use to create your Data Lake pipeline of type PERIODIC_DPS.

Log in to MongoDB Atlas.

Go to Atlas Data Lake in the Atlas UI.

If it's not already displayed, select the organization that contains your project from the Organizations menu in the navigation bar.
If it's not already displayed, select your project from the Projects menu in the navigation bar.
In the sidebar, click Data Lake under the Deployment heading.

Click Add Data Lake Pipeline.

Define the data source for the pipeline.

You can create a copy of data on your Atlas cluster in MongoDB-managed cloud object storage optimized for analytic queries with workload isolation.

To set up a pipeline, specify the following in the Setup Pipeline page:

Select the Atlas cluster from the dropdown.
Example
If you loaded the sample data on your cluster, select the Atlas cluster where you loaded the sample data.
Select the database on the specified cluster from the dropdown, or type the database name in the field if the database isn't listed in the dropdown.
Atlas Data Lake won't display the database if it's unable to fetch the names of the databases for the specified cluster.
Example
If you selected the cluster where the sample data is loaded, select sample_mflix.
Select the collection in the specified database from the dropdown, or type the collection name in the field if the collection isn't available.
Atlas Data Lake won't display the collection if it's unable to fetch the collection namespace for the specified cluster.
Atlas Data Lake doesn't support Views as a data source for pipelines. You must select a collection from your cluster.
Example
If you selected the sample_mflix database, select the movies collection in the sample_mflix database.
Enter a name for the pipeline.
Atlas Data Lake pipeline names can't exceed 64 characters and can't contain:
- Forward slashes (/),
- Backward slashes (\)
- Empty spaces
- Dollar signs ($)
Example
If you are following the examples in this tutorial, enter sample_mflix.movies in the Pipeline Name field.
Click Continue.

Specify an ingestion schedule for your cluster data.

You can specify how frequently your cluster data is extracted from your Atlas Backup Snapshots and ingested into Data Lake Datasets. Each snapshot represents your data at that point in time, which is stored in a workload isolated, analytic storage. You can query any snapshot data in the Data Lake datasets.

You can choose Basic Schedule or On Demand.

Basic Schedule lets you define the frequency for automatically ingesting data from available snapshots. You must choose from the following schedules. Choose the Snapshot Schedule that is similar to your backup schedule:

Every day
Every Saturday
Last day of the month

For example, if you select Every day, you must have a Daily backup schedule configured in your policy. Or, if you want to select a schedule of once a week, you must have a Weekly backup schedule configured in your policy. To learn more, see Backup Scheduling. You can send a GET request to the Data Lake availableSchedules endpoint to retrieve the list of backup schedule policy items that you can use in your Data Lake pipeline.

Example

For this tutorial, select Daily from the Snapshot Schedule dropdown if you don't have a backup schedule yet. If you have a backup schedule, the available options are based on the schedule you have set for your backup schedule.

On Demand lets you manually trigger ingestion of data from available snapshots whenever you want.

Example

For this tutorial, if you select On Demand, you must manually trigger the ingestion of data from the snapshot after creating the pipeline. To learn more, see Trigger Data Ingestion On Demand - Preview.

Select the AWS region for storing your extracted data.

Atlas Data Lake provides optimized storage in the following AWS regions:

Data Lake Regions	AWS Regions
Virginia, USA	us-east-1
Oregon, USA	us-west-2
Sao Paulo, Brazil	sa-east-1
Ireland	eu-west-1
London, England	eu-west-2
Frankfurt, Germany	eu-central-1
Mumbai, India	ap-south-1
Singapore	ap-southeast-1
Sydney, Australia	ap-southeast-2

By default, Atlas Data Lake automatically selects the region closest to your Atlas cluster for storing extracted data. If Atlas Data Lake is unable to determine the region, it defaults to us-east-1.

Specify fields in your collection to create partitions.

Enter the most commonly queried fields from the collection in the Partition Attributes section. To specify nested fields, use the dot notation. Do not include quotes ("") around nested fields that you specify using dot notation. You can't specify fields inside an array. The specified fields are used to partition your data.

Warning

You can't specify field names that contain periods (.) for partitioning.

The most frequently queried fields should be listed towards the top because they will have a larger impact on performance and cost than fields listed lower down the list. The order of fields is important in the same way as it is for Compound Indexes. Data is optimized for queries by the first field, followed by the second field, and so on.

Example

Enter year in the Most commonly queried field field and title in the Second most commonly queried field field.

Atlas Data Lake optimizes performance for the year field, followed by the title field. If you configure a Federated Database Instance for your Data Lake dataset, Atlas Data Federation optimizes performance for queries on the following fields:

the year field, and
the year field and the title field.

Atlas Data Federation can also support a query on the title field only. However, in this case, Atlas Data Federation wouldn't be as efficient in supporting the query as it would be if the query were on the title field only. Performance is optimized in order; if a query omits a particular partition, Atlas Data Federation is less efficient in making use of any partitions that follow that.

You can run Atlas Data Federation queries on fields not specified here, but Atlas Data Lake is less efficient in processing such queries.

(Optional) Specify fields inside your documents to exclude.

By default, Atlas Data Lake extracts and stores all fields inside the documents in your collection. To specify fields to exclude:

Click Add Field.
Enter field name in the Add Transformation Field Name window.
Example
(Optional) Enter fullplot to exclude the field named fullplot in the movies collection.
Click Done.
Repeat steps for each field you wish to exclude. To remove a field from this list, click .

Click Finish to create the Data Lake.

Next steps

Now that you've created your Data Lake pipeline, proceed to Set Up a Federated Database Instance for Your Dataset - Preview.

Back

Get Started

Step 2: Set Up a Federated Database Instance