Docs Home → MongoDB Atlas Data Lake
Create an Atlas Data Lake Pipeline
On this page
You can create Atlas Data Lake pipelines using the Atlas UI and Data Lake Pipelines API. This page guides you through the steps for creating an Atlas Data Lake pipeline.
Prerequisites
Before you begin, you must have the following:
Backup-enabled
M10
or higher Atlas cluster.Project Owner
role for the project for which you want to deploy a Data Lake.Sample data loaded on your cluster (if you wish to try the example in the following Create a Pipeline from the Atlas UI).
Create a Pipeline from the Atlas UI
Navigate to Atlas Data Lake in the Atlas UI.
To navigate to the Atlas Data Lake page:
Log in to MongoDB Atlas.
Select Data Lake under Deployment on the left-hand navigation panel.
Define the data source for the pipeline.
You can create a copy of data on your Atlas cluster in MongoDB-managed cloud object storage optimized for analytic queries with workload isolation.
To set up a pipeline, specify the following in the Setup Pipeline page:
Select the Atlas cluster from the dropdown.
Example
If you loaded the sample data on your cluster, select the Atlas cluster where you loaded the sample data.
Select the database on the specified cluster from the dropdown.
Example
If you selected the cluster where the sample data is loaded, select
sample_mflix
.Select the collection in the specified database from the dropdown.
Example
If you selected the
sample_mflix
database, select themovies
collection in thesample_mflix
database.Enter a name for the pipeline.
Example
If you are following the examples in this tutorial, enter
sample_mflix.movies
in the Pipeline Name field.Click Continue.
Specify an ingestion schedule for your cluster data.
You can specify how frequently your cluster data is extracted from your Atlas Backup Snapshots and ingested into Data Lake Datasets. Each snapshot represents your data at that point in time, which is stored in a workload isolated, analytic storage. You can query any snapshot data in the Data Lake datasets.
You can choose Basic Schedule or On Demand.
Select the AWS region for storing your extracted data.
Atlas Data Lake provides optimized storage in the following AWS regions:
Data Lake Regions | AWS Regions |
---|---|
Virginia, USA | us-east-1 |
Oregon, USA | us-west-2 |
Sao Paulo, Brazil | sa-east-1 |
Ireland | eu-west-1 |
London, England | eu-west-2 |
Frankfurt, Germany | eu-central-1 |
Mumbai, India | ap-south-1 |
Singapore | ap-southeast-1 |
Sydney, Australia | ap-southeast-2 |
By default, Atlas Data Lake automatically selects the region closest to your Atlas cluster for storing extracted data.
Specify fields in your collection to create partitions.
Enter the most commonly queried fields from the collection in the
Partition Attributes section. To specify nested
fields, use the dot notation. Do not include quotes (""
)
around nested fields that you specify using dot notation. You can't specify fields inside
an array. The specified fields are used to partition your data.
Warning
You can't specify field names that contain periods (.
) for
partitioning.
The most frequently queried fields should be listed towards the top because they will have a larger impact on performance and cost than fields listed lower down the list. The order of fields is important in the same way as it is for Compound Indexes. Data is optimized for queries by the first field, followed by the second field, and so on.
Example
Enter year
in the Most commonly queried field
field and title
in the Second most commonly
queried field field.
Atlas Data Lake optimizes performance for the year
field, followed
by the title
field. If you configure a Federated Database Instance for your
Data Lake dataset, Atlas Data Federation optimizes performance for queries on
the following fields:
the
year
field, andthe
year
field and thetitle
field.
Atlas Data Federation can also support a query on the title
field only.
However, in this case, Atlas Data Federation wouldn't be as efficient in
supporting the query as it would be if the query were on the
title
field only. Performance is optimized in order; if a
query omits a particular partition, Atlas Data Federation is less efficient
in making use of any partitions that follow that.
You can run Atlas Data Federation queries on fields not specified here, but Atlas Data Lake is less efficient in processing such queries.
(Optional) Specify fields inside your documents to exclude.
By default, Atlas Data Lake extracts and stores all fields inside the documents in your collection. To specify fields to exclude:
Click Add Field.
Enter field name in the Add Transformation Field Name window.
Example
(Optional) Enter
fullplot
to exclude the field namedfullplot
in themovies
collection.Click Done.
Repeat steps for each field you wish to exclude. To remove a field from this list, click .
Create a Pipeline from the API
To create an Atlas Data Lake pipeline through the API, send a POST
request to the Data Lake
pipelines
endpoint. To learn more about the pipelines
endpoint
syntax and parameters for creating a pipeline, see
Create One Data Lake Pipeline.
Tip
You can send a GET
request to the Data Lake availableSchedules endpoint to retrieve the
list of backup schedule policy items that you can use to create your
Data Lake pipeline of type PERIODIC_DPS
.
Next steps
Now that you've created your Data Lake pipeline, proceed to Set Up a Federated Database Instance for Your Dataset.