Docs Menu

Create an Atlas Data Lake Pipeline

On this page

  • Prerequisites
  • Procedure
  • Next steps

This page guides you through the steps for creating an Atlas Data Lake pipeline.

Before you begin, you must have the following:

  • Backup-enabled M10 or higher Atlas cluster.
  • Project Owner role for the project for which you want to deploy a Data Lake.
  • Sample data loaded on your cluster (if you wish to try the example in the following Procedure).
1

To navigate to the Atlas Data Lake page:

  1. Log in to MongoDB Atlas.
  2. Select Data Lake under Deployment on the left-hand navigation panel.
2
3

You can create a copy of data on your Atlas cluster in MongoDB-managed cloud object storage optimized for analytic queries with workload isolation.

To set up a pipeline, specify the following in the Setup Pipeline page:

  1. Select the Atlas cluster from the dropdown.

    Example

    If you loaded the sample data on your cluster, select the Atlas cluster where you loaded the sample data.

  2. Select the database on the specified cluster from the dropdown.

    Example

    If you selected the cluster where the sample data is loaded, select sample_mflix.

  3. Select the collection in the specified database from the dropdown.

    Example

    If you selected the sample_mflix database, select the movies collection in the sample_mflix database.

  4. Enter a name for the pipeline.

    Example

    If you are following the examples in this tutorial, enter sample_mflix.movies in the Pipeline Name field.

  5. Click Continue.
4

You can specify how frequently your cluster data is extracted for querying. Each snapshot represents your data at that point in time, which is stored in a workload isolated, analytic storage. You can query any snapshot data in the Data Lake datasets.

You must choose from the following schedules the Snapshot Schedule that is similar to your backup schedule:

  • Every day
  • Every Saturday
  • Last day of the month
Example

For this tutorial, select Daily from the Snapshot Schedule dropdown if you don't have a backup schedule yet. If you have a backup schedule, the available options are based on the schedule you have set for your backup schedule.

5

Enter the most commonly queried fields from the collection in the Partition Attributes section. To specify nested fields, use the dot notation. Do not include quotes ("") around nested fields that you specify using dot notation. You can't specify fields inside an array. The specified fields are used to partition your data.

The most frequently queried fields should be listed towards the top because they will have a larger impact on performance and c ost than fields listed lower down the list. The order of fields is important in the same way as it is for Compound Indexes. Data is optimized for queries by the first field, followed by the second field, and so on.

Example

Enter year in the Most commonly queried field field and title in the Second most commonly queried field field.

Atlas Data Lake optimizes performance for the year field, followed by the title field. If you configure a Federated Database Instance for your Data Lake dataset, Atlas Data Federation optimizes performance for queries on the following fields:

  • the year field,
  • the title field, and
  • the year field and the title field.

Atlas Data Federation can also supports a query on the title field only. However, in this case, Atlas Data Federation wouldn't be as efficient in supporting the query as it would be if the query were on the title field only. Performance is optimized in order; if a query omits a particular partition, Atlas Data Federation is less efficient in making use of any partitions that follow that.

You can run Atlas Data Federation queries on fields not specified here, but Atlas Data Lake is less efficient in processing such queries.

6

By default, Atlas Data Lake extracts and stores all fields inside the documents in your collection. To specify fields to exclude:

  1. Click Add Field.
  2. Enter field name in the Add Transformation Field Name window.

    Example

    (Optional) Enter fullplot to exclude the field named fullplot in the movies collection.

  3. Click Done.
  4. Repeat steps for each field you wish to exclude. To remove a field from this list, click .
7

Now that you've created your Data Lake pipeline, proceed to Set Up a Federated Database Instance for Your Dataset.

←  Get StartedSet Up a Federated Database Instance for Your Dataset →
Give Feedback
© 2022 MongoDB, Inc.

About

  • Careers
  • Investor Relations
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2022 MongoDB, Inc.