MongoDB Atlas Data Lake Setup

Atlas Data Lake Setup

At MongoDB World 2019, we announced the Beta version of MongoDB Atlas Data Lake which allows for the querying of data stored in Amazon S3 buckets. We’re able to query uncompressed or compressed data stored in JSON, BSON, CSV, TSV, Avro, and Parquet formats right in S3. This is a great feature of MongoDB Atlas to enable the exploration of data stored on S3 without having to import it into a database.

Before diving into the Atlas Data Lake, however, there are some setup steps that need to be done. To start, you'll need an Amazon Web Services (AWS) account, access to the AWS Management Console, and Command Line Interface (CLI) tools. Let's take a look.

Setting Up the AWS CLI

The first thing we’ll need to create is an Identity and Access Management (IAM) user. From the AWS Management Console, I’ll search for IAM.

AWS Management Console Search

On the left navigation, I’ll select Users, and Add User. I’ll make a new user, I’ll just call it DataLakeUser, and assign Programmatic access.

Add User

Then I’ll click on Next: Permissions and assign, or I can create, a group that has Administrator Access privileges.

Set Permissions

I’ll click on Next: Tags, and since I’m not assigning any tags at this point, I’ll move onto Next: Review. And everything looks good, so I’ll create the user.

Access and Secret Key Information

I’ll need the Access key ID and Secret access key information to configure my command-line interface. I’ll download the .csv file and store it locally. Note that once you leave this screen you won’t have access to this information again and will have to go through the credential creation process to get a new set of keys. That’s why downloading that information at this point is important.

Next, in a terminal window I'll run:

aws configure

Here, I'll be prompted for four pieces of information to configure my AWS Command Line Interface (CLI) tools, the Access Key Id and Secret Access Key we just saved in the last step, my AWS region that I want to use, and the output format.

AWS Access Key ID [****************UHHP]: AKI********HS
AWS Secret Access Key [****************EcNP]: uJ*******Lpk
Default region name [us-west-2]: 
Default output format [json]:

Great, the AWS CLI is now configured for my AWS account. The next step is organizing the S3 bucket directories.

AWS S3 Bucket Structure

While I could go through and create the bucket structure using the cli commands, I find it to be faster and easier with the S3 console. I'll navigate to the S3 service and locate the bucket where my data is stored. For this example, I have several datasets inside an atlas-data-lake-demo bucket.

Download the sample data and bucket structure here.

Inside that bucket, I have several folders, each with a different data type:

S3 Buckets

Let's take a look inside the CSV folder:

CSV Folder Contents

Inside each of these folders, I have the actual CSV data files. Similarly, in the other data file format folders (avro, json, parquet, and tsv), I have datasets of those respective data types. These folder names, comics and seattleLibrary, in this CSV example, will become collections inside Altas Data Lake. This structure is what's outlined in our Data Lake setup walkthrough. It can be configured differently, but the structure here works pretty well.

With the CLI configured and the S3 directory structure in place, we're ready to set up the MongoDB Atlas Data Lake.

Connecting MongoDB Atlas to S3

Now it’s time to get our feet wet with setting up the connection between Atlas Data Lake and our S3 bucket. Inside the MongoDB Atlas console, on the left side, I’ll click on Data Lake which takes me to the MongoDB Atlas Data Lake dashboard.

Try MongoDB Atlas Data Lake

From here I'll select Try MongoDB Atlas Data Lake which will start the step-by-step setup of the connection, but I'll point out a few things along the way as well. After the opening splash screen, we move onto naming our Data Lake. Once you pick a name here, you can’t change it. For now, I’ll call this LakeDemo.

Name Your Data Lake

Next up is the S3 bucket or buckets, this Data Lake will attach to. The S3 bucket I'm using is called atlas-data-lake-demo… not very creative I know… but, I’ll put that into this text box. If you are connecting to more than a single S3 bucket, you can add a comma-separated list in the text box as well.

S3 Buckets to Query

Role & Policy Task

The next step in the configuration process is the Role & Policy task. This is where the AWS CLI configuration we did previously pays off. Something to note here: As the warning box states, the information for the External ID and ARN aren’t shown inside the MongoDB Atlas interface in this format again, so we’ll want to save this information. I like to copy this information and put it into a config.txt file that I can find later. The information is also available in the JSON file we’ll create in a moment, but it’s good to save it separately.

The next step is to enter a new IAM role for MongoDB Atlas to use. I’ll go with the default suggestion here of mdb-dl-role.

There is a generated trust policy that we’ll copy the text from and save into a file named dl-role-trust.json. This allows MongoDB Atlas to be trusted in AWS S3.

Using that JSON file, we can create an AWS IAM role from the command line… see, I said our previous work would pay off. We copy the command, go to the terminal in the same directory as the dl-role-trust.json file we just saved, and run the command.

aws iam create-role \
  --role-name mdb-dl-role \
  --assume-role-policy-document file://dl-role-trust.json

We should get returned a JSON document that confirms our role.

There’s some information we’ll need out of this response in a later step, so don’t clear your screen!

You can also find the information from the AWS Console, but since it was just provided, we can use it from here.

One more bit of JSON to copy and paste into a file. This time the file is called dl-s3-role-policy.json, and it defines the IAM role that we just created. Once that’s saved, we can apply that policy to the role with the command here in the Roles & Policy step 5.

aws iam put-role-policy \
  --role-name mdb-dl-role \
  --policy-name mdb-dl-role-policy \
  --policy-document file://dl-s3-role-policy.json

We run this command in the terminal, again in the same folder that the dl-s3-role-policy.json file resides.

The next step in the process is to validate the connection. Remember that piece of information I said we’d need? Well, now’s the time. We want the get the Amazon Resource Name (ARN) string from the response we got when we created the IAM role.

If we scroll up a bit in the terminal window we can find it. Note that we want the ARN under the Role information, not the ARN under the Principal information. We paste that into the text box and we don’t need to touch the bucket name box. We next click on Validate & Launch which makes sure that all of our configurations are in place and correct.

Congratulations! You’ve successfully created an Atlas Data Lake.

Swimming in the Data Lake

Let's connect to the MongoDB Atlas Data Lake using the Mongo Shell. From the Data Lake console, I’ll click on Step 2 and connect. Then Connect with the Mongo Shell. I already have the Mongo Shell installed, so I’ll choose that option. If you don’t have it already, you can download it for your OS platform.

I’ll copy the connection string and paste it into the command line.

> mongo "<<CONNECTION STRING>>" --ssl --authenticationDatabase 
        admin --username <username> --password <password>

Note that I need to replace the connection string, username, and password portions of this with the appropriate information for database access. This user must have Atlas Admin privileges to be able to configure the database. You can configure the users and their permissions from the Database Access tab on the left.

After I run that command from the terminal, I should be inside the Mongo Shell. The Atlas Data Lake console will take me to the next step: database configuration. By clicking on Step 3 here, I’m taken to the documentation for a storageSetConfig command that I need to run to enable Atlas Data Lake to pull in the data from S3.

I need to update the information in the sample code to match my S3 bucket region and bucket name. Recall from an earlier step when I set up the directory structure in S3 I mentioned that those directories would come into Atlas Data Lake as collections. This command is where that is defined. We can define the databases that are generated in MongoDB Atlas as well as the collections, and what that structure looks like. Additionally, to improve performance you can structure your data on S3 in a way that aligns with common queries similar to indexing strategies and make the accompanying optimizations to your Storage Config

For example, if our data is stored in S3 in yearly folders and inside each of those folders is data of different file types, e.g. JSON, CSV, TSV, etc. MongoDB Atlas Data Lake will intelligently read the S3 partitions and make the partition names park of the documents. This allows for the efficient querying of data by only opening the needed partitions. Meaning that data in a 2018 folder will be pulled into the query logic and searches for 2018 data will only open 2018 folders.

For now, I’ll stick with a single database called sample, but I’ll bring in different datasets from the different file types. I just need to copy these lines here:

{                      
        /* Works with the definition to create a
        collection the data in each directory */
        "store": "s3store",        // Links to the S3 store above
        "definition": "/json/{collectionName()}"
      }

Paste them in and update json to the different file types.

use admin;
db.runCommand( { "storageSetConfig": {
    "stores": [{
      "s3": {
        "name": "s3store",           // Creates an S3 store
  
        "region":"us-west-2",  // Update with the bucket region code

        "bucket": "atlas-data-lake-demo"    // Update with your bucket name

      }
    }],
    "databases": {
      "sample": {                    // Creates a database named sample
        "*": [
    {                      
        /* Works with the definition to create a
        collection the data in each directory */
        "store": "s3store",        // Links to the S3 store above
        "definition": "/avro/{collectionName()}"
      },
    {
        /* Works with the definition to create a
        collection the data in each directory */
        "store": "s3store",        // Links to the S3 store above
        "definition": "/csv/{collectionName()}"
      },
      {
        /* Works with the definition to create a
        collection the data in each directory */
        "store": "s3store",        // Links to the S3 store above
        "definition": "/json/{collectionName()}"
      },
      {
        /* Works with the definition to create a
        collection the data in each directory */
        "store": "s3store",        // Links to the S3 store above
        "definition": "/parquet/{collectionName()}"
      },
      {
        /* Works with the definition to create a
        collection the data in each directory */
        "store": "s3store",        // Links to the S3 store above
        "definition": "/tsv/{collectionName()}"
      }]
    }
  }
}})

Then I run the command from inside the Mongo Shell. I should then get back a { "ok" : 1 } response indicating that all is well.

From here, depending on the size of the datasets, it can take a bit for Atlas Data Lake to generate everything needed. You’ll know it’s ready to go when you can run a

show dbs

command and you are given a list of databases. In the case of this example, I’ll get back a single database named sample.

Then with a

use sample

command and a

show collections

I should get a list of the collections.

The Data Lake console will have three green boxes with checkmarks at this point, indicating that the database has been configured. The last step is to run a query.

Back in the Mongo Shell, I can query any of those collections using MQL, the MongoDB Query Language. I’ll run a simple findOne() on the comics collection. Behind the scenes here, this operation should have fairly reasonable performance as this will only open the associated S3 partition that holds this data.

db.comics.findOne()

{
	"align" : "Good",
	"alive" : "Living Characters",
	"appearances" : "4043",
	"eye" : "Hazel Eyes",
	"first_appear" : "Aug-1962",
	"gender" : "Male",
	"gsm" : "NA",
	"hair" : "Brown Hair",
	"id" : "Secret",
	"name" : "Spider-Man (Peter Parker)",
	"publisher" : "marvel"
}

After running that query our Data Lake setup dashboard should be cleared and an actual dashboard will show up.

Wrap Up

Excellent work! We’ve come a long way in a few relatively short steps. You now have the power to explore your data stored in S3 using MQL. Don’t stop with just the Mongo Shell though. You can also explore data using MongoDB Compass and programmatically using MongoDB Drivers for your favorite language.

See MongoDB Atlas Data Lake In Action

Enjoy this great new feature of MongoDB Atlas. Let us know how you’re putting it to use, we’re always excited to hear how folks are using our products and features.

Until next time, Happy Coding!