Querying Your Data Lake
On this page
You can use the MongoDB Query Language (MQL) on Atlas Data Lake to query and analyze data on your data store. Atlas Data Lake supports most, but not all the standard server commands. To learn more about the supported and unsupported MongoDB server commands and aggregation pipleline stages, see Supported MongoDB Commands.
To query data on your data store, your Data Lake storage configuration must contain settings that define:
- Your data store.
- Data Lake virtual databases and collections that map to your data store.
You can create or update your Data Lake storage configuration for your data store using the Visual Editor or the JSON Editor. For more information, see Deploy a Data Lake for an Atlas Cluster Data Store.
Data Lake creates the virtual databases and collections you specified in your Data Lake configuration for the data in your data store. When you connect to your Data Lake and run queries, Data Lake processes your queries against the data and returns the query results.
You can run up to 30 simultaneous queries on your Data Lake against:
- Data in your S3 bucket.
- Documents in your MongoDB Atlas cluster.
- Data in files hosted at publicly accessible URLs.
The following sections contain informtaion pertaining to running queries against data in your data store.
Querying Data on S3
When deploying your Data Lake, if you specified
an S3 bucket with both read and write permissions or AWS S3
s3:PutObject
permission, you can also save your query results in your S3 bucket
using $out
to S3.
If you successfully create or update an object on your S3 data store, Data Lake returns the latest version of that object for any subsequent read requests and all list operations of the objects also reflect the changes. If your query contains multiple stages, each stage receives the most recent data available from the data store as that stage is processed.
By default, Atlas Data Lake does not return documents in any specific order
for queries on Data Lakes for S3 data stores. Atlas Data Lake reads the
partitions concurrently and the underlying storage response order
determines which documents Atlas Data Lake returns first, unless you define
order using $sort
in your query. For example, if you run
the same findOne()
query twice, you could see different documents,
and if you use $skip
, different documents might be skipped
if $sort
is not used in the query.
You incur "Data Processed" costs for the amount of data that Atlas Data Lake processes to return results for your queries in addition to the "Data Returned" cost for the amount of data that Atlas Data Lake returns. For example, for a 10 GB file, you incur the following "Data Processed" cost in addition to the "Data Returned" cost:
- If you have no partitions, Atlas Data Lake reads the entire file to return results for the query. Therefore, you incur 10 GB of "Data Processed" cost.
- If you have 10 partitions of 1 GB each, Atlas Data Lake targets and reads a single partition. Therefore, you incur 1 GB of "Data Processed" cost.
Querying Data in Your Atlas Cluster
If you query a collection in Atlas Data Lake that is mapped to only one Atlas collection, Atlas Data Lake acts as a proxy and forwards your query to Atlas. When acting as a proxy, Atlas Data Lake doesn't scan data into its virtual collection to proces the query thus improving performance and reducing cost. This optimization is not available for queries on Atlas Data Lake collections that are mapped to multiple Atlas collections.
Consider the following Data Lake storage configuration:
{ "stores" : [ { "name" : "atlas-store", "provider": "atlas", "clusterName": "myCluster", "projectId": "5e2211c17a3e5a48f5497de3" } ], "databases" : [ { "name" : "atlas-db", "collections" : [ { "name" : "foo", "dataSources" : [ { "storeName" : "atlas-store", "database" : "myFooData", "collection" : "foo" } ] }, { "name" : "barbaz", "dataSources" : [ { "storeName" : "atlas-store", "database" : "myBarData", "collection" : "bar" }, { "storeName" : "atlas-store", "database" : "myBazData", "collection" : "baz" } ] } ] } ] }
For the above storage configuration, Atlas Data Lake acts as a proxy for
queries on foo
collection and forwards the queries to Atlas.
This performance and cost optimization is not available for queries
on barbaz
collection because barbaz
is mapped to multiple
Atlas collections.
You can also save your query results in your Atlas cluster using
$out
to Atlas.
If you successfully create or update a document in your collection on the Atlas cluster, Data Lake returns the latest version of that document for any subsequent read requests and all list operations of the collection also reflect the changes. If your query contains multiple stages, each stage receives the most recent data available from the data store as that stage is processed.
Querying Data at a HTTP or HTTPS URL
The support for HTTP data stores is available as a Beta feature. The feature and the corresponding documentation may change at any time during the Beta stage.
Data Lake also creates one partition for each URL in your collection. When you connect to your Data Lake and run queries, Data Lake processes your queries against the data and returns the query results.
Running Federated Queries
You can use Atlas Data Lake to query and analyze a unified view of data in your Atlas cluster, S3 bucket, and at your HTTP URL. For federated queries, your Data Lake storage configuration must contain the settings that define:
- Your S3, Atlas, and HTTP data stores.
- Data Lake virtual databases and collections that map to your S3, Atlas, and HTTP data stores.
You can create or update your Data Lake storage configuration using the Visual Editor or the JSON Editor. For more information, see Deploy a Data Lake for an Atlas Cluster Data Store.
When you connect to your Data Lake and run federated queries, Data Lake combines data from your Atlas cluster, S3 bucket, and HTTP store in virtual databases and collections and returns a union of data in the results.
Troubleshooting
Error: We are currently experiencing increased query processing wait times for Atlas Data Lake. Our Engineering team is investigating. Normal service will resume shortly, please try again.
Atlas Data Lake returns this error only when Atlas Data Lake can't execute queries because of resource contention. We recommend that you run your queries again.