Path Syntax
On this page
- Overview
- Supported Parsing Functions
- Parsing Null Values from Filenames
- Parsing Padded Numbers from Filenames
- Examples
- Parse Single Field from Filename
- Parse Multiple Fields from Filename
- Use Regular Expression to Parse Fields from Filename
- Identify Ranges of Queryable Data from Filename
- Identify Nested Fields from Filename
- Create Partitions from ObjectIds
- Create Partitions from File Path
- Generate Dynamic Collection Names from File Path
Overview
When you query documents in your S3 buckets, the Atlas Data Lake
path
value allows Data Lake to map the data inside your document to the
filename of the document.
path
supports parsing filenames in S3 buckets into computed fields.
Data Lake can add the computed fields to each document
generated from the parsed file. Data Lake can target
queries on those computed field values to only those file(s) with
a matching file name. See Supported Parsing Functions and
Examples for more information.
path
also supports creating partitions using partition attributes in the
path to the file. Data Lake can target queries on the
parameter defined in the partition attribute to only those files
that contain the query in the filename or partition prefix.
Consider the following files in your S3 bucket:
/users/26/1234.json /users/26/5678.json
The JSON document 1234.json
contains the following:
{ "name": "jane doe", "age": 26, "userID": "1234" }
Your Data Lake configuration for the files in your S3 bucket defines
the following path
:
"path": "/users/{age int}/{userID string}"
The following shows how Data Lake maps a query to the
partitions created from the path
definition:
db.users.findOne( /users { /40 "age": 26 -----------------> /26 "userID": "1234" ----------> /1234.json } /5678.json )
If the computed field for the partition attribute already exists
in your document, Data Lake maps your query to the appropriate file.
If the computed field does not exist, Data Lake adds the computed
field to the document. For example, if the age
field does not
exist in 1234.json
, Data Lake adds the age
field and value to
1234.json
.
Supported Parsing Functions
You can specify a single parsing function on the filename. |
| |
You can specify multiple parsing functions on the filename. |
| |
You can specify parsing functions alongside static strings in the
filename: |
| |
You can specify dot (i.e. . ) along the path to the filename. |
| |
You can specify ObjectIds in the path to the files to create
partitions. |
| |
You can specify a range of ObjectIds in the path to the files to
create partitions. |
| |
You can specify parsing functions along the path to the filename. |
|
Parsing Null Values from Filenames
Data Lake automatically parses an empty string (""
) in the place of an
attribute in the file path as the BSON null value for all the Atlas Data Lake attribute
types except string
. With a string
, empty string could either represent
a BSON null value or a BSON empty string value. Atlas Data Lake does not parse any BSON
value for string
attribute type. This avoids adding a BSON value with a
conflicting type to documents read from S3.
Consider the following S3 data store:
/records/january/1.json /records/february/1.json /records//1.json For the path ``/records/{month string}/*``, Data Lake does not add any computed fields for the ``month`` attribute to documents generated from the third record in the above store.
When writing files to S3, write BSON null values as empty strings in filenames for all Atlas Data Lake attribute types.
Parsing Padded Numbers from Filenames
File path can include numeric values that are padded with leading zeros. For
Data Lake to correctly parse padded numeric values for attribute types like
int
, epoch_millis
, and epoch_secs
, specify the number of digits in
the value using regular expressions.
Consider a S3 store with the following files:
|--users |--001.json |--002.json ...
The following path
syntax uses a regular expression to specify the
number of digits in the filename. Data Lake identifies the portion of the
path that corresponds to the partition attribute and then maps that
partition attribute to a type int
:
/users/{user_id int:\\d{3}}
Examples
The following examples demonstrate how to parse filenames into computed fields:
- Parse Single Field from Filename
- Parse Multiple Fields from Filename
- Use Regular Expression to Parse Fields from Filename
- Identify Ranges of Queryable Data from Filename
- Identify Nested Fields from Filename
- Create Partitions from ObjectIds
- Create Partitions from File Path
- Generate Dynamic Collection Names from File Path
Parse Single Field from Filename
Consider a data store accountingArchive
containing files
where the filename describes an invoice date. For example, the filename
/invoices/1564671291998.json
contains the invoices for the UNIX
timestamp 1564671291998
.
The following databases
object generates a field
invoiceDate
by parsing the filename as a UNIX timestamp:
"databases" : [ { "name" : "accounting", "collections" : [ { "name" : "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{invoiceDate epoch_millis}" } ] } ] } ]
Data Lake adds the computed field and value to each
document generated from the filename. Documents generated from the
example filename includes a field
invoiceDate: ISODate("2019-08-01T14:54:51Z")
. Queries on the
invoiceDate
field can be targeted to only those files
that match the specified value.
Parse Multiple Fields from Filename
Consider a data store accountingArchive
containing files
where the filename describes an invoice number and invoice date. For
example, the filename /invoices/MONGO12345-1564671291998.json
contains the invoice MONGODB12345
for the UNIX timestamp
1564671291998
.
The following databases
object
generates:
- A field
invoiceNumber
by parsing the first segment of the filename as a string. - A field
invoiceDate
by parsing the second segment of the filename as a UNIX timestamp.
"databases" : [ { "name": "accounting", "collections" : [ { "name" : "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{invoiceNumber string}-{invoiceDate epoch_millis}" } ] } ] } ]
Data Lake adds the computed fields and values to each document generated from the filename. Documents generated from the example filename include the following fields:
invoiceNumber : "MONGODB12345"
invoiceDate : ISODate("2019-08-01T14:54:51Z")
Queries that include both the invoiceNumber
and invoiceDate
fields can be targeted to only those files that match the
specified values.
Use Regular Expression to Parse Fields from Filename
Consider a data store accountingArchive
containing files
where the filename describes an invoice number and invoice date. For
example, the filename /invoices/MONGODB12345-20190102.json
contains
the invoice MONGODB12345
for the date 20190102
.
The following databases
object
generates:
- A field
invoiceNumber
by parsing the first segment of the filename as a string - A field
year
by using a regular expression to parse only the first 4 digits of the second segment of the filename as an int. - A field
month
by using a regular expression to parse only the next 2 digits of the second segment of the filename as an int. - A field
day
by using a regular expression to parse only the next 2 digits of the second segment of the filename as an int.
"databases" : [ { "name" : "accounting", "collections" : [ { "name" : "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{invoiceNumber string}-{year int:\\d{4}}{month int:\\d{2}}{day int:\\d{2}}" } ] } ] } ]
Data Lake adds the computed fields and values to each document generated from the filename. Documents generated from the example filename include the following fields:
invoiceNumber : "MONGODB12345"
year : 2019
month: 01
day: 02
You must escape the regex string specified in the path
. For
example, if the regex string includes double quotes, you must
escape those values. Data Lake supports the Package Syntax for regular expressions
in the storage configuration.
Queries that include all generated fields can be targeted to only those files that match the specified values.
Identify Ranges of Queryable Data from Filename
Consider a data store accountingArchive
containing files
where the filename describes the range of data contained in the file.
For example, the filename
/invoices/1546367712000-1549046112000.json
contains invoices for
the time period between 2019-01-01 and 2019-01-02 with the date range
represented as milliseconds elapsed since the UNIX epoch.
The following databases
object identifies
the minimum time range as the first segment of the filename and the
maximum time range as the second segment of the filename:
"databases" : [ { "name: "accounting", "collections" : [ { "name: "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{min(invoiceDate) epoch_millis}-{max(invoiceDate) epoch_millis}" } ] } ] } ]
When Data Lake receives a query on the "invoiceDate"
field, it uses the specified path to identify which
files contain the data that matches the query.
Queries on the invoiceDate
field can be targeted to only those
files whose range captures the specified value, including the min
and max
date.
The field specified for the min and max ranges must exist in every document contained in the file to avoid unexpected or undesired behavior. Data Lake does not perform any validation that the underlying data conforms to this constraint.
Identify Nested Fields from Filename
Data Lake supports querying nested data when the nested data
value is also the filename. You can use the dot operator (i.e. .
) in your
path
to map the
partitions attributes in your storage configuration to nested fields in your
documents.
Consider a data store accountingArchive
. The data store
contains files with names that match values of nested fields in the documents.
For example:
accountingArchive |--invoices |--January.json |--February.json ...
Suppose the January.json
file contains a document with the following
fields:
{ "invoice": { "invoiceNumber" : "MONGODB12345", "year" : 2019, "month": "January", //value matches filename "date": 02 }, "vendor": "MONGODB", ... }
The following databases
object identifies
month
as a nested field inside a document. The
databases
object also identifies the value of
month
as the name of the file that contains the document.
"databases" : [ { "name" : "accounting", "collections" : [ { "name" : "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{invoice.month string}" } ] } ] } ]
When Data Lake receives a query on a specific month such as
January
, it uses the specified path to identify which file
contains the data that matches the query.
Create Partitions from ObjectIds
You can specify ObjectIds in
the path to the files. For files that contain ObjectId
in the filename,
Data Lake creates partitions for each ObjectId
.
Consider the following data store, accountingArchive
. This data
store contains files that include ObjectId
in the filename:
accountingArchive |--invoices |--507f1f77bcf86cd799439011.json |--507f1f77bcf86cd799439012.json |--507f1f77bcf86cd799439013.json |--507f1f77bcf86cd799439014.json |--507f1f77bcf86cd799439015.json
The following databases
object creates
partitions for the ObjectIds
.
"databases" : [ { "name" : "accounting", "collections" : [ { "name" : "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{objid objectid}" } ] } ] } ]
Or, suppose the data store accountingArchive
contains
files that include a range of ObjectIds
in the filename. For
example:
accountingArchive |--invoices |--507f1f77bcf86cd799439011-507f1f77bcf86cd799439020.json |--507f1f77bcf86cd799439021-507f1f77bcf86cd799439030.json |--507f1f77bcf86cd799439031-507f1f77bcf86cd799439040.json
The following databases
object creates
partitions for the given range of ObjectIds
.
"databases" : [ { "name" : "accounting", "collections" : [ { "name" : "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{min(obj) objectid}-{max(obj) objectid}" } ] } ] } ]
When Data Lake receives a query on the ObjectId
, it uses the
specified path to identify which file contains the data that matches
the query.
Create Partitions from File Path
You can specify parsing functions on any path leading up to the filename. Each computed field is based on a parsing function along the path. When querying the data, Data Lake converts each computed field to a partition. Partitions, which are synonymous with subdirectories, are then used to filter files with increasing precision.
Consider a data store accountingArchive
with the following
directory structure:
invoices |--MONGO12345 |--2019 |--01 |--02
The following databases
object creates the
invoiceNumber
, year
, month
, and day
partitions with
a small set of filtered files:
"databases" : [ { "name" : "accounting", "collections" : [ { "name" : "invoices", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{invoiceNumber string}/{year int}/{month int:\\d{2}}/{day int:\\d{2}}/*" } ] } ] } }
Generate Dynamic Collection Names from File Path
Consider a data store accountingArchive
with the
following directory structure:
invoices |--SuperSoftware |--UltraSoftware |--MegaSoftware
The following databases
object
generates a dynamic collection name from the file path:
"databases" : [ { "name" : "invoices", "collections" : [ { "name" : "*", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/invoices/{collectionName()}/" } ] } ] } ]
When applied to the example directory structure, the path results in the following collections:
- SuperSoftware
- UltraSoftware
- MegaSoftware
Or, consider a data store accountingArchive
with the
following files:
/orders/MONGODB/invoices/January.json /orders/MONGODB/purchaseOrders/January.json /orders/MONGODB/invoices/February.json ...
The following databases
object
generates a dynamic collection name from the file path:
"databases" : [ { "name" : "invoices", "collections" : [ { "name" : "*", "dataSources" : [ { "storeName" : "accountingArchive", "path" : "/orders/MONGODB/{collectionName()}/{invoiceMonth string}.json/" } ] } ] } ]
When applied to the example filenames, the path results in the following collections:
invoices
purchaseOrders
When you dynamically generate collections from filenames, the number of collections is not accurately reported in the Data Lake view.