Docs Menu
Docs Home
/
MongoDB Atlas
/ / /

$out

On this page

  • Permissions Required
  • Syntax
  • Fields
  • Options
  • Examples
  • Limitations
  • Error Output

$out takes documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the aggregation pipeline. In Atlas Data Federation, you can use $out to write data from any one of the supported federated database instance stores or multiple supported federated database instance stores when using federated queries to any one of the following:

  • Atlas cluster namespace

  • AWS S3 buckets with read and write permissions

  • Azure Blob Storage containers with read and write permissions

You must connect to your federated database instance to use $out.

You must have:

You must have:

  • A federated database instance configured for Azure Blob Storage with an Azure Role that has read and write permissions.

  • A MongoDB user with the atlasAdmin role or a custom role with the outToAzure privilege.

Note

To use $out to write to a collection in a different database on the same Atlas cluster, your Atlas cluster must be on MongoDB version 5.0 or later.

You must be a database user with one of the following roles:

1{
2 "$out": {
3 "s3": {
4 "bucket": "<bucket-name>",
5 "region": "<aws-region>",
6 "filename": "<file-name>",
7 "format": {
8 "name": "<file-format>",
9 "maxFileSize": "<file-size>",
10 "maxRowGroupSize": "<row-group-size>",
11 "columnCompression": "<compression-type>"
12 },
13 "errorMode": "stop"|"continue"
14 }
15 }
16}
1{
2 "$out": {
3 "azure": {
4 "serviceURL": "<storage-account-url>",
5 "containerName": "<container-name>",
6 "region": "<azure-region>",
7 "filename": "<file-name>",
8 "format": {
9 "name": "<file-format>",
10 "maxFileSize": "<file-size>",
11 "maxRowGroupSize": "<row-group-size>",
12 "columnCompression": "<compression-type>"
13 },
14 "errorMode": "stop"|"continue"
15 }
16 }
17}
1{
2 "$out": {
3 "atlas": {
4 "projectId": "<atlas-project-ID>",
5 "clusterName": "<atlas-cluster-name>",
6 "db": "<atlas-database-name>",
7 "coll": "<atlas-collection-name>"
8 }
9 }
10}
Field
Type
Description
Necessity
s3
object
Location to write the documents from the aggregation pipeline.
Required
s3.bucket
string

Name of the S3 bucket to write the documents from the aggregation pipeline to.

Important

The generated call to S3 inserts a / between s3.bucket and s3.filename. Don't append a / to your s3.bucket string.

For example, if you set s3.bucket to myBucket and s3.filename to myPath/myData, Atlas Data Federation writes the output location as follows:

s3://myBucket/myPath/myData.[n].json
Required
s3.region
string
Name of the AWS region in which the bucket is hosted. If omitted, uses the federated database instance configuration to determine the region where the specified s3.bucket is hosted.
Optional
s3.filename
string

Name of the file to write the documents from the aggregation pipeline to. Filename can be constant or created dynamically from the fields in the documents that reach the $out stage. Any filename expression you provide must evaluate to a string data type. If there are any files on S3 with the same name and path as the newly generated files, $out overwrites the existing files with the newly generated files.

Important

The generated call to S3 inserts a / between s3.bucket and s3.filename. Don't prepend a / to your s3.filename string.

For example, if you set s3.filename to myPath/myData and s3.bucket to myBucket, Atlas Data Federation writes the output location as follows:

s3://myBucket/myPath/myData.[n].json
Required
s3.format
object
Details of the file in S3.
Required
s3
.format
.name
enum

Format of the file in S3. Value can be one of the following:

  • bson

  • bson.gz

  • csv

  • csv.gz

  • json 1

  • json.gz 1

  • parquet

  • tsv

  • tsv.gz

1 For this format, $out writes data in MongoDB Extended JSON format.

To learn more, see Limitations.

Required
s3
.format
.maxFileSize
bytes

Maximum size of the file in S3. When the file size limit for the current file is reached, a new file is created in S3. The first file appends a 1 before the filename extension. For each subsequent file, Atlas Data Federation increments the appended number by one.

For example, <filename>.1.<fileformat>, and <filename>.2.<fileformat>.

If a document is larger than the maxFileSize, Atlas Data Federation writes the document to its own file. The following suffixes are supported:

Base 10: scaling in multiples of 1000
  • B

  • KB

  • MB

  • GB

  • TB

  • PB

Base 2: scaling in multiples of 1024
  • KiB

  • MiB

  • GiB

  • TiB

  • PiB

If omitted, defaults to 200MiB.

Optional
s3
.format
.maxRowGroupSize
string

Supported for Parquet file format only.

Maximum row group size to use when writing to Parquet file. If omitted, defaults to 128 MiB or the value of the s3.format.maxFileSize, whichever is smaller. The maximum allowed value is 1 GB.

Optional
s3
.format
.columnCompression
string

Supported for Parquet file format only.

Compression type to apply for compressing data inside a Parquet file when formatting the Parquet file. Valid values are:

  • gzip

  • snappy

  • uncompressed

If omitted, defaults to snappy.

To learn more, see Supported Data Formats.

Optional
errorMode
enum

Specifies how Atlas Data Federation should proceed if there are errors when processing a document. For example, if Atlas Data Federation encounters an array in a document when Atlas Data Federation is writing to a CSV file, Atlas Data Federation uses this value to determine whether or not to skip the document and process other documents. Valid values are:

  • continue to skip the document and continue processing the remaining documents. Atlas Data Federation also writes the document that caused the error to an error file.

    To learn more see, Errors.

  • stop to stop at that point and not process the remaining documents.

If omitted, defaults to continue.

Optional
Field
Type
Description
Necessity
azure
object
Location to write the documents from the aggregation pipeline.
Required
azure.serviceURL
string
URL of the Azure storage account in which to write documents from the aggregation pipeline.
Required
azure.containerName
string
Name of the Azure Blob Storage container in which to write documents from the aggregation pipeline.
Required
azure.region
string
Name of the Azure region which hosts the Blob Storage container.
Required
azure.filename
string

Name of the file in which to write documents from the aggregation pipeline.

Accepts constant value, or values that evaluate to string created dynamically from the fields in the documents that reach the $out stage. If there are any files in Azure Blob Storage with the same name and path as the newly generated files, $out overwrites the existing files with the newly generated files.

Required
azure.format
object
Details of the file in Azure Blob Storage.
Required
azure
.format
.name
enum

Format of the file in Azure Blob Storage. Value can be one of the following:

  • bson

  • bson.gz

  • csv

  • csv.gz

  • json 1

  • json.gz 1

  • parquet

  • tsv

  • tsv.gz

1 For this format, $out writes data in MongoDB Extended JSON format.

To learn more, see Limitations.

Required
azure
.format
.maxFileSize
bytes

Maximum size of the file in Azure Blob Storage.

When the file size limit for the current file is reached, $out automatically creates a new file. The first file appends a 1 after its name. For each subsequent file, Atlas Data Federation increments the appended number by one.

For example, <filename>.1.<fileformat>, and <filename>.2.<fileformat>.

If a document is larger than the maxFileSize, Atlas Data Federation writes the document to its own file. The following suffixes are supported:

Base 10: scaling in multiples of 1000
  • B

  • KB

  • MB

  • GB

  • TB

  • PB

Base 2: scaling in multiples of 1024
  • KiB

  • MiB

  • GiB

  • TiB

  • PiB

If omitted, defaults to 200MiB.

Optional
azure
.format
.maxRowGroupSize
string

Supported for Parquet file format only.

Maximum row group size to use when writing to Parquet file. If omitted, defaults to 128 MiB or the value of the azure.format.maxFileSize, whichever is smaller. The maximum allowed value is 1 GB.

Optional
azure
.format
.columnCompression
string

Supported for Parquet file format only.

Compression type to apply for compressing data inside a Parquet file when formatting the Parquet file. Valid values are:

  • gzip

  • snappy

  • uncompressed

If omitted, defaults to snappy.

To learn more, see Supported Data Formats.

Optional
errorMode
enum

Specifies how Atlas Data Federation should proceed when it encounters an error while processing a document. Valid values are:

  • continue to skip the document and continue processing the remaining documents. Atlas Data Federation records the error in an error file.

  • stop to stop without processing the remaining documents. Atlas Data Federation records the error in an error file.

If omitted, defaults to continue.

To learn more, see Errors.

Optional
Field
Type
Description
Necessity
atlas
object
Location to write the documents from the aggregation pipeline.
Required
clusterName
string
Name of the Atlas cluster.
Required
coll
string
Name of the collection on the Atlas cluster.
Required
db
string
Name of the database on the Atlas cluster that contains the collection.
Required
projectId
string
Unique identifier of the project that contains the Atlas cluster. The project ID must be the ID of the project that contains your federated database instance. If omitted, defaults to the ID of the project that contains your federated database instance.
Optional
Option
Type
Description
Necessity
background
boolean

Flag to run aggregation operations in the background. If omitted, defaults to false. When set to true, Atlas Data Federation runs the queries in the background.

{ "background" : true }

Use this option if you want to submit other new queries without waiting for currently running queries to complete or disconnect your federated database instance connection while the queries continue to run in the background.

Optional

Create a Filename

The following examples show $out syntaxes for dynamically creating a filename from a constant string or from the fields of the same or different data types in the documents that reach the $out stage.

Example

You want to write 1 GiB of data as compressed BSON files to an S3 bucket named my-s3-bucket.

Using the following $out syntax:

1{
2 "$out": {
3 "s3": {
4 "bucket": "my-s3-bucket",
5 "filename": "big_box_store/",
6 "format": {
7 "name": "bson.gz"
8 }
9 }
10 }
11}

The s3.region is omitted and so, Atlas Data Federation determines the region where the bucket named my-s3-bucket is hosted from the storage configuration. $out writes five compressed BSON files:

  1. The first 200 MiB of data to a file that $out names big_box_store/1.bson.gz.

    • The value of s3.filename serves as a constant in each filename. This value doesn't depend upon any document field or value.

    • Your s3.filename ends with a delimiter, so Atlas Data Federation appends the counter after the constant.

    • If it didn't end with a delimiter, Atlas Data Federation would have added a . between the constant and the counter, like big_box_store.1.bson.gz

    • Because you didn't change the maximum file size using s3.format.maxFileSize, Atlas Data Federation uses the default value of 200 MiB.

  2. The second 200 MiB of data to a new file that $out names big_box_store/2.bson.gz.

  3. Three more files that $out names big_box_store/3.bson.gz through big_box_store/5.bson.gz.

Example

You want to write 90 MiB of data to JSON files to an S3 bucket named my-s3-bucket.

Using the following $out syntax:

1{
2 "$out": {
3 "s3": {
4 "bucket": "my-s3-bucket",
5 "region": "us-east-1",
6 "filename": {"$toString": "$saleDate"},
7 "format": {
8 "name": "json",
9 "maxFileSize": "100MiB"
10 }
11 }
12 }
13}

$out writes 90 MiB of data to JSON files in the root of the bucket. Each JSON file contains all of the documents with the same saleDate value. $out names each file using the documents' saleDate value converted to a string.

Example

You want to write 176 MiB of data as BSON files to an S3 bucket named my-s3-bucket.

Using the following $out syntax:

1{
2 "$out": {
3 "s3": {
4 "bucket": "my-s3-bucket",
5 "region": "us-east-1",
6 "filename": {
7 "$concat": [
8 "persons/",
9 "$name", "/",
10 "$uniqueId", "/"
11 ]
12 },
13 "format": {
14 "name": "bson",
15 "maxFileSize": "200MiB"
16 }
17 }
18 }
19}

$out writes 176 MiB of data to BSON files. To name each file, $out concatenates:

  • A constant string persons/ and, from the documents:

    • The string value of the name field,

    • A forward slash (/),

    • The string value of the uniqueId field, and

    • A forward slash (/).

Each BSON file contains all of the documents with the same name and uniqueId values. $out names each file using the documents' name and uniqueId values.

Example

You want to write 154 MiB of data as compressed JSON files to an S3 bucket named my-s3-bucket.

Consider the following $out syntax:

1{
2 "$out": {
3 "s3": {
4 "bucket": "my-s3-bucket",
5 "region": "us-east-1",
6 "filename": {
7 "$concat": [
8 "big-box-store/",
9 {
10 "$toString": "$storeNumber"
11 }, "/",
12 {
13 "$toString": "$saleDate"
14 }, "/",
15 "$partId", "/"
16 ]
17 },
18 "format": {
19 "name": "json.gz",
20 "maxFileSize": "200MiB"
21 }
22 }
23 }
24}

$out writes 154 MiB of data to compressed JSON files, where each file contains all documents with the same storeNumber, saleDate, and partId values. To name each file, $out concatenates:

  • A constant string value of big-box-store/,

  • A string value of a unique store number in the storeNumber field,

  • A forward slash (/),

  • A string value of the date from the saleDate field,

  • A forward slash (/),

  • A string value of part ID from the partId field, and

  • A forward slash (/).

Create a Filename

The following examples show $out syntaxes for dynamically creating a filename from a constant string or from the fields of the same or different data types in the documents that reach the $out stage.

Example

You want to write 1 GiB of data as compressed BSON files to an Azure storage account mystorageaccount and container named my-container.

Using the following $out syntax:

1{
2 "$out": {
3 "azure": {
4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/",
5 "container": "my-container",
6 "filename": "big_box_store/",
7 "format": {
8 "name": "bson.gz"
9 }
10 }
11 }
12}

The azure.region is omitted and so Atlas Data Federation determines the region where the container named my-container is hosted from the storage configuration. $out writes five compressed BSON files:

  1. The first 200 MiB of data to a file that $out names big_box_store/1.bson.gz.

    • The value of azure.filename serves as a constant in each filename. This value doesn't depend upon any document field or value.

    • Your azure.filename ends with a delimiter, so Atlas Data Federation appends the counter after the constant.

    • If it didn't end with a delimiter, Atlas Data Federation would have added a . between the constant and the counter, like big_box_store.1.bson.gz

    • Because you didn't change the maximum file size using azure.format.maxFileSize, Atlas Data Federation uses the default value of 200 MiB.

  2. The second 200 MiB of data to a new file that $out names big_box_store/2.bson.gz.

  3. Three more files that $out names big_box_store/3.bson.gz through big_box_store/5.bson.gz.

Example

You want to write 90 MiB of data to JSON files to an Azure Blob Storage container named my-container.

Using the following $out syntax:

1{
2 "$out": {
3 "azure": {
4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/",
5 "container": "my-container",
6 "region": "eastus2",
7 "filename": {"$toString": "$saleDate"},
8 "format": {
9 "name": "json",
10 "maxFileSize": "100MiB"
11 }
12 }
13 }
14}

$out writes 90 MiB of data to JSON files in the root of the container. Each JSON file contains all of the documents with the same saleDate value. $out names each file using the documents' saleDate value converted to a string.

Example

You want to write 176 MiB of data as BSON files to an Azure Blob Storage container named my-container.

Using the following $out syntax:

1{
2 "$out": {
3 "azure": {
4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/",
5 "container": "my-container",
6 "region": "eastus2",
7 "filename": {
8 "$concat": [
9 "persons/",
10 "$name", "/",
11 "$uniqueId", "/"
12 ]
13 },
14 "format": {
15 "name": "bson",
16 "maxFileSize": "200MiB"
17 }
18 }
19 }
20}

$out writes 176 MiB of data to BSON files. To name each file, $out concatenates:

  • A constant string persons/ and, from the documents:

    • The string value of the name field,

    • A forward slash (/),

    • The string value of the uniqueId field, and

    • A forward slash (/).

Each BSON file contains all of the documents with the same name and uniqueId values. $out names each file using the documents' name and uniqueId values.

Example

You want to write 154 MiB of data as compressed JSON files to an Azure Blob Storage container named my-container.

Consider the following $out syntax:

1{
2 "$out": {
3 "azure": {
4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/",
5 "container": "my-container",
6 "region": "eastus2",
7 "filename": {
8 "$concat": [
9 "big-box-store/",
10 {
11 "$toString": "$storeNumber"
12 }, "/",
13 {
14 "$toString": "$saleDate"
15 }, "/",
16 "$partId", "/"
17 ]
18 },
19 "format": {
20 "name": "json.gz",
21 "maxFileSize": "200MiB"
22 }
23 }
24 }
25}

$out writes 154 MiB of data to compressed JSON files, where each file contains all documents with the same storeNumber, saleDate, and partId values. To name each file, $out concatenates:

  • A constant string value of big-box-store/,

  • A string value of a unique store number in the storeNumber field,

  • A forward slash (/),

  • A string value of the date from the saleDate field,

  • A forward slash (/),

  • A string value of part ID from the partId field, and

  • A forward slash (/).

This $out syntax sends the aggregated data to a sampleDB.mySampleData collection in the Atlas cluster named myTestCluster. The syntax doesn't specify a project ID; $out uses the ID of the project that contains your federated database instance.

Example

1{
2 "$out": {
3 "atlas": {
4 "clusterName": "myTestCluster",
5 "db": "sampleDB",
6 "coll": "mySampleData"
7 }
8 }
9}

The following example shows $out syntax for running an aggregation pipeline that ends with the $out stage in the background.

Example

db.mySampleData.aggregate(
[
{
"$out": {
"s3": {
"bucket": "my-s3-bucket",
"filename": { "$toString": "$saleDate" }
"format": {
"name": "json"
}
}
}
}
],
{ "background" : true }
)

$out writes to JSON files in the root of the bucket in the background. Each JSON file contains all of the documents with the same saleDate value. $out names each file using the documents' saleDate value converted to a string.

Example

db.mySampleData.aggregate(
[
{
"$out": {
"azure": {
"serviceURL": "http://mystorageaccount.blob.core.windows.net/",
"container": "my-container",
"filename": {"$toString": "$saleDate"},
"format": {
"name": "json"
}
}
}
}
],
{ "background" : true }
)

$out writes to JSON files in the root of the Azure Blob Storage container in the background. Each JSON file contains all of the documents with the same saleDate value. $out names each file using the documents' saleDate value converted to a string.

Example

db.mySampleData.aggregate(
[
{
"$out": {
"atlas": {
"clusterName": "myTestCluster",
"db": "sampleDB",
"coll": "mySampleData"
}
}
}
],
{ background: true }
)

$out writes to sampleDB.mySampleData collection in the Atlas cluster named myTestCluster in the background.

Atlas Data Federation interprets empty strings ("") as null values when parsing filenames. If you want Atlas Data Federation to generate parseable filenames, wrap the field references that could have null values using $convert with an empty string onNull value.

Example

This example shows how to handle null values in the year field when creating a filename from the field value.

1{
2 "$out": {
3 "s3": {
4 "bucket": "my-s3-bucket",
5 "region": "us-east-1",
6 "filename": {
7 "$concat": [
8 "big-box-store/",
9 {
10 "$convert": {
11 "input": "$year",
12 "to": "string",
13 "onNull": ""
14 }
15 }, "/"
16 ]
17 },
18 "format": {
19 "name": "json.gz",
20 "maxFileSize": "200MiB"
21 }
22 }
23 }
24}

When writing to CSV, TSV, or Parquet file format, Atlas Data Federation doesn't support more than 32000 unique fields.

When writing to CSV or TSV format, Atlas Data Federation does not support the following data types in the documents:

  • Arrays

  • DB pointer

  • JavaScript

  • JavaScript code with scope

  • Minimum or maximum key data type

In a CSV file, Atlas Data Federation represents nested documents using the dot (.) notation. For example, Atlas Data Federation writes { x: { a: 1, b: 2 } } as the following in the CSV file:

x.a,x.b
1,2

Atlas Data Federation represents all other data types as strings. Therefore, the data types in MongoDB read back from the CSV file may not be the same as the data types in the original BSON documents from which the data types were written.

For Parquet, Atlas Data Federation reads back fields with null or undefined values as missing because Parquet doesn't distinguish between null or undefined values and missing values. Although Atlas Data Federation supports all data types, for BSON data types that do not have a direct equivalent in Parquet, such as JavaScript, regular expression, etc., it:

  • Chooses a representation that allows the resulting Parquet file to be read back using a non-MongoDB tool.

  • Stores a MongoDB schema in the Parquet file's key/value metadata so that Atlas Data Federation can reconstruct the original BSON document with the correct data types if the Parquet file is read back by Atlas Data Federation.

Example

Consider the following BSON documents:

{
"clientId": 102,
"phoneNumbers": ["123-4567", "234-5678"],
"clientInfo": {
"name": "Taylor",
"occupation": "teacher"
}
}
{
"clientId": "237",
"phoneNumbers" ["345-6789"]
"clientInfo": {
"name": "Jordan"
}
}

If you write the preceding BSON documents to Parquet format using $out to S3, the Parquet file schema for your BSON documents would look similar to the following:

message root {
optional group clientId {
optional int32 int;
optional binary string; (STRING)
}
optional group phoneNumbers (LIST) {
repeated group list {
optional binary element (STRING);
}
}
optional group clientInfo {
optional binary name (STRING);
optional binary occupation (STRING);
}
}

Your Parquet data on S3 would look similar to the following:

1clientId:
2.int = 102
3phoneNumbers:
4.list:
5..element = "123-4567"
6.list:
7..element = "234-5678"
8clientInfo:
9.name = "Taylor"
10.occupation = "teacher"
11
12clientId:
13.string = "237"
14phoneNumbers:
15.list:
16..element = "345-6789"
17clientInfo:
18.name = "Jordan"

The preceding example demonstrates how Atlas Data Federation handles complex data types:

  • Atlas Data Federation maps documents at all levels to a Parquet group.

  • Atlas Data Federation encodes arrays using the LIST logical type and the mandatory three-level list or element structure. To learn more, see Lists.

  • Atlas Data Federation maps polymorphic BSON fields to a group of multiple single-type columns because Parquet doesn't support polymorphic columns. Atlas Data Federation names the group after the BSON field. In the preceding example, Atlas Data Federation creates a Parquet group named clientId for the polymorphic field named clientId with two children named after its BSON types, int and string.

Atlas Data Federation interprets empty strings ("") as null values when parsing filenames. If you want Atlas Data Federation to generate parseable filenames, wrap the field references that could have null values using $convert with an empty string onNull value.

Example

This example shows how to handle null values in the year field when creating a filename from the field value.

1{
2 "$out": {
3 "azure": {
4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/",
5 "container": "my-container",
6 "region": "eastus2",
7 "filename": {
8 "$concat": [
9 "big-box-store/",
10 {
11 "$convert": {
12 "input": "$year",
13 "to": "string",
14 "onNull": ""
15 }
16 }, "/"
17 ]
18 },
19 "format": {
20 "name": "json.gz",
21 "maxFileSize": "200MiB"
22 }
23 }
24 }
25}

When writing to CSV, TSV, or Parquet file format, Atlas Data Federation doesn't support more than 32000 unique fields.

When writing to CSV or TSV format, Atlas Data Federation does not support the following data types in the documents:

  • Arrays

  • DB pointer

  • JavaScript

  • JavaScript code with scope

  • Minimum or maximum key data type

In a CSV file, Atlas Data Federation represents nested documents using the dot (.) notation. For example, Atlas Data Federation writes { x: { a: 1, b: 2 } } as the following in the CSV file:

x.a,x.b
1,2

Atlas Data Federation represents all other data types as strings. Therefore, the data types in MongoDB read back from the CSV file may not be the same as the data types in the original BSON documents from which the data types were written.

For Parquet, Atlas Data Federation reads back fields with null or undefined values as missing because Parquet doesn't distinguish between null or undefined values and missing values. Although Atlas Data Federation supports all data types, for BSON data types that do not have a direct equivalent in Parquet, such as JavaScript, regular expression, etc., it:

  • Chooses a representation that allows the resulting Parquet file to be read back using a non-MongoDB tool.

  • Stores a MongoDB schema in the Parquet file's key/value metadata so that Atlas Data Federation can reconstruct the original BSON document with the correct data types if the Parquet file is read back by Atlas Data Federation.

Example

Consider the following BSON documents:

{
"clientId": 102,
"phoneNumbers": ["123-4567", "234-5678"],
"clientInfo": {
"name": "Taylor",
"occupation": "teacher"
}
}
{
"clientId": "237",
"phoneNumbers" ["345-6789"]
"clientInfo": {
"name": "Jordan"
}
}

If you write the preceding BSON documents to Parquet format using $out to Azure, the Parquet file schema for your BSON documents would look similar to the following:

message root {
optional group clientId {
optional int32 int;
optional binary string (STRING);
}
optional group phoneNumbers (LIST) {
repeated group list {
optional binary element (STRING);
}
}
optional group clientInfo {
optional binary name (STRING);
optional binary occupation (STRING);
}
}

Your Parquet data in Azure Blob Storage would look similar to the following:

1clientId:
2.int = 102
3phoneNumbers:
4.list:
5..element = "123-4567"
6.list:
7..element = "234-5678"
8clientInfo:
9.name = "Taylor"
10.occupation = "teacher"
11
12clientId:
13.string = "237"
14phoneNumbers:
15.list:
16..element = "345-6789"
17clientInfo:
18.name = "Jordan"

The preceding example demonstrates how Atlas Data Federation handles complex data types:

  • Atlas Data Federation maps documents at all levels to a Parquet group.

  • Atlas Data Federation encodes arrays using the LIST logical type and the mandatory three-level list or element structure. To learn more, see Lists.

  • Atlas Data Federation maps polymorphic BSON fields to a group of multiple single-type columns because Parquet doesn't support polymorphic columns. Atlas Data Federation names the group after the BSON field. In the preceding example, Atlas Data Federation creates a Parquet group named clientId for the polymorphic field named clientId with two children named after its BSON types, int and string.

This section applies only to S3 buckets and Azure Blob Storage containers with read and write permissions.

Atlas Data Federation uses the error handling mechanism described below for documents that enter the $out stage and cannot be written for one of the following reasons:

  • The s3.filename does not evaluate to a string value.

  • The s3.filename evaluates to a file that cannot be written to.

  • The s3.format.name is set to csv, tsv, csv.gz, or tsv.gz and the document passed to $out contains data types that are not supported by the specified file format. For a full list of unsupported data types, see CSV and TSV File Format.

If $out encounters one of the above errors while processing a document, Atlas Data Federation writes to the following three special error files in the path s3://<bucket-name>/atlas-data-lake-<correlation-id>/:

Error File Name
Description
out-error-docs/<i>.json
Atlas Data Federation writes the document that encountered an error to this file. i begins with 1 and increments whenever the file being written to reaches the maxFileSize. Then, any further documents are written to the new file out-error-docs/<i+1>.json.
out-error-index/<i>.json
Atlas Data Federation writes an error message to this file. Each error message contains a description of the error and an index value n that begins with 0 and increments with each additional error message written to the file. i begins with 1 and increments whenever the file being written to reaches the maxFileSize. Then, any further error messages are written to the new file out-error-docs/<i+1>.json.
out-error-summary.json
Atlas Data Federation writes a single summary document for each type of error encountered during an aggregation operation to this file. Each summary document contains a description of the type of error and a count of the number of documents that encountered that type of error.

Example

This example shows how to generate error files using $out in a federated database instance.

The following aggregation pipeline sorts documents in the analytics.customers sample dataset collection by descending customer birthdate and attempts to write the _id, name and accounts fields of the youngest three customers to the file named youngest-customers.csv in the S3 bucket named customer-data.

db.customers.aggregate([
{ $sort: { "birthdate" : -1 } },
{ $unset: [ "username", "address", "email", "tier_and_details", "birthdate" ] },
{ $limit: 3 },
{ $out: {
"s3": {
"bucket": "customer-data",
"filename": "youngest-customers",
"region":"us-east-2",
"format": {
"name": "csv"
}
}
}
])

Because accounts is an array field, $out encounters an error when it tries to write a document to s3.format.name csv. To handle these errors, Atlas Data Federation writes to the following three error files:

  • The following output shows the first of three documents written to the out-error-docs/1.json file:

    s3://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-docs/1.json
    {
    "_id" : {"$oid":"5ca4bbcea2dd94ee58162ba7"},
    "name": "Marc Cain",
    "accounts": [{"$numberInt":"980440"}, {"$numberInt":"626807"}, {"$numberInt":"313907"}, {"$numberInt":"218101"}, {"$numberInt":"157495"}, {"$numberInt":"736396"}],
    }
  • The following output shows the first of three error messages written to the out-error-index/1.json file. The n field starts at 0 and increments for each error written to the file.

    s3://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-index/1.json
    {
    "n" : {"$numberInt": "0"},
    "error" : "field accounts is of unsupported type array"
    }
  • The following output shows the error summary document written to the out-error-summary file. The count field represents the number of documents passed to $out that encountered an error due to the accounts array field.

    s3://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-summary.json
    {
    "errorType": "field accounts is of unsupported type array",
    "count": {"$numberInt":"3"}
    }

Atlas Data Federation uses the error handling mechanism described below for documents that enter the $out stage and cannot be written for one of the following reasons:

  • The azure.filename does not evaluate to a string value.

  • The azure.filename evaluates to a file that cannot be written to.

  • The azure.format.name is set to csv, tsv, csv.gz, or tsv.gz and the document passed to $out contains data types that are not supported by the specified file format. For a full list of unsupported data types, see CSV and TSV File Format.

If $out encounters one of the above errors while processing a document, Atlas Data Federation writes to the following three special error files in the path http://<storage-account>.blob.core.windows.net/<container-name>/atlas-data-lake-<correlation-id>/:

Error File Name
Description
out-error-docs/<i>.json

Atlas Data Federation writes the document that encountered an error to this file.

i begins with 1 and increments whenever the file being written to reaches the maxFileSize. Then, any further documents are written to the new file out-error-docs/<i+1>.json.

out-error-index/<i>.json

Atlas Data Federation writes an error message to this file. Each error message contains a description of the error and an index value n that begins with 0 and increments with each additional error message written to the file.

i begins with 1 and increments whenever the file being written to reaches the maxFileSize. Then, any further error messages are written to the new file out-error-docs/<i+1>.json.

out-error-summary.json
Atlas Data Federation writes a single summary document for each type of error encountered during an aggregation operation to this file. Each summary document contains a description of the type of error and a count of the number of documents that encountered that type of error.

Example

This example shows how to generate error files using $out in a federated database instance.

The following aggregation pipeline sorts documents in the analytics.customers sample dataset collection by descending customer birthdate and attempts to write the _id, name and accounts fields of the youngest three customers to the file named youngest-customers.csv in the Azure Blob Storage container named customer-data.

db.customers.aggregate([
{ $sort: { "birthdate" : -1 } },
{ $unset: [ "username", "address", "email", "tier_and_details", "birthdate" ] },
{ $limit: 3 },
{ $out: {
"azure": {
"serviceURL": "https://myserviceaccount.blob.core.windows.net"
"container": "customer-data",
"filename": "youngest-customers",
"region":"eastus2",
"format": {
"name": "csv"
}
}
}
])

Because accounts is an array field, $out encounters an error when it tries to write a document to azure.format.name csv. To handle these errors, Atlas Data Federation writes to the following three error files:

  • The following output shows the first of three documents written to the out-error-docs/1.json file:

    http://mystorageaccount.blob.core.windows.net/customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-docs/1.json
    {
    "_id" : {"$oid":"5ca4bbcea2dd94ee58162ba7"},
    "name": "Marc Cain",
    "accounts": [{"$numberInt":"980440"}, {"$numberInt":"626807"}, {"$numberInt":"313907"}, {"$numberInt":"218101"}, {"$numberInt":"157495"}, {"$numberInt":"736396"}],
    }
  • The following output shows the first of three error messages written to the out-error-index/1.json file. The n field starts at 0 and increments for each error written to the file.

    http://mystorageaccount.blob.core.windows.net/customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-index/1.json
    {
    "n" : {"$numberInt": "0"},
    "error" : "field accounts is of unsupported type array"
    }
  • The following output shows the error summary document written to the out-error-summary file. The count field represents the number of documents passed to $out that encountered an error due to the accounts array field.

    http://mystorageaccount.blob.core.windows.net/customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-summary.json
    {
    "errorType": "field accounts is of unsupported type array",
    "count": {"$numberInt":"3"}
    }

This section applies only to S3 buckets and Azure Blob Storage containers with read and write permissions.

Tip

See also:

Back

$merge