$out
$out takes documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the aggregation pipeline. In Atlas Data Federation, you can use $out to write data from any one of the supported federated database instance stores or multiple supported federated database instance stores when using federated queries to any one of the following:
Atlas cluster namespace
AWS S3 buckets with read and write permissions
Azure Blob Storage containers with read and write permissions
You must connect to your federated database instance to use $out.
Permissions Required
You must have:
A federated database instance configured for an S3 bucket with read and write permissions or s3:PutObject permissions.
A MongoDB user with the atlasAdmin role or a custom role with the
outToS3
privilege.
You must have:
A federated database instance configured for Azure Blob Storage with an Azure Role that has read and write permissions.
A MongoDB user with the atlasAdmin role or a custom role with the
outToAzure
privilege.
You must have:
A federated database instance configured for a Google Cloud Storage bucket with access to a GCP Service Account.
A MongoDB user with the atlasAdmin role or a custom role with the
outToGCP
privilege.
Note
To use $out to write to a collection in a different database on the same Atlas cluster, your Atlas cluster must be on MongoDB version 5.0 or later.
You must be a database user with one of the following roles:
A custom role with the following privileges:
Syntax
1 { 2 "$out": { 3 "s3": { 4 "bucket": "<bucket-name>", 5 "region": "<aws-region>", 6 "filename": "<file-name>", 7 "format": { 8 "name": "<file-format>", 9 "maxFileSize": "<file-size>", 10 "maxRowGroupSize": "<row-group-size>", 11 "columnCompression": "<compression-type>" 12 }, 13 "errorMode": "stop"|"continue" 14 } 15 } 16 }
1 { 2 "$out": { 3 "azure": { 4 "serviceURL": "<storage-account-url>", 5 "containerName": "<container-name>", 6 "region": "<azure-region>", 7 "filename": "<file-name>", 8 "format": { 9 "name": "<file-format>", 10 "maxFileSize": "<file-size>", 11 "maxRowGroupSize": "<row-group-size>", 12 "columnCompression": "<compression-type>" 13 }, 14 "errorMode": "stop"|"continue" 15 } 16 } 17 }
1 { 2 "$out": { 3 "gcs": { 4 "bucket": "<bucket-name>", 5 "region": "<aws-region>", 6 "filename": "<file-name>", 7 "format": { 8 "name": "<file-format>", 9 "maxFileSize": "<file-size>", 10 "maxRowGroupSize": "<row-group-size>", 11 "columnCompression": "<compression-type>" 12 }, 13 "errorMode": "stop"|"continue" 14 } 15 } 16 }
1 { 2 "$out": { 3 "atlas": { 4 "projectId": "<atlas-project-ID>", 5 "clusterName": "<atlas-cluster-name>", 6 "db": "<atlas-database-name>", 7 "coll": "<atlas-collection-name>" 8 } 9 } 10 }
Fields
Field | Type | Description | Necessity | ||||
---|---|---|---|---|---|---|---|
s3 | object | Location to write the documents from the aggregation
pipeline. | Required | ||||
s3.bucket | string | Name of the S3 bucket to write the documents from the aggregation pipeline to. The generated call to S3 inserts a For example, if you set
| Required | ||||
s3.region | string | Name of the AWS region in which the bucket is hosted. If
omitted, uses the federated database instance configuration to determine the region where the specified
s3.bucket is hosted. | Optional | ||||
s3.filename | string | Name of the file to write the documents from the
aggregation pipeline to. Filename can be constant or
created dynamically from the
fields in the documents that reach the $out stage. Any
filename expression you provide must evaluate to a
IMPORTANT: If there are any files on S3 with the same name and path as the newly generated files, $out overwrites the existing files with the newly generated files. The generated call to S3 inserts a For example, if you set
| Required | ||||
s3.format | object | Details of the file in S3. | Required | ||||
s3 .format .name | enum | Format of the file in S3. Value can be one of the following:
1 For this format, $out writes data in MongoDB Extended JSON format. To learn more, see Limitations. | Required | ||||
s3 .format .maxFileSize | bytes | Maximum size of the file in S3. When the file size limit
for the current file is reached, a new file is created in
S3. The first file appends a For example, If a document is larger than the
If omitted, defaults to | Optional | ||||
s3 .format .maxRowGroupSize | string | Supported for Parquet file format only. Maximum row group size to use when writing to Parquet
file. If omitted, defaults to | Optional | ||||
s3 .format .columnCompression | string | Supported for Parquet file format only. Compression type to apply for compressing data inside a Parquet file when formatting the Parquet file. Valid values are:
If omitted, defaults to To learn more, see Supported Data Formats. | Optional | ||||
errorMode | enum | Specifies how Atlas Data Federation should proceed if there are errors when processing a document. For example, if Atlas Data Federation encounters an array in a document when Atlas Data Federation is writing to a CSV file, Atlas Data Federation uses this value to determine whether or not to skip the document and process other documents. Valid values are:
If omitted, defaults to | Optional |
Field | Type | Description | Necessity | ||||
---|---|---|---|---|---|---|---|
azure | object | Location to write the documents from the aggregation
pipeline. | Required | ||||
azure.serviceURL | string | URL of the Azure storage account in which to
write documents from the aggregation pipeline. | Required | ||||
azure.containerName | string | Name of the Azure Blob Storage container in which to
write documents from the aggregation pipeline. | Required | ||||
azure.region | string | Name of the Azure region which hosts the Blob Storage container. | Required | ||||
azure.filename | string | Name of the file in which to write documents from the aggregation pipeline. Accepts constant value, or values that evaluate to | Required | ||||
azure.format | object | Details of the file in Azure Blob Storage. | Required | ||||
azure .format .name | enum | Format of the file in Azure Blob Storage. Value can be one of the following:
1 For this format, $out writes data in MongoDB Extended JSON format. To learn more, see Limitations. | Required | ||||
azure .format .maxFileSize | bytes | Maximum size of the file in Azure Blob Storage. When the file size limit
for the current file is reached, $out automatically
creates a new file. The first file appends a For example, If a document is larger than the
If omitted, defaults to | Optional | ||||
azure .format .maxRowGroupSize | string | Supported for Parquet file format only. Maximum row group size to use when writing to Parquet
file. If omitted, defaults to | Optional | ||||
azure .format .columnCompression | string | Supported for Parquet file format only. Compression type to apply for compressing data inside a Parquet file when formatting the Parquet file. Valid values are:
If omitted, defaults to To learn more, see Supported Data Formats. | Optional | ||||
errorMode | enum | Specifies how Atlas Data Federation should proceed when it encounters an error while processing a document. Valid values are:
If omitted, defaults to To learn more, see Errors. | Optional |
Field | Type | Description | Necessity | |
---|---|---|---|---|
gcs | object | Location to write the documents from the aggregation
pipeline. | Required | |
gcs.bucket | string | Name of the Google Cloud Storage bucket to write the documents from the aggregation pipeline to. ImportantThe generated call to Google Cloud inserts a For example, if you set
| Required | |
gcs.region | string | Name of the AWS region in which the bucket is hosted. If
omitted, uses the federated database instance configuration to determine the region where the specified
gcs.bucket is hosted. | Optional | |
gcs.filename | string | Name of the file to write the documents from the
aggregation pipeline to. Filename can be constant or
created dynamically from the
fields in the documents that reach the $out stage. Any
filename expression you provide must evaluate to a
ImportantThe generated call to Google Cloud Storage inserts a For example, if you set
| Required | |
gcs.format | object | Details of the file in Google Cloud Storage. | Required | |
gcs .format .name | enum | Format of the file in Google Cloud Storage. Value can be one of the following:
1 For this format, $out writes data in MongoDB Extended JSON format. To learn more, see Limitations. | Required |
Field | Type | Description | Necessity |
---|---|---|---|
atlas | object | Location to write the documents from the aggregation
pipeline. | Required |
clusterName | string | Name of the Atlas cluster. | Required |
coll | string | Name of the collection on the Atlas cluster. | Required |
db | string | Name of the database on the Atlas cluster that contains
the collection. | Required |
projectId | string | Unique identifier of the project that contains the
Atlas cluster. The project ID must be the ID of the
project that contains your federated database instance. If omitted, defaults to
the ID of the project that contains your federated database instance. | Optional |
Options
Option | Type | Description | Necessity | |
---|---|---|---|---|
background | boolean | Flag to run aggregation operations in the background. If
omitted, defaults to
Use this option if you want to submit other new queries without waiting for currently running queries to complete or disconnect your federated database instance connection while the queries continue to run in the background. | Optional |
Examples
Create a Filename
The following examples show $out syntaxes for dynamically creating a filename from a constant string or from the fields of the same or different data types in the documents that reach the $out stage.
Simple String Example
Example
You want to write 1 GiB of data as compressed BSON files to
an S3 bucket named my-s3-bucket
.
Using the following $out syntax:
1 { 2 "$out": { 3 "s3": { 4 "bucket": "my-s3-bucket", 5 "filename": "big_box_store/", 6 "format": { 7 "name": "bson.gz" 8 } 9 } 10 } 11 }
The s3.region
is omitted and so, Atlas Data Federation determines the
region where the bucket named my-s3-bucket
is hosted from
the storage configuration. $out writes five compressed BSON
files:
The first 200 MiB of data to a file that $out names
big_box_store/1.bson.gz
.The value of
s3.filename
serves as a constant in each filename. This value doesn't depend upon any document field or value.Your
s3.filename
ends with a delimiter, so Atlas Data Federation appends the counter after the constant.If it didn't end with a delimiter, Atlas Data Federation would have added a
.
between the constant and the counter, likebig_box_store.1.bson.gz
Because you didn't change the maximum file size using
s3.format.maxFileSize
, Atlas Data Federation uses the default value of 200 MiB.
The second 200 MiB of data to a new file that $out names
big_box_store/2.bson.gz
.Three more files that $out names
big_box_store/3.bson.gz
throughbig_box_store/5.bson.gz
.
Single Field from Documents
Example
You want to write 90 MiB of data to JSON files to an S3
bucket named my-s3-bucket
.
Using the following $out syntax:
1 { 2 "$out": { 3 "s3": { 4 "bucket": "my-s3-bucket", 5 "region": "us-east-1", 6 "filename": {"$toString": "$saleDate"}, 7 "format": { 8 "name": "json", 9 "maxFileSize": "100MiB" 10 } 11 } 12 } 13 }
$out writes 90 MiB of data to JSON files in the root of the
bucket. Each JSON file contains all of the documents with the
same saleDate
value. $out names each file using the
documents' saleDate
value converted to a string.
Multiple Fields from Documents
Example
You want to write 176 MiB of data as BSON files to an S3
bucket named my-s3-bucket
.
Using the following $out syntax:
1 { 2 "$out": { 3 "s3": { 4 "bucket": "my-s3-bucket", 5 "region": "us-east-1", 6 "filename": { 7 "$concat": [ 8 "persons/", 9 "$name", "/", 10 "$uniqueId", "/" 11 ] 12 }, 13 "format": { 14 "name": "bson", 15 "maxFileSize": "200MiB" 16 } 17 } 18 } 19 }
$out writes 176 MiB of data to BSON files. To name each file, $out concatenates:
A constant string
persons/
and, from the documents:The string value of the
name
field,A forward slash (
/
),The string value of the
uniqueId
field, andA forward slash (
/
).
Each BSON file contains all of the documents with the same
name
and uniqueId
values. $out names each file using
the documents' name
and uniqueId
values.
Multiple Types of Fields from Documents
Example
You want to write 154 MiB of data as compressed JSON files
to an S3 bucket named my-s3-bucket
.
Consider the following $out syntax:
1 { 2 "$out": { 3 "s3": { 4 "bucket": "my-s3-bucket", 5 "region": "us-east-1", 6 "filename": { 7 "$concat": [ 8 "big-box-store/", 9 { 10 "$toString": "$storeNumber" 11 }, "/", 12 { 13 "$toString": "$saleDate" 14 }, "/", 15 "$partId", "/" 16 ] 17 }, 18 "format": { 19 "name": "json.gz", 20 "maxFileSize": "200MiB" 21 } 22 } 23 } 24 }
$out writes 154 MiB of data to compressed JSON files, where
each file contains all documents with the same
storeNumber
, saleDate
, and partId
values. To
name each file, $out concatenates:
A constant string value of
big-box-store/
,A string value of a unique store number in the
storeNumber
field,A forward slash (
/
),A string value of the date from the
saleDate
field,A forward slash (
/
),A string value of part ID from the
partId
field, andA forward slash (
/
).
Create a Filename
The following examples show $out syntaxes for dynamically creating a filename from a constant string or from the fields of the same or different data types in the documents that reach the $out stage.
Simple String Example
Example
You want to write 1 GiB of data as compressed BSON files to
an Azure storage account mystorageaccount
and
container named my-container
.
Using the following $out syntax:
1 { 2 "$out": { 3 "azure": { 4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/", 5 "container": "my-container", 6 "filename": "big_box_store/", 7 "format": { 8 "name": "bson.gz" 9 } 10 } 11 } 12 }
The azure.region
is omitted and so Atlas Data Federation determines the
region where the container named my-container
is hosted from
the storage configuration. $out writes five compressed BSON
files:
The first 200 MiB of data to a file that $out names
big_box_store/1.bson.gz
.The value of
azure.filename
serves as a constant in each filename. This value doesn't depend upon any document field or value.Your
azure.filename
ends with a delimiter, so Atlas Data Federation appends the counter after the constant.If it didn't end with a delimiter, Atlas Data Federation would have added a
.
between the constant and the counter, likebig_box_store.1.bson.gz
Because you didn't change the maximum file size using
azure.format.maxFileSize
, Atlas Data Federation uses the default value of 200 MiB.
The second 200 MiB of data to a new file that $out names
big_box_store/2.bson.gz
.Three more files that $out names
big_box_store/3.bson.gz
throughbig_box_store/5.bson.gz
.
Single Field from Documents
Example
You want to write 90 MiB of data to JSON files to an Azure Blob Storage
container named my-container
.
Using the following $out syntax:
1 { 2 "$out": { 3 "azure": { 4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/", 5 "container": "my-container", 6 "region": "eastus2", 7 "filename": {"$toString": "$saleDate"}, 8 "format": { 9 "name": "json", 10 "maxFileSize": "100MiB" 11 } 12 } 13 } 14 }
$out writes 90 MiB of data to JSON files in the root of the
container. Each JSON file contains all of the documents with the
same saleDate
value. $out names each file using the
documents' saleDate
value converted to a string.
Multiple Fields from Documents
Example
You want to write 176 MiB of data as BSON files to an Azure Blob Storage
container named my-container
.
Using the following $out syntax:
1 { 2 "$out": { 3 "azure": { 4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/", 5 "container": "my-container", 6 "region": "eastus2", 7 "filename": { 8 "$concat": [ 9 "persons/", 10 "$name", "/", 11 "$uniqueId", "/" 12 ] 13 }, 14 "format": { 15 "name": "bson", 16 "maxFileSize": "200MiB" 17 } 18 } 19 } 20 }
$out writes 176 MiB of data to BSON files. To name each file, $out concatenates:
A constant string
persons/
and, from the documents:The string value of the
name
field,A forward slash (
/
),The string value of the
uniqueId
field, andA forward slash (
/
).
Each BSON file contains all of the documents with the same
name
and uniqueId
values. $out names each file using
the documents' name
and uniqueId
values.
Multiple Types of Fields from Documents
Example
You want to write 154 MiB of data as compressed JSON files
to an Azure Blob Storage container named my-container
.
Consider the following $out syntax:
1 { 2 "$out": { 3 "azure": { 4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/", 5 "container": "my-container", 6 "region": "eastus2", 7 "filename": { 8 "$concat": [ 9 "big-box-store/", 10 { 11 "$toString": "$storeNumber" 12 }, "/", 13 { 14 "$toString": "$saleDate" 15 }, "/", 16 "$partId", "/" 17 ] 18 }, 19 "format": { 20 "name": "json.gz", 21 "maxFileSize": "200MiB" 22 } 23 } 24 } 25 }
$out writes 154 MiB of data to compressed JSON files, where
each file contains all documents with the same
storeNumber
, saleDate
, and partId
values. To
name each file, $out concatenates:
A constant string value of
big-box-store/
,A string value of a unique store number in the
storeNumber
field,A forward slash (
/
),A string value of the date from the
saleDate
field,A forward slash (
/
),A string value of part ID from the
partId
field, andA forward slash (
/
).
Create a Filename
The following examples show $out syntaxes for dynamically creating a filename from a constant string or from the fields of the same or different data types in the documents that reach the $out stage.
Simple String Example
Example
You want to write 1 GiB of data as compressed BSON files to
an Google Cloud Storage bucket named my-gcs-bucket
.
Using the following $out syntax:
1 { 2 "$out": { 3 "gcs": { 4 "bucket": "my-gcs-bucket", 5 "filename": "big_box_store/", 6 "format": { 7 "name": "bson.gz" 8 } 9 } 10 } 11 }
The gcs.region
is omitted and so, Atlas Data Federation determines the
region where the bucket named my-gcs-bucket
is hosted from
the storage configuration. $out writes five compressed BSON
files:
The first 200 MiB of data to a file that $out names
big_box_store/1.bson.gz
.The value of
gcs.filename
serves as a constant in each filename. This value doesn't depend upon any document field or value.Your
gcs.filename
ends with a delimiter, so Atlas Data Federation appends the counter after the constant.If it didn't end with a delimiter, Atlas Data Federation would have added a
.
between the constant and the counter, likebig_box_store.1.bson.gz
Because you didn't change the maximum file size using
gcs.format.maxFileSize
, Atlas Data Federation uses the default value of 200 MiB.
The second 200 MiB of data to a new file that $out names
big_box_store/2.bson.gz
.Three more files that $out names
big_box_store/3.bson.gz
throughbig_box_store/5.bson.gz
.
Single Field from Documents
Example
You want to write 90 MiB of data to JSON files to a Google Cloud Storage
bucket named my-gcs-bucket
.
Using the following $out syntax:
1 { 2 "$out": { 3 "gcs": { 4 "bucket": "my-gcs-bucket", 5 "region": "us-central1", 6 "filename": {"$toString": "$saleDate"}, 7 "format": { 8 "name": "json", 9 "maxFileSize": "100MiB" 10 } 11 } 12 } 13 }
$out writes 90 MiB of data to JSON files in the root of the
bucket. Each JSON file contains all of the documents with the
same saleDate
value. $out names each file using the
documents' saleDate
value converted to a string.
Multiple Fields from Documents
Example
You want to write 176 MiB of data as BSON files to a Google Cloud Storage
bucket named my-gcs-bucket
.
Using the following $out syntax:
1 { 2 "$out": { 3 "gcs": { 4 "bucket": "my-gcs-bucket", 5 "region": "us-central1", 6 "filename": { 7 "$concat": [ 8 "persons/", 9 "$name", "/", 10 "$uniqueId", "/" 11 ] 12 }, 13 "format": { 14 "name": "bson", 15 "maxFileSize": "200MiB" 16 } 17 } 18 } 19 }
$out writes 176 MiB of data to BSON files. To name each file, $out concatenates:
A constant string
persons/
and, from the documents:The string value of the
name
field,A forward slash (
/
),The string value of the
uniqueId
field, andA forward slash (
/
).
Each BSON file contains all of the documents with the same
name
and uniqueId
values. $out names each file using
the documents' name
and uniqueId
values.
Multiple Types of Fields from Documents
Example
You want to write 154 MiB of data as compressed JSON files
to a Google Cloud Storage bucket named my-gcs-bucket
.
Consider the following $out syntax:
1 { 2 "$out": { 3 "gcs": { 4 "bucket": "my-gcs-bucket", 5 "region": "us-central1", 6 "filename": { 7 "$concat": [ 8 "big-box-store/", 9 { 10 "$toString": "$storeNumber" 11 }, "/", 12 { 13 "$toString": "$saleDate" 14 }, "/", 15 "$partId", "/" 16 ] 17 }, 18 "format": { 19 "name": "json.gz", 20 "maxFileSize": "200MiB" 21 } 22 } 23 } 24 }
$out writes 154 MiB of data to compressed JSON files, where
each file contains all documents with the same
storeNumber
, saleDate
, and partId
values. To
name each file, $out concatenates:
A constant string value of
big-box-store/
,A string value of a unique store number in the
storeNumber
field,A forward slash (
/
),A string value of the date from the
saleDate
field,A forward slash (
/
),A string value of part ID from the
partId
field, andA forward slash (
/
).
Write to Collection on Atlas Cluster
This $out syntax sends the aggregated data to a
sampleDB.mySampleData
collection in the Atlas cluster
named myTestCluster
. The syntax doesn't specify a project ID;
$out uses the ID of the project that contains your federated database instance.
Example
1 { 2 "$out": { 3 "atlas": { 4 "clusterName": "myTestCluster", 5 "db": "sampleDB", 6 "coll": "mySampleData" 7 } 8 } 9 }
Run a Query in the background
The following example shows $out syntax for running an aggregation pipeline that ends with the $out stage in the background.
Example
db.runCommand({ "aggregate": "my-collection", "pipeline": [ { "$out": { "s3": { "bucket": "my-s3-bucket", "filename": { "$toString": "$saleDate" } "format": { "name": "json" } } } } ], { "background" : true } })
$out writes to JSON files in the root of the bucket in the
background. Each JSON file contains all of the documents
with the same saleDate
value. $out names each file using
the documents' saleDate
value converted to a string.
Example
db.runCommand({ "aggregate": "my-collection", "pipeline": [ { "$out": { "azure": { "serviceURL": "http://mystorageaccount.blob.core.windows.net/", "container": "my-container", "filename": {"$toString": "$saleDate"}, "format": { "name": "json" } } } } ], { "background" : true } })
$out writes to JSON files in the root of the Azure Blob Storage container
in the background. Each JSON file contains all of the documents
with the same saleDate
value. $out names each file using
the documents' saleDate
value converted to a string.
Example
db.runCommand({ "aggregate": "my-collection", "pipeline": [ { "$out": { "gcs": { "bucket": "my-gcs-bucket", "filename": { "$toString": "$saleDate" } "format": { "name": "json" } } } } ], { "background" : true } })
$out writes to JSON files in the root of the bucket in the
background. Each JSON file contains all of the documents
with the same saleDate
value. $out names each file using
the documents' saleDate
value converted to a string.
Example
db.runCommand({ "aggregate": "my-collection", "pipeline": [ { "$out": { "atlas": { "clusterName": "myTestCluster", "db": "sampleDB", "coll": "mySampleData" } } } ], { background: true } })
$out writes to sampleDB.mySampleData
collection in the
Atlas cluster named myTestCluster
in the background.
Limitations
String Data Type
Atlas Data Federation interprets empty strings (""
) as null
values when
parsing filenames. If you want Atlas Data Federation to generate parseable
filenames, wrap the field references that could have null
values using $convert with an empty string onNull
value.
Example
This example shows how to handle null values in the year
field when creating a filename from the field value.
1 { 2 "$out": { 3 "s3": { 4 "bucket": "my-s3-bucket", 5 "region": "us-east-1", 6 "filename": { 7 "$concat": [ 8 "big-box-store/", 9 { 10 "$convert": { 11 "input": "$year", 12 "to": "string", 13 "onNull": "" 14 } 15 }, "/" 16 ] 17 }, 18 "format": { 19 "name": "json.gz", 20 "maxFileSize": "200MiB" 21 } 22 } 23 } 24 }
Number of Unique Fields
When writing to CSV, TSV, or Parquet file format, Atlas Data Federation doesn't support more than 32000 unique fields.
CSV and TSV File Format
When writing to CSV or TSV format, Atlas Data Federation does not support the following data types in the documents:
Arrays
DB pointer
JavaScript
JavaScript code with scope
Minimum or maximum key data type
In a CSV file, Atlas Data Federation represents nested documents using the dot
(.
) notation. For example, Atlas Data Federation writes
{ x: { a: 1, b: 2 } }
as the following in the CSV file:
x.a,x.b 1,2
Atlas Data Federation represents all other data types as strings. Therefore, the data types in MongoDB read back from the CSV file may not be the same as the data types in the original BSON documents from which the data types were written.
Parquet File Format
For Parquet, Atlas Data Federation reads back fields with null or undefined values as missing because Parquet doesn't distinguish between null or undefined values and missing values. Although Atlas Data Federation supports all data types, for BSON data types that do not have a direct equivalent in Parquet, such as JavaScript, regular expression, etc., it:
Chooses a representation that allows the resulting Parquet file to be read back using a non-MongoDB tool.
Stores a MongoDB schema in the Parquet file's key/value metadata so that Atlas Data Federation can reconstruct the original BSON document with the correct data types if the Parquet file is read back by Atlas Data Federation.
Example
Consider the following BSON documents:
{ "clientId": 102, "phoneNumbers": ["123-4567", "234-5678"], "clientInfo": { "name": "Taylor", "occupation": "teacher" } } { "clientId": "237", "phoneNumbers" ["345-6789"] "clientInfo": { "name": "Jordan" } }
If you write the preceding BSON documents to Parquet format using $out to S3, the Parquet file schema for your BSON documents would look similar to the following:
message root { optional group clientId { optional int32 int; optional binary string; (STRING) } optional group phoneNumbers (LIST) { repeated group list { optional binary element (STRING); } } optional group clientInfo { optional binary name (STRING); optional binary occupation (STRING); } }
Your Parquet data on S3 would look similar to the following:
1 clientId: 2 .int = 102 3 phoneNumbers: 4 .list: 5 ..element = "123-4567" 6 .list: 7 ..element = "234-5678" 8 clientInfo: 9 .name = "Taylor" 10 .occupation = "teacher" 11 12 clientId: 13 .string = "237" 14 phoneNumbers: 15 .list: 16 ..element = "345-6789" 17 clientInfo: 18 .name = "Jordan"
The preceding example demonstrates how Atlas Data Federation handles complex data types:
Atlas Data Federation maps documents at all levels to a Parquet group.
Atlas Data Federation encodes arrays using the
LIST
logical type and the mandatory three-level list or element structure. To learn more, see Lists.Atlas Data Federation maps polymorphic BSON fields to a group of multiple single-type columns because Parquet doesn't support polymorphic columns. Atlas Data Federation names the group after the BSON field. In the preceding example, Atlas Data Federation creates a Parquet group named
clientId
for the polymorphic field namedclientId
with two children named after its BSON types,int
andstring
.
String Data Type
Atlas Data Federation interprets empty strings (""
) as null
values when
parsing filenames. If you want Atlas Data Federation to generate parseable
filenames, wrap the field references that could have null
values using $convert with an empty string onNull
value.
Example
This example shows how to handle null values in the year
field when creating a filename from the field value.
1 { 2 "$out": { 3 "azure": { 4 "serviceURL": "http://mystorageaccount.blob.core.windows.net/", 5 "container": "my-container", 6 "region": "eastus2", 7 "filename": { 8 "$concat": [ 9 "big-box-store/", 10 { 11 "$convert": { 12 "input": "$year", 13 "to": "string", 14 "onNull": "" 15 } 16 }, "/" 17 ] 18 }, 19 "format": { 20 "name": "json.gz", 21 "maxFileSize": "200MiB" 22 } 23 } 24 } 25 }
Number of Unique Fields
When writing to CSV, TSV, or Parquet file format, Atlas Data Federation doesn't support more than 32000 unique fields.
CSV and TSV File Format
When writing to CSV or TSV format, Atlas Data Federation does not support the following data types in the documents:
Arrays
DB pointer
JavaScript
JavaScript code with scope
Minimum or maximum key data type
In a CSV file, Atlas Data Federation represents nested documents using the dot
(.
) notation. For example, Atlas Data Federation writes
{ x: { a: 1, b: 2 } }
as the following in the CSV file:
x.a,x.b 1,2
Atlas Data Federation represents all other data types as strings. Therefore, the data types in MongoDB read back from the CSV file may not be the same as the data types in the original BSON documents from which the data types were written.
Parquet File Format
For Parquet, Atlas Data Federation reads back fields with null or undefined values as missing because Parquet doesn't distinguish between null or undefined values and missing values. Although Atlas Data Federation supports all data types, for BSON data types that do not have a direct equivalent in Parquet, such as JavaScript, regular expression, etc., it:
Chooses a representation that allows the resulting Parquet file to be read back using a non-MongoDB tool.
Stores a MongoDB schema in the Parquet file's key/value metadata so that Atlas Data Federation can reconstruct the original BSON document with the correct data types if the Parquet file is read back by Atlas Data Federation.
Example
Consider the following BSON documents:
{ "clientId": 102, "phoneNumbers": ["123-4567", "234-5678"], "clientInfo": { "name": "Taylor", "occupation": "teacher" } } { "clientId": "237", "phoneNumbers" ["345-6789"] "clientInfo": { "name": "Jordan" } }
If you write the preceding BSON documents to Parquet format using $out to Azure, the Parquet file schema for your BSON documents would look similar to the following:
message root { optional group clientId { optional int32 int; optional binary string (STRING); } optional group phoneNumbers (LIST) { repeated group list { optional binary element (STRING); } } optional group clientInfo { optional binary name (STRING); optional binary occupation (STRING); } }
Your Parquet data in Azure Blob Storage would look similar to the following:
1 clientId: 2 .int = 102 3 phoneNumbers: 4 .list: 5 ..element = "123-4567" 6 .list: 7 ..element = "234-5678" 8 clientInfo: 9 .name = "Taylor" 10 .occupation = "teacher" 11 12 clientId: 13 .string = "237" 14 phoneNumbers: 15 .list: 16 ..element = "345-6789" 17 clientInfo: 18 .name = "Jordan"
The preceding example demonstrates how Atlas Data Federation handles complex data types:
Atlas Data Federation maps documents at all levels to a Parquet group.
Atlas Data Federation encodes arrays using the
LIST
logical type and the mandatory three-level list or element structure. To learn more, see Lists.Atlas Data Federation maps polymorphic BSON fields to a group of multiple single-type columns because Parquet doesn't support polymorphic columns. Atlas Data Federation names the group after the BSON field. In the preceding example, Atlas Data Federation creates a Parquet group named
clientId
for the polymorphic field namedclientId
with two children named after its BSON types,int
andstring
.
String Data Type
Atlas Data Federation interprets empty strings (""
) as null
values when
parsing filenames. If you want Atlas Data Federation to generate parseable
filenames, wrap the field references that could have null
values using $convert with an empty string onNull
value.
Example
This example shows how to handle null values in the year
field when creating a filename from the field value.
1 { 2 "$out": { 3 "gcs": { 4 "bucket": "my-gcs-bucket", 5 "region": "us-central1", 6 "filename": { 7 "$concat": [ 8 "big-box-store/", 9 { 10 "$convert": { 11 "input": "$year", 12 "to": "string", 13 "onNull": "" 14 } 15 }, "/" 16 ] 17 }, 18 "format": { 19 "name": "json.gz", 20 "maxFileSize": "200MiB" 21 } 22 } 23 } 24 }
Number of Unique Fields
When writing to CSV, TSV, or Parquet file format, Atlas Data Federation doesn't support more than 32000 unique fields.
CSV and TSV File Format
When writing to CSV or TSV format, Atlas Data Federation does not support the following data types in the documents:
Arrays
DB pointer
JavaScript
JavaScript code with scope
Minimum or maximum key data type
In a CSV file, Atlas Data Federation represents nested documents using the dot
(.
) notation. For example, Atlas Data Federation writes
{ x: { a: 1, b: 2 } }
as the following in the CSV file:
x.a,x.b 1,2
Atlas Data Federation represents all other data types as strings. Therefore, the data types in MongoDB read back from the CSV file may not be the same as the data types in the original BSON documents from which the data types were written.
Parquet File Format
For Parquet, Atlas Data Federation reads back fields with null or undefined values as missing because Parquet doesn't distinguish between null or undefined values and missing values. Although Atlas Data Federation supports all data types, for BSON data types that do not have a direct equivalent in Parquet, such as JavaScript, regular expression, etc., it:
Chooses a representation that allows the resulting Parquet file to be read back using a non-MongoDB tool.
Stores a MongoDB schema in the Parquet file's key/value metadata so that Atlas Data Federation can reconstruct the original BSON document with the correct data types if the Parquet file is read back by Atlas Data Federation.
Example
Consider the following BSON documents:
{ "clientId": 102, "phoneNumbers": ["123-4567", "234-5678"], "clientInfo": { "name": "Taylor", "occupation": "teacher" } } { "clientId": "237", "phoneNumbers" ["345-6789"] "clientInfo": { "name": "Jordan" } }
If you write the preceding BSON documents to Parquet format using $out to GCP, the Parquet file schema for your BSON documents would look similar to the following:
message root { optional group clientId { optional int32 int; optional binary string; (STRING) } optional group phoneNumbers (LIST) { repeated group list { optional binary element (STRING); } } optional group clientInfo { optional binary name (STRING); optional binary occupation (STRING); } }
Your Parquet data on Google Cloud Storage would look similar to the following:
1 clientId: 2 .int = 102 3 phoneNumbers: 4 .list: 5 ..element = "123-4567" 6 .list: 7 ..element = "234-5678" 8 clientInfo: 9 .name = "Taylor" 10 .occupation = "teacher" 11 12 clientId: 13 .string = "237" 14 phoneNumbers: 15 .list: 16 ..element = "345-6789" 17 clientInfo: 18 .name = "Jordan"
The preceding example demonstrates how Atlas Data Federation handles complex data types:
Atlas Data Federation maps documents at all levels to a Parquet group.
Atlas Data Federation encodes arrays using the
LIST
logical type and the mandatory three-level list or element structure. To learn more, see Lists.Atlas Data Federation maps polymorphic BSON fields to a group of multiple single-type columns because Parquet doesn't support polymorphic columns. Atlas Data Federation names the group after the BSON field. In the preceding example, Atlas Data Federation creates a Parquet group named
clientId
for the polymorphic field namedclientId
with two children named after its BSON types,int
andstring
.
This section applies only to cloud service provider storage offerings.
Error Output
Atlas Data Federation uses the error handling mechanism described below for documents
that enter the $out
stage and cannot be written for one of the following reasons:
The
s3.filename
does not evaluate to a string value.The
s3.filename
evaluates to a file that cannot be written to.The
s3.format.name
is set tocsv
,tsv
,csv.gz
, ortsv.gz
and the document passed to$out
contains data types that are not supported by the specified file format. For a full list of unsupported data types, see CSV and TSV File Format.
If $out
encounters one of the above errors while processing a document, Atlas Data Federation writes to the following three special error files in the path
s3://<bucket-name>/atlas-data-lake-<correlation-id>/
:
Error File Name | Description |
---|---|
out-error-docs/<i>.json | Atlas Data Federation writes the document that encountered an error to this file.
i begins with 1 and increments whenever the file being written to reaches the maxFileSize .
Then, any further documents are written to the new file out-error-docs/<i+1>.json . |
out-error-index/<i>.json | Atlas Data Federation writes an error message to this file.
Each error message contains a description of the error and an index value n that begins with 0
and increments with each additional error message written to the file.
i begins with 1 and increments whenever the file being written to reaches the maxFileSize .
Then, any further error messages are written to the new file out-error-docs/<i+1>.json . |
out-error-summary.json | Atlas Data Federation writes a single summary document for each type of error encountered during an aggregation operation to this file.
Each summary document contains a description of the type of error and a count of the number of documents that encountered that type of error. |
Example
This example shows how to generate error files using $out
in a federated database instance.
The following aggregation pipeline sorts documents in the analytics.customers
sample dataset collection
by descending customer birthdate and attempts to write the _id
, name
and accounts
fields of the
youngest three customers to the file named youngest-customers.csv
in the S3 bucket named customer-data
.
db.customers.aggregate([ { $sort: { "birthdate" : -1 } }, { $unset: [ "username", "address", "email", "tier_and_details", "birthdate" ] }, { $limit: 3 }, { $out: { "s3": { "bucket": "customer-data", "filename": "youngest-customers", "region":"us-east-2", "format": { "name": "csv" } } } ])
Because accounts
is an array field, $out
encounters an error when it tries to write
a document to s3.format.name
csv
. To handle these errors, Atlas Data Federation writes to the following three error files:
The following output shows the first of three documents written to the
out-error-docs/1.json
file:s3://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-docs/1.json{ "_id" : {"$oid":"5ca4bbcea2dd94ee58162ba7"}, "name": "Marc Cain", "accounts": [{"$numberInt":"980440"}, {"$numberInt":"626807"}, {"$numberInt":"313907"}, {"$numberInt":"218101"}, {"$numberInt":"157495"}, {"$numberInt":"736396"}], } The following output shows the first of three error messages written to the
out-error-index/1.json
file. Then
field starts at 0 and increments for each error written to the file.s3://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-index/1.json{ "n" : {"$numberInt": "0"}, "error" : "field accounts is of unsupported type array" } The following output shows the error summary document written to the
out-error-summary
file. Thecount
field represents the number of documents passed to$out
that encountered an error due to theaccounts
array field.s3://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-summary.json{ "errorType": "field accounts is of unsupported type array", "count": {"$numberInt":"3"} }
Atlas Data Federation uses the error handling mechanism described below for documents
that enter the $out
stage and cannot be written for one of the following reasons:
The
azure.filename
does not evaluate to a string value.The
azure.filename
evaluates to a file that cannot be written to.The
azure.format.name
is set tocsv
,tsv
,csv.gz
, ortsv.gz
and the document passed to$out
contains data types that are not supported by the specified file format. For a full list of unsupported data types, see CSV and TSV File Format.
If $out
encounters one of the above errors while processing a document,
Atlas Data Federation writes to the following three special error files in the path
http://<storage-account>.blob.core.windows.net/<container-name>/atlas-data-lake-<correlation-id>/
:
Error File Name | Description |
---|---|
out-error-docs/<i>.json | Atlas Data Federation writes the document that encountered an error to this file.
|
out-error-index/<i>.json | Atlas Data Federation writes an error message to this file.
Each error message contains a description of the error and an index value
|
out-error-summary.json | Atlas Data Federation writes a single summary document for each type of error encountered during an aggregation operation to this file.
Each summary document contains a description of the type of error and a count of the number of documents that encountered that type of error. |
Example
This example shows how to generate error files using $out
in a federated database instance.
The following aggregation pipeline sorts documents in the analytics.customers
sample dataset collection
by descending customer birthdate and attempts to write the _id
, name
and accounts
fields of the
youngest three customers to the file named youngest-customers.csv
in the Azure Blob Storage container
named customer-data
.
db.customers.aggregate([ { $sort: { "birthdate" : -1 } }, { $unset: [ "username", "address", "email", "tier_and_details", "birthdate" ] }, { $limit: 3 }, { $out: { "azure": { "serviceURL": "https://myserviceaccount.blob.core.windows.net" "container": "customer-data", "filename": "youngest-customers", "region":"eastus2", "format": { "name": "csv" } } } ])
Because accounts
is an array field, $out
encounters an error when it tries to write
a document to azure.format.name
csv
. To handle these errors, Atlas Data Federation writes to the following three error files:
The following output shows the first of three documents written to the
out-error-docs/1.json
file:http://mystorageaccount.blob.core.windows.net/customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-docs/1.json{ "_id" : {"$oid":"5ca4bbcea2dd94ee58162ba7"}, "name": "Marc Cain", "accounts": [{"$numberInt":"980440"}, {"$numberInt":"626807"}, {"$numberInt":"313907"}, {"$numberInt":"218101"}, {"$numberInt":"157495"}, {"$numberInt":"736396"}], } The following output shows the first of three error messages written to the
out-error-index/1.json
file. Then
field starts at 0 and increments for each error written to the file.http://mystorageaccount.blob.core.windows.net/customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-index/1.json{ "n" : {"$numberInt": "0"}, "error" : "field accounts is of unsupported type array" } The following output shows the error summary document written to the
out-error-summary
file. Thecount
field represents the number of documents passed to$out
that encountered an error due to theaccounts
array field.http://mystorageaccount.blob.core.windows.net/customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-summary.json{ "errorType": "field accounts is of unsupported type array", "count": {"$numberInt":"3"} }
Atlas Data Federation uses the error handling mechanism described below for documents
that enter the $out
stage and cannot be written for one of the following reasons:
The
gcs.filename
does not evaluate to a string value.The
gcs.filename
evaluates to a file that cannot be written to.The
gcs.format.name
is set tocsv
,tsv
,csv.gz
, ortsv.gz
and the document passed to$out
contains data types that are not supported by the specified file format. For a full list of unsupported data types, see CSV and TSV File Format.
If $out
encounters one of the above errors while processing a document, Atlas Data Federation writes to the following three special error files in the path
gcs://<bucket-name>/atlas-data-lake-<correlation-id>/
:
Error File Name | Description |
---|---|
out-error-docs/<i>.json | Atlas Data Federation writes the document that encountered an error to this file.
i begins with 1 and increments whenever the file being written to reaches the maxFileSize .
Then, any further documents are written to the new file out-error-docs/<i+1>.json . |
out-error-index/<i>.json | Atlas Data Federation writes an error message to this file.
Each error message contains a description of the error and an index value n that begins with 0
and increments with each additional error message written to the file.
i begins with 1 and increments whenever the file being written to reaches the maxFileSize .
Then, any further error messages are written to the new file out-error-docs/<i+1>.json . |
out-error-summary.json | Atlas Data Federation writes a single summary document for each type of error encountered during an aggregation operation to this file.
Each summary document contains a description of the type of error and a count of the number of documents that encountered that type of error. |
Example
This example shows how to generate error files using $out
in a federated database instance.
The following aggregation pipeline sorts documents in the analytics.customers
sample dataset collection
by descending customer birthdate and attempts to write the _id
, name
and accounts
fields of the
youngest three customers to the file named youngest-customers.csv
in the Google Cloud Storage bucket named customer-data
.
db.customers.aggregate([ { $sort: { "birthdate" : -1 } }, { $unset: [ "username", "address", "email", "tier_and_details", "birthdate" ] }, { $limit: 3 }, { $out: { "gcs": { "bucket": "customer-data", "filename": "youngest-customers", "region":"us-central1", "format": { "name": "csv" } } } ])
Because accounts
is an array field, $out
encounters an error when it tries to write
a document to gcs.format.name
csv
. To handle these errors, Atlas Data Federation writes to the following three error files:
The following output shows the first of three documents written to the
out-error-docs/1.json
file:gcs://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-docs/1.json{ "_id" : {"$oid":"5ca4bbcea2dd94ee58162ba7"}, "name": "Marc Cain", "accounts": [{"$numberInt":"980440"}, {"$numberInt":"626807"}, {"$numberInt":"313907"}, {"$numberInt":"218101"}, {"$numberInt":"157495"}, {"$numberInt":"736396"}], } The following output shows the first of three error messages written to the
out-error-index/1.json
file. Then
field starts at 0 and increments for each error written to the file.gcs://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-index/1.json{ "n" : {"$numberInt": "0"}, "error" : "field accounts is of unsupported type array" } The following output shows the error summary document written to the
out-error-summary
file. Thecount
field represents the number of documents passed to$out
that encountered an error due to theaccounts
array field.gcs://customer-data/atlas-data-lake-1773b3d5e2a7f3858530daf5/out-error-summary.json{ "errorType": "field accounts is of unsupported type array", "count": {"$numberInt":"3"} }
This section applies only to cloud service provider storage.