Hello,
I am working on a solution to save some costs by moving cold data from my Atlas database to S3 buckets and access them with data federation.
I am following the excelent blog post, as it covers the same scenario I am trying to solve: https://mongodb.prakticum-team.ru/developer/products/atlas/atlas-data-federation-out-aws-s3/.
I got it up and running using a small test data set, and everything works as expected. I am wondering if this solution is also cost effective, when deploying to a production environment with large data sets.
I have the following schema (simplified):
{
_id: id,
adddate: ISODate,
some_field: string
}
There is an index on adddate
within the atlas database.
I created an aggregation pipeline to write the data to S3, using the $out
stage. This should be executed e.g. once a day:
[
{
$match: {
adddate: {
$lt: ISODate("2024-12-13"),
$gte: ISODate('2024-12-12')
}
}
},
{
$out: {
s3: {
bucket: "BUCKETNAME",
region: "eu-west-1",
filename: {
$concat: [
"v1/",
"$some_field"
]
},
format: {
name: "json"
}
}
}
}
]
This aggegration pipeline works as expected, moving all documents that are older than the given adddate
to S3 (see $match
stage).
When looking at the output of “Explain”, I have some concerns that this is not very cost effective, as it shows that it traversed 1MiB of data (which is the full size of my test data set). In production this would have been hundreds of Gigabytes:
{
"ok": 1,
"stats": {
"size": "1.003183364868164 MiB",
"numberOfPartitions": 1
},
"truncated": false,
"plan": {
"kind": "region",
"region": "aws/eu-west-1",
"node": {
"kind": "data",
"size": "1.003183364868164 MiB",
"numberOfPartitions": 1,
"partitionsTruncated": false,
"partitions": [
{
"source": "FEDERATED_DATABASE_CONNECTION",
"provider": "atlas",
"size": "1.003183364868164 MiB",
"database": "ATLAS_DATABASE",
"collection": "ATLAS_COLLECTION",
"pipeline": [
{
...
This behaviour seems to be independent of the selected date, it is the same even if I use a value that is out of range of the data, yielding 0 documents.
Is the index not utilized?
Does anybody know a more effective solution? As far as I understood it, I need to use the connection string to data federation to write to S3, even though I only query from Atlas and write to S3.
Thanks,
Philipp