Atlas Data Federation cost effective $out

Philipp_Kratzer · December 13, 2024, 2:30pm

Hello,

I am working on a solution to save some costs by moving cold data from my Atlas database to S3 buckets and access them with data federation.

I am following the excelent blog post, as it covers the same scenario I am trying to solve: https://mongodb.prakticum-team.ru/developer/products/atlas/atlas-data-federation-out-aws-s3/.

I got it up and running using a small test data set, and everything works as expected. I am wondering if this solution is also cost effective, when deploying to a production environment with large data sets.

I have the following schema (simplified):

{
    _id: id,
    adddate: ISODate,
    some_field: string
}

There is an index on adddate within the atlas database.

I created an aggregation pipeline to write the data to S3, using the $out stage. This should be executed e.g. once a day:

[
  {
    $match: {
      adddate: {
        $lt: ISODate("2024-12-13"),
        $gte: ISODate('2024-12-12')
      }
    }
  },
  {
    $out: {
      s3: {
        bucket: "BUCKETNAME",
        region: "eu-west-1",
        filename: {
          $concat: [
            "v1/",
            "$some_field"
          ]
        },
        format: {
          name: "json"
        }
      }
    }
  }
]

This aggegration pipeline works as expected, moving all documents that are older than the given adddate to S3 (see $match stage).
When looking at the output of “Explain”, I have some concerns that this is not very cost effective, as it shows that it traversed 1MiB of data (which is the full size of my test data set). In production this would have been hundreds of Gigabytes:

{
  "ok": 1,
  "stats": {
    "size": "1.003183364868164 MiB",
    "numberOfPartitions": 1
  },
  "truncated": false,
  "plan": {
    "kind": "region",
    "region": "aws/eu-west-1",
    "node": {
      "kind": "data",
      "size": "1.003183364868164 MiB",
      "numberOfPartitions": 1,
      "partitionsTruncated": false,
      "partitions": [
        {
          "source": "FEDERATED_DATABASE_CONNECTION",
          "provider": "atlas",
          "size": "1.003183364868164 MiB",
          "database": "ATLAS_DATABASE",
          "collection": "ATLAS_COLLECTION",
          "pipeline": [
            {
		...

This behaviour seems to be independent of the selected date, it is the same even if I use a value that is out of range of the data, yielding 0 documents.

Is the index not utilized?
Does anybody know a more effective solution? As far as I understood it, I need to use the connection string to data federation to write to S3, even though I only query from Atlas and write to S3.

Thanks,
Philipp

Alexi_Antonino · December 17, 2024, 12:33am

Hello @Philipp_Kratzer - Kudos to you for getting this to work by using the tutorial. Let me review this throughly to figure out if this is the expected behavior. There may be a simple fix/reason why the index didn’t get applied, I will review this and may ask for some follow up info, but this seems like a good amount to go on. Though I have one quick question, in the underlying cluster that you are querying, is this a view (materialized or other), or a collection?
Let me know, I will see why it is scanning the whole data set.

Philipp_Kratzer · December 18, 2024, 2:08pm

Hello @Alexi_Antonino

Thanks for your reply!

I have the following configuration for data federation:

{
  "databases": [
    {
      "collections": [
        {
          "name": "data_combined",
          "dataSources": [
            {
              "collection": "collection",
              "database": "db",
              "storeName": "store_name_atlas"
            },
            {
              "path": "/{some_field string}.1.json",
              "storeName": "store_name_s3"
            }
          ]
        },
        {
          "name": "data_s3_only",
          "dataSources": [
            {
              "path": "/{some_field string}.1.json",
              "storeName": "store_name_s3"
            }
          ]
        },
        {
          "name": "data_atlas_only",
          "dataSources": [
            {
              "collection": "collection",
              "database": "db",
              "storeName": "store_name_atlas"
            }
          ]
        }
      ],
      "name": "virtual_data_v1",
      "views": []
    }
  ],
  "stores": [
    {
      "clusterName": "store_name_atlas",
      "name": "store_name_atlas",
      "projectId": "xxx",
      "provider": "atlas",
      "readPreference": {
        "mode": "secondary"
      }
    },
    {
      "bucket": "store_name_s3",
      "delimiter": "/",
      "name": "store_name_s3",
      "prefix": "v1/",
      "provider": "s3",
      "region": "eu-west-1"
    }
  ]
}

It is one virtual database with three different virtual collections:

data_combined: This uses both, atlas and s3 as data source, thus I can query cold and hot data using one universal connection, no matter where the data is stored
data_s3_only: Allows to query cold data only
data_atlas_only: Allows to query hot data only

For the aggregation pipeline in my previous post I use the data_atlas_only database, thus only hot data is loaded and written to S3.

I query directy from those virtual collections, no views (I don’t know if this would help?).

In the meantime I came up with a workaround, that should work:
I could create another temporary collection within atlas, and move the data which should be written to S3, to this collection first. Then I could run the aggregation pipeline on this (small) collection without any $match in the aggregation pipeline. I think that would work and be cost effective, but it creates an additional step and make the application more complex - I consider this more like a workaround.

I hope this makes the situation a bit clearer.

Thanks,
Philipp

Alexi_Antonino · December 18, 2024, 2:27pm

@Philipp_Kratzer excellent information. I will ask around about this behavior. I believe you have enough info for me to chase down an answer. Back to you later…

Andres_Navarro · December 19, 2024, 5:06pm

Hi @Philipp_Kratzer ! This is a really good question, and thank you for the thorough details. In this case, when Atlas Data Federation processes a query for the data_atlas_only collection, it will actually push down the query to the Atlas cluster data source directly for processing and stream the data back. This means that as long as that index on adddate is created on the underlying Atlas cluster, then the index will actually be used when the Atlas cluster processes the query.