Optimizing your Online Archive for Query Performance

Rachelle Palmer3 min read • Published Aug 30, 2022 • Updated Jan 23, 2024

AWS Atlas Online Archive

Rate this article

Contributed By

This article was contributed by Prem Krishna, a Senior Product Manager for Analytics at MongoDB.

Introduction

With Atlas Online Archive, you can tier off cold data or infrequently accessed data from your MongoDB cluster to a MongoDB-managed cloud object storage - Amazon S3 or Microsoft Azure Blob Storage. This can lower the cost via archival cloud storage for old data, while active data that is more often accessed and queried remains in the primary database.

FYI: If using Online Archive and also using MongoDB's Atlas Data Federation, users can also see a unified view of production data, and archived data side by side through a read-only, federated database instance.

In this blog, we are going to be discussing how to improve the performance of your online archive by choosing the correct partitioning fields.

Why is partitioning so critical when configuring Online Archive?

Once you have started archiving data, you cannot edit any partition fields as the structure of how the data will be stored in the object storage becomes fixed after the archival job begins. Therefore, you'll want to think critically about your partitioning strategy beforehand.

Also, archival query performance is determined by how the data is structured in object storage, so it is important to not only choose the correct partitions but also choose the correct order of partitions.

Do this...

Choose the most frequently queried fields. You can choose up to 2 partition fields for a custom query-based archive or up to three fields on a date-based online archive. Ensure that the most frequently queried fields for the archive are chosen. Note that we are talking about how you are going to query the archive and not the custom query criteria provided at the time of archiving!

Check the order of partitioned fields. While selecting the partitions is important, it is equally critical to choose the correct order of partitions. The most frequently queried field should be the first chosen partition field, followed by the second and third. That's simple enough.

Not this

Don't add irrelevant fields as partitions. If you are not querying a specific field from the archive, then that field should not be added as a partition field. Remember that you can add a maximum of 2 or 3 partition fields, so it is important to choose these fields carefully based on how you query your archive.

Don't ignore the “Move down” option. The “Move down” option is applicable to an archive with a data-based rule. For example, if you want to query on Field_A the most, then Field_B, and then on exampleDate, ensure you are selecting the “Move Down” option next to the “Archive date field” on top.

Don't choose high cardinality partition(s). Choosing a high cardinality field such as _id will create a large number of partitions in the object storage. Then querying the archive for any aggregate based queries will cause increased latency. The same is applicable if multiple partitions are selected such that the collective fields when grouped together can be termed as high cardinality. For example, if you are selecting Field_A, Field_B and Field_C as your partitions and if a combination of these fields are creating unique values, then it will result in high cardinality partitions.

Please note that this is not applicable for new Online Archives.

Additional guidance

In addition to the partitioning guidelines, there are a couple of additional considerations that are relevant for the optimal configuration of your data archival strategy.

Add data expiration rules and scheduled windows These fields are optional but are relevant for your use cases and can improve your archival speeds and for how long your data needs to be present in the archive.

Index required fields Before archiving the data, ensure that your data is indexed for optimal performance. You can run an explain plan on the archival query to verify whether the archival rule will use an index.

Conclusion

It is important to follow these do’s and don’ts before hitting “Begin Archiving” to archive your data so that the partitions are correctly configured thereby optimizing the performance of your online archives.

For more information on configuration or Online Archive, please see the documentation for setting up an Online Archive and our blog post on how to create an Online Archive.

Dig deeper into this topic with this tutorial.

✅ Already have an AWS account? Atlas supports paying for usage via the AWS Marketplace (AWS MP) without any upfront commitment — simply

Rate this article

Tutorial

How to Automate Continuous Data Copying from MongoDB to S3

Jan 23, 2024 | 8 min read

Tutorial

Getting Started With MongoDB Atlas Serverless, AWS CDK, and AWS Serverless Computing

Aug 09, 2024 | 18 min read

Podcast

MongoDB Atlas Multicloud Clusters

May 16, 2022 | 25 min

Tutorial

Influence Search Result Ranking with Function Scores in Atlas Search

Feb 03, 2023 | 5 min read

Contributed By
Introduction
Why is partitioning so critical when configuring Online Archive?
Do this...
Not this
Additional guidance
Conclusion

Atlas

Optimizing your Online Archive for Query Performance

Contributed By

Introduction

Why is partitioning so critical when configuring Online Archive?

Do this...

Not this

Additional guidance

Conclusion

Related

How to Automate Continuous Data Copying from MongoDB to S3

Getting Started With MongoDB Atlas Serverless, AWS CDK, and AWS Serverless Computing

MongoDB Atlas Multicloud Clusters

Influence Search Result Ranking with Function Scores in Atlas Search

Table of Contents