Optimizing your Online Archive for Query Performance
Rate this article
This article was contributed by Prem Krishna, a Senior Product Manager for Analytics at MongoDB.
With Atlas Online Archive, you can tier off cold data or infrequently accessed data from your MongoDB cluster to a MongoDB-managed cloud object storage - Amazon S3 or Microsoft Azure Blob Storage. This can lower the cost via archival cloud storage for old data, while active data that is more often accessed and queried remains in the primary database.
FYI: If using Online Archive and also using MongoDB's Atlas Data Federation, users can also see a unified view of production data, and archived data side by side through a read-only, federated database instance.
In this blog, we are going to be discussing how to improve the performance of your online archive by choosing the correct partitioning fields.
Once you have started archiving data, you cannot edit any partition fields as the structure of how the data will be stored in the object storage becomes fixed after the archival job begins. Therefore, you'll want to think critically about your partitioning strategy beforehand.
Also, archival query performance is determined by how the data is structured in object storage, so it is important to not only choose the correct partitions but also choose the correct order of partitions.
Choose the most frequently queried fields. You can choose up to 2 partition fields for a custom query-based archive or up to three fields on a date-based online archive. Ensure that the most frequently queried fields for the archive are chosen. Note that we are talking about how you are going to query the archive and not the custom query criteria provided at the time of archiving!
Check the order of partitioned fields. While selecting the partitions is important, it is equally critical to choose the correct order of partitions. The most frequently queried field should be the first chosen partition field, followed by the second and third. That's simple enough.
Don't add irrelevant fields as partitions. If you are not querying a specific field from the archive, then that field should not be added as a partition field. Remember that you can add a maximum of 2 or 3 partition fields, so it is important to choose these fields carefully based on how you query your archive.
Don't ignore the “Move down” option. The “Move down” option is applicable to an archive with a data-based rule. For example, if you want to query on Field_A the most, then Field_B, and then on exampleDate, ensure you are selecting the “Move Down” option next to the “Archive date field” on top.
Don't choose high cardinality partition(s). Choosing a high cardinality field such as
_idwill create a large number of partitions in the object storage. Then querying the archive for any aggregate based queries will cause increased latency. The same is applicable if multiple partitions are selected such that the collective fields when grouped together can be termed as high cardinality. For example, if you are selecting Field_A, Field_B and Field_C as your partitions and if a combination of these fields are creating unique values, then it will result in high cardinality partitions.
In addition to the partitioning guidelines, there are a couple of additional considerations that are relevant for the optimal configuration of your data archival strategy.
Add data expiration rules and scheduled windows These fields are optional but are relevant for your use cases and can improve your archival speeds and for how long your data needs to be present in the archive.
Index required fields Before archiving the data, ensure that your data is indexed for optimal performance. You can run an explain plan on the archival query to verify whether the archival rule will use an index.
It is important to follow these do’s and don’ts before hitting “Begin Archiving” to archive your data so that the partitions are correctly configured thereby optimizing the performance of your online archives.