The Cost of Not Knowing MongoDB, Part 3: appV6R0 to appV6R4
October 9, 2025
Welcome to the third and final part of the series "The Cost of Not Knowing MongoDB." Building upon the foundational optimizations explored in Part 1 and Part 2, this article delves into advanced MongoDB design patterns that can dramatically transform application performance.
In Part 1, we improved application performance by concatenating fields, changing data types, and shortening field names. In Part 2, we implemented the Bucket Pattern and Computed Pattern and optimized the aggregation pipeline to achieve even better performance.
In this final article, we address the issues and improvements identified in appV5R4. Specifically, we focus on reducing the document size in our application to alleviate the disk throughput bottleneck on the MongoDB server. This reduction will be accomplished by adopting a dynamic schema and modifying the storage compression algorithm.
All the application versions and revisions from this article were developed by a senior MongoDB developer, as they are built on all the previous versions and utilize the Dynamic Schema pattern, which isn't very common to see.
Application version 6 revision 0 (appV6R0): A dynamic monthly bucket document
As mentioned in the Issues and Improvements of appV5R4 from the previous article, the primary limitation of our MongoDB server is its disk throughput. To address this, we need to reduce the size of the documents being stored.
Consider the following document from appV5R3, which has provided the best performance so far:
const document = {
 _id: Buffer.from("...01202202"),
 items: [
 { date: new Date("2022-06-05"), a: 10, n: 3 },
 { date: new Date("2022-06-16"), p: 1, r: 1 },
 { date: new Date("2022-06-27"), a: 5, r: 1 },
 { date: new Date("2022-06-29"), p: 1 },
 ],
};

The items array in this document contains only four elements, but on average, it will have around 10 elements, and in the worst-case scenario, it could have up to 90 elements. These elements are the primary contributors to the document size, so they should be the focus of our optimization efforts.
One commonality among the elements is the presence of the date field, with its value including the year and month, for the previous document. By rethinking how this field and its value could be stored, we can reduce storage requirements.
An unconventional solution we could use is:
-
Changing the items field type from an array to a document.
-
Using the date value as the field name in the items document.
-
Storing the status totals as the value for each date field.
Here is the previous document represented using the new schema idea:
const document = {
 _id: Buffer.from("...01202202"),
 items: {
 20220605: { a: 10, n: 3 },
 20220616: { p: 1, r: 1 },
 20220627: { a: 5, r: 1 },
 20220629: { p: 1 },
 },
};

While this schema may not significantly reduce the document size compared to appV5R3, we can further optimize it by leveraging the fact that the year is already embedded in the _id field. This eliminates the need to repeat the year in the field names of the items document.
With this approach, the items document adopts a Dynamic Schema, where field names encode information and are not predefined.
To demonstrate various implementation possibilities, we will revisit all the bucketing criteria used in the appV5RX implementations, starting with appV5R0.
For appV6R0, which builds upon appV5R0 but uses a dynamic schema, data is bucketed by year and month. The field names in the items document represent only the day of the date, as the year and month are already stored in the _id field.
A detailed explanation of the bucketing logic and functions used to implement the current application can be found in the appV5R0 introduction.
The following document stores data for January 2022 (2022-01-XX), applying the newly presented idea:
const document = {
 _id: Buffer.from("...01202201"),
 items: {
 "05": { a: 10, n: 3 },
 16: { p: 1, r: 1 },
 27: { a: 5, r: 1 },
 29: { p: 1 },
 },
};

Schema
The application implementation presented above would have the following TypeScript document schema denominated SchemaV6R0:
export type SchemaV6R0 = {
 _id: Buffer;
 items: Record<
 string,
 {
 a?: number;
 n?: number;
 p?: number;
 r?: number;
 }
 >;
};

Bulk upsert
Based on the specification presented, we have the following updateOne operation for each event generated by this application version:
const DD = getDD(event.date); // Extract the `day` from the `event.date`

const operation = {
 updateOne: {
 filter: { _id: buildId(event.key, event.date) }, // key + year + month
 update: {
 $inc: {
 [`items.${DD}.a`]: event.approved,
 [`items.${DD}.n`]: event.noFunds,
 [`items.${DD}.p`]: event.pending,
 [`items.${DD}.r`]: event.rejected,
 },
 },
 upsert: true,
 },
};

filter:
-
Target the document where the _id field matches the concatenated value of key, year, and month.
-
The buildId function converts the key+year+month into a binary format.
update:
-
Uses the $inc operator to increment the fields corresponding to the same DD as the event by the status values provided.
-
If a field does not exist in the items document and the event provides a value for it, $inc treats the non-existent field as having a value of 0 and performs the operation.
-
If a field exists in the items document but the event does not provide a value for it (i.e., undefined), $inc treats it as 0 and performs the operation.
upsert:
- Ensures a new document is created if no matching document exists.
Get reports
To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:
const pipeline = [
 { $match: docsFromKeyBetweenDate },
 { $addFields: buildTotalsField },
 { $group: groupSumTotals },
 { $project: { _id: 0 } },
];

The complete code for this aggregation pipeline is quite complicated. Because of that, we will have just a pseudocode for it here.
1: { $match: docsFromKeyBetweenDate }
- Range-filters documents by _id to retrieve only buckets within the report date range. It has the same logic as appV5R0.
2: { $addFields: buildTotalsField }
-
The logic is similar to the one used in the Get Reports of appV5R3.
-
The $objectToArray operator is used to convert the items document into an array, enabling a $reduce operation.
-
Filtering the items fields within the report's range involves extracting the year and month from the _id field and the day from the field names in the items document.
-
The following JavaScript code is logic equivalent to the real aggregation pipeline code.
// Equivalent JavaScript logic:
const [MM] = _id.slice(-2).toString(); // Get month from _id
const [YYYY] = _id.slice(-6, -2).toString(); // Get year from _id
const items_array = Object.entries(items); // Convert the object to an array of [key, value]

const totals = items_array.reduce(
 (accumulator, [DD, status]) => {
 let statusDate = new Date(`${YYYY}-${MM}-${DD}`);

 if (statusDate >= reportStartDate && statusDate < reportEndDate) {
 accumulator.a += status.a || 0;
 accumulator.n += status.n || 0;
 accumulator.p += status.p || 0;
 accumulator.r += status.r || 0;
 }

 return accumulator;
 },
 { a: 0, n: 0, p: 0, r: 0 }
);

3: { $group: groupCountTotals }
- Group the totals of each document in the pipeline into final status totals using $sum operations.
4: { $project: { _id: 0 } }
- Format the resulting document to have the reports format.
Indexes
No additional indexes are required, maintaining the single _id index approach established in the appV4 implementation.
Initial scenario statistics
Collection statistics
To evaluate the performance of appV6R0, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:
Collection | Documents | Data Size | Document Size | Storage Size | Indexes | Index Size |
---|---|---|---|---|---|---|
appV5R0 | 95,350,431 | 19.19GB | 217B | 5.06GB | 1 | 2.95GB |
appV5R3 | 33,429,492 | 11.96GB | 385B | 3.24GB | 1 | 1.11GB |
appV6R0 | 95,350,319 | 11.1GB | 125B | 3.33GB | 1 | 3.13GB |
Event statistics
To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total data size and index size by the 500 million events.
Collection | Data Size/Events | Index Size/Events | Total Size/Events |
---|---|---|---|
appV5R0 | 41.2B | 6.3B | 47.5B |
appV5R3 | 25.7B | 2.4B | 28.1B |
appV6R0 | 23.8B | 6.7B | 30.5B |
It is challenging to make a direct comparison between appV6R0 and appV5R0 from a storage perspective. The appV5R0 implementation is the simplest bucketing possible, where event documents were merely appended to the items array without bucketing by day, as is done in appV6R0.
However, we can attempt a comparison between appV6R0 and appV5R3, the best solution so far. In appV6R0, data is bucketed by month, whereas in appV5R3, it is bucketed by quarter. Assuming document size scales linearly with the bucketing criteria (though this is not entirely accurate), the appV6R0 document would be approximately 3 * 125 = 375 bytes, which is 9.4% smaller than appV5R3.
Another indicator of improvement is the Data Size/Events metric in the Event Statistics table. For appV6R0, each event uses an average of 23.8 bytes, compared to 27.7 bytes for appV5R3, representing a 14.1% reduction in size.
Load test results
Executing the load test for appV6R0 and plotting it alongside the results for appV5R0 and Desired rates, we have the following results for Get Reports and Bulk Upsert.
Get Reports rates
The two versions exhibit very similar rate performance, with appV6R0 showing slight superiority in the second and third quarters, while appV5R0 is superior in the first and fourth quarters.

Get Reports latency
The two versions exhibit very similar latency performance, with appV6R0 showing slight advantages in the second and third quarters, while appV5R0 is superior in the first and fourth quarters.

Bulk Upsert rates
Both versions have similar rate values, but it can be seen that appV6R0 has a small edge compared to appV5R0.

Bulk Upsert latency
Although both versions have similar latency values for the first quarter of the test, for the final three-quarters, appV6R0 has a clear advantage over appV5R0.

Performance summary
Despite the significant reduction in document and storage size achieved by appV6R0, the performance improvement was not as substantial as expected. This suggests that the bottleneck in the application when bucketing data by month may not be related to disk throughput.
Examining the collection stats table reveals that the index size for both versions is close to 3GB. This is near the 4GB of available memory on the machine running the database and exceeds the 1.5GB allocated by WiredTiger for cache. Therefore, it is likely that the limiting factor in this case is memory/cache rather than document size, which explains the lack of a significant performance improvement.
Issues and improvements
To address the limitations observed in appV6R0, we propose adopting the same line of improvements applied from appV5R0 to appV5R1. Specifically, we will bucket the events by quarter in appV6R1. This approach not only follows the established pattern of enhancements but also aligns with the need to optimize performance further.
As highlighted in the Load Test Results, the current bottleneck lies in the size of the index relative to the available cache/memory. By increasing the bucketing interval from month to quarter, we can reduce the number of documents by approximately a factor of three. This reduction will, in turn, decrease the number of index entries by the same factor, leading to a smaller index size.
Application version 6 revision 1 (appV6R1): A dynamic quarter bucket document
As discussed in the previous Issues and Improvements section, the primary bottleneck in appV6R0 was the index size nearing the memory capacity of the machine running MongoDB. To mitigate this issue, we propose increasing the bucketing interval from a month to a quarter for appV6R1, following the approach used in appV5R1.
This adjustment aims to reduce the number of documents and index entries by approximately a factor of three, thereby decreasing the overall index size. By adopting a quarter-based bucketing strategy, we align with the established pattern of enhancements applied in appV5R1 versions while addressing the specific memory/cache constraints identified in appV6R0.
The implementation of appV6R1 retains most of the code from appV6R0, with the following key differences:
-
The _id field will now be composed of key+year+quarter.
-
The field names in the items document will encode both month and day, as this information is necessary for filtering date ranges in the Get Reports operation.
The following example demonstrates how data for June 2022 (2022-06-XX), within the second quarter (Q2), is stored using the new schema:
const document = {
 _id: Buffer.from("...01202202"),
 items: {
 "0605": { a: 10, n: 3 },
 "0616": { p: 1, r: 1 },
 "0627": { a: 5, r: 1 },
 "0629": { p: 1 },
 },
};

Schema
The application implementation presented above would have the following TypeScript document schema denominated SchemaV6R0:
export type SchemaV6R0 = {
 _id: Buffer;
 items: Record<
 string,
 {
 a?: number;
 n?: number;
 p?: number;
 r?: number;
 }
 >;
};

Bulk upsert
Based on the specification presented, we have the following updateOne operation for each event generated by this application version:
const MMDD = getMMDD(event.date); // Extract the month (MM) and day(DD) from the `event.date`

const operation = {
 updateOne: {
 filter: { _id: buildId(event.key, event.date) }, // key + year + quarter
 update: {
 $inc: {
 [`items.${MMDD}.a`]: event.approved,
 [`items.${MMDD}.n`]: event.noFunds,
 [`items.${MMDD}.p`]: event.pending,
 [`items.${MMDD}.r`]: event.rejected,
 },
 },
 upsert: true,
 },
};

This updateOne operation has a similar logic to the one in appV6R0, with the only differences being the filter and update criteria.
filter:
-
Target the document where the _id field matches the concatenated value of key, year, and quarter.
-
The buildId function converts the key+year+quarter into a binary format.
update:
- Uses the $inc operator to increment the fields corresponding to the same MMDD as the event by the status values provided.
Get reports
To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:
const pipeline = [
 { $match: docsFromKeyBetweenDate },
 { $addFields: buildTotalsField },
 { $group: groupSumTotals },
 { $project: { _id: 0 } },
];

This aggregation operation has a similar logic to the one in appV6R0, with the only differences being the implementation in the $addFields stage.
{ $addFields: itemsReduceAccumulator }:
-
A similar implementation to the one in appV6R0
-
The difference relies on extracting the value of year (YYYY) from the _id field and the month and day (MMDD) from the field name.
-
The following JavaScript code is logic equivalent to the real aggregation pipeline code.
const [YYYY] = _id.slice(-6, -2).toString(); // Get year from _id
const items_array = Object.entries(items); // Convert the object to an array of [key, value]

const totals = items_array.reduce(
 (accumulator, [MMDD, status]) => {
 let [MM, DD] = [MMDD.slice(0, 2), MMDD.slice(2, 4)];
 let statusDate = new Date(`${YYYY}-${MM}-${DD}`);

 if (statusDate >= reportStartDate && statusDate < reportEndDate) {
 accumulator.a += status.a || 0;
 accumulator.n += status.n || 0;
 accumulator.p += status.p || 0;
 accumulator.r += status.r || 0;
 }

 return accumulator;
 },
 { a: 0, n: 0, p: 0, r: 0 }
);

Indexes
No additional indexes are required, maintaining the single _id index approach established in the appV4 implementation.
Initial scenario statistics
Collection statistics
To evaluate the performance of appV6R1, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:
Collection | Documents | Data Size | Document Size | Storage Size | Indexes | Index Size |
---|---|---|---|---|---|---|
appV5R3 | 33,429,492 | 11.96GB | 385B | 3.24GB | 1 | 1.11GB |
appV6R0 | 95,350,319 | 11.1GB | 125B | 3.33GB | 1 | 3.13GB |
appV6R1 | 33,429,366 | 8.19GB | 264B | 2.34GB | 1 | 1.22GB |
Event statistics
To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total data size and index size by the 500 million events.
Collection | Data Size/Events | Index Size/Events | Total Size/Events |
---|---|---|---|
appV5R3 | 25.7B | 2.4B | 28.1B |
appV6R0 | 23.8B | 6.7B | 30.5B |
appV6R1 | 17.6B | 2.6B | 20.2B |
In the previous Initial Scenario Statistics analysis, we assumed that document size would scale linearly with the bucketing range. However, this assumption proved inaccurate. The average document size in appV6R1 is approximately twice as large as in appV6R0, even though it stores three times more data. Already a win for this new implementation.
Since appV6R1 buckets data by quarter at the document level and by day within the items sub-document, a fair comparison would be with appV5R3, the best-performing version so far. From the tables above, we observe a significant improvement in Document Size and consequently Data Size when transitioning from appV5R3 to appV6R1. Specifically, there was a 31.4% reduction in Document Size. From an index size perspective, there was no change, as both versions bucket events by quarter.
Load test results
Executing the load test for appV6R0 and plotting it alongside the results for appV5R0 and Desired rates, we have the following results for Get Reports and Bulk Upsert.
Get Reports rates
For the first three-quarters of the test, both versions have similar rate values, but, for the final quarter, appV6R1 has a notable edge over appV5R3.

Get Reports latency
The two versions exhibit very similar latency performance, with appV6R0 showing slight advantages in the second and third quarters, while appV5R0 is superior in the first and fourth quarters.

Bulk Upsert rates
Both versions have similar rate values, but it can be seen that appV6R0 has a small edge compared to appV5R0.

Bulk Upsert latency
Although both versions have similar latency values for the first quarter of the test, for the final three-quarters, appV6R0 has a clear advantage over appV5R0.

Performance summary
Despite the significant reduction in document and storage size achieved by appV6R0, the performance improvement was not as substantial as expected. This suggests that the bottleneck in the application when bucketing data by month may not be related to disk throughput.
Examining the collection stats table reveals that the index size for both versions is close to 3GB. This is near the 4GB of available memory on the machine running the database and exceeds the 1.5GB allocated by WiredTiger for cache. Therefore, it is likely that the limiting factor in this case is memory/cache rather than document size, which explains the lack of a significant performance improvement.
Issues and improvements
To address the limitations observed in appV6R0, we propose adopting the same line of improvements applied from appV5R0 to appV5R1. Specifically, we will bucket the events by quarter in appV6R1. This approach not only follows the established pattern of enhancements but also aligns with the need to optimize performance further.
As highlighted in the Load Test Results, the current bottleneck lies in the size of the index relative to the available cache/memory. By increasing the bucketing interval from month to quarter, we can reduce the number of documents by approximately a factor of three. This reduction will, in turn, decrease the number of index entries by the same factor, leading to a smaller index size.
Application version 6 revision 1 (appV6R1): A dynamic quarter bucket document
As discussed in the previous Issues and Improvements section, the primary bottleneck in appV6R0 was the index size nearing the memory capacity of the machine running MongoDB. To mitigate this issue, we propose increasing the bucketing interval from a month to a quarter for appV6R1, following the approach used in appV5R1.
This adjustment aims to reduce the number of documents and index entries by approximately a factor of three, thereby decreasing the overall index size. By adopting a quarter-based bucketing strategy, we align with the established pattern of enhancements applied in appV5R1 versions while addressing the specific memory/cache constraints identified in appV6R0.
The implementation of appV6R1 retains most of the code from appV6R0, with the following key differences:
-
The _id field will now be composed of key+year+quarter.
-
The field names in the items document will encode both month and day, as this information is necessary for filtering date ranges in the Get Reports operation.
The following example demonstrates how data for June 2022 (2022-06-XX), within the second quarter (Q2), is stored using the new schema:
const document = {
 _id: Buffer.from("...01202202"),
 items: {
 "0605": { a: 10, n: 3 },
 "0616": { p: 1, r: 1 },
 "0627": { a: 5, r: 1 },
 "0629": { p: 1 },
 },
};

Schema
The application implementation presented above would have the following TypeScript document schema denominated SchemaV6R0:
export type SchemaV6R0 = {
 _id: Buffer;
 items: Record<
 string,
 {
 a?: number;
 n?: number;
 p?: number;
 r?: number;
 }
 >;
};

Bulk upsert
Based on the specification presented, we have the following updateOne operation for each event generated by this application version:
const MMDD = getMMDD(event.date); // Extract the month (MM) and day(DD) from the `event.date`

const operation = {
 updateOne: {
 filter: { _id: buildId(event.key, event.date) }, // key + year + quarter
 update: {
 $inc: {
 [`items.${MMDD}.a`]: event.approved,
 [`items.${MMDD}.n`]: event.noFunds,
 [`items.${MMDD}.p`]: event.pending,
 [`items.${MMDD}.r`]: event.rejected,
 },
 },
 upsert: true,
 },
};

This updateOne operation has a similar logic to the one in appV6R0, with the only differences being the filter and update criteria.
filter:
-
Target the document where the _id field matches the concatenated value of key, year, and quarter.
-
The buildId function converts the key+year+quarter into a binary format.
update:
- Uses the $inc operator to increment the fields corresponding to the same MMDD as the event by the status values provided.
Get reports
To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:
const pipeline = [
 { $match: docsFromKeyBetweenDate },
 { $addFields: buildTotalsField },
 { $group: groupSumTotals },
 { $project: { _id: 0 } },
];

This aggregation operation has a similar logic to the one in appV6R0, with the only differences being the implementation in the $addFields stage.
{ $addFields: itemsReduceAccumulator }:
-
A similar implementation to the one in appV6R0
-
The difference relies on extracting the value of year (YYYY) from the _id field and the month and day (MMDD) from the field name.
-
The following JavaScript code is logic equivalent to the real aggregation pipeline code.
const [YYYY] = _id.slice(-6, -2).toString(); // Get year from _id
const items_array = Object.entries(items); // Convert the object to an array of [key, value]

const totals = items_array.reduce(
 (accumulator, [MMDD, status]) => {
 let [MM, DD] = [MMDD.slice(0, 2), MMDD.slice(2, 4)];
 let statusDate = new Date(`${YYYY}-${MM}-${DD}`);

 if (statusDate >= reportStartDate && statusDate < reportEndDate) {
 accumulator.a += status.a || 0;
 accumulator.n += status.n || 0;
 accumulator.p += status.p || 0;
 accumulator.r += status.r || 0;
 }

 return accumulator;
 },
 { a: 0, n: 0, p: 0, r: 0 }
);

Indexes
No additional indexes are required, maintaining the single _id index approach established in the appV4 implementation.
Initial scenario statistics
Collection statistics
To evaluate the performance of appV6R1, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:
Collection | Documents | Data Size | Document Size | Storage Size | Indexes | Index Size |
---|---|---|---|---|---|---|
appV5R3 | 33,429,492 | 11.96GB | 11.96GB | 3.24GB | 1 | 1.11GB |
appV6R1 | 33,429,366 | 8.19GB | 264B | 2.34GB | 1 | 1.22GB |
appV6R2 | 33,429,207 | 9.11GB | 293B | 2.8GB | 1 | 1.26GB |
Event statistics
To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total data size and index size by the 500 million events.
Collection | Data Size/Events | Index Size/Events | Total Size/Events |
---|---|---|---|
appV5R3 | 25.7B | 2.4B | 28.1B |
appV6R1 | 17.6B | 2.6B | 20.2B |
appV6R2 | 19.6B | 2.7B | 22.3B |