Schema Performance Evaluation in MongoDB Using PerformanceBench
Rate this tutorial
MongoDB is often incorrectly described as being schemaless. While it is true that MongoDB offers a level of flexibility when working with schema designs that traditional relational databases systems cannot match, as with any database system, the choice of schema design employed by an application built on top of MongoDB will still ultimately determine whether the application is able to meet its performance objectives and SLAs.
Fortunately, a number of design patterns (and corresponding anti-patterns) exist to help guide application developers design appropriate schemas for their MongoDB applications. A significant part of our role as developer advocates within the global strategic account team at MongoDB involves educating developers new to MongoDB on the use of these design patterns and how they differ from those they may have previously used working with relational database systems. My colleague, Daniel Coupal, contributed to a fantastic set of blog posts on the most common patterns and anti-patterns we see working with MongoDB.
Whilst schema design patterns provide a great starting point for guiding our design process, for many applications, there may come a point where it becomes unclear which one of a set of alternative designs will best support the application’s anticipated workloads. In these situations, a quote by Rear Admiral Grace Hopper that my manager, Rick Houlihan, made me aware of rings true:“One accurate measurement is worth a thousand expert opinions.”
In this article, we will explore using PerformanceBench, a Java framework application used by my team when evaluating candidate data models for a customer workload.
PerformanceBench is a simple Java framework designed to allow developers to assess the relative performance of different database design patterns within MongoDB.
PerformanceBench defines its functionality in terms of models (the design patterns being assessed) and measures (the operations to be measured against each model). As an example, a developer may wish to assess the relative performance of a design based on having data spread across multiple collections and accessed using $lookup (join) aggregations, versus one based on a hierarchical model where related documents are embedded within each other. In this scenario, the models might be respectively referred to as multi-collection and hierarchical, with the "measures" for each being CRUD operations: Create, Read, Update, and Delete.
The framework allows Java classes to be developed that implement a defined interface known as “SchemaTest,” with one class for each model to be tested. Each SchemaTest class implements the functionality to execute the measures defined for that model and returns, as output, an array of documents with the results of the execution of each measure — typically timing data for the measure execution, plus any metadata needed to later identify the parameters used for the specific execution. PerformanceBench stores these returned documents in a MongoDB collection for later analysis and evaluation.
PerformanceBench is configured via a JSON format configuration file which contains an array of documents — one for each model being tested. Each model document in the configuration file contains a set of standard fields that are common across all models being tested, plus a set of custom fields specific to that model. Developers implementing SchemaTest model classes are free to include whatever custom parameters their testing of a specific model requires.
When executed, PerformanceBench uses the data in the configuration file to identify the implementing class for each model to be tested and its associated measures. It then instructs the implementing classes to execute a specified number of iterations of each measure, optionally using multiple threads to simulate multi-user/multi-client environments.
Full details of the SchemaTest interface and the format of the PerformanceBench JSON configuration file are provided in the GitHub readme file for the project.
The PerformanceBench source in Github was developed using IntelliJ IDEA 2022.2.3 with OpenJDK Runtime Environment Temurin-17.0.3+7 (build 17.0.3+7).
The compiled application has been run on Amazon Linux using OpenJDK 17.0.5 (2022-10-18 LTS - Corretto).
Other than the requirement to implement the SchemaTest interface, PerformanceBench gives model class developers wide latitude in designing their classes in whatever way is needed to meet the requirements of their test cases. However, there are some common considerations to take into account.
The SchemaTest interface defines the following four methods:
The initialize method is intended to allow implementing classes to carry out any necessary steps prior to measures being executed. This could, for example, include establishing and verifying connection to the database, building or preparing a test data set, and/or removing the results of prior execution runs. PerformanceBench calls initialize immediately after instantiating an instance of the class, but before any measures are executed.
The name method should return a string name for the implementing class. Class implementers can set the returned value to anything that makes sense for their use case. Currently, PerformanceBench only uses this method to add context to logging messages.
The warmup method is called by PerformanceBench prior to any iterations of any measure being executed. It is designed to allow model class implementers to attempt to create an environment that accurately reflects the expected state of the database in real-life. This could, for example, include carrying out queries designed to seed the MongoDB cache with an appropriate working set of data.
The executeMeasure method allows PerformanceBench to instruct a model-implementing class to execute a defined number of iterations of a specified measure. Typically, the method implementation will contain a case statement redirecting execution to the code for each defined measure. However, there is no requirement to implement in that way. The return from this method should be an array of BSON Document objects containing the results of each test iteration. Implementers are free to include whatever fields are necessary in these documents to support the metrics their use case requires.
The cleanup method is called by PerformanceBench after all iterations of all measures have been executed by the implementing class and is designed primarily to allow test data to be deleted or reset ahead of future test executions. However, the method can also be used to execute any other post test-run functionality necessary for a given use case. This may, for example, include calculating average/mean/percentile execution times for a test run, or for cleanly disconnecting from a database.
When assessing a given model, it is important to measure the model’s performance against varying data sets. For example, the following can all impact the performance of different search and data manipulation operations:
- Overall database and collection sizes
- Individual document sizes
- Available CPU and memory on the MongoDB servers being used
- Total number of documents within individual collections.
Executing a sequence of measures using different test data sets can help to identify if there is a threshold beyond which one model may perform better than another. It may also help to identify the amount of memory needed to store the working set of data necessary for the workload being tested to avoid excessive paging. Model-implementing classes should ensure that they add sufficient metadata to the results documents they generate to allow the conditions of the test to be identified during later analysis.
As with most databases, query performance in MongoDB is dependent on appropriate indexes existing on collections being queried. Model class implementers should ensure any such indexes needed by their test cases either exist or are created during the call to their classes’ initialize method. Index size compared with available cache memory should be considered, and often, finding the point at which performance is negatively impacted by paging of indexes is a major objective of PerformanceBench testing.
With any testing regime, one goal should be to limit the number of variables potentially impacting performance discrepancies between test runs so differences in measured performance can be attributed with confidence to the intentional differences in test conditions. Items that come under this heading include network latency between the server running PerformanceBench and the MongoDB cluster servers. When working with MongoDB Atlas in a cloud environment, for example, specifying dedicated rather than shared servers can help avoid background load on the servers impacting performance, whilst deploying all servers in the same availability zone/region can reduce potential impacts from varying network latency.
PerformanceBench allows measures to be executed concurrently in multiple threads to simulate a multi-user environment. However, if making use of this facility, put some thought into how to accurately model real user behavior. It is rare, for example, for users to execute a complex ad-hoc aggregation pipeline and immediately execute another on its completion. Your model class may therefore want to insert a delay between execution of measure iterations to attempt to model a realistic length of time you may expect between query requests from an individual user in a realistic production environment.
The PerformaceBench GitHub repository includes example model class implementations for a hypothetical application designed to report on success and failure rates of calls to a set of APIs monitored by observability software.
Data for the application is stored in two document types in two different collections.
The APIDetails collection contains one document for each monitored API with metadata about that API:
The second collection, APIMetrics, is designed to represent the output from monitoring software with one document generated for each API at 15-minute intervals, giving the total number of calls to the API, the number that were successful, and the number that failed:
The documents include a deployment region value for each API (one of “Tokyo,” “Hong Kong,” “India,” or “UK”). The sample model classes in the repository are designed to compare the performance of options for running aggregation pipelines that calculate the total number of calls, the overall success rate, and the corresponding failure rate for all the APIs in a given region, for a given time period.
Four approaches are evaluated:
- Carrying out an aggregation pipeline against the APIDetails collection that includes a $lookup stage to perform a join with and summarization of relevant data in the APIMetrics collection.
- Carrying out an initial query against the APIDetails collection to produce a list of the API ids for a given region and use that list as input to an $in clause as part of a $match stage in a separate aggregation pipeline against the APIMetrics collection to summarize the relevant monitoring data.
- A third approach that uses an equality clause on the region information in each document as part of the initial $match stage of a pipeline against the APIMetrics collection to summarize the relevant monitoring data. This approach is designed to test whether an equality match against a single value performs better than one using an $in clause with a large number of possible values, as used in the second approach. Two measures are implemented in this model: one that queries the two collections sequentially using the standard MongoDB Java driver, and one that queries the two collections in parallel using the MongoDB Java Reactive Streams driver.
- A fourth approach that adds a third collection called APIPreCalc that stores documents with pre-calculated total calls, total failed calls, and total successful calls for each API for each complete day, month, and year in the data set, with the aim of reducing the number of documents and size of calculations the aggregation pipeline has to execute. This model is an example implementation of the Computed schema design pattern and also uses the MongoDB Java Reactive Streams driver to query the collections in parallel.
For the fourth approach, the pre-computed documents in the APIPreCalc collection look like the following:
Note the type field in the documents used to differentiate between totals for a year, month, or day of month.
For the purposes of showing how PerformanceBench organizes models and measures, in the PerformanceBench GitHub repository, the first and second approaches are implemented as two separate SchemaTest model classes, each with a single measure, while the third and fourth approaches are implemented in a third SchemaTest model class with two measures — one for each approach.
The first model, implementing the $lookup approach, is implemented in package com.mongodb.devrel.pods.performancebench.models.apimonitor_lookup in a class named APIMonitorLookupTest.
The aggregation pipeline implemented by this approach is:
The pipeline is executed against the APIDetails collection and is run once for each of the four geographical regions. The $lookup stage of the pipeline contains its own sub-pipeline which is executed against the APIMetrics collection once for each API belonging to each region.
This results in documents looking like the following being produced:
One document will be produced for each API in each region. The model implementation records the total time taken (in milliseconds) to generate all the documents for a given region and returns this in a results document to PerformanceBench. The results documents look like:
As can be seen, as well as the region, start time, end time, and duration of the execution run, the result documents also include:
- The model name and measure executed (in this case, ‘USEPIPELINE’).
- The number of APIs (apiCount) found for this region, and number of APIs for which metrics were able to be generated (metricsCount). These numbers should always match and are included as a sanity check that data was generated correctly by the measure.
- The number of threads and iterations used for the execution of the measure. PerformanceBench allows measures to be executed a defined number of times (iterations) to allow a good average to be determined. Executions can also be run in one or more concurrent threads to simulate multi-user/multi-client environments. In the above example, three threads each concurrently executed 1,000 iterations of the measure (3,000 total iterations).
- The MongoDB Atlas cluster tier on which the measures were executed. This is simply used for tracking purposes when analyzing the results and could be set to any value by the class developer. In the sample class implementations, the value is set to match a corresponding value in the PerformanceBench configuration file. Importantly, it remains the user’s responsibility to ensure the cluster tier being used matches what is written to the results documents.
- baseDate indicates the date period for which monitoring data was summarized. For a given baseDate, the summarized period is always baseDate to the current date (inclusive). An earlier baseDate will therefore result in more data being summarized.
With a single measure defined for the model, and with three threads each carrying out 1,000 iterations of the measure, an array of 3,000 results documents will be returned by the model class to PerformanceBench. PerformanceBench then writes these documents to a collection for later analysis.
To support the aggregation pipeline, the model implementation creates the following indexes in its initialize method implementation:
APIDetails: {"deployments.region": 1}
APIMetrics: {"appname": 1, "creationDate": 1}
The model temporarily drops any existing indexes on the collection to avoid contention for memory cache space. The above indexes are subsequently dropped in the model’s cleanup method implementation, and all original indexes restored.
The second model carries out an initial query against the APIDetails collection to produce a list of the API ids for a given region and then uses that list as input to an $in clause as part of a $match stage in an aggregation pipeline against the APIMetrics collection. It is implemented in package com.mongodb.devrel.pods.performancebench.models.apimonitor_multiquery in a class named APIMonitorMultiQueryTest.
The initial query, carried out against the APIDetails collection, looks like:
This query is carried out for each of the four regions in turn and, from the returned documents, a list of the APIs belonging to each region is generated. The generated list is then used as the input to a $in clause in the $match stage of the following aggregation pipeline run against the APIMetrics collection:
This pipeline is essentially the same as the sub-pipeline in the $lookup stage of the aggregation used by the APIMonitorLookupTest class, the main difference being that this pipeline returns the summary documents for all APIs in a region using a single execution, whereas the sub-pipeline is executed once per API as part of the $lookup stage in the APIMonitorLookupTest class. Note that the pipeline shown above has only three API values listed in its $in clause. In reality, the list generated during testing was between two and three hundred items long for each region.
When the documents are returned from the pipeline, they are merged with the corresponding API details documents retrieved from the initial query to create a set of documents equivalent to those created by the pipeline in the APIMonitorLookupTest class. From there, the model implementation creates the same summary documents to be returned to and saved by PerformanceBench.
To support the pipeline, the model implementation creates the following indexes in its initialize method implementation:
APIDetails: {"deployments.region": 1}
APIMetrics: {"appname": 1, "creationDate": 1}
As with the APIMonitorLookupTest class, this model temporarily drops any existing indexes on the collections to avoid contention for memory cache space. The above indexes are subsequently dropped in the model’s cleanup method implementation, and all original indexes restored.
The third model class, com.mongodb.devrel.pods.performancebench.models.apimonitor_regionquery.APIMonitorRegionTest, implements two measures, both similar to the measure in APIMonitorMultiQueryTest, but where the $in clause in the $match stage is replaced with a equivalency check on the ”region” field. The purpose of these measures is to assess whether an equivalency check against the region field provides any performance benefit versus an $in clause where the list of matching values could be several hundred items long. The difference between the two measures in this model, named “QUERYSYNC” and “QUERYASYNC” respectively, is that the first performs the initial find query against the APIDetails collection, and then the aggregation pipeline against the APIMetrics collection in sequence, whilst the second model uses the Reactive Streams MongoDB Driver to carry out the two operations in parallel to assess whether that provides any performance benefit.
With these changes, the match stage of the aggregation pipeline for this model looks like:
In all other regards, the pipeline and the subsequent processes for creating summary documents to pass back to PerformanceBench are the same as those used in APIMonitorMultiQueryTest.
The fourth model class, com.mongodb.devrel.pods.performancebench.models.apimonitor_precompute.APIMonitorPrecomputeTest, implements a single measure named “PRECOMPUTE”. This measure makes use of a third collection named APIPreCalc that contains precalculated summary data for each API for each complete day, month, and year in the data set. The intention with this measure is to assess what, if any, performance gain can be obtained by reducing the number of documents and resulting calculations the aggregation pipeline is required to carry out.
The measure calculates complete days, months, and years between the baseDate specified in the configuration file, and the current date. The total number of calls, failed calls and successful calls for each API for each complete day, month, or year is then retrieved from APIPreCalc. A $unionWith stage in the pipeline is then used to combine these values with the metrics for the partial days at either end of the period (the basedate and current date) retrieved from APIMetrics.
The pipeline used for this measure looks like:
The $group and $project stages are identical to the prior models and are not shown above.
To support the queries and carried out by the pipeline, the model creates the following indexes in its initialize method implementation:
APIDetails: {"deployments.region": 1}
APIMetrics: {"region": 1, "year": 1, "dayOfYear": 1, "creationDate": 1}
APIPreCalc: {"region": 1, "dateTag": 1}
The execution of PerformanceBench is controlled by a configuration file in JSON format. The name and path to this file is passed as a command line argument using the -c flag. In the PerformanceBench GitHub repository, the file is called config.json:
The document contains a single top-level field called “models,” the value of which is an array of sub-documents, each of which describes a model and its corresponding measures to be executed. PerformanceBench attempts to execute the models and measures in the order they appear in the file.
For each model, the configuration file defines the Java class implementing the model and its measures, the number of concurrent threads there should be executing each measure, the number of iterations of each measure each thread should execute, an array listing the names of the measures to be executed, and the connection URI, database name, and collection name where PerformanceBench should write results documents.
Additionally, there is a “custom” sub-document for each model where model class implementers can add any parameters specific to their model implementations. In the case of the APIMonitor class implementations, this includes the connection URI, database name and collection names where the test data resides, an array of acronyms for the geographic regions, the base date from which monitoring data should be summarized (summaries are based on values for baseDate to the current date, inclusive), and the Atlas cluster tier on which the tests were run (this is included in the results documents to allow comparison of performance of different tiers). The custom parameters also include a flag indicating if the test data set should be rebuilt before any of the measures for a model are executed and, if so, how many APIs data should be built for. The data rebuild code included in the sample model implementations builds data for the given number of APIs with the data for each API starting from a random date within the last 90 days.
By having PerformanceBench save the results of each test to a MongoDB collection, we are able to carry out analysis of the results in a variety of ways. The MongoDB aggregation framework includes over 20 different available stages and over 150 available expressions allowing enormous flexibility in performing analysis, and if you are using MongoDB Atlas, you have access to Atlas Charts, allowing you to quickly and easily visually display and analyze the data in a variety of chart formats.
For analyzing larger data sets, the MongoDB driver for Python or Connector for Apache Spark could be considered.
The output from one simulated test run generated the following results:
Note that the AWS EC2 server used to run PerformanceBench was located within the same AWS availability zone as the MongoDB Atlas cluster in order to minimize variations in measurements due to variable network latency.
The above conditions resulted in a total of 20,000 results documents being written by PerformanceBench to MongoDB (five measures, executed 500 times for each of four geographical regions, by two threads). Atlas Charts was used to display the results:
A further aggregation pipeline was then run on the results to find, for each measure, run by each model:
- The shortest iteration execution time
- The longest iteration execution time
- The mean iteration execution time
- The 95 percentile execution time
- The number of iterations completed per second.
The pipeline used was:
This produced the following results:
As can be seen, the pipelines using the $lookup stage and the equality searches on the region values in APIMetrics performed significantly slower than the other approaches. In the case of the $lookup based pipeline, this was most likely because of the overhead of marshaling one call to the sub-pipeline within the lookup for every API (1,000 total calls to the sub-pipeline for each iteration), rather than one call per geographic region (four calls total for each iteration) in the other approaches. With two threads each performing 500 iterations of each measure, this would mean marshaling 1,000,000 calls to the sub-pipeline with the $lookup approach as opposed to 4,000 calls for the other measures.
If verification of the results indicated they were accurate, this would be a good indicator that an approach that avoided using a $lookup aggregation stage would provide better query performance for this particular use case. In the case of the pipelines with the equality clause on the region field (QUERYSYNC and QUERYASYNC), their performance was likely impacted by having to sort a large number of documents by APIID in the $group stage of their pipeline. In contrast, the pipeline using the $in clause (USEINQUERY) utilized an index on the APPID field, meaning documents were returned to the pipeline already sorted by APPID — this likely gave it enough of an advantage during the $group stage of the pipeline for it to consistently complete the stage faster. Further investigation and refinement of the indexes used by the QUERYSYNC and QUERYASYNC measures could reduce their performance deficit.
It’s also noticeable that the precompute model was between 25 and 40 times faster than the other approaches. By using the precomputed values for each API, the number of documents the pipeline needed to aggregate was reduced from as much as 96,000, to, at most, 1,000 for each full day being measured, and from as much as 2,976,000 to, at most, 1,000 for each complete month being measured. This has a significant impact on throughput and underlies the value of the computed schema design pattern.
PerformanceBench provides a quick way to organize, create, execute, and record the results of tests to measure how different schema designs perform when executing different workloads. However, it is important to remember that the accuracy of the results will depend on how well the implemented model classes simulate the real life access patterns and workloads they are intended to model.
Ensuring the models accurately represent the workloads and schemas being measured is the job of the implementing developers, and PerformanceBench can only provide the framework for executing those models. It cannot improve or provide any guarantee that the results it records are an accurate prediction of an application’s real world performance.
Finally, it is important to understand that PerformanceBench, while free to download and use, is not in any way endorsed or supported by MongoDB.
The repository for PerformanceBench can be found on Github. The project was created in IntelliJ IDEA using Gradle.