MongoDB aggregation pipeline performance

ishan_khan1 · September 12, 2023, 5:34pm

We conducted a performance test on MongoDB for one of our scenarios. We set up a test collection containing 100,000 records with fields like _id, userId, offerId, and createdAt. Specifically, for each unique offerId, we added 20,000 documents and created an index on the offerId field.

Mongo DB instance config : 2 core CPU and 16 GB ram AWS instance.

Our goal was to execute an aggregation pipeline with the following stages:

Filtering by offerId (“QRSTWVTZ”).
Grouping the results, calculating counts based on specific conditions involving userId.
Further filtering to include only records meeting specific count criteria.
Replacing the root document with new data.
Merging the results.

During testing, we observed that when a particular offerId had 20,000 records and was subjected to a load test at approximately 90 requests per second (RPS), the average response time for the aggregation pipeline ranged from 200 to 300 milliseconds. This was despite having indexes on both offerId and userId, which were utilized in the aggregation pipeline.

However, when we tested an offerId with only 1,000 records out of the total 100,000, the response time improved significantly to 20-30 milliseconds. Even when performing a simple find query, like find({ offerId: “123” }), for an offerId with 20,000 documents, the response time was around 40-50 milliseconds. The response time decreased when the offerId had only 1-2 thousand documents, despite having an index on offerId.

Questions:

should we consider SQL db like (postgres/mysql) for such scenarios as it will be more performant

steevej · September 12, 2023, 7:11pm

Are you sure? Did you performed the same load testing? With the same server specs? If you did and you really found that the performance are better then you should consider it. If you did not load test as you did with a MongoDB implementation, then you should test to really determine what is more performant. And if performance is the only criteria then implement the most performant.

ishan_khan1 · September 13, 2023, 5:45am

Thanks @steevej for looking into this,
Yes, We did similar load test on Mysql Maria DB and we found better results,
Apart from Performance, could you please point out a few other points we should consider for choosing DB specially MongoDB vs MySQL/Postgres? As choosing the right DB will be very important for the future of our business-critical application.

steevej · September 13, 2023, 10:28pm

Perhaps, your aggregation and/or your schema was sub-optimal. If you could share it we could see if it can be improved. Of course, sample input and result documents are required. Schema wise, for example, using a string offerId might not be the best since strings are compared character per character. Using a number or and ObjectId would be more efficient. Aggregation wise, using $group (a blocking stage) might not be the best, sometimes a self $lookup with a pipeline speeds things out and reduce the memory needed. The explain plan is also important to have.

In your first post you mentioned

Would that be the most frequent use-case? You should find out the most frequent use-case and perform performance tests on that scenario.

One strong point in favour of MongoDB is the flexible schema where you do not have to write migration script using ALTER TABLE. For example, you may simply modify your code and start using a new attribute. I often spent more time writing and testing migration scripts compared The possibility to have documents with complete different schema within the same collection. I often use this in a config collection. As a Unix veteran, the aggregation pipeline feels so natural. And easier to work with compared to stored procedure. After all, an aggregation pipeline is simply an array of JSON documents it is thus very easy to manipulate the pipeline as any other document.