We’ve had instances in which our MongoDB cluster seems to be holding up for an extended period of time, not performing almost any action and consequently causing production job freezes/delays, despite having queries in its queue.
Our setup :
MongoDB v5.0.6 hosted on AWS EC2 (r6g.8xLarge), Primary-Secondary-Arbiter architecture.
Some key aspects:
- We have 6 or 7 jobs running once a day and 2 or 3 jobs running every 30 minutes. All perform simple inserts, updates, deletes or any combination of those
- Each job may act on different and multiple collections
- Volume is usually around 6M inserted, 2.5M updated or 4.5M deleted documents per job
- Average runtime should be 15 minutes maximum. We’ve seen freezes of over 1 hour
- Inserts (and updates) are mostly run as batch inserts of 100k documents each. Those are sent in parallel to Mongo, IE an insert of 5M documents becomes 50 inserts of 100k documents.
Over the weekend we have a job which rebuilds most of our collections, and that one is moving ~100M documents.
By “rebuild” I mean creating a new collection, inserting data, renaming it and deleting the old collection.
The issue has never appeared on this weekend rebuild, which is much bigger than our daily jobs, where the issue is not consistently happening, but is common enough to heavily disrupt production.
We’ve been running MongoDB v4.0 in production for a few years and this has never happened. Only when we upgraded to MongoDB v5 we started seeing this, and no change to our job infrastructure was done at all.
Here’s some more details.
When the issue comes up it looks like queries are just sitting in the MongoDB queue and are not being executed, then at some point something snaps and they all get executed very quickly.
However we are not actively building indexes on populated collections, and we also don’t see any ongoing index builds, automatically checked every 2 seconds using what’s shown here: Index Builds on Populated Collections — MongoDB Manual and db.currentOp() — MongoDB Manual.
For jobs where we run inserts, we usually create an empty collection, create the indexes whilst it’s empty, insert the data, and the rename the collection. So perhaps something is happening behind the hood.
Any idea what the issue could be?