MongoDB randomly locking up

Hi,

We’ve had instances in which our MongoDB cluster seems to be holding up for an extended period of time, not performing almost any action and consequently causing production job freezes/delays, despite having queries in its queue.

Our setup :
MongoDB v5.0.6 hosted on AWS EC2 (r6g.8xLarge), Primary-Secondary-Arbiter architecture.

Some key aspects:

  • We have 6 or 7 jobs running once a day and 2 or 3 jobs running every 30 minutes. All perform simple inserts, updates, deletes or any combination of those
  • Each job may act on different and multiple collections
  • Volume is usually around 6M inserted, 2.5M updated or 4.5M deleted documents per job
  • Average runtime should be 15 minutes maximum. We’ve seen freezes of over 1 hour
  • Inserts (and updates) are mostly run as batch inserts of 100k documents each. Those are sent in parallel to Mongo, IE an insert of 5M documents becomes 50 inserts of 100k documents.

Over the weekend we have a job which rebuilds most of our collections, and that one is moving ~100M documents.
By “rebuild” I mean creating a new collection, inserting data, renaming it and deleting the old collection.
The issue has never appeared on this weekend rebuild, which is much bigger than our daily jobs, where the issue is not consistently happening, but is common enough to heavily disrupt production.

Quick note:
We’ve been running MongoDB v4.0 in production for a few years and this has never happened. Only when we upgraded to MongoDB v5 we started seeing this, and no change to our job infrastructure was done at all.

Here’s some more details.
When the issue comes up it looks like queries are just sitting in the MongoDB queue and are not being executed, then at some point something snaps and they all get executed very quickly.

However we are not actively building indexes on populated collections, and we also don’t see any ongoing index builds, automatically checked every 2 seconds using what’s shown here: Index Builds on Populated Collections — MongoDB Manual and db.currentOp() — MongoDB Manual.

For jobs where we run inserts, we usually create an empty collection, create the indexes whilst it’s empty, insert the data, and the rename the collection. So perhaps something is happening behind the hood.

Any idea what the issue could be?

Thanks!

Hi @Marco_Bellini welcome to the community!

This is a very detailed report on a peculiar issue, which is to say that to be able to determine the root cause, it’s likely to involve a personalized, deep troubleshooting session, unfortunately :slight_smile:

However, I may be able to provide some pointers:

  • Are you using a burstable EBS volume? I’ve seen cases where MongoDB seems to freeze up for no reason, then all of a sudden start processing workloads. It was caused by the disk running out of burst credit.
  • Does your deployment follow the recommended settings in the production notes?
  • There are some suggestions for AWS specifically in
    Maximizing MongoDB Performance on AWS
    , it might be worth checking.
  • Have you tried using the latest in the 5.0 series (currently 5.0.9)?

Best regards
Kevin