MongoDb becomes slow after sometime

I am working on a task in which I am running a cron job, work of cron job is to complete a task and send notifications(emails and sms).
Time interval for cron job is in every 8 minutes.

CronJob is running for 25000 merchants, and there was around 4-5 tasks for each merchant and around 60-70 db queries are running for each merchant, in which around 15-20 db queries are insertion and updation and around 40-50 queries are get queries.

I have run the cron job using goroutines, in which I have implemented worker pool, and I have set the worker to 200, which means 200 merchants are completing their tasks parallely.

I have setup a server for database, for all merchants exists and there was seperate db for each merchant. And I have used mongoDb as Database, and mongoDb is running on it’s default settings.

My servers system specifications are,

Database server:-
Database size:- 570GB
OS:- Ubuntu 22.04

Cron are running on different server and specifications of that server is:-
RAM:- 16GB
OS:-Ubuntu 22.04

My problem is that, whenever I am starting the cron service, for first few merchants the db is working fine. all the Db queries including insert, update, delete, Get are running fast, but after a period of time, db becomes slow, all the queries run very slow.

The db becomes slow for every operations including cronJob or other operations. I have noticed that mongoDb goes into the locking condition for certain period of time.
And this locking time is increasing rapidly, i.e. Whenever it was stopped for first time it was again started in 1-2 seconds, but after some time the time is increased.
After 2-3 hours, It goes to a state in which db got locked for more than 5 minutes and run queries for only 1 minute after that again goes to the locking state.

I have noticed a log which was logged frequently whenever db is stopped

{“t”:{“$date”:“2023-03-31T06:38:04.021+00:00”},“s”:“W”, “c”:“COMMAND”, “id”:20525, “ctx”:“conn60701”,“msg”:“Failed to gather storage statistics for slow operation”,“attr”:{“opId”:2317177,“error”:“lock acquire timeout”}}

I have noticed the locking condition by examining the logs, whenever the db is started after the lock, I am seeing these type of slow query logs in which handleLock and schemaLock is high.

{“t”:{“$date”:“2023-03-31T06:40:34.908+00:00”},“s”:“I”, “c”:“COMMAND”, “id”:51803, “ctx”:“conn59118”,“msg”:“Slow query”,“attr”:{“type”:“command”,“ns”:“ausloc678_bk_db.providers”,“command”:{“find”:“providers”,“filter”:{“uid”:7},“limit”:1,“projection”:{“_id”:1,“show_payment_method_and_price”:1,“show_payment_method_and_price_for”:1,“is_team_member”:1,“who_see_payment_method_and_price”:1,“team_lead_id”:1,“hide_provider_payments”:1,“hidden_provider_payments”:1,“show_booking_price”:1,“show_booking_price_for”:1,“who_see_booking_price”:1},“singleBatch”:true,“lsid”:{“id”:{“$uuid”:“c6c4c42b-216c-48c4-92bf-8ca3b1db93f7”}},“$db”:“ausloc678_bk_db”},“planSummary”:“COLLSCAN”,“keysExamined”:0,“docsExamined”:52,“cursorExhausted”:true,“numYields”:1,“nreturned”:0,“queryHash”:“B89C5911”,“planCacheKey”:“B89C5911”,“reslen”:114,“locks”:{“FeatureCompatibilityVersion”:{“acquireCount”:{“r”:2}},“ReplicationStateTransition”:{“acquireCount”:{“w”:2}},“Global”:{“acquireCount”:{“r”:2}},“Database”:{“acquireCount”:{“r”:2}},“Collection”:{“acquireCount”:{“r”:2}},“Mutex”:{“acquireCount”:{“r”:1}}},“storage”:{“data”:{“bytesRead”:28496,“timeReadingMicros”:13},“timeWaitingMicros”:{“handleLock”:122143,“schemaLock”:15285487}},“protocol”:“op_msg”,“durationMillis”:15899}}

Can someone help me to find the solution to prevent these locking condition, I have optimized all the db queries, there was no lookup or joins are used in any query.

And I have some questions:-

  • What would be the reasons for this issue I am facing?
  • What changes should I do to resolve this problem?

You have


so your server haas at least 50000 open files. At least, because if you have indexes you even have more files. So you effectively implemented the Massive Number of Collections anti-pattern.

I am not too sure about that, since the logs you shared show a