MongoDB leaking memory while recovery in process

Gayathri_Prasad · December 5, 2023, 5:54pm

Hi,
I have a 3 node mongoDB replica set on K8s cluster with the following profile
Server version : 4.4.13
Wired tiger and journal enabled
CPU : 4
Memory : 4 Gi
PV size : 30GB
Wired tiger cache size : Default

There was an unclean shutdown attempted on one of the k8s nodes which caused mongodb to start recovering once it was back up. Now I notice that the recovery is causing memory leaks and mongodb is not coming back up stable even after bumping memory and cpu.

{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I",  "c":"STORAGE",  "id":4784929, "ctx":"conn43","msg":"Acquiring the global lock for shutdown"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I",  "c":"STORAGE",  "id":4784930, "ctx":"conn43","msg":"Shutting down the storage engine"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I",  "c":"STORAGE",  "id":22320,   "ctx":"conn43","msg":"Shutting down journal flusher thread"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I",  "c":"STORAGE",  "id":22321,   "ctx":"conn43","msg":"Finished shutting down journal flusher thread"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I",  "c":"STORAGE",  "id":20282,   "ctx":"conn43","msg":"Deregistering all the collections"}
{"t":{"$date":"2023-12-05T17:07:31.236+00:00"},"s":"I",  "c":"STORAGE",  "id":22372,   "ctx":"OplogVisibilityThread","msg":"Oplog visibility thread shutting down."}
{"t":{"$date":"2023-12-05T17:07:31.237+00:00"},"s":"I",  "c":"STORAGE",  "id":22261,   "ctx":"conn43","msg":"Timestamp monitor shutting down"}
{"t":{"$date":"2023-12-05T17:07:31.237+00:00"},"s":"I",  "c":"STORAGE",  "id":22317,   "ctx":"conn43","msg":"WiredTigerKVEngine shutting down"}
{"t":{"$date":"2023-12-05T17:07:31.238+00:00"},"s":"I",  "c":"STORAGE",  "id":22318,   "ctx":"conn43","msg":"Shutting down session sweeper thread"}
{"t":{"$date":"2023-12-05T17:07:31.238+00:00"},"s":"I",  "c":"STORAGE",  "id":22319,   "ctx":"conn43","msg":"Finished shutting down session sweeper thread"}
{"t":{"$date":"2023-12-05T17:07:31.238+00:00"},"s":"I",  "c":"STORAGE",  "id":22322,   "ctx":"conn43","msg":"Shutting down checkpoint thread"}
{"t":{"$date":"2023-12-05T17:07:31.239+00:00"},"s":"I",  "c":"STORAGE",  "id":22323,   "ctx":"conn43","msg":"Finished shutting down checkpoint thread"}
{"t":{"$date":"2023-12-05T17:07:31.242+00:00"},"s":"I",  "c":"STORAGE",  "id":4795902, "ctx":"conn43","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
{"t":{"$date":"2023-12-05T17:07:38.469+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn6","msg":"Connection ended","attr":{"remote":"10.78.38.174:37618","connectionId":6,"connectionCount":21}}

{"t":{"$date":"2023-12-05T17:08:44.963+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn43","msg":"WiredTiger message","attr":{"message":"[1701796124:962982][13134:0x7faa49e68700], txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 73 seconds and has inspected 315 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-12-05T17:08:44.969+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn43","msg":"WiredTiger message","attr":{"message":"[1701796124:969631][13134:0x7faa49e68700], txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 73 seconds and has inspected 316 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-12-05T17:08:44.969+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn43","msg":"WiredTiger message","attr":{"message":"[1701796124:969756][13134:0x7faa49e68700], txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 73 seconds and has inspected 317 files. For more detailed logging, enable WT_VERB_RTS"}}
server should be down...

Can someone pls help how do we fix this cluster ? Also pls recommend any wired tiger configuration that could prevent this issue in future?
Thanks,
Gayathri

Aasawari · December 11, 2023, 11:21am

Hi @Gayathri_Prasad
Welcome to MongoDB community forums!!

There could certainly be more than one reason for experiencing a memory leak in your deployment.
However, this is certainly clear that a unclean shutdown has not released the resources used by the nodes and the new nodes restarted nodes are extending the resource utilisation.

The recommendation on the kubernetes environment end will be to measure the range of memory our pod uses during normal operation and set our memory limit to accommodate it without being too excessive.
You can use kubectl top pod to get a list of pods with their CPU and memory usage

Can you also confirm, under what circumstances was the node restart seen at the deployment end ?
There could also be a possibility that the application would be causing the resource contention. Also are you seeing the restart or unclean shutdown in the deployment frequently ?

Before recommending could you help me with some more information from the server logs which would give me clarification if there is a possibility of an OOM issue being seen in the logs.

Also, would recommend to view the pod logs during the process of initiation and see if you can find something suspicious in the logs.
Finally, can you also try upgrading the MongoDB server in the environment for latest updates and bug fixes and let us know if the issue still persists.

Regards
Aasawari