Hi,
I have a 3 node mongoDB replica set on K8s cluster with the following profile
Server version : 4.4.13
Wired tiger and journal enabled
CPU : 4
Memory : 4 Gi
PV size : 30GB
Wired tiger cache size : Default
There was an unclean shutdown attempted on one of the k8s nodes which caused mongodb to start recovering once it was back up. Now I notice that the recovery is causing memory leaks and mongodb is not coming back up stable even after bumping memory and cpu.
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I", "c":"STORAGE", "id":4784929, "ctx":"conn43","msg":"Acquiring the global lock for shutdown"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I", "c":"STORAGE", "id":4784930, "ctx":"conn43","msg":"Shutting down the storage engine"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I", "c":"STORAGE", "id":22320, "ctx":"conn43","msg":"Shutting down journal flusher thread"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I", "c":"STORAGE", "id":22321, "ctx":"conn43","msg":"Finished shutting down journal flusher thread"}
{"t":{"$date":"2023-12-05T17:07:31.235+00:00"},"s":"I", "c":"STORAGE", "id":20282, "ctx":"conn43","msg":"Deregistering all the collections"}
{"t":{"$date":"2023-12-05T17:07:31.236+00:00"},"s":"I", "c":"STORAGE", "id":22372, "ctx":"OplogVisibilityThread","msg":"Oplog visibility thread shutting down."}
{"t":{"$date":"2023-12-05T17:07:31.237+00:00"},"s":"I", "c":"STORAGE", "id":22261, "ctx":"conn43","msg":"Timestamp monitor shutting down"}
{"t":{"$date":"2023-12-05T17:07:31.237+00:00"},"s":"I", "c":"STORAGE", "id":22317, "ctx":"conn43","msg":"WiredTigerKVEngine shutting down"}
{"t":{"$date":"2023-12-05T17:07:31.238+00:00"},"s":"I", "c":"STORAGE", "id":22318, "ctx":"conn43","msg":"Shutting down session sweeper thread"}
{"t":{"$date":"2023-12-05T17:07:31.238+00:00"},"s":"I", "c":"STORAGE", "id":22319, "ctx":"conn43","msg":"Finished shutting down session sweeper thread"}
{"t":{"$date":"2023-12-05T17:07:31.238+00:00"},"s":"I", "c":"STORAGE", "id":22322, "ctx":"conn43","msg":"Shutting down checkpoint thread"}
{"t":{"$date":"2023-12-05T17:07:31.239+00:00"},"s":"I", "c":"STORAGE", "id":22323, "ctx":"conn43","msg":"Finished shutting down checkpoint thread"}
{"t":{"$date":"2023-12-05T17:07:31.242+00:00"},"s":"I", "c":"STORAGE", "id":4795902, "ctx":"conn43","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
{"t":{"$date":"2023-12-05T17:07:38.469+00:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn6","msg":"Connection ended","attr":{"remote":"10.78.38.174:37618","connectionId":6,"connectionCount":21}}
{"t":{"$date":"2023-12-05T17:08:44.963+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn43","msg":"WiredTiger message","attr":{"message":"[1701796124:962982][13134:0x7faa49e68700], txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 73 seconds and has inspected 315 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-12-05T17:08:44.969+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn43","msg":"WiredTiger message","attr":{"message":"[1701796124:969631][13134:0x7faa49e68700], txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 73 seconds and has inspected 316 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-12-05T17:08:44.969+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn43","msg":"WiredTiger message","attr":{"message":"[1701796124:969756][13134:0x7faa49e68700], txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 73 seconds and has inspected 317 files. For more detailed logging, enable WT_VERB_RTS"}}
server should be down...
Can someone pls help how do we fix this cluster ? Also pls recommend any wired tiger configuration that could prevent this issue in future?
Thanks,
Gayathri