Disclaimers: I know this is a REALLY old version, but it’s a legacy system that I need to make work for a few more months. Also, I know literally nothing about MongoDB.
I have a cluster of 3 nodes running on Windows as EC2 instances on AWS. The memory allocated to the VMs is 16GB. The data partition is configured with 1000 IOPs. The size of the database is roughly one terabyte. The largest collection is on the order of 400GB or so, with some other 40-100GB collections.
What I’m seeing in logs is a lot of elections and heartbeat failures. Members are being flagged as down, or slow to respond. I’m also seeing a lot of connection drops.
Windows shows there’s only 300MB of free memory, with somewhere around 5-8000 page faults a second on average. The commit charge for the MongoD.exe process varies, but is as high as 55GB.
Additional Info: Two of the smaller collections have TTLs. At this point, the only write activity is the TTL processes.
My analysis is that there’s so much swapping going on that it’s destabilizing the cluster, and that’s manifested as seeing hosts down, elections, etc. Is that a reasonable idea?
What would be a reasonable way to address this? I’m thinking bumping memory allocation to 64GB. Would that be sufficient?