Problems with replicaset of 3 nodes falling down one for out of memory error

Hellow everybody ,

I have a server running Ubuntu Server 20 LTS. I have installed there Mongo 4.4.10. The architecture deployed is replicaset of 3 nodes, one primary and two other secondaries. After making several mongoimport operations with big TSV files (several million rows), one of the nodes falls down, usually a secondary (sometimes is the id 1, sometimes is the id 2), and very rarely the primary.

I have the problem in a Server of 32 GB RAM, but also in a Server of 64 GB RAM.

I have check several topics:

  • Linux limits:
    I have set the limits to maximum possible:
    sudo echo “* soft nproc 65536” >> /etc/security/limits.conf
    sudo echo “* hard nproc 65536” >> /etc/security/limits.conf
    sudo echo “* soft nofile 655360” >> /etc/security/limits.conf
    sudo echo “* hard nofile 655360” >> /etc/security/limits.conf
    sudo echo “root soft nproc 65536” >> /etc/security/limits.conf
    sudo echo “root hard nproc 65536” >> /etc/security/limits.conf
    sudo echo “root soft nofile 65536” >> /etc/security/limits.conf
    sudo echo “root hard nofile 65536” >> /etc/security/limits.conf

  • Swappiness factor:
    The current swappiness value is 60, as it is the value by default…

  • Ports:
    The 3 needed ports are opened.

The error can be found reading the journal of the core:
journalctl -k -p err;
The error has this output:
kernel: Out of memory: Killed process 2923646 (mongod) total-vm:28964256kB, anon-rss:23559824kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:53796kB oom_score_adj:0

Please help me solving this problem. Thank you in advance.

Best regards ,
Jose

Hello @Jose_M_Aguero ,

Could you confirm that all mongod processes are run in their separate machine or VMs? Generally, running multiple mongod processes in a single machine is not a recommended setup since they will compete for resources and may lead to undesirable outcomes. From what you posted, I think the OS is running out of physical memory, the kernel’s out-of-memory killer (OOM-killer) kicks in and terminates processes to free up some RAM (in this case it’s mongodb, but could be any other).

To prevent OOMkills, the general approach is to configure swap space on the server. If it still happening after this, then I tend to think that the hardware is insufficient to serve the workload you’re putting on it and upgrading the deployment might be an option to consider.

Regards,
Tarun

Hellow Tarun ,

Thank you for your reply.

All the 3 proceses are running directly in Ubuntu, not in separated WM.

For mongo processes I have read that the swappiness factor should be 1. This low value means the kernel will try to avoid swapping as much as possible where a higher value instead will make the kernel aggressively try to use swap space. This is the same as saying that Mongo does not use swap memory.

For hardware RAM, I can say the problem will ocurr in any server with this configuration. I have seen it with 32, 64 and 128 GB hardware RAM.

The solution I have found is to set the option --wiredTigerCacheSizeGB with a reasonable value of RAM for each process. For 64 GB machine, it would be “–wiredTigerCacheSizeGB 10”.

Regards ,
Jose

I’d like to mention that it’s strongly not recommended to run more than one mongod in a single machine/VM. Below blob is from Production notes

The default WiredTiger internal cache size value assumes that there is a single mongod instance per machine. If a single machine contains multiple MongoDB instances, then you should decrease the setting to accommodate the other mongod instances.

While running multiple mongod is fine during the development phase, it’s best not to do this in a production environment.

Note that although it’s possible to limit the WiredTiger cache size, it is not the only memory usage needs of the mongod process. Things such as aggregation queries, connections, and other processes outside of WiredTiger requires memory outside of WiredTiger cache.

2 Likes

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.