Mongo 3.6 > 4.4.24 - Docker Swarm Restarts / Memory Creep

Hi all -

Bit of a tricky one here - appreciate any advice you can give.

We have recently upgraded from 3.6 to 4.4.24 for our application, along with a Mongoose upgrade. We use Docker Swarm to deploy it across 3 nodes in a replicaset with a separate service per MongoDB role.

Since upgrading from 3.6, we are seeing regular MongoDB container restarts on the primary node, which is causing our application to crash. The container is restarting on other nodes, but less regularly.

There is no information in the Docker logs about why the container is crashing; suggesting an OOM issue.

We limit the Mongo container to 2GB RAM using Docker limits (we’re not doing much I/O and the DB is small) - watching docker stats output shows memory creeping up to the 2GB limit, then the process is killed by the OS and connections to the replicaset are reset.

Is there anything we can do to enforce the RAM limit within the Mongo application so it doesn’t hit 2GB? I’ve tried limiting the WiredTiger cache size, but it’s not very big - see below - and doesn’t help.

Noticed we have a high active connection count today - does each active connection take up RAM / could this be contributing?

I ran a script on one of our production systems and filtered active connections by IP; the169.254.4.x addresses are our swarm endpoints, 1 per host.

{
        "TOTAL_CONNECTION_COUNT" : 5264,
        "169.254.4.8" : 1895,
        "169.254.4.9" : 1939,
        "169.254.4.5" : 1397,
        "Internal" : 32,
        "127.0.0.1" : 1
}

We only have 1 active application server (running on the 169.254.4.8 host) - any idea why we are seeing a high number of connections across all three? I believe the only connections from the other two are for replication - could there be something wrong with the RS configuration which makes it duplicate connections?

Replicaset config is here - we’re using DNS for each node as we’re running in Swarm mode:

{
        "_id" : "xxxxxxxxxxxxxx",
        "version" : 2,
        "term" : 25,
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "mongo_primary:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 2,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "mongo_secondary:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 2,
                        "host" : "mongo_manager:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 10000,
                "heartbeatTimeoutSecs" : 20,
                "electionTimeoutMillis" : 10000,
                "catchUpTimeoutMillis" : -1,
                "catchUpTakeoverDelayMillis" : 30000,
                "getLastErrorModes" : {

                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("6544205e6a308455ab730738")
        }
}

Some more info on the Mongo DB if useful:

Mongo cache

        "bytes allocated for updates" : 25474313,
        "bytes belonging to page images in the cache" : 1444422,
        "bytes belonging to the history store table in the cache" : 547,
        "bytes currently in the cache" : 27046123,
        "bytes dirty in the cache cumulative" : 2172249945,
        "bytes not belonging to page images in the cache" : 25601701,
        "bytes read into cache" : 2713104,
        "bytes written from cache" : 1221520852,

Mongo memory allocation:

db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC:     1939406736 ( 1849.6 MiB) Bytes in use by application
MALLOC: +     18894848 (   18.0 MiB) Bytes in page heap freelist
MALLOC: +     25106008 (   23.9 MiB) Bytes in central cache freelist
MALLOC: +      1009344 (    1.0 MiB) Bytes in transfer cache freelist
MALLOC: +    591508312 (  564.1 MiB) Bytes in thread cache freelists
MALLOC: +     84410368 (   80.5 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2660335616 ( 2537.1 MiB) Actual memory used (physical + swap)
MALLOC: +      5971968 (    5.7 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   2666307584 ( 2542.8 MiB) Virtual address space used
MALLOC:
MALLOC:         132361              Spans in use
MALLOC:          22206              Thread heaps in use
MALLOC:           4096              Tcmalloc page size

Hi @Matt_Crum and welcome to MongoDB community forums!!

It would be helpful if you also share the Mongoose and the docker version you are using ?

During the process of restarting/crashing the server, do you also observe any error logs in the mongod servers?
Generally, the OOM issues occur if the kernel memory limit is lower than the user memory limit, running out of kernel memory. You can find more details in the official documentation for Runtime options with Memory, CPUs, and GPUs | Docker Docs.

Considering the fact that there are less IOPS and a smaller database size but still occupying the allocated space could be a classic case of database fragmentation which could be result of many insert or update or even open connections.

Yes, the open connection would take up the resource allocated and causing a memory contention.

The mongod process opens connections with every MongoClient command that we issue. You can refer to the documentation for connection pooling for more details.
The recommendation would to to look back at the application code to retrospect the reason of open connections.

Also, in order to understand further, could you also help me with the hardware specifications of the VM or the system where you are running the mongo servers.

Regards
Aasawari