MongoDB Docker container not starting properly

Sarojini_Jillalla · April 28, 2022, 1:22am

Hi All,
Sorry if this is the wrong category.

I have inherited the infrastructure from someone who is long gone from my company. I am not aware of the configuration/steps used to bringup the MongoDB containers.

Basically, we are running MongoDB (v3.6.18) as three containers and are configured as a replicaset. When the docker stack is deployed, two of the containers are up and running, but the third one is taking a long time to come up. The two DBs that are up, are about 1GB each. The third DB is about 637GB. Since the third DB is large, it tries for about 3.5hrs and exits and tries to be recreated. This goes on a loop.

The logs from the other two DBs which are up show that they try to reach the other DBs and fails. Similar logs from both DB01 and DB02.

**docker logs <DB01_container_ID>** 
<snip>
2022-04-28T00:55:23.546+0000 I CONTROL  [LogicalSessionCacheRefresh] Sessions collection is not set up; waiting until next sessions refresh interval: Could not find host matching read preference { mode: "primary" } for set graylog
2022-04-28T00:59:58.027+0000 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for graylog/mongodb_db01:27017,mongodb_db02:27018,mongodb_db03:27019
2022-04-28T01:00:03.030+0000 W NETWORK  [LogicalSessionCacheRefresh] Failed to connect to 10.0.11.2:27017 after 5000ms milliseconds, giving up.
2022-04-28T01:00:03.033+0000 W NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Failed to connect to 10.0.11.5:27018 after 5000ms milliseconds, giving up.
2022-04-28T01:00:08.033+0000 W NETWORK  [LogicalSessionCacheRefresh] Failed to connect to 10.0.11.8:27019 after 5000ms milliseconds, giving up.
2022-04-28T01:00:08.033+0000 W NETWORK  [LogicalSessionCacheRefresh] Unable to reach primary for set graylog
2022-04-28T01:00:08.033+0000 I NETWORK  [LogicalSessionCacheRefresh] Cannot reach any nodes for set graylog. Please check network connectivity and the status of the set. This has happened for 1 checks in a row.
2022-04-28T01:00:13.539+0000 W NETWORK  [LogicalSessionCacheRefresh] Failed to connect to 10.0.11.8:27019 after 5000ms milliseconds, giving up.
2022-04-28T01:00:18.545+0000 W NETWORK  [LogicalSessionCacheRefresh] Failed to connect to 10.0.11.2:27017 after 5000ms milliseconds, giving up.
2022-04-28T01:00:23.551+0000 W NETWORK  [LogicalSessionCacheRefresh] Failed to connect to 10.0.11.5:27018 after 5000ms milliseconds, giving up.
2022-04-28T01:00:23.551+0000 W NETWORK  [LogicalSessionCacheRefresh] Unable to reach primary for set graylog
2022-04-28T01:00:23.551+0000 I NETWORK  [LogicalSessionCacheRefresh] Cannot reach any nodes for set graylog. Please check network connectivity and the status of the set. This has happened for 2 checks in a row.
2022-04-28T01:00:23.551+0000 I CONTROL  [LogicalSessionCacheRefresh] Sessions collection is not set up; waiting until next sessions refresh interval: Could not find host matching read preference { mode: "primary" } for set graylog
<snip>

For DB03, the following are the initial set of logs

2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1 port=27019 dbpath=/data/db 64-bit host=01d5b6a43712
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] db version v3.6.18
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] git version: 2005f25eed7ed88fa698d9b800fe536bb0410ba4
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] modules: none
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] build environment:
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten]     distmod: ubuntu1604
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten]     distarch: x86_64
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2022-04-28T00:08:14.735+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIpAll: true, port: 27019, ssl: { CAFile: "/etc/certs/ca.pem", PEMKeyFile: "/etc/certs/certandkey.pem", allowConnectionsWithoutCertificates: true, allowInvalidHostnames: true, mode: "preferSSL" } }, replication: { oplogSizeMB: 400, replSetName: "graylog" } }
2022-04-28T00:08:14.737+0000 W -        [initandlisten] Detected unclean shutdown - /data/db/mongod.lock is not empty.
2022-04-28T00:08:14.741+0000 I -        [initandlisten] Detected data files in /data/db created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2022-04-28T00:08:14.743+0000 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2022-04-28T00:08:14.743+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=63873M,cache_overflow=(file_max=0M),session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),compatibility=(release="3.0",require_max="3.0"),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),

When I logged into either the DB01 or DB02 containers and check the status, it shows the following

[root@dcvsl126 sjillalla]# docker exec -it 24664c0d5a58 mongo
MongoDB shell version v3.6.18
connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("964473e9-b60b-46b9-b1d1-de99829f62a4") }
MongoDB server version: 3.6.18
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
        http://docs.mongodb.org/
Questions? Try the support group
        http://groups.google.com/group/mongodb-user
Server has startup warnings:
2022-04-27T20:09:57.767+0000 I CONTROL  [initandlisten]
2022-04-27T20:09:57.767+0000 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2022-04-27T20:09:57.767+0000 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2022-04-27T20:09:57.767+0000 I CONTROL  [initandlisten]
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten]
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten] ** WARNING: You are running on a NUMA machine.
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten] **          We suggest launching mongod like this to avoid performance problems:
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten] **              numactl --interleave=all mongod [other options]
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten]
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten]
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2022-04-27T20:09:57.769+0000 I CONTROL  [initandlisten]
graylog:OTHER> rs.status()
{
        "state" : 10,
        "stateStr" : "REMOVED",
        "uptime" : 16290,
        "optime" : {
                "ts" : Timestamp(1650247277, 4),
                "t" : NumberLong(111)
        },
        "optimeDate" : ISODate("2022-04-18T02:01:17Z"),
        "lastHeartbeatMessage" : "",
        "syncingTo" : "",
        "syncSourceHost" : "",
        "syncSourceId" : -1,
        "infoMessage" : "",
        "ok" : 0,
        "errmsg" : "Our replica set config is invalid or we are not a member of it",
        "code" : 93,
        "codeName" : "InvalidReplicaSetConfig",
        "operationTime" : Timestamp(1650247277, 4),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1650247277, 4),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}
graylog:OTHER> exit

The docker compose file is

version: '3'
services:
  db01:
    image: docker-prod.tools.royalsunalliance.ca/mongo:3.6.18
    volumes:
    - /docker/services/mongodb/db01:/data/db
    - /docker/services/mongodb/db01-dump:/data/db/dump
    - /docker/services/mongodb/db01-config/mongod.conf:/etc/mongod.conf
    - /docker/services/elasticsearch-prod/certs/db01:/etc/certs
    ports:
    - "27017:27017"
    command: ["mongod", "--sslAllowConnectionsWithoutCertificates", "--sslMode", "preferSSL", "--sslPEMKeyFile", "/etc/certs/certandkey.pem", "--sslCAFile", "/etc/certs/ca.pem", "--config", "/etc/mongod.conf", "--sslAllowInvalidHostnames"]

  db02:
    image: docker-prod.tools.royalsunalliance.ca/mongo:3.6.18
    volumes:
    - /docker/services/mongodb/db02:/data/db
    - /docker/services/mongodb/db02-dump:/data/db/dump
    - /docker/services/mongodb/db02-config/mongod.conf:/etc/mongod.conf
    - /docker/services/elasticsearch-prod/certs/db02:/etc/certs
    ports:
    - "27018:27018"
    command: ["mongod", "--port", "27018", "--sslAllowConnectionsWithoutCertificates", "--sslMode", "preferSSL", "--sslPEMKeyFile", "/etc/certs/certandkey.pem", "--sslCAFile", "/etc/certs/ca.pem", "--config", "/etc/mongod.conf", "--sslAllowInvalidHostnames"]
    #command: ["mongod", "--config", "/etc/mongod.conf"]

  db03:
    image: docker-prod.tools.royalsunalliance.ca/mongo:3.6.18
    volumes:
    - /docker/services/mongodb/db03:/data/db
    - /docker/services/mongodb/db03-dump:/data/db/dump
    - /docker/services/mongodb/db03-config/mongod.conf:/etc/mongod.conf
    - /docker/services/elasticsearch-prod/certs/db03:/etc/certs
    ports:
    - "27019:27019"
    command: ["mongod", "--port", "27019", "--sslAllowConnectionsWithoutCertificates", "--sslMode", "preferSSL", "--sslPEMKeyFile", "/etc/certs/certandkey.pem", "--sslCAFile", "/etc/certs/ca.pem", "--config", "/etc/mongod.conf", "--sslAllowInvalidHostnames"]

and there is a rs.initiate file which I am sure is used for configuring the replicaset, but not sure how.

rs.initiate({
  "_id": "graylog",
  "version": 1,
  "members" : [
   {"_id": 1, "host": "mongodb_db01:27017"},
   {"_id": 2, "host": "mongodb_db02:27018"},
   {"_id": 3, "host": "mongodb_db03:27019"}
  ]
 })

Please let me know how I can recover the MongoDB stack so that the containers are all up and running properly.
Let me know if you need any more information.

Aasawari · April 28, 2022, 1:00pm

Hi @Sarojini_Jillalla
Welcome to the community forum!!

Could you help with a few information in regard to the concern.

Was this setup in working condition before the issue started to arrive? If yes, could you please help in understanding the changes done in between due to which the issue was seen.
Could you share the contents of the mongod.conf file for the replica sets?
Was there any change done to the docker compose file when in working condition to now, if yes, could you please share the original docker file.

Lastly would suggest to scratch if the database does not contain important data o if the backup for the deployment is already available.

Please let me know the following details to assist you better.

Thanks
Aasawari

Sarojini_Jillalla · April 28, 2022, 3:25pm

Hi @Aasawari,

Please find the answers below.

The setup was in working condition. There was no apparent changes done to get the MongoDB into this corrupted state.
The mongod.conf contains the following

replication:
   oplogSizeMB: 400
   replSetName: graylog

for all the dbs.
3. There is no change to the docker-compose.yml file.

Since the DB03 contains huge data (about 640GB), I did not want to tinker with it.

Please let me know if you need any more information.

Thanks,
Sarojini Jillalla

Aasawari · May 4, 2022, 8:17am

Hi @Sarojini_Jillalla

Could you provide us with below details on the above sent information.

As per the above information, DB01, DB02 and DB03 are from a replica set configurations so they should be of similar sizes. However, thats not the case with DB03 where the size is much larger than the other two sets. Can you help me understand the method you used to calculate the sizes of the replica sets. Also please provide the output for db.stats() from the database that your application is using from all three nodes.
Can you send the output of rs.status() and rs.conf() by logging in into each of the three nodes along with hostname -f for the nodes. This would help in understanding the configurations of the replica set.
As per the docker compose file, the docker image looks like a custom made image, can you provide us with the information on how the docker image was created.

Thanks
Aasawari