in our company we have mongodb community deployment with three nodes (Primary, Secondary and Secondary on DR site).
We are experiencing problem stopping mongod instance and starting.
OS RHAT 7.7, 64bit, RAM 32 GB, wt cache default, numCores 4. MongoDB version 4.4.2.
The idea was to upgrade to version 5 and we started by the book, from secondary on DR site.
When we issue systemctl stop mongod , the process does not stop in 5 minutes and something kills it.
Active: failed (Result: signal) since Sat 2022-06-04 16:37:36 CEST; 51s ago Docs: https://docs.mongodb.org/manual Main PID: 31801 (code=killed, signal=KILL)
Normal systemctl status would be ‘inactive (dead)’ but we have ‘failed’.
Previously (around two years ago) we could start it again by issuing ‘start’ command multiple times.
We noticed that journal -xe shows some kind of counter that increase after every start and in the end it starts.
It was strange but since this instance is rarely stopped or restarted, we did not have opportunity to repeat the test.
Now, upgrade to version 5.0.8 was done via yum install command. The software upgraded but could not start.
Installed: mongodb-org-database.x86_64 0:5.0.8-1.el7 Dependency Installed: mongodb-mongosh.x86_64 0:1.5.0-1.el8 Updated: mongodb-org.x86_64 0:5.0.8-1.el7 mongodb-org-mongos.x86_64 0:5.0.8-1.el7 mongodb-org-server.x86_64 0:5.0.8-1.el7 mongodb-org-shell.x86_64 0:5.0.8-1.el7 mongodb-org-tools.x86_64 0:5.0.8-1.el7 Complete!
Jun 04 16:48:38 server3.f.hr systemd: Starting MongoDB Databas... Jun 04 16:48:38 server3.f.hr mongod: about to fork child p... Jun 04 16:48:38 server3.f.hr mongod: forked process: 31527 Jun 04 16:48:42 server3.f.hr mongod: ERROR: child process ... Jun 04 16:48:42 server3.f.hr mongod: To see additional inf... Jun 04 16:48:42 server3.f.hr systemd: mongod.service: control ... Jun 04 16:48:42 server3.f.hr systemd: Failed to start MongoDB ... Jun 04 16:48:42 server3.f.hr systemd: Unit mongod.service ente... Jun 04 16:48:42 server3.f.hr systemd: mongod.service failed. Hint: Some lines were ellipsized, use -l to show in full.
[root@mongodbp3er ~]# journalctl -xe -- Unit mongod.service has begun starting up. Jun 04 16:48:38 server3.f.hr mongod: about to fork child process, waiting unti Jun 04 16:48:38 server3.f.hr mongod: forked process: 31527 Jun 04 16:48:42 server3.f.hr mongod: ERROR: child process failed, exited with Jun 04 16:48:42 server3.f.hr mongod: To see additional information in this out Jun 04 16:48:42 server3.f.hr systemd: mongod.service: control process exited, code Jun 04 16:48:42 server3.f.hr systemd: Failed to start MongoDB Database Server. -- Subject: Unit mongod.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit mongod.service has failed. -- -- The result is failed. Jun 04 16:48:42 server3.f.hr polkitd: Unregistered Authentication Agent for uni Jun 04 16:48:42 server3.f.hr systemd: Unit mongod.service entered failed state. Jun 04 16:48:42 server3.f.hr systemd: mongod.service failed. Jun 04 16:48:42 server3.f.hr logger: root[/root] : systemctl start mongod Jun 04 16:48:57 server3.f.hr polkitd: Registered Authentication Agent for unix- Jun 04 16:48:57 server3.f.hr polkitd: Unregistered Authentication Agent for uni Jun 04 16:48:57 server3.f.hr logger: root[/root] : systemctl stop mongod Jun 04 16:49:03 server3.f.hr logger: root[/root] : systemctl status mongod Jun 04 16:49:39 server3.f.hr logger: root[/root] : journalctl -xe
Then inside mongod.log we found:
Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade.
featureCompatibility was set to 4.4 earlier.
Then we also found: Upgrading from a WiredTiger version 10.0.0 database that was not shutdown cleanly is not allowed. Perform a clean shutdown on version 10.0.0 and then upgrade.
So, it was killed during stop but this time multiple start command does not help.
Base on some googling… we decided to move back to 4.4.14. (latest version 4).
We did that but we could not start again.
Final solution was - cold sync that is now going on (around 4 days to finish).
We have 32 TB of data online but users access only last 1% of data or less… (last 2 year available, but 99% of time they use last 7 days RW, 7-30 days RO in history).
Now we think that problem could be related to wiredTiger Cache that is 50% of RAM -1 GB : in our setup it’s 32 GB /2 = 16 GB -1 = 15 GB.
Our plan was to:
- wait for cold sync to finish
- increase RAM dynamically to 70 GB
- add inside config file (only on this node): cacheSizeGB: 50
- execute mongod --wiredTigerCacheSizeGB 50
(this way we increase default wt cache from (70/2)-1=34 GB to 50 GB … and we know that 17 GB of free RAM for mongodb & OS is enough for operational use so we can have 50 GB wt size + 16 GB = 67 GB of 70 GB total ram)
- after that: systemctl stop mongod
wait to see is it going to be stopped now or killed again
if stopped correctly (clean shutdown), systemctl status mongod should be inactive (dead)… that I suppose it can start normally
Please advise what do you think about our situation and plan for Friday (expected date of cold sync finish).
Thank you very much.