MongoDB community stop/start problem

Branimir_Putnikovic · June 6, 2022, 12:53pm

Hello,

in our company we have mongodb community deployment with three nodes (Primary, Secondary and Secondary on DR site).

We are experiencing problem stopping mongod instance and starting.
OS RHAT 7.7, 64bit, RAM 32 GB, wt cache default, numCores 4. MongoDB version 4.4.2.

The idea was to upgrade to version 5 and we started by the book, from secondary on DR site.

When we issue systemctl stop mongod , the process does not stop in 5 minutes and something kills it.

Active: failed (Result: signal) since Sat 2022-06-04 16:37:36 CEST; 51s ago
     Docs: https://docs.mongodb.org/manual
 Main PID: 31801 (code=killed, signal=KILL)

Normal systemctl status would be ‘inactive (dead)’ but we have ‘failed’.

Previously (around two years ago) we could start it again by issuing ‘start’ command multiple times.
We noticed that journal -xe shows some kind of counter that increase after every start and in the end it starts.
It was strange but since this instance is rarely stopped or restarted, we did not have opportunity to repeat the test.

Now, upgrade to version 5.0.8 was done via yum install command. The software upgraded but could not start.

Installed:
  mongodb-org-database.x86_64 0:5.0.8-1.el7

Dependency Installed:
  mongodb-mongosh.x86_64 0:1.5.0-1.el8

Updated:
  mongodb-org.x86_64 0:5.0.8-1.el7
  mongodb-org-mongos.x86_64 0:5.0.8-1.el7
  mongodb-org-server.x86_64 0:5.0.8-1.el7
  mongodb-org-shell.x86_64 0:5.0.8-1.el7
  mongodb-org-tools.x86_64 0:5.0.8-1.el7

Complete!

Jun 04 16:48:38 server3.f.hr systemd[1]: Starting MongoDB Databas...
Jun 04 16:48:38 server3.f.hr mongod[31525]: about to fork child p...
Jun 04 16:48:38 server3.f.hr mongod[31525]: forked process: 31527
Jun 04 16:48:42 server3.f.hr mongod[31525]: ERROR: child process ...
Jun 04 16:48:42 server3.f.hr mongod[31525]: To see additional inf...
Jun 04 16:48:42 server3.f.hr systemd[1]: mongod.service: control ...
Jun 04 16:48:42 server3.f.hr systemd[1]: Failed to start MongoDB ...
Jun 04 16:48:42 server3.f.hr systemd[1]: Unit mongod.service ente...
Jun 04 16:48:42 server3.f.hr systemd[1]: mongod.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

[root@mongodbp3er ~]# journalctl -xe
-- Unit mongod.service has begun starting up.
Jun 04 16:48:38 server3.f.hr mongod[31525]: about to fork child process, waiting unti
Jun 04 16:48:38 server3.f.hr mongod[31525]: forked process: 31527
Jun 04 16:48:42 server3.f.hr mongod[31525]: ERROR: child process failed, exited with
Jun 04 16:48:42 server3.f.hr mongod[31525]: To see additional information in this out
Jun 04 16:48:42 server3.f.hr systemd[1]: mongod.service: control process exited, code
Jun 04 16:48:42 server3.f.hr systemd[1]: Failed to start MongoDB Database Server.
-- Subject: Unit mongod.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit mongod.service has failed.
--
-- The result is failed.
Jun 04 16:48:42 server3.f.hr polkitd[1995]: Unregistered Authentication Agent for uni
Jun 04 16:48:42 server3.f.hr systemd[1]: Unit mongod.service entered failed state.
Jun 04 16:48:42 server3.f.hr systemd[1]: mongod.service failed.
Jun 04 16:48:42 server3.f.hr logger[31554]: root[/root] : systemctl start mongod
Jun 04 16:48:57 server3.f.hr polkitd[1995]: Registered Authentication Agent for unix-
Jun 04 16:48:57 server3.f.hr polkitd[1995]: Unregistered Authentication Agent for uni
Jun 04 16:48:57 server3.f.hr logger[31576]: root[/root] : systemctl stop mongod
Jun 04 16:49:03 server3.f.hr logger[31584]: root[/root] : systemctl status mongod
Jun 04 16:49:39 server3.f.hr logger[31618]: root[/root] : journalctl -xe

Then inside mongod.log we found:

Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade.

featureCompatibility was set to 4.4 earlier.

Then we also found: Upgrading from a WiredTiger version 10.0.0 database that was not shutdown cleanly is not allowed. Perform a clean shutdown on version 10.0.0 and then upgrade.

So, it was killed during stop but this time multiple start command does not help.
Base on some googling… we decided to move back to 4.4.14. (latest version 4).

We did that but we could not start again.

Final solution was - cold sync that is now going on (around 4 days to finish).

We have 32 TB of data online but users access only last 1% of data or less… (last 2 year available, but 99% of time they use last 7 days RW, 7-30 days RO in history).

Now we think that problem could be related to wiredTiger Cache that is 50% of RAM -1 GB : in our setup it’s 32 GB /2 = 16 GB -1 = 15 GB.

Our plan was to:

wait for cold sync to finish
increase RAM dynamically to 70 GB
add inside config file (only on this node): cacheSizeGB: 50
execute mongod --wiredTigerCacheSizeGB 50
(this way we increase default wt cache from (70/2)-1=34 GB to 50 GB … and we know that 17 GB of free RAM for mongodb & OS is enough for operational use so we can have 50 GB wt size + 16 GB = 67 GB of 70 GB total ram)
after that: systemctl stop mongod

wait to see is it going to be stopped now or killed again

if stopped correctly (clean shutdown), systemctl status mongod should be inactive (dead)… that I suppose it can start normally

Please advise what do you think about our situation and plan for Friday (expected date of cold sync finish).

Thank you very much.

Best regards,
Branimir Putniković

chris · June 6, 2022, 3:34pm

The SIGKILL indicates the shutdown was not clean. Likely this is a SIGKILL from systemd after a stop timeout.

A clean shutdown is a prerequisite for the upgrade. If you are not getting a clean shutdown from systemctl then you can try one of the other methods on https://www.mongodb.com/docs/manual/tutorial/manage-mongodb-processes/#stop-mongod-processes

I did a test jumping from 4.2 through 4.4 to 5.0 without updating the FCV and mongod 5.0 gave a good informational message about FCV being on 4.2. So I think “something bad” ™ happened to your installation on that replica.

Branimir_Putnikovic · June 9, 2022, 8:47am

Good morning,

Thank you Mr. Dellaway for answer and advice.

Tomorrow is our cold sync going to be finished.
We shall use ‘use admin db.shutDownServer’ after increasing RAM/wt cache as described and see result.
If successful, we plan to do this on other two nodes and after that upgrade to 5.x.

Best regards,
Branimir Putniković