MongoDB random crash from time to time, version 4.4.15, Ubuntu 20.04.5 LTS

Hello everyone,

We are facing some issues with mongoDB and still have no clue what is going on, so I am posting here for some suggestion/help. The issue is that mongod service is getting stopped/down a few times a day or once a week - completely random. This is what we can see in the mongo logs:

{"t":{"$date":"2023-03-25T05:04:26.904+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"WTCheckpointThread","msg":"WiredTiger message","attr":{"message":"[1679720666:903992][803:0x7f0b1dc21700], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 21472, snapshot max: 21472 snapshot count: 0, oldest timestamp: (1679720661, 1) , meta checkpoint timestamp: (1679720666, 1) base write gen: 2878249"}}
{"t":{"$date":"2023-03-25T05:04:33.027+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ftdc","msg":"Writing fatal message","attr":{"message":"terminate() called. An exception is active; attempting to gather more information"}}
{"t":{"$date":"2023-03-25T05:04:33.027+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ftdc","msg":"Writing fatal message","attr":{"message":"DBException::toString(): FileStreamFailed: Failed to write to interim file buffer for full-time diagnostic data capture: /var/lib/mongodb/diagnostic.data/metrics.interim.temp\nActual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)39, mongo::AssertionException>\n"}}
{"t":{"$date":"2023-03-25T05:04:33.347+00:00"},"s":"I",  "c":"CONTROL",  "id":31431,   "ctx":"ftdc","msg":"BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"55876002346A","b":"55875D226000","o":"2DFD46A","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1EA"},{"a":"558760024EF9","b":"55875D226000","o":"2DFEEF9","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"5587600220C6","b":"55875D226000","o":"2DFC0C6","s":"_ZN5mongo12_GLOBAL__N_111myTerminateEv","s+":"A6"},{"a":"5587601B26D6","b":"55875D226000","o":"2F8C6D6","s":"_ZN10__cxxabiv111__terminateEPFvvE","s+":"6"},{"a":"558760246739","b":"55875D226000","o":"3020739","s":"__cxa_call_terminate","s+":"39"},{"a":"5587601B20C5","b":"55875D226000","o":"2F8C0C5","s":"__gxx_personality_v0","s+":"275"},{"a":"7F0B2630DBEF","b":"7F0B262FD000","o":"10BEF","s":"_Unwind_GetTextRelBase","s+":"1E7F"},{"a":"7F0B2630E281","b":"7F0B262FD000","o":"11281","s":"_Unwind_RaiseException","s+":"331"},{"a":"5587601B2837","b":"55875D226000","o":"2F8C837","s":"__cxa_throw","s+":"37"},{"a":"55875E160F60","b":"55875D226000","o":"F3AF60","s":"_ZN5mongo13error_details23throwExceptionForStatusERKNS_6StatusE","s+":"1B72"},{"a":"55875E1751FD","b":"55875D226000","o":"F4F1FD","s":"_ZN5mongo21uassertedWithLocationERKNS_6StatusEPKcj","s+":"27B"},{"a":"55875DECDC6F","b":"55875D226000","o":"CA7C6F","s":"_ZN5mongo14FTDCController6doLoopEv.cold.395","s+":"2D"},{"a":"55875E70770C","b":"55875D226000","o":"14E170C","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_14FTDCController5startEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"5C"},{"a":"5587601CE19F","b":"55875D226000","o":"2FA819F","s":"execute_native_thread_routine","s+":"F"},{"a":"7F0B262E2609","b":"7F0B262DA000","o":"8609","s":"start_thread","s+":"D9"},{"a":"7F0B26207133","b":"7F0B260E8000","o":"11F133","s":"clone","s+":"43"}],"processInfo":{"mongodbVersion":"4.4.15","gitVersion":"bc17cf2c788c5dda2801a090ea79da5ff7d5fac9","compiledModules":[],"uname":{"sysname":"Linux","release":"5.4.0-144-generic","version":"#161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023","machine":"x86_64"},"somap":[{"b":"55875D226000","elfType":3,"buildId":"EE0334AB46B2152536232E843AA38EAFC636FE8F"},{"b":"7F0B262FD000","path":"/lib/x86_64-linux-gnu/libgcc_s.so.1","elfType":3,"buildId":"4ABD133CC80E01BB388A9C42D9E3CB338836544A"},{"b":"7F0B262DA000","path":"/lib/x86_64-linux-gnu/libpthread.so.0","elfType":3,"buildId":"7B4536F41CDAA5888408E82D0836E33DCF436466"},{"b":"7F0B260E8000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"1878E6B475720C7C51969E69AB2D276FAE6D1DEE"}]}}}}
{"t":{"$date":"2023-03-25T05:04:33.348+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"ftdc","msg":"  Frame: {frame}","attr":{"frame":{"a":"55876002346A","b":"55875D226000","o":"2DFD46A","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1EA"}}}
{"t":{"$date":"2023-03-25T05:04:33.348+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"ftdc","msg":"  Frame: {frame}","attr":{"frame":{"a":"558760024EF9","b":"55875D226000","o":"2DFEEF9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2023-03-25T05:04:33.348+00:00"},"s":"I",  "c":"CONTROL",  "id":3
{"t":{"$date":"2023-03-25T18:30:00.556+00:00"},"s":"I",  "c":"CONTROL",  "id":20698,   "ctx":"main","msg":"***** SERVER RESTARTED *****"}
{"t":{"$date":"2023-03-25T18:30:00.597+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2023-03-25T18:30:01.037+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2023-03-25T18:30:01.037+00:00"},"s":"I",  "c":"NETWORK",  "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize."}
{"t":{"$date":"2023-03-25T18:30:01.038+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2023-03-25T18:30:01.209+00:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":804,"port":27017,"dbPath":"/var/lib/mongodb","architecture":"64-bit","host":"ahus3"}}
{"t":{"$date":"2023-03-25T18:30:01.209+00:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.15","gitVersion":"bc17cf2c788c5dda2801a090ea79da5ff7d5fac9","openSSLVersion":"OpenSSL 1.1.1f  31 Mar 2020","modules":[],"allocator":"tcmalloc","environment":{"distmod":"ubuntu2004","distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2023-03-25T18:30:01.209+00:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Ubuntu","version":"20.04"}}}
{"t":{"$date":"2023-03-25T18:30:01.209+00:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"config":"/etc/mongod.conf","net":{"bindIp":"127.0.0.1","port":27017},"processManagement":{"timeZoneInfo":"/usr/share/zoneinfo"},"replication":{"replSetName":"rs0"},"security":{"authorization":"enabled","keyFile":"/home/developer/mongo-security/mongodb-key"},"storage":{"dbPath":"/var/lib/mongodb","journal":{"enabled":true}},"systemLog":{"destination":"file","logAppend":true,"logRotate":"reopen","path":"/var/log/mongodb/mongod.log"}}}}
{"t":{"$date":"2023-03-25T18:30:01.214+00:00"},"s":"W",  "c":"STORAGE",  "id":22271,   "ctx":"initandlisten","msg":"Detected unclean shutdown - Lock file is not empty","attr":{"lockFile":"/var/lib/mongodb/mongod.lock"}}
{"t":{"$date":"2023-03-25T18:30:01.215+00:00"},"s":"I",  "c":"STORAGE",  "id":22270,   "ctx":"initandlisten","msg":"Storage engine to use detected by data files","attr":{"dbpath":"/var/lib/mongodb","storageEngine":"wiredTiger"}}
{"t":{"$date":"2023-03-25T18:30:01.215+00:00"},"s":"W",  "c":"STORAGE",  "id":22302,   "ctx":"initandlisten","msg":"Recovering data from the last clean checkpoint."}
{"t":{"$date":"2023-03-25T18:30:01.215+00:00"},"s":"I",  "c":"STORAGE",  "id":22297,   "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
{"t":{"$date":"2023-03-25T18:30:01.221+00:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=3466M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}}
{"t":{"$date":"2023-03-25T18:30:02.505+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1679769002:505835][804:0x7f21853c7cc0], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 18 through 19"}}

These log lines should help:

{"t":{"$date":"2023-03-25T05:04:33.027+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ftdc","msg":"Writing fatal message","attr":{"message":"terminate() called. An exception is active; attempting to gather more information"}}
{"t":{"$date":"2023-03-25T05:04:33.027+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ftdc","msg":"Writing fatal message","attr":{"message":"DBException::toString(): FileStreamFailed: Failed to write to interim file buffer for full-time diagnostic data capture: /var/lib/mongodb/diagnostic.data/metrics.interim.temp\nActual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)39, mongo::AssertionException>\n"}}
{"t":{"$date":"2023-03-25T05:04:33.347+00:00"},"s":"I",  "c":"CONTROL",  "id":31431,   "ctx":"ftdc","msg":"BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"55876002346A","b":"55875D226000","o":"2DFD46A","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1EA"},{"a":"558760024EF9","b":"55875D226000","o":"2DFEEF9","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"5587600220C6","b":"55875D226000","o":"2DFC0C6","s":"_ZN5mongo12_GLOBAL__N_111myTerminateEv","s+":"A6"},{"a":"5587601B26D6","b":"55875D226000","o":"2F8C6D6","s":"_ZN10__cxxabiv111__terminateEPFvvE","s+":"6"},{"a":"558760246739","b":"55875D226000","o":"3020739","s":"__cxa_call_terminate","s+":"39"},{"a":"5587601B20C5","b":"55875D226000","o":"2F8C0C5","s":"__gxx_personality_v0","s+":"275"},{"a":"7F0B2630DBEF","b":"7F0B262FD000","o":"10BEF","s":"_Unwind_GetTextRelBase","s+":"1E7F"},{"a":"7F0B2630E281","b":"7F0B262FD000","o":"11281","s":"_Unwind_RaiseException","s+":"331"},{"a":"5587601B2837","b":"55875D226000","o":"2F8C837","s":"__cxa_throw","s+":"37"},{"a":"55875E160F60","b":"55875D226000","o":"F3AF60","s":"_ZN5mongo13error_details23throwExceptionForStatusERKNS_6StatusE","s+":"1B72"},{"a":"55875E1751FD","b":"55875D226000","o":"F4F1FD","s":"_ZN5mongo21uassertedWithLocationERKNS_6StatusEPKcj","s+":"27B"},{"a":"55875DECDC6F","b":"55875D226000","o":"CA7C6F","s":"_ZN5mongo14FTDCController6doLoopEv.cold.395","s+":"2D"},{"a":"55875E70770C","b":"55875D226000","o":"14E170C","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_14FTDCController5startEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"5C"},{"a":"5587601CE19F","b":"55875D226000","o":"2FA819F","s":"execute_native_thread_routine","s+":"F"},{"a":"7F0B262E2609","b":"7F0B262DA000","o":"8609","s":"start_thread","s+":"D9"},{"a":"7F0B26207133","b":"7F0B260E8000","o":"11F133","s":"clone","s+":"43"}],"processInfo":{"mongodbVersion":"4.4.15","gitVersion":"bc17cf2c788c5dda2801a090ea79da5ff7d5fac9","compiledModules":[],"uname":{"sysname":"Linux","release":"5.4.0-144-generic","version":"#161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023","machine":"x86_64"},"somap":[{"b":"55875D226000","elfType":3,"buildId":"EE0334AB46B2152536232E843AA38EAFC636FE8F"},{"b":"7F0B262FD000","path":"/lib/x86_64-linux-gnu/libgcc_s.so.1","elfType":3,"buildId":"4ABD133CC80E01BB388A9C42D9E3CB338836544A"},{"b":"7F0B262DA000","path":"/lib/x86_64-linux-gnu/libpthread.so.0","elfType":3,"buildId":"7B4536F41CDAA5888408E82D0836E33DCF436466"},{"b":"7F0B260E8000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"1878E6B475720C7C51969E69AB2D276FAE6D1DEE"}]}}}}
{"t":{"$date":"2023-03-25T05:04:33.348+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"ftdc","msg":"  Frame: {frame}","attr":{"frame":{"a":"55876002346A","b":"55875D226000","o":"2DFD46A","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1EA"}}}
{"t":{"$date":"2023-03-25T05:04:33.348+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"ftdc","msg":"  Frame: {frame}","attr":{"frame":{"a":"558760024EF9","b":"55875D226000","o":"2DFEEF9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2023-03-25T05:04:33.348+00:00"},"s":"I",  "c":"CONTROL",  "id":3

Aftrer those, mongodb is down…

I have found some really similar issues in this forum, but it was not the same. For most of them, the problem was “no space left on device” but in this instance, we have plenty of space (and we do not have that line in the logs). The only way to fix this so far is to restart mongd service.

Thanks in advance for any kind of help!

What’s the platform? If it’s Linux you might check ulimit

1 Like

Good point! I can see there are some lower values, and maybe it can help by increasing them.

user@xxxxxx:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31470
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 31470
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Based on the recommended ulimit settings from mongodb documentation, the number of open files and threads should be much higher. But if I run

return-limits(){
     for process in $@; do
          process_pids=`ps -C $process -o pid --no-headers | cut -d " " -f 2`
          if [ -z $@ ]; then
             echo "[no $process running]"
          else
             for pid in $process_pids; do
                   echo "[$process #$pid -- limits]"
                   cat /proc/$pid/limits
             done
          fi
     done
}

It looks like mongo can get recommended resources. Here is the output:

[mongod #260286 -- limits]
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             64000                64000                processes
Max open files            64000                64000                files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       31470                31470                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
done

Turned out that the values from the service configuration (/usr/lib/systemd/system/mongod.service) are taking effect, as shown in the output above (return-limits function). This is the mongod.service content:

[Unit]
Description=MongoDB Database Server
Documentation=https://docs.mongodb.org/manual
After=network-online.target
Wants=network-online.target

[Service]
User=mongodb
Group=mongodb
EnvironmentFile=-/etc/default/mongod
ExecStart=/usr/bin/mongod --config /etc/mongod.conf
PIDFile=/var/run/mongodb/mongod.pid
# file size
LimitFSIZE=infinity
# cpu time
LimitCPU=infinity
# virtual memory size
LimitAS=infinity
# open files
LimitNOFILE=64000
# processes/threads
LimitNPROC=64000
# locked memory
LimitMEMLOCK=infinity
# total threads (user+kernel)
TasksMax=infinity
TasksAccounting=false

# Recommended limits for mongod as specified in
# https://docs.mongodb.com/manual/reference/ulimit/#recommended-ulimit-settings

[Install]
WantedBy=multi-user.target

Sure, there is a super dirty workaround. I can modify mongod.service so it restarts on the failures but it does not help us to reveal the root cause of “random” crashes.

Perhaps open an Issue?

You could also try updating to Ubuntu 22.04

Still, there are no reproduction steps, it is only random. If I open an issue I believe it will be there just hanging for a while… today I have noticed the same problem on the windows platform, so it might be something internal with mongo 4.4 rather than the environment configuration or limitation.

You’re probably right, @Benjamin_Beganovic … can you upgrade to a later version of MongoDB?

1 Like

Upgrade needs to be done anyway in a couple of months (some other things have to be upgraded first), but for now, I have to come up with some at least good enough workaround.

I would build a Docker with MongoDB 6.0 and get out of 4.4, as you’re not the only person in the last couple of weeks who’s brought up 4.4 crashing on Ubuntu 20 and above.

After people started 6.0 services they haven’t experienced problems since that I know of.

After building the 6.0, export your indexes to it and the aggregations, then export the BSON over to it. That would be a lot more ideal than trying to troubleshoot what is essentially going to be EOL in Feb of 2024 anyway, in the next 10 months you’ll be in a worse situation support wise, so it makes sense to be ahead of the curve for the next two years instead of 10 months. You could even upgrade to the latest ops manager as well on top of it.

2 Likes