Multiple core dumps

Oded_Raiches · October 24, 2022, 10:52am

Hi, been working with mongo 6.0 for a few weeks now.
Using a sharded cluster configuration.
Observed a bunch of core dumps happening, but was not able to inspect them with mongod on the VM that it happened in (core size was limited, so its reading garbage).
Core dump names are truncated, but I believe you can identify.

Replica.xecutor
ShardRegistry
ShardRe.Updater
ReplCoord
Logical.cheReap

They happened all at once more or less:

Oct 18 04:01:10.526700 //Replica.xecutor--1666065669.gz
Oct 18 04:01:32.726745 //ShardRegistry--1666065691.gz
Oct 18 04:50:28.010586 //ShardRe.Updater--1666068627.gz
Oct 18 04:50:39.290355 //ReplCoord-16--1666068638.gz
Oct 18 04:52:04.376591 //Logical.cheReap--1666068723.gz
Oct 18 04:52:10.809420 //Logical.cheReap--1666068723.gz
Oct 18 04:52:26.634675 //ReplCoord-0--1666068720.gz

How can I debug these cores? mongod should work fine if the cores are not limited?
is there a reason why they are all crashing separately at the same time?
Might it be related to replica set shut down or primary turned off for a short time?

Thanks!

kevinadi · October 24, 2022, 11:44pm

Hi @Oded_Raiches

Could you provide more background details:

What’s the exact MongoDB version you’re using? 6.0.1? 6.0.2?
What’s the sharded cluster’s topology? How many shards, how many nodes in a shard replica set?
Details of the hardware you’re running it on, e.g. what’s the hardware spec, how much RAM per node/VM, etc.
Details of your deployment, e.g. what’s your OS and filesystem, are each node running inside their own VM, does any mongod process share a node with anything else (e.g. another mongod process)

And also about the event itself:

What’s in the logs of the crashed servers?
Timeline of the events; I assume there are multiple crashes?
Any pattern you can see when the crashes happened, e.g. time of day, workload, or any other details you can share?

This seems like a lot of questions. but the goal is to know enough of the situation you’re facing, your hardware, and the events timeline, so people can imitate your situation and reproduce the issue reliably.

Best regards
Kevin

Oded_Raiches · October 25, 2022, 2:06pm

Hi @kevinadi , thanks for the fast reply!

MongoDB version is 6.0.1
not sure what you mean about “topology”, but we have one shard, 3 nodes of a 3 member replica set.
Hardware: 12vcpu + 46.8GB RAM on each node
OS: Ubuntu 18.04, with XFS filesystem. mongod process running on a separate VM

Logs around this time:
mongos logs:

{"t":{"$date":"2022-10-19T03:40:19.693+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-558","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2199, epochDisambiguatingSequenceNum: 2181 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2198, epochDisambiguatingSequenceNum: 2180 }","durationMillis":6}}
{"t":{"$date":"2022-10-19T03:40:19.713+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-558","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2201, epochDisambiguatingSequenceNum: 2183 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2200, epochDisambiguatingSequenceNum: 2182 }","durationMillis":4}}
{"t":{"$date":"2022-10-19T03:40:36.660+00:00"},"s":"W",  "c":"NETWORK",  "id":4615610, "ctx":"MirrorMaestro-8","msg":"Failed to check socket connectivity","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}}
{"t":{"$date":"2022-10-19T03:40:36.660+00:00"},"s":"I",  "c":"CONNPOOL", "id":22561,   "ctx":"MirrorMaestro-8","msg":"Dropping unhealthy pooled connection","attr":{"hostAndPort":"10.3.14.67:7501"}}
{"t":{"$date":"2022-10-19T03:40:36.660+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"10.3.14.67:7501"}}
{"t":{"$date":"2022-10-19T03:41:54.107+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-559","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2203, epochDisambiguatingSequenceNum: 2185 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2202, epochDisambiguatingSequenceNum: 2184 }","durationMillis":5}}
{"t":{"$date":"2022-10-19T03:41:54.116+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-559","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2205, epochDisambiguatingSequenceNum: 2187 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2204, epochDisambiguatingSequenceNum: 2186 }","durationMillis":4}}

Unfortunately, I don’t have data about what occurred there at this time (logs got rolled) but I got a hint that the storage holding the database might have disconnected for some time.

kevinadi · October 26, 2022, 3:34am

Hi @Oded_Raiches

Unfortunately we need the crashed mongod logs instead of the mongos logs, since the mongos process have no idea what’s going on in the shards themselves.

You also mentioned:

but I got a hint that the storage holding the database might have disconnected for some time.

If I understand correctly, your setup have storage defined external to the nodes, and the VM lost connection to the storage node. Is this accurate? If yes, then this is like pulling the hard drive out when your PC is running. It’s best to double check that no data is lost or corrupt.

I would verify that your deployment follows the settings put forth in the production notes to ensure best performance and reliability.

Best regards,
Kevin

Oded_Raiches · November 15, 2022, 8:38am

Hi @kevinadi, I didn’t notice any corruption in data, but when the storage is disconnected we get these cores.
Is there a way I can share you the logs and cores? I’m not really able to read the cores, at first I thought they were corrupt but I’m probably not using the right binaries

Oded_Raiches · November 22, 2022, 6:55am

Hi @kevinadi , I observed some crashes again.
What happened here is that a storage drive was replaced.

The OplogApplier core says:

#0  0x00007f722da79817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a8bc928968 in mongo::(anonymous namespace)::endProcessWithSignal(int) ()
#2  0x000055a8bc929c26 in mongo::(anonymous namespace)::myTerminate() ()
#3  0x000055a8bcab46a6 in __cxxabiv1::__terminate(void (*)()) ()
#4  0x000055a8bcab46e1 in std::terminate() ()
#5  0x000055a8b9a11508 in mongo::ThreadPool::Impl::_startWorkerThread_inlock() [clone .cold.629] ()
#6  0x000055a8bc716f48 in mongo::ThreadPool::Impl::schedule(mongo::unique_function<void (mongo::Status)>) ()
#7  0x000055a8bc717083 in mongo::ThreadPool::schedule(mongo::unique_function<void (mongo::Status)>) ()
#8  0x000055a8ba726f20 in mongo::repl::OplogApplierImpl::_applyOplogBatch(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry, std::allocator<mongo::repl::OplogEntry> >) ()
#9  0x000055a8ba725597 in mongo::repl::OplogApplierImpl::_run(mongo::repl::OplogBuffer*) ()
#10 0x000055a8ba7ca02b in auto mongo::unique_function<void (mongo::executor::TaskExecutor::CallbackArgs const&)>::makeImpl<mongo::repl::OplogApplier::startup()::{lambda(mongo::executor::TaskExecutor::CallbackArgs const&)#1}>(mongo::repl::OplogApplier::startup()::{lambda(mongo::executor::TaskExecutor::CallbackArgs const&)#1}&&)::SpecificImpl::call(mongo::executor::TaskExecutor::CallbackArgs const&) ()
#11 0x000055a8bc0f9ee0 in mongo::executor::ThreadPoolTaskExecutor::runCallback(std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState>) ()
#12 0x000055a8bc0fa2e0 in auto mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::executor::ThreadPoolTaskExecutor::scheduleIntoPool_inlock(std::__cxx11::list<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState>, std::allocator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > >*, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::unique_lock<mongo::latch_detail::Latch>)::{lambda(auto:1)#3}>(mongo::executor::ThreadPoolTaskExecutor::scheduleIntoPool_inlock(std::__cxx11::list<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState>, std::allocator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > >*, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::unique_lock<mongo::latch_detail::Latch>)::{lambda(auto:1)#3}&&)::SpecificImpl::call(mongo::Status&&) ()
#13 0x000055a8bc715a25 in mongo::ThreadPool::Impl::_doOneTask(std::unique_lock<mongo::latch_detail::Latch>*) ()
#14 0x000055a8bc71719b in mongo::ThreadPool::Impl::_consumeTasks() ()
#15 0x000055a8bc71865c in mongo::ThreadPool::Impl::_workerThreadBody(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#16 0x000055a8bc718bd0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::ThreadPool::Impl::_startWorkerThread_inlock()::{lambda()#4}, , 0>(mongo::ThreadPool::Impl::_startWorkerThread_inlock()::{lambda()#4})::{lambda()#1}> > >::_M_run() ()
#17 0x000055a8bcad093f in execute_native_thread_routine ()
#18 0x00007f722da6e6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x00007f722d79761f in clone () from /lib/x86_64-linux-gnu/libc.so.6

For SignalHandler I see:

Program terminated with signal SIGABRT, Aborted.
#0  0x00007fb7875b6817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
[Current thread is 1 (Thread 0x7fb7843b2700 (LWP 5666))]
(gdb) bt
#0  0x00007fb7875b6817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00005558680be968 in mongo::(anonymous namespace)::endProcessWithSignal(int) ()
#2  0x00005558680bfc26 in mongo::(anonymous namespace)::myTerminate() ()
#3  0x000055586824a6a6 in __cxxabiv1::__terminate(void (*)()) ()
#4  0x000055586824a6e1 in std::terminate() ()
#5  0x0000555864c5c494 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::_startKillOpThread() [clone .cold.3979] ()
#6  0x00005558655bbf28 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::AutoGetRstlForStepUpStepDown(mongo::repl::ReplicationCoordinatorImpl*, mongo::OperationContext*, mongo::repl::ReplicationCoordinator::OpsKillingStateTransitionEnum, mongo::Date_t) ()
#7  0x00005558655c8e10 in mongo::repl::ReplicationCoordinatorImpl::stepDown(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&) ()
#8  0x000055586571c460 in mongo::stepDownForShutdown(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> > const&, bool) ()
#9  0x0000555865413029 in mongo::(anonymous namespace)::shutdownTask(mongo::ShutdownTaskArgs const&) ()
#10 0x00005558680bb4a5 in mongo::(anonymous namespace)::runTasks(std::stack<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>, std::deque<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>, std::allocator<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)> > > >, mongo::ShutdownTaskArgs const&) ()
#11 0x00005558651f2ad2 in mongo::shutdown(mongo::ExitCode, mongo::ShutdownTaskArgs const&) ()
#12 0x0000555866199b80 in mongo::(anonymous namespace)::signalProcessingThread(mongo::LogFileStatus) ()
#13 0x0000555866199d25 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}> > >::_M_run() ()
#14 0x000055586826693f in execute_native_thread_routine ()
#15 0x00007fb7875ab6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007fb7872d461f in clone () from /lib/x86_64-linux-gnu/libc.so.6

For Replica.xecutor core I only see ??? in the core

Do you know what it means and how we can prevent it?

kevinadi · November 22, 2022, 7:14am

Hi @Oded_Raiches

I have to reiterate that MongoDB was not designed to have storage pulled out under it while it’s running.

If the mongod process cannot access the disk because it was suddenly not available, either deliberately or accidentally, it is not surprising that it crashes.

how we can prevent it?

Please don’t disconnect the disk while the mongod process is running.

Best regards
Kevin

Oded_Raiches · November 30, 2022, 10:39am

Thanks for the reply @kevinadi .
So essentially your saying is that mongo can crash anywhere when it is having storage issues?
For storage systems this is pretty common for a disk to have issues and be needing replace.

One more example I got recently, is ftdc crash because of this.
Here is the stacktrace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/mongod --quiet --config /etc/mongod.conf'.
Program terminated with signal SIGABRT, Aborted.
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fcb4e290700 (LWP 2679))]
(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000562520b5b968 in mongo::(anonymous namespace)::endProcessWithSignal(int) ()
#2  0x0000562520b5cc26 in mongo::(anonymous namespace)::myTerminate() ()
#3  0x0000562520ce76a6 in __cxxabiv1::__terminate(void (*)()) ()
#4  0x0000562520d7c029 in __cxa_call_terminate ()
#5  0x0000562520ce7095 in __gxx_personality_v0 ()
#6  0x00007fcb59ec2573 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007fcb59ec2ad1 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x0000562520ce7807 in __cxa_throw ()
#9  0x000056251dc78971 in mongo::error_details::throwExceptionForStatus(mongo::Status const&) ()
#10 0x000056251dc8d8c2 in mongo::uassertedWithLocation(mongo::Status const&, char const*, unsigned int) ()
#11 0x000056251d72b02e in mongo::FTDCController::doLoop() [clone .cold.495] ()
#12 0x000056251e170dec in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::FTDCController::start()::{lambda()#2}, , 0>(mongo::FTDCController::start()::{lambda()#2})::{lambda()#1}> > >::_M_run() ()
#13 0x0000562520d0393f in execute_native_thread_routine ()
#14 0x00007fcb59c9a6db in start_thread (arg=0x7fcb4e290700) at pthread_create.c:463
#15 0x00007fcb599c361f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

steevej · November 30, 2022, 2:41pm

Yes. Like all software that needs disks for permanent storage. Like a car that crashes when it looses a wheel.

The solution is RAID at the storage level and replication at the system architecture level.