Multiple core dumps

Hi, been working with mongo 6.0 for a few weeks now.
Using a sharded cluster configuration.
Observed a bunch of core dumps happening, but was not able to inspect them with mongod on the VM that it happened in (core size was limited, so its reading garbage).
Core dump names are truncated, but I believe you can identify.

  • Replica.xecutor
  • ShardRegistry
  • ShardRe.Updater
  • ReplCoord
  • Logical.cheReap

They happened all at once more or less:

Oct 18 04:01:10.526700 //Replica.xecutor--1666065669.gz
Oct 18 04:01:32.726745 //ShardRegistry--1666065691.gz
Oct 18 04:50:28.010586 //ShardRe.Updater--1666068627.gz
Oct 18 04:50:39.290355 //ReplCoord-16--1666068638.gz
Oct 18 04:52:04.376591 //Logical.cheReap--1666068723.gz
Oct 18 04:52:10.809420 //Logical.cheReap--1666068723.gz
Oct 18 04:52:26.634675 //ReplCoord-0--1666068720.gz
  • How can I debug these cores? mongod should work fine if the cores are not limited?
  • is there a reason why they are all crashing separately at the same time?
  • Might it be related to replica set shut down or primary turned off for a short time?

Thanks!

Hi @Oded_Raiches

Could you provide more background details:

  • What’s the exact MongoDB version you’re using? 6.0.1? 6.0.2?
  • What’s the sharded cluster’s topology? How many shards, how many nodes in a shard replica set?
  • Details of the hardware you’re running it on, e.g. what’s the hardware spec, how much RAM per node/VM, etc.
  • Details of your deployment, e.g. what’s your OS and filesystem, are each node running inside their own VM, does any mongod process share a node with anything else (e.g. another mongod process)

And also about the event itself:

  • What’s in the logs of the crashed servers?
  • Timeline of the events; I assume there are multiple crashes?
  • Any pattern you can see when the crashes happened, e.g. time of day, workload, or any other details you can share?

This seems like a lot of questions. but the goal is to know enough of the situation you’re facing, your hardware, and the events timeline, so people can imitate your situation and reproduce the issue reliably.

Best regards
Kevin

Hi @kevinadi , thanks for the fast reply!

  • MongoDB version is 6.0.1
  • not sure what you mean about “topology”, but we have one shard, 3 nodes of a 3 member replica set.
  • Hardware: 12vcpu + 46.8GB RAM on each node
  • OS: Ubuntu 18.04, with XFS filesystem. mongod process running on a separate VM

Logs around this time:
mongos logs:

{"t":{"$date":"2022-10-19T03:40:19.693+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-558","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2199, epochDisambiguatingSequenceNum: 2181 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2198, epochDisambiguatingSequenceNum: 2180 }","durationMillis":6}}
{"t":{"$date":"2022-10-19T03:40:19.713+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-558","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2201, epochDisambiguatingSequenceNum: 2183 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2200, epochDisambiguatingSequenceNum: 2182 }","durationMillis":4}}
{"t":{"$date":"2022-10-19T03:40:36.660+00:00"},"s":"W",  "c":"NETWORK",  "id":4615610, "ctx":"MirrorMaestro-8","msg":"Failed to check socket connectivity","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}}
{"t":{"$date":"2022-10-19T03:40:36.660+00:00"},"s":"I",  "c":"CONNPOOL", "id":22561,   "ctx":"MirrorMaestro-8","msg":"Dropping unhealthy pooled connection","attr":{"hostAndPort":"10.3.14.67:7501"}}
{"t":{"$date":"2022-10-19T03:40:36.660+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"10.3.14.67:7501"}}
{"t":{"$date":"2022-10-19T03:41:54.107+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-559","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2203, epochDisambiguatingSequenceNum: 2185 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2202, epochDisambiguatingSequenceNum: 2184 }","durationMillis":5}}
{"t":{"$date":"2022-10-19T03:41:54.116+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-559","msg":"Refreshed cached collection","attr":{"namespace":"config.system.sessions","lookupSinceVersion":"1|1024||633ffefb43612c48dd1e4cdd||Timestamp(1665138427, 6)","newVersion":"{ chunkVersion: { t: Timestamp(1665138427, 6), e: ObjectId('633ffefb43612c48dd1e4cdd'), v: Timestamp(1, 1024) }, forcedRefreshSequenceNum: 2205, epochDisambiguatingSequenceNum: 2187 }","timeInStore":"{ chunkVersion: \"None\", forcedRefreshSequenceNum: 2204, epochDisambiguatingSequenceNum: 2186 }","durationMillis":4}}

Unfortunately, I don’t have data about what occurred there at this time (logs got rolled) but I got a hint that the storage holding the database might have disconnected for some time.

Hi @Oded_Raiches

Unfortunately we need the crashed mongod logs instead of the mongos logs, since the mongos process have no idea what’s going on in the shards themselves.

You also mentioned:

but I got a hint that the storage holding the database might have disconnected for some time.

If I understand correctly, your setup have storage defined external to the nodes, and the VM lost connection to the storage node. Is this accurate? If yes, then this is like pulling the hard drive out when your PC is running. It’s best to double check that no data is lost or corrupt.

I would verify that your deployment follows the settings put forth in the production notes to ensure best performance and reliability.

Best regards,
Kevin

2 Likes

Hi @kevinadi, I didn’t notice any corruption in data, but when the storage is disconnected we get these cores.
Is there a way I can share you the logs and cores? I’m not really able to read the cores, at first I thought they were corrupt but I’m probably not using the right binaries

Hi @kevinadi , I observed some crashes again.
What happened here is that a storage drive was replaced.

  • The OplogApplier core says:
#0  0x00007f722da79817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a8bc928968 in mongo::(anonymous namespace)::endProcessWithSignal(int) ()
#2  0x000055a8bc929c26 in mongo::(anonymous namespace)::myTerminate() ()
#3  0x000055a8bcab46a6 in __cxxabiv1::__terminate(void (*)()) ()
#4  0x000055a8bcab46e1 in std::terminate() ()
#5  0x000055a8b9a11508 in mongo::ThreadPool::Impl::_startWorkerThread_inlock() [clone .cold.629] ()
#6  0x000055a8bc716f48 in mongo::ThreadPool::Impl::schedule(mongo::unique_function<void (mongo::Status)>) ()
#7  0x000055a8bc717083 in mongo::ThreadPool::schedule(mongo::unique_function<void (mongo::Status)>) ()
#8  0x000055a8ba726f20 in mongo::repl::OplogApplierImpl::_applyOplogBatch(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry, std::allocator<mongo::repl::OplogEntry> >) ()
#9  0x000055a8ba725597 in mongo::repl::OplogApplierImpl::_run(mongo::repl::OplogBuffer*) ()
#10 0x000055a8ba7ca02b in auto mongo::unique_function<void (mongo::executor::TaskExecutor::CallbackArgs const&)>::makeImpl<mongo::repl::OplogApplier::startup()::{lambda(mongo::executor::TaskExecutor::CallbackArgs const&)#1}>(mongo::repl::OplogApplier::startup()::{lambda(mongo::executor::TaskExecutor::CallbackArgs const&)#1}&&)::SpecificImpl::call(mongo::executor::TaskExecutor::CallbackArgs const&) ()
#11 0x000055a8bc0f9ee0 in mongo::executor::ThreadPoolTaskExecutor::runCallback(std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState>) ()
#12 0x000055a8bc0fa2e0 in auto mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::executor::ThreadPoolTaskExecutor::scheduleIntoPool_inlock(std::__cxx11::list<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState>, std::allocator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > >*, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::unique_lock<mongo::latch_detail::Latch>)::{lambda(auto:1)#3}>(mongo::executor::ThreadPoolTaskExecutor::scheduleIntoPool_inlock(std::__cxx11::list<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState>, std::allocator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > >*, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::_List_iterator<std::shared_ptr<mongo::executor::ThreadPoolTaskExecutor::CallbackState> > const&, std::unique_lock<mongo::latch_detail::Latch>)::{lambda(auto:1)#3}&&)::SpecificImpl::call(mongo::Status&&) ()
#13 0x000055a8bc715a25 in mongo::ThreadPool::Impl::_doOneTask(std::unique_lock<mongo::latch_detail::Latch>*) ()
#14 0x000055a8bc71719b in mongo::ThreadPool::Impl::_consumeTasks() ()
#15 0x000055a8bc71865c in mongo::ThreadPool::Impl::_workerThreadBody(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#16 0x000055a8bc718bd0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::ThreadPool::Impl::_startWorkerThread_inlock()::{lambda()#4}, , 0>(mongo::ThreadPool::Impl::_startWorkerThread_inlock()::{lambda()#4})::{lambda()#1}> > >::_M_run() ()
#17 0x000055a8bcad093f in execute_native_thread_routine ()
#18 0x00007f722da6e6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#19 0x00007f722d79761f in clone () from /lib/x86_64-linux-gnu/libc.so.6
  • For SignalHandler I see:
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fb7875b6817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
[Current thread is 1 (Thread 0x7fb7843b2700 (LWP 5666))]
(gdb) bt
#0  0x00007fb7875b6817 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00005558680be968 in mongo::(anonymous namespace)::endProcessWithSignal(int) ()
#2  0x00005558680bfc26 in mongo::(anonymous namespace)::myTerminate() ()
#3  0x000055586824a6a6 in __cxxabiv1::__terminate(void (*)()) ()
#4  0x000055586824a6e1 in std::terminate() ()
#5  0x0000555864c5c494 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::_startKillOpThread() [clone .cold.3979] ()
#6  0x00005558655bbf28 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::AutoGetRstlForStepUpStepDown(mongo::repl::ReplicationCoordinatorImpl*, mongo::OperationContext*, mongo::repl::ReplicationCoordinator::OpsKillingStateTransitionEnum, mongo::Date_t) ()
#7  0x00005558655c8e10 in mongo::repl::ReplicationCoordinatorImpl::stepDown(mongo::OperationContext*, bool, mongo::Duration<std::ratio<1l, 1000l> > const&, mongo::Duration<std::ratio<1l, 1000l> > const&) ()
#8  0x000055586571c460 in mongo::stepDownForShutdown(mongo::OperationContext*, mongo::Duration<std::ratio<1l, 1000l> > const&, bool) ()
#9  0x0000555865413029 in mongo::(anonymous namespace)::shutdownTask(mongo::ShutdownTaskArgs const&) ()
#10 0x00005558680bb4a5 in mongo::(anonymous namespace)::runTasks(std::stack<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>, std::deque<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)>, std::allocator<mongo::unique_function<void (mongo::ShutdownTaskArgs const&)> > > >, mongo::ShutdownTaskArgs const&) ()
#11 0x00005558651f2ad2 in mongo::shutdown(mongo::ExitCode, mongo::ShutdownTaskArgs const&) ()
#12 0x0000555866199b80 in mongo::(anonymous namespace)::signalProcessingThread(mongo::LogFileStatus) ()
#13 0x0000555866199d25 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<void (*)(mongo::LogFileStatus), mongo::LogFileStatus&, 0>(void (*)(mongo::LogFileStatus), mongo::LogFileStatus&)::{lambda()#1}> > >::_M_run() ()
#14 0x000055586826693f in execute_native_thread_routine ()
#15 0x00007fb7875ab6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007fb7872d461f in clone () from /lib/x86_64-linux-gnu/libc.so.6
  • For Replica.xecutor core I only see ??? in the core

Do you know what it means and how we can prevent it?

Hi @Oded_Raiches

I have to reiterate that MongoDB was not designed to have storage pulled out under it while it’s running.

If the mongod process cannot access the disk because it was suddenly not available, either deliberately or accidentally, it is not surprising that it crashes.

how we can prevent it?

Please don’t disconnect the disk while the mongod process is running.

Best regards
Kevin