Mongodump of collection fails with a HostUnreachable caused by "node is not in primary or recovering state"

Allan_Chase · February 16, 2024, 6:24pm

Hello Mongo-Deities , I have a 7 node (on AWS ec2 RHEL) Mongo (6.X) setup that contains 2 shards (on a pretty sturdy machine class with lots of processors/ram/disk). My goal is to convert one collection into a time-series collection, so I’m dumping (mongodump from mongotools 100.9.3) ~2.3TB of data out of the collection (trying to be smart by slicing up the mongodump into date ranges and on a data drive that is separate from my main data drives) and I get this error after 200 to 300K documents:

"Failed: error writing data for collection weatherobservation.weatherobservation to disk: error reading collection: (HostUnreachable) Error on remote shard mongodb6-internal-hostname:27021 :: caused by :: node is not in primary or recovering state"

The command I used was (from my mongodb1 host connecting to “mongos” like a good boy):

"*mongodump --uri "mongodb://admin@localhost:27018/weatherobservation?authSource=test" --collection=weatherobservation --archive=/data3temp/decnov2023.archive --query='{"createdAtMillis": {"$gte":1698796800000, "$lte":1703375999000}}'*"

It is worth mentioning that I do have “mongos” running as a replicaset that cohabitates with and spans all 7 ec2-instances (which seems to work really well). It is also worth mentioning that this isn’t a particularly busy mongo environment. I did look at the health and status of the cluster and rs.status() reports that everything is fine and that both shards are fine and the hosts are all happy and visible to each other. Initially I thought it might be a hostname resolution issue because of the inability to quickly resolve the host, so I hard-wired those hosts into /etc/hosts and that didn’t help. I also stopped the balancer just in-case in order to remove more moving parts from the equation.

So, I followed the rabbit trail to the server that produced the error and noticed this (NetworkInterfaceExceededTimeLimit) was in the logfile:

"t":{"$date":"2024-02-16T17:09:08.428Z"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"mongodb1_RS","host":"mongodb6-internal-hostname:27021","error":{"code":202,"codeName":"NetworkInterfaceExceededTimeLimit","errmsg":"Request 41087450 timed out, deadline was 2024-02-16T1
7:09:05.269+00:00, op was RemoteCommand 41087450 -- target:[mongodb6-internal-hostname:27021] db:admin expDate:2024-02-16T17:09:05.269+00:00 cmd:{ hello: 1, maxAwaitTimeMS: 10000, topologyVersion: { processId: ObjectId('6598703f00c0630a11c162ca'), counter: 57 }, internalClient: { minWireVersion: 17, maxWireVersion: 17 } }"},"action":{"dropConnections":false,"re
questImmediateCheck":false,"outcome":{"host":"mongodb6-internal-hostname:27021","success":false,"errorMessage":"NetworkInterfaceExceededTimeLimit: Request 41087450 timed out, deadline was 2024-02-16T17:09:05.269+00:00, op was RemoteCommand 41087450 -- target:[mongodb6-internal-hostname:27021] db:admin expDate:2024-02-16T17:09:05.269+00:00 cmd:{ hello: 1, maxAwa
itTimeMS: 10000, topologyVersion: { processId: ObjectId('6598703f00c0630a11c162ca'), counter: 57 }, internalClient: { minWireVersion: 17, maxWireVersion: 17 } }"}}}}

And below that was an interesting RSM Topology Change…which seems to hint that the election process is about to happen(?):

"s":"I", "c":"NETWORK", "id":4333213, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM Topology Change"

Then it proceeds to ping all the nodes in the cluster for status, then of course it starts the election process and Mongo5 took over as the new primary and then I get the error (as stated in the topic heading).

Soooo, in the original mongodump command, should I be specifying a read-only replica…or something? I’m wondering why the primary was consulted to begin with (which was originally Mongo6). Hopefully someone has seen this issue and can drop some knowledge on this topic.

Allan_Chase · February 21, 2024, 5:46pm

After much rumination and what the servers are reporting, I think the resources are being starved enough to enter high swap usage and therefore damaging the stability of the cluster (to the point at which I need to reboot).