Hi there,
Quick question about ConfigServerCatalogCacheLoader on a newly elected PRIMARY shard in a sharded deployment running Mongo 4.0.25.
Question:
Consider a sharded deployment with these pieces (along with mongos & other shards):
fed-db-mongo-ram-configsvr-0
fed-db-mongo-ram-configsvr-1
fed-db-mongo-ram-configsvr-2
fed-db-mongo-ram-shard2-0
fed-db-mongo-ram-shard2-1
fed-db-mongo-ram-shard2-2
The logs below show that the ConfigServerCatalogCacheLoader-0 on a newly elected PRIMARY shard2-2 is connected to SECONDARY configsvr-1, even though configsvr-0 is PRIMARY. This seems odd? Is it really expected?
This seems to have bad side-effects with “Failed to refresh metadata…” if the SECONDARY configsvr-2 is then rebooted?
Logs:
All logs below are from the perspective of shard2-2:
//
// At this point:
//
// fed-db-mongo-ram-configsvr-0 == PRIMARY
// fed-db-mongo-ram-configsvr-1 == SECONDARY
// fed-db-mongo-ram-configsvr-2 == SECONDARY
//
// fed-db-mongo-ram-shard2-0 == SECONDARY
// fed-db-mongo-ram-shard2-1 == PRIMARY
// fed-db-mongo-ram-shard2-2 == SECONDARY
//
// Where shard2-2 properly understands that configsvr-0 is PRIMARY, and configsvr-1 is SECONDARY:
//
2022-04-03T02:34:13.200+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor-18] Updating host fed-db-mongo-ram-configsvr-0.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 state based on ismaster reply from ReplicaSet cnfRepSet. Self is primary: 1, primary is: fed-db-mongo-ram-configsvr-0.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017, election ID: 7fffffff0000000000000008, ReplicaSet version: 3
2022-04-03T02:34:13.201+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor-18] Updating host fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 state based on ismaster reply from ReplicaSet cnfRepSet. Self is primary: 0, primary is: fed-db-mongo-ram-configsvr-0.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017, election ID: 000000000000000000000000, ReplicaSet version: 3
2022-04-03T02:34:13.201+0000 I NETWORK [ReplicaSetMonitor-TaskExecutor-18] Updating host fed-db-mongo-ram-configsvr-2.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 state based on ismaster reply from ReplicaSet cnfRepSet. Self is primary: 0, primary is: fed-db-mongo-ram-configsvr-0.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017, election ID: 000000000000000000000000, ReplicaSet version: 3
//
// At this point:
//
// fed-db-mongo-ram-configsvr-0 == PRIMARY
// fed-db-mongo-ram-configsvr-1 == SECONDARY
// fed-db-mongo-ram-configsvr-2 == SECONDARY
//
// fed-db-mongo-ram-shard2-0 == SECONDARY
// fed-db-mongo-ram-shard2-1 == PRIMARY ---> reboot
// fed-db-mongo-ram-shard2-2 == SECONDARY ---> stepUp ---> PRIMARY
//
// Where shard2-2 is quickly elected PRIMARY via "stepUp" and is ready within ~1 second.
//
2022-04-03T02:34:17.515+0000 I COMMAND [conn1] Received replSetStepUp request
2022-04-03T02:34:18.518+0000 I REPL [rsSync-0] transition to primary complete; database writes are now permitted
//
// At this point, configsvr-1 is rebooted:
//
// fed-db-mongo-ram-configsvr-0 == PRIMARY
// fed-db-mongo-ram-configsvr-1 == SECONDARY ---> reboot
// fed-db-mongo-ram-configsvr-2 == SECONDARY
//
// fed-db-mongo-ram-shard2-0 == SECONDARY
// fed-db-mongo-ram-shard2-1 == PRIMARY ---> still rebooting
// fed-db-mongo-ram-shard2-2 == PRIMARY
//
// Is it really expected that ConfigServerCatalogCacheLoader-0 was connected to SECONDARY configsvr-1?
// The configsvr-0 is PRIMARY.
//
2022-04-03T02:34:18.522+0000 I CONNPOOL [ShardRegistry] dropping unhealthy pooled connection to fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017
2022-04-03T02:34:18.522+0000 I ASIO [ShardRegistry] Connecting to fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017
2022-04-03T02:34:38.522+0000 I ASIO [ShardRegistry] Connecting to fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017
2022-04-03T02:34:38.522+0000 I NETWORK [ConfigServerCatalogCacheLoader-0] Marking host fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2022-04-03T02:34:38.522+0000 I SHARDING [ConfigServerCatalogCacheLoader-0] Operation timed out :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2022-04-03T02:34:38.522+0000 I SH_REFR [ConfigServerCatalogCacheLoader-0] Refresh for database upf_UPF-CL11 took 20000 ms and failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2022-04-03T02:34:38.523+0000 I ASIO [ShardRegistry] Failed to connect to fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 - HostUnreachable: Error connecting to fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 :: caused by :: Could not find address for fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017: SocketException: Host not found (authoritative)
2022-04-03T02:34:38.523+0000 I CONNPOOL [ShardRegistry] Dropping all pooled connections to fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 due to HostUnreachable: Error connecting to fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017 :: caused by :: Could not find address for fed-db-mongo-ram-configsvr-1.fed-db-mongo-ram-configsvr-pods.fed-db-mongo-ram.svc.cluster.local:27017: SocketException: Host not found (authoritative)
//
// At this point, configsvr-1 is still rebooting (above):
//
// fed-db-mongo-ram-configsvr-0 == PRIMARY
// fed-db-mongo-ram-configsvr-1 == SECONDARY ---> rebooting
// fed-db-mongo-ram-configsvr-2 == SECONDARY
//
// fed-db-mongo-ram-shard2-0 == SECONDARY
// fed-db-mongo-ram-shard2-1 == PRIMARY ---> still rebooting
// fed-db-mongo-ram-shard2-2 == PRIMARY
//
// There are a series of these logs for about 0.2 seconds : trying to connect to dead SECONDARY configsvr-1 ?
//
2022-04-03T02:34:38.535+0000 I SHARDING [conn543] Failed to refresh metadata for collection...blah... :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2022-04-03T02:34:38.535+0000 I SHARDING [conn543] Failed to handle stale version exception :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
... repeat ...
2022-04-03T02:34:38.550+0000 I SH_REFR [ConfigServerCatalogCacheLoader-0] Refresh for collection ...blah... to version 1|5||6248fb74821589a79f4590d9 took 2 ms