@MaBeuLux88 Thanks so much for your previous reply. You mentioned that the lack of { state : 1 } for rs1 could indicate that the mongod process is not running with --shardsvr, however, upon looking into the mongod.conf files, it does appear that the mongod processes on these shards do have shardsvr specified:
storage:
dbPath: /mongodb/data
journal:
enabled: true
systemLog:
destination: file
logAppend: true
logRotate: reopen
path: /mongodb/log/mongod.log
net:
port: 27018
bindIp: 0.0.0.0
processManagement:
timeZoneInfo: /usr/share/zoneinfo
fork: true
replication:
replSetName: rs1
sharding:
clusterRole: shardsvr
We tried to research it but couldn’t find anything super obvious about the “state” flag. We seem to recall it might be something where a shard updated from an older version of mongodb might not show the “state” value correctly, but it was a benign issue.
When examining the current ops for the db, we can see that there is a moveChunk operation that has been running since Nov 30:
{
"shard" : "rs1",
…
"active" : true,
"currentOpTime" : "2022-12-08T18:10:24.606+0000",
"opid" : "rs1:767299796",
"secs_running" : NumberLong(673411),
"microsecs_running" : NumberLong("673411012241"),
"op" : "command",
"ns" : "admin.$cmd",
"command" : {
"moveChunk" : "mydatabase.mycollection",
"shardVersion" : [
Timestamp(22, 1),
ObjectId("5b3bf4a351bf517cc03596ce")
],
"epoch" : ObjectId("5b3bf4a351bf517cc03596ce"),
"configdb" : "rsconfig/config1:27019,config2:27019,config3:27019",
"fromShard" : "rs1",
"toShard" : "rs2",
"min" : {
"userId" : NumberLong("-9215713443395991186")
},
"max" : {
"userId" : NumberLong("-9214977367518352602")
},
"maxChunkSizeBytes" : NumberLong(67108864),
"waitForDelete" : false,
"takeDistLock" : false,
"$clusterTime" : {
"clusterTime" : Timestamp(1669849613, 163),
"signature" : {
"hash" : BinData(0,"6VoraWaZJWlSL5Er5dML0dCBvok="),
"keyId" : NumberLong("7122446341149558275")
}
},
"$configServerState" : {
"opTime" : {
"ts" : Timestamp(1669849613, 163),
"t" : NumberLong(16)
}
},
"$db" : "admin"
},
"msg" : "step 3 of 6",
"numYields" : 1213,
"locks" : {
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(2437),
"w" : NumberLong(3)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(1217),
"w" : NumberLong(3)
}
},
"Collection" : {
"acquireCount" : {
"r" : NumberLong(1217),
"W" : NumberLong(1)
},
"acquireWaitCount" : {
"W" : NumberLong(1)
},
"timeAcquiringMicros" : {
"W" : NumberLong(266342)
}
},
"oplog" : {
"acquireCount" : {
"w" : NumberLong(2)
}
}
}
}
It’s a little unclear what’s causing this operation to not complete. At this point, it doesn’t seem likely that it’s just taking a long time, but it seems like it’s just stuck. But some clarity on this would be helpful.
We do see in the config.locks collection for this database there is this lock:
{
"_id" : "mydatabase.mycollection",
"state" : 2,
"process" : "ConfigServer",
"ts" : ObjectId("6269bc1620b5633916ac3f46"),
"when" : ISODate("2022-11-30T23:06:53.590Z"),
"who" : "ConfigServer:Balancer",
"why" : "Migrating chunk(s) in collection mydatabase.mycollection"
}
And the migrations collection also shows this:
{
"_id" : "mydatabase.mycollection-userId_-9215713443395991186",
"ns" : "mydatabase.mycollection",
"min" : {
"userId" : NumberLong(-9215713443395991186)
},
"max" : {
"userId" : NumberLong(-9214977367518352602)
},
"fromShard" : "rs1",
"toShard" : "rs2",
"chunkVersion" : [
Timestamp(22, 1),
ObjectId("5b3bf4a351bf517cc03596ce")
],
"waitForDelete" : false
}
We were originally planning on rebooting all shard and config machines to attempt to get things moving again, but we weren’t sure of the consequences of doing that, and didn’t want our data to end up in an invalid state. But attempting to stop the shard balancer times out and gives us this error:
mongos> sh.stopBalancer()
2022-12-08T10:32:47.860-0800 E QUERY [js] uncaught exception: Error: command failed: {
"ok" : 0,
"errmsg" : "Operation timed out",
"code" : 202,
"codeName" : "NetworkInterfaceExceededTimeLimit",
"operationTime" : Timestamp(1670524367, 1917),
"$clusterTime" : {
"clusterTime" : Timestamp(1670524367, 1917),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:583:17
assert.commandWorked@src/mongo/shell/assert.js:673:16
sh.stopBalancer@src/mongo/shell/utils_sh.js:177:12
@(shell):1:1
At this point we are a bit at a loss of what to do. We want to get our shard balancer working again so that we can actually have the benefits of having a second shard, as well as establishing a balancer window. Any help or recommendations would be very helpful.