Troubleshoot or Stop Long-Running Shard Migration

Clark_Kromenaker · December 2, 2022, 8:54pm

Hi everyone, a few days ago we added a second shard to our database, which seemed to go pretty smoothly. This triggered the shard balancer to start doing some migrations.

However, a few days later, we’re not seeing much data actually move to the new shard:

shards:
        {  "_id" : "rs1",  "host" : "rs1/shard1a:27018,shard1b:27018,shard1c:27018" }
        {  "_id" : "rs2",  "host" : "rs2/shard2a:27018,shard2b:27018,shard2c:27018",  "state" : 1 }
autosplit:
        Currently enabled: yes
balancer:
        Currently enabled:  yes
        Currently running:  yes
        Collections with active migrations: 
                mydatabase.mycollection started at Wed Nov 30 2022 15:06:53 GMT-0800 (PST)
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours: 
                No recent migrations
databases:
        {  "_id" : "mydatabase",  "primary" : "rs1",  "partitioned" : true }
                mydatabase.mycollection
                        shard key: { "userId" : "hashed" }
                        unique: false
                        balancing: true
                        chunks:
                                rs1	44215
                                rs2	21

As you can see from the above output, around 44k chunks are on rs1, only 21 chunks on rs2. The chunk count has also not changed for quite awhile.

And despite the balancer running, the “active migration” started two days ago and has neither failed nor provided any result.

Tailing the logs for the two shards and the config server didn’t appear to show any errors being outputted related to this either.

The database does actively have production traffic and user writes/reads going to it. Would that cause migrations to not proceed?

Is there anything I can do to troubleshoot whether this migration is proceeding as expected, or if some issue has occurred that has halted it?

If the migration is stuck, is there any process to “stop and restart” the migration?

Ultimately, any insight would be much appreciated! We aren’t seeing any clear error or sign as to why it would be in this state for multiple days without much movement.

MaBeuLux88_xxx · December 2, 2022, 9:54pm

Hi @Clark_Kromenaker and welcome in the MongoDB Community !

Just a random guess at this point but at least it’s a lead to explore.

I don’t see “state: 1” in rs1. So my guess is that RS1 isn’t started with --shardsvr and isn’t shard aware.

Cheers,
Maxime.

Clark_Kromenaker · December 3, 2022, 4:32pm

@MaBeuLux88_xxx, thanks for the welcome and the quick reply!

That’s a good point - I somehow missed the concept of starting the process with “shardsvr” to make it shard aware. The newly added shard does have “state” set, but the original shard (which we deployed years ago) does not.

So, we’ll try to rectify that this coming week and see if it resolves the issue. Thanks for the tip!

Andrew_Dos_Santos · December 8, 2022, 8:02pm

@MaBeuLux88_xxx Thanks so much for your previous reply. You mentioned that the lack of { state : 1 } for rs1 could indicate that the mongod process is not running with --shardsvr, however, upon looking into the mongod.conf files, it does appear that the mongod processes on these shards do have shardsvr specified:

storage:
 dbPath: /mongodb/data
 journal:
  enabled: true

systemLog:
 destination: file
 logAppend: true
 logRotate: reopen
 path: /mongodb/log/mongod.log

net:
 port: 27018
 bindIp: 0.0.0.0

processManagement:
 timeZoneInfo: /usr/share/zoneinfo
 fork: true

replication:
 replSetName: rs1

sharding:
 clusterRole: shardsvr

We tried to research it but couldn’t find anything super obvious about the “state” flag. We seem to recall it might be something where a shard updated from an older version of mongodb might not show the “state” value correctly, but it was a benign issue.

When examining the current ops for the db, we can see that there is a moveChunk operation that has been running since Nov 30:

{
			"shard" : "rs1",
			…
			"active" : true,
			"currentOpTime" : "2022-12-08T18:10:24.606+0000",
			"opid" : "rs1:767299796",
			"secs_running" : NumberLong(673411),
			"microsecs_running" : NumberLong("673411012241"),
			"op" : "command",
			"ns" : "admin.$cmd",
			"command" : {
				"moveChunk" : "mydatabase.mycollection",
				"shardVersion" : [
					Timestamp(22, 1),
					ObjectId("5b3bf4a351bf517cc03596ce")
				],
				"epoch" : ObjectId("5b3bf4a351bf517cc03596ce"),
				"configdb" : "rsconfig/config1:27019,config2:27019,config3:27019",
				"fromShard" : "rs1",
				"toShard" : "rs2",
				"min" : {
					"userId" : NumberLong("-9215713443395991186")
				},
				"max" : {
					"userId" : NumberLong("-9214977367518352602")
				},
				"maxChunkSizeBytes" : NumberLong(67108864),
				"waitForDelete" : false,
				"takeDistLock" : false,
				"$clusterTime" : {
					"clusterTime" : Timestamp(1669849613, 163),
					"signature" : {
						"hash" : BinData(0,"6VoraWaZJWlSL5Er5dML0dCBvok="),
						"keyId" : NumberLong("7122446341149558275")
					}
				},
				"$configServerState" : {
					"opTime" : {
						"ts" : Timestamp(1669849613, 163),
						"t" : NumberLong(16)
					}
				},
				"$db" : "admin"
			},
			"msg" : "step 3 of 6",
			"numYields" : 1213,
			"locks" : {

			},
			"waitingForLock" : false,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(2437),
						"w" : NumberLong(3)
					}
				},
				"Database" : {
					"acquireCount" : {
						"r" : NumberLong(1217),
						"w" : NumberLong(3)
					}
				},
				"Collection" : {
					"acquireCount" : {
						"r" : NumberLong(1217),
						"W" : NumberLong(1)
					},
					"acquireWaitCount" : {
						"W" : NumberLong(1)
					},
					"timeAcquiringMicros" : {
						"W" : NumberLong(266342)
					}
				},
				"oplog" : {
					"acquireCount" : {
						"w" : NumberLong(2)
					}
				}
			}
		}

It’s a little unclear what’s causing this operation to not complete. At this point, it doesn’t seem likely that it’s just taking a long time, but it seems like it’s just stuck. But some clarity on this would be helpful.

We do see in the config.locks collection for this database there is this lock:

{
  "_id" : "mydatabase.mycollection",
  "state" : 2,
  "process" : "ConfigServer",
  "ts" : ObjectId("6269bc1620b5633916ac3f46"),
  "when" : ISODate("2022-11-30T23:06:53.590Z"),
  "who" : "ConfigServer:Balancer",
  "why" : "Migrating chunk(s) in collection mydatabase.mycollection"
}

And the migrations collection also shows this:

{
  "_id" : "mydatabase.mycollection-userId_-9215713443395991186",
  "ns" : "mydatabase.mycollection",
  "min" : {
    "userId" : NumberLong(-9215713443395991186)
  },
  "max" : {
    "userId" : NumberLong(-9214977367518352602)
  },
  "fromShard" : "rs1",
  "toShard" : "rs2",
  "chunkVersion" : [ 
    Timestamp(22, 1), 
    ObjectId("5b3bf4a351bf517cc03596ce")
  ],
  "waitForDelete" : false
}

We were originally planning on rebooting all shard and config machines to attempt to get things moving again, but we weren’t sure of the consequences of doing that, and didn’t want our data to end up in an invalid state. But attempting to stop the shard balancer times out and gives us this error:

mongos> sh.stopBalancer()
2022-12-08T10:32:47.860-0800 E QUERY  [js] uncaught exception: Error: command failed: {
	"ok" : 0,
	"errmsg" : "Operation timed out",
	"code" : 202,
	"codeName" : "NetworkInterfaceExceededTimeLimit",
	"operationTime" : Timestamp(1670524367, 1917),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1670524367, 1917),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:583:17
assert.commandWorked@src/mongo/shell/assert.js:673:16
sh.stopBalancer@src/mongo/shell/utils_sh.js:177:12
@(shell):1:1

At this point we are a bit at a loss of what to do. We want to get our shard balancer working again so that we can actually have the benefits of having a second shard, as well as establishing a balancer window. Any help or recommendations would be very helpful.

MaBeuLux88_xxx · December 8, 2022, 11:31pm

Thanks for all the details, it’s very helpful!

I’ll get a few other set of eyes on this as this doesn’t trigger another “genius” idea here.

Could you just give me a couple more information that might be useful?

MongoDB version ?
Explain your cluster config (number of nodes, number of mongos, how many config servers, etc).

Anything suspicious in the logs of any of these nodes?

Thanks,
Maxime.

Andrew_Dos_Santos · December 9, 2022, 12:20am

Thanks so much, Maxime!

Sure thing. Our MongoDB version is 3.6.23

We have two sharded clusters, one with user data that users regularly interact with (many reads and writes) and this one, which is more write heavy as it’s more of an activity log that holds records of user actions.

As I mentioned we just added this second shard to this activtydb in order to scale for user growth. For this cluster, we’re running 3 config machines, 2 router machines, and 2 shards both in replica sets that are replicated to 2 secondaries each. As mentioned from our sh.status():

{  "_id" : "rs1",  "host" : "rs1/shard1a:27018,shard1b:27018,shard1c:27018" }
{  "_id" : "rs2",  "host" : "rs2/shard2a:27018,shard2b:27018,shard2c:27018",  "state" : 1 }

Unfortunately, we didn’t see anything suspicious in the logs.

Andrew_Dos_Santos · December 9, 2022, 8:20pm

Just as an update here, we stepped down our primary config machine and that seemed to get some things unstuck. We’re able to successfully call sh.stopBalancer() and sh.startBalancer() now and the timestamp of the migration has been updated.

  shards:
        {  "_id" : "rs1",  "host" : "rs1/shard1a:27018,shard1b:27018,shard1c:27018" }
        {  "_id" : "rs2",  "host" : "rs2/shard2a:27018,shard2b:27018,shard2c:27018",  "state" : 1 }
  active mongoses:
        "3.6.23" : 1
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Collections with active migrations:
                mydatabase.mycollection started at Fri Dec 09 2022 10:47:11 GMT-0800 (PST)
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                No recent migrations
  databases:
        {  "_id" : "mydatabase",  "primary" : "rs1",  "partitioned" : true }
                mydatabase.mycollection
                        shard key: { "userId" : "hashed" }
                        unique: false
                        balancing: true
                        chunks:
                                rs1	44215
                                rs2	21
                        too many chunks to print, use verbose if you want to force print

That said, even after calling sh.startBalancer() (and that running successfully) sh.isBalancerRunning() returns false.

We’re giving it a bit of time to see if it starts work to migrate chunks, but wondering if this is potentially an issue.

MaBeuLux88_xxx · December 10, 2022, 1:56am

I dug up my old repo and I made a new version of my “quick start” sharded cluster with docker.

I used this to create a 3 shards cluster to play with and in the result of my sh.status() command I get this:

[...]
shards
[
  {
    _id: 'shard1',
    host: 'shard1/mongod-s1-1:27018,mongod-s1-2:27018,mongod-s1-3:27018',
    state: 1,
    topologyTime: Timestamp({ t: 1670635728, i: 1 })
  },
  {
    _id: 'shard2',
    host: 'shard2/mongod-s2-1:27018,mongod-s2-2:27018,mongod-s2-3:27018',
    state: 1,
    topologyTime: Timestamp({ t: 1670635729, i: 1 })
  },
  {
    _id: 'shard3',
    host: 'shard3/mongod-s3-1:27018,mongod-s3-2:27018,mongod-s3-3:27018',
    state: 1,
    topologyTime: Timestamp({ t: 1670635729, i: 7 })
  }
]
[...]

Without too much of a surprise, I’m getting {state: 1} on my 3 shards… So there is definitely something wrong in here and I bet this is our issue.

Can you please make SURE that these 3 nodes are actually running with this option ON at the moment? Maybe these startup scripts have been updated after the last restart of these 3 machines and the mongod have a large uptime.

To check you can run this on your first RS:

shard1 [direct: primary] config> use local
switched to db local
shard1 [direct: primary] local> db.startup_log.find()
[
  {
    _id: 'mongod-s1-1-1670635632853',
    hostname: 'mongod-s1-1',
    startTime: ISODate("2022-12-10T01:27:12.000Z"),
    startTimeLocal: 'Sat Dec 10 01:27:12.853',
    cmdLine: {
      net: { bindIp: '*' },
      replication: { replSet: 'shard1' },
      sharding: { clusterRole: 'shardsvr' }
    },
    pid: Long("1"),
    buildinfo: {
      version: '6.0.3',
      gitVersion: 'f803681c3ae19817d31958965850193de067c516',
      modules: [],
      allocator: 'tcmalloc',
      javascriptEngine: 'mozjs',
      sysInfo: 'deprecated',
      versionArray: [ 6, 0, 3, 0 ],
      openssl: {
        running: 'OpenSSL 1.1.1f  31 Mar 2020',
        compiled: 'OpenSSL 1.1.1f  31 Mar 2020'
      },
      buildEnvironment: {
        distmod: 'ubuntu2004',
        distarch: 'x86_64',
        cc: '/opt/mongodbtoolchain/v3/bin/gcc: gcc (GCC) 8.5.0',
        ccflags: '-Werror -include mongo/platform/basic.h -ffp-contract=off -fasynchronous-unwind-tables -ggdb -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -fno-omit-frame-pointer -fno-strict-aliasing -O2 -march=sandybridge -mtune=generic -mprefer-vector-width=128 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-const-variable -Wno-unused-but-set-variable -Wno-missing-braces -fstack-protector-strong -fdebug-types-section -Wa,--nocompress-debug-sections -fno-builtin-memcmp',
        cxx: '/opt/mongodbtoolchain/v3/bin/g++: g++ (GCC) 8.5.0',
        cxxflags: '-Woverloaded-virtual -Wno-maybe-uninitialized -fsized-deallocation -std=c++17',
        linkflags: '-Wl,--fatal-warnings -pthread -Wl,-z,now -fuse-ld=gold -fstack-protector-strong -fdebug-types-section -Wl,--no-threads -Wl,--build-id -Wl,--hash-style=gnu -Wl,-z,noexecstack -Wl,--warn-execstack -Wl,-z,relro -Wl,--compress-debug-sections=none -Wl,-z,origin -Wl,--enable-new-dtags',
        target_arch: 'x86_64',
        target_os: 'linux',
        cppdefines: 'SAFEINT_USE_INTRINSICS 0 PCRE_STATIC NDEBUG _XOPEN_SOURCE 700 _GNU_SOURCE _FORTIFY_SOURCE 2 BOOST_THREAD_VERSION 5 BOOST_THREAD_USES_DATETIME BOOST_SYSTEM_NO_DEPRECATED BOOST_MATH_NO_LONG_DOUBLE_MATH_FUNCTIONS BOOST_ENABLE_ASSERT_DEBUG_HANDLER BOOST_LOG_NO_SHORTHAND_NAMES BOOST_LOG_USE_NATIVE_SYSLOG BOOST_LOG_WITHOUT_THREAD_ATTR ABSL_FORCE_ALIGNED_ACCESS'
      },
      bits: 64,
      debug: false,
      maxBsonObjectSize: 16777216,
      storageEngines: [ 'devnull', 'ephemeralForTest', 'wiredTiger' ]
    }
  }
]

And here in my example I confirmed that I have sharding: { clusterRole: 'shardsvr' } in my cmdLine.

Note that all the above outputs are from MongoDB 6.0.3 and it’s definitely time for an update on your sharded clusters!

Cheers,
Maxime.

Clark_Kromenaker · December 14, 2022, 6:05pm

@MaBeuLux88_xxx, thanks so much for your attention while troubleshooting this!

We discovered that having the primary config server step down seemed to result in the shard balancer commands becoming responsive again. However, we were still seeing the migration seemingly never completing.

Ultimately, we scheduled some downtime and rebooted the entire cluster (all config machines, and all replica set members in all shards). This did resolve the issue, and we now see data migrating at a decent clip.

Amazingly, “have you tried turning it off and turning it back on” applies even here!

We hear you on updating the cluster - these were initially deployed back in the 3.6 days, and it’s been hard to prioritize updating them. But I’m sure it’d be better to do that proactively than under duress!

MaBeuLux88_xxx · December 14, 2022, 6:25pm

Something weird was happening with the primary config server then !

I have that reference !

I’m glad it’s resolved. Do you have the state:1 now in your sh.status()?

Cheers,
Maxime.

system · December 19, 2022, 6:25pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.