Indexing after moving a chunks

I’m curious to understand what happens to indexes after a chunk is moved. It appears to me that when a chunk is moved from one replica set to another a full re-index occurs for all of the source collection indexes. Is this true? If so, are there options that can be used to throttle the chunk movement, like not moving any chucks until the indexing is complete?

In my particular case it appears that indexing after moving a chunk is flooding my disk with activity and crushing the performance.

Let me explain.

I’ve been working with a standalone instance and I’m now working to move it to two sharded replica sets. I’ve successfully moved the database to a replica set of three data nodes and a single shard. I then created an empty three node replica set and added it as a second shard. I’ve enabled sharding on the database and one by one I am sharding the collections. The small collections balanced without issue but I’m running into balancing issues when sharding the larger collections.

It seems that after a chunk is moved from the source RS to the target RS, each node in the source RS performs a significant amount of indexing work, so much so that it overwhelms the disk drive which will peg at 100% activity and a disk queue of around 10. I believe it is indexing activity because it the files being written to are index files. The data disk for each node is four 10K SAS drives in a RAID 10 array. CPU utilization hovers around 40% and memory around 50%.

I expect this is resulting in excessive lag and causing the migrations to begin failing.

This issue seems to build over time. What I mean by that is that I can disable balancing on the collection and let the indexing work finish. If I then enable balancing on the collection, the indexing load after the first chunk is moved isn’t too bad, but after 6 or 7 chunks have been moved the load on the drives gets so bad that the balancing begins to throw ‘aborted’ errors.

Hi Tim,

I don’t believe any reindex operation was done in a chunk move in either the donor shard nor the recipient shard. In the donor shard, all indexes must be present, otherwise the command will not proceed (reference: moveChunk command).

When you say that issues start to pop out when balancing larger collections, could you provide the scale of the collection? Are they gigantic collections approaching TB-size each? How many of those collections are you trying to shard?

I expect this is resulting in excessive lag and causing the migrations to begin failing.

Could you share the log line or the output of sh.status()? What was the exact cause of the failed migration?

This issue seems to build over time. What I mean by that is that I can disable balancing on the collection and let the indexing work finish

This sounds like two separate issues to me, indexing and moving chunks. If possible, could you let the indexing work finish before doing balancing?

Actually, the “best” way of migrating into a sharded cluster is to pre-create and pre-split the sharded collection before filling it with data. In essence, this can be done by:

  1. On the sharded cluster, create the relevant collection and indexes using db.createCollection command and db.collection.createIndex. Note that we want the sharded collection to remain empty.
  2. Shard the collection, and split them into chunks using the proper shard key and the splitAt command
  3. Ensure that the empty sharded collection are balanced. This way, the only operation involved is a shard metadata change, and no actual documents are migrated, since the collection is still empty.
  4. Dump the collection from the original database (e.g. by using mongodump).
  5. Turn off the balancer, then restore the database to the sharded cluster (e.g. by using mongorestore). This is to prevent any chunk split & movement while data import is taking place. Do this restore using the mongos query router and never operate on the shards directly.
  6. Turn on the balancer, and ensure the restored collection reach steady state.

If you pre-create the sharded collection and pre-split the (empty) chunks, you don’t need to do migrations anymore. The relevant data would be restored straight into its proper shard.

If you’re interested in the inner workings of chunk migration and sharded cluster, please see the Sharding Internals article, which is part of the MongoDB source code itself.

Best regards,
Kevin

Hi Kevin, thanks for the reply.

I have three collections that fail. One has 27M documents and a total document size of 28GB. Two has 86M documents and is 26GB. The third is by far the largest having 4.2B documents and a total document size of 1.4TB. I only enable balancing on one collection at a time in my attempts to make this work.

sh.status() only says something about n failures in the last 24 hours with a message of ‘aborted’. There were not any helpful details. There is more information in the logs which can be viewed here.

The collections are fully indexed in a single sharded replica set. My application is disabled, so there are no attempted CRUD actions being taken. It is when I add a second shard and set a collection to be sharded that the problem occurs. On the smaller collections a few chunks will move but eventually the disks on all nodes in the donor replica set are overwhelmed with disk activity (100% and disk queue ~10). I’m opining that it is indexing related because the huge disk activity is all on .wt files that begin with ‘index’. On the large collection errors happen immediately. Eventually one chunk was moved but it took nearly 24 hours. There are 2,900 chunks in this collection and I can’t wait 8 years for the balancing to complete :wink:

The pre-split process sounds interesting but my initial concern is that using mongodump and mongorestore was the first method I tried when I started this migration task nearly two months ago. At that time I was merely trying to restore to another standalone instance and I gave up on that path because after restoring the data it started rebuilding the indexes and I was in the same situation with the indexing crushing my hard drives.

If I create pre-splitted shards won’t I potentially have the same indexing issue on each member of each replicaset?

To try the pre-split I’ll need to round up some equipment for a lab. My application is finally back online after two months of attempting to create a sharded cluster :frowning:

But I have a question regarding pre-splitting the collections. I have my chunk size set to 1024MB to minimize re-balancing needs, which results in 2,900+ chunks in my large collection. In a 1:1 meeting with MondoDb yesterday I was encouraged to reset it back to the default of 64MB which means I’ll have nearly 50K chunks. Is there a way to tell MondoDb that I want 50K evenly spaced chunks to do I need to use loop in a script that calls splitAt 50K times?

Hi Tim,

You have a lot of data there. I think the hardware is overwhelmed with the demand placed onto it by all the chunk moves and rebalancing.

From the logs you references, two things seem to support the theory:

2020-06-02T06:55:12.326-0700 W STORAGE [FlowControlRefresher] Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries.

This message seem to indicate that the secondaries are having trouble keeping up with demand. The flow control is a throttling mechanism on the primary to allow the secondaries to keep up (see Flow Control). I can’t tell why the secondaries can’t keep up, but it could be caused by a couple things, e.g. slow network, or if the secondaries are less powerful than the primary, then the primary simply must wait for the secondary to finish it work.

The second thing that caught my eye:

2020-06-02T06:55:29.190-0700 I STORAGE [WTCheckpointThread] WiredTiger message [1591106129:189774][1352:140709697032320], WT_SESSION.checkpoint: Checkpoint ran for 99 seconds and wrote: 89632 pages (4141 MB)

By default, WiredTiger creates a checkpoint every 60 seconds. See the linked page for a short explanation of checkpoints. A typical checkpoint runs for much less than 60 seconds. This particular checkpoint runs for 99 seconds, which implies that: a) it needs to write a whole lot of data, or b) the disk is overwhelmed by the write request placed on it.

From the numbers you provided so far, it sounds to me that doing dump & restore would be faster than waiting for the chunk to rebalance themselves :slight_smile:

Regarding creating chunks, I don’t think there’s an automatic method to create the chunks. Usually faced with this situation, I would create a small script that calls sh.splitAt() in a loop.

Best regards,
Kevin

2 Likes

Hey Kevin.

Yep, I have a lot of data and I’m just getting started. I anticipate having to add shards on an ongoing basis and I’ll need to be able to do it without having to manually recreate the cluster, which is why I’m trying to understand where my bottleneck is.

The nodes are all identical. They are VMs running on six identically configured servers.

CPU, memory and network all seem fine. I am definitely overwhelming my disks as mentioned in the original post. All of the disk activity is on the index files so I’m trying to understand why this is, and what can I do about it.

Could it be my 1GB chunks? I thought that using the maximum chunk size would be best in my case since I will be constantly adding new data and I wanted to minimize the need to split chunks and balance shards, but maybe these large chunks are the problem? When I get my lab setup I plan to retry this process using the default 64MB chunks.

Thanks.

Tim