Shard Migration Leads to Service Disruption

Clark_Kromenaker · October 2, 2023, 10:42pm

Hi all, we are having an issue with shard balancing/migration, and are looking for any insight into the cause, how it can be fixed, or experience others have had with similar issues. We are using MongoDB 3.6.23 (I know it’s old; we hope to update it soonish).

Our shard balancer has been running without issue for over a year. However, we had a maintenance window last week, after which we started to see issues (incidentally we did not touch the database during this maintenance beyond creating a new index on a collection).

The issue is this: when migrating records for a particular collection (not the one we created a new index on btw), we see a major disruption in our application servers’ ability to query the database. The database becomes effectively unresponsive while the migration occurs.

And just to highlight - this ONLY happens for a single collection. Migration of all collections are succeeding, but only this single collection correlates with the disruption we’re seeing. The only unique characteristic I could see with the problem collection is that it is shared on the “_id” field whereas all others are sharded on a “userId” field instead.

By monitoring the issue, we have been able to observe a timeline of behaviors:

The shard balancer window is set for midnight to 3am PT. During this window, the shard balancer is actively migrating records.
Migration of records proceeds successfully until the balancer decides to migrate records from this particular problem collection.
MongoDB Compass shows a sudden dip in all operations as well as bytes in/out.
The number of connections to the machine suddenly spikes dramatically. While it usually has maybe 200 concurrent connections, the log is suddenly flooded with these messages: I NETWORK [listener] connection accepted from 10.180.12.59:47654 #5069275 (30995 connections now open).
At about 32636 connections, the log is instead spammed with this message hundreds of thousands of times:

I NETWORK [listener] connection accepted from 10.180.12.59:56484 #5187456 (32636 connections now open)
I - [listener] pthread_create failed: Resource temporarily unavailable
W EXECUTOR [conn5187456] Terminating session due to error: InternalError: failed to create service entry worker thread

Meanwhile, all queries from our application are failing with: Query failed with error code 6 and error message 'End of file' on server
Eventually, the migration from the problem collection succeeds, the number of connections drops back to normal values, and the DB starts to respond to external queries again. However, in one case we did see the mongod process crash during this issue.

So yeah, we are perplexed about why we would suddenly start to see this issue after a seemingly innocuous maintenance window. At first, we thought this was a bug in our application server code or perhaps a malicious user, but simply disabling balancing of the problem collection causes the problem to disappear entirely. However, we would of course like to be able to continue balancing our collections for the health of the cluster.

Has anyone encountered this behavior before? Seeing that it is specifically caused by shard migration, it seems like maybe a bug? The only thing that comes to mind immediately to fix this is to try cycling the config server primary or a “brute force” option is maybe to just restart the entire cluster during a maintenance window.

Clark_Kromenaker · October 3, 2023, 7:25am

Sorry, quick update/clarification. After running some more tests, we found that this issue is not related to a specific collection in the DB. Even with balancing disabled on the apparent problem collection, the exact same issue arose, but with a different collection.

So, some issue here where the data migration process between shards is somehow causing the cluster to become unresponsive to queries and spool up a massive number of connections/threads while migrating chunks between shards.

Clark_Kromenaker · October 26, 2023, 12:34am

Just to follow up here, we tried a few different things, and this ultimately fixed the problem:

We discovered that some of the machines in the cluster were using a different patch revision than other machines. A few machines were still using v3.6.5, while others were using the latest patch revision v3.6.23.

After updating all machines to use v3.6.23, we found that this issue no longer occurred.

Though we don’t know the exact cause, this leads us to believe that some incompatibility between these revisions was causing problems.

system · October 31, 2023, 12:34am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.