Specific Mongo shards are getting most request

Mukesh_kumar · May 9, 2020, 3:40pm

Hi Team,

We have about 10 shards, 3 configs and 6 MongoS and what could we see is particular nodes of shards are getting most of request.
Let say we have rep1 to rep10 as shards of the Mongo cluster and

What we could see is particular shard rep2 is getting most read request and it’s not distributed to other shards.
Database for which queries are coming(as per the logs of Mongo) are sharded so it should go to across the shards.
We also checked the stats of Mongo from MongoDB ops manager and we could see high Network for it. It’s about ~700KB/s Bytes in and ~500-600 MB/s bytes out and for rest it’s about 4MB/s Bytes out and ~4-5 KB/s Bytes in.
We also checked the RAM usage and CPUs it seems normal for this node(~10GB RAM is still available).
Queries are taking more than 2 secs of execution time and sometime it’s expire.

MongoD details are below-

We are using DC tagging so that query should go to particular DC’s shard if it’s coming from that DC.
MongoDB version is- 3.0.4
storage Engine- mmapv1

Let me know if more details required.

Thanks in advance.

Cheers,
Mukesh Kumar

chris · May 9, 2020, 9:19pm

Is this particular collection sharded? Unsharded collections are on the Primary shard for the database.

Is the balancer enabled? sh.getBalancerState() If it is disabled then this node could have a disproportionate amount of chunks.

The collections themselves need to be sharded. Perhaps this is just the way you have written and you mean collections.

If I don’t say it someone else will. Mongo 3.0.4 was End of Life February 2018. 3.6 is the next current version, along with 4.0 and 4.2

If you are staying on 3.0 then you should loook at updating to 3.0.15 for the most up to date version of that release.

Doug_Duncan · May 10, 2020, 12:06am

In addition to @chris’s answer, take a look at the results of sh.status(). Does that shard have more chunks than the other shards do?

Stennie_X · May 10, 2020, 7:53am

Hi Mukesh,

The earlier suggestions from @chris and @Doug_Duncan are great starting points. Definitely identify whether the query volume is targeting sharded or unsharded collections, and review your queries and data distribution for any sharded collections on the affected shard. If there are many unsharded collections with rep2 as their primary shard, this difference in network traffic may be expected.

Queries for sharded collections are distributed based on the shard key values. Chunks in a sharded collection are balanced based on roughly equal distribution of chunks according to migration thresholds. There are scenarios where a sharded collection may be balanced by policy but still have an unequal distribution of workload. A good choice of shard key will, on average, distribute your workload appropriately.

Some other considerations for balancing:

Assuming the disproportionate queries are for sharded collections, check if these collections have any jumbo chunks – if so, these will show as jumbo in the output of sh.status(true) . Jumbo chunks are considered too large to move and ignored by the balancer. Jumbo chunks can sometimes be manually divisible, but are usually caused by a poor choice of shard key. See: Clear jumbo Flag .
Does the distribution of estimated docs and data per chunk look reasonable based on db.collection.getShardDistribution() and any shard tagging you may be using to influence the balancer policy?
Are you deleting documents from sharded collections that might create empty chunks based on your shard key? For example, if your shard key is based on a creation date and you expire or delete old documents there will be older chunk ranges that will never be used for new data. Sparsely populated chunks will not be consolidated automatically, but you can manually merge chunks if appropriate.

If chunks are balanced as expected but data/docs distribution is not, you may have empty chunks, jumbo chunks, or a poor choice of shard key.

Also as mentioned earlier, MongoDB 3.0 is an end of life release series. As always, I’d recommend upgrading to the final minor release (3.0.15) and planning an upgrade to a supported release series (currently 3.6 or higher). This is unlikely to affect your query distribution issue, but MongoDB 3.0.4 was released in June, 2015 and you are missing out on 2 years of maintenance and stability updates for the 3.0 release series.

If you are using Ops Manager legally in a production environment, you should also be able to raise a case on the Support Portal as per your MongoDB Enterprise Advanced subscription . However, MongoDB 3.0 is well out of support so I’ll assume this is a test/development environment.

Regards,
Stennie

Mukesh_kumar · May 11, 2020, 10:45am

@Doug_Duncan @Stennie_X Thanks for sharing the info.
We checked the balancer and data on the collection. We could see collection is sharded but there were 2 documents on that shard(let say sh1). And size of those 2 docs is more than 10MB and whenever a query search these doc, we could see getmore multiple times. And there was a flaw in our code base due to which it was trying to search same doc again and again.

We have deleted that doc though code isn’t fixed yet but it seems everything is normal now.

Really appreciate your efforts.

Thanks,
Mukesh Kumar