Performance issue on a job inserting over 700 biliions of rows to a sharded cluster

Hi, I have tried to insert over 700 billions of rows to a sharded cluster using distribute computing

My data schema contains just one line of hashed string (ex: {_id: e3nlvksnlk12fdnsnkd!}),
and I want to give index to it so I can check existence of hashed string quickly.
I configured around 50 shards(10 mongos server)

I made pymongo client in spark job and used insert_many to cluster.
At first, performance was up to 200k insert count aggregated over mongos server
but as document accumulated, write performance is down below 50k…

Is there any way to increase bulk insert performance?
I consider several ways to do it… create index after indexless insert, ordered bulk insert(from ordered index data)… etc. but I don’t seem these are effective way

Hi @JaeHo_Park and welcome to the community!!

In order to understand the issue observed and provide with detailed assistance, could you please help me a few details based on the above specifications:

  1. What’s the MongoDB version you’re using?
  2. What procedure do you follow to insert over 700 billion rows to a sharded cluster? Are you using a mongoimport/mongorestore tool for doing so?
  3. Is the sharding done based on the _id as the shard key?
  4. Does the deployment have pre-split and pre-distributed chunks? Does the insert happens all at once and then sharding is configured or insertion is done in a sharded cluster ?
  5. MongoDB strives to respond as quickly as possible to your queries. However, in some cases, you might have a low insert rate due to hitting hardware limitations (e.g. disk limits, CPU limits, etc.). Out of curiosity, what’s the hardware specifications for the deployment in this case, and whether it was setup by following the recommended settings in the production notes
  6. Also, could you share the output of mongostat on the shard servers? It might be useful to determine how the hardware copes with the insert workload.

Is this heavy insert workload the normal day-to-day workload you’re expecting for the cluster? Or is this a one-off job, and querying/aggregation will be the typical workload you envision in the future?