Hi @A_Grikk
There are a lot of info to uncompress here, and I believe @steevej has touched on all of them.
Having said that, I’d like to take a step back a little and maybe add my take on it ![]()
I have a database sharded across the 6 pairs
Is there a reason for you to start with a sharded cluster which is arguably the most complex deployment method that requires deeper knowledge? Is the workload cannot be served by using a replica set with larger machines instead of a sharded cluster with smaller machines?
At 5-6 minutes per daily data set, it would take an estimated 12 days to load this data. At almost two hours,
it will tape almost two hundred days to load this data.
Since you didn’t mention about the sharded collection setup and you mentioned that turning off the balancer sped things up, what I’m guessing happened is that the chunks was not pre-distributed into the shards before import, thus the import is happening at the same time as the cluster constantly trying to rebalance itself, thus slowing down the import. In a sharded collection, the recommended approach is to pre-distribute the chunks before importing (see Create Chunks in a Sharded Cluster).
(Yes, I know the recommended installation is a 3-node replica set, but this is not a production environment so a 2-node will suffice for me for now.)
As @steevej mentioned earlier, the recommended setup is 3 node replica set, for many reasons (high availability, majority write, majority read, sharded transaction support, etc.). Unless you can avoid it, having arbiters in a replica set is not a recommended setup, or unless you fully understand the implications of using arbiters entails.
If that is the case, then to build a six-shard database across replica sets it would require 18 servers?
Yes. A sharded cluster is not a small deployment. Also in your case with 6 shards, it would need 3 more nodes minimum to act as the config servers, so total of 21. It was designed for very large workloads.
I guess my question is twofold: First, have you considered using a single replica set which will more closely mimic your monolithic SQL Server setup, and second, if you really need the performance of a sharded cluster and the complexity that comes with it, have you considered using Atlas so all of this operational concerns are managed for you?
Best regards
Kevin