Horizontal scaling of data that doesn't shard well

Chris_Wiggins · October 29, 2021, 2:00pm

Hello,

I have a non-sharded Mongo instance with some data that I’m about to make much bigger – bigger than a single machine will be able to handle in terms of its working set. It also doesn’t follow the usual pattern needed to achieve good sharding: there’s no obvious shard key.

It seems to me that sharding on _id (which I’ve configured to be a uniformly distributed UUID) could still be a way forward. The id can’t/won’t be in any of my queries, but it does mean the data will be spread across multiple machines and allow the working sets to fit in memory. Without this approach, I don’t see how I can store the data at all.

But, I’ve read lots of dire warnings about scatter-gather queries. In my situation, would these really be significantly slower than my current replica set? I understand that I won’t benefit from the read throughput gains that sharding can bring, but I also don’t understand how statements like “on larger clusters, scatter gather queries are unfeasible for routine operations” can be true.

For example, if I have data of size n that just fits onto a non-sharded Mongo instance, and I can achieve a query throughput of t, then it seems to me that two shards could handle data of size 2n also with a query throughput of t: each machine receives all queries, as they did pre-sharding. The only overhead is merging the results at the router.

In short, is it a totally crazy idea to shard and explitly plan to use scatter-gather queries?!

Thanks for any input!