I have a non-sharded Mongo instance with some data that I’m about to make much bigger – bigger than a single machine will be able to handle in terms of its working set. It also doesn’t follow the usual pattern needed to achieve good sharding: there’s no obvious shard key.
It seems to me that sharding on
_id (which I’ve configured to be a uniformly distributed UUID) could still be a way forward. The id can’t/won’t be in any of my queries, but it does mean the data will be spread across multiple machines and allow the working sets to fit in memory. Without this approach, I don’t see how I can store the data at all.
But, I’ve read lots of dire warnings about scatter-gather queries. In my situation, would these really be significantly slower than my current replica set? I understand that I won’t benefit from the read throughput gains that sharding can bring, but I also don’t understand how statements like “on larger clusters, scatter gather queries are unfeasible for routine operations” can be true.
For example, if I have data of size
n that just fits onto a non-sharded Mongo instance, and I can achieve a query throughput of
t, then it seems to me that two shards could handle data of size
2n also with a query throughput of
t: each machine receives all queries, as they did pre-sharding. The only overhead is merging the results at the router.
In short, is it a totally crazy idea to shard and explitly plan to use scatter-gather queries?!
Thanks for any input!