The purpose of sharding

Big_Cat_Public_Safety_Act · July 25, 2023, 2:59pm

I have a collection that only holds 10 million documents, totaling 10 gigabytes. This may not seem like enough to necessitate sharding.

But there is a query that takes 1000 seconds to complete on this collection.

If I divide this collection into 1000 shards, then I can take advantage of the divide and conquer strategy, and reduce the query speed to 1 second (in theory, excluding overhead and other complications).

Is the above scenario not the primary reason for sharding? If so, it seems odd that MongoDB Atlas only allows 50 shards maximum.

John_Sewell · July 25, 2023, 3:05pm

Do you know what the bottlenecks are on your query before you resort to building a load of shards?

Big_Cat_Public_Safety_Act · July 25, 2023, 3:09pm

One of the bottlenecks is that the query contains several $lookup.
The other bottleneck is that the query may return arbitrary number of documents, ie. 10 million docs, before the $lookup. So if you pass 10 million docs to several $lookups, then it will be slow.

Even though the number of docs returned to the client is limited at 50, the query has a $sort, so it must examine all docs returned.

John_Sewell · July 25, 2023, 4:06pm

Ahh, ok I guess you have a specific requirement, as per the sorting could you do that early and then make use of an index or do you unwind and reshape the data as it flows, so end up sorting on a derived field?

Big_Cat_Public_Safety_Act · July 25, 2023, 4:19pm

Right, sorting must be done after all of the $lookup, which filters documents, and some of the fields involved in the sorting are derived. So there is no way to sort early or use an index. The only optimization for the sorting is that the returned set of documents are limited to 50.

John_Sewell · July 25, 2023, 4:47pm

Sounds like given the data layout, there is a lot of work! I guess to get it running quicker you need to either change the storage model or throw a lot of hardware at it as you said at the top.

I’m not quite sure how running a lookup on a massively sharded collection to another collection would work with that many nodes in terms of the extra overhead…I guess you get to the point when you export data to a relational reporting database and use joins just for reporting requirements…

Sorry, not being much help here! I guess you could merge in the vlookups or keep them updated in this collection if they do not change on a very frequent basis. If you need this query to run in 1s then it would be cheaper to re-model than throw a LOT of hardware at it.