Hi @kembhootha_k - my name is Chris and I am one of Maxime’s coworkers here at MongoDB. Thanks again for your question.
Broadly speaking, the takeaway from this comment is going to be that:
SINGLE_SHARD stage is unlikely to be meaningfully contributing to the duration noted in the
Individual query latency is necessarily going to be higher on a single shard sharded cluster compared to a replica set by itself.
I would also be curious about what your specific goals are. At a worst case (presumably cold cache) total time of
46 seconds, this implies that each document is being processed in
0.023 milliseconds or a processing rate of nearly
43,500 documents per second. Is there a more defined target that you are trying to hit, or are you just exploring what is possible with the current configuration?
Would you be able to provide the full
explain outputs for us to examine? It is difficult to provide specific answers or guidance about what may be happening in your environment given only a few duration metrics. When examined as a whole, explain output really helps tell a story (or acts as a map) about what is going on. Without the complete picture we may be missing important pieces.
Even in the absence of the full output, we can still say a few things that are probably useful. I would expect the execution time reported by subsequent stages in the explain output to be cumulative and inclusive of their children stages. This implies a few interesting items:
There may be a typo or mixup in the numbers mentioned in the original post. I don’t think it should be possible for the parent
SINGLE_SHARD stage to report a smaller duration (15 seconds) than its child
FETCH stage (44 seconds). Is it possible the times for the
SINGLE_SHARD stage were transposed between the two runs, as the 46 seconds and 13 seconds from the opposite lines seem to match pretty closely?
The total time for the
explain operation should basically be the largest number (e.g. 46 seconds) as opposed to the sum of each duration reported (e.g. 60.5 = 1.5 + 13 + 46).
SINGLE_SHARD stage should not be responsible for doing much work. Given the assumptions above are correct (including the final number being swapped), the maximum time that could be attributable to this stage would be 3 seconds. Even that number could be inflated for other reasons. There is probably not much (or any) optimization which could really be done here.
As a point of comparison, what is the total duration for the same
explain operation when executed directly against the
PRIMARY member of the underlying replica set for the shard? I would expect that the majority of the time (when using
explain) will be dominated by the work being performed by the underlying shard, so the numbers will likely be similar.