What happens really in the SINGLE_SHARD stage?

I have been trying to benchmark a query ( which is a single field-single value query ) on a collection that is 1.1 billion documents large, but is a single shard setup. My query is expected to return 2 million documents from the list of 1.1 billion in that collection. Further, the index is setup properly for this query. The cluster is a 3 node cluster behind the scenes.

Upon doing an explain of the query, the following are the time spent in each of the stages

Run 1 ) IXSCAN : 2 seconds, FETCH 44 seconds, SINGLE_SHARD 15 seconds.
Run 2) IXSCAN : 1.5 seconds, FETCH 13 seconds, SINGLE_SHARD 46 seconds.


  1. I can understand why the SHARD_MERGE stage is needed in a multisharded collection and that SINGLE_SHARD is it’s equivalent for similar tasks - but why is that stage needed when there is only a single shard ? What really happens in the “SINGLE_SHARD” stage?

  2. Why does “SINGLE_SHARD” stage take so much time given FETCH has already done the job of retrieving physically the documents from the disk ?

  3. While I understand why FETCH takes a lot less time in the second run ( cached documents / loaded in Memory ), why does SINGLE_SHARD stage shoot up wrt the time it takes ?

  4. Is it possible to avoid SINGLE_SHARD stage on one shard setups / or atleast make them more performant ?

Bumping up the thread to elicit a reply.

Hi @kembhootha_k,

I opened a ticket to understand why the stages in the explain output aren’t fully documented. It already frustrated me a few times as well.

My best guess is that during this stage the data is transferred from the shard to the mongos and because this is probably a few hundred megabytes at this point (2 million docs), maybe you are saturating the network or the mongos itself?

Some questions I have for you though:

  • Why are you trying to return that many docs in one query? It’s not possible to aggregate that result directly into the final result you are presenting to the final consumer of this data? I mean I guess you are not sending 2 millions docs to a webpage or a PDF.
  • Why is this collection only in a single shard? If it was distributed it could potentially help with the network throughput (potential) issue.
  • How large is the total result set? Does that fit in RAM in the shard and in the mongos (page fault or dump to disk?)


Thank you for your time. Answers as below

  1. You are right. That many documents will not be sent to the end user. It would be used more for batch jobs or feed analytics type of workloads. Having said that, my questions / observations stem from a performance suite that is run to test the limits of my setup.

  2. Precisely the reason. I’m trying to test out the single shard setup against various scenarios to better formulate my sharding strategy for the future using data created as part of the suite.

  3. The 2 million dataset fits within less than 40% of the RAM on the node serving the data.

Even if we momentarily keep aside what SINGLE_SHARD does, it’s weirder that it should take more time the second time around.

Hi @kembhootha_k - my name is Chris and I am one of Maxime’s coworkers here at MongoDB. Thanks again for your question.

Broadly speaking, the takeaway from this comment is going to be that:

  • The SINGLE_SHARD stage is unlikely to be meaningfully contributing to the duration noted in the explain output.

  • Individual query latency is necessarily going to be higher on a single shard sharded cluster compared to a replica set by itself.

I would also be curious about what your specific goals are. At a worst case (presumably cold cache) total time of 46 seconds, this implies that each document is being processed in 0.023 milliseconds or a processing rate of nearly 43,500 documents per second. Is there a more defined target that you are trying to hit, or are you just exploring what is possible with the current configuration?

Would you be able to provide the full explain outputs for us to examine? It is difficult to provide specific answers or guidance about what may be happening in your environment given only a few duration metrics. When examined as a whole, explain output really helps tell a story (or acts as a map) about what is going on. Without the complete picture we may be missing important pieces.

Even in the absence of the full output, we can still say a few things that are probably useful. I would expect the execution time reported by subsequent stages in the explain output to be cumulative and inclusive of their children stages. This implies a few interesting items:

  1. There may be a typo or mixup in the numbers mentioned in the original post. I don’t think it should be possible for the parent SINGLE_SHARD stage to report a smaller duration (15 seconds) than its child FETCH stage (44 seconds). Is it possible the times for the SINGLE_SHARD stage were transposed between the two runs, as the 46 seconds and 13 seconds from the opposite lines seem to match pretty closely?

  2. The total time for the explain operation should basically be the largest number (e.g. 46 seconds) as opposed to the sum of each duration reported (e.g. 60.5 = 1.5 + 13 + 46).

  3. The SINGLE_SHARD stage should not be responsible for doing much work. Given the assumptions above are correct (including the final number being swapped), the maximum time that could be attributable to this stage would be 3 seconds. Even that number could be inflated for other reasons. There is probably not much (or any) optimization which could really be done here.

As a point of comparison, what is the total duration for the same explain operation when executed directly against the PRIMARY member of the underlying replica set for the shard? I would expect that the majority of the time (when using explain) will be dominated by the work being performed by the underlying shard, so the numbers will likely be similar.