Where to find the costs of MongoDB operators, and how they are applied?

Takis · August 10, 2021, 11:29pm

Hello

To write code to be fast we need to know how MongoDB works at least some basic things.

For example if i have an array and do 5 $map to it, they will be evaluated lazy with 1 array read? or it will be 5 reads?If i do 1 $map and 1 $avg, it will be 2 array reads or 1 again lazy?

Also i noticed that $let is slow if nested, and i didn’t expected it.

Is there a place that informs us about such things?I dont mean all the internal things MongoDB
does, just the basics, so we know how to write our queries.

From documentation i read one nice page about pipeline optimizations, are there any more info about those things?

Thank you

Asya_Kamsky · August 11, 2021, 5:01pm

I think specific questions would be easier to answer. For instance, I cannot reproduce the slow embedded $let you mention (let’s discuss that in the other thread).

The entire document will usually be read from disk when you query for it, once it’s in memory how many times we reference a particular field is likely to be nearly negligible by comparison - generally in databases the cost of IO (reads and writes to/from disk) will dominate the cost of actually working with the document.

Asya

Takis · August 11, 2021, 6:26pm

For the $let its on the other question.

Yesterday i tried to answer to a stackoverflow question and test a query.
Stackoverflow Question is here

It was computing 1 average from an array,but that array should be tranformed first,
before we could do it.

So i thought, that i have to write a query to read the array only once, and i used $reduce

Someone else gave the bellow answer

db.collection.aggregate([
  {
    $set: {
      analysis: {
        $map: {
          input: "$analysis",
          in: { $objectToArray: "$$this" }
        }
      }
    }
  },
  {
    $set: {
      analysis: {
        $map: {
          input: "$analysis",
          in: { $first: "$$this.v" }
        }
      }
    }
  },
  { $set: { average: { $multiply: [ { $avg: "$analysis" }, 100 ] } } }
])

So when i saw it i thought that i will be very slow, he did 2 $map (2 array reads) and then 1
$avg looked like 3 array reads . So i benchmarked it, and it was working faster than my 1 $reduce (i think one reason why my reduce was slower is because i used 2-3 variables with $let )

Then i did a dummy benchmark where i did like 5 $map in an array, and the speed was similar with doing 1 $map only, so i thought that internally $map operations should be done
lazy.(instead of 5 reads only 1 lazy read)

That was the reason to think about perfomance, if lazy? and why $let is slow.

Asya_Kamsky · August 11, 2021, 6:45pm

How did you benchmark it? I just did a quick run and they seem somewhat comparable in performance.

Note that their pipeline can be reduced to a single stage and it will then run faster, but keep in mind that benchmarking is hard because there is significant variance from run to run, especially if you are measuring round trip time to/from client. I recommend measuring how long the actual pipeline takes to eliminate client-server communication from the equation.

Did you try running yours without $let? It doesn’t seem to do anything special just renaming $$value and $$this which is not necessary.

Asya

Asya_Kamsky · August 11, 2021, 6:47pm

You might only have one $reduce but you also have {$size: "$analysis"} which traverses the array to count its size…

Takis · August 11, 2021, 6:52pm

I don’t know how to do benchmarks, all i do is write the pipeline and i end the pipeline with an
$out stage, to avoid sending data to the client.

I thought $size has constant cost O(1), this is why i wrote this question, for us to know things like that. That $size costs O(n) for example.

That query is not important, but it would be good if documentation explained more about the costs of the operators.

Asya_Kamsky · August 11, 2021, 7:08pm

I thought $size has constant cost O(1) , this is why i wrote this question

I see - the best way to infer some of this is to look at bson specification.

You can see that the way arrays are represented, there is no way to know how big it is without getting to the end of it. You can also see that in the code here (among other places).