Query Speed among multiple find then count vs aggregation

Aaron_Lee1 · October 25, 2022, 3:19am

Hi, I am newbie to MongoDB and I am currently facing a trouble in query speed among multiple find().count() and aggregation({"$facet": {“countAmount”: [{"$match"}, {$count}]}}) and I would like to get some help.

I discovered that with multiple query that will millions of records

db.collection.find({A: true}).count()
db.collection.find({B: true}).count()
db.collection.find({C: true}).count()

is way more faster than just merging the query into one with aggregation

 db.collection.aggregate( {"$facet":{ "countA":[{"$match": {A: true}},{"$count":"count"}],   "countB":[{"$match": {B: true}},{"$count":"count"}],   "countC":[{"$match": {C: true}},{"$count":"count"}] }})

Hence I would like to know what will be the best practice to do such query. Thank you in advance!!!

Pavel_Duchovny · October 25, 2022, 12:51pm

Hi @Aaron_Lee1 ,

It is possible that running the queries using a find will be faster then facet as it might parse and use indexing better.

Do you have separate indexes on each of the fileds?

What I would suggest is to have 3 partial expression index on every field name:

db.collection.createIndex({A:1},{ partialFilterExpression: { A : true }})
...

, and in your queries filter and only project this field to take advantage of covered query scan :

db.collection.find({A: true},{A:1}).count()
db.collection.find({B: true},{B:1}).count()
db.collection.find({C: true},{C:1}).count()

Aaron_Lee1 · October 26, 2022, 6:36am

Nice to meet you @Pavel_Duchovny,

Thank you for your suggestion of creating an index of the field. However, would you mind tell me more about facet might parse?

I looked for using facet initially because I hope I can get the 3 count queries are using the same set of data in the secondary mongoDB (as the queries are merged into one and I can sure that there will only 1 access time rather than split into 3 access). There will still some new insert to the primary mongoDB during the queries. However the query of using facet will make me access timeout if there are plenty of data.

What will be the best practice in handling this case?

P.S. I am using java spring with mongotemplate to do the query

Thank you once again your kind reply and suggestion!

Pavel_Duchovny · October 26, 2022, 7:39am

@Aaron_Lee1 ,

Look at the explain plans of each query:

When running the single count query which I actually prefare to use the count operation:

db.collection.count({A :true})

executionStats: {
    executionSuccess: true,
    nReturned: 0,
    executionTimeMillis: 0,
    totalKeysExamined: 1001,
    totalDocsExamined: 0,
    executionStages: {
      stage: 'COUNT',
      nReturned: 0,
      executionTimeMillisEstimate: 0,
      works: 1001,
      advanced: 0,
      needTime: 1000,
      needYield: 0,
      saveState: 1,
      restoreState: 1,
      isEOF: 1,
      nCounted: 1000,
      nSkipped: 0,
      inputStage: {
        stage: 'COUNT_SCAN',
        nReturned: 1000,

Will return just an index scan on {A : 1} with no need to scan documents in my case I had 1000 A : true in the the collection.

When using facet look at the explain plan:


          "nReturned": 1600,
          "executionTimeMillis": 4,
          "totalKeysExamined": 0,
          "totalDocsExamined": 1600,
          "executionStages": {
            "stage": "PROJECTION_SIMPLE",
            "nReturned": 1600,
            "executionTimeMillisEstimate": 0,
            "works": 1602,
            "advanced": 1600,
            "needTime": 1,
            "needYield": 0,
            "saveState": 2,
            "restoreState": 2,
            "isEOF": 1,
            "transformBy": {
              "A": 1,
              "B": 1,
              "C": 1,
              "_id": 0
            },
            "inputStage": {
              "stage": "COLLSCAN",
              "nReturned": 1600,
              "executionTimeMillisEstimate": 0,
              "works": 1602,
              "advanced": 1600,
              "needTime": 1,
              "needYield": 0,
              "saveState": 2,
              "restoreState": 2,
              "isEOF": 1,
              "direction": "forward",
              "docsExamined": 1600
            }
          },
          "allPlansExecution": []
        }
      },
      "nReturned": 1600,
      "executionTimeMillisEstimate": 2
    },
    {
      "$facet": {
        "countA": [
          {
            "$internalFacetTeeConsumer": {},
            "nReturned": 1600,
            "executionTimeMillisEstimate": 2
          },
          {
            "$match": {
              "A": {
                "$eq": true
              }
            },
            "nReturned": 1000,
            "executionTimeMillisEstimate": 2
          },
          {
            "$group": {
              "_id": {
                "$const": null
              },
              "count": {
                "$sum": {
                  "$const": 1
                }
              }
            },
            "maxAccumulatorMemoryUsageBytes": {
              "count": 80
            },
            "totalOutputDataSizeBytes": 237,
            "usedDisk": false,
            "spills": 0,
            "nReturned": 1,
            "executionTimeMillisEstimate": 2
          },
          {
            "$project": {
              "count": true,
              "_id": false
            },
            "nReturned": 1,
            "executionTimeMillisEstimate": 2
          }
        ],
        "countB": [
          {
            "$internalFacetTeeConsumer": {},
            "nReturned": 1600,
            "executionTimeMillisEstimate": 0
          },
          {
            "$match": {
              "B": {
                "$eq": true
              }
            },
            "nReturned": 500,
            "executionTimeMillisEstimate": 0
          },
          {
            "$group": {
              "_id": {
                "$const": null
              },
              "count": {
                "$sum": {
                  "$const": 1
                }
              }
            },
            "maxAccumulatorMemoryUsageBytes": {
              "count": 80
            },
            "totalOutputDataSizeBytes": 237,
            "usedDisk": false,
            "spills": 0,
            "nReturned": 1,
            "executionTimeMillisEstimate": 0
          },
          {
            "$project": {
              "count": true,
              "_id": false
            },
            "nReturned": 1,
            "executionTimeMillisEstimate": 0
          }
        ],
        "countC": [
          {
            "$internalFacetTeeConsumer": {},
            "nReturned": 1600,
            "executionTimeMillisEstimate": 0
          },
          {
            "$match": {
              "C": {
                "$eq": true
              }
            },
            "nReturned": 100,
            "executionTimeMillisEstimate": 0
          },
          {
            "$group": {
              "_id": {
                "$const": null
              },
              "count": {
                "$sum": {
                  "$const": 1
                }
              }
            },
            "maxAccumulatorMemoryUsageBytes": {
              "count": 80
            },
            "totalOutputDataSizeBytes": 237,
            "usedDisk": false,
            "spills": 0,
            "nReturned": 1,
            "executionTimeMillisEstimate": 0
          },
          {
            "$project": {
              "count": true,
              "_id": false
            },
            "nReturned": 1,
            "executionTimeMillisEstimate": 0
          }
        ]
      },
      "nReturned": 1,
      "executionTimeMillisEstimate": 2
    }
  ],
  "serverInfo": {
    "host": "atlas-rg2tcy-shard-00-02.uvwhr.mongodb.net",
    "port": 27017,
    "version": "6.1.0",
    "gitVersion": "0ca11aca38c75d3c8fb5bac5bd103b950718a896"

It actually does a collection scan for the full collection. And need to read the documents which are much larger than the index.

I think that if you need this one in one query you are way better using unionWith rather than facet:

db.collection.aggregate([{
 $match: {
  A: true
 }
}, {
 $count: 'countA'
}, {
 $unionWith: {
  coll: 'collection',
  pipeline: [
   {
    $match: {
     B: true
    }
   },
   {
    $count: 'countB'
   }
  ]
 }
}, {
 $unionWith: {
  coll: 'collection',
  pipeline: [
   {
    $match: {
     C: true
    }
   },
   {
    $count: 'countC'
   }
  ]
 }
}])

Ty
Pavel

Aaron_Lee1 · October 26, 2022, 8:06am

@Pavel_Duchovny

Thank you for your detail explanation.

I was just following the suggestion in the comments in the following topic

However, I would like to know is it better to merge the multiple count queries into one with using unionWIth in my case, which is using replicaSet with a Primary and Secondary database. Such that to make sure the multiple count is using the same set of documents at the point. Or are there any methods that I can secure the multiple queries are getting the data in the same specific time point, since there will some overheads when I connect to the db with mongoTemplate? What will be the common or best practice for facing this kind of problems.

Please let me know if there are any unclear.

Thank you so much!
Aaron

Pavel_Duchovny · October 26, 2022, 8:14am

I don’t think there is a defenitive answer.

You need to test the alternatives. I suspect that the $facet is the least preferable way .

Indexes should be fitting your memory set and therefore scanning indexes for counts should be potentially most efficient.

Please test the methods and let us know what worked.

Pavel

Aaron_Lee1 · October 26, 2022, 8:40am

Get it! Thank you so much for your help and advice.

Have a nice day,
Aaron

system · October 31, 2022, 8:41am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.