How to compute frequency for multiple fields using a single pipeline in MongoDB

Md_Abdur_Rakib · August 10, 2021, 5:31am

Is it possible to calculate the frequency of multiple fields with a single query in MongoDB? I can do that with separate $group stages for each field. How can I optimize it and build one pipeline that can do the job for all items?

I have the following pipeline in MongoDB 4.5

{
  $match: {
    field1: { $in: ['value1', 'value2'] },
    field2: { $in: ['v1', 'v2'] },
  }
},
{
  $group: {
    _id: {
      field1: '$field1',
      field2: '$field2'
    },
    frequency: { $sum: 1.0 }
  }
}

From this, I obtain data like the following:

{ 
  "_id": {
      "field1": "value1", 
      "field2": "v1"
  }, 
  "count": 7.0
},
{ 
  "_id": {
      "field1": "value1", 
      "field2": "v2"
  }, 
  "count": 3.0
},
{ 
  "_id": {
      "field1": "value2", 
      "field2": "v1"
  }, 
  "count": 4.0
}

The result that I am trying to get is:

{
  "field1": [
    "value1": 10.0,
    "value2": 4.0
  ],
  "field2": [
    "v1": 11.0,
    "v2": 3.0
  ]
}

steevej · August 10, 2021, 12:38pm

I think you may achieve your goal with https://docs.mongodb.com/manual/reference/operator/aggregation/facet/ with something like:

{
   $facet : {
     "field1" : [
       { $match: { field1: { $in: ['value1', 'value2'] } } }
       // group $field1 
     ]  ,
     "field2" : [
       { $match : // field2 match as original post
       // group $field2
     ]
  }
}

Takis · August 10, 2021, 12:59pm

This takes it from where you are to the final output.
But i dont know from where you started.

Its 2 times the same code, for field1 and field2

[ {
    "$facet" : {
      "field1" : [ {
        "$project" : {
          "_id.field1" : 1,
          "count" : 1
        }
      }, {
        "$project" : {
          "_id" : 0,
          "value" : "$_id.field1",
          "count" : 1
        }
      }, {
        "$group" : {
          "_id" : "$value",
          "count" : {
            "$sum" : "$count"
          }
        }
      }, {
        "$addFields" : {
          "value" : "$_id"
        }
      }, {
        "$project" : {
          "_id" : 0
        }
      } ],
      "field2" : [ {
        "$project" : {
          "_id.field2" : 1,
          "count" : 1
        }
      }, {
        "$project" : {
          "_id" : 0,
          "value" : "$_id.field2",
          "count" : 1
        }
      }, {
        "$group" : {
          "_id" : "$value",
          "count" : {
            "$sum" : "$count"
          }
        }
      }, {
        "$addFields" : {
          "value" : "$_id"
        }
      }, {
        "$project" : {
          "_id" : 0
        }
      } ]
    }
  } ]

Results

{"field1" : [{"count" : 4.0, "value" : "value2"}, {"count" : 10.0, "value" : "value1"}],
 "field2" : [{"count" : 3.0, "value" : "v2"}, {"count" : 11.0, "value"  : "v1"}]}

Asya_Kamsky · August 11, 2021, 5:39pm

It turns out you can do this in a single pipeline and you don’t need $facet at all (which basically emulates multiple pipelines). Here’s how (using sample data that matches your field counts):

db.frequency.find()
{ "_id" : ObjectId("6112ece6bba8077232b072be"), "field1" : "value1", "field2" : "v2" }
{ "_id" : ObjectId("6112ece9bba8077232b072bf"), "field1" : "value1", "field2" : "v1" }
{ "_id" : ObjectId("6112ecebbba8077232b072c0"), "field1" : "value1", "field2" : "v1" }
{ "_id" : ObjectId("6112ecebbba8077232b072c1"), "field1" : "value1", "field2" : "v2" }
{ "_id" : ObjectId("6112ecf0bba8077232b072c2"), "field1" : "value2", "field2" : "v2" }
{ "_id" : ObjectId("6112ecf3bba8077232b072c3"), "field1" : "value2", "field2" : "v1" }
{ "_id" : ObjectId("6112ecf4bba8077232b072c4"), "field1" : "value2", "field2" : "v1" }
{ "_id" : ObjectId("6112ecf8bba8077232b072c5"), "field1" : "value2", "field2" : "v1" }
{ "_id" : ObjectId("6112ed21bba8077232b072c6"), "field1" : "value1", "field2" : "v1" }
{ "_id" : ObjectId("6112ed27bba8077232b072c7"), "field1" : "value1", "field2" : "v1" }
{ "_id" : ObjectId("6112ed27bba8077232b072c8"), "field1" : "value1", "field2" : "v1" }
{ "_id" : ObjectId("6112ed28bba8077232b072c9"), "field1" : "value1", "field2" : "v1" }
{ "_id" : ObjectId("6112ed29bba8077232b072ca"), "field1" : "value1", "field2" : "v1" }
{ "_id" : ObjectId("6112ed34bba8077232b072cb"), "field1" : "value1", "field2" : "v1" }
db.frequency.aggregate([
    {$project:{_id:0}},
    {$project:{x:{$objectToArray:"$$ROOT"}}},
    {$unwind:"$x"},
    {$group:{_id:{f:"$x.k",v:"$x.v"},count:{$sum:1}}},
    {$group:{_id:"$_id.f", values:{$push:{k:"$_id.v", v:"$count"}}}},
    {$group:{_id:0, all:{$push:{k:"$_id", v:{$arrayToObject:"$values"}}}}}, 
    {$replaceWith:{$arrayToObject:"$all"}}
]).pretty()
{
"field2" : {
	"v2" : 3,
	"v1" : 11
},
"field1" : {
	"value2" : 4,
	"value1" : 10
}
}

To explain, after removing _id, I convert the object into an array of key/value pairs, unwind them, then I group by key+value to get counts for each distinct field value, then group by field names (keys) to get things summarized by each field and then group by 0 to get a single document result. Pushing things into array of "k"/"v" subdocuments allows me to convert it back to an object where each field has subdocument of values and their counts.

Asya