How to accumulate values of array items starting from specific index

Stanislav_Kuznetsov · June 30, 2023, 4:58am

I want to get some sort of accumulation object (like Others) as last array item.

Having this data

[
  {
    "_id": "1",
    "name": "name1"
  },
  {
    "_id": "2",
    "name": "name2"
  },
  {
    "_id": "3",
    "name": "name1"
  },
  {
    "_id": "4",
    "name": "name3"
  },
  {
    "_id": "5",
    "name": "name4"
  },
  {
    "_id": "6",
    "name": "name3"
  },
  {
    "_id": "7",
    "name": "name5"
  },
  {
    "_id": "8",
    "name": "name5"
  },
  {
    "_id": "9",
    "name": "name5"
  },
  {
    "_id": "10",
    "name": "name6"
  }
]

i run this aggregation:

db.collection.aggregate([
  {
    $sortByCount: "$name"
  }
])

As result i want to get:

[
  {
    "_id": "name5",
    "count": 3
  },
  {
    "_id": "name3",
    "count": 2
  },
  {
    "_id": "name1",
    "count": 2
  },
  {
    "_id": "others",
    "count": 3
  }
]

How can i do it?

John_Sewell · June 30, 2023, 4:34pm

What goes into others? Do you want a sum for everything with a name prefix grouped by name and everything else in “others”?

Like this?

db.getCollection("test").aggregate([
{
    $facet:{
        matchingItems:[
            {
                $match:{name:/^name/}
            },
            {
                $sortByCount: "$name"
            }
        ],
        otherItems:[
            {
                $match:{name:{$not:/^name/}}
            },
            {
                $group:{
                    _id:'other',
                    count:{$sum:1}
                }
            }
        ]
    }
},
{
    $project:{
        totalResults:{
            $setUnion:[
                '$matchingItems',
                '$otherItems'
            ]
        }
    }
},
{
    $unwind:'$totalResults'
},
{
    $replaceRoot:{newRoot:'$totalResults'}
}
])

Stanislav_Kuznetsov · July 1, 2023, 5:23am

Basically, i’m going to get top N mostly frequently appeared names, and others group should contain number of all other names. But ideally i want to have not just absolute numbers, but percentage of total numbers of elements

[
  {
    "_id": "name5",
    "count": 3 // appeared 3 times in original set of documents
    "percentage": 30 // (3 / 10) 10 is total document number 
  },
  {
    "_id": "name3",
    "count": 2,
    "percentage": 20
  },
  {
    "_id": "name1",
    "count": 2,
   "percentage": 20
  },
  {
    "_id": "others",
    "count": 3,
    "percentage": 30
]

John_Sewell · July 1, 2023, 7:06am

Ahh, gotcha. Shall have a play later.

steevej · July 1, 2023, 1:34pm

I would try with something like:

top_n = 10 
pipeline = [
    { "$sortByCount : "$name" } ,
    { "$facet" : {
        "top_n" : [
            { "$limit" : top_n }
        ] ,
        "others" : [
            { "$skip" : top_n } ,
            { "$group" : {
                _id : null ,
                count : { "$sum" : "$count" } 
            } }
        ] 
    } } ,
    { "$set" : {
        "total" : { "$sum" : [ 
            { "$reduce" : {
                "input" : "$top_n.count" ,
                "initialValue" : 0 ,
                "in" : { "$sum" : [ "$$this" , "$$value" ] }
            } } ,
           { "$arrayElemAt" : [ "$others.count" , 0 ] }
        ] }
    } } ,
    { "$set" : {
       "top_n" : { "$map" : {
            "input" : "$top_n" ,
            /* use total field to update each element with percentage */
       } }
       "others" : { "$map" :
            "input" : "$others" ,
            /* use total field to update each element with percentage */
      } }
    } } ,
    /* what ever cosmetic $project,$unwind stage to match desired output */
]

John_Sewell · July 1, 2023, 3:55pm

I wanted to have a play to see what I could come up with as well, it seems I’ve a different approach.

db.getCollection("test").aggregate([
{
    $sortByCount: "$name"
},
{
    $facet:{
        topItems:[
            {
                $limit:3
            }
        ],
        otherItems:[
            {
                $skip:3
            },
            {
                $group:{
                    _id:'Others',
                    count:{$sum:'$count'}
                }
            }
        ],
        totalitems:[
            {
                $group:{
                    _id:null,
                    total:{$sum:'$count'}
                }
            }
        ]
    }
},
{
    $project:{
        allData:{
            $concatArrays:[
                '$topItems',
                '$otherItems'
            ]
        },
        totalitems:{$arrayElemAt:['$totalitems.total', 0]}
    }
},
{
    $unwind:'$allData'
},
{
    $project:{
        Item:'$allData._id',
        Count:'$allData.count',
        Percentage:{
            $round:[
                {
                    $multiply:[
                        100, 
                        {    
                            $divide:[
                                '$allData.count', 
                                '$totalitems'
                            ]
                        }
                    ]
                }, 
                0
            ]
        } 
    }
}
])

We start with a grouping and sort, then use a facets to perform multiple operations over the same data we’ve just calculated. We calculate:

List of top 3 items
Sum of everything else
Sum of all items, to be used in percentage calculation

We then combine the top X items and the “others” into one array ready for the next stage.

We unwind the field containing the data, which means each document has the data item as well as the total items.

Then it’s a final project to calculate the percentages for each item and round it to 0DP

steevej · July 2, 2023, 1:41pm

I decided to calculate totalitems (count in my code) in a separated $set stage because doing it inside the $facet implies that all incoming documents are processed another time. With the $set after, only the top_n accumulation and others are processed. One version might be more efficient than the other but I cannot tell which one.

I really like your idea of using $unwind on the result of $concatArrays.

Computing percentage after the $unwind is also nice since it is much simpler than doing 2 $map.

John_Sewell · July 2, 2023, 2:02pm

I need to play more with the reduce, most of my normal work involves reporting or data fixes so nice to look at different things.
Shall try building a large dataset and checking performance but i dont think there will bemuch in it as we both reduce the document volume so massively so quickly.

John_Sewell · July 2, 2023, 2:53pm

Created a collection with 1M records and a covering index…both approaches are about the same, sub-600ms for the complete analysis of all 1M records showing top 3 items + others.

Stanislav_Kuznetsov · July 2, 2023, 4:49pm

Thank you for your help. $facet step with $limit, $skip and $group does good job. But unfortunately i can’t use it because in fact this computation is part of another higher level $facet step. Original documents have much more fields than just name, and i have to calculate different statistics on same amount of matched docs.
This is how it looks in general:

{
  $match: {} // some match
},
{
  $facet: {
    name: [] // topic of our discussion,
    category: [],
    type: [],
    ...
  }
},
{
 // some formatting steps of results from above steps.
}

I found an option to get this sort of statistics that works in my case:

{
  $facet: {
    name: [{ $sortByCount: "$name" }],
   ...
  }
},
{
  $set: {
    name: {
      $reduce: {
        input: {
          $map: {
            input: {
              $zip: {
                inputs: [
                  "$name",
                  {
                    $range: [
                      0,
                      {
                        $size: "$name",
                      },
                    ],
                  },
                ],
              },
            },
            as: "item",
            in: {
              $mergeObjects: [
                {
                  $arrayElemAt: [
                    "$$item",
                    0
                  ],
                },
                {
                  index: {
                    $arrayElemAt: [
                      "$$item",
                      1
                    ],
                  },
                },
              ],
            },
          },
        },
        initialValue: {},
        in: {
          $mergeObjects: [
            "$$value",
            {
              $cond: [
                {
                  $gt: [
                    "$$this.index",
                    3
                  ],
                },
                {
                  other: {
                    $add: [
                      {
                        $cond: [
                          "$$value.other",
                          "$$value.other",
                          0
                        ],
                      },
                      "$$this.count",
                    ],
                  },
                  total: {
                    $add: [
                      {
                        $cond: [
                          "$$value.total",
                          "$$value.total",
                          0
                        ],
                      },
                      "$$this.count",
                    ],
                  },
                },
                {
                  top: {
                    $concatArrays: [
                      {
                        $cond: [
                          "$$value.top",
                          "$$value.top",
                          []
                        ],
                      },
                      [
                        {
                          _id: "$$this._id",
                          count: "$$this.count",
                        },
                      ],
                    ],
                  },
                  total: {
                    $add: [
                      {
                        $cond: [
                          "$$value.total",
                          "$$value.total",
                          0
                        ],
                      },
                      "$$this.count",
                    ],
                  },
                },
              ],
            },
          ],
        },
      },
    },
  },
},
{
  $set: {
    name: {
      $concatArrays: [
        {
          $cond: [
            "$name.top",
            {
              $map: {
                input: "$name.top",
                as: "item",
                in: {
                  $setField: {
                    field: "percentage",
                    input: "$$item",
                    value: {
                      $multiply: [
                        {
                          $divide: [
                            "$$item.count",
                            "$name.total"
                          ],
                        },
                        100,
                      ],
                    },
                  },
                },
              },
            },
            [],
          ],
        },
        {
          $cond: [
            "$name.other",
            [
              {
                _id: "__other__",
                count: "$name.other",
                percentage: {
                  $multiply: [
                    {
                      $divide: [
                        "$name.other",
                        "$name.total"
                      ],
                    },
                    100,
                  ],
                },
              },
            ],
            [],
          ],
        },
      ],
    },
  },
}

It can be not as optimal as possible, but for now it works. Feel free to give your feedback

John_Sewell · July 2, 2023, 5:57pm

Be wary of the data that feeds into a facet…as they canot make use of indexes.

I would love to be able to embed facets within facets though…