Compute min/max/avg from all array fields

John_Dunn · October 26, 2022, 3:32pm

I have multiple documents in this form, where the Array is an array of objects whose contents don’t really matter

{ animals : { cats: Array[3], dogs: Array[3] }}
{ animals : { cats: Array[2], fish: Array[1] }}
{ animals : { dogs: Array[2] }}

What I’d like to do is compute things like the total, min, max and average count of each category and end up with something like this

{ cats { min: XXX, max: XXX, avg: XXX }}
{ dogs { min: XXX, max: XXX; avg: XXX }}
{ fish { min: XXX, max: XXX; avg: XXX }}

This pipeline comes close

[{
 $project: {
  foo: {
   $objectToArray: '$animals'
  }
 }
}, {
 $unwind: {
  path: '$foo',
  preserveNullAndEmptyArrays: true
 }
}, {
 $group: {
  _id: '$foo.k',
  min: {
   $min: {
    $size: '$foo.v'
   }
  },
  max: {
   $max: {
    $size: '$foo.v'
   }
  },
  avg: {
   $avg: {
    $size: '$foo.v'
   }
  },
  sum: {
   $sum: {
    $size: '$foo.v'
   }
  }
 }
}]

which returns

{ "_id": "cats", "min": 2, "max": 3, "avg": 2.5, "sum": 5 }
{ "_id": "dogs", "min": 2, "max": 3, "avg": 2.5, "sum": 5 }
{ "_id": "fish", "min": 1, "max": 1, "avg": 1, "sum": 1 }

which is close but it doesn’t account for missing items. I need to treat those as 0 values and incorporate them into the various calculations. The correct result would be

{ "_id": "cats", "min": 0, "max": 3, "avg": 1.666, "sum": 5 }
{ "_id": "dogs", "min": 0, "max": 3, "avg": 1.666, "sum": 5 }
{ "_id": "fish", "min": 0, "max": 1, "avg": 0.3333, "sum": 1 }

Jason_Tran · October 27, 2022, 3:14am

Hi @John_Dunn - Welcome to the community.

Thanks for providing what you’ve attempted, current and expected output.

which is close but it doesn’t account for missing items.

I just need some clarification here - If you need the calculation to cater for missing items, then the data needs to reflect that fact.

I do understand you’ve noted that the values within the Array are not important however providing these may help with reproduction steps as it makes it easier to import to any test environments and also may give insight to how to achieve the expected output.

I need to treat those as 0 values and incorporate them into the various calculations

In the meantime, i’ve replicated the first 2 stages of the array just using integers to fill up the Array[<int>] field of your sample data. Here is the current ouput after the $unwind stage of your pipeline:

[
  {
    _id: ObjectId("6359f0e35e8d9dd1c663bebb"),
    foo: { k: 'cats', v: [ 1, 2, 3 ] }
  },
  {
    _id: ObjectId("6359f0e35e8d9dd1c663bebb"),
    foo: { k: 'dogs', v: [ 1, 2, 3 ] }
  },
  {
    _id: ObjectId("6359f0e35e8d9dd1c663bebc"),
    foo: { k: 'cats', v: [ 1, 2 ] }
  },
  {
    _id: ObjectId("6359f0e35e8d9dd1c663bebc"),
    foo: { k: 'fish', v: [ 1 ] }
  },
  {
    _id: ObjectId("6359f0e35e8d9dd1c663bebd"),
    foo: { k: 'dogs', v: [ 1, 2 ] }
  }
]

Could you clarify where the items which should be 0 would be located?

Regards,
Jason

John_Dunn · October 27, 2022, 3:29pm

Thanks for the reply Jason

I think my use of arrays was confusing the question and isn’t actually required to answer my question. I can simplify my data and swap the arrays for integer values and my question still stands. Take this data for example

{ animals : { cats: 3, dogs: 3 }}
{ animals : { cats: 2, fish: 1 }}
{ animals : { dogs: 2 }}

I would like my query to calculate the average number of cats, dogs and fish across all documents. If a given document doesn’t have a value for a specific animal I would like 0 to be used in place. In the above document the average number of cats would be calculated as ( 3 + 2 + 0 ) / 3. Basically I would like the above documents to be expanded to something like this automatically in the query.

{ animals : { cats: 3, dogs: 3, fish: 0 }}
{ animals : { cats: 2, fish: 1, dogs: 0 }}
{ animals : { dogs: 2, cats: 0, fish: 0 }}

I also need to be able to calculate the average number of all animal types without having to explicitly specify each type of animal. In my case I have 500+ different ‘types’ of objects so having to specify each one in the query isn’t really feasible.

Does that make sense?

Thanks-

John

Jason_Tran · October 28, 2022, 1:36am

Thanks for clarifying John.

I would like my query to calculate the average number of cats, dogs and fish across all documents. If a given document doesn’t have a value for a specific animal I would like 0 to be used in place.

I have a example which outputs a calculated average based off the sum and total count of distinct animal types that exists in the collection as a whole. For brevity, I only used avg in this example.

Sample documents:

DB> db.animals.find({},{_id:0})
[
  { animals: { cats: 3, dogs: 3 } },
  { animals: { cats: 2, fish: 1 } },
  { animals: { dogs: 2 } }
]

Aggregation:

DB> db.animals.aggregate([
  {
    '$facet': {
      animalSums: [
        { '$addFields': { foo: { '$objectToArray': '$animals' } } },
        { '$unwind': { path: '$foo', preserveNullAndEmptyArrays: true } },
        { '$group': { _id: '$foo.k', sum: { '$sum': '$foo.v' } } }
      ],
      totalCount: [ { '$count': 'totalDocuments' } ]
    }
  },
  { $unwind: "$totalCount" },
  { $unwind: "$animalSums" },
  {
    '$project': {
      animal: '$animalSums._id',
      sum: '$animalSums.sum',
      average: { '$divide': [ '$animalSums.sum', '$totalCount.totalDocuments' ] }
    }
  }
])

Output:

[
  { animal: 'cats', sum: 5, average: 1.6666666666666667 },
  { animal: 'fish', sum: 1, average: 0.3333333333333333 },
  { animal: 'dogs', sum: 5, average: 1.6666666666666667 }
]

Some aggregation stages / operators related documentation used in the above example for reference:

Please note I’ve only performed this on the 3 sample documents provided. Please test thoroughly on a test environment to see if it suits your use case and requirements.

In saying the above, although it may be possible to achieve the desired output, it is not performant by any measure. If this is to be part of the standard workload and is often run then it could cause resource issues. On the other hand, if the performance issues can be ignored due to it being used in a once-off scenario, then this would be fine depending on your use case(s).

There are a few possible things you may wish to consider for future:

Store all possible animals in every document, with 0 for animals not existing in the document. This will make the average calculations trivial to do at the expense of larger documents
OR
Store all possible animals in a separate collection, and store the document as they appear from your examples. This will lead to smaller document sizes relative to (1), but this would still be faster than the example aggregation above since you can easily get the total number of possible animals instead of doing it on-the-fly. As a bonus, you can check all possible animals names easily, at the expense of a more complex aggregation compared to (1)

Lastly, you should optimize the schema design according to the most common workload (considering details in 1. and 2. above). If document size is a concern, you could utilise zlib/zstd compression on the collection(s).

Regards,
Jason

John_Dunn · October 31, 2022, 4:08pm

Thanks Jason.

Is the ‘expensive’ part of your example the addFields portion? It seems like if I populated my documents with all animal types the rest of your example would still apply.

Thanks

John

Jason_Tran · November 1, 2022, 4:22am

Is the ‘expensive’ part of your example the addFields portion?

Essentially the pipeline will not be able to use any indexes which will result in all documents in the collection being processed. Although on the 3 test documents it may not seem relatively poor in terms of performance, it may eventually lead to issues with large / growing collections.

I would recommend also viewing the Explain Results documentation to further understand index usage for different aggregation pipelines for your testing.

Regards,
Jason

system · July 9, 2023, 11:52pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.