Perfomance of Aggregate.group

Hi, we are leveraging mongoDB capabilities to process a large amount of data, yet performance is much slower than expected.

We have 4.600.000 records with position data. Each record has 38 properties, of which one is the priority (the lower the priority number, the higher it should be prioritised) and one is the date time it has been issued at.
About 1 of every 10.000 updates is for the same domain object and we want to have the latest state per domain object.

To do this we save incoming data per field and created an aggregate to get the latest state for a given domain object

db.barProperty.aggregate([
   { $match: {
      domainObjectId : "e1f2a6ba-de58-4548-a40f-990bdbf57d48"
   }},
   { $sort: {
      priority: 1,
      issuedAt: -1
   }},
   { $group: {
      "_id": "$field",
      "value": {"$last": "$value"},
      "issuedAt": {"$last": "$issuedAt"}
   }},
   {
      "$group": {
         "_id": null,
         "issuedAt": {"$max": "$issuedAt"},
         "array": {
            "$push": {
            "k": "$_id",
            "v": "$value"
         }
      }
   }},
   { "$replaceRoot": {
      "newRoot": {
         "$mergeObjects": [
            {"$arrayToObject": "$array"},
            {"issuedAt": "$issuedAt"}
         ]
      }
   }}
],
{ allowDiskUse: true });

There is an index on

db.barProperty.createIndex(
   { domainObjectId: 1, priority: 1, issuedAt: -1 }
)

Following group step is taking up to 30 seconds for a given identifier:

{
   "$group":{
      "_id":"$field",
      "value":{
         "$last":"$value"
      },
      "issuedAt":{
         "$last":"$issuedAt"
      }
   }
}

The second group and last mergeObjects is relatively quick since at that point it only handles 38 documents.

Any help with this is highly appreciated.

Hi @Bert_Roex and welcome in the MongoDB Community :muscle: !

Looks like you have created the right index so far, but I’m wondering if we can take it one step further maybe?

Are you really grouping by $field? If you are not reusing the fields from the sort, then it won’t work.

If that doesn’t work (likely), let’s look at pure performances.

How many documents are entering your pipeline? There is no way to refine / reduce that number?
Hardware => Do you have spare CPU, RAM & IOPS or is your cluster desperate for more ressources?

Another idea would be to maintain that “latest state” in another collection for each domain object. As you are inserting into your collection, you could also insert in a new collection the latest state and read from there?

Maybe another idea could be to use materialized views?

I know it’s just a bunch of ideas to explore but… It’s hard to do better without more context, numbers and possibility to explore options myself.

Cheers,
Max.

The dataset cannot be reduced since any of the fields can reside in any of the messages (e.g: it can be that a field is only sent in the first of 10th or nth record).

We are actually grouping per field as we would like the output of the group stage to be one single value per field.
This indeed means the index cannot be reused.

We are running on a cluster that has sufficient CPU, RAM & IOPS.

Determining the latest state is the part the group stage would result in and that’s the part we need to speed up for our non-functional requirements.

Thanks for your reply
Bert

Any chance you can share a few documents, the real pipeline and the expected result based on these docs?