Perfomance of Aggregate.group

Hi, we are leveraging mongoDB capabilities to process a large amount of data, yet performance is much slower than expected.

We have 4.600.000 records with position data. Each record has 38 properties, of which one is the priority (the lower the priority number, the higher it should be prioritised) and one is the date time it has been issued at.
About 1 of every 10.000 updates is for the same domain object and we want to have the latest state per domain object.

To do this we save incoming data per field and created an aggregate to get the latest state for a given domain object

db.barProperty.aggregate([
   { $match: {
      domainObjectId : "e1f2a6ba-de58-4548-a40f-990bdbf57d48"
   }},
   { $sort: {
      priority: 1,
      issuedAt: -1
   }},
   { $group: {
      "_id": "$field",
      "value": {"$last": "$value"},
      "issuedAt": {"$last": "$issuedAt"}
   }},
   {
      "$group": {
         "_id": null,
         "issuedAt": {"$max": "$issuedAt"},
         "array": {
            "$push": {
            "k": "$_id",
            "v": "$value"
         }
      }
   }},
   { "$replaceRoot": {
      "newRoot": {
         "$mergeObjects": [
            {"$arrayToObject": "$array"},
            {"issuedAt": "$issuedAt"}
         ]
      }
   }}
],
{ allowDiskUse: true });

There is an index on

db.barProperty.createIndex(
   { domainObjectId: 1, priority: 1, issuedAt: -1 }
)

Following group step is taking up to 30 seconds for a given identifier:

{
   "$group":{
      "_id":"$field",
      "value":{
         "$last":"$value"
      },
      "issuedAt":{
         "$last":"$issuedAt"
      }
   }
}

The second group and last mergeObjects is relatively quick since at that point it only handles 38 documents.

Any help with this is highly appreciated.

Hi @Bert_Roex and welcome in the MongoDB Community :muscle: !

Looks like you have created the right index so far, but I’m wondering if we can take it one step further maybe?

Are you really grouping by $field? If you are not reusing the fields from the sort, then it won’t work.

If that doesn’t work (likely), let’s look at pure performances.

How many documents are entering your pipeline? There is no way to refine / reduce that number?
Hardware => Do you have spare CPU, RAM & IOPS or is your cluster desperate for more ressources?

Another idea would be to maintain that “latest state” in another collection for each domain object. As you are inserting into your collection, you could also insert in a new collection the latest state and read from there?

Maybe another idea could be to use materialized views?

I know it’s just a bunch of ideas to explore but… It’s hard to do better without more context, numbers and possibility to explore options myself.

Cheers,
Max.

The dataset cannot be reduced since any of the fields can reside in any of the messages (e.g: it can be that a field is only sent in the first of 10th or nth record).

We are actually grouping per field as we would like the output of the group stage to be one single value per field.
This indeed means the index cannot be reused.

We are running on a cluster that has sufficient CPU, RAM & IOPS.

Determining the latest state is the part the group stage would result in and that’s the part we need to speed up for our non-functional requirements.

Thanks for your reply
Bert

Any chance you can share a few documents, the real pipeline and the expected result based on these docs?

a lower priority number means it has a higher priority.

Incoming JSON Data: (has 38 fields in the real case)

{
“domainObjectId” : “1b8480b9-d11f-4a9c-ba5e-a3144bce3126”,
“priority”: 1
“field1”: “value1”,
“field3”: “value3”,
},
{
“domainObjectId” : “1b8480b9-d11f-4a9c-ba5e-a3144bce3126”,
“priority”: 0
“field1”: “value1override”,
“field2”: “value2”,
},{
“domainObjectId” : “1b8480b9-d11f-4a9c-ba5e-a3144bce3126”,
“priority”: 2
“field1”: “value1lowerprio”,
},

Currently we map incoming data before we persist to MongoDB as follows:
{
“domainObjectId” : “1b8480b9-d11f-4a9c-ba5e-a3144bce3126”,
“priority”: 1
“field”: “field1”,
“value”: “value1”
},{
“domainObjectId” : “1b8480b9-d11f-4a9c-ba5e-a3144bce3126”,
“priority”: 1
“field”: “field2”,
“value”: “value2”
}

The expected outcome (and the pipeline is currently exactly as mentioned in previous post)

{
“domainObjectId” : “1b8480b9-d11f-4a9c-ba5e-a3144bce3126”,
“field1”: “value1override”
“field2”: “value2”
“field3” : “value3”
}

Hey @Bert_Roex,

Sorry for the abominable time to answer… I got ultra busy :smiley: !

How about this?

> db.test.find()
[
  {
    _id: ObjectId("6153bc90a571a9a021541f49"),
    domainObjectId: '1b8480b9-d11f-4a9c-ba5e-a3144bce3126',
    priority: 1,
    field1: 'value1',
    field3: 'value3'
  },
  {
    _id: ObjectId("6153bc90a571a9a021541f4a"),
    domainObjectId: '1b8480b9-d11f-4a9c-ba5e-a3144bce3126',
    priority: 0,
    field1: 'value1override',
    field2: 'value2'
  },
  {
    _id: ObjectId("6153bc90a571a9a021541f4b"),
    domainObjectId: '1b8480b9-d11f-4a9c-ba5e-a3144bce3126',
    priority: 2,
    field1: 'value1lowerprio'
  }
]

Aggregation:

[
  {
    '$match': {
      'domainObjectId': '1b8480b9-d11f-4a9c-ba5e-a3144bce3126'
    }
  }, {
    '$sort': {
      'priority': 1
    }
  }, {
    '$group': {
      '_id': '$domainObjectId', 
      'field1': {
        '$push': '$field1'
      }, 
      'field2': {
        '$push': '$field2'
      }, 
      'field3': {
        '$push': '$field3'
      }
    }
  }, {
    '$project': {
      '_id': 0, 
      'domainObjectId': '$_id', 
      'field1': {
        '$first': '$field1'
      }, 
      'field2': {
        '$first': '$field2'
      }, 
      'field3': {
        '$first': '$field3'
      }
    }
  }
]

Result:

> db.test.aggregate([
...   {
.....     '$match': {
.......       'domainObjectId': '1b8480b9-d11f-4a9c-ba5e-a3144bce3126'
.......     }
.....   }, {
.....     '$sort': {
.......       'priority': 1
.......     }
.....   }, {
.....     '$group': {
.......       '_id': '$domainObjectId', 
.......       'field1': {
.........         '$push': '$field1'
.........       }, 
.......       'field2': {
.........         '$push': '$field2'
.........       }, 
.......       'field3': {
.........         '$push': '$field3'
.........       }
.......     }
.....   }, {
.....     '$project': {
.......       '_id': 0, 
.......       'domainObjectId': '$_id', 
.......       'field1': {
.........         '$first': '$field1'
.........       }, 
.......       'field2': {
.........         '$first': '$field2'
.........       }, 
.......       'field3': {
.........         '$first': '$field3'
.........       }
.......     }
.....   }
... ])
[
  {
    domainObjectId: '1b8480b9-d11f-4a9c-ba5e-a3144bce3126',
    field1: 'value1override',
    field2: 'value2',
    field3: 'value3'
  }
]

Cheers,
Maxime.

That does seem like a viable option,

However in this example the group still doesn’t reuse the index used for sorting (btw we’re also sorting on a issuedAt date on top of the priority)

Furthermore we’re using Spring Data MongoDB and I’m currently not sure how to convert, do you by any chance have someone on your end that could help out with that transformation?

I usually don’t recommend using Spring Data MongoDB - but it does support Aggregation Pipelines.

https://docs.spring.io/spring-data/mongodb/docs/current/reference/html/#mongo.aggregation

What version are you using? If it’s anything recent then an index should be used very efficiently to get $first record. I’m not sure frankly why you are using $last rather than $first which would be more efficient and more correct I believe - don’t you want the lowest priority?

Change your index to be { domainObjectId: 1, field:1, priority: 1, issuedAt: -1 } and change your pipeline to have this sort:

{ "$sort" : {
		"field" : 1,
		"priority" : 1,
		"issuedAt" : -1
}	}

and then in the first $group change $last to be $first. If you do that you should get the optimal plan which will look like this:

"winningPlan" : {
					"stage" : "FETCH",
					"inputStage" : {
						"stage" : "DISTINCT_SCAN",
						"keyPattern" : {
							"domainObjectId" : 1,
							"field" : 1,
							"priority" : 1,
							"issuedAt" : -1
						},
						"indexName" : "domainObjectId_1_field_1_priority_1_issuedAt_-1",

It’s currently the project stage I’m having difficulties with.
See example 74. Spring Data MongoDB - Reference Documentation

reactiveMongoTemplateGet.aggregate(Aggregation.newAggregation(BarProperty.class,
match(Criteria.where(“domainObjectId”).is(domainObjectId)),
Aggregation.sort(Sort.Direction.DESC, PRIORITY)
.and(Sort.Direction.ASC, ISSUED_AT),
Aggregation.group(“domainObjectId”)
.push(“$value.field1”).as(“field1”),
Aggregation**.project()**
** .and(AggregationExpression.from(**
** MongoExpression.create(“‘$first’ : ‘$field1’”))).as(“field1”)**
).withOptions(AggregationOptions.builder().allowDiskUse(true).build()), BarProjection.class)

We’ve tried switching out last with first and the performance remained similar.

We only see the winningPlan on the first step of the aggregate (the match) when we explain.
On the group stage there is no indication whatsoever on which index is or isn’t being used.