$group Aggregation in Compass is timing out

James_Hall · June 20, 2020, 9:53pm

I’m working with relatively small (10k documents) database and need to combine collections in an aggregation pipeline. I’m using Compass for this, and it is timing out at a $group stage, but I can’t quite understand the reason.

My goal is to create a view showing the number of paid/free/total memberships, and number of paid/free/total members. (members are different than memberships).

Here is an example of the first doc:
account:

{
_id: '1234566788',
url: 'www.site.com',
memberships: [
     {0: 
         {membershipType: 'free'},
     {1: 
        {membershipType: 'paid'}
]
}

The second doc:
members:

{
_id: 'xxxxxxx',
site: '12345667',       // maps to _id from other doc
stripeInfo: {
    customer: 'xyz'    // This field will not exist if it is a free account
    }
}

… and this is the pipeline:

[
    {
        '$addFields': {
            'total_memberships': {
                '$cond': {
                    'if': {
                        '$isArray': '$memberships'
                    }, 
                    'then': {
                        '$size': '$memberships'
                    }, 
                    'else': 'NA'
                }
            }
        }
    }, {
        '$project': {
            '_id': 1, 
            'url': 1, 
            'memberships': 1, 
            'total_memberships': 1
        }
    }, {
        '$unwind': {
            'path': '$memberships', 
            'preserveNullAndEmptyArrays': True
        }
    }, {
        '$group': {
            '_id': {
                '_id': '$_id', 
                'url': '$url'
            }, 
            'total_memberships': {
                '$sum': 1
            }, 
            'count_of_free_memberships': {
                '$sum': {
                    '$cond': [
                        {
                            '$eq': [
                                '$memberships.membershipType', 'free'
                            ]
                        }, 1, 0
                    ]
                }
            }, 
            'count_of_paid_memberships': {
                '$sum': {
                    '$cond': [
                        {
                            '$eq': [
                                '$memberships.membershipType', 'paid'
                            ]
                        }, 1, 0
                    ]
                }
            }
        }
    }, {
        '$lookup': {                 // connecting to members 
            'from': 'members', 
            'localField': '_id._id', 
            'foreignField': 'site', 
            'as': 'members'
        }
    }, {
        '$unwind': {
            'path': '$members', 
            'preserveNullAndEmptyArrays': True
        }
    }, {
        '$addFields': {
            'is_paid_member': {
                'stripeInfo': {
                    'customer': {
                        '$ifNull': [
                            1, 0
                        ]
                    }
                }
            }, 
            'is_free_member': {
                'stripeInfo': {
                    'customer': {
                        '$ifNull': [
                            0, 1
                        ]
                    }
                }
            }
        }
    }, {
        '$project': {                 // cleaning up/ flatening output
            '_id': '$_id._id', 
            'url': '$_id.url', 
            'total_memberships': 1, 
            'count_of_free_memberships': 1, 
            'count_of_paid_memberships': 1, 
            'is_paid_member': '$is_paid_member.stripeInfo.customer', 
            'is_free_member': '$is_free_member.stripeInfo.customer'
        }
    }, 

/* **This group is where the pipeline is timing out** */
{
        '$group': {
            '_id': {
                '_id': '$_id', 
                'url': '$url', 
                'count_of_free_memberships': '$count_of_free_memberships', 
                'count_of_paid_memberships': '$count_of_paid_memberships', 
                'total_memberships': '$total_memberships'
            }, 
            'free_members': {
                '$sum': '$is_free_member'
            }, 
            'paid_members': {
                '$sum': '$is_paid_member'
            }, 
            'total_members': {
                '$sum': 1
            }
        }
    }, {
        '$project': {
            '_id': '$_id._id', 
            'url': '$_id.url', 
            'free_memberships': '$_id.count_of_free_memberships', 
            'paid_memberships': '$_id.count_of_paid_memberships', 
            'total_memberships': '$_id.total_memberships', 
            'free_members': 1, 
            'paid_members': 1, 
            'total_members': 1
        }
    }, {}
]

This Aggregation pipelinen will run if I limit the input to 1000, or increase the max time from 5000 to 20000, but otherwise will throw :

error in $cursor stage ::caused by:: operator exceeded time limit

I appreciate any advice in optimizing this aggregation pipeline.

Massimiliano_Marcon · June 22, 2020, 9:58am

The default maxTimeMS is 5000 ms. You can increase it in the aggregation pipeline builder settings.

James_Hall · June 22, 2020, 10:28am

Thanks for the quick response and the gif is helpful. However, although raising the time limit works for my aggregation, when I save it to a view for the team they are getting a time out error. Is there a way to raise the time limit for the view?

paul_N_A · June 23, 2020, 1:37pm

Am I wrong, but doesn’t extending the time for a query seems like a way worse solution than optimizing it. I am reading (interpriting) this answer as there is no way to optimize this query. 5 secs? Is this accurate?

James_Hall · June 27, 2020, 6:38pm

Paul, I agree, with this relatively small set I keep running into time limits, and don’t want to have to instruct everyone to change settings each time they want to use a view.

I came across another instance of this. If I increase the time limit I can view the collection, but the view shows total documents N/A:

I’m unable to export this view, presumably because it doesn’t know how many rows to export.

However if I set the limit to something higher than the total docs (I have 10.5k docs, set limit to 12k), then I am able to export to csv and things seem to work.

I’m wondering what is causing this and if there is a way to avoid it in an aggregation stage.

slava · June 28, 2020, 7:23pm

Good point.

@James_Hall, the main issue, that slows down your aggregation - is the $lookup stage. It is called for each membership document.

You make your aggregation faster in one of the following ways:

If the relationship between ‘accounts’ and ‘members’ documents is 1:1 (every account has only 1 member and every member can have only 1 account), then consider merging the collections into one. With this you will not have to do any $lookups to get the output, your current aggregation provides. The aggregation will be way more faster.
If the relationships between ‘accounts’ and ‘members’ documents is 1:M (every account can have many members and one or more members can belong to same account), then consider to put member-related totals into ‘accounts’ documents. This will allow you to avoid using $lookups, which will significantly improve the performance of your aggregation. But you will have to maintain consistency of those totals on the application level. And each modification to ‘members’ collection may require modification of the ‘accounts’ collection.
If the relationships is 1:M (like in approach #2, see above), but you’re trying to avoid data redundancy for some reason, then rewrite your aggregation, so it will would $lookup members to accounts. This will reduce the number of $lookups, but still, you will have plenty of them. Performance will be slightly improved.
Try to rewrite the aggregation, so it would use only 1 $lookup. Like this:

db.test_members.aggregate([
  {
    $group: {
      _id: '$site',
      totalMembers: {
        $sum: 1,
      },
      totalFreeMembers: {
        $sum: {
          $cond: {
            if: {
              $eq: [{ $type: '$stripeInfo.customer' }, 'missing'],
            },
            then: 1,
            else: 0
          }
        }
      }
    }
  },
  {
    // tried to avoid additional operations and conditional 
    // in the $group stage, for the totalPaidMembers value, 
    // that is why its calculation is moved to a separate stage, 
    // so it just a simple subtraction of two integers
    $addFields: {
      totalPaidMembers: {
        $subtract: ['$totalMembers', '$totalFreeMembers'],
      }
    }
  },
  {
    $group: {
      _id: null,
      // accumulate membership sites to do 1 single $lookup
      membershipSites: {
        $push: '$_id',
      },
      // accumulate membership docs into a single objects, 
      // so later it can be easily concatenated with members later
      members: {
        $push: '$$CURRENT',
      }
    }
  },
  {
    $lookup: {
      from: 'test_accounts',
      pipeline: [
        {
          // leave only necessary data in the pipeline
          $project: {
            _id: true,
            url: true,
            memberships: true,
          },
        },
        {
          $unwind: {
            path: '$memberships',
            // some 'membership' fields can contain empty array
            preserveNullAndEmptyArrays: true,
          },
        },
        {
          $group: {
            _id: '$_id',
            url: {
              $first: '$url',
            },
            totalFreeMemberships: {
              $sum: {
                $cond: {
                  if: {
                    $eq: ['$memberships.membershipType', 'free'],
                  },
                  then: 1,
                  else: 0,
                }
              },
            },
            totalMemberships: {
              $sum: 1
            }
          }
        },
        {
          $addFields: {
            totalPaidMemberships: {
              $subtract: ['$totalMemberships', '$totalFreeMemberships'],
            }
          }
        },
      ],
      as: 'memberships',
    }
  },
  {
    $project: {
      result: {
        // at this point it is possible to hit the 100 MiB stage's limit
        $concatArrays: ['$memberships', '$members']
      }
    }
  },
  {
    $unwind: '$result',
  },
  {
    $group: {
      _id: '$result._id',
      url: {
        $max: '$result.url',
      },
      totalFreeMemberships: {
        $sum: '$result.totalFreeMemberships',
      },
      totalPaidMemberships: {
        $sum: '$result.totalPaidMemberships',
      },
      totalMemberships: {
        $sum: '$result.totalMemberships',
      },
      totalMembers: {
        $sum: '$result.totalMembers',
      },
      totalFreeMembers: {
        $sum: '$result.totalFreeMembers',
      },
      totalPaidMembers: {
        $sum: '$result.totalPaidMembers',
      }
    }
  },
]).pretty();

And for those datasets:

db.test_accounts.insertMany([
  {
    _id: 'site1.com',
    url: 'url1.com',
    memberships: [
      { membershipType: 'free' },
      { membershipType: 'paid' },
    ],
  },
  {
    _id: 'site2.com',
    url: 'url2.com',
    memberships: [],
  },
  {
    _id: 'site3.com',
    url: 'url3.com',
    memberships: [
      { membershipType: 'free' },
    ],
  },
  {
    _id: 'site4.com',
    url: 'url4.com',
    memberships: [
      { membershipType: 'paid' },
    ],
  }
]);

db.test_members.insertMany([
  {
    _id: 'm1',
    site: 'site1.com',
    stripeInfo: {
      customer: 'c1'
    }
  },
  {
    _id: 'm2',
    site: 'site1.com',
    stripeInfo: {
      customer: 'c2'
    }
  },
  {
    _id: 'm3',
    site: 'site1.com',
    stripeInfo: {
      customer: 'c3'
    }
  },
  {
    _id: 'm4',
    site: 'site3.com',
    stripeInfo: {
      customer: 'c4'
    }
  },
  {
    _id: 'm5',
    site: 'site4.com',
    stripeInfo: {
      customer: 'c5'
    }
  },
  {
    _id: 'm6',
    site: 'site1.com',
    stripeInfo: {}
  },
  {
    _id: 'm7',
    site: 'site4.com',
    stripeInfo: {}
  }
]);

The aggregation will provide this output:

{
	"_id" : "site3.com",
	"url" : "url3.com",
	"totalFreeMemberships" : 1,
	"totalPaidMemberships" : 0,
	"totalMemberships" : 1,
	"totalMembers" : 1,
	"totalFreeMembers" : 0,
	"totalPaidMembers" : 1
}
{
	"_id" : "site4.com",
	"url" : "url4.com",
	"totalFreeMemberships" : 0,
	"totalPaidMemberships" : 1,
	"totalMemberships" : 1,
	"totalMembers" : 2,
	"totalFreeMembers" : 1,
	"totalPaidMembers" : 1
}
{
	"_id" : "site1.com",
	"url" : "url1.com",
	"totalFreeMemberships" : 1,
	"totalPaidMemberships" : 1,
	"totalMemberships" : 2,
	"totalMembers" : 4,
	"totalFreeMembers" : 1,
	"totalPaidMembers" : 3
}
{
	"_id" : "site2.com",
	"url" : "url2.com",
	"totalFreeMemberships" : 0,
	"totalPaidMemberships" : 1,
	"totalMemberships" : 1,
	"totalMembers" : 0,
	"totalFreeMembers" : 0,
	"totalPaidMembers" : 0
}

The above aggregation will work perfectly, if your collections are not huge. Otherwise, some stages may reach 100 MiB limitation. It can be negotiated with { allowDiskUse: true } option, but it will reduce the performance of the aggregation.

Try to apply this aggregation to your datasets, if you will still have issues with the performance, consider other solutions, that I have suggested above.