Complex aggregation with multiple lookups & unwinds results in nested arrays

Hello there!

My issue is quite complex and I’m new to mongodb. And I’m so close to the solution but one property still won’t display as intended. First i need to explain the setup. We have multiple collections:

Collections

  • requests - Holiday requests from website visitors
  • campaigns - A list of advert campaigns
  • bookings - Holiday bookings
  • flags - List of flags/tags which can be used for tagging documents of the other collections.

Examples document structures
Only the relevant props are shown.

// requests
{
  _id: ObjectId("60be1a668cc56aca6c46bb24"),
  dateRequest: '2021-06-07 13:08:54.675Z',
  apikey: '7bcc49c49cb8afb4ba666a6ae2',
  meta: {
    campaign:  '999999' // references to campaigns.id
  }
}

// campaigns
{
  _id: ObjectId("61af6485a1a6045e25baf1b1"),
  id: '999999',
  name: 'Brand International EU',
  apikey: '7bcc49c49cb8afb4ba666a6ae2',
  flag: ObjectId("616025c3e5bff7b3d6d89928") // reference to flags._id
}

// bookings
{
  _id: ObjectId("61645257c03caa392af894e8"),
  apikey: '7bcc49c49cb8afb4ba666a6ae2',
  revenue: 742.34,
  flags: [ // references to flags._id
    ObjectId("61602615e5bff7b3d6d8992c"),
    ObjectId("616025fee5bff7b3d6d8992b")
  ],
  requests: [ // references to requests._id
    ObjectId("60be1a668cc56aca6c46bb24"),
    ObjectId("6160174c8fa5873c1e068f7a")
]
}

// flags
{
  _id: ObjectId("61602615e5bff7b3d6d8992c"),
  name: 'In House'
},
{
  _id: ObjectId("616025fee5bff7b3d6d8992b"),
  name: 'Dynamic'
}

Now the final output needs to be grouped by campaigns and needs to hold aggregated data for revenue, bookingscount, bookingflags, booking quota, etc. The goal is to have a table like this:

| campaign               | requests | bookings | bookingquota | revenue | revenue/booking | revenue/request |
|------------------------|----------|----------|--------------|---------|-----------------|-----------------|
| null                   |      255 |        0 |            0 |       0 |               0 |               0 |
| Brand International EU |      350 |       50 |        0.14% | 12,650€ |            253€ |           0,03€ |
| 666666 (new)           |       20 |        5 |        0.25% |  6,470€ |          1,294€ |          323.5% |

The following conditions need to be respected:

  • Some requests may not include any campaign id, this requests should be grouped by null
  • Some requests may include a campaign id which does not exist in the campaign collection, therefore we group by the id instead of the object id (_id).
  • bookingFlags can have duplicates because we would use this property to count how many times a booking flag is related to the campaign.
  • earliestRequest is the earliest dateRequest of requests inside the bookings.
  • revenuetotal is the sum of revenue values inside bookings.
  • quota is simply bookingcount divided by requestcount.
  • revenueperbooking and revenueperrequest are revenuetotal divided by each amount.

The desired array output should look something like this (I think):

[{
  _id: null,
  name: null,
  requestcount: 255,
  bookingcount: 0,
  earliestrequest: '2021-04-13 10:16:54.675Z'
  revenuetotal: 0,
  quota: 0,
  revenueperbooking: 0,
  revenueperrequest: 0,
  bookingFlags: []
},
{
  _id: '666666',
  name: null,
  requestcount: 20,
  bookingcount: 5,
  earliestrequest: '2021-04-08 12:18:54.675Z'
  revenuetotal: 6470,
  quota: 0.25,
  revenueperbooking: 1294,
  revenueperrequest: 323.5,
  bookingFlags: ['In House', 'Dynamic']
},
{
  _id: '999999',
  name: 'Brand International EU',
  requestcount: 350,
  bookingcount: 50,
  earliestrequest: '2021-06-07 13:08:54.675Z'
  revenuetotal: 12,650,
  quota: 0.14,
  revenueperbooking: 253,
  revenueperrequest: 0.03,
  bookingFlags: ['In House', 'Dynamic', 'In House', 'Wiederholer']
}]

And I’m trying to figure this out for 2 weeks now and I feel like I’m close but can’t seem to get the bookingFlags property to do what I want.

My current aggregation looks like this:

db.getCollection('requests').aggregate([
  {
    $lookup: {
      from: 'bookings',
      localField: '_id',
      foreignField: 'requests',
      as: 'bookings',
    },
  },
  {
    $lookup: {
      from: 'campaigns',
      localField: 'meta.campaign',
      foreignField: 'id',
      as: 'campaign',
    },
  },
  { $unwind: { path: '$campaign', preserveNullAndEmptyArrays: true } },
  { $unwind: { path: '$bookings', preserveNullAndEmptyArrays: true } },
  {
    $lookup: {
      from: 'flags',
      localField: 'bookings.flags',
      foreignField: '_id',
      as: 'bookings.flags',
    },
  },
  {
    $group: {
      _id: '$_id',
      bookings: {
        $addToSet: {
          $cond: {
            if: { $not: '$bookings._id' },
            then: '$$REMOVE',
            else: '$bookings',
          },
        },
      },
      campaign: { $first: '$campaign' },
      dateRequest: { $first: '$dateRequest' },
      meta: { $first: '$meta' },
    },
  },
  {
    $group: {
      _id: '$meta.campaign',
      name: { $first: '$campaign.name' },
      requestcount: { $sum: 1 },
      earliestrequest: { $min: '$dateRequest' },
      revenuetotal: { $sum: { $sum: '$bookings.revenue' } },
      bookingcount: { $sum: { $size: '$bookings' } },
      bookingFlags: { $push: '$bookings.flags' },
    },
  },
  {
    $project: {
      _id: 1,
      name: 1,
      requestcount: 1,
      earliestrequest: 1,
      revenuetotal: 1,
      bookingcount: 1,
      quota: {
        $cond: {
          if: { $gte: ['$requestcount', 1] },
          then: {
            $divide: [{ $multiply: ['$bookingcount', 100] }, '$requestcount'],
          },
          else: 0,
        },
      },
      revenueperbooking: {
        $cond: {
          if: { $gte: ['$bookingcount', 1] },
          then: { $divide: ['$revenuetotal', '$bookingcount'] },
          else: 0,
        },
      },
      revenueperrequest: {
        $cond: {
          if: { $gte: ['$requestcount', 1] },
          then: { $divide: ['$revenuetotal', '$requestcount'] },
          else: 0,
        },
      },
      bookingFlags: {
        $reduce: {
          input: '$bookingFlags',
          initialValue: [],
          in: { $concatArrays: ['$$value', '$$this'] },
        },
      },
    },
  },
]);

And the result looks like this:

[
{
    "_id" : null,
    "name" : null,
    "requestcount" : 225.0,
    "earliestrequest" : ISODate("2021-06-07T13:26:37.754Z"),
    "revenuetotal" : 0,
    "bookingcount" : 0,
    "quota" : 0.0,
    "revenueperbooking" : 0.0,
    "revenueperrequest" : 0.0,
    "bookingFlags" : []
},
  {
    _id: '99999999',
    name: 'Brand International EU',
    requestcount: 2870.0,
    earliestrequest: ISODate('2021-01-11T23:00:00.000Z'),
    revenuetotal: 16151.4,
    bookingcount: 5,
    quota: 0.174216027874564,
    revenueperbooking: 3230.28,
    revenueperrequest: 5.62766550522648,
    bookingFlags: [
      [
        {
          _id: ObjectId('616025fee5bff7b3d6d8992b'),
          apikeys: [],
          name: 'In House',
          type: 'booking',
          access: 'global',
          __v: 0,
        },
        {
          _id: ObjectId('61602615e5bff7b3d6d8992c'),
          apikeys: [],
          name: 'Wiederholer',
          type: 'booking',
          access: 'global',
          __v: 0,
        },
      ],
      [
        {
          _id: ObjectId('616025f7e5bff7b3d6d8992a'),
          apikeys: [],
          name: 'Telefon',
          type: 'booking',
          access: 'global',
          __v: 0,
        },
        {
          _id: ObjectId('61602615e5bff7b3d6d8992c'),
          apikeys: [],
          name: 'Wiederholer',
          type: 'booking',
          access: 'global',
          __v: 0,
        },
      ],
      [
        {
          _id: ObjectId('616025f7e5bff7b3d6d8992a'),
          apikeys: [],
          name: 'Telefon',
          type: 'booking',
          access: 'global',
          __v: 0,
        },
      ],
      [
        {
          _id: ObjectId('616025fee5bff7b3d6d8992b'),
          apikeys: [],
          name: 'In House',
          type: 'booking',
          access: 'global',
          __v: 0,
        },
      ],
      [
        {
          _id: ObjectId('616025fee5bff7b3d6d8992b'),
          apikeys: [],
          name: 'In House',
          type: 'booking',
          access: 'global',
          __v: 0,
        },
      ],
    ],
  }
]

So the problem is, that bookingFlags is a nested array instead of a flat one with just the names of the flags. My aggregation pipeline is slow and surely not the best approach, but its the only thing i managed to come up with. I’m open for any suggestions, optimizations or other solutions. Honestly, if you’ve read until this part, you are a champ, and if you can help me solve this monster of an aggregation, I will buy you a beer.

I hope someone can help me before i flip my desk and burn down my office.

cheers
George

Hi @Georg_Bote ,

Welcome to mongodb community.

The initial problem I see is with you schema design. It seems that it is more normalised for sql relational databases than Mongodb document model.

The idea is that data that is queried together should be stored together even in cost of duplicating data sometimes.

For example the flags data can be embedded in the bookings collection, where beside the id the flag you store flag names or any additional data important to your critical queries.

Now the aggregation is complex with multiple stages. You can still reshape the data to get embedded arrays out adding stages:

{
$addFields : { bookingsFlags : { $first : "$bookingsFlags"}}
},
{
$addFields : { bookingsFlags :"$bookingsFlags.name"}
}

Thanks
Pavel

@Georg_Bote ,

My recommendation was only by reading the ticket.

I will try to reproduce it and see if I can get with a better pipeline.

Thanks
Pavel

@Pavel_Duchovny

Thank you! $addFields did the trick!

But honestly, I wasn’t aware of the proper MongoDB schema principles. I read up on it and now I understand where we went wrong with our schema design and the way we try to query our data. So we will reevaluate where to use embedding and where references make more sense.

And another thing I tested was changing the way the campaign ids are being saved. Until now, the reason why I used the requests collection as entry point was because requests can hold campaign ids not existing in the campaign collection and I still need them as groups. But I changed it so campaign ids are being saved into the campaign collection as soon as a new request is recieved by the api. therefore I can use the campaign collection as the entry point and after some tests the execution time went down from 4.3s to 0.34s with a fraction of the initial aggregation query.

So thank you very much Pavel for pointing me into the right direction! You ended 2 weeks of frustration with a single post. :pray:

And if you send me your paypal contact via PM I will buy you that beer I promised :wink:

3 Likes

Hey @Georg_Bote,

Maybe this course on MongoDB University can help you get a better grip on the data modeling MongoDB style.

Cheers,
Maxime.

2 Likes

Thank you Maxime! I’ll definitely check it out. Don’t want to struggle like this again :smiley:

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.