Pipeline to group zip codes by region

Leyo_Cory · December 14, 2022, 3:12pm

Hi, I tried so long to figure out how to aggregate some documents via a pipeline but I was not able to get close to the result.

I have documents representing zip code ranges belonging to a region and a “last delivery date” for each range of zip codes. I would like to obtain one document for each contiguous range belonging to a region, with the date of the last delivery that happened for that range.

Currently I am doing it programmatically in my app, but doing that with an aggregation would bring to a major performance improvement given that it is much more efficient to don’t have to process all the documents on the client side.

Here a sample of my dataset:

[
  {
    region: "EMEA",
    startZipCode: 0,
    endZipCode: 10,
    lastDelivery: ISODate("2020-01-01T10:58:18Z")
  },
  {
    region: "EMEA",
    startZipCode: 10,
    endZipCode: 20,
    lastDelivery: ISODate("2020-12-01T11:13:26Z")
  },
  {
    region: "EMEA",
    startZipCode: 20,
    endZipCode: 30,
    lastDelivery: ISODate("2020-11-01T11:13:26Z")
  },
  {
    region: "NA",
    startZipCode: 30,
    endZipCode: 40,
    lastDelivery: ISODate("2020-01-01T11:13:26Z")
  },
  {
    region: "NA",
    startZipCode: 40,
    endZipCode: 50,
    lastDelivery: ISODate("2020-01-01T11:13:26Z")
  },
  {
    region: "EMEA",
    startZipCode: 50,
    endZipCode: 60,
    lastDelivery: ISODate("2020-01-01T11:13:26Z")
  }
]

Here an example of the output I would expect:

// Range [0-30) belonging to EMEA
// Grouping together contiguous zip codes (3 first docs)
// And choosing the higher delivery date in the group
  {
    region: "EMEA",
    startZipCode: 0,
    endZipCode: 30,
    lastDelivery: ISODate("2020-12-01T11:13:26Z")
  },
// Range [30, 50) belonging to NA
// Grouping together contiguous zip codes (2 docs)
// And choosing the higher delivery date in the group (they're the same)
   {
    region: "NA",
    startZipCode: 30,
    endZipCode: 50,
    lastDelivery: ISODate("2020-01-01T11:13:26Z")
  },
// Range [50, 60) belonging to EMEA (single document in the range, no grouping needed)
  {
    region: "EMEA",
    startZipCode: 50,
    endZipCode: 60,
    lastDelivery: ISODate("2020-01-01T11:13:26Z")
  }

Thank you very much in advance for your help.

MaBeuLux88_xxx · December 15, 2022, 3:38pm

Hi @Leyo_Cory,

First of all, thank you for the brain teasing exercice! I have to say that this has been a challenge!

My solution IS NOT perfect! I couldn’t find a solution to generalise the regions. They are hardcoded in my pipeline… Hopefully you don’t have 765 regions and you can generate the pipeline from your regions if they are dynamic.

Buckle up, here we go:

[
  {
    '$group': {
      '_id': '$region', 
      'starts': {
        '$push': '$startZipCode'
      }, 
      'ends': {
        '$push': '$endZipCode'
      }
    }
  }, {
    '$set': {
      'mins': {
        '$filter': {
          'input': '$starts', 
          'as': 's', 
          'cond': {
            '$not': {
              '$in': [
                '$$s', '$ends'
              ]
            }
          }
        }
      }
    }
  }, {
    '$lookup': {
      'from': 'coll', 
      'let': {
        'id': '$_id', 
        'mins': '$mins'
      }, 
      'pipeline': [
        {
          '$match': {
            '$expr': {
              '$and': [
                {
                  '$eq': [
                    '$region', '$$id'
                  ]
                }, {
                  '$in': [
                    '$startZipCode', '$$mins'
                  ]
                }
              ]
            }
          }
        }
      ], 
      'as': 'test'
    }
  }, {
    '$unwind': {
      'path': '$test'
    }
  }, {
    '$replaceRoot': {
      'newRoot': '$test'
    }
  }, {
    '$facet': {
      'EMEA': [
        {
          '$match': {
            'region': 'EMEA'
          }
        }, {
          '$graphLookup': {
            'from': 'coll', 
            'startWith': '$startZipCode', 
            'connectFromField': 'endZipCode', 
            'connectToField': 'startZipCode', 
            'as': 'zips', 
            'restrictSearchWithMatch': {
              'region': 'EMEA'
            }
          }
        }
      ], 
      'NA': [
        {
          '$match': {
            'region': 'NA'
          }
        }, {
          '$graphLookup': {
            'from': 'coll', 
            'startWith': '$startZipCode', 
            'connectFromField': 'endZipCode', 
            'connectToField': 'startZipCode', 
            'as': 'zips', 
            'restrictSearchWithMatch': {
              'region': 'NA'
            }
          }
        }
      ]
    }
  }, {
    '$set': {
      'aze': {
        '$concatArrays': [
          '$EMEA.zips', '$NA.zips'
        ]
      }
    }
  }, {
    '$unwind': {
      'path': '$aze'
    }
  }, {
    '$set': {
      'result': {
        '$reduce': {
          'input': '$aze', 
          'initialValue': [], 
          'in': {
            'region': '$$this.region', 
            'startZipCode': {
              '$concatArrays': [
                '$$value.startZipCode', [
                  '$$this.startZipCode'
                ]
              ]
            }, 
            'endZipCode': {
              '$concatArrays': [
                '$$value.endZipCode', [
                  '$$this.endZipCode'
                ]
              ]
            }, 
            'lastDelivery': {
              '$concatArrays': [
                '$$value.lastDelivery', [
                  '$$this.lastDelivery'
                ]
              ]
            }
          }
        }
      }
    }
  }, {
    '$replaceRoot': {
      'newRoot': '$result'
    }
  }, {
    '$set': {
      'startZipCode': {
        '$min': '$startZipCode'
      }, 
      'endZipCode': {
        '$max': '$endZipCode'
      }, 
      'lastDelivery': {
        '$max': '$lastDelivery'
      }
    }
  }
]

Result:

[
  {
    region: 'EMEA',
    startZipCode: 0,
    endZipCode: 30,
    lastDelivery: ISODate("2020-12-01T11:13:26.000Z")
  },
  {
    region: 'EMEA',
    startZipCode: 50,
    endZipCode: 60,
    lastDelivery: ISODate("2020-01-01T11:13:26.000Z")
  },
  {
    region: 'NA',
    startZipCode: 30,
    endZipCode: 50,
    lastDelivery: ISODate("2020-01-01T11:13:26.000Z")
  }
]

Just to explain a bit the pipeline…

First I collect the starts and ends by regions.
Then I remove the ends from the starts. I end up with just the starts that start a range.
From there, I retrieve the docs that are starting ranges by regions with the $lookup. I assume here that no region have the same range start… (first problem).
Then I use a $facet (one facet for each region (second problem)) with a GraphLookup to retrieve the ranges starting from the start I have already from above.
At this point I have almost the solution and the rest of the pipeline is just extracting and cleaning the docs to extract the final solution.

I’m not going to lie, I was like this!

It is 99% sure that there is a more direct and optimized solution (with generic regions) but it’s the best I could do with the limited time I have!

Cheers,
Maxime

Leyo_Cory · December 20, 2022, 5:31pm

Thank you very much Maxime! It required a lot of time for me to digest this solution, a real artwork! The regions are indeed not so many hence I could integrate this aggregation quite nicely into the application. A great performance improvement thanks to the reduced data transfer latency (~213%).

MaBeuLux88_xxx · December 20, 2022, 11:43pm

I’ll take the 213% optimization !

And it would be awesome to see the actual optimized version with generic regions. I’d gladly read that pipeline.

Cheers,
Maxime.

Leyo_Cory · December 21, 2022, 7:48am

And it would be awesome to see the actual optimized version with generic regions.

I ended up manually pasting region names in the $facet stage (21 regions in total). It would be great to have a more generic stage but from my side the aggregation looks acceptable atm.