"Conditional sort", a data modelling problem

ilari · November 15, 2022, 5:02pm

I have a data modelling problem and would greatly appreciate any help.

I’ll try to lay down the scenario via a simplified example:

Our data model has profiles for businesses:

{
   name: String,
   locations: [ Location{ adm1_id, adm3_id } ],
   product: enum.PRO|BASIC
}

We need to query them with and without a location (based on either adm1_id, adm3_id or both), sorting the results so that PRO-product profiles appear first, non-PRO profiles come after that sorted by a specific, separate sort field. This is now done and it works.

Now, approximately 2 days before product launch, the specs changed

We now need to enable multiple locations per profile, but, so that any PRO-products would be location specific, meaning:

When querying without location (adm1_id or adm3_id), all profiles with any PRO-product will appear first.
When querying with location, only profiles with PRO-product for the specific adm1_id-location will appear first. Profiles with PRO-products for different adm1_id-locations are considered non-PRO.

Number two is where my brain snaps.

How would you model, index & query this? I want to avoid resorting to multiple queries if at all possible, since we also have to support pagination via skip, which already makes things very slow, but the best I can come up with is:

{
   name: String,
   locations: [Location{adm1_id, adm3_id, product}]
}

…but, per my understanding, this requires two queries, one two fetch profiles which have PRO-product, and another to fetch the rest. I’m not very experienced with aggregates, maybe there’s a solution hidden there. From the little I could gather, maybe something like…

{
   $match // query
   $unwind // separate docs based on `locations`
   $filter // filter based on...?
   ...?
}

…could work?

Thanks for reading this far.

ilari · November 16, 2022, 7:39am

In case someone else has a similar problem, I’ve come up with at least a solution that works, not necessarily the solution:

db.profiles.aggregate([
    // Match only profiles with given location
    {
      "$match": {
        "locations.adm1_id": "new-york"
      }
    },
    // Unwind locations
    {
      "$unwind": "$locations" 
    },
    // The unwinded working set might have profiles with non-relevant locations.
    // Match only unwinded profiles with given location.
    {
      "$match": {
        "locations.adm1_id": "new-york" 
      }
    },

    // Sort profiles with highlighted location first, second order by
    // orderKey-field, so that the next stage picks up the highlighted profile.
    {
      "$sort": { 
        "locations.highlight": -1,
        "orderKey": 1
      }
    },
    // Now we might have duplicates if a profile had two locations with same
    // adm1_id, so group by _id, grabbing the `$first` values
    {
      "$group": {
        "_id": "$_id",
        "name": {
          "$first": "$name"
        },
        "locations": {
          "$first": "$locations"
        },
        "orderKey": {
          "$first": "$orderKey"
        }
      }
    },
    // $group messes up the sort ordering, so we sort again!
    {
      "$sort": {
        "locations.highlight": -1,
        "orderKey": 1
      }
    },
    // We want to skip & limit our results, but are also interested in the total amount of results.
    {
      "$facet": {
        "paginatedResults": [
          {
            "$skip": 24 // skip whatever is needed
          },
          {
            "$limit": 24 // limit to page size
          },
          // use project to make sure we only return safe fields,
          // $group stage is optional and used only if adm1_id was provided.
          {
            "$project": {
              "name": 1,
              "locations": 1,
              "orderKey": 1
            }
          }
        ],
        // Get total count of result documents
        "totalCount": [
          {
            "$count": "count"
          }
        ]
      }
    }
  ]);

It is obviously slower than a plain .find() on the index, but a typical query within a collection of 5000 documents takes approx. ~12-20ms, or ~20-30ms, measured from the node client.

Any ideas how to achieve the same, but better and faster, are very much welcome. My aggregate-knowledge is very limited.

system · November 21, 2022, 7:39am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.