How to scale and optimize this friend suggestions query?

Big_Cat_Public_Safety_Act · July 26, 2023, 9:00pm

My app has a performance intensive query, which generates a list of suggested friends for any user that logs into the App.

This query currently is slow despite indexes. The $lookup, which is a join, is slow despite being indexed.

Currently, this query is slow with a database of more than 100,000 users. I need to scale my APP to potentially 10s of millions of users.

questions:

Does this look like the right approach?
What is needed to speed it up? Sharding? Faster Hardware? Persist in RAM?
There are appears to be no way to remove the $lookup / join. Or are there?

The business logic of the query goes like so:

const suggested_friends = await all_users.aggregate([
  {
    $match: {
      age: { $gt: 18 },
      city: { $in: ["chicago", "new york"] },
      school: { $in: ["harvard", "standford"] },
      // etc., potentially hundreds of other arbitrary filters, which are different for each query and undefined
    },
  },
  // must lookup and remove users that this user has blocked and who have blocked this user.
  {
    $lookup: {
      from: "block_users",
      let: { tar_id: "$_id", tar_tid: "$tar._id" }, // joining by _id, which is indexed
      pipeline: [
        {
          $match: {
            $expr: {
              $or: [
                {
                  $and: [
                    {
                      $eq: ["$blocker", querier_user_id],
                    },
                    { $eq: ["$blocked", "$$tar_id"] },
                  ],
                },
                {
                  $and: [
                    { $eq: ["$blocker", "$$tar_id"] },
                    {
                      $eq: ["$blocked", querier_user_id],
                    },
                  ],
                },
              ],
            },
          },
        },
        { $limit: 1 },
      ],
      as: "remove",
    },
  },
  {
    $match: {
      $expr: {
        $eq: [{ $size: "$remove" }, 0],
      },
    },
  },
  {
    $project: {
      remove: 0,
    },
  },

  // now to derive some similarity score to predict whether user will be good friend
  {
    $set: {
      similarity_score: complex_function_call(), // some complex function to compute a number
    },
  },

  // finally sorting
  {
    $sort: {
      similarity_score: -1,
      age: 1,
      grade: 1,
      distance: 1,
      // etc. the sort order, by what field, and how many fields is not predefined and can be different for each query
      // and thus, cannot use index.
      // also, similarity_score must be derived, hence another reason an index cannot be used.
      // this must come at the end because blocked users must be filtered and removed first.
    },
  },
  {
    $limit: 100,
  },
]);

John_Sewell · July 27, 2023, 6:53am

Couple of thoughts…

You’re running this on every logon? What’s more often to happen, a user log on or a user sign up / change key information. If it’s logon then there is no point re-calculating the friends list every time, calculate it only as needed if it’s expensive.

Store the meta information as attributes, which would make indexing the fields easier / more efficient.

Don’t keep blocking / blocked user data in a different collection, store it with the users as that’s a place it’s being actively used, you can re-calculate a users friends suggestion list when a user blocks / is blocked.
(with the assumption that if it grows to massive you’ve the 16MB limit, but that’s a VERY disliked user, you could always have a flag to indicate a user like this and process them differently…)

Don’t calculate this in real-time, when a change is made, put the user id on a queue and have a background process calculate this at some point, but not immediately, I assume it’s not critical that friends matches are calculated the second that a user logs in.