Performance problem for lookup with $in

Christian_Schwaderer · May 18, 2023, 3:21am

I have a collection which refers in string array to another collection. Much simplified the data looks like:

db={
  "foo": [
    {
      "fooid": "my_foo_1",
      "name": "name of foo1",
      "organizationId": "myOrganization"
    },
    {
      "fooid": "my_foo_2",
      "name": "name of foo2",
      "organizationId": "myOrganization"
    },
    {
      "fooid": "my_foo_3",
      "name": "name of foo3",
      "organizationId": "myOrganization"
    }
  ],
  "fooCombinations": [
    {
      "id": "combination1",
      "foos": [
        "my_foo_1",
        "my_foo_2"
      ],
      "organizationId": "myOrganization"
    },
    {
      "id": "combination2",
      "foos": [
        "my_foo_1",
        "my_foo_3"
      ],
      "organizationId": "myOrganization"
    },
    {
      "id": "combination3",
      "foos": [
        "my_foo_2",
        "my_foo_3"
      ],
      "organizationId": "myOrganization"
    }
  ]
}

You see, foos in fooCombinations collection refers to fooid in the foo collection.

Now my $lookup looks like:

db.fooCombinations.aggregate([
  {
    $lookup: {
      from: "foo",
      let: {
        foos: "$foos",
        organizationId: "$organizationId",
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                {
                  $eq: [
                    "$organizationId",
                    "$$organizationId"
                  ]
                },
                {
                  $in: [
                    "$fooid",
                    "$$foos"
                  ],
                },
              ],
            },
          },
        }
      ],
      as: "fooAdditionalInfo",
    },
  }
])

( Put that also on Mongo playground: Mongo playground )

Now the problem is performance. The fooCombinations collection has in reality about 2000 documents for that organization and will grow. Even right now, that query takes about 20 seconds - all the parts without the $lookup take just 400ms. So, the $lookup is defintely the problem here.
There is an index on fooid but I know that $in inside $lookup cannot use indexes. I also know that certain improvements have been made on newer versions of Mongo, but right now I’m stuck with version 4.

So, what can I do to drastically decrease the execution time? (Small improvements won’t help. Reducing it from 20s to 10s won’t help. It need to be under 1 second.)

My line of thinking is: Is it wise to first create an array of all foo ids of all fooCombinations in one organization and then query “foo” just once with that big array and then afterwards “distribute” to foo results back to fooCombinations?
If that makes sense performance-wise, how would I do that?

steevej · May 20, 2023, 1:06pm

It is really bad because the new flavour of $lookup are more flexible. Not using indexes it definitively a performance killer.

Another issue is if a given foo (like my_foo_1) appears in multiple combinations you potentially (potentially since there might exist an optimization in the server to avoid this) in the looking it up multiple time, without an index. Since it looks like we only have part of the use case it is hard to make recommendation. For example, do you have some kind of $match before the $lookup. One way to reduce the multiple $lookup for the same foo is to use $facet where $group on foo ids of the foos array.

Are the foo ids unique for a given organization? If they are you might forgo the $eq on the organization. You could always make them unique by making the real id to be the concatenation of the original foo id and organization. Or use an object for the _id like

foo collection

{ fooid : { foo: "my_foo_1" , organization: "myOrganization" } ,
  name : "name of foo 1"
}

Then before the $lookup you would use $addFields to $map foos arrays from:

to the shape of ids in the foo collection

"foos": [
        { foo: "my_foo_1" , organization: "myOrganization" }
        { foo: "my_foo_2" , organization: "myOrganization" }
      ]

You will then be able to use localField:foos and foreignField: fooid in you $lookup thus being able to use the unique index of fooid.

All the above might be more work than upgrading to a more recent version which you will have to do any way within a year.

Christian_Schwaderer · May 23, 2023, 10:28am

Many thanks for your answer!

The foo ids are uniqe within an organization, but not across organizations. That’s why I think I need to compare them as well.

Yes, I know that it’s problematic that I cannot show my complete use case, but that would be too complicated and contains code I cannot show.

But many thanks anyway. That already helped quite a good deal!