I have a collection which refers in string array to another collection. Much simplified the data looks like:
db={
"foo": [
{
"fooid": "my_foo_1",
"name": "name of foo1",
"organizationId": "myOrganization"
},
{
"fooid": "my_foo_2",
"name": "name of foo2",
"organizationId": "myOrganization"
},
{
"fooid": "my_foo_3",
"name": "name of foo3",
"organizationId": "myOrganization"
}
],
"fooCombinations": [
{
"id": "combination1",
"foos": [
"my_foo_1",
"my_foo_2"
],
"organizationId": "myOrganization"
},
{
"id": "combination2",
"foos": [
"my_foo_1",
"my_foo_3"
],
"organizationId": "myOrganization"
},
{
"id": "combination3",
"foos": [
"my_foo_2",
"my_foo_3"
],
"organizationId": "myOrganization"
}
]
}
You see, foos
in fooCombinations
collection refers to fooid
in the foo
collection.
Now my $lookup looks like:
db.fooCombinations.aggregate([
{
$lookup: {
from: "foo",
let: {
foos: "$foos",
organizationId: "$organizationId",
},
pipeline: [
{
$match: {
$expr: {
$and: [
{
$eq: [
"$organizationId",
"$$organizationId"
]
},
{
$in: [
"$fooid",
"$$foos"
],
},
],
},
},
}
],
as: "fooAdditionalInfo",
},
}
])
( Put that also on Mongo playground: Mongo playground )
Now the problem is performance. The fooCombinations
collection has in reality about 2000 documents for that organization and will grow. Even right now, that query takes about 20 seconds - all the parts without the $lookup take just 400ms. So, the $lookup is defintely the problem here.
There is an index on fooid
but I know that $in inside $lookup cannot use indexes. I also know that certain improvements have been made on newer versions of Mongo, but right now I’m stuck with version 4.
So, what can I do to drastically decrease the execution time? (Small improvements won’t help. Reducing it from 20s to 10s won’t help. It need to be under 1 second.)
My line of thinking is: Is it wise to first create an array of all foo ids of all fooCombinations in one organization and then query “foo” just once with that big array and then afterwards “distribute” to foo results back to fooCombinations?
If that makes sense performance-wise, how would I do that?