In my collection, each document contains either a notificationIdor a schedulerId — never both. I want to perform a $lookup based on whichever field exists to pull data from either the notifications or schedulers collection.
My concern is:
If I write an aggregation pipeline that includes both$lookup stages (one for each field), will the pipeline still run both lookups for every document — even if one of the fields doesn’t exist? Could that affect performance on large datasets?
Is there a better way to structure the aggregation so that only the necessary $lookup is performed based on the field that exists?
Yes, it will. Even if there’s no match, the lookup has to occur to establish that there’s no match.
Worth measuring and based on how large. If you’re looking up on the _id field, it will still be quite fast.
To improve performance, you can use $facet or $unionWith; both with a $match operation for the fields.
If there’s a group operation, then facet works well. But there’s a 100MB per stage limit and the standard document limit of 16 MB for the final result . So if you want full/all documents, then you will need unionWith.
It will, but you’ll need to choose which one makes more sense to have:
a) Lookup both collections
b) Pre-filter before two separate lookups.
Consider that you have 1,000 docs with a 50-50 split between the two fields. Directly doing the Lookup twice means 2,000 lookups, which you have already said you want to avoid. The other way filters 1000 to 500 (twice), so is 500 + 500 lookups. To improve the two match stages, you can also add a partial index for those two fields with $exists: true.
One null and other not-null is different from “only one exists” But that’s a minor change to the aggregation and partial index. In this case it would be $type: "string" or $type: "objectId" depending on what your data actually has.
If you don’t have or can’t create partial indexes on those two fields but your lookup for notificationId and schedulerId is with the _id field of the other collections, then it would use that unique index during the lookup. And that could make the 2,000 lookups faster than the twice “500 $match on non-indexed field & lookup on indexed _id field”. You’ll have to try both options to see where the cutover occurs wrt number of documents.
If your foreignField in the lookup for notifications and schedulers collection is not an indexed field, then it’s very likely that filtering twice before two lookups will be faster for non-trivially sized collections.
Got it, thanks for the clarification! I’ll test both approaches—using partial indexes where possible and leveraging the _id field during lookups—to see which one performs better depending on the document count. I’ll focus on filtering first if the foreign fields aren’t indexed.
Adding the twice-match-lookup method here, using unionWith:
Filter on one of the fields, do the lookup
Union with:
Filter on the other field, do the lookup
Add unwind, sort, etc. as needed. I’ve put just one sort at the end.
db.collection.aggregate([
// the `$type: "string"` implies not null in the data
// use `$type: "objectId"` if that's the data type in the actual docs
{ $match: { notificationId: { $type: "string" } } },
{
$lookup: {
from: "notifications",
localField: "notificationId",
foreignField: "_id",
as: "notification"
}
},
{
$unionWith: {
coll: "collection",
pipeline: [
{ $match: { schedulerId: { $type: "string" } } },
{
$lookup: {
from: "schedulers",
localField: "schedulerId",
foreignField: "_id",
as: "schedule"
}
}
]
}
},
{ $sort: { _id: 1 } }
])
Mongo Playground with Example data and Results. Btw, I’ve added one doc which has both fields just to show that the results would be odd if such a doc existed. (But you have specified that this won’t occur.)