I have a fairly straightforward aggregation pipeline to join a collection with itself, but I’m having trouble getting the performance I expect. At a high level, I want to join collection1.dest=collection2.src AND collection2.type='some_constant'
. collection1 and collection2 are the same collection though.
My pipeline looks like this:
// Get list of edges, returns 256 docs
{ $match: { source: "some_id", type: "1" },
// Lookup destinations, returns 256 docs
{ $lookup: { "from": "grouped_assocs", "localField": 'destination_id', "foreignField": 'source_id', "as": "target_objs", pipeline: [ $match: { "type": "2" } ] } }
Today, this takes ~100ms. By upgrading to mongo 6.0 and removing the pipeline filter in stage 2, it speeds up to 25ms (returning 3k documents though, which can be mitigated by filtering in another stage after). However, if I simply fire two separate queries to the DB I can achieve ~1ms for both stages (incurring latency twice though).
I’m confused about why the pipeline is unable to treat it the same as two sequential queries to the database. Here are the sequential queries which results in much better performance:
// Get the list of edges, 256 docs
[ { $match: { source: "some_id", type: "1"} ]
followed by:
// Load the resulting objects for each edge, 256 docs
[
{ $match: { type: '2', src: { $in: [ list of "dest" returned from first query ] } } }
]
I understand the join and $in have slightly different results ($in would remove any dupes, and have a different order, although in my case these don’t matter). Is there some way to achieve that same performance with a pipeline though?
Note: I have an index on “src,type”, and it does seem to be used in all cases listed