How to optimise a $lookup

Joel_Lafaille · August 11, 2021, 10:28am

Hello!
I am looking for a better way to write an horribly slow aggregate query. The objective is to extract

// from metrics db
{
  $lookup: { 
      "from" : "snapshot", 
      "localField" : "_id", 
      "foreignField" : "events.resource._id", 
      "as" : "snapshots"
  }
},
{
  $match: {'snapshots': []}
},
{
  $project: {'_id': 1}
}

The goal is to extract all metrics that does not have at least one snapshot, metrics entity being a simple json but the snapshot’s foreignField is an array:

{
    "_id": {
        "$oid": "61139681c5189368714d9ef3"
    },
    "events": [{
        "resource": {
            "_id": {
                "$oid": "61139622c5189368714d92c5"
            },
        }
    }],
}

The issue is that with only a couple thousand snapshots entities, the query takes already a couple minutes to run. The joining part is almost instant, the slowness comes from the match…
Any idea on how I could rewrite this?

Thanks a lot

Joel_Lafaille · August 11, 2021, 11:52am

‘explain’ shows a COLLSCAN from the join, so the slowness has nothing to do with the $match. Replaced it with a $sort and it’s the same execution time.

steevej · August 11, 2021, 12:22pm

COLLSCAN means you have no index on the queried field.

Try adding an index on events.resource._id

Joel_Lafaille · August 11, 2021, 1:01pm

Yes I did that already but it’s not taken into account. Explain always shows a ‘indexFilterSet: false’

Joel_Lafaille · August 12, 2021, 7:25am

Actually with a simple

   {
      $lookup: { 
          "from" : "snapshot", 
          "localField" : "_id", 
          "foreignField" : "_id", 
          "as" : "snapshots"
      }
    }

it takes hours to execute with less than 100k elements in each collection. Is that expected?? ‘_id’ fields are indexed by default in all collections…

steevej · August 12, 2021, 1:07pm

If I understand you want to the _id of some resource for which there is NO events from snapshot.

If it is the case then I have the following idea that might work.

Basically, you look at all the events which is slow but in fact you just locate one then you know that you can prune the given _id. I think you could use one of the alternate https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/ where you can set a pipeline where you would limit the $lookup only to the first occurrence of a given events.resource._id.

An alternative might be https://docs.mongodb.com/manual/reference/operator/aggregation/graphLookup/ with a maxDepth.

Joel_Lafaille · August 12, 2021, 1:41pm

Well that does it, thanks. It’s indeed a better approach, but I still don’t understand the slowness related to the simple $lookup above

steevej · August 12, 2021, 1:45pm

The only thing I could think is that since you scan the whole snapshot collection it might not fit in RAM and disk I/O becomes a bottle neck. By limit the search to the first one, the working set has better chance to fit in RAM.

system · August 17, 2021, 1:45pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.