How to optimise a $lookup

Hello!
I am looking for a better way to write an horribly slow aggregate query. The objective is to extract

// from metrics db
{
  $lookup: { 
      "from" : "snapshot", 
      "localField" : "_id", 
      "foreignField" : "events.resource._id", 
      "as" : "snapshots"
  }
},
{
  $match: {'snapshots': []}
},
{
  $project: {'_id': 1}
}

The goal is to extract all metrics that does not have at least one snapshot, metrics entity being a simple json but the snapshot’s foreignField is an array:

{
    "_id": {
        "$oid": "61139681c5189368714d9ef3"
    },
    "events": [{
        "resource": {
            "_id": {
                "$oid": "61139622c5189368714d92c5"
            },
        }
    }],
}

The issue is that with only a couple thousand snapshots entities, the query takes already a couple minutes to run. The joining part is almost instant, the slowness comes from the match…
Any idea on how I could rewrite this?

Thanks a lot :slight_smile:

‘explain’ shows a COLLSCAN from the join, so the slowness has nothing to do with the $match. Replaced it with a $sort and it’s the same execution time.

COLLSCAN means you have no index on the queried field.

Try adding an index on events.resource._id

Yes I did that already but it’s not taken into account. Explain always shows a ‘indexFilterSet: false’ :thinking:

Actually with a simple

   {
      $lookup: { 
          "from" : "snapshot", 
          "localField" : "_id", 
          "foreignField" : "_id", 
          "as" : "snapshots"
      }
    }

it takes hours to execute with less than 100k elements in each collection. Is that expected?? ‘_id’ fields are indexed by default in all collections…

If I understand you want to the _id of some resource for which there is NO events from snapshot.

If it is the case then I have the following idea that might work.

Basically, you look at all the events which is slow but in fact you just locate one then you know that you can prune the given _id. I think you could use one of the alternate https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/ where you can set a pipeline where you would limit the $lookup only to the first occurrence of a given events.resource._id.

An alternative might be https://docs.mongodb.com/manual/reference/operator/aggregation/graphLookup/ with a maxDepth.

Well that does it, thanks. It’s indeed a better approach, but I still don’t understand the slowness related to the simple $lookup above :man_shrugging:

The only thing I could think is that since you scan the whole snapshot collection it might not fit in RAM and disk I/O becomes a bottle neck. By limit the search to the first one, the working set has better chance to fit in RAM.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.