Lookup stages take too long even though indexed

Marco_D_Agostino · January 13, 2022, 4:21pm

So, I have a collection with millions of documents. I’m trying to create an aggregation which does several steps to eventually retrieve a field and its counting in the same collection. by doing some filtering.
Here’s my pipeline:

[{$match: {
 field1: 'xxxxxxxx-yyyy-zzzz-wwww-qqqqqqqqqqqq'
}}, 
{$limit: 50000}, 
{$lookup: {
 from: 'collection',
 localField: 'field2',
 foreignField: 'field2',
 as: 'field2_field2'
}}, 
{$lookup: {
 from: 'collection',
 localField: 'field2',
 foreignField: 'field3.field2',
 as: 'field2_field3'
}}, 
{$lookup: {
 from: 'collection',
 localField: 'field3.field2',
 foreignField: 'field2',
 as: 'field3_field2'
}}, 
{$project: {
 field2: 1,
 ids: {
  $concatArrays: [
   {
    $ifNull: [
     '$field2_field2',
     []
    ]
   },
   {
    $ifNull: [
     '$field2_field3',
     []
    ]
   },
   {
    $ifNull: [
     '$field3_field2',
     []
    ]
   }
  ]
 }
}}, 
{$project: {
 field2: 1,
 ids: {
  $filter: {
   input: '$ids',
   as: 'item',
   cond: {
    $and: [
     {
      $ne: [
       '$$item._id',
       '$_id'
      ]
     },
     {
      $eq: [
       '$$item.field4',
       '23'
      ]
     },
     {
      $eq: [
       {
        $type: '$$item.deleted'
       },
       'missing'
      ]
     }
    ]
   }
  }
 }
}}, 
{$project: {
 _id: 0,
 field2: 1,
 counts: {
  $size: '$ids'
 }
}}]

I’m having troubles with the indexes that I think are not being used properly.
First of all, when I perform an explain(“executionStats”) on the aggregate, the final explain doesn’t have for each stage the indexes used. This way I cannot understand how it worked.
Secondly, I’m limiting the query to 50K documents, since if I do it for the whole collection by field1 (2M+ docs) it gets stuck.

I’m using index on field1:

{“field1”:1, …}

(there are a lot of compound indexes with field1 as prefix)

then for the lookup stages:

{“field2”:1, field4}

(used for another query)

{“field3.field2”:1, “field2”:1}

And these are the execution stats:

           { executionSuccess: true,
             nReturned: 50000,
             executionTimeMillis: 18732,
             totalKeysExamined: 50000,
             totalDocsExamined: 50000,
             executionStages: 
              { stage: 'LIMIT',
                nReturned: 50000,
                executionTimeMillisEstimate: 84,
                works: 50001,
                advanced: 50000,
                needTime: 0,
                needYield: 0,
                saveState: 56,
                restoreState: 56,
                isEOF: 1,
                limitAmount: 50000,
                inputStage: 
                 { stage: 'PROJECTION_DEFAULT',
                   nReturned: 50000,
                   executionTimeMillisEstimate: 84,
                   works: 50000,
                   advanced: 50000,
                   needTime: 0,
                   needYield: 0,
                   saveState: 56,
                   restoreState: 56,
                   isEOF: 0,
                   transformBy: 
                    { _id: 1,
                      field3_field2: 1,
                      'field3.field2': 1,
                      field2_field3: 1,
                      field2_field2: 1,
                      field2: 1 },
                   inputStage: 
                    { stage: 'FETCH',
                      nReturned: 50000,
                      executionTimeMillisEstimate: 73,
                      works: 50000,
                      advanced: 50000,
                      needTime: 0,
                      needYield: 0,
                      saveState: 56,
                      restoreState: 56,
                      isEOF: 0,
                      docsExamined: 50000,
                      alreadyHasObj: 0,
                      inputStage: 
                       { stage: 'IXSCAN',
                         nReturned: 50000,
                         executionTimeMillisEstimate: 21,
                         works: 50000,
                         advanced: 50000,
                         needTime: 0,
                         needYield: 0,
                         saveState: 56,
                         restoreState: 56,
                         isEOF: 0,
                         keyPattern: { field1: 1, field2: 1 },
                         indexName: 'i_field1_field2',
                         isMultiKey: false,
                         multiKeyPaths: { field1: [], field2: [] },
                         isUnique: false,
                         isSparse: false,
                         isPartial: false,
                         indexVersion: 2,
                         direction: 'forward',
                         indexBounds: 
                          { field1: [ '["xxxxxxxx-yyyy-zzzz-wwww-qqqqqqqqqqqq"]' ],
                            field2: [ '[MinKey, MaxKey]' ] },
                         keysExamined: 50000,
                         seeks: 1,
                         dupsTested: 0,
                         dupsDropped: 0 } } } } } },
       nReturned: 50000,
       executionTimeMillisEstimate: 215 },
     { '$lookup': 
        { from: 'collection',
          as: 'field2_field2',
          localField: field2',
          foreignField: 'field2' },
       nReturned: 50000,
       executionTimeMillisEstimate: 5456 },
     { '$lookup': 
        { from: 'collection',
          as: 'field2_field3',
          localField: 'field2',
          foreignField: 'field3.field2' },
       nReturned: 50000,
       executionTimeMillisEstimate: 14914 },
     { '$lookup': 
        { from: 'collection',
          as: 'field3_field2',
          localField: 'field3.field2',
          foreignField: 'field2' },
       nReturned: 50000,
       executionTimeMillisEstimate: 18680 },
     { '$project': 
        { _id: true,
          field2: true,
          ids: 
           { '$concatArrays': 
              [ { '$ifNull': [ '$field2_field2', { '$const': [] } ] },
                { '$ifNull': [ '$field2_field3', { '$const': [] } ] },
                { '$ifNull': [ '$field3_field2', { '$const': [] } ] } ] } },
       nReturned: 50000,
       executionTimeMillisEstimate: 18681 },
     { '$project': 
        { _id: true,
          field2: true,
          ids: 
           { '$filter': 
              { input: '$ids',
                as: 'item',
                cond: 
                 { '$and': 
                    [ { '$ne': [ '$$item._id', '$_id' ] },
                      { '$eq': [ '$$item.field4', { '$const': '23' } ] },
                      { '$eq': [ { '$type': [ '$$item.deleted' ] }, { '$const': 'missing' } ] } ] } } } },
       nReturned: 50000,
       executionTimeMillisEstimate: 18684 },
     { '$project': 
        { field2: true,
          counts: { '$size': [ '$ids' ] },
          _id: false },
       nReturned: 50000,
       executionTimeMillisEstimate: 18685 } ],

Here’s an example of final output document:

{“field2”: “a field”, “counts”: 15}

Did I miss any possible index?
Can the pipeline be optimized?
And why is the $match stage FETCHing docs while being supported by field1 index?

Rafael_Green · January 13, 2022, 9:01pm

Hi @Marco_D_Agostino,
If you’re trying to count elements by value of a specific field that’s the way to do it:

[
  {
    $match: {
       field1: 'xxxxxxxx-yyyy-zzzz-wwww-qqqqqqqqqqqq'
    }
  },
  { 
    $project: {
       _id: 0,
       field1: 1,
       field2: 1
    }
  },
  {
    $limit: 50000
  }, 
  {
    '$group': {
      '_id': '$field2', 
      'count': {
        '$count': {}
      }
    }
  },
  {
    $project: {
        _id: 0,
        field2: '$_id',
        count: 1
    }
  }
]

Thanks,
Rafael,

Marco_D_Agostino · January 14, 2022, 8:19am

hi Rafael, that’s not actually what I’m trying to do.
Let’ susppose we have a document like this:

{ 
  "field1": "xxxxxxxx",
  "field2": "yyyy"
  "field3": {
                field2: "zzzz"
               ....
               }
  ...
}

What I’m trying to do is, for EACH document, finding its countings based on:
field2 in field2 ($ne itself —>
$ne: [
‘$$item._id’,
‘$_id’ so that I don’t count itself)
field2 in field3.field2,
field3.field2 in field2
And this for EACH document.
Take the first document : search its field2 in OTHER field2, its field2 in OTHER field3.field2, its field3.field2 in OTHER field2. Then take the second document and do it again,… and so on, for all the documents that match field1 = “xxxxx-yyyy-zzzz-wwwww”
So for EACH document I’m having 3 lists (given from the left outer joins, i.e. lookups). Each list is the result of those searches. And this way, I’m concatenating those 3 lists and, by doing some filtering, retrieve the object counting of the concatenated lists.
I hope it’s more clear now

Rafael_Green · January 15, 2022, 5:41am

Hi @Marco_D_Agostino,
Sorry but I don’t understand what you’re trying to do.
Can you give some examples?
Thanks,
Rafael,

Marco_D_Agostino · January 17, 2022, 9:07am

Ok so…
Let’s reduce it to 3 documents:

doc1:

{ 
  "field1": "value1"
  "field2": {
                field1: "value2"
               ....
               }
  ...
}

doc2:

{ 
  "field1": "value1"
  "field2": {
                field1: "value3"
               ....
               }
  ...
}

doc3:

{ 
  "field1": "value3",
  "field2": {
                field1: "value1"
               ....
               }
  ...
}

Then, the final result shall be 3 documents, one for each document in the collection :

{"field1": "value1", countings: 2} --> doc1
{"field1": "value1", countings: 3} --> doc2
{"field1": "value3", countings: 3} --> doc3

For doc1:
its field1’s value1 is present in doc2’s field1 and in doc3’s field2.field1, but its field2.field1’s value2 is NOT present in ANY OTHER field1 of other documents. So the counting for it is 2 times.

For doc2:
its field1’s value1 is present in doc1’s field1 and in doc3’s field2.field1, and its field2.field1’s value3 is present in doc3’s field1. So the counting for it is 3.

For doc3:
its field1’s value3 is present in doc2’s field2.field1 and its field2.field1’s value1 is present in both doc1 and doc2’s field1. So its counting is 3.

So, I’m searching recursively inside the same collection, for each document, its:

field1’s value inside ANY OTHER field1 and field2.field1
field2.field1 inside ANY OTHER field1

without counting itself (–> “ANY OTHER”)

I sum both conditions into one counting for each document.

I thought of 3 lookups on same collection. The parent collection will have much less documents than the “from” collection, because the parent collection is first filtered out by another field, but the left joined collection (the same, the “from” collection) is having ALL the collection’s documents, so billions. I tried out also with let and pipeline to filter out elements in “from” collection, but the lookups timed out. So I can filter them out only in the last project stages.
Hope it’s clearer now

Rafael_Green · January 18, 2022, 12:51am

Hi @Marco_D_Agostino,
Try to run this from mongosh:

ensure you have these indexes:

db.yourcollection.createIndex({field4: 1, field1: 1, "field2.field1": 1});
db.yourcollection.createIndex({field4: 1, "field2.field1": 1});

run this from mongosh:

let printCount = function (doc) {
  let c1 = db.yourcollection.count({field4: doc.field4, field1: doc.field1});
  c1-=1;
  let c2 = db.yourcollection.count({field4: doc.field4, "field2.field1": doc.field1});
  if (doc.field2.field1 == doc.field1) c2-=1;
  let c3 = db.yourcollection.count({field4: doc.field4, "field1": doc.field2.field1});
  if (doc.field2.field1 == doc.field1) c3-=1;
  print ({...doc, count: c1 + c2 + c3});
}

db.yourcollection.find({field4: 23}, {_id: 0, field4: 1, field1: 1, "field2.field1": 1}).forEach(printCount)

I tried it for this input:

db.yourcollection.insertMany([
{ 
  "field1": "value1",
  "field2": {
                field1: "value2"
               },
  "field4": 23
},
{ 
  "field1": "value1",
  "field2": {
                field1: "value3"
               },
  "field4": 23
},
{ 
  "field1": "value3",
  "field2": {
                field1: "value1"
               },
  "field4": 23
}])

and got this result:

{ field4: 23,
  field1: 'value1',
  field2: { field1: 'value2' },
  count: 2 },
{ field4: 23,
  field1: 'value1',
  field2: { field1: 'value3' },
  count: 3 },
{ field4: 23,
  field1: 'value3',
  field2: { field1: 'value1' },
  count: 3 }

Goodluck,
Rafael,

Marco_D_Agostino · January 19, 2022, 4:14pm

The very big downside of this solution is the number of operations. This collection has billions of docs. If I even reduce it to millions by the first find, it’s millions x 3 calls to the server. If one count takes even 1ms and the find yields let’s say 2M docs, that means 2 x 10^6 x 10^-3 seconds, i.e. 30 minute ish to compute. This solution works for very small collections. But thank you anyway for your time, I appreciate it a lot

Rafael_Green · January 20, 2022, 1:43pm

Have you tried running the code I sent you?
If you follow the instructions I sent you, all count operations will use the COUNT_SCAN stage which is super fast.
Another way to solve the issue is to use the computed pattern.
Goodluck,
Rafael,

Marco_D_Agostino · January 20, 2022, 2:40pm

Yes, I’ve tried your code and I had to force close mongo client because it was running endlessly, printing out each count. Even though COUNT_SCAN is taking 0 ms to compute, it’s printing counts one by one, with a very small amount of ms from each other. It has to do this millions x millions times. Not the best…

Rafael_Green · January 20, 2022, 3:20pm

In my opinion, the printing itself takes most of the time. Try to check how to save the output to file instead of printing it to the screen.