Query performance on array using $all is slow despite using an index

Doog · July 5, 2022, 9:07am

Hi there, I am a bit new to using Mongo and I am set up using mongo atlas to store a load of documents (~500,000) that all have an array of tags on them. The tag array usually has around 10 to 20 tags.

I am trying to query for documents based on several tags and sorted by a field.
The Index I have is set out like:

{ tags: 1, sort.profit: -1 }

and then the query is something like:

{
"filter": {
      "tags": {
        "$all": [
          "profitability|profitable",
          "currency-code|gbp",
          "bids|no-bid",
          "end-time|less-than-24-hours"
        ]
      }
    },
    "sort": {
      "sort.profit": -1
    },
    "limit": 50,
}

I’ve attached a profiler log with some bits redacted

{
  "type": "command",
  "ns": "",
  "command": {
    "find": "table",
    "filter": {
      "tags": {
        "$all": [
          "profitability|profitable",
          "currency-code|gbp",
          "bids|no-bid",
          "end-time|less-than-24-hours"
        ]
      }
    },
    "sort": {
      "sort.profit": -1
    },
    "limit": 50,
    "lsid": {
      "id": {
        "$binary": {
          "base64": "",
          "subType": ""
        }
      }
    },
    "$clusterTime": {
      "clusterTime": {
        "$timestamp": {
          "t": 0,
          "i": 1
        }
      },
      "signature": {
        "hash": {
          "$binary": {
            "base64": "",
            "subType": ""
          }
        },
        "keyId": 
      }
    },
    "$db": ""
  },
  "planSummary": "IXSCAN { tags: 1, sort.profit: -1 }",
  "keysExamined": 2310,
  "docsExamined": 2310,
  "fromMultiPlanner": true,
  "cursorExhausted": true,
  "numYields": 444,
  "nreturned": 50,
  "queryHash": "",
  "planCacheKey": "",
  "reslen": 342662,
  "locks": {
    "Global": {
      "acquireCount": {
        "r": 445
      }
    },
    "Mutex": {
      "acquireCount": {
        "r": 1
      }
    }
  },
  "readConcern": {
    "level": "local",
    "provenance": "implicitDefault"
  },
  "storage": {
    "data": {
      "bytesRead": 698110513,
      "timeReadingMicros": 7267966
    },
    "timeWaitingMicros": {
      "cache": 28
    }
  },
  "remote": "",
  "protocol": "op_msg",
  "durationMillis": 8175,
  "v": "5.0.9"
}

I am seeing query times of more than 20 seconds in some cases. I assume I have set myself up incorrectly or I am trying to make mongo do something that it wasn’t designed for.

MaBeuLux88_xxx · July 6, 2022, 4:28pm

Hi @Doog and welcome in the MongoDB Community !

Try to invert your index to { sort.profit: -1, tags: 1 } as the ESR rule is broken here because $all is a range query.

This should avoid the in-memory sort which can be expensive, especially if you don’t have a lot of RAM.

Can you please share the winning plan of the explain(true) of this query before and after?

Cheers,
Maxime.

Doog · July 6, 2022, 9:47pm

Hi Maxime,

Thank you for the reply, I have done as you asked and this is the result:
Pre index:

{
  explainVersion: '1',
  queryPlanner: {
    namespace: 'pokeprice.marketplace.listings',
    indexFilterSet: false,
    parsedQuery: {
      '$and': [
        { tags: { '$eq': 'profitability|profitable' } },
        { tags: { '$eq': 'currency-code|gbp' } },
        { tags: { '$eq': 'bids|no-bid' } },
        { tags: { '$eq': 'end-time|less-than-24-hours' } }
      ]
    },
    maxIndexedOrSolutionsReached: false,
    maxIndexedAndSolutionsReached: false,
    maxScansToExplodeReached: false,
    winningPlan: {
      stage: 'LIMIT',
      limitAmount: 50,
      inputStage: {
        stage: 'FETCH',
        filter: {
          '$and': [
            { tags: { '$eq': 'profitability|profitable' } },
            { tags: { '$eq': 'currency-code|gbp' } },
            { tags: { '$eq': 'bids|no-bid' } }
          ]
        },
        inputStage: {
          stage: 'IXSCAN',
          keyPattern: { tags: 1, 'sort.profit': -1 },
          indexName: 'tags_1_sort.profit_-1',
          isMultiKey: true,
          multiKeyPaths: { tags: [ 'tags' ], 'sort.profit': [] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: {
            tags: [
              '["end-time|less-than-24-hours", "end-time|less-than-24-hours"]'
            ],
            'sort.profit': [ '[MaxKey, MinKey]' ]
          }
        }
      }
    },
    rejectedPlans: [

    ]
  },
  command: {
    find: 'marketplace.listings',
    filter: {
      tags: {
        '$all': [
          'profitability|profitable',
          'currency-code|gbp',
          'bids|no-bid',
          'end-time|less-than-24-hours'
        ]
      }
    },
    sort: { 'sort.profit': -1 },
    limit: 50,
    '$db': 'pokeprice'
  },
  serverParameters: {
    internalQueryFacetBufferSizeBytes: 104857600,
    internalQueryFacetMaxOutputDocSizeBytes: 104857600,
    internalLookupStageIntermediateDocumentMaxSizeBytes: 104857600,
    internalDocumentSourceGroupMaxMemoryBytes: 104857600,
    internalQueryMaxBlockingSortMemoryUsageBytes: 104857600,
    internalQueryProhibitBlockingMergeOnMongoS: 0,
    internalQueryMaxAddToSetBytes: 104857600,
    internalDocumentSourceSetWindowFieldsMaxMemoryBytes: 104857600
  },
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1657142274, i: 1 }),
    signature: {
      hash: Binary(Buffer.from("84e75b45192ea6df09a60ac4f3757211f2aab19a", "hex"), 0),
      keyId: Long("7065994558026285062")
    }
  },
  operationTime: Timestamp({ t: 1657142274, i: 1 })
}

Post index

{
  explainVersion: '1',
  queryPlanner: {
    namespace: 'pokeprice.marketplace.listings',
    indexFilterSet: false,
    parsedQuery: {
      '$and': [
        { tags: { '$eq': 'profitability|profitable' } },
        { tags: { '$eq': 'currency-code|gbp' } },
        { tags: { '$eq': 'bids|no-bid' } },
        { tags: { '$eq': 'end-time|less-than-24-hours' } }
      ]
    },
    maxIndexedOrSolutionsReached: false,
    maxIndexedAndSolutionsReached: false,
    maxScansToExplodeReached: false,
    winningPlan: {
      stage: 'LIMIT',
      limitAmount: 50,
      inputStage: {
        stage: 'FETCH',
        filter: {
          '$and': [
            { tags: { '$eq': 'profitability|profitable' } },
            { tags: { '$eq': 'currency-code|gbp' } },
            { tags: { '$eq': 'bids|no-bid' } }
          ]
        },
        inputStage: {
          stage: 'IXSCAN',
          keyPattern: { tags: 1, 'sort.profit': -1 },
          indexName: 'tags_1_sort.profit_-1',
          isMultiKey: true,
          multiKeyPaths: { tags: [ 'tags' ], 'sort.profit': [] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: {
            tags: [
              '["end-time|less-than-24-hours", "end-time|less-than-24-hours"]'
            ],
            'sort.profit': [ '[MaxKey, MinKey]' ]
          }
        }
      }
    },
    rejectedPlans: [

    ]
  },
  command: {
    find: 'marketplace.listings',
    filter: {
      tags: {
        '$all': [
          'profitability|profitable',
          'currency-code|gbp',
          'bids|no-bid',
          'end-time|less-than-24-hours'
        ]
      }
    },
    sort: { 'sort.profit': -1 },
    limit: 50,
    '$db': 'pokeprice'
  },
  serverParameters: {
    internalQueryFacetBufferSizeBytes: 104857600,
    internalQueryFacetMaxOutputDocSizeBytes: 104857600,
    internalLookupStageIntermediateDocumentMaxSizeBytes: 104857600,
    internalDocumentSourceGroupMaxMemoryBytes: 104857600,
    internalQueryMaxBlockingSortMemoryUsageBytes: 104857600,
    internalQueryProhibitBlockingMergeOnMongoS: 0,
    internalQueryMaxAddToSetBytes: 104857600,
    internalDocumentSourceSetWindowFieldsMaxMemoryBytes: 104857600
  },
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1657143513, i: 11 }),
    signature: {
      hash: Binary(Buffer.from("35b990efa3495ecc98a61cdbce04286081f501c9", "hex"), 0),
      keyId: Long("7065994558026285062")
    }
  },
  operationTime: Timestamp({ t: 1657143513, i: 11 })
}

It seems it doesn’t want to choose the new index

just to be clear I added {"sort.profit": -1, "tags": 1} as the new index.

I removed the rejected plans from these because there were a large amount, would you like to see them too?

Doog · July 6, 2022, 10:06pm

Looking through the plan it looks like it uses the index to find one of the tags then does an indexless search through the remainder, which in this case is going to be quite large

Is there a way to get it to search for every tag using the index?

Doog · July 7, 2022, 9:35am

If i force the query to use the index with sort first then tags it runs considerably slower

{
  explainVersion: '1',
  queryPlanner: {
    namespace: 'pokeprice.marketplace.listings',
    indexFilterSet: false,
    parsedQuery: {
      '$and': [
        { tags: { '$eq': 'profitability|profitable' } },
        { tags: { '$eq': 'currency-code|gbp' } },
        { tags: { '$eq': 'bids|no-bid' } },
        { tags: { '$eq': 'end-time|less-than-24-hours' } }
      ]
    },
    maxIndexedOrSolutionsReached: false,
    maxIndexedAndSolutionsReached: false,
    maxScansToExplodeReached: false,
    winningPlan: {
      stage: 'LIMIT',
      limitAmount: 50,
      inputStage: {
        stage: 'FETCH',
        filter: {
          '$and': [
            { tags: { '$eq': 'profitability|profitable' } },
            { tags: { '$eq': 'currency-code|gbp' } },
            { tags: { '$eq': 'bids|no-bid' } },
            { tags: { '$eq': 'end-time|less-than-24-hours' } }
          ]
        },
        inputStage: {
          stage: 'IXSCAN',
          keyPattern: { 'sort.profit': Long("-1"), tags: Long("1") },
          indexName: 'sort.profit_-1_tags_1',
          isMultiKey: true,
          multiKeyPaths: { 'sort.profit': [], tags: [ 'tags' ] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: {
            'sort.profit': [ '[MaxKey, MinKey]' ],
            tags: [ '[MinKey, MaxKey]' ]
          }
        }
      }
    },
    rejectedPlans: []
  },
  executionStats: {
    executionSuccess: true,
    nReturned: 50,
    executionTimeMillis: 28677,
    totalKeysExamined: 2671925,
    totalDocsExamined: 102010,
    executionStages: {
      stage: 'LIMIT',
      nReturned: 50,
      executionTimeMillisEstimate: 23897,
      works: 2671926,
      advanced: 50,
      needTime: 2671875,
      needYield: 0,
      saveState: 3855,
      restoreState: 3855,
      isEOF: 1,
      limitAmount: 50,
      inputStage: {
        stage: 'FETCH',
        filter: {
          '$and': [
            { tags: { '$eq': 'profitability|profitable' } },
            { tags: { '$eq': 'currency-code|gbp' } },
            { tags: { '$eq': 'bids|no-bid' } },
            { tags: { '$eq': 'end-time|less-than-24-hours' } }
          ]
        },
        nReturned: 50,
        executionTimeMillisEstimate: 23891,
        works: 2671925,
        advanced: 50,
        needTime: 2671875,
        needYield: 0,
        saveState: 3855,
        restoreState: 3855,
        isEOF: 0,
        docsExamined: 102010,
        alreadyHasObj: 0,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: 102010,
          executionTimeMillisEstimate: 1104,
          works: 2671925,
          advanced: 102010,
          needTime: 2569915,
          needYield: 0,
          saveState: 3855,
          restoreState: 3855,
          isEOF: 0,
          keyPattern: { 'sort.profit': Long("-1"), tags: Long("1") },
          indexName: 'sort.profit_-1_tags_1',
          isMultiKey: true,
          multiKeyPaths: { 'sort.profit': [], tags: [ 'tags' ] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: {
            'sort.profit': [ '[MaxKey, MinKey]' ],
            tags: [ '[MinKey, MaxKey]' ]
          },
          keysExamined: 2671925,
          seeks: 1,
          dupsTested: 2671925,
          dupsDropped: 2569915
        }
      }
    }
  },
  command: {
    find: 'marketplace.listings',
    filter: {
      tags: {
        '$all': [
          'profitability|profitable',
          'currency-code|gbp',
          'bids|no-bid',
          'end-time|less-than-24-hours'
        ]
      }
    },
    sort: { 'sort.profit': -1 },
    hint: 'sort.profit_-1_tags_1',
    limit: 50,
    '$db': 'pokeprice'
  },
  serverInfo: {

  },
  serverParameters: {
    internalQueryFacetBufferSizeBytes: 104857600,
    internalQueryFacetMaxOutputDocSizeBytes: 104857600,
    internalLookupStageIntermediateDocumentMaxSizeBytes: 104857600,
    internalDocumentSourceGroupMaxMemoryBytes: 104857600,
    internalQueryMaxBlockingSortMemoryUsageBytes: 104857600,
    internalQueryProhibitBlockingMergeOnMongoS: 0,
    internalQueryMaxAddToSetBytes: 104857600,
    internalDocumentSourceSetWindowFieldsMaxMemoryBytes: 104857600
  },
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1657186270, i: 1 }),
    signature: {

    }
  },
  operationTime: Timestamp({ t: 1657186270, i: 1 })
}

Examining around 250,000 keys using the sort first index, whereas it examines about 2000 keys with the tag first index

Doog · July 7, 2022, 2:49pm

Following more investigation here, the slow query is not a find but a count.

I am running a find and a count in parallel using the same filter but the count takes ~20sec and the find take s less than 2

Is there a way to get explain from a count query?

MaBeuLux88_xxx · July 7, 2022, 7:23pm

That’s surprising because they are doing the same thing. Also, why count + find? Just run find, collect the results in an array and then check the size of the array? Unless you are doing paginated queries, I don’t see the point of running both of them.

You can explain a count like this:

db.coll.explain(true).count({})

Finally, let’s get back to the initial issue. I asked for the “before” and “after” pipeline because I was almost certain there was already an issue with the initial index: MongoDB doesn’t support index intersection for $all queries and your explain plan confirms it: it’s just using the index to resolve the first equality filtering and then fetches the docs to resolves the 3 other checks. Sad but I guess there are valid reasons for not doing an intersection. If you had 80 values in the $all, the intersection would probably waste time. It’s also mentioned here: https://www.mongodb.com/docs/manual/core/index-intersection/. They are “disfavored in plan selection”.

How you can solve this?

Use Atlas Search. Lucene works differently for this kind of queries and will most probably perform better for this particular query. It’s probably the best solution here.
Try a hack with the aggregation pipeline. The idea is to do something like this:

match TAG1
loookup ID,TAG2
if matched LOOKUP ID,TAG3
if matched LOOKUP ID,TAG4

Not ideal but it might just work.

[
  {
    '$match': {
      'tags': 'C#'
    }
  }, {
    '$lookup': {
      'from': 'messages', 
      'let': {
        'id': '$_id', 
        'tags': '$tags'
      }, 
      'pipeline': [
        {
          '$match': {
            '$expr': {
              '$and': [
                {
                  '$eq': [
                    '$$id', '$_id'
                  ]
                }, {
                  '$in': [
                    'Java', '$$tags'
                  ]
                }
              ]
            }
          }
        }
      ], 
      'as': 'result'
    }
  }, {
    '$match': {
      '$expr': {
        '$eq': [
          {
            '$size': '$result'
          }, 1
        ]
      }
    }
  }, {
    '$replaceRoot': {
      'newRoot': {
        '$arrayElemAt': [
          '$result', 0
        ]
      }
    }
  }, {
    '$lookup': {
      'from': 'messages', 
      'let': {
        'id': '$_id', 
        'tags': '$tags'
      }, 
      'pipeline': [
        {
          '$match': {
            '$expr': {
              '$and': [
                {
                  '$eq': [
                    '$$id', '$_id'
                  ]
                }, {
                  '$in': [
                    'JS', '$$tags'
                  ]
                }
              ]
            }
          }
        }
      ], 
      'as': 'result'
    }
  }, {
    '$match': {
      '$expr': {
        '$eq': [
          {
            '$size': '$result'
          }, 1
        ]
      }
    }
  }, {
    '$replaceRoot': {
      'newRoot': {
        '$arrayElemAt': [
          '$result', 0
        ]
      }
    }
  }, {
    '$lookup': {
      'from': 'messages', 
      'let': {
        'id': '$_id', 
        'tags': '$tags'
      }, 
      'pipeline': [
        {
          '$match': {
            '$expr': {
              '$and': [
                {
                  '$eq': [
                    '$$id', '$_id'
                  ]
                }, {
                  '$in': [
                    'Go', '$$tags'
                  ]
                }
              ]
            }
          }
        }
      ], 
      'as': 'result'
    }
  }, {
    '$match': {
      '$expr': {
        '$eq': [
          {
            '$size': '$result'
          }, 1
        ]
      }
    }
  }, {
    '$replaceRoot': {
      'newRoot': {
        '$arrayElemAt': [
          '$result', 0
        ]
      }
    }
  }
]

I wrote a small Python script to generate some fake docs to try this pipeline:

from random import sample

from faker import Faker
from pymongo import ASCENDING
from pymongo import MongoClient

fake = Faker()


def random_tags():
    return sample(["Java", "JS", "Python", "C#", "Bash", "Closure", "Swift", "C++", "R", "Go"], 4)


def random_messages():
    docs = []
    for _id in range(1, 10001):
        doc = {
            '_id': _id,
            'user_id': fake.pyint(min_value=1, max_value=100),
            'message': fake.sentence(nb_words=10),
            'tags': random_tags()
        }
        docs.append(doc)
    return docs


if __name__ == '__main__':
    client = MongoClient()
    db = client.get_database('test')
    messages = db.get_collection('messages')
    messages.drop()
    messages.insert_many(random_messages())
    print('Import done!')

    messages.create_index('tags')
    messages.create_index([('_id', ASCENDING), ('tags', ASCENDING)])

I spent a few hours on this and I can’t get it to work. When I check the explain plan of the lookups, I don’t see an index being used:

    {
      '$lookup': {
        from: 'messages',
        as: 'result',
        let: { id: '$_id', tags: '$tags' },
        pipeline: [
          {
            '$match': {
              '$expr': {
                '$and': [
                  { '$eq': [ '$$id', '$_id' ] },
                  { '$in': [ 'Java', '$$tags' ] }
                ]
              }
            }
          }
        ]
      },
      totalDocsExamined: Long("40320000"),
      totalKeysExamined: Long("0"),
      collectionScans: Long("8064"),
      indexesUsed: [],
      nReturned: Long("4032"),
      executionTimeMillisEstimate: Long("16709")
    }

I also don’t understand why it says 40320000 docs examined when I have 10k in my collection… I must be doing something wrong here but I can’t put my finger on it.

I’ll keep digging and ask a few colleagues around me. But for sure Atlas Search is the best / easier option.

Cheers,
Maxime.

Doog · July 12, 2022, 11:27am

Sorry yes, I should have explained the count situation in more detail.

I do 2 queries, one to find the first 50 documents of a query and another to find to total count of documents in order to do pagination.

For the time being i have been able to remove the count query and work around it.

I took your advice to try and use a $search query for the same functionality and this is what I have found works:

PIPELINE
[
      {
        "$search": {
          "index": "idx_name",
          "compound": {
            "must": [
              { "phrase": { "path": "tags", "query": "profitability|profitable" }},
              { "phrase": { "path": "tags", "query": "end-time|less-than-24-hours"}},
              { "phrase": { "path": "tags", "query": "bids|no-bid" }},
              { "phrase": { "path": "tags", "query": "currency-code|gbp" }},
              { "near": {
                  "origin": 1000000,
                  "pivot": 1,
                  "score": {  "boost": {  "value": 1000000000 } },
                  "path": "sort.profit"
               }}
            ]
          }
        }
      },
      {
        "$skip": 0
      },
      {
        "$limit": 40
      }
    ]

and using s search index like this:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "sort": {
        "dynamic": true,
        "type": "document"
      },
      "tags": {
        "type": "string"
      }
    }
  }
}

This allows me to add phrase queries that can target all the tags I need to and then pagination can be carried out using follow on pipeline stages.

The sorting is done using the $near operator and boosting it’s score so that it is the most important factor in determining which documents are top of the list.

This query is now performing the query i wanted to do in a range of about 200ms to 2000ms

MaBeuLux88_xxx · July 14, 2022, 8:38am

20 sec => 200ms to 2s.

I call that a win !
I hope it’s good enough though.

Thanks for sharing the query!
ping @John_Page

system · July 19, 2022, 8:38am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.