Exists query with index very slow

Karl_Banke · June 4, 2020, 12:16pm

We are using a fairly trivial query to count data in a collection where a specific nested field is set or not set.

db[‘collname’].find({“consumer.consumed”:{$exists: true}}).count()

As simple as it comes. In order to be quick, we have created an index. The MongoDB profiler and the explain plan show that the index should be used. The query as such should be as simple as skimming over the existing index and return a docment number. However the explain shows a large number of yields that I completely fail to understand and execution quite often takes several seconds.

{
  "appName": "MongoDB Shell",
  "command": {
    "count": "collname",
    "query": {
      "consumer.consumed": {
        "$exists": true
      }
    },
    "fields": {},
    "$clusterTime": {
      "clusterTime": {
        "$timestamp": {
          "t": 1591257092,
          "i": 1
        }
      },
      "signature": {
        "hash": {
          "$binary": "PHh4eHh4eD4=",
          "$type": "00"
        },
        "keyId": {
          "$numberLong": "6790010151542718465"
        }
      }
    },
    "$db": "dpa-id-hub"
  },
  "planSummary": [
    {
      "IXSCAN": {
        "consumer.consumed": 1
      }
    }
  ],
  "keysExamined": 316394,
  "docsExamined": 316394,
  "numYields": 2473,
  "reslen": 170,
  "locks": {
    "Global": {
      "acquireCount": {
        "r": 2474
      }
    },
    "Database": {
      "acquireCount": {
        "r": 2474
      }
    },
    "Collection": {
      "acquireCount": {
        "r": 2474
      }
    }
  },
  "storage": {
    "data": {
      "bytesRead": 10726149,
      "timeReadingMicros": 42554
    }
  },
  "protocol": "op_msg",
  "millis": 709
}

kevinadi · June 9, 2020, 3:14am

Hi,

Looking at these lines:

  "keysExamined": 316394,
  "docsExamined": 316394,
  "numYields": 2473,

seems to indicate that it scans the index (keysExamined), but also must confirm the query condition by loading the documents from disk (docsExamined). This loading is apparently expensive for the disk, since it yields a lot, meaning that MongoDB spent a lot of time just waiting for disk.

I did a quick test using MongoDB 4.2.7 and saw a similar output, where it needs to load the documents from disk to examine them:

> db.test.explain('executionStats').find({a:{$exists:true}}).count()
...
		"totalKeysExamined" : 10,
		"totalDocsExamined" : 10,
...

However, changing the query a little to make it a covered query seem to improve things a bit (note the projection of {_id:0,a:1} to make it covered query):

> db.test.explain('executionStats').find({a:{$gt:MinKey}}, {_id:0,a:1}).count()
...
		"totalKeysExamined" : 11,
		"totalDocsExamined" : 0,
...

so it did the count by scanning the index only. Let me know if a similar method works with your query.

Best regards,
Kevin

Karl_Banke · June 9, 2020, 3:10pm

I tried your suggestion. However, when I issue

db[‘dbname’].find({‘consumer.consumed’:{$gt:MinKey}}, {_id:0,‘consumer.consumed’:1}).count()

I always get back the total number of elements in the collection rather than the number of elements where the queried field is set.

kevinadi · June 9, 2020, 11:08pm

Hi Karl,

I’m not sure I understand. Could you post some example documents, and what the query is supposed to return? Maybe there’s something I’m missing here.

Best regards,
Kevin

Karl_Banke · June 10, 2020, 9:04am

Well the query is supposed to return the number of documents where the (nested) field consumer.consumed exists. There is a regular index on the field (not sparse). From what I understand, the index will have an entry for every field value, where the entry is null when the field is either null or not set. So, all the query would need to do is to count the keys in the index that are not null. I understand that there would be a need to look at the documents for the negated query (not exists), since the field could be either null or not there at all. So consider three documents:
{consumer: {name: ‘karl’, consumed: ‘2019-10-10’}}
{consumer: {name: ‘kevin’, consumed: null}}
{consumer: {name: ‘iceman’}}
In which case I would assume the query to return 2.

kevinadi · June 11, 2020, 6:08am

Hi Karl,

I see what you mean. Well, I have a “hack” that may work with your specific use case. You might be able to use sparse indexes to achieve a quick count. Using the example you provided, I created a sparse index on consumer.consumed:

> db.test.find()
{ "_id" : 0, "consumer" : { "name" : "karl", "consumed" : "2019-10-10" } }
{ "_id" : 1, "consumer" : { "name" : "kevin", "consumed" : null } }
{ "_id" : 2, "consumer" : { "name" : "iceman" } }

> db.test.createIndex({'consumer.consumed':1}, {sparse:true})

A feature of a sparse index is that it doesn’t create an index key for documents that doesn’t have the indicated field. This can be verified by doing a find() by hint():

> db.test.find().hint({'consumer.consumed':1})
{ "_id" : 1, "consumer" : { "name" : "kevin", "consumed" : null } }
{ "_id" : 0, "consumer" : { "name" : "karl", "consumed" : "2019-10-10" } }

Note that the third document is missing here.

For most other queries, MongoDB knows that the sparse index does not cover the whole collection, and would avoid using it if it thinks that it can return the wrong result. That is, unless you force it to use the index by hint(), which can work to your advantage in this specific case.

Since the sparse index doesn’t include the 3rd document, for your count query to return the correct count, you just have to hint() it, and provide an empty parameter for find(), since that {$exists:true} parameter is already implicit in the index itself:

> db.test.find().hint({'consumer.consumed':1}).count()
2

and since you’re forcing it to use that index, it doesn’t load the documents from disk:

> db.test.explain('executionStats').find().hint({'consumer.consumed':1}).count()
...
		"totalKeysExamined" : 3,
		"totalDocsExamined" : 0,
...

Please have a read though Sparse Indexes and its superset Partial Indexes (which is a more flexible version of sparse indexes) for more details.

However, a caveat worth repeating is that since sparse indexes don’t cover the whole collection, other queries may behave differently, e.g. some queries that should use the index may not use the index and end up being a collection scan instead. This is explained in the linked pages above.

Best regards,
Kevin

Karl_Banke · June 15, 2020, 1:16pm

Hi Kevin,

thanks a lot. I tried this and it seems to work alright from the console. As you said, the caveat is that this approach will not work well when we would like to update items based on some where condition on the field but for the time being, this will likely take away the pain. If I manage to translate it to Java properly ;-).

Thank you very much, Karl

Karl_Banke · June 15, 2020, 2:07pm

I now made the changes and the overall behavior is much more predictable. Yet I see that the number of yields is still high, even though no documents are inspected.

"keysExamined": 318501,
"docsExamined": 0,
"cursorExhausted": 1,
"numYields": 2488,
"nreturned": 1,
"reslen": 262,
"locks": {
    "Global": {
      "acquireCount": {
        "r": 2490
      }
    },
    "Database": {
      "acquireCount": {
        "r": 2490
      }
    },
    "Collection": {
      "acquireCount": {
        "r": 2490
      }
    }

kevinadi · June 16, 2020, 6:05am

Hi Karl,

The high number of yields implies that the server needs to fetch data from disk frequently. In typical cases, this means that either the disk is too slow, or the working set (i.e. most frequently accessed data/indexes) are bigger than the available RAM.

This usually means that it’s time to upgrade your hardware

You can check if this is the case by specifying the count() using some parameter, so that it will count only a subset of the data. If you find that the number of yields are drastically lower, this is a good sign that more hardware is needed.

Best regards,
Kevin