The fast way to count distinct by field in large mongodb collection

I want a faster way to count the distinct by field in MongoDB Collection that contain more than 10 Million documents; my schema looks like this :

[
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "2222"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  }
]

I want a very performance query that can count the distinct resources over my collection

Note:

  • this can’t be pre calculated !
  • I can add an index on resource

An index that is resource or has resource as a prefix will allow for a DISTINCT_SCAN.

mongosh can execute using db.getSiblingDB('test').collection.distinct('resource')
nodejs: https://mongodb.github.io/node-mongodb-native/6.3/classes/Collection.html#distinct
python: https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.distinct

I have a large collection so the .distinct() method ends by an the error :
MongoServerError: distinct too big, 16mb cap

Okay then this will have to be an aggregation:

db.getSiblingDB('test').collection.aggregate(
  {
    "$group": {
      "_id": "$resource"
    }
  }
])

Yes, but it takes more than 10 seconds to get the result which is not acceptable

Just to build off what Chris wrote, you should be able to do this as a covered query.

For example:

db.foo.drop();
db.foo.insertMany([
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "2222"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  }
])
db.foo.createIndex({ resource: 1 });
db.foo.aggregate([
  { $sort: { resource: 1 } },
  { $group: { 
    _id: "$resource",
    count: { $sum: 1 }
  }}
])

This should efficiently use the index and prevent any documents from being fetched from disk to satisfy the distinct count requirements you’ve outlined.

As to how long this operation would take would depend on a number of factors (size of cluster, memory available, resource usage, etc).

2 Likes