The fast way to count distinct by field in large mongodb collection

moussa_el · February 13, 2024, 2:54pm

I want a faster way to count the distinct by field in MongoDB Collection that contain more than 10 Million documents; my schema looks like this :

[
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "2222"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  }
]

I want a very performance query that can count the distinct resources over my collection

Note:

this can’t be pre calculated !
I can add an index on resource

chris · February 13, 2024, 3:29pm

An index that is resource or has resource as a prefix will allow for a DISTINCT_SCAN.

mongosh can execute using db.getSiblingDB('test').collection.distinct('resource')
nodejs: https://mongodb.github.io/node-mongodb-native/6.3/classes/Collection.html#distinct
python: https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.distinct

moussa_el · February 13, 2024, 3:36pm

I have a large collection so the .distinct() method ends by an the error :
MongoServerError: distinct too big, 16mb cap

chris · February 13, 2024, 3:44pm

Okay then this will have to be an aggregation:

db.getSiblingDB('test').collection.aggregate(
  {
    "$group": {
      "_id": "$resource"
    }
  }
])

moussa_el · February 13, 2024, 3:46pm

Yes, but it takes more than 10 seconds to get the result which is not acceptable

alexbevi · February 13, 2024, 4:13pm

Just to build off what Chris wrote, you should be able to do this as a covered query.

For example:

db.foo.drop();
db.foo.insertMany([
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "2222"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  },
  {
    //-- type and date
    "date": "2023-11-09",
    "type": "my_type",
    "resource": "1111"
  }
])
db.foo.createIndex({ resource: 1 });
db.foo.aggregate([
  { $sort: { resource: 1 } },
  { $group: { 
    _id: "$resource",
    count: { $sum: 1 }
  }}
])

This should efficiently use the index and prevent any documents from being fetched from disk to satisfy the distinct count requirements you’ve outlined.

As to how long this operation would take would depend on a number of factors (size of cluster, memory available, resource usage, etc).