Query Optimization for collection over 10 Million records

Praveen_Ramkumar · June 24, 2020, 4:25am

I have a collection with over 10 Million records, I need to match with a particular field and get the distinct _ids of the records set.

after the $match pipeline the result set becomes less than 5 Million. if i group with id to get the unique ids, the execution time on my local environment is over 20 seconds.

db.getCollection('viewscounts').aggregate(
[
   {
       $match: {
           MODULE_ID: 4, 
       }
   },
   {
       $group: {
           _id: '$ITEM_ID',
       }
   }
], { allowDiskUse: true })

If I get rid of either $match or $group and have only 1 pipeline, the execution time is less than 0.1 seconds.

I’m okay with limiting the _ids, but they should be unique.

Can anyone suggest a better way to get the results faster?

slava · June 24, 2020, 5:38am

@Praveen_Ramkumar, can you provide example of the document from ‘viewcounts’ collection?

Praveen_Ramkumar · June 24, 2020, 5:53am

Hi @slava, Thanks for your interest on helping me.

Please find below an example:

{
    "_id" : ObjectId("5ec9b916899b6c0013d84826"),
    "MODULE_ID" : 4,
    "USER_ID" : ObjectId("5c5b2a0a12df970a22fc0467"),
    "ITEM_ID" : ObjectId("5d622b9492463900134b5b82"),
    "COMPANY_ID" : ObjectId("5d1260206c8ca32c10fac59a"),
    "IP_ADDRESS" : "::ffff:10.255.0.2",
    "TYPE" : 3,
    "LOCATION" : 1,
    "VIEWED_DATE" : ISODate("2020-05-24T00:00:22.702Z"),
    "ELEMENT_ID" : 0,
    "UTM_VARIABLES" : null,
    "__v" : 0
}

slava · June 24, 2020, 6:04am

You can get unique ids with aggregation pipeline, like this:

db.getCollection('viewscounts').aggregate([
   {
       $match: {
           MODULE_ID: 4,
       }
   },
   {
       $group: {
           _id: null,
           UNIQUE_IDS: {
               $addToSet: '$ITEM_ID',
           }
       }
   }
], { allowDiskUse: true })

You can also achieve the same with distinct collection command:

db.getCollection('viewscounts').distinct('ITEM_ID', { MODULE_ID: { $ne: 4 } })

Don’t forget to add index on ‘MODULE_ID’ field, so both above approaches would work faster

db.getCollection('viewscounts').createIndex({ MODULE_ID: 1 })

Praveen_Ramkumar · June 24, 2020, 6:28am

I’ve already tried the above approaches and they take around 14 seconds in local environment, the execution time will be doubled when I run the query on my hosted production db. only having $match or $group is executed within less than 1ms. combining both the pipelines increases the execution time. what could be the reason? My expection was as the first match already reduce the dataset the group should work even faster.

Tai_Huynh1 · May 31, 2022, 8:52am

i have same problem with $match and $group.
i think it is slow because Aggregation Pipeline work like a process Pipe.

The documents that are output from a stage are passed to the next stage.
it will bring 5 millions record (fully data) to next stage for $group. that why when we combine $match and group. it will be slow.
p/s: u using single $match but take limit records. so it is executed within less than 1ms.
https://www.mongodb.com/docs/manual/core/aggregation-pipeline/