Get data from collection having millions of record and also contains duplicate data using limit() and skip()

Aggregation_need_help · November 6, 2022, 10:11am

Hi,
I have a task to get data from collection which have millions of records and need to support pagination. THe issue with that is the collection also contains duplicate data and we don’t need that to come up in result. I can use group but that is making the query very slow. For example the limit is 2000 than using group after using limit always gives result less that 2000. and i I first use group than limit than the query is grouping all data in collection and than giving first 2000 records. The issue with this approach is currently I have more than million data in db so the group operation is very slow. can any one help.

Data example

_id | john | doe | johndoe | johndoe@gmail.com | 12345 | 78 | +1 (123)- (123) - 7896 | …
_id | john | doe | johndoe | johndoe@gmail.com | 89076 | 800 | +1 (123)- (123) - 7896 | …

So in this collection one user can have multiple purchase order , and I only want this user details only one time… This collection have more than million data

SO currently I am using this aggregation:-

db.Collection_A.aggregate([
{
     $match : {
             DELETED : { $ne: 'N'},
      }
},{
    $group: {
        _id: {
            uderid : '$userid',
            email: '#email' 
        },
        distinct: {$first: '$$ROOT'}
    }
},{
    $replaceRoot: {newRoot: '$distinct'}
},{
    $skip : 0
},{
    $limit: 100
}
])

This query is

vey slow as I have million of records and I need to support pagination

Pavel_Duchovny · November 6, 2022, 10:24am

Hi @Aggregation_need_help ,

Yes this sounds like a sub optimal way.

I suggest to create a materialized view of the distinct values using a $merge or $out periodically. This will allow you to index and query already a distinct set of values.

In this materialised view you can also removed deleted users. We also recommend to use _id in the created collection for pegination:

This is a better way than skip and limit.

Thanks
Pavel

Aggregation_need_help · November 6, 2022, 11:05am

Hi @Pavel_Duchovny I am new to this this topic could you help mw with the query

Pavel_Duchovny · November 6, 2022, 1:58pm

Hi @Aggregation_need_help ,

First create the new collection by running:

db.Collection_A.aggregate([
{
     $match : {
             DELETED : { $ne: 'N'},
      }
},{
    $group: {
        _id: {
            uderid : '$userid',
            email: '#email' 
        },
        distinct: {$first: '$$ROOT'}
    }
},{
    $replaceRoot: {newRoot: '$distinct'}
},
{
 $out : "distinct_Collection_A"
} ])

Now you can fetch the first round of documents:

db.distinct_Collection_A.find({}).sort({_id : 1}).limit(100);
db.distinct_Collection_A.find({_id : {$gt : <LAST_ID_ABOVE>}}).sort({_id : 1}).limit(100);
...

Thanks
Pavel

system · February 18, 2023, 5:17pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.