Execute a query in batches in aggregate pipeline for large no. of documents

Santhosh_R · February 17, 2022, 5:57am

Hello Folks,
I have a profile collection where each document will have a birthDate & weddingDatefield ,so i want to calculate birthdays & weddingdays coming in 14 days ,to do so I use the below query, this query find birthdays & weddings for all documents in the collection but can I do the same thing in batches.

Example:

Lets say if I have 1m documents in the profile collection and I want to calculate birthdays or weddingdays , then I want to do it in batches (1000 documents for each batch).
For each 1000 documents we calculate weddingdays & birthdays.
Until the total count.

Is caculating in the above method is a scalable one or doing it for all documents at once, and if there is any other pattern which is scalable it would be of great help.

Note:

Using this query in a cron task to send push notifications to users regarding the upcoming events.

db.collection.profiles.aggregate([
{
        $project: {
          _id: 0,
          fcmToken: 1,
          userNumber: 1,
          displayName: 1,
          todayDayOfYear: { $dayOfYear: new Date() }, // this returns the current day in year ex:if today is feb 10,2022 => 41
          birthDayOfYear: { $dayOfYear: "$birthDate" }, //this returns the day in year of user birthday,
          weddingDayOfYear: { $dayOfYear: "$weddingDate" }, //this returns the day in year of user weddingDay
        },
      },
      //Projecting profiles for calculating the daysTillBirthday or birtday in n days
      {
        $project: {
          fcmToken: 1,
          userNumber: 1,
          displayName: 1,
          daysTillBirthday: {
            $subtract: [
              {
                $add: [
                  "$birthDayOfYear",
                  {
                    $cond: [
                      { $lt: ["$birthDayOfYear", "$todayDayOfYear"] },
                      365,
                      0,
                    ],
                  },
                ],
              },
              "$todayDayOfYear",
            ],
          },
          daysTillWeddingDay: {
            $subtract: [
              {
                $add: [
                  "$weddingDayOfYear",
                  {
                    $cond: [
                      { $lt: ["$weddingDayOfYear", "$todayDayOfYear"] },
                      365,
                      0,
                    ],
                  },
                ],
              },
              "$weddingDayOfYear",
            ],
          },
        },
      },
      // Filtering the profiles which has birthdays in 14 days
      {
        $match: {
          $or: [
            { daysTillBirthday: { $lt: 15 } },
            { daysTillWeddingDay: { $lt: 15 } },
          ],
        },
      },
])

Jason_Tran · February 22, 2022, 10:45pm

Hi @Santhosh_R,

Welcome to the community

but can I do the same thing in batches.

Just to clarify, how would you be running this in batches? Additionally, would you be able to provide some sample documents that are used in the aggregation pipeline you’ve mentioned?

Is caculating in the above method is a scalable one or doing it for all documents at once, and if there is any other pattern which is scalable it would be of great help.

There are some more details on Aggregation Pipeline Optimization documentation but getting some of the sample docs may assist with coming up with alternative approaches.

Look forward to hearing from you.

Regards,
Jason

Santhosh_R · February 23, 2022, 3:07am

@Jason_Tran Thank you for your welcoming .

Below here I have attached some sample documents for the above profile collection.

Regarding the batches one, if the number of documents scaled to more, then will the above same query would be an optimized one or can I do the calculation part as batches.

Example:

Let the total documents in a collection be 50000.
I want to calculate the birthdays in way that take first 1000 documents calculating birthday or wedding day then notify the users.
Repeating step 2 until the total count.

I don’t know whether this method is an optimized one so only I am asking for suggestions, and also if there is any other way we could do this process in a scalable way also it would be helpful.

[{
  "_id": {
    "$oid": "62052fb90cf0b90034c66896"
  },
  "email": {
    "text": "",
    "private": true
  },
  "birthDay": {
    "private": false,
    "text": "1984-01-01"
  },
  "birthYear": {
    "private": true,
    "text": "1984"
  },
  "weddingDay": {
    "private": true
  },
  "weddingYear": {
    "private": true
  },
  "photoURL": "",
  "displayName": "Pravi test",
  "profession": "",
  "location": "",
  "firstName": "",
  "lastName": "",
  "phone1": "+919381062923",
  "fcmToken": "f18d8ef4862d7ae8cd6775",
  "locations": [],
  "userNumber": "+919381062923",
  "userID": "62052fb90cf0b90034c66894",
  "__v": 0,
  "acceptedTerms": true,
  "birthDate": {
    "$date": "1984-01-01T00:00:00Z"
  },
  "weddingDate": {
    "$date": "1999-01-01T00:00:00Z"
  }
},{
  "_id": {
    "$oid": "62052fff0cf0b90034c668a8"
  },
  "email": {
    "text": "",
    "private": true
  },
  "birthDay": {
    "private": false,
    "text": "1920-01-11"
  },
  "birthYear": {
    "private": true,
    "text": "1920"
  },
  "weddingDay": {
    "private": true
  },
  "weddingYear": {
    "private": true
  },  
  "photoURL": "",
  "displayName": "Test User 2",
  "profession": "Business",
  "location": "Goa",
  "firstName": "",
  "lastName": "",
  "phone1": "+917978568123",
  "fcmToken": "0b5457b0fd2bea64ff0dd7",
  "locations": [],
  "userNumber": "+917978568123",
  "userID": "62052fff0cf0b90034c668a6",
  "__v": 0,
  "acceptedTerms": true,
  "birthDate": {
    "$date": "1964-05-05T00:00:00Z"
  },
  "weddingDate": {
    "$date": "1984-08-20T00:00:00Z"
  }
  
},{
  "_id": {
    "$oid": "6205300c0cf0b90034c668b5"
  },
  "email": {
    "text": "",
    "private": true
  },
  "birthDay": {
    "private": false,
    "text": "1920-01-01"
  },
  "birthYear": {
    "private": true,
    "text": "1920"
  },
  "weddingDay": {
    "private": true
  },
  "weddingYear": {
    "private": true
  }, 
  "userName": "6ff737ab-91a7-474e-96b6-f73982494637",
  "photoURL": "",
  "displayName": "Realme Device",
  "profession": "",
  "location": "",
  "firstName": "",
  "lastName": "",
  "phone1": "+918280537765",
  "fcmToken": "6a94c2943e0c0c3dad1402",
  "locations": [],
  "userNumber": "+918280537765",
  "userID": "6205300c0cf0b90034c668b3",
  "__v": 0,
  "acceptedTerms": true,
  "birthDate": {
    "$date": "1984-05-05T00:00:00Z"
  },
  "weddingDate": {
    "$date": "1999-08-20T00:00:00Z"
  }
  
},{
  "_id": {
    "$oid": "620532ea0cf0b90034c66900"
  },
  "email": {
    "text": "",
    "private": true
  },
  "birthDay": {
    "private": false,
    "text": "1976-02-10"
  },
  "birthYear": {
    "private": true,
    "text": "1976"
  },
  "weddingDay": {
    "private": true
  },
  "weddingYear": {
    "private": true
  },  
  "userName": "6ff737ab-91a7-474e-96b6-f73982494637",
  "photoURL": "",
  "displayName": "Demo User 3",
  "profession": "",
  "location": "",
  "firstName": "",
  "lastName": "",
  "phone1": "+919849027123",
  "fcmToken": "5ec6720b539ccc9205a469",
  "locations": [],
  "userNumber": "+919849027123",
  "userID": "620532e90cf0b90034c668fe",
  "__v": 0,
  "acceptedTerms": true,
  "birthDate": {
    "$date": "1976-02-10T00:00:00Z"
  },
  "weddingDate": {
    "$date": "1984-08-20T00:00:00Z"
  }
},{
  "_id": {
    "$oid": "620536b20cf0b90034c66926"
  },
  "email": {
    "text": "",
    "private": true
  },
  "birthDay": {
    "private": false,
    "text": "2007-02-10"
  },
  "birthYear": {
    "private": true,
    "text": "2007"
  },
  "weddingDay": {
    "private": true
  },
  "weddingYear": {
    "private": true
  },
  "userName": "6ff737ab-91a7-474e-96b6-f73982494637",
  "photoURL": "",
  "displayName": "Demo User 5",
  "profession": "",
  "location": "",
  "firstName": "",
  "lastName": "",
  "phone1": "+917893031423",
  "fcmToken": "5a91c2cacd4255ca94962c",
  "locations": [],
  "userNumber": "+917893031423",
  "userID": "620536b20cf0b90034c66924",
  "__v": 0,
  "acceptedTerms": true,
  "birthDate": {
    "$date": "2007-02-10T00:00:00Z"
  },
  "weddingDate": null
},{
  "_id": {
    "$oid": "6205e8e10cf0b90034c669ed"
  },
  "email": {
    "text": "",
    "private": true
  },
  "birthDay": {
    "private": false,
    "text": "1920-01-01"
  },
  "birthYear": {
    "private": true,
    "text": "1920"
  },
  "weddingDay": {
    "private": true
  },
  "weddingYear": {
    "private": true
  },
  "userName": "6ff737ab-91a7-474e-96b6-f73982494637",
  "photoURL": "",
  "displayName": "Demo user 8",
  "profession": "",
  "location": "",
  "firstName": "",
  "lastName": "",
  "phone1": "+919398470744",
  "fcmToken": "c0c6275300593d326df7ae",
  "locations": [],
  "userNumber": "+919398470744",
  "userID": "6205e8e10cf0b90034c669eb",
  "__v": 0,
  "acceptedTerms": true,
  "birthDate": {
    "$date": "2007-02-10T00:00:00Z"
  },
  "weddingDate": null
  
}]

Jason_Tran · February 28, 2022, 5:27am

Hi @Santhosh_R,

Thank you for providing the sample documents that you’re working with.

Regarding the batches one, if the number of documents scaled to more, then will the above same query would be an optimized one or can I do the calculation part as batches.

I do not believe that running the aggregation in batches would help in terms of scalability or performance. The reason being is that the aggregation you’ve mentioned would require collection scan each time it is run, i.e. scan every document in a collection, to select those documents that match the query statement. It cannot make use of any indexes as the $match stage requires values that are computed from the $project stages.

I don’t know whether this method is an optimized one so only I am asking for suggestions, and also if there is any other way we could do this process in a scalable way also it would be helpful.

In terms of scalability, you may wish to alter the schema so that you can utilise indexes. One particular method may be to:

Update the current documents by adding birthDayOfYear and weddingDayOfYear fields which should be static. e.g. A birthday or wedding anniversary that is on the 5th of January will always have a birthDayOfYear or weddingDayOfYear of 5:
Example:

{
"_id": {
    "$oid": "62052fb90cf0b90034c66896"
  },
  "email": {
    "text": "",
    "private": true
  },
  "birthDate": {
    "$date": "1980-01-05T00:00:00Z"
  },
  "weddingDate": {
    "$date": "2005-01-10T00:00:00Z"
  },
  "weddingDayOfYear": 10, /// <--- new field for wedding `$dayOfYear` value in original document
  "birthDayOfYear": 5 /// <---- new field for birthday `$dayOfYear` value in original document
}

Please note, I have not accounted for how birthdays / wedding days on the 29th of February or leap years would be handled.

Create 2 seperate indexes, 1 on each of the new fields weddingDayOfYear and birthDayOfYear:

db> db.collection.createIndex({"birthDayOfYear":1})
birthDayOfYear_1

db> db.collection.createIndex({"weddingDayOfYear":1})
weddingDayOfYear_1

Pre-calculate the current $dayOfYear year value as an int. There are packages like moment.js which can assist with this. In the below step, we will be using today’s day of year value as 8 which will be assigned to a variable called nowDayOfYear:

db> nowDayOfYear = 8
8

Perform the following aggregation to get all documents where the birthDayOfYear OR weddingDayOfYear are within 14 days of noDayOfYear

db> db.collection.aggregate(
{
  '$match': {
    '$or': [
      { weddingDayOfYear: { '$gte': nowDayOfYear, '$lte': (nowDayOfYear+14) } },
      { birthDayOfYear: { '$gte': nowDayOfYear, '$lte': (nowDayOfYear+14) } }
    ]
  }
})

The above possible alternative method would make use of indexes as the $match stage utilising the $or expression has both the birthDayOfYear and weddingDayOfYear indexed. You can view further information regarding this behaviour here.

Of course which ever method you choose to go by will ultimately depend on what best suits your use case. This includes factors such as how often the aggregation is run, how large the collection is, how large it’s expected to grow by, etc.

However, for demonstration purposes, I have a test collection containing 50000 documents with the following structure:

[
  {
    _id: ObjectId("621c2f1081e09c0081f99d61"),
    birthDate: ISODate("1988-03-25T02:06:00.210Z"),
    weddingDate: ISODate("1985-09-27T20:40:33.290Z"),
    birthDayOfYear: 331,
    weddingDayOfYear: 87
  },
  {
    _id: ObjectId("621c2f1081e09c0081f99d62"),
    birthDate: ISODate("2020-03-06T15:08:30.477Z"),
    weddingDate: ISODate("1980-08-02T11:49:46.589Z"),
    birthDayOfYear: 110,
    weddingDayOfYear: 87
  },
  {
    _id: ObjectId("621c2f1081e09c0081f99d63"),
    birthDate: ISODate("2015-01-09T15:25:32.513Z"),
    weddingDate: ISODate("2022-07-06T05:13:32.521Z"),
    birthDayOfYear: 237,
    weddingDayOfYear: 76
  }
]

Please note that the birthDayOfYear and weddingDayOfYear values are not matching the birthDate and weddingDate as this is just a an example for demonstration purposes

I now execute the aggregation mentioned above with the .explain(“executionStats”) method:

db>db.collection.aggregate(
{
   '$match': {
     '$or': [
       { weddingDayOfYear: { '$gte': nowDayOfYear, '$lte': (nowDayOfYear+14) } },
       { birthDayOfYear: { '$gte': nowDayOfYear, '$lte': (nowDayOfYear+14) } }
     ]
   }
}).explain("executionStats")

You can see the output shows that the indexes are both used and the totalDocsExamined to nReturned ratio is 1:1 without needing to scan all 50,000 documents:

executionStats: {
    executionSuccess: true,
    nReturned: 3978,
    executionTimeMillis: 14,
    totalKeysExamined: 4073,
    totalDocsExamined: 3978

Running the .explain(“executionStats”) on the aggregation you provided against the same test collection we can see that although the ratio is still 1:1, a collection scan was required to be done, scanning all 50000 documents in the collection (50,000 returned as the daysTillWeddingDay field projection in your example has a value of 0 for all 50,000 documents in my test data) :

 executionStats: {
          executionSuccess: true,
          nReturned: 50000,
          executionTimeMillis: 241,
          totalKeysExamined: 0,
          totalDocsExamined: 50000

In additon to the explain(“executionStats”) page linked above, you may also find the Analyze Query Performance documentation useful.

I would also like to note that $dateAdd could possibly suit your use case depending on your MongoDB version as it is new in 5.0.

Lastly, I would suggest not using $project except until the end of the pipeline. If you would like to create a field, please use $addFields instead as this can possibly help in reducing the amount of data passing through the pipeline. Please view the Aggregation Pipeline Optimization documentation for more information regarding this.

A bit off the exact scenario mentioned but if precision on the use cases is not required to such a degree, i.e. the 14 days, you may also want to think about perhaps changing the search based off a whole month. This would simplify the queries and additional work to account for leap years (29th of Feb, 366 days, etc). There are a few operators in 5.0 which could possibly be useful if the requirement was cut to birthday months / wedding months rather than 14 days from the exact date:

As always, it’s highly recommended to test various approaches on a test environment to see if it meets your requirements beforehand.

Hope this helps.

Regards,
Jason

Santhosh_R · February 28, 2022, 5:55am

Hey @Jason_Tran ,
Thank you for your reply, it was very helpful.

system · March 5, 2022, 5:55am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.