Loop through data in batches?

MRM · September 5, 2023, 7:43am

I have a data set of dates and I’d like to check if a user input date clashes with an existing one,
so far this has been my approach:

for await (const doc of collection.find(query)) {
 if(check_clash(doc.date, user_date))
   break;
}

but I have 2 concerns:

Will this work even for a larger data set?
Is there any way to do this in batches?
i.e. since I’m guessing every time 1 doc is fetched it takes some amount of time whereas if we take e.g. 20 docs at a time, compare them & then take the next batch it should take less amount of time since there are less calls to the database?

maybe a sudo code would look something like this?

await collection.find(query).getInBatches(20,(docs)=>{
   for(const d in docs)
     if(check_clash(d.date, user_date))
       return false; // break out 
})

P.S. I tried the approach in this question: Batching data with find but a solution pointed out:

It is never a good idea to use skip in mongo queries

John_Sewell · September 5, 2023, 8:04am

Depending on the rule of check_clash you may be able to implement this as an aggregation pipeline and run it server side.

In regards to the point of .skip, the alternative is to sort your data (make sure there is a supporting index) and then keep track of the last item processed, when you resume you just get data more than the item last found.

So you could use the _id field as the primary key for this and something like this (pseudocode)

db.getCollection("Test").find({_id:{$gt:XXX}}).sort({_id:1}).forEach(theDoc =>{
    XXX = theDoc._id,
    checkClash(theDoc.date, user_date)....etc
})

MRM · September 5, 2023, 9:14am

I’ve actually abstracted my problem in the original question to simplify things but the core problem is that I have a user input date range and bookings also with a date range as such:

MONTHLY_COLLECTION:

{
"month_year" : "01-2020",
"dates" : {
    "1": [...], // array of Booking IDS whose range falls within this date
    "5": [...],
    "28":[...]
},
...

BOOKING_COLLECTION

{
  "_id": ... , //  Booking ID
 "from": "2023-09-04T04:30:00.000Z",
 "till": "2023-09-05T06:00:00.000Z",
 ...
},

...

and my approach has been to first get all the Booking IDs then check if they clash with the user input range as such

const user_range_start=new Date(Date.UTC(2020, 0, 1) // 01-01-2020
const user_range_end=new Date(Date.UTC(2020, 2, 23) // 23-03-2020

let id_arr = []
await DB.collection(MONTHLY_COLLECTION).find(
    {
        "month_year": {
            "$in": month_year_between(user_range_start, user_range_end) // [`01-2020`, `02-2020`, `03-2020` ]
        }
    }
)
.forEach(doc => {
    for (let date in doc["dates"])
        id_arr.push(...doc["dates"][date])
})


await DB.collection(BOOKING_COLLECTION).find({
    "_id": {
        "$in": id_arr
    }
})
.forEach(doc => {
       if(check_clash(doc.from, doc.till, user_range_start, user_range_end)
          return false
})

My only problem with this approach is that it is exhaustive till 101 documents since we might need to check for more than that

Could you please answer in context with this?