Data Retrieval and then deletion

I have a MongoDB collection with over 2.9M records and i want to remove the old data till a specific date range. & there’s only one index (i.e. _id UNIQUE) created over this collection

Upon running the following command in a shell, it is taking too much time:
"db.collection.remove({_id: {$in: db.collection.find({}, {_id : 1}).limit(100).sort({id:1}).toArray().map(function(doc) { return doc._id; }) }})
"
So which index is more suitable for a function “deleteMany or remove” to remove data from a column using a wildcard?

Secondly, i am creating a new index right now , on the same column in ASC order with option “Create index in the backgound”. How much time the index creation will take ?

Hi @Ahmed_Hussain ,

It sounds like you should have a field in each document stating the date of the document? Is that correct?

If so then you can potentially use a TTL (Time To Live) index to maintain the window of documents alive:

However, in case you want to delete documents based on a timestamp field (eg. createdAt ), you can index that field and delete chunks of data :

db.collection.deleteMany({createdAt : {$gte: <START_TIMESTAMP> ,  $lte : <END_TIMESTAMP>}})

Its better to split the deletes into smaller batches to not overwhelm the database at once. Consider clearing week by week or month by month etc…

Thanks
Pavel

Hi @Pavel_Duchovny ,

Unfortunately we don’t have a date column in the collection but we do have a column that holds such information in this format (“1021-30-12-2021”) . The right side of this string is the date (i.e. 30-12-2021).

I tried using this query ( db.collection.deleteMany({routeKey: /12-01-2022/}) ) but the query is taking too much time.

Can you please suggest under this situation?

Because you are using a regular expression that is not anchored at the beginning. This means that even if you have an index, all documents needs to be scanned to produce a result.

The first error is that you do not keep your dates as date but as string. The second error is that you used the worst string representation of a date by using day-month-year. This means you cannot even use relational operators, you cannot sort your data on date.

If querying with date is important you must have a date field store as a date data type. It takes less space than string, it is faster to compare than string and there is a rich library of date manipulation function.

Using the right model, using the right data type and have an index are the first steps for performance.

@steevej I strongly agree with your perspective about the structure but what to do now?

Once the data has been deleted, i will surely work on it.

1 Like

What is the meaning of the first 4 digits of routeKey?

If it is HHMM like hour-hour-minute-minute, then you might be able to delete one day at a time with

{ routeKey : { $gte : "0000-12-01-2022" , $lte : "2400-12-01-2022" } }

you would still be scanning all documents (no choice since you only have an index on _id) but no regular expression.

You could also leverage the fact that the first 4 bytes of an object ID is a timestamp. See https://www.mongodb.com/docs/manual/reference/method/ObjectId/.

The following is certainly time consuming and having an ascending index on a field for which you already have a descending index is probably completely useless.

Do not create a new index just for your delete. It will probably take more time to 1) create the index 2) delete the documents and 3) update the index for all deleted document than just doing the delete.

Hi @Ahmed_Hussain ,

Why not create an index on { routeKey : 1} and perform a single day deletion as @steevej suggested and see how it performs?

Thanks
Pavel

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.