How best to delete a huge amount of data from a collection?

I am storing multi tenanted data in collections. Each collection has a tenantId. Sometimes, I will need to delete a tenant and that means I will need to run a query such as

db.records.deleteMany({ tenantId: "abc" })

However, the number of records for a tenant in some collections can reach even half a billion records and this is very slow, I cannot easily get progress too. tenantId is indexed. How best should I delete data in such instances?

Thinking if some of these are recommended to try?

  • Query X rows at a time, build an array of _ids which I will then use in delete. Repeat until no more rows
  • Maybe store data for each tenant in separate collections? But this will complicate query if I need to query across collections?
  • What if I bulk delete a deletedAt field that has a TTL index? Dunno if thats faster
1 Like

Hi, it could be worth looking at the bulk operation for removing documents, and here’s a discussion about it on StackOverflow.

Will try this. But the docs doesnt say much about how this works under the hood or hows its different apart from doing it with less load on the DB?

Bulk operations might be of help here. An unordered bulk.find(queryDoc).remove() basically is a version of db.collection.remove(queryDoc) optimized for large numbers of operations. It’s usage is pretty straightforward:

The idea behind this approach is not to speed up the removal, but to produce less load. In my tests, the load was reduced by half and took slightly less time than a db.collection.remove(query).

Another thing to keep in mind is that doing a large amount of delete may cause high network bandwidth usage and longer replication lag. (many deployments have alert on the replication lags)

Hmm but how should I do this instead?

  1. Add rate limit on the bulk remove
  2. Do it at off peak time window
  3. Monitor disk io and network usage while the removal is in progress

How do I add a rate limit tho? I dont see a limit option?

For off peak window guess thats also manually?