Large number of deletes in a transaction

fried_empanada · July 21, 2023, 10:03pm

Hey all, currently I am implementing the database operations for deleting all the information related to a particular user when a user wants to delete his/her account; this involves deleting the user’s data that spread across multiple collections, and ideally I would like to perform deleteMany() operations across multiple collections in an all-or-nothing transaction in order to ensure data consistency because some of these data are interconnected, such as the user’s posts, likes, etc.

The number of deleteMany() operations to be included in the transaction would likely vary from anywhere around tens of thousands to around tens of millions, my question is that I read in this post (Performance Best Practices: Transactions and Read / Write Concerns | MongoDB Blog) that “For operations that need to modify more than 1,000 documents, developers should break the transaction into separate parts that process documents in batches.”, and I am wondering if this limit applies to document delete operations or only document update operations. I also read on this page (https://www.mongodb.com/docs/manual/core/transactions-production-consideration/#oplog-size-limit) that starting in version 4.2, MongoDB can create as many oplog entries as necessary instead of limiting all the oplog to a single 16 MB object, so I am also not sure if the 1,000 documents limit is still valid. Either way, I would like to seek some advice and recommendations on how to approach this issue.

Thanks a lot in advance!

Kobe_W · July 22, 2023, 4:56am

Do you have to delete all those data ? can you simply add a “deleted” bool flag to that user account instead?
Do you have to always delete all those related resources? some may fail to delete but perhaps you don’t have to delete them if they will never be used again.

Using a very big transaction is almost never good idea. Big transactions consume more resources and can cause performance impact.

fried_empanada · July 22, 2023, 9:58pm

Hi @Kobe_W,

I see, those are very good points. I actually have a follow-up question with regards to the second point: say if I delete the single user profile document belonging to user A, but do not delete the resources related to that user in other collections (those resources would reference that user by that user profile document’s ObjectId), then when a new user, say user B, signs up, is there any chance that the ObjectId originally assigned to user A’s user profile document will be “recycled” and assigned to user B’s user profile document, and therefore cause the system to mis-assign user A’s stale/left-over resources to user B and essentially allow user B to access user A’s resources? Basically I think this boils down to the question of whether or not deleted documents’ ObjectIds (the _id field) will be “recycled” and re-used by new documents in the future?

Thanks for the help again!

Kobe_W · July 23, 2023, 12:33am

for each signup, a new object id is generated and assigned to that specific user. so no recycle.

John_Sewell · July 23, 2023, 1:01pm

Take a look at how an object id is constructed:

From a comment elsewhere someone pointed out that the driver may generate this if inserting in a driver, or the server will if run on the server via shell etc.

steevej · July 23, 2023, 1:09pm

I do not want to be picky about the details, no really I want to be picky B-)

The shell is simply a client application that uses a driver, and it is also the driver that generates the ObjectId, not the server.

John_Sewell · July 23, 2023, 1:36pm

Technically correct, is the best form of correct

I was more leaning towards something like an aggregation out redirect on the server side where you project out the ID fields etc, so the server generates, but worded it badly.

As you say though, the shell is just an application, using a driver!

It’s worth knowing how an ID is created, and can also be useful to know when a document ID was generated from the imbedded timestamp (taking the knowledge that it may have been generated on the client into account)

fried_empanada · September 3, 2023, 4:50pm

I see, gotchu, thank you all very much for the explanation and insights! I checked this page https://www.mongodb.com/docs/manual/reference/method/ObjectId/ you shared, it looks the ObjectId is constructed from information from the current timestamp, random value unique to machine and process, and an incrementing counter; if I understand this correctly, then there should not be any recycling of old deleted ObjectId right? Thank you very much for all your help again!

Kobe_W · September 5, 2023, 4:16am

yes, this is correct.