I have a collection with 100 million documents. I want to delete every document where the guildID isn’t in an array (the array has 200k elements). I’m using mongoose.
Would it be faster to:
Stream all the data from the collection (using mongoose). Once I recieve each document I check if the guildID is in the array. If it’s not in the array I use deleteOne().
Use deleteMany with $nin. I’m currently using this method and it’s taking 10 minutes to delete 1 million documents.
Both scenarios assume the deletion of the 10 million documents that matches the delete criteria. This is to simplify the understanding.
Scenario 1 - stream
you send one query to retrieve all docs
the server fetches 100M docs from permanent storage
the server sends 100M docs over to client
the client sends 10M delete commands to the server
the server performs 10M delete
you could improve this by streaming only _id and guildID
Scenario 2 - deleteMany
you send one command to deleteMany with an array of 200k elements (ouch)
the server also fetches the 100M docs but mark them as deleted right away
the main problem is that I do not think the index on guildID, if any, can be used because the query is $nin
That’s a very interesting problem. Please do follow up on your findings.
My question is what is the source of the array of 200k elements? Is it another collection in the same database?
I would try to find a way to use $in rather than $nin by getting the guildID that needs to be deleted rather than using the guildID that needs to be kept. Specially if an index on guildID exists.