Deletion of 1 million records with BulkWrite (DeleteManyModel ) taking longer time

Hi Team,
I have collection of 5 million records and every time data refresh happens , i wanted to delete Approx. 1 Million records. I am using below java code but it taking 30 Sec to delete. that is to high for application response time. Please suggest me better approach to reduce query response time.

I have created compound index on fields which i am using in where clause.

Created compound index on these field : {“collection_id” : 1,“client_id” : 1,“version” : -1}

package com.sample.db.mongo;
import java.util.ArrayList;
import java.util.List;

import org.bson.Document;
import org.json.JSONObject;

import com.mongodb.MongoClient;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.BulkWriteOptions;
import com.mongodb.client.model.DeleteManyModel;
import com.mongodb.client.model.WriteModel;
public class InsertingDocumentBulk {
   public static void main( String args[] ) {
      //Creating a MongoDB client
      MongoClient mongo = new MongoClient( "localhost" , 27017 );
      MongoDatabase database = mongo.getDatabase("integ28");

    long start = System.currentTimeMillis();
        
	List<WriteModel<Document>> deleteManyModel = new ArrayList<>();
  	JSONObject dataVersion = new JSONObject().put("$lt", 23);
	JSONObject jsonQuery = new JSONObject();
	jsonQuery.put("collection_id", "123321");
	jsonQuery.put("version", dataVersion);
	jsonQuery.put("client_id", "8");
	
	BulkWriteOptions bulkWriteOptions = new BulkWriteOptions();
	bulkWriteOptions.ordered(false);
	bulkWriteOptions.bypassDocumentValidation(true);
	 
	deleteManyModel.add(new DeleteManyModel<Document>(Document.parse(jsonQuery.toString())));
	database.getCollection("BulkInsertTesting").bulkWrite(deleteManyModel,bulkWriteOptions); 
	   
    	System.out.println(System.currentTimeMillis() - start);  

    System.out.println("Document Deleted successfully");
   }
}

Performances are influenced by the characteristics of your configuration.

So those numbers might be totally appropriate depending of your setup.

What are the characteristics of the machines running mongod? RAM, disk, replica sets, …

Indexes are also an important factor. Deleting documents implies that ALL indexes are updated. If your index {“collection_id” : 1,“client_id” : 1,“version” : -1} exists only for the purpose of your delete then updating the sole purpose index might be the one causing the problem. So removing it might improve your performances.

If removing the index is slower or impossible then you might want to consider adding a field that marks if a document is to be deleted. When the user action causing the delete occurs, you simply mark the documents. This should be fast because no indexes are updated while the user is waiting for a response. You then write a small delete task that distribute over time the physical deletion.

Hi Steevej,
When you say,

When the user action causing the delete occurs, you simply mark the documents

Can you enlighten me more on how to implement this?
I’m currently doing bulk deletes of billion records in batches of 100k and my sharded cluster is overloaded with cache pressure. Almost everytime it’s at 5% dirty mark with 80% disk IO utilization. I’m using local SSD’s with provisioned IOPS. Trying to understand if the way you suggested somehow tackle the deletion overhead. Looking forward to hear

Share the code you use to do the following:

Why are you doing by batches? Why not in a single operation? How many documents still remain in the collections when all the batches are removed?

collection.updateMany( { query that specifies the documents to delete } ,
  { "$currentDate" : { "deleted" : { "$type" : "date" } } } )

This comes with the caveat that all your other normal queries must now specify

“deleted”:{“$exists”:false}"

This technique may help; no indexes are updated; or not as documents not in RAM must be read from disk and written back. It has to be tested with you use-cases. The advantage is that you can implement undo for your deletes.

The biggest question is what kind of use-case requires

How many time data refresh happens?

May be your model has to be revised rather than trying to fix this performance issue.

Performance issues are often cause by one of the following:

  1. insufficient hardware for the use-cases
  2. wrong model for the use-cases

Hopefully, it won’t take another 5 months before a followup.