Update on 2 billion records will take 1.5 year

Azat_TAZAYAN · April 11, 2023, 4:08am

Hi community,

We are facing a huge performance issue during migration. There is a collection with 2B records, and we would like to change each document’s property value that satisfies the filter query (It will be around 78% of all documents). The current performance of that operation is terrible.
It is updating 55 documents per second. That means for 2B records, it will take around 420 days .

Is there any way to optimize this to the few hours instead?

Here are the details.

C# Driver: 2.10.4
Mongo DB version: 4.4.2 (Replicaset with 3 nodes)
Document structure:

{
    "_id" : UUID("9a7b2d8b-0648-4381-9317-c69e2c4d21b7"),
    "Data" : {
        "CustomAttributes" : [
           .....................
            {
                "k" : NumberInt(6),
                "v" : "J"
            },
           ....................... 
			{
                "k" : NumberInt(160),
                "v" : "00:18"
            },
        ],
        "IsEvent" : false,
		....................
		"Prop18" : someValue
    }
}

CustomAttributes is a dictionary with 160 items with the indexes from 1 to 160

the actual query

            var flightCollection = db.GetCollection<BsonDocument>(stateStore);

            var resultFlights = flightCollection.UpdateMany(
                Builders<BsonDocument>.Filter.Eq("Data.IsEvent", false),
                Builders<BsonDocument>.Update.Set("Data.CustomAttributes.$[elem].v", "Z"), 
                new UpdateOptions 
                {
                    ArrayFilters = new[]
                    {
                        new BsonDocumentArrayFilterDefinition<BsonDocument>(new BsonDocument("elem.k", 6))
                    }
                });

Brock · April 13, 2023, 5:14am

For the sake of the fact I’ve dealt with DBs with upward of 7 billion documents before, I pulled out some of my old scripts to do stuff like this. Call me cocky, but I love stuff like this…

These use drivers to connect to the database, fetch documents in batches of 5000, and update them sequentially using the BulkWrite API. The processedDocuments variable keeps track of how many documents have been updated so far, and the loop continues until all documents have been processed. Doing the 2BN in batches of 5k at a time will keep your systems from freezing, and honestly you’ll probably be done in a few days.

Here are some tips:
To optimize the update performance for the given scenario, here are some suggestions:

1. Use a multi-updates approach - Instead of updating all 2B records in one go, you can split them into smaller batches and update them in parallel. This will make use of the available resources and improve the overall update performance. - This is why I also recommend doing it in C# or Node.JS, but also I fixed your C# stuff.

2. Use a dedicated server - You can use a separate server with high processing power and memory to execute the update operation. This will ensure that other database operations are not impacted due to the high load generated by the update.

3. Use bulk writes - Use bulk write operations to perform multiple update operations in a single request. This can help reduce network latency and improve performance.

4. Use a covered query - Use a covered query to fetch only the necessary fields from the collection. This will help reduce the amount of data transferred between the database and the application, and improve performance.

5. Optimize index usage - Ensure that the collection has an appropriate index that is being used to execute the update operation. This can help improve query performance by reducing the amount of time taken to search for records.

6. Optimize query filters - Ensure that the query filters are optimized and are using the appropriate operators to retrieve only the necessary records. This can help reduce the number of records being updated and improve performance.

7. Monitor database resources - Monitor the database resources such as CPU, memory, and network usage during the update operation. This can help identify any performance bottlenecks and optimize the update process accordingly.

Compare these to what you have, and let me know what you think.

C#


var collection = database.GetCollection<BsonDocument>("myCollection");

var updates = new List<WriteModel<BsonDocument>>();

// Build the update operation for each document
foreach (var doc in documentsToUpdate)
{
    var filter = Builders<BsonDocument>.Filter.Eq("_id", doc["_id"]);
    var update = Builders<BsonDocument>.Update.Set("myField", "myNewValue");
    var updateModel = new UpdateOneModel<BsonDocument>(filter, update);
    updates.Add(updateModel);
}

// Execute the bulk write operation
BulkWriteResult result = collection.BulkWrite(updates);
int numUpdated = result.ModifiedCount;

Now to do it all in batches that will always keep running until all 2B are done:

var collection = database.GetCollection<BsonDocument>("myCollection");

List<BsonDocument> documentsToUpdate = // retrieve documents to update

int batchSize = 5000;
int totalDocs = documentsToUpdate.Count;
int currentDocIndex = 0;
int numUpdated = 0;

while (currentDocIndex < totalDocs) {
    List<UpdateOneModel<BsonDocument>> updates = new List<UpdateOneModel<BsonDocument>>();

    // Build the update operation for each document in the batch
    for (int i = currentDocIndex; i < currentDocIndex + batchSize && i < totalDocs; i++) {
        BsonDocument doc = documentsToUpdate[I];
        var filter = Builders<BsonDocument>.Filter.Eq("_id", doc.GetValue("_id"));
        var update = Builders<BsonDocument>.Update.Set("myField", "myNewValue");
        var updateModel = new UpdateOneModel<BsonDocument>(filter, update);
        updates.Add(updateModel);
    }

    // Execute the bulk write operation for the batch
    BulkWriteResult result = collection.BulkWrite(updates);
    numUpdated += result.ModifiedCount;

    // Move to the next batch
    currentDocIndex += batchSize;
}

Console.WriteLine($"Total documents updated: {numUpdated}");

For JavaScript/Node:

const collection = db.collection('myCollection');

const updates = documentsToUpdate.map((doc) => ({
  updateOne: {
    filter: { _id: doc._id },
    update: { $set: { myField: 'myNewValue' } },
  },
}));

// Execute the bulk write operation
collection.bulkWrite(updates).then((result) => {
  const numUpdated = result.modifiedCount;
});

const MongoClient = require('mongodb').MongoClient;

const batchSize = 5000;
const dbName = 'myDatabase';
const collectionName = 'myCollection';

async function updateDocumentsSequentially() {
  const client = await MongoClient.connect('mongodb://localhost:27017');
  const db = client.db(dbName);
  const collection = db.collection(collectionName);

  const totalDocuments = await collection.countDocuments({});

  let processedDocuments = 0;
  while (processedDocuments < totalDocuments) {
    const documentsToUpdate = await collection.find({}).limit(batchSize).toArray();
    const updates = documentsToUpdate.map((doc) => ({
      updateOne: {
        filter: { _id: doc._id },
        update: { $set: { myField: 'myNewValue' } },
      },
    }));

    const result = await collection.bulkWrite(updates);
    processedDocuments += result.modifiedCount;
  }

  await client.close();
}

updateDocumentsSequentially().catch((err) => console.error(err));

I’m also going to give you my Python and Rust versions of the 5000 batch. The RUST VERSION has the LOWEST impact to system resource in its operation almost no memory issues at all comparatively. And is what I’d recommend overall above the others for that much workload, but it’s up to you.

RUST

let collection = database.collection("myCollection");

let documents_to_update: Vec<BsonDocument> = // retrieve documents to update

let batch_size = 5000;
let total_docs = documents_to_update.len();
let mut current_doc_index = 0;
let mut num_updated = 0;

while current_doc_index < total_docs {
    let mut updates: Vec<UpdateOneModel<BsonDocument>> = Vec::new();

    // Build the update operation for each document in the batch
    for i in current_doc_index..std::cmp::min(current_doc_index + batch_size, total_docs) {
        let doc = &documents_to_update[i];
        let filter = doc.get("_id").and_then(|id| Some(doc! {"_id": id.to_owned()})).unwrap();
        let update = doc! {"$set": {"myField": "myNewValue"}};
        let update_model = UpdateOneModel::new(filter, update, None);
        updates.push(update_model);
    }

    // Execute the bulk write operation for the batch
    let result = collection.bulk_write(updates, None);
    num_updated += result.modified_count;

    // Move to the next batch
    current_doc_index += batch_size;
}

println!("Total documents updated: {}", num_updated);

PYTHON

from pymongo import MongoClient, UpdateOne

client = MongoClient("mongodb://localhost:27017/")
database = client["myDatabase"]
collection = database["myCollection"]

documents_to_update = # retrieve documents to update

batch_size = 5000
total_docs = len(documents_to_update)
current_doc_index = 0
num_updated = 0

while current_doc_index < total_docs:
    updates = []

    # Build the update operation for each document in the batch
    for i in range(current_doc_index, min(current_doc_index + batch_size, total_docs)):
        doc = documents_to_update[i]
        filter = {"_id": doc["_id"]}
        update = {"$set": {"myField": "myNewValue"}}
        update_model = UpdateOne(filter, update)
        updates.append(update_model)

    # Execute the bulk write operation for the batch
    result = collection.bulk_write(updates)
    num_updated += result.modified_count

    # Move to the next batch
    current_doc_index += batch_size

print(f"Total documents updated: {num_updated}")

Note that this requires the PyMongo driver to be installed.

Azat_TAZAYAN · April 20, 2023, 4:31pm

Hi Brock,

Thank you very much for your replay, I will try and let you know.

Brock · April 20, 2023, 4:48pm

Just remember to run it all in batches, that’s key, so then it’s going batch by batch instead of all at once.

It’ll greatly improve the performance. You may scale up or down the batches, but 5000 docs per batch is usually a sweet spot for a lot of systems, even a Raspberry Pi 4 running 4.4 can handle 5000 at a time.

Azat_TAZAYAN · April 24, 2023, 11:16pm

Hi @Brock,

I’m using following code for update but seems that it will take 300 hours for 1B records:
6 sec for each 5000 items (1 batch); How can we improve it? If it will be 1 sec per batch, then during two weekends I can apply this to our PROD db.

 public void Up(IMongoDatabase database)
        {
            var db = dBHelper.GetDatabase("name"));

            long numUpdated = 0;
            int batchSize = 5000;
            var documentsToUpdate = db.GetCollection<BsonDocument>(stateStore);
            
            System.Diagnostics.Debug.WriteLine($"Total: --- {documentsToUpdate.EstimatedDocumentCount()}");

            string query = @"{ 
                        $set:
                        { 
                            'Data.CustomAttributes' : 
                            {
                                 $map:
                                 {
                                     input: '$Data.CustomAttributes',
                                     as:'this',
                                     in:
                                     {
                                         $cond:
                                         {
                                              if: { $eq: ['$$this.k', 6] },
                                              then:
                                              {
                                                'k' : NumberInt(6),
                                                'v' : {$toString: '$$this.v'}
                                              },
                                              else: '$$this'
                                         }
                                     }
                                 }
                            }
                        }
                    }";

            var pipelineDefinition = new BsonDocumentStagePipelineDefinition<BsonDocument, BsonDocument>(new[] { BsonDocument.Parse(query) });

            var all = db.GetCollection<BsonDocument>(stateStore).Find(Builders<BsonDocument>.Filter.Empty, new FindOptions { BatchSize = batchSize, }).ToCursor();

            List<UpdateOneModel<BsonDocument>> updates = new List<UpdateOneModel<BsonDocument>>();
            
            Stopwatch stopwatch = Stopwatch.StartNew();

            while (all.MoveNext())
            {
                updates.Clear();

                foreach (var document in all.Current)
                {
                    var filter = Builders<BsonDocument>.Filter.Eq("_id", document.GetValue("_id"));
                    updates.Add(new UpdateOneModel<BsonDocument>(filter, pipelineDefinition));
                }

                // Execute the bulk write operation for the batch
                BulkWriteResult result = documentsToUpdate.BulkWrite(updates);
                numUpdated += result.ModifiedCount;
                System.Diagnostics.Debug.WriteLine($"Count: --- {numUpdated}");
                System.Diagnostics.Debug.WriteLine($"Time: --- {stopwatch.Elapsed}");

            }
        }