For the sake of the fact I’ve dealt with DBs with upward of 7 billion documents before, I pulled out some of my old scripts to do stuff like this. Call me cocky, but I love stuff like this…
These use drivers to connect to the database, fetch documents in batches of 5000, and update them sequentially using the BulkWrite API. The processedDocuments
variable keeps track of how many documents have been updated so far, and the loop continues until all documents have been processed. Doing the 2BN in batches of 5k at a time will keep your systems from freezing, and honestly you’ll probably be done in a few days.
Here are some tips:
To optimize the update performance for the given scenario, here are some suggestions:
1. Use a multi-updates approach - Instead of updating all 2B records in one go, you can split them into smaller batches and update them in parallel. This will make use of the available resources and improve the overall update performance. - This is why I also recommend doing it in C# or Node.JS, but also I fixed your C# stuff.
2. Use a dedicated server - You can use a separate server with high processing power and memory to execute the update operation. This will ensure that other database operations are not impacted due to the high load generated by the update.
3. Use bulk writes - Use bulk write operations to perform multiple update operations in a single request. This can help reduce network latency and improve performance.
4. Use a covered query - Use a covered query to fetch only the necessary fields from the collection. This will help reduce the amount of data transferred between the database and the application, and improve performance.
5. Optimize index usage - Ensure that the collection has an appropriate index that is being used to execute the update operation. This can help improve query performance by reducing the amount of time taken to search for records.
6. Optimize query filters - Ensure that the query filters are optimized and are using the appropriate operators to retrieve only the necessary records. This can help reduce the number of records being updated and improve performance.
7. Monitor database resources - Monitor the database resources such as CPU, memory, and network usage during the update operation. This can help identify any performance bottlenecks and optimize the update process accordingly.
Compare these to what you have, and let me know what you think.
C#
var collection = database.GetCollection<BsonDocument>("myCollection");
var updates = new List<WriteModel<BsonDocument>>();
// Build the update operation for each document
foreach (var doc in documentsToUpdate)
{
var filter = Builders<BsonDocument>.Filter.Eq("_id", doc["_id"]);
var update = Builders<BsonDocument>.Update.Set("myField", "myNewValue");
var updateModel = new UpdateOneModel<BsonDocument>(filter, update);
updates.Add(updateModel);
}
// Execute the bulk write operation
BulkWriteResult result = collection.BulkWrite(updates);
int numUpdated = result.ModifiedCount;
Now to do it all in batches that will always keep running until all 2B are done:
var collection = database.GetCollection<BsonDocument>("myCollection");
List<BsonDocument> documentsToUpdate = // retrieve documents to update
int batchSize = 5000;
int totalDocs = documentsToUpdate.Count;
int currentDocIndex = 0;
int numUpdated = 0;
while (currentDocIndex < totalDocs) {
List<UpdateOneModel<BsonDocument>> updates = new List<UpdateOneModel<BsonDocument>>();
// Build the update operation for each document in the batch
for (int i = currentDocIndex; i < currentDocIndex + batchSize && i < totalDocs; i++) {
BsonDocument doc = documentsToUpdate[I];
var filter = Builders<BsonDocument>.Filter.Eq("_id", doc.GetValue("_id"));
var update = Builders<BsonDocument>.Update.Set("myField", "myNewValue");
var updateModel = new UpdateOneModel<BsonDocument>(filter, update);
updates.Add(updateModel);
}
// Execute the bulk write operation for the batch
BulkWriteResult result = collection.BulkWrite(updates);
numUpdated += result.ModifiedCount;
// Move to the next batch
currentDocIndex += batchSize;
}
Console.WriteLine($"Total documents updated: {numUpdated}");
For JavaScript/Node:
const collection = db.collection('myCollection');
const updates = documentsToUpdate.map((doc) => ({
updateOne: {
filter: { _id: doc._id },
update: { $set: { myField: 'myNewValue' } },
},
}));
// Execute the bulk write operation
collection.bulkWrite(updates).then((result) => {
const numUpdated = result.modifiedCount;
});
const MongoClient = require('mongodb').MongoClient;
const batchSize = 5000;
const dbName = 'myDatabase';
const collectionName = 'myCollection';
async function updateDocumentsSequentially() {
const client = await MongoClient.connect('mongodb://localhost:27017');
const db = client.db(dbName);
const collection = db.collection(collectionName);
const totalDocuments = await collection.countDocuments({});
let processedDocuments = 0;
while (processedDocuments < totalDocuments) {
const documentsToUpdate = await collection.find({}).limit(batchSize).toArray();
const updates = documentsToUpdate.map((doc) => ({
updateOne: {
filter: { _id: doc._id },
update: { $set: { myField: 'myNewValue' } },
},
}));
const result = await collection.bulkWrite(updates);
processedDocuments += result.modifiedCount;
}
await client.close();
}
updateDocumentsSequentially().catch((err) => console.error(err));
I’m also going to give you my Python and Rust versions of the 5000 batch. The RUST VERSION has the LOWEST impact to system resource in its operation almost no memory issues at all comparatively. And is what I’d recommend overall above the others for that much workload, but it’s up to you.
RUST
let collection = database.collection("myCollection");
let documents_to_update: Vec<BsonDocument> = // retrieve documents to update
let batch_size = 5000;
let total_docs = documents_to_update.len();
let mut current_doc_index = 0;
let mut num_updated = 0;
while current_doc_index < total_docs {
let mut updates: Vec<UpdateOneModel<BsonDocument>> = Vec::new();
// Build the update operation for each document in the batch
for i in current_doc_index..std::cmp::min(current_doc_index + batch_size, total_docs) {
let doc = &documents_to_update[i];
let filter = doc.get("_id").and_then(|id| Some(doc! {"_id": id.to_owned()})).unwrap();
let update = doc! {"$set": {"myField": "myNewValue"}};
let update_model = UpdateOneModel::new(filter, update, None);
updates.push(update_model);
}
// Execute the bulk write operation for the batch
let result = collection.bulk_write(updates, None);
num_updated += result.modified_count;
// Move to the next batch
current_doc_index += batch_size;
}
println!("Total documents updated: {}", num_updated);
PYTHON
from pymongo import MongoClient, UpdateOne
client = MongoClient("mongodb://localhost:27017/")
database = client["myDatabase"]
collection = database["myCollection"]
documents_to_update = # retrieve documents to update
batch_size = 5000
total_docs = len(documents_to_update)
current_doc_index = 0
num_updated = 0
while current_doc_index < total_docs:
updates = []
# Build the update operation for each document in the batch
for i in range(current_doc_index, min(current_doc_index + batch_size, total_docs)):
doc = documents_to_update[i]
filter = {"_id": doc["_id"]}
update = {"$set": {"myField": "myNewValue"}}
update_model = UpdateOne(filter, update)
updates.append(update_model)
# Execute the bulk write operation for the batch
result = collection.bulk_write(updates)
num_updated += result.modified_count
# Move to the next batch
current_doc_index += batch_size
print(f"Total documents updated: {num_updated}")
Note that this requires the PyMongo driver to be installed.