MongoDB v5... ingest of public business data. I need to do a titleCase() of the $name field. Works great to about 400,000 records, then errors

I end up having anywhere from 1.5Million to 3Million documents in the collection. Ingest is from public Government CSV data aggregated to gov_data.businesses collection. Everything is ALLCAPS. I aggregated the data to a new collection with the address.city and name fields $toLower. Now I need to titleCase those fields. using address.city instead of name in the following code takes a while (28 minutes), but succeeds. name however fails with TypeError: Cannot read properties of undefined (reading 'toUpperCase') after some 400,000 documents at about 8 (minutes). Feels like a data size issue, but I’ve no idea. I’m relatively new to aggregations and coding in mongo/mongosh.

I borrowed the script from here: how to update field value to TittleCase in MongoDb?

use gov_data
function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).map(function(word) {
        return word.replace(word[0], word[0].toUpperCase());
    }).join(' ');
}

    console.log(titleCase(undefined));

    console.log(titleCase(""));

    console.log(titleCase(null));

    console.log(titleCase("NAMAR"));

db.businesses.aggregate().forEach(function(doc){
    db.businesses.updateMany(
        { "_id": doc._id },
        { "$set": { "name": titleCase(doc.name) } }
    );
});

Updating one by one will be not great, at the least batch up the operations in a bulk operator, it’ll be much faster:

You can send updates to the server in batches of X updates, you can tune the batch volume for best performance.

Something similar to this:

var batchSize = 1000;
var currentUpdates = 0 ;
var bulk = db.Test.initializeUnorderedBulkOp();
var results;

//db.getCollection('Test').find()

print(new Date());
db.getCollection('Test').find({}, {"name3":1}).forEach(theDocument =>{
    currentUpdates++;
    var filter = { _id: theDocument._id };
    var update = { $set: { name3: titleCase(theDocument.name3) } };
   
    bulk.find(filter).update(update);
    
    if((currentUpdates % batchSize) == 0){
        print(`Progress: ${currentUpdates}, commiting`)
        results = bulk.execute();    
        bulk = db.Test.initializeUnorderedBulkOp();
    }     
})

if((currentUpdates % batchSize)  > 0){
    print(`${currentUpdates} remaining in bulk, commiting`)
    results = bulk.execute();    
}

print(`Done`)
print(new Date());

Running tests on a local collection with 600, 000 records the following were timings:
Single Updates: 452s
Batch of 1,000: 104s
Batch of 10,000: 96s

The error you’re getting looks like it’s due to a whitespace in the input, i.e.

print(titleCase(' '))

Gives:

TypeError: Cannot read properties of undefined (reading 'toUpperCase')

So one of your documents has a field with a space in it that’s causing the issue, you need a base case in the function to catch this before trying to access the character at the first location [0]:

function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).map(function(word) {
        if(word.length == 0){
            return ''
        } else{
            return word.replace(word[0], word[0].toUpperCase());
        }
    }).join(' ');
}

To add…we capture the results above:

results = bulk.execute();

…you can look at the return object and collate the updates to see how many matches and upates were performed as a sanity check, the return object looks like this:

{
  acknowledged: true,
  insertedCount: 0,
  insertedIds: [],
  matchedCount: 1,
  modifiedCount: 1,
  deletedCount: 0,
  upsertedCount: 0,
  upsertedIds: []
}

Thanks. You are da man… that worked perfectly. I just didn’t know the allowable code. I’m still thinking database…
I had issues with the console in DataGrip (Arity Error), but this worked in mongosh.
Thanks again.

Glad to help! Weird that the SO post didn’t take into account the base case!

Hopefully running in a bulk operation should make the update faster for you.

John

I refactored it complete with an initial aggregation to create the collection from the source, then run 2 separate batches; one for the name field, and one for the city field (this is really just data beautification).
The entire scripts runs in less than 19 minutes, where seeding it all through the API took upwards of 70 hours. I’d say that’s an improvement…
Thanks for your help again.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.