MongoDB v5... ingest of public business data. I need to do a titleCase() of the $name field. Works great to about 400,000 records, then errors

Bill_Fetters · July 17, 2023, 8:24pm

I end up having anywhere from 1.5Million to 3Million documents in the collection. Ingest is from public Government CSV data aggregated to gov_data.businesses collection. Everything is ALLCAPS. I aggregated the data to a new collection with the address.city and name fields $toLower. Now I need to titleCase those fields. using address.city instead of name in the following code takes a while (28 minutes), but succeeds. name however fails with TypeError: Cannot read properties of undefined (reading 'toUpperCase') after some 400,000 documents at about 8 (minutes). Feels like a data size issue, but I’ve no idea. I’m relatively new to aggregations and coding in mongo/mongosh.

I borrowed the script from here: how to update field value to TittleCase in MongoDb?

use gov_data
function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).map(function(word) {
        return word.replace(word[0], word[0].toUpperCase());
    }).join(' ');
}

    console.log(titleCase(undefined));

    console.log(titleCase(""));

    console.log(titleCase(null));

    console.log(titleCase("NAMAR"));

db.businesses.aggregate().forEach(function(doc){
    db.businesses.updateMany(
        { "_id": doc._id },
        { "$set": { "name": titleCase(doc.name) } }
    );
});

John_Sewell · July 18, 2023, 7:57am

Updating one by one will be not great, at the least batch up the operations in a bulk operator, it’ll be much faster:

You can send updates to the server in batches of X updates, you can tune the batch volume for best performance.

Something similar to this:

var batchSize = 1000;
var currentUpdates = 0 ;
var bulk = db.Test.initializeUnorderedBulkOp();
var results;

//db.getCollection('Test').find()

print(new Date());
db.getCollection('Test').find({}, {"name3":1}).forEach(theDocument =>{
    currentUpdates++;
    var filter = { _id: theDocument._id };
    var update = { $set: { name3: titleCase(theDocument.name3) } };
   
    bulk.find(filter).update(update);
    
    if((currentUpdates % batchSize) == 0){
        print(`Progress: ${currentUpdates}, commiting`)
        results = bulk.execute();    
        bulk = db.Test.initializeUnorderedBulkOp();
    }     
})

if((currentUpdates % batchSize)  > 0){
    print(`${currentUpdates} remaining in bulk, commiting`)
    results = bulk.execute();    
}

print(`Done`)
print(new Date());

Running tests on a local collection with 600, 000 records the following were timings:
Single Updates: 452s
Batch of 1,000: 104s
Batch of 10,000: 96s

The error you’re getting looks like it’s due to a whitespace in the input, i.e.

print(titleCase(' '))

Gives:

TypeError: Cannot read properties of undefined (reading 'toUpperCase')

So one of your documents has a field with a space in it that’s causing the issue, you need a base case in the function to catch this before trying to access the character at the first location [0]:

function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).map(function(word) {
        if(word.length == 0){
            return ''
        } else{
            return word.replace(word[0], word[0].toUpperCase());
        }
    }).join(' ');
}

John_Sewell · July 18, 2023, 8:01am

To add…we capture the results above:

results = bulk.execute();

…you can look at the return object and collate the updates to see how many matches and upates were performed as a sanity check, the return object looks like this:

{
  acknowledged: true,
  insertedCount: 0,
  insertedIds: [],
  matchedCount: 1,
  modifiedCount: 1,
  deletedCount: 0,
  upsertedCount: 0,
  upsertedIds: []
}

Bill_Fetters · July 18, 2023, 2:44pm

Thanks. You are da man… that worked perfectly. I just didn’t know the allowable code. I’m still thinking database…
I had issues with the console in DataGrip (Arity Error), but this worked in mongosh.
Thanks again.

John_Sewell · July 18, 2023, 2:47pm

Glad to help! Weird that the SO post didn’t take into account the base case!

Hopefully running in a bulk operation should make the update faster for you.

John

Bill_Fetters · July 18, 2023, 7:52pm

I refactored it complete with an initial aggregation to create the collection from the source, then run 2 separate batches; one for the name field, and one for the city field (this is really just data beautification).
The entire scripts runs in less than 19 minutes, where seeding it all through the API took upwards of 70 hours. I’d say that’s an improvement…
Thanks for your help again.

system · July 23, 2023, 7:53pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.