Upsert lat lng from city-data collection into customer-data collection based on city, state

I have a full US city-data collection with latitude and longitude coordinates…
I also have over 1.5 million address records for customers. I need to use the combination of city/state as a key and upsert the longitude, latitude data into my geo field for the associated customer city/state.

Structured data sets… relatively easy, but I’m new to unstructured… They are on the same cluster in different databases… I get that I need to have the collections in the same DB (seems silly to not be able to merge data from different DBs, but). I just don’t know how to upsert from many to many in Mongo. Appreciate any help.

For example… My city_geo collection has:

{
  "_id": {
    "$oid": "63c8c45228636a53ac28cfd7"
  },
  "address": {
    "city": "Cheyenne",
    "region": "WY"
  },
  "geo": {
    "lat": "41.135",
    "lon": "-104.7902"
  }
}

My Customer collection has a number of fields but has the object ‘address’ with the same data (city, region), but it has a NULL geo field… I need to update the geo field in the customer data with the geo object data from the city_geo collection by matching the ‘address’ object. This city_geo is an aggregation of the original city_data (with 43 string fields) into a same database output to the same DB as the customer data collection as city_geo with the 2 objects above.

Something like this seems to work, but it is taking forever and I don’t know if it will finish. I’m sure that there is a much better way to do it…

db.customers.find().forEach(function (doc1) {
    var doc2 = db.city_geo.findOne({ city: doc1.address.city, region: doc1.address.region}, { geo: 1});
    if (doc2 != null) {
        doc1.geo.lat = doc2.geo.lat;
        doc1.geo.lon = doc2.geo.lon;
        db.customers.save(doc1);
    }
},
{
    allowDiskUse: true
}
);

Since you are calling

I assume that you use mongoose.

The main issue here is that you do every document 1 by 1. Slow, very slow. You need to use bulk operation.

One way to do that is to use an aggregation pipeline that performs a $lookup and terminates with a $merge. Something along the untested lines:

/* I want to simplify the $merge by having less data to merge */
project_address = { "$project" : {
    "address" : 1
} }
lookup = { "$lookup" : {
    "from" : "geo_city" ,
    "localField" : "address" ,
    "foreignField" : "address" ,
    "as" : "geo" ,
    "pipeline" : [ { "$project" : { "$geo" : 1 } } ]
} }
/* the output of $lookup is an array but I want an object */
project_geo = { "$project" : {
    "geo" : { "$arrayElemAt" : [ "$geo" : 0 } }
} }
merge = { "$merge" : {
    "into" : "customers" ,
    "on" : "_id"
} }

The second issue is that you definitively need the unique index { address:1 } on the city_geo collection.

OMGoodness… My notification didn’t come across. I jjust checked this.
Steeve, you are the man. That is exactly the push I needed… I’ve just been sitting here with FOTU (fear of the unknown).
I reverted this to a mongosh js file, but followed the same pattern. Like a charm and it completed in 36 minutes. AND… fantastic call on the unique index. what a difference that made.

Thanks.

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.