Update millions of documents in a collection with hundreds of millions of documents

Freeda_Tech · December 2, 2021, 3:40pm

Hi,

we have a collection with more than 300 millions documents which needs to be updated more times per day. The update itself is made of a couple of millions of document. A single document is like this (it represents a social media content metric at a certain timestamp):

{
_id:ObjectId(61a4d4832850c97dec709291),
contentId:"657083495702124",
granularity:"hour",
metricTimestamp:2021-11-28T16:00:00.000+00:00,
sampling:"absolute",
contentCreatedTimestamp:2021-11-22T09:32:28.000+00:00,
name:"post_video_views_rate_non_paid_unique",
key:"value",
value:0.09930861093651791
}

The update runs in a aws lambda, so we would like it to be under 15 minutes and consume less than 10Gb ram.

At the moment the update is implemented as node script. Here is an extract (edited for brevity):

........................

const deltaQueryResults = await executeQuery(deltaQuery);

const bulkOps = 
        deltaQueryResults
        .map(metric => ({
            updateOne: {
                filter: { 
                    contentId: metric.contentId, 
                    granularity: metric.granularity, 
                    metricTimestamp: metric.metricTimestamp, 
                    sampling: metric.sampling,
                    name: document.name,
                    key: document.key
                },
                update: { 
                   $set:  { 
                    contentId: metric.contentId, 
                    granularity: metric.granularity, 
                    metricTimestamp: metric.metricTimestamp, 
                    sampling: metric.sampling,
                    name: document.name,
                    key:document.key
                }, },
                upsert: true
            }
        }))

const bulkWriteOpResult = await metricModel.collection.bulkWrite(bulkOps, {w: 0, j: false}, {ordered: false})

....................

MongoDB instance is a M40 low CPU tier on Atlas (one cluster with 3 nodes)

Any suggestion on how to make it faster is very welcome!

Thank you
Luigi

sergeimeza · December 2, 2021, 4:11pm

Any suggestion on how to make it faster is very welcome!

Have you tried sharding the data?

I recommend you the following course by the MongoDB University on MongoDB Performance, it should take less than a day to complete and it will sure give you new ideas on how to improve your database performance.

sergeimeza · December 2, 2021, 4:17pm

To further troubleshoot your question I have the following questions:

How long does your queries take? i.e. the filter part of your upsert operation

                filter: { 
                    contentId: metric.contentId, 
                    granularity: metric.granularity, 
                    metricTimestamp: metric.metricTimestamp, 
                    sampling: metric.sampling,
                    name: document.name,
                    key: document.key
                }

Is the collection properly indexed? Also since updating indexes also makes the write ops less performant, are there any unnecessary indexes in your collection?

As an alternative, you could also distribute the load in multiple lambda function calls.

Freeda_Tech · December 5, 2021, 4:37pm

Hi,

thank you for the reply and for pointing me to that course on performances. I’ll for sure have a look at it.

Indexes were fine, the problem was related to peering. We did set up peering between Atlas and our aws vpc, but it turns out lambdas, by default, run in their own vpc.

We solved this configuring that lambda to run in the proper vpc and after that it was really fast

Freeda_Tech · December 6, 2021, 11:05am

We are continuing testing and, even if it is now much faster, updates are still taking roughly 25 minutes. Any other suggestion is more than welcome