Bulk operation performance with bulk.find().upsert().updateOne()

Dhruvesh_Patel · November 4, 2020, 11:03pm

Hi

We are processing historical quote data from files (millions of rows). I have followed fixed size bucketing (1000 quotes per document) for time series data as per scenario # 3 in MongoDB's New Time Series Collections | MongoDB

so to push data to mongodb I am sending 10000 transactions (bulk.find().upsert().updateone()) between bulk initialization (tried both ordered and non ordered) and bulk.execute() (tried both with write and no write concern), but somehow performance is significant slow (it is pushing 1000 quotes per second)

here is example of generated query which is being pushed to mongodb through java (for clarity I have put 5 transactions here instead of 10000). Any suggestion why it is significant slow?

var bulk=db.quote_date.initializeOrderedBulkOp(); 

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337417"),nano_seconds:NumberInt(784325554),ask_price:-0.98,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4),qualifiers:["[BID_TONE]","[ASK_TONE]"]}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337417784")},
	$max:{quote_last:NumberLong("1597337417784")},
	$inc :{nquotes:NumberInt(1)}
});

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337606"),nano_seconds:NumberInt(436207836),ask_price:-0.97,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4)}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337606436")},
	$max:{quote_last:NumberLong("1597337606436")},
	$inc :{nquotes:NumberInt(1)}
});

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337635"),nano_seconds:NumberInt(967713742),ask_price:-0.98,ask_size:NumberInt(4),bid_price:-1.0,bid_size:NumberInt(4)}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337635967")},
	$max:{quote_last:NumberLong("1597337635967")},
	$inc :{nquotes:NumberInt(1)}
});

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337940"),nano_seconds:NumberInt(812651241),ask_price:-0.98,ask_size:NumberInt(4),qualifiers:["[BID_TONE]"]}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337940812")},
	$max:{quote_last:NumberLong("1597337940812")},
	$inc :{nquotes:NumberInt(1)}
});

bulk.find({ric:"LCOJ1-V1",utc_date:ISODate("2020-08-13T00:00:00.000000000Z"),nquotes:{$lt:1000}}).upsert().updateOne(
{
	$push:{quotes:{utc_date_time:NumberInt("1597337940"),nano_seconds:NumberInt(832562892),ask_price:-0.99,ask_size:NumberInt(4)}},
	$setOnInsert:{exch_utc_offset:NumberInt(1)},
	$min:{quote_first:NumberLong("1597337940832")},
	$max:{quote_last:NumberLong("1597337940832")},
	$inc :{nquotes:NumberInt(1)}
});

bulk.execute({w:0});

Pavel_Duchovny · November 5, 2020, 6:19am

Hi @Dhruvesh_Patel,

What is the version of MongoDB?

For replica set bulk updates do not use w:0 to speedup write. I would even recommend stick with w: majority. The w:0 is probably causing replicas to lag resulting in worse performance.

Additionally, make sure that the criteria of the update is compound indexed on its fields.

MongoDB creates bulks of 1000 operations even if you specify a higher number .

Why are all the updates repeating?

Best
Pavel

Dhruvesh_Patel · November 5, 2020, 5:45pm

Hi

Thank you for contacting me on this issue

See the following comments

Currently we are using standalone community edition for mongodb in UAT , which is 4.0.2 (without replication) .Our test data are from UAT environment.
Soon we will test it in production. In production we have mongodb version of 4.2.3.
Ok I will revert changes to w:majority. It was still slow when originally implemented w:majority
Yes I have created compound index on find{} fields which are {ric:1,utc_date:1,nqutoes:1}
Yes I am sending 10000 bulk operation even though mongodb internally create batch of 1000
These are the time series data of “quotes” price. So that’s why data is being repeated with values for different times.
I have split quote time in “utc_date_time” (seconds) and “nano_seconds”, because mongodb can’t support nano seconds time.
I have followed “Scenario 3: Size-based bucketing” as per MongoDB's New Time Series Collections | MongoDB
Multiple “$push” will cause document to reallocate space every $push? Or is it in place update to array for existing document?

Please let me know if you need any additional information.

Thanks

Pavel_Duchovny · November 5, 2020, 9:06pm

Hi @Dhruvesh_Patel,

One additional question does the order matter or you can consider initialising unordered bulk update?

How much data is in qoute_date collection? Can you partition the data into several collections with naming convention? (Weekly or monthly collections).

4.0.2 is an old version , I would use 4.0.20 or 4.2.10 as they have many performance improvement!

Best
Pavel

Dhruvesh_Patel · November 5, 2020, 9:46pm

I have tried both Ordered vs Non Ordered bulk operation and it same performance issue. We will soon try in production which has version 4.2.3 so we will confirm about change in performance with latest version

We don’t need document in specific order but “quotes” array in a document should quote be in specific order since it is time series data. “quotes” array has 1000 quotes. We can change number of quote in the array since we are using fixed size bucketing

We do have performance issue even with empty collection so not sure it can help us to do weekly/monthly collection at this point.

Pavel_Duchovny · November 6, 2020, 5:32am

Hi @Dhruvesh_Patel,

Ok now that you say that elements in a document can grow to 1000 array elements it makes sense that updates are slow…

When mongodb perfoms an update to an array it needs to desirialize and serelize the whole array for each update.

Consider limiting array elements to 100 a document.

Best
Pavel

Dhruvesh_Patel · November 20, 2020, 6:31pm

we tried the solution you suggested , but not significant improvement in production. If we have 1000 quotes in array per documents then we able to save 1684 quotes/seconds , but if we have 100 quotes in array per document then we able to save 4083 quotes/seconds.

Pavel_Duchovny · November 21, 2020, 5:02am

Hi @Dhruvesh_Patel,

Consider upgrading the version and your hardware that should speed the upserts.

One additional comment , I noticed you are setting

$min:{quote_first:NumberLong("1597337417784")},
	$max:{quote_last:NumberLong("1597337417784")},

Hardcoded value each upsert command. This is uneeded overhead and traffic. You can set those values in the setOnInsert clause without running $max or $min or at least running it once.

Best
Pavel

Dhruvesh_Patel · November 23, 2020, 1:13pm

We can ask DBA for version upgrade (but its long shot depending on the priorities and we have to be sure why we are asking for it) . But we have adopted MongoDB in our company almost one year ago. so our hardware are recommended by MongoDB.
Also we do see MongoDB site has other uses cases where timeseries data adopted successfully and they have done with previous versions of MongoDB.

Also if you see scenario # 3 in MongoDB's New Time Series Collections | MongoDB, “quote_first” and “quote_last” value should be changed based on every time new quote pushed to “qutoes” array. I am generating bulk insert query on the fly so those value are part of query before it is being executed on MongoDB. So “setOnInsert” will not work in this scenario.

$min:{quote_first:NumberLong(“1597337417784”)},
$max:{quote_last:NumberLong(“1597337417784”)},

Thanks

Pavel_Duchovny · November 23, 2020, 3:33pm

Hi @Dhruvesh_Patel,

Further assistance require specific environment investigation to identify which resources are blocking you from proceeding.

This kind of investigation is best covered by our Support subscriptions.

Thanks,
Pavel