Upgrading app driver causing issues on server

Hello,
This is a just a quick post-mortem that won’t be complete but I hope it can lead at least to some documentation updates.

We have ~20 shards of 3 nodes (P+S+H) running mongod 4.0.19 and our clients were using the java driver 3.8.0. As we planned to upgrade to mongod 5.0, we started by upgrading our java driver to mongodb-driver-legacy:4.3.4 and things did not go as smoothly as expected.

First of all, not a bug per say but a change in behavior that took us some time to diagnostic : our oplog window got reduced by a factor of 5~6 after the client upgrade. Long story short, it took us some time but found out that this is due to the retryable write feature that is now enabled by default (and we knew that was a change of the driver when upgrading) that puts more pressure on the oplog as explained in this readme buried in the source code repo ( mongo/README.md at master · mongodb/mongo · GitHub ).
I think this should really be spoken about in the official documentation ( https://docs.mongodb.com/manual/core/retryable-writes/ ) as this has quite an unexpected impact.

But then, we hit a real issue. I’ll try to give you some infos, but I won’t be able to provide much more or to help with the resolution. The tl;dr is : mongodb-driver-legacy:4.3.4 does not play well with mongod 4.0.19 and causes performance issues.
In no order in particular:

  • we perform a backup every night on the hidden node. The night after the driver upgrade, the backup caused the rs to have its replication lag skyrocket into oblivion (where every other days the backup has literally no impact at all) causing the node to go past its oplog window. => no, we did not upgrade the backup tools (mongodump), only the application’s drivers
  • the “oplog processing time - batch apply” went from 130ms to 450ms (then back down to 250ms after upgrading the servers to 4.2)
  • some ops just disappeared from the secondary stats. We went from (509 command/s, 322 getmore/s and 122 query/s to 28 command/s, 0.5 getmore/s and 14.8 query/s) but then reappeared when we upgraded to 4.2
    stuff.stooit .com/d/1/623196a66edaf/opscounter.png
  • the “time acquiring write lock” skyrocketed from 500ms to 1000 SECS on the secondaries (then back down after upgrade)

Note: I’m not an adminsys and even if I’ve got some knowledge on mongo, I’m not sure how our grafana metric names translate into mongodb’s term (hopefully they are similar and you understand)

So yeah, as I said, I’m not sure you’ll be able to do much about the second issue given what I provide to you, but I’d really appreciate if the first one ends-up in a documentation somewhere!

Thanks.

Hi @Michael_Longo !

Do you happen to use a lot of findAndModify operations? It looks like this could be a reason for the increase of oplog that you noticed but this has been fixed in 5.1 from what I’m seeing.

Also just a side note but I’d definitely recommend to drop the legacy driver and use the sync driver:

With 5.0 and up, you can also benefit from the Stable API which should simplify the upgrade process in the future for you.

Cheers,
Maxime.