Losing oplogs when upgrading replicaset from 4.0 to 4.4

I have an application that runs change streams on various collections, against a replica set with 3 nodes that were on 4.0.28, and when upgrading the db deployment from 4.0 => 4.2 => 4.4 (we can’t upgrade directly to 4.4), the application began throwing the following error:

Resume of change stream was not possible, as the resume point may no longer be in the oplog., ‘stack’: MongoError: Resume of change stream was not possible, as the resume point may no longer be in the oplog.

The error is ChangeStreamHistoryLost, code 286.

I’m trying to identify the root cause of the issue but it’s difficult to replicate because:

  1. Atlas won’'t allow another deployment on 4.0.x
  2. This issue doesn’t always occur. Another staging environment was upgraded and didn’t experience this.

I don’t believe it’s an incompatibility with our app and our DB deployment, since re-deploying the app fixed the issue.

I noticed that during our version upgrades, the primary was restarted and a secondary replica was elected the new primary. Is it possible that during an election of a new primary, oplogs may be lost?

Hi @Richard_DeAvila ,

I cannot say for 100% what happened without reviewing logs and diagnostic data of that specific deployment but I am afraid you hit a known compatibility change with the way change stream resumability is done between the versions, Mainly 4.0 to 4.2 .

So what I suspect is that the failover caused an invalidate event that didn’t allow a proper resume. You should verify you are using all new methods and code to ensure resumability as designed in 4.4…

When you restarted the change streams there is no need for resume thats why it worked.

Please contact Atlas support for further investigation and verification your code is no error prune…

Thanks
Pavel