TranslatorFatalError with Realm Sync

Hi there,

Late last night I had this error: “TranslatorFatalError - encountered non-recoverable resume token error. Sync cannot be resumed from this state and must be terminated and re-enabled to continue functioning: (ChangeStreamHistoryLost) Resume of change stream was not possible, as the resume point may no longer be in the oplog.”. Sync has been paused and I am unable to restart. I have also tried terminating sync and starting again but with no success.

I am not sure where the error came from as there were no requests around that time. Is anybody aware of what the cause of this issue is and how it can be resolved?

Many thanks

Will

Is anyone from the Realm team able to help with this? I am still facing the issue and cannot figure out the issue.

Is there anyone who can provide some inputs here?

Same here … +1 … happened on my M0 cluster for no reason at all and am unable to recover.

Hello,

Before answering the questions, let’s discuss some of the terminology mentioned in the error.

Translator
This refers to an internal automated process called the Sync Translator which has the role of translating “Realm” data in the client into Atlas data, and vice versa. This process creates and executes instructions in the Sync metadata that allow syncing to take place between mobile devices and Atlas.

Change Stream
This Change Stream is a MongoDB feature that lets applications watch collections for real-time data changes recorded in the Oplog (Operations Log). The Sync Translator uses change streams to check for writes that occur and translate them into instructions so that documents/objects can be synced between the Sync client and your Atlas database(s). Similarly, Triggers use change streams to watch for data changes and fire executions on the operation types configured in the trigger.

Resume Token
The Resume Token is a point in time in your oplog which the change stream uses to process Change Events recorded in the oplog. If the resume token cannot be found, the Sync Translator or Trigger will not know from which point in time it needs to continue processing change events.

If the Sync translator loses the token, it can result a non-recoverable error to be thrown where Sync needs to be terminated and restarted in order for a new token to created. This means the metadata instructions that have been created thus far need to be cleared/reset and rebuilt again using what is in Atlas. The existing clients will need to also undergo a Client Reset to continue using sync. For this reason, we recommend including client reset handling in your app to take care of this automatically.

Similarly, if a trigger loses its resume token it will not know from which point in time it needs to process events and as such the trigger will go into a suspended state and must be restarted without a resume token. Unfortunately this means the trigger will not be able to process change events prior to the restart of the trigger without token and can only fire on new events that occur. For further information please see other root causes for trigger suspensions can be suspended.

Cause of ChangeStreamHistoryLost errors
As discussed above, this is usually due to the change stream used by the translator process not being able to find its resume token. This is most commonly caused by insufficient cluster resources. We recommend a general minimum of an M10 cluster for a production Realm app using Sync, and a best practice minimum of M30 cluster (or greater depending on needs). This will ensure that your app is not affected by other clusters which utilise the shared resources in a shared cluster. Please ensure your app uses a dedicated cluster tier before deploying the Realm app live into production. If you choose to later upgrade from a shared tier to a dedicated tier cluster, you will need to undergo a sync termination as part of the process causing a potential inconvenience to the users.

If you have a dedicated cluster tier and experience this error, it is most likely due to the Oplog size on the cluster being insufficient. The oplog size determines how much room there is for change events to be written in the oplog. If there is a surge of writes occurring in your cluster, it will cause more entries to be written into the oplog and as a result may force the resume token to fall outside of the Oplog Replication Window (a graph you can find in your Cluster Metrics). Please increase your oplog size so that there is at least 48 hours of replication oplog window available at any given time.

Regards
Manny

2 Likes

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.