Before answering the questions, let’s discuss some of the terminology mentioned in the error.
This refers to an internal automated process called the Sync Translator which has the role of translating “Realm” data in the client into Atlas data, and vice versa. This process creates and executes instructions in the Sync metadata that allow syncing to take place between mobile devices and Atlas.
This Change Stream is a MongoDB feature that lets applications watch collections for real-time data changes recorded in the Oplog (Operations Log). The Sync Translator uses change streams to check for writes that occur and translate them into instructions so that documents/objects can be synced between the Sync client and your Atlas database(s). Similarly, Triggers use change streams to watch for data changes and fire executions on the operation types configured in the trigger.
The Resume Token is a point in time in your oplog which the change stream uses to process Change Events recorded in the oplog. If the resume token cannot be found, the Sync Translator or Trigger will not know from which point in time it needs to continue processing change events.
If the Sync translator loses the token, it can result a non-recoverable error to be thrown where Sync needs to be terminated and restarted in order for a new token to created. This means the metadata instructions that have been created thus far need to be cleared/reset and rebuilt again using what is in Atlas. The existing clients will need to also undergo a Client Reset to continue using sync. For this reason, we recommend including client reset handling in your app to take care of this automatically.
Similarly, if a trigger loses its resume token it will not know from which point in time it needs to process events and as such the trigger will go into a suspended state and must be restarted without a resume token. Unfortunately this means the trigger will not be able to process change events prior to the restart of the trigger without token and can only fire on new events that occur. For further information please see other root causes for trigger suspensions can be suspended.
Cause of ChangeStreamHistoryLost errors
As discussed above, this is usually due to the change stream used by the translator process not being able to find its resume token. This is most commonly caused by insufficient cluster resources. We recommend a general minimum of an M10 cluster for a production Realm app using Sync, and a best practice minimum of M30 cluster (or greater depending on needs). This will ensure that your app is not affected by other clusters which utilise the shared resources in a shared cluster. Please ensure your app uses a dedicated cluster tier before deploying the Realm app live into production. If you choose to later upgrade from a shared tier to a dedicated tier cluster, you will need to undergo a sync termination as part of the process causing a potential inconvenience to the users.
If you have a dedicated cluster tier and experience this error, it is most likely due to the Oplog size on the cluster being insufficient. The oplog size determines how much room there is for change events to be written in the oplog. If there is a surge of writes occurring in your cluster, it will cause more entries to be written into the oplog and as a result may force the resume token to fall outside of the Oplog Replication Window (a graph you can find in your Cluster Metrics). Please increase your oplog size so that there is at least 48 hours of replication oplog window available at any given time.