Approaches to dealing with Flexible Sync outages relating to client change set errors

Ben_Corbett · October 4, 2023, 8:04am

Hi Mongo, we are using Mongo Realm Flexible Sync as our solution for a offline first app.

We recently encountered a BSONObjectTooLarge error log in Mongo Atlas which required us to restart flexible sync. While Flexible Sync was down all our clients were unable to sync data from Mongo Realm.

As this caused an outage for our users we are wondering what is the best action plan to deal with this in the future and we’re hoping you could help us with some questions:

How can we narrow down which document is responsible for the error?
Why does one client cause the entire sync system to terminate and what mitigation steps are taken to prevent this event?
Is it possible to control the change event size on the client so we can prevent a client exceeding the limit of the change events / documents size when syncing.
When Flexible Sync fails across all clients how can we reenable flexible sync without critical data loss on our clients.

Some of the errors we are seeing in our logs are:

BadClientFileIdent Error
MongoEncodingError Error
DivergingHistories Error
TranslatorFatalError Error
Encountered BSONObjectTooLarge error. Sync cannot be resumed from this state and must be terminated and re-enabled to continue functioning.

Tyler_Kaye · October 4, 2023, 2:08pm

Hi, I apologize for this unfortunate situation, and I will do my best to answer your questions.

If you can send me the entire error message it should have a long hex string in the error. This can determine the document that is causing the error. If you do not have this, if you can provide me with an application_id (the objectId in the URL of realm.mongodb.com) I can search for it in our logs.

This error is actually caused by a write to MongoDB by an external client (shell, driver, compass, etc) that Device Sync is listening to. It is a longstanding issue in MongoDB in which Change Events (the structure that is used to listen for events) exceed 16MB and MongoDB does not allow us to continue consuming events (and thus we lose our pointer into the oplog, so it is unsafe to do anything other than fail loudly). We realize this is not ideal, and it is why the MongoDB server team released this feature in 7.0 (https://www.mongodb.com/docs/manual/reference/operator/aggregation/changeStreamSplitLargeEvent/#-changestreamsplitlargeevent--aggregation-). We are completing the work soon to begin using this new stage to avoid this issue altogether.

As mentioned above, this is actually a write that originated from outside of Device Sync in which the PreImage of the document combined with the UpdateDescription (https://www.mongodb.com/docs/manual/reference/change-events/) exceeded 16MB. This generally means you have a document somewhere in the 10MB-16MB range causing the issue. While MongoDB does support documents up to 16MB, it is advised that you design your data model such that you don’t come too close to that limit.

Do you mind explaining more about what kind of Data Loss you saw? Generally speaking, when terminating and re-enabling sync clients should perform a Client Reset. The new default is a mode called RecoverUnsynced that should perform a best-effort attempt at not losing any data. https://www.mongodb.com/docs/atlas/app-services/sync/error-handling/client-resets/

These errors are not unexpected. BadClientFileIdent and DivergingHistories are errors that occur when you terminate and re-enable sync. When that happens, clients need to be reset and those errors are the 2 possible ways we have of detecting older clients. They should be accompanied by the message:

The server has forgotten about the client-side file presented by the client. 
This is likely due to using a synchronized realm after terminating and re-enabling sync. 
Please wipe the file on the client to resume synchronization.

See here for more details about these errors: https://www.mongodb.com/docs/atlas/app-services/sync/error-handling/errors/

MongoEncodingError is also normal and indicative of having MongoDB documents that do not match the schema you have configured in App Services. See here for a better description in our documentation: https://www.mongodb.com/docs/atlas/app-services/sync/error-handling/errors/#mongodb-translator-errors

TranslatorFatalError - This is what happens when the component in charge of translating changes between MongoDB and Device Sync has encountered a fatal error that requires user interaction.

Let me know if you have any other questions and I would be happy to answer. We are excited that we will soon be able to bypass this error on clusters utilizing a newer version of MongoDB.

Best,
Tyler

Ben_Corbett · October 5, 2023, 5:13am

Hi Tyler, thanks for your time and such a comprehensive response this has been a huge help for the team in understanding how best to tackle the sync issue.

We have located the document which has stored low-res thumbnails which somewhat explains the change event size being over 16 mg and flexible sync being terminated. We are taking steps to mitigate the problem which includes reducing the document size and modifying the writes to reduce the change event!

As far as data loss we currently do not have any reported issues when terminating the sync, but we were just concerned about the possibility. We will investigate your client reset and error handling suggestions and report back if we have any issues.

Thanks again for such a great answer, look forward to your updates to change event handling.
Ben

Mikael_Gurenius · October 20, 2023, 10:46am

@Tyler_Kaye, I find the design isn’t fault tolerant enough. We experienced the same issue last year. Lucky for us, it’s an enterprise app where we could contact the user to have him delete the app.

Validation of object size on client only is error prone and against best practices. All requests must also be validated server side!

One user trying to sync an object too large shouldn’t break sync for ALL users! Tracking this user down before sync can be restored isn’t justified.

We raised our concerns in the support ticket but given this thread, it is still an issue.

Best regard,
// Mikael

Tyler_Kaye · October 20, 2023, 1:11pm

Hey @Mikael_Gurenius

I totally agree! For changes coming from Realm we can handle arbitrarily sized data (though we are limited by the 16MB document size). I will note that generally speaking if you find yourself bumping into the 16MB document size it is likely the case that the bigger issue is that your Data Models should be revisited (and you will have performance problems in MongoDB dealing with such large documents).

Also, just to be clear, the issue above is what happens when a change is made to MongoDB that results in a document being larger than 16MB. So it doesn’t have to do with changes from a client, but rather changes from the shell / drivers / etc. This error has been fixed in MongoDB with a new feature and we are working on rolling it out to users of App Services. (see: https://www.mongodb.com/docs/manual/reference/operator/aggregation/changeStreamSplitLargeEvent/#-changestreamsplitlargeevent--aggregation-)

Thanks,
Tyler

Mikael_Gurenius · October 20, 2023, 1:16pm

The split looks promising! I’ll have another look if I can reproduce our issue from last year.

// Mikael