Corrupt collection in MongoDB Atlas, failed synchronization with App Service

Andreas_Ley · March 5, 2024, 1:50pm

I’m managing an Atlas cluster (M10, MongoDB 5.0.25) with App Services (Device Sync). Data in the cluster is created and accessed exclusively via Device Sync.

One of the collections seems to be corrupt, leading to failed synchronization between the Device Sync server and the MongoDB server. As a result of this, several weeks worth of new data have not been successfully synchronized to Atlas and are presumably only stored in the temporary cache of the Device Sync server.

I’m extremely worried about possible data loss (e.g. if the Device Sync server is restarted or fails).

Has anyone encountered a similar situation and if so, how did you fix it?
Is there any strategy to prevent collection corruption on Atlas?
Is there any way to be notified if synchronization between MongoDB and App Services fails repeatedly?

Andreas_Ley · March 12, 2024, 3:58pm

Turns out the collection wasn’t corrupt, just big (about 5GB; mostly binary data). This led to various issues:

1) Failed synchronization between App Services and MongoDB

The memory capacity of the M10 instance was insufficient to handle a big changeset. App Services Logs showed the following error:

MaxIntegrationAttempts Error

"Failed to integrate download after attempting the maximum number of tries: error committing transaction: (WriteConflict) WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction."

App Services repeatedly kept trying to sync the changes to MongoDB and failed every time.
However, the client-side apps continued to sync fine and no alert/notification was generated, so the problem wasn’t recognized for a few weeks.

2) Increased costs and degraded performance

Since App Services aren’t available in the region where the cluster is located, a few terabytes of cross-region traffic were generated by the repeated synchronization attempts.
Cluster performance was impacted by the continuous read/write actions.

3) Inability to sort some collections

The big collection failed to be sorted in the Atlas web interface. The error message was not very detailed:

An error occurred when performing the requested operation. Check your query and try again

Attempting to apply the same sorting when querying the same collection in mongsh made it clear that there isn’t enough memory to execute this query.

Solution:

Upgrading the cluster from M10 to M30 solved the problem. Synchronization started automatically right after the scale-up. It took 5 hours to commit all the pending changes to MongoDB, but no data was lost.
Scaling the cluster down again is possible.

It’s possible that Cluster Auto-Scaling would have prevented this problem.

The solution was found with the help of a support agent after subscribing to a MongoDB Support Plan (“Developer”) and opening a case.

Lessons learned:

A) Temporary sync data is on the same cluster

I wouldn’t have had to worry about losing the changes that weren’t committed to MongoDB yet.

Device Sync is using the invisible database __realm_sync which resides on the same cluster. This database can’t be seen in the Atlas UI but can be inspected using mongosh and contains the whole client sync history.

B) CLI tools are better for troubleshooting

The error reporting capabilities of Atlas’ web interface are limited. When in doubt, use the MongoDB Shell and the MongoDB Database Tools to inspect the database.

C) Sync issues may go unnoticed

It’s advisable to have some mechanism to verify that sync is working properly and maybe setting up Alerts for certain conditions.

system · March 17, 2024, 3:58pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.