Turns out the collection wasn’t corrupt, just big (about 5GB; mostly binary data). This led to various issues:
1) Failed synchronization between App Services and MongoDB
The memory capacity of the M10 instance was insufficient to handle a big changeset. App Services Logs showed the following error:
MaxIntegrationAttempts Error
"Failed to integrate download after attempting the maximum number of tries: error committing transaction: (WriteConflict) WriteConflict error: this operation conflicted with another operation. Please retry your operation or multi-document transaction."
App Services repeatedly kept trying to sync the changes to MongoDB and failed every time.
However, the client-side apps continued to sync fine and no alert/notification was generated, so the problem wasn’t recognized for a few weeks.
2) Increased costs and degraded performance
Since App Services aren’t available in the region where the cluster is located, a few terabytes of cross-region traffic were generated by the repeated synchronization attempts.
Cluster performance was impacted by the continuous read/write actions.
3) Inability to sort some collections
The big collection failed to be sorted in the Atlas web interface. The error message was not very detailed:
An error occurred when performing the requested operation. Check your query and try again
Attempting to apply the same sorting when querying the same collection in mongsh
made it clear that there isn’t enough memory to execute this query.
Solution:
Upgrading the cluster from M10 to M30 solved the problem. Synchronization started automatically right after the scale-up. It took 5 hours to commit all the pending changes to MongoDB, but no data was lost.
Scaling the cluster down again is possible.
It’s possible that Cluster Auto-Scaling would have prevented this problem.
The solution was found with the help of a support agent after subscribing to a MongoDB Support Plan (“Developer”) and opening a case.
Lessons learned:
A) Temporary sync data is on the same cluster
I wouldn’t have had to worry about losing the changes that weren’t committed to MongoDB yet.
Device Sync is using the invisible database __realm_sync
which resides on the same cluster. This database can’t be seen in the Atlas UI but can be inspected using mongosh
and contains the whole client sync history.
B) CLI tools are better for troubleshooting
The error reporting capabilities of Atlas’ web interface are limited. When in doubt, use the MongoDB Shell and the MongoDB Database Tools to inspect the database.
C) Sync issues may go unnoticed
It’s advisable to have some mechanism to verify that sync is working properly and maybe setting up Alerts for certain conditions.