Assessing complexity to fix mongodump behaviour for capped collections

Erlon_Cruz · March 14, 2021, 4:16pm

Hi folks, while working with a customer that is hitting this Mongo bug ( mongodump fails when capped collection position lost), I run out of options trying a workaround for the problem, which happens sporadically (very likely depending on variable latency issues)
but makes automation of backups not possible.

Given the bug is open for quite a while, we are considering working on a fix, but my concern is that we
might start working in something and end up in a rabbit hole with a solution way to complex or that require architectural changes.

Also, I’m posting in this mongo tools channel, but there’s also a bug related to the mongo core tools, which increases, even more, my suspicions that this
might required to change things that the mongo team might not be willing to accept.

So, what do you guys think? Does anyone here have the expertise or knows who could give a good direction on how to fix that problem?

Stennie_X · March 14, 2021, 8:23pm

Welcome to the MongoDB Community forums @Erlon_Cruz!!

The issues you’ve highlighted were originally reported for MongoDB 3.4 server and MongoDB 3.2 mongodump, so it would be good to confirm if any behaviour has changed in recent versions of MongoDB.

To help understand the bug affecting your mongodump backups, can you please:

confirm the exact MongoDB server and mongodump versions you are using in your environment
confirm the type of deployment you are backing up (standalone, replica set, or sharded cluster)
share more context on the capped collection(s) that cause your mongodump backups to periodically fail (for example, is that the oplog system collection or a user collection?)

There have been improvements to the initial sync process and mongodump since MongoDB 3.2, but you may need to consider a different backup method if some of your capped collections are rolling over faster than mongodump can complete.

Since all data dumped via mongodump has to be read into memory by the MongoDB server, it is not an ideal backup approach for deployments that are highly active or have uncompressed data significantly larger than the WiredTiger cache. Backup approaches like filesystem snapshots and agent-based backup (eg MongoDB Ops Manager or Cloud Manager) are more common for production deployments that have outgrown mongodump.

Regards,
Stennie

Erlon_Cruz · March 17, 2021, 1:25pm

Hi Stennie,

Thanks for answering, so the points you asked:

The mongo-server version and mongodump is 3.6.1.
The deployment is a replica set cluster with 3 nodes
The capped collection that is failing are user collections. Happened a few times with a few collections used to store logs and with a collection that is used to store transactions[1]. See that [1] fails inside a program that wraps mongodump. On [2] I have a run of 40 times and 1 of them failed.

One of the collections I mentioned, the one used for logs, seems to have no more errors after I increase the collection size in 10x, but doing the same in the transaction collection didn’t work the same.

About the improvements you mention, can you tell what versions of mongo they were added? And if they improved anything on that behavior, shouldn’t the bugs I mentioned in Jira be updated or closed?

Erlon

[1] Bug #1852502 “Juju backups failing Executor error: CappedPositio...” : Bugs : juju
[2] mongodump.log · GitHub

Stennie_X · March 21, 2021, 9:53am

Hi @Erlon_Cruz,

Thanks for the extra details. Since the issues you mentioned were originally reported against older versions, context on the actual versions used (and whether this affects system vs user collections) helps eliminate the possibility that you are running into some related bugs that have since been fixed.

In particular I was thinking of issues with small but very active system capped collections (for example, system.profile is 1MB by default) that should be excluded from mongodump by default.

The underlying problem is twofold: capped collections have a fixed size and can roll over while the mongodump is in progress, and deletes to capped collections are currently implicit rather than replicated.

Implicit deletion from capped collections is an implementation detail inherited from the legacy MMAPv1 storage engine. The MMAP storage engine was removed in MongoDB 4.2, and there is currently work in progress to address this behaviour in future server versions. SERVER-55156 (Move capped collection responsibilities to the collection layer) and related issues will unblock being able to address the initial sync issue (SERVER-32827) you mentioned in the first post in this thread. I don’t expect there will be a straightforward fix for mongodump, but I’ll defer to the database tools team to comment on the Jira ticket.

I can think of some possible workarounds without any changes to server or mongodump :

Increase the size of affected capped collections (which can still be a challenge to mongodump depending on workload).
Use an alternative backup strategy: filesystem snapshots or backup agent would be best.

If you have a replica set deployment you could also consider configuring a hidden member for backup purposes and using db.fsyncLock() / db.fsyncUnlock() to quiesce writes while the mongodump backup is running. Stopping writes on a secondary in order to take a backup does presume the backup can be completed before the secondary’s oplog becomes too stale to sync, so this is a less recommendable backup approach.
Consider using a TTL index to limit data size instead of a capped collection.

Issues will be definitely be closed if there is an associated commit, but with 50K+ issues in the SERVER project sometimes there are older issues that are either duplicates or indirectly addressed via other code changes.

Regards,
Stennie

Erlon_Cruz · March 31, 2021, 12:36pm

Hi Stennie, thanks a lot for your help, support, and detailed information.
I really appreciated it!