UnrecoverableRollbackError

We have 3 replica set members running on the node as three pods. When a node is rebooted, it is observed that the secondary member pod sometimes goes into a crash loopback state.
Please provide some insight, as this has been observed multiple times.

**** Here are log snippets: ***

{"t":{"$date":"2024-06-04T02:31:27.069+00:00"},"s":"I",  "c":"ELECTION", "id":4615601, "ctx":"ReplCoord-1","msg":"Scheduling priority takeover","attr":{"when":{"$date":"2024-06-04T02:31:29.312Z"}}}
{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"I",  "c":"ROLLBACK", "id":21607,   "ctx":"BackgroundSync","msg":"Rollback common point","attr":{"commonPointOpTime":{"ts":{"$timestamp":{"t":1715069941,"i":1}},"t":19}}}
{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"I",  "c":"ROLLBACK", "id":21612,   "ctx":"BackgroundSync","msg":"Rollback summary","attr":{"startTime":{"$date":"2024-06-04T02:31:27.049Z"},"endTime":{"$date":"2024-06-04T02:31:27.102Z"},"syncSource":"mongo-subscriber-1.mongo-subscriber.mongodb-global.svc.cluster.local:27017","lastOptimeRolledBack":{"ts":{"$timestamp":{"t":1717128607,"i":3}},"t":22},"commonPoint":{"ts":{"$timestamp":{"t":1715069941,"i":1}},"t":19},"lastWallClockTimeRolledBack":{"$date":"2024-05-31T04:10:07.260Z"},"firstOpWallClockTimeAfterCommonPoint":{"$date":"2024-05-07T08:19:01.187Z"},"wallClockTimeDiff":2058666,"shardIdentityRolledBack":false,"configServerConfigVersionRolledBack":false,"affectedSessions":[],"affectedNamespaces":["admin.system.keys"],"rollbackCommandCounts":{"insert":1},"totalEntriesRolledBackIncludingNoops":2}}
{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"I",  "c":"REPL",     "id":21358,   "ctx":"BackgroundSync","msg":"Replica set state transition","attr":{"newState":"SECONDARY","oldState":"ROLLBACK"}}
{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"I",  "c":"REPL",     "id":21106,   "ctx":"BackgroundSync","msg":"Resetting sync source to empty","attr":{"previousSyncSource":"mongo-subscriber-1.mongo-subscriber.mongodb-global.svc.cluster.local:27017"}}
{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"F",  "c":"REPL",     "id":21128,   "ctx":"BackgroundSync","msg":"Rollback failed with unrecoverable error","attr":{"error":{"code":127,"codeName":"UnrecoverableRollbackError","errmsg":"not willing to roll back more than 86400 seconds of data. Have: 2058666 seconds."}}}
{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"F",  "c":"ASSERT",   "id":23095,   "ctx":"BackgroundSync","msg":"Fatal assertion","attr":{"msgid":50666,"error":"UnrecoverableRollbackError: not willing to roll back more than 86400 seconds of data. Have: 2058666 seconds.","file":"src/mongo/db/repl/bgsync.cpp","line":842}}
{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"F",  "c":"ASSERT",   "id":23096,   "ctx":"BackgroundSync","msg":"\n\n***aborting after fassert() failure\n\n"}

Hi there!

Since there wasn’t a reply to your message, I tried to run it by our documentation bot. Here is the response I got :slight_smile:


When a node is rebooted in a MongoDB replica set, it is possible for the secondary member pod to go into a crash loopback state. This can happen if the set’s active members cannot form a majority, causing the primary to step down and become a secondary. Until at least another secondary becomes available, the set has no primary and cannot elect a new primary.

To avoid this situation, ensure that the set is able to elect a primary during maintenance by ensuring that a majority of the set’s members are available. In a three-member replica set, for example, the set can elect a primary if at least two members can connect to each other. If you reboot the two secondaries at once, the primary will step down and become a secondary. Until at least one of the rebooted secondaries becomes available, the set will have no primary and cannot elect a new primary.

To troubleshoot the crash loopback state, you can check the logs of the secondary member pod for any error messages or exceptions that may indicate the cause of the issue. Additionally, you can review the MongoDB documentation on troubleshooting replica sets for more information on common issues and their resolutions.

If the issue persists or you need further assistance, please provide more details or specific error messages, and I’ll be happy to help you further.

Related resources:

Beta AI answer


Based on the log snippets provided, it appears that the secondary member is experiencing an unrecoverable rollback error, which is causing it to crash. 1

Replica Set States

In a replica set, members can transition through various states. The log shows the following state transitions:

  • The member was in the ROLLBACK state.
  • It attempted to transition to the SECONDARY state.
  • However, it encountered a fatal error during the rollback process.

Rollback Process

During the rollback process, the secondary member attempts to sync with the primary by rolling back any operations that weren’t replicated to the other members. However, in this case, the rollback failed due to an unrecoverable error.

Error Analysis

The critical error message in the log is:

{"t":{"$date":"2024-06-04T02:31:27.102+00:00"},"s":"F",  "c":"REPL",     "id":21128,   "ctx":"BackgroundSync","msg":"Rollback failed with unrecoverable error","attr":{"error":{"code":127,"codeName":"UnrecoverableRollbackError","errmsg":"not willing to roll back more than 86400 seconds of data. Have: 2058666 seconds."}}}

This error indicates that the rollback process is attempting to roll back more than 24 hours (86400 seconds) of data. Specifically, it’s trying to roll back 2,058,666 seconds, which is approximately 23.8 days.

Possible Causes

  • Long period of disconnection: The secondary might have been disconnected from the primary for an extended period, causing it to fall significantly behind in replication.
  • Data inconsistency: There might be a significant data divergence between the secondary and the primary, possibly due to a previous failure or network issue.
  • Configuration issues: There might be misconfigurations in the replica set that are preventing proper synchronization.

Recommendations

  • Check network connectivity: Ensure that all members of the replica set have stable network connections. 2
  • Review replica set configuration: Verify that all members are properly configured and can communicate with each other.
  • Monitor replication lag: Regularly monitor the replication lag between the primary and secondaries to catch and address any growing discrepancies early.
  • Consider resynchronization: If the secondary is consistently failing to catch up, you might need to perform a full resynchronization of the data from the primary to the secondary. 1
  • Investigate the root cause: Look into why the secondary fell so far behind. Was there an extended network outage? Was the node down for a long period? Addressing the root cause will help prevent future occurrences.
  • Review your backup and recovery strategy: Ensure you have a solid backup strategy in place to recover from such situations more easily in the future. 2

If this issue persists, you may need to perform a full resynchronization of the problematic secondary or seek further assistance from MongoDB support for a more in-depth investigation of your specific deployment.

Sources


Replica Set Member States
MongoDB Manual
Replication
Replication Reference

Three Member Replica Sets
MongoDB Manual
Replication
Replica Set Deployment Architectures