Make the MongoDB docs better! We value your opinion. Share your feedback for a chance to win $100.
Click here >
Docs Menu
Docs Home
/ /

Troubleshoot Backup and Restore Failures

Backup and restore operations for deployments managed by Ops Manager can fail for a variety of reasons, including agent connectivity issues, disk space constraints, or oplog inconsistencies.

This page describes how to confirm backup and restore failures, outlines common causes and resolutions, and provides guidance on what to collect before contacting support. If the issue persists after you complete the steps below, contact Technical Support.

Before you investigate the root cause of a backup or restore failure, confirm that a failure has occurred by checking the relevant status indicators in the Ops Manager UI or API.

Use the following methods to confirm that a backup job or snapshot has failed.

To confirm whether a snapshot failed:

1
  1. Click Admin.

  2. Click Backups.

  3. Click Snapshots.

2
3

The column shows whether the snapshot succeeded, is running, or failed.

You can also click JSON next to a snapshot to view additional fields, including:

  • status

  • createdDate

  • completedDate

  • totalDuration

  • transferSpeed

These fields help confirm whether the backup completed successfully.

For a description of all snapshot states, see Backup Overview.

To check for issues with ongoing backup jobs:

1
  1. Click Admin.

  2. Click Backup.

  3. Click Jobs.

2
3

Fields such as Last Snapshot, Last Oplog, or Head Time might appear highlighted when delayed, indicating a problem with the backup process.

For more information, see Jobs.

To review error messages from backup jobs:

1
  1. Click Admin.

  2. Click Logs.

2

The logs display error messages grouped by time, which can help diagnose why a backup job failed.

Ops Manager generates alerts that indicate failures or issues with backup jobs, including:

  • "Backup has reached a high number of retries"

  • "Backup is in an unexpected state"

  • "Replica set has a late snapshot"

For a full list of backup-related alert conditions, see Alert Conditions.

To retrieve snapshots that have not completed, query the Ops Manager API using the completed=false query parameter:

curl --user "{PUBLIC-KEY}:{PRIVATE-KEY}" --digest \
--header "Accept: application/json" \
"https://{OPSMANAGER-HOST}:{PORT}/api/public/v1.0/groups/{PROJECT-ID}/clusters/{CLUSTER-ID}/snapshots?completed=false"

The response includes a results array where each object represents a snapshot. The complete field indicates whether the snapshot finished successfully.

Note

The snapshot API does not provide a named failure status. A snapshot with complete: false may still be in progress or may have failed.

For more information, see Get All Snapshots for One Cluster.

Use the following methods to confirm that a restore job has failed.

To view the status of restore jobs in the Ops Manager UI:

1
2
3
4

The Restores page shows a table of the last 300 restore jobs. Check the Status column to identify jobs with the following states:

  • FAILED

  • CANCELED

  • IN_PROGRESS

  • FINISHED

Click a row to view more details about that specific restore operation.

For more information, see Restores.

To retrieve restore jobs programmatically, query the Ops Manager API:

curl --user "{PUBLIC-KEY}:{PRIVATE-KEY}" --digest \
--header "Accept: application/json" \
"https://{OPSMANAGER-HOST}:{PORT}/api/public/v1.0/groups/{PROJECT-ID}/clusters/{CLUSTER-ID}/restoreJobs"

The response includes a results array where each object represents a restore job. The statusName field indicates the job state. Possible values include:

  • FINISHED

  • IN_PROGRESS

  • BROKEN

  • KILLED

Restore jobs with a statusName of BROKEN or KILLED are considered failed.

To filter for failed jobs using jq:

curl --user "{PUBLIC-KEY}:{PRIVATE-KEY}" --digest \
--header "Accept: application/json" \
"https://{OPSMANAGER-HOST}:{PORT}/api/public/v1.0/groups/{PROJECT-ID}/clusters/{CLUSTER-ID}/restoreJobs" \
| jq '.results[] | select(.statusName=="BROKEN" or .statusName=="KILLED")'

For more information, see Get All Restore Jobs for One Cluster.

The following sections describe common causes of backup and restore failures and how to resolve them.

The following sections describe common causes of backup failures and how to resolve them.

A lack of free disk space on the replica set member nodes can cause the cluster to enter an unhealthy state, leading to backup failures.

To resolve this issue, increase the available storage capacity on the dbPath of the affected nodes. Monitor disk usage regularly to prevent recurrence.

The backup process depends on the MongoDB Agent running continuously. If the agent stops or keeps restarting, backups fail.

Symptoms include:

  • Alerts such as "Backup oplog is behind"

  • No oplog slices received for an hour

To resolve this issue:

1
2

The agent logs are typically located at:

/var/log/mongodb-mms-automation/backup-agent.log
3

For more information, see Fix Backup Oplog Issues.

The backup agent must maintain a connection to the replica set. Failures can occur due to network connectivity issues, an unavailable MongoDB node, or an authentication failure.

Symptoms in the agent logs include:

  • server selection timeout

  • Authentication failed

To resolve this issue:

1
mongosh "mongodb://host:port"
2

Confirm the following:

  • Network access between the agent host and replica set members

  • Replica set availability

  • Backup user credentials and required roles

For more information, see Fix Backup Oplog Issues.

If the oplog is too small or the backup agent cannot keep up with write activity, the backup falls behind and eventually fails.

Symptoms include the following alerts:

  • "Backup requires a resync"

  • "Backup oplog is behind"

To resolve this issue:

  • Increase the oplog size so the oplog window covers enough history (a minimum of 24 hours is recommended).

  • If the backup has fallen too far behind, resync the backup.

A backup job requires a Backup Daemon with enough space to store a local copy of the backed-up replica set. If no daemon has sufficient space, the job fails to bind. To resolve this issue, add an additional Backup Daemon to increase capacity.

This issue can also occur when no primary is detected in the replica set. To resolve this, ensure the replica set is healthy and has a primary before you retry the backup.

For more information, see Backup FAQ.

The following sections describe common causes of restore failures and how to resolve them.

When you restore a sharded cluster, you must restore all shards. The restore process fails if you attempt to restore a single shard in isolation.

For more information, see Restore Limitations.

An automated restore can fail when certain storage settings of the source backup and the target database do not match. If a restore attempt fails, Ops Manager displays any mismatched settings.

For a list of settings that must match, see Potential Causes for Automated Restore Failure.

Point-in-time restores require a continuous oplog history. If there is a gap in the oplog, the restore fails.

Common causes of oplog gaps include:

  • The backup agent stopped tailing the oplog.

  • The oplog rolled over before the agent processed it.

  • Cluster topology changes occurred.

  • A Feature Compatibility Version (FCV) change occurred.

  • A restore was attempted across MongoDB version changes.

To resolve this issue:

  • Restore from the latest valid snapshot taken before the oplog gap, or

  • Wait until a new snapshot is created, then perform the restore again.

For more information, see Restore from a Specific Point in Time.

If the target host does not have enough storage for the snapshot files and restored database, the restore fails.

To resolve this issue:

1
db.stats()
2

Verify that the dbPath has enough free disk space to accommodate the restored data before proceeding.

For more information about the dbStats command, see dbStats.

If the issue persists, collect the following information before contacting Technical Support:

  • Complete error messages from the Ops Manager UI or API

  • Backup agent log files

  • MongoDB server version

  • Ops Manager version

  • Relevant MongoDB server logs

  • Output from the Restores page or API restore job query

Back

Recover a Standalone after an Unexpected Shutdown

On this page