/ /

Back Up and Restore Ops Manager

Docs Home

/ /

Advanced Options

Back Up and Restore Ops Manager

Back Up and Restore Ops Manager

Restore Ops Manager from a Secondary Ops Manager

When you lose the primary Ops Manager, you restore its backing databases from the secondary Ops Manager and then start a new primary Ops Manager. Use this procedure after an event such as a failed upgrade, accidental data deletion, or infrastructure failure. When the primary Ops Manager restarts, it automatically enters Restoration Mode to reconcile agent configuration before normal operation resumes. For an overview of this pattern, see Back Up and Restore Ops Manager Using a Secondary Instance. To configure this pattern, see Configure a Secondary Ops Manager to Back Up Ops Manager.

Considerations

Restore Order

The sequence of restoring the backing databases and starting the primary Ops Manager is critical.

Warning

Restore the backing databases before you start the primary Ops Manager. If you start the primary Ops Manager before you restore its backing databases, the primary Ops Manager can write inconsistent state and create a split-brain scenario between the old and new deployments.

If the secondary Ops Manager also backs up the snapshot metadata store and the oplog metadata store, restore all three backing databases to the same point in time, or restore the metadata stores to a slightly later point in time than the application database, before you start the primary Ops Manager. Restoring the metadata stores to an earlier point than the application database causes the primary Ops Manager to reject affected restore jobs with HTTP 409 ("snapshot blocks missing").

Recovery Point

A point-in-time restore recovers the application database to the point in time that you select, not to the moment of the disaster. Operations that Backup didn't capture in an oplog slice before the disaster might not be recoverable. Select a restore point as close to the disaster as your point-in-time recovery window allows.

After the primary Ops Manager restarts, reconciliation recovers the deployment automation configuration from the MongoDB Agents. Other recent application database changes reflect the restore point that you select.

Version Compatibility

The primary and secondary Ops Manager versions must stay compatible.

Warning

Restore the application database to a primary Ops Manager that runs the same version as, or a later version than, the original primary Ops Manager that the snapshot was taken from. If the replacement binary is older than the application database's recorded version, Ops Manager refuses to start with a "Downgrades are not permitted" error.

Prerequisites

Restoration Mode is enabled by default in Ops Manager 8.0.24 and later. If you disabled it, re-enable it on the primary Ops Manager before you restore. For the steps, see Configure a Secondary Ops Manager to Back Up Ops Manager.
Confirm that the secondary Ops Manager has a completed snapshot and a continuous point-in-time recovery window for the backing databases.
Confirm that you preserved the required per-host state for recovery, including the gen.key file from the original primary Ops Manager installation.

Preserve Per-Host State

A full primary Ops Manager recovery requires more than the application database. Preserve the following state on each host, in addition to the application database that the secondary Ops Manager backs up:

Item	Location	Description
Encryption key `gen.key`	`/etc/mongodb-mms/gen.key`	Encrypts the application database contents. Must match the key used for the original installation, or the primary Ops Manager can't decrypt the restored application database on startup.
Ops Manager configuration	`conf-mms.properties` and JVM configuration files	Stores database URIs, blockstore configuration, license keys, and TLS certificates. Without it, you must reconfigure the primary Ops Manager by hand.
Agent configuration	`/etc/mongodb-mms/automation-agent.config` on each managed host	Stores the `mmsGroupId` and `mmsApiKey`. These must match the restored application database's project records so agents re-attach without re-registration.

Important

If the gen.key file is missing or doesn't match the restored application database, the primary Ops Manager fails its startup preflight check with an error that gen.key doesn't match the key already used for this Ops Manager installation. Keep gen.key in your disaster recovery backup alongside the application database data.

Restore the Primary Ops Manager

Choose a restore point

In the secondary Ops Manager, choose a point in time before the failure event. Use the point-in-time recovery window for the primary Ops Manager's application database.

Restore the backing databases

In the secondary Ops Manager, restore the primary Ops Manager's application database to the point in time that you chose:

Click Continuous Backup, then select the application database replica set.
Click the menu, then click Restore.
Select Point in Time, then enter the target date and time.
Click Choose Cluster to Restore to, select the application database replica set hosts, then click Restore.

The secondary Ops Manager's MongoDB Agent stops the application database processes, replaces the data with the restored snapshot, replays the oplog to the target time, and restarts the processes.

Warning

If the secondary Ops Manager also backs up the snapshot and oplog metadata stores, restore them to the same point in time as the application database, or to a slightly later point, before you start the primary Ops Manager. Restoring the metadata stores to an earlier point causes the primary Ops Manager to reject affected restore jobs with HTTP 409 ("snapshot blocks missing").

If you restore only the application database, the secondary Ops Manager leaves the metadata stores unchanged. This is safe for normal operation, but backup snapshots that the primary Ops Manager took between the restore point and now might be unavailable.

Rebuild lost hosts if needed

If you lost the application database hosts, provision empty MongoDB processes with the same replica set name and ownership. Reinstall the MongoDB Agent on each host before you run the automated restore.

Start the primary Ops Manager

Start the primary Ops Manager. The primary Ops Manager connects to the restored application database, reads the restored state, and begins recovery. To learn more, see Start and Stop Ops Manager Application.

Let reconciliation complete

The primary Ops Manager enters Restoration Mode and reconciles the deployment configuration automatically. While reconciliation runs, the primary Ops Manager shows a Restoration Mode banner. No manual action is required. The primary Ops Manager exits Restoration Mode when reconciliation completes.

Restoration Mode and Reconciliation

After you restore the application database and start the primary Ops Manager, the primary Ops Manager compares each MongoDB Agent's configuration version with the restored configuration version. If an agent reports a later version, the primary Ops Manager enters Restoration Mode for that project and reconciles the configuration automatically:

The primary Ops Manager isolates the project. It shows a Restoration Mode banner, returns an unchanged response to agent polls, and blocks deployment changes from the user interface and the API until reconciliation completes.
The primary Ops Manager collects the configuration version from each agent, selects the agent with the latest configuration, and writes that configuration to the application database as the authoritative configuration.
The primary Ops Manager exits Restoration Mode for the project and resumes normal operation. Backups resume automatically.

This reconciliation prevents a split-brain scenario. It converges all agents on a single authoritative configuration before any agent receives a new configuration.

When you roll the application database back to an earlier point in time, the restored configuration no longer includes deployment changes that you made after the restore point. Without reconciliation, agents receive the older configuration and stop the processes that the configuration no longer references. The impact depends on the deployment change:

Deployment Change	Risk Without Reconciliation
New replica set member	The data exists on other members, so you lose no data.
New shard with migrated chunks	The migrated chunks exist only on the new shard. Stopping it makes that data unreachable, which causes data loss.
New process version	The process can't run on the rolled-back binary version, which causes operational drift.
New index	Index queries degrade until Ops Manager rebuilds the index.

Reconciliation prevents these outcomes. The primary Ops Manager converges all agents on the latest configuration and enqueues an on-demand snapshot before it exits Restoration Mode, which gives you a coherent backup point.

Validate the Restored Ops Manager

Confirm that the restored primary Ops Manager is healthy:

Confirm that the MongoDB Agents reconnect and report as healthy.
Confirm that automation, backup, and monitoring resume for your managed deployments.
Confirm that the primary Ops Manager exits Restoration Mode. To check the Restoration Mode status, send a GET request to the following endpoint as a user with the Project Read Only role. The response shows the current state, the trigger reason, and timestamps for the project:
GET /api/public/v1.0/groups/{PROJECT-ID}/restorationMode

Recover From a Stuck Reconciliation

If the primary Ops Manager restarts while it is in Restoration Mode, it re-triggers reconciliation on the next MongoDB Agent poll. You don't need to take action for the restart case.

Reconciliation might not complete, for example because agents are unreachable. In that case, use one of the following API endpoints as a user with the Project Owner role:

To retry reconciliation, send a POST request to the following endpoint. The retry resets the reconciliation failure counter and re-runs reconciliation without exiting Restoration Mode:
POST /api/public/v1.0/groups/{PROJECT-ID}/restorationMode/retry
To force the primary Ops Manager to exit Restoration Mode and accept the restored configuration, send a DELETE request to the following endpoint:
DELETE /api/public/v1.0/groups/{PROJECT-ID}/restorationMode

Warning

When you force the primary Ops Manager to exit Restoration Mode, it accepts the restored configuration without reconciling later agent configurations. Use this endpoint only when reconciliation can't complete. After force-exit, the primary Ops Manager serves the restored configuration to all agents in the project and won't re-enter Restoration Mode until every agent has converged on it.

Cut Over to the Restored Ops Manager

This pattern restores the primary Ops Manager in place. It doesn't promote the secondary Ops Manager to replace the primary Ops Manager.

After you validate the restored primary Ops Manager, complete the cutover:

Update the URL to Access Ops Manager setting and any DNS records to point to the restored primary Ops Manager.
Confirm that the MongoDB Agents connect to the restored primary Ops Manager. The agents reconnect without re-registration when the mmsGroupId and mmsApiKey in their configuration match the restored project records.

Operational Guidance

Use the following practices to operate and validate this pattern over time.

Test the Backup and Restore Path

An untested restore is an operational risk. Test this backup and restore path regularly. Treat the following runbook as a required practice, not optional guidance:

Perform a test restore on a schedule. Restore the backing databases to a sandbox Ops Manager and confirm that it starts and reconciles.
Confirm that snapshots appear on schedule and that the point-in-time recovery window stays continuous.
Review the primary Ops Manager and secondary Ops Manager logs for backup and restore errors.

Monitor the Secondary Ops Manager

Monitor the secondary Ops Manager and the backing databases that it backs up:

Use Ops Manager backup alerts to watch for missed or failed snapshots of the backing databases.
Confirm that the point-in-time recovery window stays continuous and keeps advancing.
Monitor the health of the secondary Ops Manager's application database and Backup Daemon.

Failure Scenarios

The following table describes how this pattern behaves in common failure scenarios:

Scenario	Impact	Recovery
Secondary Ops Manager temporarily unavailable	New backups and restores pause. The primary Ops Manager and all managed agents continue running.	Restore the secondary Ops Manager. Backup agents resume automatically.
Secondary Ops Manager fails after a restore	None. After you restore the application database, the primary Ops Manager enters Restoration Mode and reconciles without the secondary Ops Manager.	No action required.
Loss of both Ops Manager instances	You lose point-in-time recovery for the primary Ops Manager's application database.	Rebuild the secondary Ops Manager and re-import the application database, or rebuild the primary Ops Manager and re-import your clusters manually.

Performance, Storage, and Cost Considerations

Consider the following when you run a dedicated secondary Ops Manager:

Size the secondary Ops Manager materially smaller than a production primary Ops Manager. It manages only the primary Ops Manager's backing databases for backup and restore, and doesn't manage your MongoDB clusters.
Plan snapshot storage capacity for the backing databases based on your snapshot schedule and retention policy.
Account for the extra infrastructure and operational cost of running a dedicated secondary Ops Manager.

Back

Configure a Secondary Ops Manager

Monitor Large Deployments