You can deploy a second Ops Manager instance, called a secondary Ops Manager, to back up a primary Ops Manager and its backing databases. The secondary Ops Manager also serves as your recovery path if you lose the primary Ops Manager.
This pattern protects the operational data that Ops Manager stores in its application database and metadata stores. Use this guide to design, configure, and operate disaster recovery for Ops Manager itself.
This guide is for Ops Manager administrators who manage backup and disaster recovery and for teams who design high availability and disaster recovery topologies for Ops Manager.
How Secondary Ops Manager Backup Works
In this pattern, two Ops Manager instances have distinct responsibilities:
The primary Ops Manager manages your MongoDB deployments and their backups, as usual.
The secondary Ops Manager manages and backs up only the primary Ops Manager's backing databases. The secondary Ops Manager doesn't manage your application clusters.
The MongoDB Agent runs on each host of the primary Ops Manager's application database and registers with the secondary Ops Manager. The secondary Ops Manager takes continuous and point-in-time backups of those backing databases.
If you lose the primary Ops Manager, you restore its backing databases from the secondary Ops Manager and then start a new primary Ops Manager. The primary Ops Manager reconnects to the restored backing databases and resumes management of your MongoDB deployments.
When the MongoDB Agents reconnect after the restart, they report a configuration version newer than the restored database. The primary Ops Manager detects the mismatch, automatically enters Restoration Mode for the affected project, converges all agents on the restored configuration, and blocks deployment changes until reconciliation completes.
Architecture
The following table describes the components in this pattern and their responsibilities:
Component | Responsibility |
|---|---|
Primary Ops Manager | Manages your MongoDB deployments and their backups. Stores its own operational data in its application database, snapshot metadata store, and oplog metadata store. |
Secondary Ops Manager | Runs a Backup Daemon that writes to an S3-compatible storage blockstore for application database snapshots and oplog slices. Continuously backs up the primary Ops Manager's backing databases. Doesn't manage your application clusters. |
Application database | Stores the primary Ops Manager's operational data, including project configuration, automation state, and backup metadata. You must back up the application database. |
Snapshot and oplog metadata stores | Store the block and oplog indexes for the deployments that the primary Ops Manager backs up. Back up these stores as well. |
MongoDB Agent | Runs on each backing database host and registers with the secondary Ops Manager to perform backups and restores. |
The secondary Ops Manager stores the backups of the primary Ops Manager's backing databases in its own S3-compatible storage blockstore, separate from the primary Ops Manager's backup storage.
Deployment Variants
Deploy the secondary Ops Manager in a separate failure domain from the primary Ops Manager to prevent a single failure from affecting both instances. Common variants include:
Different Regions
Deploy the secondary Ops Manager in a different cloud region than the primary Ops Manager. This variant protects against the loss of a region.
Different Data Centers
Deploy the secondary Ops Manager in a different data center than the primary Ops Manager. This variant protects against the loss of a data center.
Separate Backup Network
Place the secondary Ops Manager on a separate network that is dedicated to backup traffic. This variant isolates backup traffic from your application network.
Important
Deploy the secondary Ops Manager in a separate failure domain, such as a different rack, availability zone, region, or network segment, from the primary Ops Manager. If both instances share a failure domain, a single failure can disrupt both the primary Ops Manager and its recovery path.
Supported Versions and Limitations
Before you use this pattern, review the following requirements and limitations.
Supported Versions
Both the primary and secondary Ops Manager instances must run Ops Manager 8.0.24 or later.
The secondary Ops Manager must run the same version as the primary Ops Manager or a later version. Don't run a secondary Ops Manager that is earlier than the primary Ops Manager.
Warning
Restore the application database to a primary Ops Manager that runs the same version as, or a later version than, the original primary Ops Manager that the snapshot was taken from. If the replacement binary is older than the application database's recorded version, Ops Manager refuses to start with a "Downgrades are not permitted" error.
Limitations
This pattern backs up the primary Ops Manager's backing databases. It doesn't back up arbitrary MongoDB clusters. The primary Ops Manager continues to manage backups for your MongoDB deployments.
Backing up and reconciling the snapshot metadata store and the oplog metadata store is a manual procedure. Ops Manager doesn't automatically select a restore point for these stores. As a result, backup metadata can be inconsistent after a restore, and some backups might be non-restorable. Ops Manager validates a snapshot before it restores and fails with an error rather than performing an unsafe restore.
Restoration Mode doesn't apply to externally managed deployments, such as deployments that a Kubernetes Operator manages. After you restore the application database, agents in these projects receive the restored configuration directly on their next poll and converge without entering Restoration Mode. No action is required for these projects.
A snapshot can become unrestorable if its data blocks are no longer in the snapshot store. Before a restore, the primary Ops Manager verifies that the snapshot's blocks exist. If blocks are missing, the restore fails with an error and leaves the replica set unmodified, instead of wiping it and failing partway through.
An untested restore is an operational risk. Validate the backup and restore path regularly. See the validation runbook in Restore Ops Manager from a Secondary Ops Manager.
Next Steps
To set up and operate this pattern, see the following pages: