Navigation

Disaster Recovery

The Kubernetes Operator can orchestrate the recovery of MongoDB replica set members to a healthy Kubernetes cluster when the Kubernetes Operator identifies that the original Kubernetes cluster is down.

Disaster Recovery Modes

The Kubernetes Operator can orchestrate either an automatic or manual remediation of the MongoDBMulti resources in a disaster recovery scenario, using one of the following modes:

  • Auto Failover Mode allows the Kubernetes Operator to shift the affected MongoDB replica set members from an unhealthy Kubernetes cluster to healthy Kubernetes clusters. When the Kubernetes Operator performs this auto remediation, it evenly distributes replica set members across the healthy Kubernetes clusters.

    To enable this mode, use --set multiCluster.performFailover=true in the MongoDB Helm Charts for Kubernetes. In the values.yaml file in the MongoDB Helm Charts for Kubernetes directory, the environment’s variable default value is true.

    Alternatively, you can set the multi-Kubernetes-cluster deployment environment variable PERFORM_FAILOVER to true, as in the following abbreviated example:

    spec:
      template:
        ...
        spec:
          containers:
          - name: mongodb-enterprise-operator
            ...
            env:
            ...
            - name: PERFORM_FAILOVER
              value: "true"
            ...
    
  • Manual(CLI-based) Failover Mode allows you to use the multi-cluster CLI to reconfigure the Kubernetes Operator to use new healthy Kubernetes clusters. In this mode, you distribute replica set members across the new healthy clusters by configuring the MongoDBMulti custom resource based on your configuration.

    To enable this mode, use --set multiCluster.performFailover=true in the MongoDB Helm Charts for Kubernetes, or set the multi-Kubernetes-cluster deployment environment variable PERFORM_FAILOVER to false, as in the following abbreviated example:

    spec:
      template:
        ...
        spec:
          containers:
          - name: mongodb-enterprise-operator
            ...
            env:
            ...
            - name: PERFORM_FAILOVER
              value: "false"
            ...
    

Note

You can’t rely on the auto or manual failover modes when a Kubernetes cluster hosting one or more Kubernetes Operator instances goes down, or the replica set member resides on the same failed Kubernetes cluster as the Kubernetes that manages it.

In such cases, to restore replica set members from lost Kubernetes clusters to the remaining healthy Kubernetes clusters, you must first restore the Kubernetes Operator instance that manages your multi-Kubernetes-cluster deployments, or redeploy the Kubernetes Operator to one of the remaining Kubernetes clusters, and rerun the multi-cluster CLI. To learn more, see Manually Recover from a Failure Using the Multi-Cluster CLI.

Manually Recover from a Failure Using the Multi-Cluster CLI

When a Kubernetes cluster hosting one or more Kubernetes Operator instances goes down, or the replica set member resides on the same failed Kubernetes cluster as the Kubernetes that manages it, you can’t rely on the auto or manual failover modes and must use the following procedure to manually recover from a failed Kubernetes cluster.

The following procedure uses the multi-cluster CLI to:

  • Configure new healthy Kubernetes clusters.
  • Re-balance nodes hosting MongoDBMulti resources on the nodes in the healthy Kubernetes clusters.

Before you start the following procedure, ensure that you:

  • Deployed one central cluster and three member clusters, following the Quick Start Procedure. In this case, the Kubernetes Operator is installed with the automated failover disabled with --set multiCluster.performFailover=false.

  • Deployed a MongoDBMulti resource as follows:

    kubectl apply -n mongodb -f - <<EOF
    apiVersion: mongodb.com/v1
    kind: MongoDBMulti
    metadata:
     name: multi-replica-set
    spec:
     version: 5.0.5-ent
     type: ReplicaSet
     persistent: false
     duplicateServiceObjects: true
     credentials: my-credentials
     opsManager:
       configMapRef:
         name: my-project
     security:
       tls:
         ca: custom-ca
     clusterSpecList:
       - clusterName: ${MDB_CLUSTER_1_FULL_NAME}
         members: 3
       - clusterName: ${MDB_CLUSTER_2_FULL_NAME}
         members: 2
       - clusterName: ${MDB_CLUSTER_3_FULL_NAME}
         members: 3
    EOF
    

The Kubernetes Operator periodically checks for connectivity to the clusters in the multi-Kubernetes-cluster deployment by pinging the /healthz endpoints of the corresponding servers. To learn more about /healthz, see Kubernetes API health endpoints.

In the case that CLUSTER_3 in our example becomes unavailable, the Kubernetes Operator detects the failed connections to the cluster and marks the MongoDBMulti resources with the failedClusters annotation for subsequent reconciliations.

The resources with data nodes deployed on this cluster fail reconciliation until you run the manual recovery steps as in the following procedure.

To re-balance the MongoDB data nodes so that all the workloads run on CLUSTER_1 and CLUSTER_2:

  1. Recover the multi-Kubernetes-cluster deployment using the multi-cluster-CLI as follows:

    go run main.go recover \
    -central-cluster="MDB_CENTRAL_CLUSTER_FULL_NAME" \
    -member-clusters="${MDB_CLUSTER_1_FULL_NAME},${MDB_CLUSTER_2_FULL_NAME}" \
    -member-cluster-namespace="mongodb" \
    -central-cluster-namespace="mongodb" \
    -operator-name=mongodb-enterprise-operator-multi-cluster \
    -source-cluster="${MDB_CLUSTER_1_FULL_NAME}"
    

    This command:

    • Reconfigures the Kubernetes Operator to manage workloads on the two healthy clusters. (This list could also include new clusters).
    • Marks CLUSTER_1 as the source of configuration for the member node configuration for new healthy Kubernetes clusters. This means that Roles and Service Account configuration is replicated to match the configuration in CLUSTER_1.
  2. Re-configure the MongoDBMulti resource to re-balance the data nodes on the healthy Kubernetes clusters by editing the MongoDB resources affected by the change:

    kubectl apply -n mongodb -f - <<EOF
    apiVersion: mongodb.com/v1
    kind: MongoDBMulti
    metadata:
     name: multi-replica-set
    spec:
     version: 5.0.5-ent
     type: ReplicaSet
     persistent: false
     duplicateServiceObjects: true
     credentials: my-credentials
     opsManager:
       configMapRef:
        name: my-project
     security:
       tls:
        ca: custom-ca
     clusterSpecList:
       - clusterName: ${MDB_CLUSTER_1_FULL_NAME}
         members: 4
       - clusterName: ${MDB_CLUSTER_2_FULL_NAME}
         members: 3
     EOF
    

Manually Recover from a Failure Using GitOps Workflows

For an example of using the multi-cluster-CLI in a GitOps workflow with Argo CD, see multi-cluster CLI example for GitOps.