Troubleshoot the Kubernetes Operator¶
On this page
- Get Status of a Deployed Resource
- Review the Logs
- Check Messages from the Validation Webhook
- View All
MongoDB
resource Specifications - Restore StatefulSet that Failed to Deploy
- Replace a ConfigMap to Reflect Changes
- Remove Kubernetes Components
- Create a New Persistent Volume Claim after Deleting a Pod
- Disable Ops Manager Feature Controls
- Debug a Failing Container
- Verify Corrrectness of Domain Names in TLS Certificates
- Verify the MongoDB Version when Running in Local Mode
- Upgrade Fails Using
kubectl
oroc
- Upgrade Fails Using Helm Charts
- Two Operator Instances After an Upgrade
Important
This section is for single Kubernetes cluster deployments only. For multi-Kubernetes-cluster deployments, see Troubleshoot Deployments with Multiple Kubernetes Clusters.
Get Status of a Deployed Resource¶
To find the status of a resource deployed with the Kubernetes Operator, invoke one of the following commands:
For Ops Manager resource deployments:
- The
status.applicationDatabase.phase
field displays the Application Database resource deployment status. - The
status.backup.phase
displays the backup daemon resource deployment status. - The
status.opsManager.phase
field displays the Ops Manager resource deployment status.
Note
The Cloud Manager or Ops Manager controller watches the database resources defined in the following settings:
spec.backup.opLogStores
spec.backup.s3Stores
spec.backup.blockStores
- The
For MongoDB resource deployments:
The
status.phase
field displays the MongoDB resource deployment status.
The following key-value pairs describe the resource deployment statuses:
Key | Value | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
message |
Message explaining why the resource is in a Pending or
Failed state. |
||||||||||
phase |
|
||||||||||
lastTransition |
Timestamp in ISO 8601 date and time format in UTC when the last reconciliation happened. | ||||||||||
link |
Deployment URL in Ops Manager. | ||||||||||
backup.statusName |
If you enabled continuous backups with spec.backup.mode
in Kubernetes for your MongoDB resource, this field indicates
the status of the backup, such as backup.statusName:"STARTED" .
Possible values are STARTED , STOPPED , and TERMINATED . |
||||||||||
Resource specific fields | For descriptions of these fields, see MongoDB Database Resource Specification. |
Example
To see the status of a replica set named my-replica-set
in
the developer
namespace, run:
If my-replica-set
is running, you should see:
If my-replica-set
is not running, you should see:
Review the Logs¶
Keep and review adequate logs to help debug issues and monitor cluster activity. Use the recommended logging architecture to retain Pod logs even after a Pod is deleted.
Review Logs from the Kubernetes Operator¶
To review the Kubernetes Operator logs, invoke this command:
You could check the Ops Manager Logs as well to see if any issues were reported to Ops Manager.
Find a Specific Pod¶
To find which pods are available, invoke this command first:
See also
Kubernetes documentation on kubectl get.
Review Logs from Specific Pod¶
If you want to narrow your review to a specific Pod, you can invoke this command:
Example
If your replica set is labeled myrs
, run:
This returns the Automation Agent Log for this replica set.
Check Messages from the Validation Webhook¶
The Kubernetes Operator uses a validation Webhook to prevent users from applying invalid resource definitions. The webhook rejects invalid requests.
The ClusterRole and ClusterRoleBinding for the webhook are included in the default configuration files that you apply during the installation. To create the role and binding, you must have cluster-admin privileges.
If you create an invalid resource definition, the webhook returns a message similar to the following that describes the error to the shell:
When the Kubernetes Operator reconciles each resource, it also validates that resource. The Kubernetes Operator doesn’t require the validation webhook to create or update resources.
If you omit the validation webhook, or if you remove the webhook’s role
and binding from the default configuration, or have insufficient
privileges to run the configuration, the Kubernetes Operator issues warnings,
as these are not critical errors. If the Kubernetes Operator encounters a
critical error, it marks the resource as Failed
.
GKE (Google Kubernetes Engine) deployments
GKE (Google Kubernetes Engine) has a known issue with the webhook when deploying to private clusters. To learn more, see Update Google Firewall Rules to Fix WebHook Issues.
View All MongoDB
resource Specifications¶
To view all MongoDB
resource specifications in the provided
namespace:
Example
To read details about the dublin
standalone resource, run
this command:
This returns the following response:
Restore StatefulSet that Failed to Deploy¶
A StatefulSet Pod may hang with a status of Pending
if it
encounters an error during deployment.
Pending
Pods do not automatically terminate, even if you
make and apply configuration changes to resolve the error.
To return the StatefulSet to a healthy state, apply the configuration
changes to the MongoDB resource in the Pending
state, then delete
those pods.
Example
A host system has a number of running Pods:
my-replica-set-2
is stuck in the Pending
stage. To gather
more data on the error, run:
The output indicates an error in memory allocation.
Updating the memory allocations in the MongoDB resource is insufficient, as the pod does not terminate automatically after applying configuration updates.
To remedy this issue, update the configuration, apply the configuration, then delete the hung pod:
Once this hung pod is deleted, the other pods restart with your new configuration as part of rolling upgrade of the Statefulset.
Note
To learn more about this issue, see Kubernetes Issue 67250.
Replace a ConfigMap to Reflect Changes¶
If you cannot modify or redeploy an already-deployed resource
ConfigMap
file using the kubectl apply command, run:
This deletes and re-creates the ConfigMap
resource file.
This command is useful in cases where you want to make an immediate recursive change, or you need to update resource files that cannot be updated once initialized.
Remove Kubernetes Components¶
Important
To remove any component, you need the following permissions:
Cluster Roles |
|
---|---|
Cluster Role Bindings |
|
Remove a MongoDB
resource¶
To remove any instance that Kubernetes deployed, you must use Kubernetes.
Important
- You can use only the Kubernetes Operator to remove Kubernetes-deployed instances. If you use Ops Manager to remove the instance, Ops Manager throws an error.
- Deleting a MongoDB resource doesn’t remove it from the Ops Manager UI. You must remove the resource from Ops Manager manually. To learn more, see Remove a Process from Monitoring.
- Deleting a MongoDB resource for which you enabled backup doesn’t delete the resource’s snapshots. You must delete snapshots in Ops Manager.
Example
To remove a single MongoDB instance you created using Kubernetes:
To remove all MongoDB instances you created using Kubernetes:
Remove the CustomResourceDefinitions¶
To remove the CustomResourceDefinitions:
Remove the CustomResourceDefinitions:
Create a New Persistent Volume Claim after Deleting a Pod¶
If you accidentally delete the MongoDB replica set Pod and its Persistent Volume Claim, the Kubernetes Operator fails to reschedule the MongoDB Pod and issues the following error message:
To recover from this error, you must manually create a new PVC
with the PVC object’s name that corresponds to this replica set Pod,
such as data-<replicaset-pod-name>
.
Disable Ops Manager Feature Controls¶
When you manage an Ops Manager project through the Kubernetes Operator, the
Kubernetes Operator places the EXTERNALLY_MANAGED_LOCK
feature control policy
on the project. This policy disables certain features in the Ops Manager
application that might compromise your Kubernetes Operator configuration. If
you need to use these blocked features, you can remove the policy
through the feature controls API,
make changes in the Ops Manager application, and then restore the original
policy through the API.
Warning
The following procedure enables you to use features in the Ops Manager application that are otherwise blocked by the Kubernetes Operator.
Retrieve the feature control policies for your Ops Manager project.
Save the response that the API returns. After you make changes in the Ops Manager application, you must add these policies back to the project.
Important
Note the highlighted fields and values in the following sample response. You must send these same fields and values in later steps when you remove and add feature control policies.
The
externalManagementSystem.version
field corresponds to the Kubernetes Operator version. You must send the exact same field value in your requests later in this task.Your response should be similar to:
Update the
policies
array with an empty list:Note
The values you provide for the
externalManagementSystem
object, like theexternalManagementSystem.version
field, must match values that you received in the response in Step 1.The previously blocked features are now available in the Ops Manager application.
Make your changes in the Ops Manager application.
Update the
policies
array with the original feature control policies:Note
The values you provide for the
externalManagementSystem
object, like theexternalManagementSystem.version
field, must match values that you received in the response in Step 1.The features are now blocked again, preventing you from making further changes through the Ops Manager application. However, the Kubernetes Operator retains any changes you made in the Ops Manager application while features were available.
Debug a Failing Container¶
A container might fail with an error that results in Kubernetes restarting that container in a loop.
You may need to interact with that container to inspect files or run commands. This requires you to prevent the container from restarting.
In your preferred text editor, open the MongoDB resource you need to repair.
To this resource, add a
podSpec
collection that resembles the following.The sleep command in the
spec.podSpec.podTemplate.spec
instructs the container to wait for the number of seconds you specify. In this example, the container will wait for 1 hour.Apply this change to the resource.
Invoke the shell inside the container.
Verify Corrrectness of Domain Names in TLS Certificates¶
A MongoDB replica set or sharded cluster may fail to reach
the READY
state if the TLS certificate is invalid.
When you configure TLS for MongoDB replica sets or sharded clusters, verify that you specify a valid certificate.
If you don’t specify the correct Domain Name for each TLS certificate,
the Kubernetes Operator logs may contain an error
message similar to the following, where foo.svc.local
is the
incorrectly-specified Domain Name for the cluster member’s Pod:
Each certificate should include a valid Domain Name.
For each replica set or sharded cluster member, the Common Name, also known as the Domain Name, for that member’s certificate must match the FQDN of the pod this cluster member is deployed on.
The FQDN name in each certificate has the following syntax:
pod-name.service-name.namespace.svc.cluster.local
. This name is
different for each Pod hosting a member of the replica set or a
sharded cluster.
For example, for a member of a replica set deployed on a Pod with
the name rs-mongos-0-0
, in the Kubernetes Operator service
named mongo-0
that is created in the default mongodb
namespace, the FQDN is:
To check whether you have correctly configured TLS certificates:
Run:
Check for TLS-related messages in the Kubernetes Operator log files.
To learn more about TLS certificate requirements, see the prerequisites on the TLS-Encrypted Connections tab in Deploy a Replica Set or in Deploy a Sharded Cluster.
Verify the MongoDB Version when Running in Local Mode¶
MongoDB CustomResource
may fail to reach a Running
state
if Ops Manager is running in Local Mode and you specify either a version of MongoDB
that doesn’t exist, or a valid version of MongoDB for which
Ops Manager deployed in local mode did not download a corresponding MongoDB archive.
If you specify a MongoDB version that doesn’t exist, or a valid MongoDB
version for which Ops Manager could not download a MongoDB archive, then
even though the Pods can reach the READY
state,
the Kubernetes Operator logs contain an
error message similar to the following:
This may mean that the MongoDB Agent could not successfully download a
corresponding MongoDB binary to the /var/lib/mongodb-mms-automation
directory. In cases when the MongoDB Agent can download the MongoDB
binary for the specified MongoDB version successfully, this directory
contains a MongoDB binary folder, such as mongodb-linux-x86_64-4.4.0
.
To check whether a MongoDB binary folder is present:
Specify the Pod’s name to this command:
Check whether a MongoDB binary folder is present in the
/var/lib/mongodb-mms-automation
directory.If you cannot locate a MongoDB binary folder, copy the MongoDB archive into the Ops Manager Persistent Volume for each deployed Ops Manager replica set.
Upgrade Fails Using kubectl
or oc
¶
You might receive the following error when you upgrade the Kubernetes Operator:
To resolve this error:
Remove the old Kubernetes Operator deployment.
Note
Removing the Kubernetes Operator deployment doesn’t affect the lifecycle of your MongoDB resources.
Repeat the
kubectl apply
command to upgrade to the new version of the Kubernetes Operator.
Upgrade Fails Using Helm Charts¶
You might receive the following error when you upgrade the Kubernetes Operator:
To resolve this error:
Remove the old Kubernetes Operator deployment.
Note
Removing the Kubernetes Operator deployment doesn’t affect the lifecycle of your MongoDB resources.
Repeat the
helm
command to upgrade to the new version of the Kubernetes Operator.
Two Operator Instances After an Upgrade¶
After you upgrade from Kubernetes Operator version 1.10 or earlier to a version 1.11 or later, your Kubernetes cluster might have two instances of the Kubernetes Operator deployed.
Use the get pods
command to view your Kubernetes Operator pods:
Note
If you deployed the Kubernetes Operator to OpenShift, replace the
kubectl
commands in this section with oc
commands.
If the response contains both an enterprise-operator
and a
mongodb-enterprise-operator
pod, your cluster has two Kubernetes Operator
instances:
You can safely remove the enterprise-operator
deployment. Run the
following command to remove it: