Ops Manager Never Ping a Host which is Alive

finisky · January 13, 2021, 11:51am

I deploy a sharded cluster using kubernetes operator and manage the cluster by Ops Manager 4.4.6. The cluster config: 2 mongos, 2 shards (4 instances replicaset), 1 config replicaset (3 instances), server version 4.4.3-ent running on Ubuntu 16.04.

After a pressure test, one of the host is offline and Ops Manager shows its status as a grey diamond like this: hostdown|690x247

Hover the diamond: Last ping: Never.

I can browse the agent log.

kubectl exec to the pod, both the mongodb process and the agent process work fine.

Use mongo shell to connect to each host of the same shard, no connectivity issue. rs.status() lists all the replicas as expected.

Execute rs.printSecondaryReplicationInfo(), db-0-0 is the never ping host while db-0-1 is primary:

MongoDB Enterprise db-0:SECONDARY> rs.printSecondaryReplicationInfo()
source: db-0-0.db-sh.prod-cs-mongodb.svc.cluster.local:27017
        syncedTo: Wed Jan 13 2021 11:38:30 GMT+0000 (UTC)
        -3 secs (0 hrs) behind the primary 
source: db-0-2.db-sh.prod-cs-mongodb.svc.cluster.local:27017
        syncedTo: Wed Jan 13 2021 11:38:27 GMT+0000 (UTC)
        0 secs (0 hrs) behind the primary 
source: db-0-3.db-sh.prod-cs-mongodb.svc.cluster.local:27017
        syncedTo: Wed Jan 13 2021 11:38:27 GMT+0000 (UTC)
        0 secs (0 hrs) behind the primary

I tried delete the statefulset pod but with no luck. How does Ops Manager mark the host as “never ping”? What’s the right way to fix deployment monitoring issues for a kubernetes deployment(no sudo user as the pod is running under security context)?

Other logs might help:

I have no name!@db-0-0:/var/log/mongodb-mms-automation$ tail automation-agent.log
[2021-01-13T10:55:08.026+0000] [header.info] [::0]       TracePlanner = false
[2021-01-13T10:55:08.026+0000] [header.info] [::0]               User = mongodb
[2021-01-13T10:55:08.026+0000] [header.info] [::0]         Go version = go1.14.7
[2021-01-13T10:55:23.918+0000] [.error] [cm/director/director.go:executePlan:915] <db-0-0> [10:55:23.918] <db-0-0> [10:55:23.918] Precondition failed for step 0(RsInit) of move 3(RsInit) because 
['currentState.ReplSetConf' != <nil> ('currentState.ReplSetConf' = ReplSetConfig{id=db-0,version=6,configsvr=false,protocolVersion=1,forceProtocolVersion=false,writeConcernMajorityJournalDefault=true,members={id:0,HostPort:db-0-0.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:1,HostPort:db-0-1.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:2,HostPort:db-0-2.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:3,HostPort:db-0-3.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},settings=map[catchUpTakeoverDelayMillis:30000 catchUpTimeoutMillis:-1 chainingAllowed:true electionTimeoutMillis:10000 getLastErrorDefaults:map[w:1 wtimeout:0] getLastErrorModes:map[] heartbeatIntervalMillis:2000 heartbeatTimeoutSecs:10 replicaSetId:ObjectID("5ff832dd7fa909b953e5915a")]})]. Outcome=3
[2021-01-13T10:55:23.918+0000] [.error] [cm/director/director.go:planAndExecute:566] <db-0-0> [10:55:23.918] Plan execution failed on step RsInit as part of move RsInit : <db-0-0> [10:55:23.918] <db-0-0> [10:55:23.918] Precondition failed for step 0(RsInit) of move 3(RsInit) because 
['currentState.ReplSetConf' != <nil> ('currentState.ReplSetConf' = ReplSetConfig{id=db-0,version=6,configsvr=false,protocolVersion=1,forceProtocolVersion=false,writeConcernMajorityJournalDefault=true,members={id:0,HostPort:db-0-0.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:1,HostPort:db-0-1.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:2,HostPort:db-0-2.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:3,HostPort:db-0-3.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},settings=map[catchUpTakeoverDelayMillis:30000 catchUpTimeoutMillis:-1 chainingAllowed:true electionTimeoutMillis:10000 getLastErrorDefaults:map[w:1 wtimeout:0] getLastErrorModes:map[] heartbeatIntervalMillis:2000 heartbeatTimeoutSecs:10 replicaSetId:ObjectID("5ff832dd7fa909b953e5915a")]})]. Outcome=3
[2021-01-13T10:55:23.918+0000] [.error] [cm/director/director.go:mainLoop:398] <db-0-0> [10:55:23.918] Failed to planAndExecute : <db-0-0> [10:55:23.918] Plan execution failed on step RsInit as part of move RsInit : <db-0-0> [10:55:23.918] <db-0-0> [10:55:23.918] Precondition failed for step 0(RsInit) of move 3(RsInit) because 
['currentState.ReplSetConf' != <nil> ('currentState.ReplSetConf' = ReplSetConfig{id=db-0,version=6,configsvr=false,protocolVersion=1,forceProtocolVersion=false,writeConcernMajorityJournalDefault=true,members={id:0,HostPort:db-0-0.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:1,HostPort:db-0-1.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:2,HostPort:db-0-2.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},{id:3,HostPort:db-0-3.db-sh.prod-cs-mongodb.svc.cluster.local:27017,ArbiterOnly:falsePriority:1,Hidden:false,SlaveDelay:0,Votes:1,Tags:map[]},settings=map[catchUpTakeoverDelayMillis:30000 catchUpTimeoutMillis:-1 chainingAllowed:true electionTimeoutMillis:10000 getLastErrorDefaults:map[w:1 wtimeout:0] getLastErrorModes:map[] heartbeatIntervalMillis:2000 heartbeatTimeoutSecs:10 replicaSetId:ObjectID("5ff832dd7fa909b953e5915a")]})]. Outcome=3
[2021-01-13T11:00:29.745+0000] [.error] [cm/mongoclientservice/mongoclientservice.go:sendClientToRequesters:720] [11:00:29.745] Error getting client ready for conn params = db-0-2.db-sh.prod-cs-mongodb.svc.cluster.local:27017 (local=false).  Informing all requests and disposing of client (0xc00052cdc0).  requests=[ 0xc0005275c0 ] : connection(db-0-2.db-sh.prod-cs-mongodb.svc.cluster.local:27017[-23]) unable to decode message length: EOF


I have no name!@db-0-0:/var/log/mongodb-mms-automation$ tail monitoring-agent.log
[2021-01-13T11:38:40.977+0000] [agent.info] [monitoring/agent.go:Iterate:177] Received new configuration: Assigned 1 out of 12 plus 0 chunk monitor(s)
[2021-01-13T11:38:40.977+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s...
[2021-01-13T11:39:35.986+0000] [agent.info] [monitoring/agent.go:Iterate:177] Received new configuration: Assigned 1 out of 12 plus 0 chunk monitor(s)
[2021-01-13T11:39:35.986+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s...
[2021-01-13T11:40:30.978+0000] [agent.info] [monitoring/agent.go:Iterate:177] Received new configuration: Assigned 1 out of 12 plus 0 chunk monitor(s)
[2021-01-13T11:40:30.978+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s...
[2021-01-13T11:41:25.996+0000] [agent.info] [monitoring/agent.go:Iterate:177] Received new configuration: Assigned 1 out of 12 plus 0 chunk monitor(s)
[2021-01-13T11:41:25.996+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s...
[2021-01-13T11:42:20.980+0000] [agent.info] [monitoring/agent.go:Iterate:177] Received new configuration: Assigned 1 out of 12 plus 0 chunk monitor(s)
[2021-01-13T11:42:20.980+0000] [agent.info] [monitoring/agent.go:Run:266] Done. Sleeping for 55s...
I have no name!@db-0-0:/var/log/mongodb-mms-automation$ tail readiness.log 
2021-01-13T11:42:14.980Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:10 +0000 UTC
2021-01-13T11:42:19.979Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:10 +0000 UTC
2021-01-13T11:42:24.958Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:20 +0000 UTC
2021-01-13T11:42:29.989Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:20 +0000 UTC
2021-01-13T11:42:34.962Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:30 +0000 UTC
2021-01-13T11:42:39.981Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:30 +0000 UTC
2021-01-13T11:42:44.969Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:40 +0000 UTC
2021-01-13T11:42:49.987Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:40 +0000 UTC
2021-01-13T11:42:54.976Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:50 +0000 UTC
2021-01-13T11:42:59.982Z        DEBUG   build/readiness.go:175  ExpectedToBeUp: true, IsInGoalState: true, LastMongoUpTime: 2021-01-13 11:42:50 +0000 UTC


I have no name!@db-0-0:/var/log/mongodb-mms-automation$ tail mongodb.log
{"t":{"$date":"2021-01-13T11:43:04.476+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn727","msg":"Connection ended","attr":{"remote":"10.240.1.237:41278","connectionId":727,"connectionCount":61}}
{"t":{"$date":"2021-01-13T11:43:14.595+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"10.240.1.237:41352","connectionId":729,"connectionCount":62}}
{"t":{"$date":"2021-01-13T11:43:14.603+00:00"},"s":"W",  "c":"NETWORK",  "id":23234,   "ctx":"conn729","msg":"No SSL certificate provided by peer"}
{"t":{"$date":"2021-01-13T11:43:14.603+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn729","msg":"client metadata","attr":{"remote":"10.240.1.237:41352","client":"conn729","doc":{"driver":{"name":"mongo-go-driver","version":"v1.1.4"},"os":{"type":"linux","architecture":"amd64"},"platform":"go1.14.7"}}}
{"t":{"$date":"2021-01-13T11:43:14.603+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"10.240.1.237:41354","connectionId":730,"connectionCount":63}}
{"t":{"$date":"2021-01-13T11:43:14.611+00:00"},"s":"W",  "c":"NETWORK",  "id":23234,   "ctx":"conn730","msg":"No SSL certificate provided by peer"}
{"t":{"$date":"2021-01-13T11:43:14.611+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn730","msg":"client metadata","attr":{"remote":"10.240.1.237:41354","client":"conn730","doc":{"driver":{"name":"mongo-go-driver","version":"v1.1.4"},"os":{"type":"linux","architecture":"amd64"},"platform":"go1.14.7","application":{"name":"MongoDB Automation Agent v10.14.17.6445 (git: bc85c64c604a1d6beb273b1ea7fc81af16b0aca1)"}}}}
{"t":{"$date":"2021-01-13T11:43:14.629+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn730","msg":"Successful authentication","attr":{"mechanism":"SCRAM-SHA-256","principalName":"__system","authenticationDatabase":"local","client":"10.240.1.237:41354"}}
{"t":{"$date":"2021-01-13T11:43:14.630+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn730","msg":"Connection ended","attr":{"remote":"10.240.1.237:41354","connectionId":730,"connectionCount":62}}
{"t":{"$date":"2021-01-13T11:43:14.630+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn729","msg":"Connection ended","attr":{"remote":"10.240.1.237:41352","connectionId":729,"connectionCount":61}}

Ercin_Demir · January 28, 2022, 11:13pm

Hi there,

I’m facing the same issue. My ops-manager-db pods are just running fine but it seems down on OpsManager UI. Did you manage to solve it? @finisky

finisky · February 5, 2022, 3:48am

Hi, the issue may be caused by outdated host mappings. Try to delete all host mappings in the Ops Manager?

Refer to this post: MongoDB Agent Host Discovery Issue

Ercin_Demir · February 5, 2022, 11:23pm

Thank you for your reply @finisky. I tried your suggestion, however, it turns out the issue is different. After commenting here, I found AuthenticationFailed logs under mongodb-agent-monitoring logs. It was working totally fine. I just disabled backup and then ended up with this issue. I am still investigating it.

[2022-02-05T22:40:48.436+0000] [metrics.status.collector-ops-manager-db-2.ops-manager-db-svc.mongodb.svc.cluster.local:27017.error] [monitoring/collector.go:looper:254] Failure during status metrics collection.
Failed to get connectionStatus. Err: `connection() : auth error: sasl conversation error: unable to authenticate using mechanism "SCRAM-SHA-256": (AuthenticationFailed) Authentication failed.`
	at cm/monitoring/dialing.go:491
	at cm/monitoring/dialing.go:240
	at cm/monitoring/dialing.go:351
	at cm/monitoring/dialing.go:368
	at louisaberger/procexec/concurrency.go:45
	at src/runtime/asm_amd64.s:1374
[22:40:48.448] All 0 Mongo processes are in goal state. Monitoring in goal state

Here is my OpsManager manifest if it helps in any way to troubleshoot. I assume this automation user is being created by OpsManager itself, therefore, I don’t know how it was working now but not now.

---
apiVersion: mongodb.com/v1
kind: MongoDBOpsManager
metadata:
  name: ops-manager
  namespace: mongodb
spec:
  version: 4.4.9
  configuration:
    mms.centralUrl: https://mongodb.mydomain.com

  adminCredentials: ops-manager-admin-secret

  externalConnectivity:
    type: NodePort # TODO: should be service behind nginx ingress

  applicationDatabase:
    members: 3
    version: 4.2.11-ent
    persistent: true
    podSpec:
      persistence:
        single:  
          storage: 100G
          storageClass: aws-efs 
  backup:
    enabled: true
    opLogStores:
      - name: oplog1
        mongodbResourceRef:
          name: om-mongodb-oplog-db
        mongodbUserRef:
          name: ops-manager-app
    s3Stores:
    - name: "news3store"
      s3SecretRef:
        name: "REDACTED" 
      s3BucketEndpoint: "REDACTED"
      s3BucketName: "REDACTED"
      pathStyleAccessEnabled: true
---

finisky · February 6, 2022, 4:17am

Hi @Ercin_Demir , actually I had the same error on appdb as well:

auth error: sasl conversation error: unable to authenticate using mechanism "SCRAM-SHA-256": (AuthenticationFailed) Authentication failed.

For me, the error was caused by recreating Ops Manager after deletion (just delete Ops Manager while keep the appdb disks).

I think the root cause is that the Ops Manager credentials are inconsistent with the data on appdb disks: when recreating the Ops Manager, some credentials were regenerated in the kubernetes secrets while the appdb data on the disk is not changed at the same time. Unfortunately, I compared the credentials but couldn’t pinpoint which entries are inconsistent. Moreover, modify the internal appdb data online is non-trivial and dangerous, which may lead to service down.

Since the appdb still works fine, I chose to ignore the issue .

Because the Ops Manager is a blackbox, another clean solution would be migrating the old cluster from the old Ops Manager to a new one (with brand new appdb).

Ercin_Demir · February 7, 2022, 3:12am

Thank you very much @finisky for taking the time to reply back. Looks like we made the same mistake. I also deleted OpsManager while keeping the appdb disks. But this issue also made me realize that I am not setting any credentials for app db pods. So perhaps, I can try setting up one.

I could try migrating it as well, but I am using AWS EFS. I need to look into this a bit.

finisky · February 7, 2022, 4:29am

I guess the trick lies in the Kubernetes secrets: ops-manager-db-config (cluster-config.json: scramSha256Creds), and ops-manager-db-om-user-scram-credentials (server-key, salt etc.). However, currently I don’t understand how they interact with each other.

Here is an example to setup appdb scram password:

github.com

mongodb/mongodb-enterprise-kubernetes/blob/master/samples/ops-manager/ops-manager-scram.yaml

---
apiVersion: mongodb.com/v1
kind: MongoDBOpsManager
metadata:
  name: ops-manager-scram
spec:
  replicas: 1
  version: 5.0.5
  adminCredentials: ops-manager-admin-secret

  # the application database backing Ops Manager. Replica Set is the only supported type
  # Application database has the SCRAM-SHA authentication mode always enabled
  applicationDatabase:
    version: "4.4.11-ent"
    members: 3
    podSpec:
      cpu: '0.25'
    # optional. Specifies the secret which contains the password used to connect to the database
    # If not specified - the Operator will generate a random password
    passwordSecretKeyRef:

This file has been truncated. show original

Good luck!

Ercin_Demir · March 14, 2022, 6:09pm

Just commenting here if anyone comes back here and experiences the same issue.

In my case, I redeployed OpsManager and noticed when I enable backup in OpsManager, ops-manager-db deployment gets down on OM UI even though it seems up and running in the cluster. I don’t see any problem other than that. So I’m leaving it like this. Will update here if I find any solutions.

FYI @finisky