Shard cluster is not working after one shard data is lost

FahimAbrar · March 25, 2022, 5:54pm

Hello,

I have a sharded cluster with 2 shards. I lost all the replicas of a shard with all its data. Now, when I created the shard replicaset again (with the same replicaset name and host), It is not being connected to the shard cluster. When I run any command from mongos (ex: show dbs) I get this error message Could not find host matching read preference { mode: \"primary\" } for set myshard2. Is there a way to recover the cluster? I don’t need the lost shard data, I just want to make the cluster work again. Any pointer or reference will be helpful. I’m using mongodb 4.4.10.

Thank you.

Aasawari · March 28, 2022, 8:28am

Hello @FahimAbrar
Welcome to the Community!!

Could you provide the output for rs.status() for the replica set?

Thanks
Aasawari

FahimAbrar · March 29, 2022, 5:57am

Hi @Aasawari,

Thank you for the reply. Here is the output:

rs.status()

{
  "set" : "shard1",
  "date" : ISODate("2022-03-29T05:49:36.267Z"),
  "myState" : 1,
  "term" : NumberLong(3),
  "syncSourceHost" : "",
  "syncSourceId" : -1,
  "heartbeatIntervalMillis" : NumberLong(2000),
  "majorityVoteCount" : 2,
  "writeMajorityCount" : 2,
  "votingMembersCount" : 3,
  "writableVotingMembersCount" : 3,
  "optimes" : {
    "lastCommittedOpTime" : {
      "ts" : Timestamp(1648532967, 9),
      "t" : NumberLong(3)
    },
    "lastCommittedWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
    "readConcernMajorityOpTime" : {
      "ts" : Timestamp(1648532967, 9),
      "t" : NumberLong(3)
    },
    "readConcernMajorityWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
    "appliedOpTime" : {
      "ts" : Timestamp(1648532967, 9),
      "t" : NumberLong(3)
    },
    "durableOpTime" : {
      "ts" : Timestamp(1648532967, 9),
      "t" : NumberLong(3)
    },
    "lastAppliedWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
    "lastDurableWallTime" : ISODate("2022-03-29T05:49:27.567Z")
  },
  "lastStableRecoveryTimestamp" : Timestamp(1648532967, 9),
  "electionCandidateMetrics" : {
    "lastElectionReason" : "electionTimeout",
    "lastElectionDate" : ISODate("2022-03-29T05:29:23.422Z"),
    "electionTerm" : NumberLong(3),
    "lastCommittedOpTimeAtElection" : {
      "ts" : Timestamp(1648531752, 1),
      "t" : NumberLong(1)
    },
    "lastSeenOpTimeAtElection" : {
      "ts" : Timestamp(1648531752, 1),
      "t" : NumberLong(1)
    },
    "numVotesNeeded" : 1,
    "priorityAtElection" : 1,
    "electionTimeoutMillis" : NumberLong(10000),
    "newTermStartDate" : ISODate("2022-03-29T05:29:23.428Z"),
    "wMajorityWriteAvailabilityDate" : ISODate("2022-03-29T05:29:23.429Z")
  },
  "members" : [
    {
      "_id" : 0,
      "name" : "mg-sh-shard1-0.mg-sh-shard1-pods.demo.svc.cluster.local:27017",
      "health" : 1,
      "state" : 1,
      "stateStr" : "PRIMARY",
      "uptime" : 1265,
      "optime" : {
        "ts" : Timestamp(1648532967, 9),
        "t" : NumberLong(3)
      },
      "optimeDate" : ISODate("2022-03-29T05:49:27Z"),
      "lastAppliedWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
      "lastDurableWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
      "syncSourceHost" : "",
      "syncSourceId" : -1,
      "infoMessage" : "",
      "electionTime" : Timestamp(1648531763, 4),
      "electionDate" : ISODate("2022-03-29T05:29:23Z"),
      "configVersion" : 3,
      "configTerm" : -1,
      "self" : true,
      "lastHeartbeatMessage" : ""
    },
    {
      "_id" : 1,
      "name" : "mg-sh-shard1-2.mg-sh-shard1-pods.demo.svc.cluster.local:27017",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 971,
      "optime" : {
        "ts" : Timestamp(1648532967, 9),
        "t" : NumberLong(3)
      },
      "optimeDurable" : {
        "ts" : Timestamp(1648532967, 9),
        "t" : NumberLong(3)
      },
      "optimeDate" : ISODate("2022-03-29T05:49:27Z"),
      "optimeDurableDate" : ISODate("2022-03-29T05:49:27Z"),
      "lastAppliedWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
      "lastDurableWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
      "lastHeartbeat" : ISODate("2022-03-29T05:49:34.689Z"),
      "lastHeartbeatRecv" : ISODate("2022-03-29T05:49:34.941Z"),
      "pingMs" : NumberLong(0),
      "lastHeartbeatMessage" : "",
      "syncSourceHost" : "mg-sh-shard1-1.mg-sh-shard1-pods.demo.svc.cluster.local:27017",
      "syncSourceId" : 2,
      "infoMessage" : "",
      "configVersion" : 3,
      "configTerm" : -1
    },
    {
      "_id" : 2,
      "name" : "mg-sh-shard1-1.mg-sh-shard1-pods.demo.svc.cluster.local:27017",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 141,
      "optime" : {
        "ts" : Timestamp(1648532967, 9),
        "t" : NumberLong(3)
      },
      "optimeDurable" : {
        "ts" : Timestamp(1648532967, 9),
        "t" : NumberLong(3)
      },
      "optimeDate" : ISODate("2022-03-29T05:49:27Z"),
      "optimeDurableDate" : ISODate("2022-03-29T05:49:27Z"),
      "lastAppliedWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
      "lastDurableWallTime" : ISODate("2022-03-29T05:49:27.567Z"),
      "lastHeartbeat" : ISODate("2022-03-29T05:49:34.527Z"),
      "lastHeartbeatRecv" : ISODate("2022-03-29T05:49:36.239Z"),
      "pingMs" : NumberLong(0),
      "lastHeartbeatMessage" : "",
      "syncSourceHost" : "mg-sh-shard1-0.mg-sh-shard1-pods.demo.svc.cluster.local:27017",
      "syncSourceId" : 0,
      "infoMessage" : "",
      "configVersion" : 3,
      "configTerm" : -1
    }
  ],
  "ok" : 1
}

Also the output of shard status:
sh.status()

--- Sharding Status --- 
  sharding version: {
  	"_id" : 1,
  	"minCompatibleVersion" : 5,
  	"currentVersion" : 6,
  	"clusterId" : ObjectId("624297ef666efa1ac6bacfaf")
  }
  shards:
        {  "_id" : "shard0",  "host" : "shard0/mg-sh-shard0-0.mg-sh-shard0-pods.demo.svc.cluster.local:27017,mg-sh-shard0-1.mg-sh-shard0-pods.demo.svc.cluster.local:27017,mg-sh-shard0-2.mg-sh-shard0-pods.demo.svc.cluster.local:27017",  "state" : 1,  "tags" : [ "shard0" ] }
        {  "_id" : "shard1",  "host" : "shard1/mg-sh-shard1-0.mg-sh-shard1-pods.demo.svc.cluster.local:27017,mg-sh-shard1-1.mg-sh-shard1-pods.demo.svc.cluster.local:27017,mg-sh-shard1-2.mg-sh-shard1-pods.demo.svc.cluster.local:27017",  "state" : 1,  "tags" : [ "shard1" ] }
  active mongoses:
        "4.4.12-12" : 2
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours: 
                8 : Success
                112 : Failed with error 'aborted', from shard0 to shard1
  databases:
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                shard0	1017
                                shard1	7
                        too many chunks to print, use verbose if you want to force print

Aasawari · April 4, 2022, 7:46am

Hello @FahimAbrar
Apologies for the delay in response.

To make the cluster in working mode again, I would recommend you do a mongodump from a healthy shard and then restore the required collection.
You can refer to the allowPartialResults in case you have more than one sharded collection.

Let us know if you have any more questions

Thanks
Aasawari