I tried to check your github issue; I could read your logs but could not continue as it is too long for my current time slot.
In short, my wild guess here would be about how k8s shuts down servers.
As you already noted, SIGTERM causes “PRIMARY not found” error while db.shutdowServer does it gracefully.
You may immeditely notice this line at the start of secondary log for shutdown command:
... "ctx":"conn220026","msg":"Received replSetStepUp request"}
This clearly indicates continuous communication between nodes.
On the other hand, when SIGTERM used, primary says it is about to send a message to the cluster, but secondary starts failing to get heartbeat at the same interval.
... "ctx":"SignalHandler","msg":"Stepping down the ReplicationCoordinator for shutdown","attr":{"waitTimeMillis":10000}}
... "ctx":"ReplCoord-21","msg":"Heartbeat failed after max retries","attr":{"target":"mongodb-0.mongodb-headless.default.svc.cluster.local:27017","maxHeartbeatRetries":2,"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Error connecting to mongodb-0.mongodb-headless.default.svc.cluster.local:27017 (100.96.5.193:27017) :: caused by :: Connection refused"}}}
Your logs have a 5 seconds gap between these two lines, I wonder at what time secondary lost the heartbeat.
Anyways, this leads me thinking that the “network connection” for the primary is closed before it can send that step down command to the cluster, hence no heartbeat to others. And closing the network is the job of the k8s.
As you know, SIGTERM is part of forced shutdown commands, though it awaits the program to do cleanup. yet this does not tell anything about the rest of the system, especially for the network.
Unfortunately I don’t have the setup to test this, so I hope you and others find better explanation.
PS: you mention of v4.4 server. have you also tried with 5 or 6?