To whom it may concern,
We are running node native driver 4.13, running with node16 docker hosted in Azure Kubernetes service. We are connecting to cosmosDB with mongo api, which is essentially a standalone mongodb and manages all replica informations on server side. We are connecting to the DB using direct mode give the replication is managed by server side.
We do experience network blip time to time and most of time the running pods would be able to recover but some pods are not able to and went into a “stuck” state: all commands experienced MongoServerSelectionError
and with more detailed message that timedout after 30 seconds.
I have enabled cluster monitor on sdk side, and compared side by side for the “good” vs “bad” pods. Here are my findings:
After initial heartbeat failure due to network issues, which leads to server description change, topology description change, connection pool cleared/ready a series events. All pods will stabilized. The recover logic is working as expected. However, I do notice for the “bad” stuck pods, their topology description change events would include some value of “REDACTED”, and if their stabilized topology event change include “REDACTED” information, while these “good” pods, they don’t have it. Samples:
“BAD pods”:
{"topologyId":0,"previousDescription":{"type":"Single","servers":{},"stale":"REDACTED","compatible":"REDACTED","heartbeatFrequencyMS":"REDACTED","localThresholdMS":"REDACTED","setName":null,"maxElectionId":null,"maxSetVersion":null,"commonWireVersion":"REDACTED","logicalSessionTimeoutMinutes":null},"newDescription":{"type":"Single","servers":{},"stale":false,"compatible":true,"heartbeatFrequencyMS":"REDACTED","localThresholdMS":"REDACTED","setName":null,"maxElectionId":null,"maxSetVersion":null,"commonWireVersion":null,"logicalSessionTimeoutMinutes":30}}
“Good pods”
{"topologyId":0,"previousDescription":{"type":"Single","servers":{},"stale":false,"compatible":true,"heartbeatFrequencyMS":30000,"localThresholdMS":15,"setName":null,"maxElectionId":null,"maxSetVersion":null,"commonWireVersion":0,"logicalSessionTimeoutMinutes":30},"newDescription":{"type":"Single","servers":{},"stale":false,"compatible":true,"heartbeatFrequencyMS":30000,"localThresholdMS":15,"setName":null,"maxElectionId":null,"maxSetVersion":null,"commonWireVersion":0,"logicalSessionTimeoutMinutes":30}}
A few questions I have:
- Have you ever seen issues like this?
- Does anyone know would these “REDACTED” causing
MongoServerSelectionError
error even if we are using SINGLE type (direct connect Mode)? - Where did “REDACTED” get generated? I cloned the sdk source code and didn’t find a place would create
"REDACTED"
at all. From our own code, we do build some exception REDACTED scribing to remove sensitive information from exception. But our code is built on top the sdk code, only for exceptions, and only for operational cmd likefindOne
,UpdateMany
etc. It is hard to believe our code would inject the"REDACTED"
data to sdk level. - Are there any setting controls this behavior? Do we have to try bump up sdk version?
Many thx!