MongoDB replica set stability issues

Hello

We have been testing with MongoDB replica sets and seeing MongoDB nodes crashing randomly even when there are no queries being executed. Since we are using supervisord to manage mongod processes, the mongod process gets restarted. Sometimes the restarted mongod comes up and get connected to other replica set nodes. Other times, it does not accept any connections and has to be stopped/killed and restarted. We have seen these crashes in Primary-Secondary and Primary-Secondary-Secondary replica sets. No crashes are seen in single node replica set.

So far what we have found is that hostname resolution method matters. When hostnames are in /etc/hosts file, no crash is seen. When we use a custom resolver, the crash occurs. The custom resolver is a simple python script that reads a file and returns ip for the host. It is configured via /etc/nsswitch.conf. At the moment DNS resolution is not tested.

I have logs available for the crashed servers which I can share. For now, I am pasting one of the backtrace of crash that is logged in the mongod.log file. The backtraces do not all look same to me but I may be wrong. Any help in resolving this issue is greatly appreciated. Please let me know what other information should I can further provide.

Thanks
Arshad

MongoDB version: 4.4.9, CentOS 7, glibc-2.17-323.el7_9.x86_64

mongod.conf

net:
  bindIp: 192.168.1.1, 127.0.0.1
  port: 27017
  tls:
    CAFile: /var/lib/mongo/mongodbCA.crt
    allowConnectionsWithoutCertificates: true
    allowInvalidCertificates: true
    allowInvalidHostnames: true
    certificateKeyFile: /var/lib/mongo/mongodb.pem
    clusterFile: /var/lib/mongo/mongodb.pem
    mode: requireTLS
  unixDomainSocket:
    filePermissions: '0777'
    pathPrefix: /var/lib/mongo
processManagement:
  fork: false
  pidFilePath: /var/lib/mongo/mongod.pid
  timeZoneInfo: /usr/share/zoneinfo
replication:
  replSetName: kiteworks
security:
  authorization: enabled
  clusterAuthMode: x509
  javascriptEnabled: false
storage:
  dbPath: /var/lib/mongo
  directoryPerDB: true
  journal:
    enabled: true
systemLog:
  destination: file
  logAppend: true
  logRotate: reopen
  path: /var/lib/mongo/mongod.log
'backtrace': [{'a': '557F0C6E83AA',
                'b': '557F0991F000',
                'o': '2DC93AA',
                's': '_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606',
                's+': '1EA'},
               {'a': '557F0C6E9E39',
                'b': '557F0991F000',
                'o': '2DCAE39',
                's': '_ZN5mongo15printStackTraceEv',
                's+': '29'},
               {'a': '557F0C6E7006',
                'b': '557F0991F000',
                'o': '2DC8006',
                's': '_ZN5mongo12_GLOBAL__N_111myTerminateEv',
                's+': 'A6'},
               {'a': '557F0C877216',
                'b': '557F0991F000',
                'o': '2F58216',
                's': '_ZN10__cxxabiv111__terminateEPFvvE',
                's+': '6'},
               {'a': '557F0C90B2C9',
                'b': '557F0991F000',
                'o': '2FEC2C9',
                's': '__cxa_call_terminate',
                's+': '39'},
               {'a': '557F0C876C05',
                'b': '557F0991F000',
                'o': '2F57C05',
                's': '__gxx_personality_v0',
                's+': '275'},
               {'a': '7F69EB5B18E3',
                'b': '7F69EB5A2000',
                'o': 'F8E3',
                's': '_Unwind_GetTextRelBase',
                's+': '23E3'},
               {'a': '7F69EB5B1E17',
                'b': '7F69EB5A2000',
                'o': 'FE17',
                's': '_Unwind_Resume',
                's+': '57'},
               {'a': '557F0A7DB9CD',
                'b': '557F0991F000',
                'o': 'EBC9CD',
                's': '_ZN5mongo10ThreadPool13_consumeTasksEv.cold.574',
                's+': '53'},
               {'a': '557F0C217C21',
                'b': '557F0991F000',
                'o': '28F8C21',
                's': '_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE',
                's+': 'E1'},
               {'a': '557F0C217F50',
                'b': '557F0991F000',
                'o': '28F8F50',
                's': '_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_10ThreadPool25_startWorkerThread_inlockEvEUlvE2_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv',
                's+': '60'},
               {'a': '557F0C892D2F',
                'b': '557F0991F000',
                'o': '2F73D2F',
                's': 'execute_native_thread_routine',
                's+': 'F'},
               {'a': '7F69EB38DEA5',
                'b': '7F69EB386000',
                'o': '7EA5',
                's': 'start_thread',
                's+': 'C5'},
               {'a': '7F69EB0B69FD',
                'b': '7F69EAFB8000',
                'o': 'FE9FD',
                's': 'clone',
                's+': '6D'}],
 'processInfo': {'compiledModules': [],
                 'gitVersion': 'b4048e19814bfebac717cf5a880076aa69aba481',
                 'mongodbVersion': '4.4.9',
                 'somap': [{'b': '557F0991F000',
                            'buildId': '375E0455B64A8CCBA2B20814F34100164730166F',
                            'elfType': 3},
                           {'b': '7F69EB5A2000',
                            'buildId': 'EDF51350C7F71496149D064AA8B1441F786DF88A',
                            'elfType': 3,
                            'path': '/usr/lib64/libgcc_s.so.1'},
                           {'b': '7F69EB386000',
                            'buildId': 'E10CC8F2B932FC3DAEDA22F8DAC5EBB969524E5B',
                            'elfType': 3,
                            'path': '/usr/lib64/libpthread.so.0'},
                           {'b': '7F69EAFB8000',
                            'buildId': '7CD4A08C18C60E1E2EA1AEBB88C9379BD7289D38',
                            'elfType': 3,
                            'path': '/usr/lib64/libc.so.6'}],
                 'uname': {'machine': 'x86_64',
                           'release': '3.10.0-1160.59.1.el7.x86_64',
                           'sysname': 'Linux',
                           'version': '#1 SMP Wed Feb 23 16:47:03 UTC 2022'}}}

Seems to me like you’ve found the problem. Every host lookup is going to invoke the script. You’re better off with DNS or hosts file.

Thanks for the reply. I’ll test with DNS as well. The script that I mentioned has been used by other part of the application as well and hasn’t caused any issues. MongoDB crashing doesn’t seem the right behavior on part of MongoDB.

Regards

Hello All

I updated mongodb version to 4.4.13 and now I don’t see any crashes. However, after sometime the replica set nodes fail to replicate.
In my setup I have one primary and one secondary. The logs show that primary is trying to connect with secondary and secondary node accepts the connection, but does not reply back. This makes primary to keep connecting back. On the secondary side the connections keep accumulating and the connectionCount goes upto 7000 until I restarted secondary and killed a the non-responding mongod process.

I am pasting some relevant parts of the logs of primary and secondary. Any help in solving this issue is greatly appreciated.
Regards
Arshad

Excerpt from Primary’s log:

...
{"t":{"$date":"2022-04-06T20:28:04.778+00:00"},"s":"I",  "c":"REPL",     "id":21216,   "ctx":"ReplCoord-27","msg":"Member is now in state RS_DOWN","attr":{"hostAndPort":"secondary:27017","heartbeatMessage":"no response within election timeout period"}}
{"t":{"$date":"2022-04-06T20:28:04.778+00:00"},"s":"I",  "c":"REPL",     "id":21809,   "ctx":"ReplCoord-27","msg":"Can't see a majority of the set, relinquishing primary"}
{"t":{"$date":"2022-04-06T20:28:04.778+00:00"},"s":"I",  "c":"REPL",     "id":21475,   "ctx":"ReplCoord-27","msg":"Stepping down from primary in response to heartbeat"}
{"t":{"$date":"2022-04-06T20:28:04.778+00:00"},"s":"I",  "c":"REPL",     "id":21343,   "ctx":"RstlKillOpThread","msg":"Starting to kill user operations"}
{"t":{"$date":"2022-04-06T20:28:04.778+00:00"},"s":"I",  "c":"REPL",     "id":21344,   "ctx":"RstlKillOpThread","msg":"Stopped killing user operations"}
{"t":{"$date":"2022-04-06T20:28:04.778+00:00"},"s":"I",  "c":"REPL",     "id":21340,   "ctx":"RstlKillOpThread","msg":"State transition ops metrics","attr":{"metrics":{"lastStateTransition":"stepDown","userOpsKilled":0,"userOpsRunning":4}}}
....
{"t":{"$date":"2022-04-06T20:28:06.211+00:00"},"s":"I",  "c":"CONNPOOL", "id":22566,   "ctx":"ReplNetwork","msg":"Ending connection due to bad connection status","attr":{"hostAndPort":"secondary:27017","error":"CallbackCanceled: Callback was canceled","numOpenConns":0}}
{"t":{"$date":"2022-04-06T20:28:06.211+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"ReplNetwork","msg":"Connecting","attr":{"hostAndPort":"secondary:27017"}}

Excerpt from Secondary’s log:

....
{"t":{"$date":"2022-04-06T20:28:06.299+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37632","connectionId":1564,"connectionCount":11}}
{"t":{"$date":"2022-04-06T20:28:06.605+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37634","connectionId":1565,"connectionCount":12}}
{"t":{"$date":"2022-04-06T20:28:08.431+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37636","connectionId":1566,"connectionCount":13}}
{"t":{"$date":"2022-04-06T20:28:10.288+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37644","connectionId":1567,"connectionCount":14}}
{"t":{"$date":"2022-04-06T20:28:11.176+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37652","connectionId":1568,"connectionCount":15}}
{"t":{"$date":"2022-04-06T20:28:18.700+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37848","connectionId":1569,"connectionCount":16}}
{"t":{"$date":"2022-04-06T20:28:21.686+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37858","connectionId":1570,"connectionCount":17}}
{"t":{"$date":"2022-04-06T20:28:23.804+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37880","connectionId":1571,"connectionCount":18}}
{"t":{"$date":"2022-04-06T20:28:25.103+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37882","connectionId":1572,"connectionCount":19}}
{"t":{"$date":"2022-04-06T20:28:26.563+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"192.168.1.1:37884","connectionId":1573,"connectionCount":20}}
....