Mongoose client not reconnecting to primary Mongo replica after replica election

Austin_Conry · December 30, 2022, 8:16pm

Mongo: 4.4.5
Mongoose: 6.6.0
Node: 14.20.1 (bullseye-slim)
Docker: 20.10.8

Problem

We have a docker swarm with multiple servers and a mongo stack with 3 replicas each on different servers.
Our rs1(primary) replica goes down due to a Docker restart. When this happens an election occurs and rs3 is selected as the new primary. After rs1 is re-elected a few seconds later, some of our backend client replicas receive the following error when making queries:

MongooseServerSelectionError: Server selection timed out after 30000 ms
    at Function.Model.$wrapCallback (/usr/share/backend-server/node_modules/mongoose/lib/model.js:5192:32)
    at /usr/share/backend-server/node_modules/mongoose/lib/query.js:4901:21
    at /usr/share/backend-server/node_modules/mongoose/lib/helpers/promiseOrCallback.js:41:5
    at new Promise (<anonymous>)
    at promiseOrCallback (/usr/share/backend-server/node_modules/mongoose/lib/helpers/promiseOrCallback.js:40:10)
    at model.Query.exec (/usr/share/backend-server/node_modules/mongoose/lib/query.js:4900:10)
    at model.Query.Query.then (/usr/share/backend-server/node_modules/mongoose/lib/query.js:4983:15) {
  reason: TopologyDescription {
    type: 'ReplicaSetNoPrimary',
    servers: Map(3) {
      'rs1:27017' => [ServerDescription],
      'rs2:27017' => [ServerDescription],
      'rs3:27017' => [ServerDescription]
    },
    stale: false,
    compatible: true,
    heartbeatFrequencyMS: 10000,
    localThresholdMS: 15,
    setName: 'rs0',
    maxElectionId: new ObjectId("7fffffff0000000000000053"),
    maxSetVersion: 1,
    commonWireVersion: 0,
    logicalSessionTimeoutMinutes: 30
  },
  code: undefined
}

Upon checking the status of the replicas, rs1 is the primary and most of our backend clients fulfill queries correctly. We have seen this issue occur on two separate occasions and we are unsure why some Mongoose clients are unable to find the primary replica after the election. We have been unable to recreate the error intentionally.

Our Setup

docker-compose.yml:

The image being used is for Mongodb 4.4.5

version: '3.3'
secrets:
  mongo_cluster_key:
    external: true
services:
  rs1:
    image: mongodb-custom:v1.0.0
    command: mongod --keyFile /run/secrets/mongo_cluster_key --replSet "rs0"
    networks:
      - mongo
    ports:
      - 27017:27017
    secrets:
      - source: mongo_cluster_key
        target: mongo_cluster_key
        uid: '999'
        gid: '999'
        mode: 0400
    environment:
    - MONGO_INITDB_ROOT_USERNAME=admin 
    - MONGO_INITDB_ROOT_PASSWORD=password
    - MONGO_INITDB_DATABASE=admin
    - MAIN_MONGO_DB_NAME=testing
    - MAIN_MONGO_DB_USERNAME=test
    - MAIN_MONGO_DB_PASSWORD=password
    - MAIN_MONGO_DB_ROLE=readWrite
    deploy:
      replicas: 1
    volumes:
      - rs1:/data/db 
      - rs1:/data/configdb
  rs2:
    image: mongodb-custom:v1.0.0
    command: mongod --keyFile /run/secrets/mongo_cluster_key --replSet "rs0"
    networks:
      - mongo
    secrets:
      - source: mongo_cluster_key
        target: mongo_cluster_key
        uid: '999'
        gid: '999'
        mode: 0400
    environment:
    - MONGO_INITDB_ROOT_USERNAME=admin 
    - MONGO_INITDB_ROOT_PASSWORD=password
    - MONGO_INITDB_DATABASE=admin
    - MAIN_MONGO_DB_NAME=testing
    - MAIN_MONGO_DB_USERNAME=test
    - MAIN_MONGO_DB_PASSWORD=password
    - MAIN_MONGO_DB_ROLE=readWrite
    deploy:
      replicas: 1
    volumes:
      - rs2:/data/db 
      - rs2:/data/configdb
  rs3:
    image: mongodb-custom:v1.0.0
    command: mongod --keyFile /run/secrets/mongo_cluster_key --replSet "rs0"
    networks:
      - mongo
    secrets:
      - source: mongo_cluster_key
        target: mongo_cluster_key
        uid: '999'
        gid: '999'
        mode: 0400
    environment:
    - MONGO_INITDB_ROOT_USERNAME=admin 
    - MONGO_INITDB_ROOT_PASSWORD=password
    - MONGO_INITDB_DATABASE=admin
    - MAIN_MONGO_DB_NAME=testing
    - MAIN_MONGO_DB_USERNAME=test
    - MAIN_MONGO_DB_PASSWORD=password
    - MAIN_MONGO_DB_ROLE=readWrite
    deploy:
      replicas: 1
    volumes:
      - rs3:/data/db 
      - rs3:/data/configdb
  rs:
    image: mongodb-custom:v1.0.0
    command: /usr/local/bin/replica-init.sh
    networks:
      - mongo
    secrets:
      - source: mongo_cluster_key
        target: mongo_cluster_key
        uid: '999'
        gid: '999'
        mode: 0400
    environment:
    - MONGO_INITDB_ROOT_USERNAME=admin 
    - MONGO_INITDB_ROOT_PASSWORD=password
    - MONGO_INITDB_DATABASE=admin
    - MAIN_MONGO_DB_NAME=testing
    - MAIN_MONGO_DB_USERNAME=test
    - MAIN_MONGO_DB_PASSWORD=password
    - MAIN_MONGO_DB_ROLE=readWrite
    deploy:
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 10
volumes:
  rs1:
    driver: local
  rs2:
    driver: local
  rs3:
    driver: local
networks:
  mongo:
    driver: overlay
    driver_opts:
      encrypted: "true"
    internal: true
    attachable: true

replica-init.sh:

#!/bin/bash
# Make sure 3 replicas available
for rs in rs1 rs2 rs3;do
  mongo --host $rs --eval 'db'
  if [ $? -ne 0 ]; then
    exit 1
  fi
done
MONGO_INITDB_ROOT_USERNAME="$(< $MONGO_INITDB_ROOT_USERNAME_FILE)"
MONGO_INITDB_ROOT_PASSWORD="$(< $MONGO_INITDB_ROOT_PASSWORD_FILE)"
# Connect to rs1 and configure replica set if not done
status=$(mongo --host rs1 --quiet --eval 'rs.status().members.length')
if [ $? -ne 0 ]; then
  # Replicaset not yet configured
  mongo --username $MONGO_INITDB_ROOT_USERNAME -p $MONGO_INITDB_ROOT_PASSWORD --host rs1 --eval 'rs.initiate({ _id: "rs0", version: 1, members: [ { _id: 0, host : "rs1", priority: 100 }, { _id: 1, host : "rs2", priority: 2 }, { _id: 2, host : "rs3", priority: 2 } ] })';
fi

backend-server.js

const mongoose = require('mongoose');
// MongoDB Connection Class
class MongoDB {
  constructor() {
    mongoose
      .connect('mongodb://test:password@rs1:27017,rs2:27017,rs3:27017/testing?replicaSet=rs0')
      .then(() => {
        console.log('Connected to MongoDB');
      })
      .catch((err) => {
        console.error('MongoDB Error: ', err.message);
      });
    // Add error handler while connected
    mongoose.connection.on('error', (err) => {
      console.log('MongoDB Error: ', err);
    });
    mongoose.pluralize(null);
  }
}
module.exports = new MongoDB();

We have 2500 connections, between 60 clients, so using the directConnection flag may be too slow.

What we’ve tried

We have tested the replica election process and it seems to be electing a secondary node as the new primary correctly once the original primary goes down. It also is re-electing the original primary once that node comes back up.
We have verified that the priorities for each primary and secondary node are set correctly.
The Docker Swarm DNS resolves rs1’s hostname and is reachable from the backend server container that has a Mongoose client receiving the error above.

We verified that the authentication and connection strings are all setup correctly. When this error occurs, restarting the backend server docker containers fixes the issue. We are not receiving any connection errors in our backend server logs. Errors only appear when querying the database.

Steve_Hand1 · March 10, 2023, 3:13pm

Wouldn’t a restart of the Docker environment also bring down the Docker network as well? The TCP/IP connection made by your client prior to the Docker restart would have been interrupted by the Docker restart. It could be that the driver in your app attempts to reestablish the connection but the Docker network is down at the time during its restart.

Deepak_Kumar16 · March 10, 2023, 9:38pm

@Austin_Conry … not sure, if you were able to resolve … but here is some help from my limited knowledge

The error message “MongooseServerSelectionError: Server selection timed out after 30000 ms” indicates that the Mongoose client was unable to connect to a MongoDB server within the specified timeout period.
This error occurred after the primary replica (rs1) went down due to a Docker restart and resulted in an election that selected rs3 as the new primary. After rs1 was re-elected a few seconds later, some backend client replicas could not find the primary replica, resulting in the error.
The relevant issue seems to be related to the Mongoose client’s inability to communicate with the primary replica. It is unclear why some replicas are experiencing this issue while others are not. The error could be a result of a network issue or a bug in the MongoDB NodeJS driver.
To resolve this issue, you can try the following steps:

Check the network settings and connectivity between the backend clients and the MongoDB replicas.
Upgrade the MongoDB NodeJS driver to the latest version (if not already done).
Check the MongoDB logs for any relevant errors or messages.
If the issue persists, try restarting the affected backend clients.
Additionally, it is advisable to review the MongoDB configuration and setup, especially in the context of replica sets and Docker swarm, to ensure optimal performance and stability.