Query to mongodb stuck if the cluster primary lost network (using PyMongo)

I have a problem regarding mongo. Here is my setup:

  • 3 mongod with replica set on 3 different virtual machine
  • 1 mongo-configsvr on each of the 3 previous VM
  • 1 mongos on the client VM to route to the 3 mongod
  • A soft developed in Python (with PyMongo)

Now, my problem. I’m developing a software that request a lot mongo. The problem appeared when my primary mongod VM lost network. At this time, the request that was last sent was stuck.
By stuck I mean that the request wont work, but will not raise any “AutoReconnect” to let me do the request again so it can route to the new primary.
The request is totally stucked, it will raise an error after 15 minutes OR if the network goes up again on the previous primary.

I wrote a script to explain the problem:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
import threading
import time
from pymongo import MongoClient

def main():
    
    mongo_client = MongoClient()
    db = mongo_client.test
    collection = db['test']
    collection.drop()
    collection = db['test']
    print u'Connected; lets go !'
    while True:
        t0 = time.time()
        collection.insert({'time': time.time()})
        print u'Wait [%03.6f] | There is %s items in test collection' % ((time.time() - t0), collection.count())
        print u'---------------------------------------------------------------'
        time.sleep(1)

main()

As you can see in the code, I do one request per second. So I let the script run, and then I remove the ethernet cable on the primary. The script stuck at the insert, the print will not be printed until the ethernet cable is put back or after 15 minutes.

I can’t believe that no one ever had the problem as losing network is not really hard. But I did not found anything using google.

Hope you will have answers, thank you!

Hi @Thomas_Clemenceau and welcome in the MongoDB Community :muscle: !

I think you are missing a timeout option in your code. Can you try to add the server_selection_timeout in your client to see if it’s better?

Else, there are a few options that can help I guess in the client options.

A few other tings to consider:

  • Which version of MongoDB are you using?
  • Did you enable the retryable writes?
  • Which write concern are you using to write this document?

Also, this setup isn’t production ready at all. It’s full of single point of failure and disconnecting an ethernet cable makes the entire system unstable as one of the Primary node isn’t available anymore and the RS cannot recover as there is only a single node in your shard. I guess you are writing to a sharded collection?
In theory, if your shard key is evenly distributing the write operations (as it should be), only 1/3 of the write operation should fail if one of your shard is down.
As it’s impossible to perform this write operation - as I guess this write operation was supposed to be rooted to that one shard - the driver is waiting to retry this operation I guess but as your setup isn’t capable of electing a new node, it’s stuck. But it should timeout faster with the correct client option I think.

Cheers,
Maxime.