serverSelectionTimeoutMS is not set correctly

Shenfeng_Liu · January 8, 2022, 12:03pm

We are working on an application using pymongo to connect to MongoDB, and we set the Server Selection Timeout to 60 seconds by adding serverSelectionTimeoutMS=60000 in the MONGODB_CONNECTION_STRING string. It works fine most of the time. But we noticed that every day there will be several failures with the error message like this:

File "/usr/local/lib/python3.9/site-packages/celery/backends/mongodb.py", line 294, in collection
    collection.create_index('date_done', background=True)
  File "/usr/local/lib/python3.9/site-packages/pymongo/collection.py", line 2059, in create_index
    return self.__create_indexes([index], session, **cmd_options)[0]
  File "/usr/local/lib/python3.9/site-packages/pymongo/collection.py", line 1919, in __create_indexes
    with self._socket_for_writes(session) as sock_info:
  File "/usr/local/lib/python3.9/site-packages/pymongo/collection.py", line 198, in _socket_for_writes
    return self.__database.client._socket_for_writes(session)
  File "/usr/local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1293, in _socket_for_writes
    server = self._select_server(writable_server_selector, session)
  File "/usr/local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1278, in _select_server
    server = topology.select_server(server_selector)
  File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 241, in select_server
    return random.choice(self.select_servers(selector,
  File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 199, in select_servers
    server_descriptions = self._select_servers_loop(
  File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 215, in _select_servers_loop
    raise ServerSelectionTimeoutError(
... pymongo.errors.ServerSelectionTimeoutError: No primary available for writes, Timeout: 0.06s, Topology Description: <TopologyDescription id: XXX, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('URI_XXX', PORT_XXX) server_type: Unknown, rtt: None>, <ServerDescription ('URI_XXX', PORT_XXX) server_type: RSSecondary, rtt: XXX>, <ServerDescription ('URI_XXX', PORT_XXX) server_type: Unknown, rtt: None>]>

In the error message, it indicated that the timeoout is 0.06s, which is very strange, since our setting is 60s (60000ms).

I checked the pymongo code, and found that in pymongo/client_options.py, the _server_selection_timeout is defined to use either serverselectiontimeoutms that we defined in connection string, or common.SERVER_SELECTION_TIMEOUT, which is 30 (in second).

     # self.__server_selection_timeout is in seconds. Must use full name for
        # common.SERVER_SELECTION_TIMEOUT because it is set directly by tests.
        self.__server_selection_timeout = options.get(
            'serverselectiontimeoutms', common.SERVER_SELECTION_TIMEOUT)

While the key is, the 3 values in the code above are supposed to be numbers in seconds. But serverselectiontimeoutms we provide in connection string is in millisecond (60000)!

Then I found in pymongo/common.py, there is a validators defined:

'serverselectiontimeoutms': validate_timeout_or_zero,

And in validate_timeout_or_zero, it will divide serverselectiontimeoutms by 1000 so that it is changed to a number in second now:

return validate_positive_float(option, value) / 1000.0

I didn’t totally figure out why, but obviously, in the failure we met, the validator was triggered twice for some reason, so that serverselectiontimeoutms changed from 60000 to 0.06.

I think we may need to check under which situation the validator will be triggered more than once, and fix it.

While I also want to say, it sounds like a bad design to let the validator to divide serverselectiontimeoutms by 1000 and save the new value back.
serverselectiontimeoutms, as the value name indicates, should be a number in millisecond. The validator changed its definition to second which will confuse people.
The right way should be to do the conversion when setting __server_selection_timeout in client_options.py. like below:

        self.__server_selection_timeout = options.get(
            'serverselectiontimeoutms', common.SERVER_SELECTION_TIMEOUT * 1000) / 1000

For now, our application can only work around it by removing the serverSelectionTimeoutMS from connection string. I wonder if there is any other suggestion?