We are working on an application using pymongo to connect to MongoDB, and we set the Server Selection Timeout to 60 seconds by adding serverSelectionTimeoutMS=60000 in the MONGODB_CONNECTION_STRING string. It works fine most of the time. But we noticed that every day there will be several failures with the error message like this:
File "/usr/local/lib/python3.9/site-packages/celery/backends/mongodb.py", line 294, in collection
collection.create_index('date_done', background=True)
File "/usr/local/lib/python3.9/site-packages/pymongo/collection.py", line 2059, in create_index
return self.__create_indexes([index], session, **cmd_options)[0]
File "/usr/local/lib/python3.9/site-packages/pymongo/collection.py", line 1919, in __create_indexes
with self._socket_for_writes(session) as sock_info:
File "/usr/local/lib/python3.9/site-packages/pymongo/collection.py", line 198, in _socket_for_writes
return self.__database.client._socket_for_writes(session)
File "/usr/local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1293, in _socket_for_writes
server = self._select_server(writable_server_selector, session)
File "/usr/local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1278, in _select_server
server = topology.select_server(server_selector)
File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 241, in select_server
return random.choice(self.select_servers(selector,
File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 199, in select_servers
server_descriptions = self._select_servers_loop(
File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 215, in _select_servers_loop
raise ServerSelectionTimeoutError(
... pymongo.errors.ServerSelectionTimeoutError: No primary available for writes, Timeout: 0.06s, Topology Description: <TopologyDescription id: XXX, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('URI_XXX', PORT_XXX) server_type: Unknown, rtt: None>, <ServerDescription ('URI_XXX', PORT_XXX) server_type: RSSecondary, rtt: XXX>, <ServerDescription ('URI_XXX', PORT_XXX) server_type: Unknown, rtt: None>]>
In the error message, it indicated that the timeoout is 0.06s, which is very strange, since our setting is 60s (60000ms).
I checked the pymongo code, and found that in pymongo/client_options.py, the _server_selection_timeout is defined to use either serverselectiontimeoutms that we defined in connection string, or common.SERVER_SELECTION_TIMEOUT, which is 30 (in second).
# self.__server_selection_timeout is in seconds. Must use full name for
# common.SERVER_SELECTION_TIMEOUT because it is set directly by tests.
self.__server_selection_timeout = options.get(
'serverselectiontimeoutms', common.SERVER_SELECTION_TIMEOUT)
While the key is, the 3 values in the code above are supposed to be numbers in seconds. But serverselectiontimeoutms
we provide in connection string is in millisecond (60000)!
Then I found in pymongo/common.py, there is a validators defined:
'serverselectiontimeoutms': validate_timeout_or_zero,
And in validate_timeout_or_zero, it will divide serverselectiontimeoutms
by 1000 so that it is changed to a number in second now:
return validate_positive_float(option, value) / 1000.0
I didn’t totally figure out why, but obviously, in the failure we met, the validator was triggered twice for some reason, so that serverselectiontimeoutms changed from 60000 to 0.06.
I think we may need to check under which situation the validator will be triggered more than once, and fix it.
While I also want to say, it sounds like a bad design to let the validator to divide serverselectiontimeoutms by 1000 and save the new value back.
serverselectiontimeoutms
, as the value name indicates, should be a number in millisecond. The validator changed its definition to second which will confuse people.
The right way should be to do the conversion when setting __server_selection_timeout
in client_options.py. like below:
self.__server_selection_timeout = options.get(
'serverselectiontimeoutms', common.SERVER_SELECTION_TIMEOUT * 1000) / 1000
For now, our application can only work around it by removing the serverSelectionTimeoutMS from connection string. I wonder if there is any other suggestion?