Me and my team trying to investigate an error for about a month and sadly without any luck
We have cluster of mongodb on premise and an app that is written in nodejs using mongoose(in nestjs).
We have around 10 pods of that app and everything works ok until suddenly some of the pods getting “server selection timeout, no primary replicaset”.
The weird thing is that if I kill the pods everything works ok again and then after minutes/hours it happens again from different pods.
We’re really lost and will be happy for every help or something you can think of
Thanks for your reply!
- The primary didn’t change for a long time.
- They were alive long.
- Didn’t check the network logs, will check and see if there is something interesting there.
Another thing that I forgot to write its that all of our operations to DB happens in transaction.
I tried to ping the db from the pods that having the error and it works so it doesn’t seems like a network problem…
check mongodb server config for connection limits. mongodb server logs would also indicate this but connection drops may not be apparent. check the lines where connections accepted/dropped in the timespans of your errors.
if limit is 10k and your pods create 100 singleton connections there is no problem. but they create, lets say, 10 individual connection per request, you can only serve 1000 requests at a time, and 1001th person will get timeout.
as far as I know, mongoose should be handling connections as singleton.
this might be the issue. or not. a transaction itself should not affect, but check how mongoose uses connections during a transaction.