Queries stop finishing after a while without an error

Hey there!

I hope you can help me with a weird issue I’m facing:

I’m self deployed a mongo replication server, 6.0.8, and have several microservices with NodeJS which connect to the server.

At first, everything is working totally fine, but

  • After a while simple find() queries does not finish at all. They don’t throw an error and they don’t timeout. They are just returning nothing and the scripts waits for them forever at this point.
  • This happens randomly on all microservices, but not together. So the database itself is working fine.
  • An health check is running, which confirms, that the mongo connection is still up. I double checked it by restarting the mongo service and the health check detects this immediately and reconnects.
  • After a restart of the microservice, everything works fine again.
  • The microservice itself is working fine, everything not mongo related keeps working.
  • There are only 2-3 rows of data in the database and we are talking about an absolutely simple find request, no heavy aggregation pipeline or something else.
  • It doesn’t recover automatically, regardless how long I wait.
  • There is absolutely no load on the microservices at the moment. It works in the evening and at the next morning it’s not working anymore and there is not a single request happened at the mean time.
  • If I run a pen test I can do 400 requests a minute for a long time without any problems. So it’s not like “after 100 requests” it fails or resource limits are reached or something like that.

So, these are the hard facts. Initially the microservice was using the 4.x node driver. I updated it to latest 5 driver. Nothing changed.

I know that the error is hard to find. But maybe you guys have ideas what I could test or what I could do to debug this? Especially the point that all find() queries are just freezing and don’t throw any errors makes it impossible for me to find the root cause.

Thanks for any help!

1 Like

FWIW, I’m seeing similar behavior, including on microservices that have run happily for a while. I’m adding a database check to our health check to try and at least restart around the problem, but it’s frustrating to be sure.

I only noticed this after upgrading to the 5.4 node driver, so I’m also trying downgrading to 5.3. Cluster is running 6.0.12. Otherwise same symptoms. Queries just seem to suddenly stop, despite other parts of the microservice continuing to work.

As an added data point, so far I’m only seeing this on microservices that utilize websocket connections, but that may be a red herring, in that in our setup these may just be what starts to display trouble earlier.