Mongo-golang driver is not reconnecting after mongo atlas server is down due to server size increaese

We have a golang server in production that writes logs to a database in mongo atlas. Our mongo atlas configuration automatically increases the server disk when it’s about to get full. When the upgrade happens the server is down, apparently for a few minutes, but when it comes up again, the driver is not able to reconnect and all the following logs fail to be inserted to the database.

We tried to reproduce this problem locally by stoping and restarting a docker container running a mongo db database, but the driver is perfectly capable of reconnecting in this situation. It seems to be something related to DNS resolution.

We have seen several time-related parameters that we can tune like:

  • ServerSelectionTimeout
  • HeartbeatInterval
  • ConnectionTimeout
  • MaxConnIdleTime
  • SocketTimeout

But since we don’t know the exact error (and we can’t reproduce the problem locally) it’s not clear which one to use (we’ve tried them all locally).

We think it could be something like the driver storing the server ip behind the connection string and storing it to skip the DNS in future requests. Then, after Atlas upgrades the server maybe the IP is no longer the same and that’s why the driver is not able to communicate remotely but it is able to do so locally (since the server IP doesn’t change and there’s no DNS translation).

Has anyone gone through something similar? How do you handle reconnecting to a remote mongo db on Atlas?

We found this thread but didn’t find the solution to our problem.

Thanks in advance!

Hi @Jairo_Lozano,

We’re looking at a bug related to this right now, and are planning to get a fix out for the next patch release.

2 Likes

Thanks @Isabella_Siu! Please let me know when it’s released!

Hi again @Jairo_Lozano ! It’ll be released in v1.5.4, which is scheduled for July 6th.

2 Likes

cool! thanks @Isabella_Siu :smiley:

@Jairo_Lozano we just released Go driver v1.5.4, which includes a fix for SRV polling that should resolve the problem you encountered with having to restart your application after scaling an Atlas cluster.

Check out the v1.5.4 release on GitHub.

1 Like

Thanks @Matt_Dale!! I’ll upgrade the driver and let you know if that solves the problem!