No way to avoid ReplicaSetNoPrimary errors

Following up from my previous post.

I’m still stuck with the occasional ReplicaSetNoPrimary errors. Quite rare but it does happen.

ERROR Unhandled Promise Rejection {"errorType":"Runtime.UnhandledPromiseRejection","errorMessage":"MongoServerSelectionError: Server selection timed out after 30000 ms","reason":{"errorType":"MongoServerSelectionError","errorMessage":"Server selection timed out after 30000 ms","reason":{"type":"ReplicaSetNoPrimary","servers":{},"stale":false,"compatible":true,"heartbeatFrequencyMS":10000,"localThresholdMS":15,".

This is despite upgrading to a dedicated M10 cluster. My application barely has any traffic so I’m confused why it sometimes can’t seem to connect?

My setup is still the same:

  • My connection string is valid.
  • The network connection is stable.
  • retryWrites=true
  • w=majority
  • Allowed all IPs
  • All errors are caught.
  • I use the Mongo Node Driver version 5.7.0
  • My stable connections have been stably hovering around a relatively low number (see scree

    nshot).

It’s frustrating not knowing why this is happening despite setting everything up correctly, and it’s connecting properly most of the time.

Random, unpredictable errors of unknown cause are unsettling so if someone has insight, please share.

Hey @Pyra_Metrik,

There could be various reasons behind it, a few of them could be:

  • Intermittent network outages that cause the driver to lose connectivity.
  • Re-election of the PRIMARY node in your cluster, which leads to lost connections as the topology changes.

Please refer to Test Primary Failover and Test Resilience to read more about it.

In case you need further assistance, please share the org name of your cluster, so we can look into it or you can reach out to Atlas in-app chat support.

The in-app chat support does not require any payment to use and can be found at the bottom right corner of the Atlas UI:

image

Best,
Kushagra

Hi @Kushagra_Kesav

Thanks for the reply. I conducted a Primary Failover Test in the Atlas UI, and my app worked fine during and after the test.

So, this leaves us with intermittent network failures.
Is there any way we can verify that the ReplicaSetNoPrimary errors are indeed from network failures, by checking logs somewhere (or something else)?

And how exactly do I get the org name of my cluster?

Hey @Pyra_Metrik,

  1. Just to clarify, have you reached the Atlas in-app chat support team for any notable cluster issues during the time these errors occurred?
  2. May I ask if you notice any patterns in the timing of these errors?
  3. Also, could you please provide specific information like the connection string (with sensitive credentials redacted) and details about the client-side environment (e.g., containerized, Lambda, etc.).

The above details will help us to assist you better.

Regards,
Kushagra

@Kushagra_Kesav

  1. Yes, I’ve contact supported just now. Awaiting a response.
  2. The pattern is quite random, but I think it happens more often after longer periods of not connecting to the app (i.e opening my app URL in the browser).
  • My connection string: mongodb+srv://${process.env.DB_USERNAME}:${process.env.DB_PASSWORD}@cluster0.nkmq1cz.mongodb.net/?retryWrites=true&w=majority;
  • The client-side environment is a Next.js app. The Mongo client is to connected to from a serverless Next.js API function.

Hey @Pyra_Metrik,

Thanks for sharing the details! :star2:

Just out of curiosity, I’m wondering if you are using Vercel. Could you please confirm it?

Thanks,
Kushagra

@Kushagra_Kesav yes, I am

Hi @Pyra_Metrik,

If you’ve checked with the Atlas in-app chat support and they’ve advised no issues were identified on the Atlas cluster side at the time of the error messages, I would also recommend checking with Vercel support. There was another mention of this previously on this post as well.

Depending on your cluster tier, you might be able to check the mongod logs to see the client metadata as well to determine if connection was ended from the application side possibly.

You can perform the same troubleshooting step mentioned in my comment by connecting from a different client perhaps outside of Vercel for trying to narrow down what the issue could be.

Regards,
Jason

Hi @Jason_Tran thanks for sharing the tips + the other post. I’m in contact with Vercel community/support as well to solve the problem.

I did check the logs for my cluster however, and I see this:

Automation Agent v13.4.2.8420 (git: <id>)"}}}}
{"t":{"$date":"2023-09-28T16:25:39.414+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn115096","msg":"Authentication succeeded","attr":{"mechanism":"SCRAM-SHA-256","speculative":true,"principalName":"__system","authenticationDatabase":"local","remote":"192.168.254.146:43258","extraInfo":{}}}
{"t":{"$date":"2023-09-28T16:25:39.415+00:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn115094","msg":"Interrupted operation as its client disconnected","attr":{"opId":31692522}}
{"t":{"$date":"2023-09-28T16:25:39.415+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn115095","msg":"Connection ended","attr":{"remote":"192.168.254.146:43252","uuid":"ea3b6fab-f503-49f9-8af1-b71110d04158","connectionId":115095,"connectionCount":40}}

Basically, it looks the authentication succeeded, then client disconnect immediately after, then it logged a “Connection ended” message.

I don’t think this is expected behavior, that is, for a client to disconnect immediately after authenticated. Please confirm, and in the mean time, I’m debugging the problem on Vercel’s end.