Using MongoDB Atlas on Lambda throws Could not connect to any servers in MongoDB Atlas cluster error

Shawn_Varughese · April 19, 2022, 1:35am

I am using node lambda which is a api server that accepts request from API gateway. The lambda connects to mongo db atlas to pull data from the db and then returns it in the response. For the most part it works fine with no issues but randomly it runs into mongo atlas connection errors:

    {
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "MongooseServerSelectionError: Could not connect to any servers in your MongoDB Atlas cluster. One common reason is that you're trying to access the database from an IP that isn't whitelisted. Make sure your current IP address is on your Atlas cluster's IP whitelist: https://docs.atlas.mongodb.com/security-whitelist/",
    "reason": {
        "errorType": "MongooseServerSelectionError",
        "errorMessage": "Could not connect to any servers in your MongoDB Atlas cluster. One common reason is that you're trying to access the database from an IP that isn't whitelisted. Make sure your current IP address is on your Atlas cluster's IP whitelist: https://docs.atlas.mongodb.com/security-whitelist/",
        "message": "Could not connect to any servers in your MongoDB Atlas cluster. One common reason is that you're trying to access the database from an IP that isn't whitelisted. Make sure your current IP address is on your Atlas cluster's IP whitelist: https://docs.atlas.mongodb.com/security-whitelist/",
        "reason": {
            "type": "ReplicaSetNoPrimary",
            "setName": null,
            "maxSetVersion": null,
            "maxElectionId": null,
            "servers": {},
            "stale": false,
            "compatible": true,
            "compatibilityError": null,
            "logicalSessionTimeoutMinutes": null,
            "heartbeatFrequencyMS": 10000,
            "localThresholdMS": 15,
            "commonWireVersion": null
        },
        "stack": [
            "MongooseServerSelectionError: Could not connect to any servers in your MongoDB Atlas cluster. One common reason is that you're trying to access the database from an IP that isn't whitelisted. Make sure your current IP address is on your Atlas cluster's IP whitelist: https://docs.atlas.mongodb.com/security-whitelist/",
            "    at NativeConnection.Connection.openUri (/var/task/node_modules/mongoose/lib/connection.js:846:32)",
            "    at /var/task/node_modules/mongoose/lib/index.js:351:10",
            "    at /var/task/node_modules/mongoose/lib/helpers/promiseOrCallback.js:31:5",
            "    at new Promise (<anonymous>)",
            "    at promiseOrCallback (/var/task/node_modules/mongoose/lib/helpers/promiseOrCallback.js:30:10)",
            "    at Mongoose._promiseOrCallback (/var/task/node_modules/mongoose/lib/index.js:1149:10)",
            "    at Mongoose.connect (/var/task/node_modules/mongoose/lib/index.js:350:20)",
            "    at Object.<anonymous> (/var/task/build/app.js:172:40)",
            "    at Module._compile (internal/modules/cjs/loader.js:999:30)",
            "    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1027:10)",
            "    at Module.load (internal/modules/cjs/loader.js:863:32)",
            "    at Function.Module._load (internal/modules/cjs/loader.js:708:14)",
            "    at Module.require (internal/modules/cjs/loader.js:887:19)",
            "    at require (internal/modules/cjs/helpers.js:74:18)",
            "    at _tryRequire (/var/runtime/UserFunction.js:75:12)",
            "    at _loadUserApp (/var/runtime/UserFunction.js:95:12)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: MongooseServerSelectionError: Could not connect to any servers in your MongoDB Atlas cluster. One common reason is that you're trying to access the database from an IP that isn't whitelisted. Make sure your current IP address is on your Atlas cluster's IP whitelist: https://docs.atlas.mongodb.com/security-whitelist/",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:314:20)",
        "    at process.EventEmitter.emit (domain.js:483:12)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}

My mongo DB atlas instance allows ip addresses from everywhere: 0.0.0.0.

This is how I am making my mongo connection:

import express from "express";
import serverless from "serverless-http";
import mongoose, { Error } from "mongoose";
import { APIGatewayProxyHandler } from 'aws-lambda';
import { AppSettings } from "./constants/constants";

class App {

    public app: express.Application;
    public mongoUrl: string = AppSettings.MONGODBURL;

    constructor() {
        this.setupDBConn();
    }
    async setupDBConn() {
        this.app = express();
    }

    private async mongoSetup() {
        await mongoClient;
        console.log('MONGO connection successfully made...');
    }
}
const mongoClient = mongoose.connect(AppSettings.MONGODBURL, { useNewUrlParser: true, useUnifiedTopology: true, useCreateIndex: true, useFindAndModify: false });
export const handler: APIGatewayProxyHandler = serverless(new App().app);

I am also using mongoose version 5.12.8

MaBeuLux88_xxx · April 20, 2022, 12:47pm

Hi @Shawn_Varughese and welcome in the MongoDB Community !

Are you creating a new connection (==mongoClient) for each Lambda or your MongoDB Connections are shared across multiple (all) lambdas?

I give you a little hint, “yes” is a very bad answer !

Check out this doc where they explain how to set up the connection correctly:

Using a private endpoint would also be a plus (instead of 0.0.0.0/0).

Cheers,
Maxime.

Shawn_Varughese · April 22, 2022, 10:58pm

Thanks for this and this is helpful I did go through this and we do have the connection outside the lambda function and the connection is shared in the code we have the mongoClient outside the App class and the serverless function just waits for the mongo connection and reuses it. Unfortunately we are still seeing this connection issue.

Any other ideas on why we might be seeing this?

MaBeuLux88_xxx · April 23, 2022, 2:35pm

I don’t understand why you would get this error if the connection was already established and functional and if your cluster is healthy on the other hand. This error message makes me think that the connection is being created at that moment.

If it’s really an occasional error, maybe this is happening when you have a maintenance on the cluster? Can you link this to an event in Atlas?
But in theory, it’s supposed to be transparent for the clients, unless something is misconfigured.

Shawn_Varughese · April 24, 2022, 6:47pm

Interesting I am not sure that could be possible. I dont see any open alerts in Atlas is that what you were asking for? To give you more context our platform is microservices based so we have 8 microservices which are all separate lambda functions and each connect to atlas to read/write to the database.

You dont see this as a problem right?

MaBeuLux88_xxx · April 25, 2022, 12:49pm

Yes. Then it’s something else.

No, sounds like I would do exactly the same thing. Given that your connection pool is centralised and reused correctly by all the lambdas, maybe you could increase the size of the pool? If you have many lambdas running in parallel, I guess you need a connection pool large enough to accommodate all the queries.

Shawn_Varughese · April 26, 2022, 1:50am

So are you all 8 microservices should share the same connection pool? How would i be able to do that if they are all separate lambda functions

Or each lambda function shares its own connection pool and just does not create a new connection each invocation?

MaBeuLux88_xxx · April 26, 2022, 5:59pm

Ado proposed an implementation here based on a cache:

It’s fine if you use like 8 different pools as long as it’s a small number, it’s OK. What’s not OK is one connection to the cluster per lambda execution. That’s definitely a problem.

Your cluster can only support a limited number of connections. For an M10 it’s 1500 per node for example.

https://www.mongodb.com/docs/atlas/reference/atlas-limits/.

Keep an eye on the monitoring to see if you are getting close to that limit.

Cheers,
Maxime.

Shawn_Varughese · May 6, 2022, 1:42pm

Thanks for this i was able to implementing the caching and it seems like it working well so far! Now after implementing caching i am facing a new error:

ERROR	Unhandled Promise Rejection 	{
  "errorType": "Runtime.UnhandledPromiseRejection",
  "errorMessage": "MongoNetworkTimeoutError: connection timed out",
  "reason": {
    "errorType": "MongoNetworkTimeoutError",
    "errorMessage": "connection timed out",
    "name": "MongoNetworkTimeoutError",
    "stack": [
      "MongoNetworkTimeoutError: connection timed out",
      "    at connectionFailureError (/var/task/node_modules/mongodb/lib/core/connection/connect.js:362:14)",
      "    at TLSSocket.<anonymous> (/var/task/node_modules/mongodb/lib/core/connection/connect.js:330:16)",
      "    at Object.onceWrapper (events.js:420:28)",
      "    at TLSSocket.emit (events.js:314:20)",
      "    at TLSSocket.EventEmitter.emit (domain.js:483:12)",
      "    at TLSSocket.Socket._onTimeout (net.js:483:8)",
      "    at listOnTimeout (internal/timers.js:554:17)",
      "    at processTimers (internal/timers.js:497:7)"
    ]
  },
  "promise": {},
  "stack": [
    "Runtime.UnhandledPromiseRejection: MongoNetworkTimeoutError: connection timed out",
    "    at /var/runtime/index.js:35:15",
    "    at /opt/nodejs/node_modules/@lumigo/tracer/dist/tracer/tracer.js:265:37",
    "    at processTicksAndRejections (internal/process/task_queues.js:97:5)"
  ]
}

This doesn’t happen all the time just randomly I assume its because the cached db connection has timed out. Any advise here?

Shawn_Varughese · May 11, 2022, 12:39pm

@MaBeuLux88_xxx any thoughts on this error i am getting now after implementing the cacheing?

MaBeuLux88_xxx · May 18, 2022, 9:09pm

Hey @Shawn_Varughese,

So sorry for the terrible delay to answer. I had a baby on May 1st so I was a little distracted.

To be honest, I have no idea at all. If it’s happening “randomly”, is this happening maybe after your lambda wasn’t triggered for a long time?

What timeout have your defined? Did you try to increase them a bit so maybe this gives more opportunity for this connection to land?

Search for “timeout” in here and try to increase the relevant one maybe? I assume connectTimeoutMS here, no?

Cheers,
Maxime.

Shawn_Varughese · May 27, 2022, 6:35pm

@MaBeuLux88_xxx ,

Not a problem at all congratulations on your baby!!!

Yeah I tried changing socketTimeout but that did not seem to help. Do you think the connectTimeout might be better? I am stumped on whats causing this.

MaBeuLux88_xxx · May 27, 2022, 8:24pm

Thanks !

connectTimeout would only apply to the very first connection that is then cached for later use by the lambdas. So I think this isn’t the one you are looking for but… Give it a try?

Cheers,
Maxime.

Stewart_Snow · June 22, 2022, 11:07am

We’ve been suffering from the exact same problem - occasional timeouts on connection to Mongo from AWS Lambda. It feels like an event / issue occurring on the MongoDb / Atlas instance - but nothing in the Atlas event log to indicate as such.

Our lambda timeout is being hit at 60 seconds - but basically Atlas is just not giving a connection - without error other than via connection timeout.

We work in c# and are establishing our MongoClient connections via DI - ie outside of each method function call - just on initialisation.

MaBeuLux88_xxx · June 22, 2022, 12:17pm

Hi @Stewart_Snow and welcome in the MongoDB Community !

Did you try to add some options to the C# driver connection like increase the timeouts, etc ? Maybe waitQueueTimeoutMS can help?

Cheers,
Maxime.

Stewart_Snow · June 22, 2022, 2:31pm

Hey - we can try that - but it feels wrong. In this case the database that we’re connecting to (running in Atlas) is absolutely tiny - barely 50kb of data. At max there is 10 connections open to it. It’s just so small / lightweight that it really shouldn’t be behaving as it is - it makes no sense.

I can understand needing big timeouts - some sort of heavily load scenario’s - but we’re dealing with such a low-load scenario it’s strange. Timing out at 60 seconds or so - just to get the connection to the DB seems very strange indeed.

We’re having a go at trying a slightly different pattern for the initialization of our mongo connection from our Lambda function to see if that makes any difference.

MaBeuLux88_xxx · June 22, 2022, 4:17pm

I’m wondering if there is a limit to keep the connection alive if it’s not used for a long time. Please let me know if you find a solution because I don’t have a test environment up to test it at the moment.

Also you read that one right?

Stewart_Snow · June 22, 2022, 4:32pm

Will do - and yes aware of that doc - cheers!

Stewart_Snow · June 27, 2022, 2:47pm

Okay - just following up here. We tweaked our initialisation code, so that we’re 100% inline with mongo / lambda recommendations - but it’s still producing the same issue on occasion.

Each time it occurs we get two close errors:

Error 1
An unhandled exception has occurred while executing the request.System.TimeoutException: A timeout occurred after 30000ms selecting a server using CompositeServerSelector{ Selectors = ReadPreferenceServerSelector{ ReadPreference = { Mode : Primary } }, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 }, OperationsCountServerSelector }. Client view of cluster state is { ClusterId : “1”, ConnectionMode : “ReplicaSet”, Type : “ReplicaSet”, State : “Connected”, Servers : [{ ServerId: “{ ClusterId : 1, EndPoint : “Unspecified/XXXXXX” }”, EndPoint: “Unspecified/XXXXXXX”, ReasonChanged: “Heartbeat”, State: “Connected”, ServerVersion: 5.0.9, TopologyVersion: { “processId” : ObjectId(“62aa3b6c7c78a0c51037b843”), “counter” : NumberLong(4) }, Type: “ReplicaSetSecondary”, Tags: “{ region : US_EAST_1, provider : AWS, nodeType : ELECTABLE, workloadType : OPERATIONAL }”, WireVersionRange: “[0, 13]”, LastHeartbeatTimestamp: “2022-06-27T10:25:32.0295404Z”, LastUpdateTimestamp: “2022-06-27T10:25:32.0295416Z” }, { ServerId: “{ ClusterId : 1, EndPoint : “Unspecified/XXXXX” }”, EndPoint: “Unspecified/XXXXXX”, ReasonChanged: “Heartbeat”, State: “Disconnected”, ServerVersion: , TopologyVersion: , Type: “Unknown”, HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server. —> System.TimeoutException: Timed out connecting to X.X.X.X. Timeout was 00:00:30. at MongoDB.Driver.Core.Connections.TcpStreamFactory.ConnectAsync(Socket socket, EndPoint endPoint, CancellationToken cancellationToken) at MongoDB.Driver…

Error 2
[Error] An unhandled exception has occurred while executing the request.MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server. —> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer. —> System.Net.Sockets.SocketException (104): Connection reset by peer — End of inner exception stack trace — at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token) at System.Net.FixedSizeReader.ReadPacketAsync(Stream transport, AsyncProtocolRequest request) at System.Net.Security.SslStream.InternalEndProcessAuthentication(LazyAsyncResult lazyResult) at System.Net.Security.SslStream.EndProcessAuthentication(IAsyncResult result) at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult) at System.Net.Security.SslStream.<>c.b__64_2(IAsyncResult iar) at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic(IAsyncResult iar, Func2 endFunction, Action1 endAction, Task1 promise, Boolean requiresSynchronization)— End of stack trace from previous location where exception was thrown

We’re using mongo+srv style connection string plus the following options:
retryWrites=true&w=majority

…? We’re at a loss - makes no sense. Any suggestions??

MaBeuLux88_xxx · June 27, 2022, 3:35pm

Could it be a “genuine” error like there was actually a connection problem between AWS and Atlas?

Is this error maybe connected to a maintenance in Atlas ? (Scheduled maintenance, upgrade minor version, auto-scaling, etc)?

How do you connect Lambdas to Atlas? Private Link? Peering? (not sure what are the solutions available to be honest).