Pymongo AWS Lambda ServerSelectionTimeoutError

Riku · May 12, 2023, 6:28am

Hello,

I have created an AWS Lambda function in Python 3.9 and pymongo 4.3.3. It should count some documents in two collections and send the counts to slack. I’m facing an issue where at times the first count_documents function call causes the function to hang for 30 seconds and timeout with:

[ERROR] ServerSelectionTimeoutError: cluster0-shard-00-00-....mongodb.net:27017: timed out,cluster0-shard-00-01-....mongodb.net:27017: timed out,cluster0-shard-00-02-....mongodb.net:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 645dcca1e91881ffaf45360b, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('cluster0-shard-00-00-....mongodb.net', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('cluster0-shard-00-00-....mongodb.net:27017: timed out')>, <ServerDescription ('cluster0-shard-00-01-....mongodb.net', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('cluster0-shard-00-01-....mongodb.net:27017: timed out')>, <ServerDescription ('cluster0-shard-00-02-....mongodb.net', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('cluster0-shard-00-02-....mongodb.net:27017: timed out')>]>

Im setting

client = MongoClient(uri, readPreference='secondaryPreferred')

outside the handler function. This uri is of the format

mongodb+srv://<user>:<pass>@cluster0-....mongodb.net/db

(the db is probably unnecessary, but it is what I use in another app and get it through param store)

And running client.db.collection.count_documents() with:

    "createdAt": {
      "$gte": datetime(year=yesterday.year, month=yesterday.month, day=yesterday.day),
      "$lt": datetime(year=today.year, month=today.month, day=today.day)
    },

to get all documents for yesterday.

The lambda function gets as far as the first count_documents call and hangs for the default server selection timeout of 30 seconds and times out and exits the function via error. At times, the function works correctly.

I’ve set the function to be invoked via CRON and AWS Lambda retries twice by default, but even that isn’t always enough. It failed last night and this morning invoking manually it worked on the 19th try. Then sometimes it works on the first try without a problem. This makes me believe it is not a networking issue. The Lambda resides in a VPC and the NAT GW ip is allowed in Mongo Atlas.

Shane · May 12, 2023, 7:14pm

error=NetworkTimeout('cluster0-shard-00-02-....mongodb.net:27017: timed out')

Hi, this error means the the client was unable to connect to the remote server(s). Is it possible that some of the Lambda machines are assigned IPs that are not allowed to connect to your Atlas cluster? Could you verify this hypothesis by temporarily allowing all IPs in your cluster to see if these connection errors go away?

Riku · May 15, 2023, 5:27am

Hey,

Thank you for replying. I don’t think I can open the database to the whole world, but it does sound like a reasonable cause. I have tried explicitly setting the Lambda to all existing security groups one by one, but that didn’t seem to have an effect. The Lambda should also have internet access only through NAT gateway which has two associated IP addresses. These are allowed in Atlas.

Edit: I realized we have VPC Peering in place, so the NAT GW shouldn’t be in use any way. I will check VPC Flow logs next.

Riku · May 15, 2023, 8:51am

Every time the error happens, the call to pymongo.MongoClient is successful as well as “initializing” the collections to collection = client[db_name].collection. The time out happens when trying to actually count the documents. I have a feeling the error message might be a bit misleading here. I’m trying to count over 13 million documents with a filter for just the previous day and getting the error. I’ll see if adding an index helps…

Riku · May 15, 2023, 9:27am

From VPC Flow logs I can see, that PyMongo sends a variety of requests and the source ports vary. Source IP is an IP that is accepted on the Atlas side for both a request that succeeds and fails.

Riku · May 15, 2023, 12:59pm

Not surprisingly (in hindsight) - I was incorrect. I placed my lambda in multiple subnets and didn’t realize some of them didn’t have a route in their routetable to the VPC Peering (which - in addition - I forgot about initially). Specifying the correct subnets fixed my problem. It was indeed a connection issue. Thank you for your help Shane!

system · May 22, 2023, 9:54am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.