Randomly thrown "WaitQueueFullException" appear with no sign of actual connection issues

We have a severe issue at the moment. We run a M10 cluster on Atlas which should support up to 1500 connections. We run our application on Azure AKS with two replicas. A while ago we got informed that some of our projects cannot be opened in our app because we get a 500 telling me the wait queue is full.

HTTP “GET” “/projects/652e59ba1b2f2672b2f75d71” responded 500 in 6080.3021 ms
Request: Unhandled Exception for Request “GetProjectByIdQuery” GetProjectByIdQuery { Id: 652e59ba1b2f2672b2f75d71 }
The wait queue for acquiring a connection to server our-shard.mongodb.net:27017 is full.
MongoDB.Driver.MongoWaitQueueFullException: The wait queue for acquiring a connection to server our-shard.mongodb.net:27017 is full.
at void MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+AcquireConnectionHelper.AcquireWaitQueueSlot()
at void MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+AcquireConnectionHelper.StartCheckingOut()
at async Task MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool+AcquireConnectionHelper.AcquireConnectionAsync(CancellationToken cancellationToken)
at async Task MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionAsync(CancellationToken cancellationToken)
at async Task MongoDB.Driver.Core.Servers.Server.GetChannelAsync(CancellationToken cancellationToken)
at async Task MongoDB.Driver.Core.Operations.RetryableReadContext.InitializeAsync(CancellationToken cancellationToken)
at async Task MongoDB.Driver.Core.Operations.RetryableReadContext.CreateAsync(IReadBinding binding, bool retryRequested, CancellationToken cancellationToken)
at async Task<IAsyncCursor> MongoDB.Driver.Core.Operations.FindOperation.ExecuteAsync(IReadBinding binding, CancellationToken cancellationToken)
at async Task MongoDB.Driver.OperationExecutor.ExecuteReadOperationAsync(IReadBinding binding, IReadOperation operation, CancellationToken cancellationToken)
at async Task MongoDB.Driver.MongoCollectionImpl.ExecuteReadOperationAsync(IClientSessionHandle session, IReadOperation operation, ReadPreference readPreference, CancellationToken cancellationToken)
at async Task MongoDB.Driver.MongoCollectionImpl.UsingImplicitSessionAsync(Func<IClientSessionHandle, Task> funcAsync, CancellationToken cancellationToken)

Well at first I thought sure, I know that parts were implemented badly due to lack of time. So there are a lot of unncessary operations going on, but not that much.

So the weird part is this: A lot of projects work fine, they load and they do so fast (1-3 seconds). But very specific projects don’t work. They always fail with this exception. Even after restarting the pods (which should eliminate all active connections). Reloading tens of times sometimes results in it randomly working but most of the time it fails. The issue also appears for different endpoints, but mostly for loading some specific projects.

Well. So I adjusted the connection settings:

  maxConnectionPoolSize: "400"
  minConnectionPoolSize: "10"
  maxConnecting: "4"
  maxConnectionIdleTime: "300000"
  maxConnectionLifeTime: "1800000"

Now the connection pool and wait queue should be large enough for these few operations. And yet: it still happens without any change of behavior.

I implemented metrics based on the Events the driver offers. When I try this with the very same database (data-wise, not actual server, still local) I end up having a total of maybe 8 connections (worst-case so far) but usually the whole app never goes beyond that value no matter what I do. Most of the time I have even less connections.

Also the cluster metrics tell me that the primary shard has somewhat betwene 80-150 connections active. I assume the large difference comes from the actual latency which causes more connections on production (also more people using it simultaneously). But 150 connections with two replicas that should support up to 400 connections in their pools + 400*5 wait queue … can you tell me how this adds up?

Why do I get that exception? The error occurs immediately. The wait timeout is 30 seconds, so how can the queue be full and the exception be thrown in 2-3 seconds then?

I assume this might be a severe issue in the driver and might be a regression from the recent updates (as this issue occured first only ~2 weeks ago I guess).