Failover test on Atlas with Scala driver leads to an exception and interrupted processing

Hi!

I set up a test with the 4.2.3 Scala driver and Atlas. The test is a straightforward one, I increment a field recursively and then start a Failover test on Atlas. The recursive incrementation continues throughout the test. In my test case, I set the recursion to go through the incrementation 200 000 times as my experimentation showed that to be long enough to last through the entire Failover test.

My expectation was that when the primary election is underway, the driver would pause operation during that time and resume once a primary was again available. However, this didn’t occur as the driver threw an exception and interrupted processing. The exception was a MongoWriteConcernException. The connection had retryWrites=true&w=majority set.

There’s not a whole lot in the documentation about how to handle Failover. I was expecting the driver to handle the election without needing exception handling in our own code. Is this assumption correct? If it is, there seems to be an issue with the driver.

Below is the relevant code from the test.

val testCollection = mongoFactory.database.getCollection("test")

private def mongoTest(userId: String, curr: Int, times: Int) = {
  if (curr >= times) Future(Ok("Complete"))
  else incrementMongoTestField(userId) flatMap { _ => 
    mongoTest(userId, curr+1, times)
  }
}

def incrementMongoTestField(userId: String) : Future[UpdateResult] = {
  val query = Document("userId" -> userId)
  val update = Document("$inc" -> Document("mongoTestField" -> 1))
  testCollection.updateOne(query, update).toFuture
}

I’m just going to answer myself here since I have new information regarding the subject.

Write concern exceptions are not errors that can have writes retried. Therefore they seem to never be enough to handle the failover test. I’ve run the tests several times and the Atlas failover test always generates a write concern exception. However, when I restart the primary in another fashion, e.g. by changing instance size, the retryWrites handles the situation smoothly.

I am somewhat surprised that handling the Atlas failover test gracefully seems to need logic on our own application code. If anyone has any links to documentation on possible options to handle the situation, I would be grateful.

Valtteri

Hi @Valtteri_Pirttila and welcome in the MongoDB Community :muscle: !

I can’t read scala properly so I wrote my own version in Python:

from pymongo import MongoClient

if __name__ == '__main__':
    client = MongoClient('mongodb+srv://USER:PASSWORD@bigcluster.abcde.mongodb.net/test?retryWrites=true&w=majority')
    db = client.get_database('test')
    coll = db.get_collection('coll')

    coll.drop()

    for i in range(1, 1000000):
        print("Iteration " + str(i))
        coll.update_one({'_id': 1}, {'$inc': {'v': 1}}, True)
    print("Done!")

I ran this against an M10 cluster based in Ireland and ran this script on my PC in Paris.
I ran a failover test and I didn’t get any error / exception. I was looking closely to the logs and I didn’t even notice a slower write operation so I don’t even know when the primary switched.

Every operation you do require some protective code. An operation can still timeout or fail for one reason or another and only transient network issues will trigger a retry operation and this isn’t even true for ALL the operations. Here is the list of supported operations.

I hope this helps,

Cheers,
Maxime.

Hi!

Thank you very much for your response @MaBeuLux88 !

Your Python example looks functionally identical to what I made, kudos for getting the intent from Scala code :grinning_face_with_smiling_eyes:

The tests I had made (four tests on two different days) were always within the same AWS region, Oregon in our case. Today I tested a setup similar to yours, setting my M10 Atlas cluster to US east coast, with the application remaining on west coast. No write concerns happened and failover test was successful with the same code.

Amount of increments performed during this timeframe when within the same Oregon region was around 120000. When done from west coast to east coast, 6000 were made. This leads me to believe the reason why the same region test results in the write concern error is related to the latency. A remote connection spends more time “on the way” than doing operations on Mongo side and the problematic period is probably really short.

Does this sound sensible to you?

My main concern is ensuring that our application can handle Atlas maintenance events without running into non-retryable errors. We can handle using only supported operations, but a write concern error seems difficult to cope with.

Valtteri

Hi,

Maybe you didn’t set the timeout to a big enough value? The latency shouldn’t make a difference I think. If I’m not mistaking, if the primary can’t be found, but you are trying to write, with retryable writes ON, you shouldn’t get an error until you reached the timeout for this query because the “retry” operation will only happen once - but will wait for a primary to be present before retrying. If you get another error at that point, you get the exception, there is no second retry operation. But the latency shouldn’t make a difference, as long as the total time is lower than your timeout, you should be good, in theory.

If I remember correctly, I was barely able to do 1500 updates during the 6 min failover test I did. I think your code is using promises so it’s probably async, no? Maybe that makes a difference as you may be queuing “too many” queries during the failover which could then timeout ? I’m not sure. :confused:.

In any case, you should always protect your code with a try/catch mechanism. Maybe someday the network will fail completely and you won’t have a primary for 5 minutes. If a command is retried successfully, you won’t get an exception, it’s transparent. But if it fails twice or timeout, you need to be able to handle this case too.

Cheers,
Maxime.

Hi and thanks again for your reply! I really appreciate it.

The code I’m using is not doing parallel writes, there’s just very little latency within the same region, probably less than 1ms. West coast to east cost / Paris to Ireland are tens of ms, so most of the time will be spent in transit. Operation itself is again far less than 1ms even on an M10.

Timeout is at the default value of 30 seconds, so that’s not an issue. Looking at retryableWrites specs, a write concern error is not listed, so I don’t think a retry is being triggered at all: Retryable errors

Here’s my current theory of what’s happening:

  1. $inc operation arrives at the primary
  2. Failover test cuts connection between primary and secondaries
  3. Operation results in a write concern error as it can’t be propagated to secondary nodes

If this is what is happening, then the big question is if the same might happen during Atlas updates. When stepping down the primary manually, I don’t think there’s ever a period where the nodes are cut from each other, therefore no write concern failures. The stepDown errors are all covered by the retryWrites.

I see two options here:

  1. Atlas failover test is more rigorous than required to survive Atlas maintenance with no downtime (it has the added complexity of cutting the connection between primary and secondaries)
  2. We need more than driver level retryable operations to survive Atlas maintenances without downtime

You are correct that a try/catch mechanism is needed for more serious outages. Just for the sake of the regularly happening maintenances, can you confirm which of the two scenarios is correct? I’m hoping for 1, but fearing the 2 :sweat_smile:

Respectfully,

Valtteri

Hi @Valtteri_Pirttila,

Is your test repeatable in causing the WriteConcernError? If so would it be possible to share the full code, so I can try an reproduce the issue and debug it locally?

The test failover process is:

  1. Atlas shuts down the current primary.
  2. The members of the replica set hold an election to choose which of the secondaries will become the new primary.
  3. Atlas brings the original primary back to the replica set as a secondary. When the old primary rejoins the replica set, it will sync with the new primary to catch up any writes that occurred during its downtime.

So given a 3 node replicaSet and that scenario there should always be a “majority” of nodes. Unless it coincides with some other issue where the new Primary cannot see the other healthy Secondary either (for example a networking blip). Given that scenario the majority write concern cannot be fulfilled by the Primary.

I’m guessing that is what you have seen, and in that scenario your application code would have to handle the Write Concern Exception to understand if the write was eventually propagated to all nodes.

Ross.

Also just to let you know I’ve added a ticket for the drivers team as a whole to understand if anything can be done in this scenario for a user. It may take some time to review and discuss that ticket, but it could be a future improvement to the drivers.

Hi! @Ross_Lawley !

While the documentation says the failover process to be like you indicated, in practice I see something very different, even in the Atlas UI. Atlas failover tests always begin with secondaries restarting and they are not even mentioned. Like you said, if the failover took place like an ordinary stepDown process, there would be no room for a write concern error.

And yes, at least for me the test has been repeatable. After discussing with Maxime, we gained the extra knowledge that the regional setup is important. Our ordinary environment is in the same region and additionally peered with our VPC. I would expect the connection to be quite fast. I used an M10 cluster for the tests and a t3.medium as the application server. If there’s a bigger latency, it’s apparently easy to miss the problematic period.

Regarding the code, I’m afraid I don’t have a full stand-alone code for reproduction. But simply an empty database with a single collection that has a document with field “userId” set to what you’re passing as an argument above is enough. Then calling mongoTest(userId, 0, 120000) will for example process 120 000 $inc operations in sequence.

I hope that helps, thank you for your assistance!

Valtteri

One further piece of information I have.

I repeated the tests by not using the Atlas failover feature, instead changing instance size from M10 to M20 and back. This causes a primary restart every time. There wasn’t a single write concern error. After that, tried the failover feature again and it resulted in the write concern error as always.

I’m getting really suspicious of what the Atlas failover test is actually doing.

Valtteri

Hi @Valtteri_Pirttila,

Thanks for the update and your patience helping debug this. There are two open questions that need answering:

  1. Is the Atlas failover doing the expected thing as per the test failover documentation?
  2. Is the driver working to spec with regards to retryable writes.

I’m following up with the Atlas team but any server logs from the failover test would be helpful.

Regarding the driver I realised I’ve neglected to ask for is the full Write Concern Error message and code. That may also shed some light on the situation. For example the following responses:

{ok:1, writeConcernError: {code: 91, codeName: "ShutdownInProgress"}}
{ok:1, writeConcernError: {code: 189, codeName: "PrimarySteppedDown"}}
{ok:1, writeConcernError: {code: 11600, codeName: "InterruptedAtShutdown"}}
{ok:1, writeConcernError: {code: 11601, codeName: "Interrupted"}}
{ok:1, writeConcernError: {code: 11602, codeName: "InterruptedDueToReplStateChange"}}

Should all be handled by the driver and the retryable writes spec. If they aren’t then it could indicate that there is a bug in the driver or the retry mechanism failed and the original error was thrown.

I’m also reopening JAVA-4204 to continue the investigation as logs can be added to the ticket if needs be.

Ross

Hi @Ross_Lawley ,

I repeated the test and downloaded the logs from Atlas. Interestingly, there seem to be no errors pertaining to the actual update operation. But there are InterruptedAtShutdown errors when the primary is going down. If I understand correctly, these appear related to the oplog:

"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"},"stats":{"stage":"COLLSCAN","nReturned":35391

The incrementation stopped at number 35321 and there are no other documents in the database.

From code side, the stack trace doesn’t contain the error code so I guess I’ll have to add a try/catch to get it. I’ll do that tomorrow. If there’s other information I can provide or if the Atlas logs would be useful to you, do let me know.

Thank you for looking into this,

Valtteri

Hi @Valtteri_Pirttila,

I’ve spent most of the day trying to reproduce the issue and believe this has unearthed a bug in the Java drivers handling of Retryable Writes. I will continue testing this tomorrow and hope to confirm then.

I ended up using custom CommandListeners and ClusterListeners to determine the state of the replicaSet from the perspective of the driver and have observed Atlas behaving as expected during the failover.

I was able to get a:

org.mongodb.driver.operation - Unable to retry operation UPDATE due to error "com.mongodb.MongoBulkWriteException: Bulk write operation error on server myServer.mongodb.net:27017. 
Write concern error: WriteConcernError{code=11602, codeName='InterruptedDueToReplStateChange', message='operation was interrupted', details={"writeConcern": {"w": "majority", "wtimeout": 0, "provenance": "clientSupplied"}}, 
errorLabels=[]}. 

Which indicates that the exception should be have been considered retryable.

I will follow up on JAVA-4204.

Ross

Hi @Ross_Lawley ,

I enabled the logging on the driver side today and the error log matches what you got in your own tests:

Unable to retry operation UPDATE due to error "com.mongodb.MongoBulkWriteException: Bulk write operation error on server mongo-driver-shard-00-01.hs4pm.mongodb.net:27017. 
Write concern error: WriteConcernError{code=11602, codeName='InterruptedDueToReplStateChange', message='operation was interrupted', details={"writeConcern": {"w": "majority", "wtimeout": 0, "provenance": "clientSupplied"}}, errorLabels=[]}. "

Good luck with finding out what’s going on!

Valtteri