Hey,
I am having a pickle with our cluster setup. We are experiencing weird slowdowns/pauses on storage, which the datacenter provider is investigating and that seems to take its time. They are random, almost daily and last a few minutes at a time and the worst case has been 180 seconds writeConcern wait on Primary for writes. And we have one old piece of our software stack seemingly very very allergic to slow writes. We are rewriting those parts, but it’s not happening fast enough.
Anyway, before, the defaultRWConcern was majority, but in an effort to tackle this, I changed that to w: 1. It may have helped or not, but there still occurs clear slow downs (from storage latence <1 ms → 200+ ms), which cause issues.
So, I was thinking what else could I do. Vast, vast majority of the writes in general are not critical at all, they are like heartbeats for devices, so I was thinking setting `writeConcernMajorityJournalDefault to false. But apparently that would mean that I should change the defaultRWConcern back to majority? But in the end the write latencies would go down considerably and possibly/probably would not be affected by varying latencies on disk writes?
The other thought I had was keep using the defaultRWConcern w: 1 and adding/converting one node to inMemory storage. We now have 5 nodes in total so I thought one could just as well be an inMemory node. Reasoning here being, that this node would then acknowledge writes “immediately”. Of course, as the Primary uses the same storage as the rest of them, the issue would probably/possibly just move to Primary itself being slow to write…
Any thoughts or other ideas?
Thanks,
Mika
PS. If the current defaultRWConcern config is:
defaultReadConcern: { level: ‘local’ },
defaultWriteConcern: { w: 1, wtimeout: 0 },
with j: being unspecified, then writes are acknowledged after one replica has written them to memory, correct?