About Write Concerns

I just followed these two videos:
Lecture: Write Concerns

That brings a question to my mind concerning the algorithm running when the primary node handles a write request from a client.

In my understanding the primary node has to perform these three tasks:

  1. Write the requested data to disk.
  2. Send a confirmation to the client: “Hey I received your request, I am done with it”.
  3. Send a message to the secondary nodes: “Hey guys, I got a write order from a client you have to replicate this.”

We can say that once these tasks are performed, the primary node job is basically done (with some possible detail variations according to the situation).

But what is less clear is the order in wich they are or should be performed.
Some order may make more sense than some other.

I presume for example that task 2) should be the last one executed, but is it what actually happens? For tasks 1) and 3) it seems less clear, I can see arguments for both choices (1 then 3 or 3 then 1). So what is actually happening?

for a standard replica set as described in the long section:

  1. Sync oplog (replicate) with at least 1 node (~ms)
  2. Acknowledge to client (~ms)
  3. Writes to disk (~1min).

Somewhere in between, the data is “journaled” (for a default configuration).


Take this case scenario:

A write operation lands on the primary of a 3-node repl set, WriteConcern is majority. The user receives the acknowledgement.

Right next all servers suffer a hard shutdown.

  • Journal enabled (default): data written miliseconds before shutdown is recovered.

(This doesn’t mean the actual data is on the disk, but it’s in the journal, and it is used to rebuild the data.)

If Journaling is disabled, server recovers up to the last snapshot written from memory to disk (checkpoint), and called durable data.

If journal is enabled data is recover till last checkpoint + journal log.

The point is: the acknowledgement can be received before data is durable, and may be lost. But this should be rather unusual.

Step 1, I believe, will normally be the third.

If write concern is majority the confirmation to the client is send after the secondary nodes sync the oplog. So step 2 is after 3.

Other useful links:
WiredTiger (default SE)
Replica set config & defaults