HI,
I was going through the HybridClock implementation and I see that the logical clock ticks only on the primary node. The value is replicated through oplog entries to the secondaries.
In a sharded cluster, primary of each shard is where the cluster time will be incremented. I think this works fine in a single replica set, which is controlled by Raft. With the Raft leader controlling the clusterTime, it is guaranteed to be monotonically increasing across the Raft cluster.
But in multi-shard transactions, there are multiple raft clusters (one replica set per shard). Then it is not enough for the shard primary to control cluster time ticking. It must be the transaction co-ordinator, which should chose cluster time, getting all the cluster times of shard primaries involved in the transaction and then chosing the value which is greatest and then incrementing it. Is this how its implemented in mongo?
The hybridclock paper https://dl.acm.org/doi/pdf/10.1145/3299869.3314049, talks about per shard scenarios, but not of how it works in multi document distributed transactions.
Also , this looks similar to how its implemented in LogCabin, the reference implementation of Raft, but I don’t see Logcabin referred in the above mentioned paper. Is there any difference than whats implemented in logcabin? LogCabin.<wbr />appendEntry(5, "Cluster Clock, etc") - ongardie.net