Does query advance Logical Time in MongoDB?

Vinllen_Chen · August 10, 2020, 4:04am

As I went through the MongoDB source code and the paper: Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB. I have a question about the clock time: Does MongoDB advance the clock time when sending a query with ClusterTime field. For instance, in Change Streams, MongoS build several cursors to each MongoD, then it will send a getMore query to MongoD, so I want to know whether it’’ advance the HLC.

As the above paper said:

Cluster nodes (mongod, mongos, config server, clients) always track and include the greatest known ClusterTime when sending a message.
The ClusterTime is incremented (“ticks”) only when there is a write to a primary node’s replication operation log (oplog).

It looks like these two rules are contradictory.

Asya_Kamsky · August 10, 2020, 12:58pm

There is no contradiction between those two statements. Tracking and including does not mean advancing. As the paper says, only writes advance the logical clock.

Asya

Vinllen_Chen · August 11, 2020, 2:09am

Hi, Asya,
Thanks for your reply. My question is when the clusterTime of MongoS is T1, and after it sending a query to the MongoD with clusterTime T0, the HLC of MongoD is still T0, but it’ll track the T1. So what’s the reason of this tracking and including?

As the A1.3 chapter said in this paper:

1. Client sends a write command to the primary, the message includes its current value of the ClusterTime: T1.
2. Primary node receives the message and advances its ClusterTime to T1, if T1 is greater than the primary node’s current ClusterTime value.
3. Primary node “ticks” the cluster time to T2 in the process of preparing the OpTime for the write. This is the only time a new value of ClusterTime is generated.
4. Primary node writes to the oplog.
5. Result is returned to the client, it includes the new
ClusterTime T2.
6. The client advances its ClusterTime to T2.

Checkout the step2, it looks like the Primary Node will advance the HLC to T1 once receive the message, so in my understanding, both query and write can advance the HLC to T1, but only write can tick the HLC to T2.

Asya_Kamsky · August 12, 2020, 12:01am

No, query/read cannot advance the cluster’s time. It can update a particular node’s view of what current cluster time is. But that doesn’t advance the time.

Think of it this way. You are node 1. You think it’s T1. Meanwhile node 2 has taken several writes advancing the cluster time to T3. Next request, no matter what it is to node 1 will notify node 1 that the current cluster time is T3 not T1. That does not advance the cluster clock that just catches up node 1 which was snoozing and didn’t realize that the cluster time has advanced.

There is no such thing as multiple cluster times. There is one. Not all nodes/clients may realize where it is at a particular point and they can learn about it with any communication even one that doesn’t involve any query.

Vinllen_Chen · August 12, 2020, 2:16am

Thanks Asya, So one more question, if query can’t advance HLC, how does a read only distributed transaction fulfill the isolation level of snapshot?

For example, the cluster time of MongoS is T1, while MongoD1 is T2 and MongoD2 is T3. Assume this read only distributed transaction crosses MongoD1 and MongoD2, but the response of this transaction may includes {a:1} from MongoD1 with cluster time T2 and {b:1} from MongoD2 with cluster time T3.

Is this phenomenon normal? Or is my understanding wrong?

Asya_Kamsky · August 12, 2020, 3:27pm

I think you may be thinking of it the wrong way. For instance:

cluster time of MongoS is T1, while MongoD1 is T2 and MongoD2 is T3

There is only one cluster time - it’s T3 here for the whole cluster - it just so happens that MongoS doesn’t know that yet, and neither does MongoD1, they just happen to be behind in their knowledge of cluster time. However, it doesn’t matter, because the number of writes is ‘consistent’ - in other words, each node can start their read at their current “snapshot” and it will reflect correctly cluster time T3 because the gossip protocol will eventually get each node to update their cluster time up to or past such time, but any write that happens after the snapshot read starts will not be included in the snapshot.

the response of this transaction may includes {a:1} from MongoD1 with cluster time T2 and {b:1} from MongoD2 with cluster time T3

But what’s key is that {a:1} may have been there at T2 but it’s also the same at T3! It has to be because if it wasn’t then cluster time would have had to have been incremented. So in reality all the snapshots are consistent with T3 time even if some of the components of the cluster don’t yet know that their writes are the same at T3 as whatever latest time they know about.

system · August 17, 2020, 3:27pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.