Transactions Background Part 1: Low-level timestamps in MongoDB/WiredTiger
May 23, 2019 | Updated: June 2, 2021
Recent features in MongoDB such as multi-document ACID transactions have required foundational enhancements in the underlying WiredTiger storage engine to enable them.
In this six part series, we'll be looking at the changes in MongoDB that enabled transactions support at the heart of the database. These changes included:
- Low-level timestamps in MongoDB/WiredTiger
- Logical Sessions in MongoDB
- Support for Local Snapshot Reads
- Implementing a Global Logical Clock
- Enabling Safe Secondary Reads
- Adding Retryable Writes
Each one of these features will be examined to answer questions such as "What was added?", "Why was it implemented in that way?" and "What impact it has on transactions as a whole?".
We start here with a look at low-level timestamps in MongoDB and WiredTiger.
Timestamps from MongoDB write operations now appear in the WiredTiger storage layer as additional metadata. This allows MongoDB’s concept of time and order to be queried so that only data from or before a particular time is retrieved. It allows MongoDB snapshots to be created so that database operations and transactions can work from a common point in time.
To enable replication MongoDB maintains an operational log, also known as the oplog. The oplog is a specialized collection within the server which lists the most recent operations that have been applied to the database. By replaying those operations on a secondary server, the replica can be brought up to date, maintaining a consistent state with the primary server. The ordering of operations in the oplog is critical to ensuring that the replicas correctly reflect the contents of the primary server.
MongoDB manages that ordering of the oplog and how replicas can access the oplog in the correct order. The switch to WiredTiger 's more powerful storage engine with its own underlying idea of order meant that to fully exploit WiredTiger's capabilities, the two ideas of order in the server and the storage layer would need to be reconciled.
WiredTiger stores all data in a tree of keys and values. When used as MongoDB’s storage layer, that data can be a document or a part of an index. Both are stored in the WiredTiger tree. When any updates are made to a key's value, WiredTiger creates an update structure. This structure contains information about the transaction, the data that has changed and a pointer on to any later changes. WiredTiger then attaches that to the original value. Later update structures will append themselves to the previous update structure creating a chain of different versions of the value over time.
This is the multi-version component of the concurrency control implemented by WiredTiger. WiredTiger has its own rules for how to read the update structures to get the "current" state of a value. This order that WiredTiger applies updates is different from MongoDB's oplog order. That difference in ordering comes from WiredTiger, parallelizing multiple writes to secondaries when possible. As the primary can accept many parallel writes, secondaries need to be able to match that throughput with their own parallel writes for replication.
To preserve the MongoDB order within the WiredTiger storage engine, the update structure was extended with a "timestamp" field. The value for this field is passed into WiredTiger by MongoDB and is treated as a significant item of meta information by WiredTiger. When queries are made of WiredTiger, a timestamp can be specified with the query to get the exact state of the data at that particular moment. This provides a way to map between MongoDB order and WiredTiger storage order.
When a secondary replica synchronizes with the primary, it does so by reading batches of updates from the oplog. It then attempts to apply those changes to its own storage. Without timestamps, that applying of the operations would block read queries until a batch of updates were completed to ensure that no out of order writes are seen by users. With the addition of timestamps, it is now possible to continue read queries using the timestamp from the start of the current batch. That timestamp will ensure a consistent response to the queries. This means that secondary reads will now not be interrupted by replication updates.
When multiple secondary servers in a MongoDB cluster are updated through replication, they will find themselves at different stages of synchronization with the primary. This fact means we also have the "majority commit point": the point in time where a majority of the secondary servers have caught up to. When a primary fails, only data up to that majority commit point is guaranteed to be available on all servers, and that's what the secondaries work with as one of them is elected to be a new primary with our RAFT-based consensus protocol.
When the former primary returns to the cluster, the process of synchronizing that server with the rest of the cluster was quite complex. As it may have had data from beyond the common point, it had to work out what changes it had done which the cluster no longer knew about and retrieve old versions of records that it had changed.
With timestamps, this process is radically simplified. By taking the timestamp of the majority commit point and applying it to the former primaries storage, changes that happened after that timestamp can be dropped. Once done, the node can rejoin the cluster and start replicating the primary.
Timestamps and Transactions
By pushing the timestamp information into the heart of the WiredTiger trees, it has enabled using WiredTiger's multi-version concurrency control to reduce locks and streamline resynchronization processes. The ability to snapshot a point in time also gives the server the ability to roll back to that point in time, a capability that is fundamental to the correctness guarantees of multi-document ACID transactions.
In the next part of the series, we'll look at Logical Sessions in MongoDB and how they enabled transactions.
The beta test for the next evolution of transactions on MongoDB is open - Sign up for the Distributed Transactions Beta