Andy Schwerin

1 result

MongoDB and Leap Seconds

The short answer As the June 30, 2015 leap second  event approaches, I have received a number of questions about how MongoDB is expected to behave during a leap second event.   The short answer is “just fine.”  MongoDB treats the observation of leap seconds similarly to the observation of clock skew between machines or the observation of other time-setting events, like manual clock adjustment. In more detail To understand why MongoDB is robust to leap seconds, it helps to think  about how leap seconds affect the observation of wall clock time, especially the case where it can make it appear to processes that time has gone backwards, and about how MongoDB uses wall clock time. Leap seconds come in one of two forms: either an extra second added at the end of the last minute of a specific calendar day or the omission of the last second at the end of the last minute of a specific calendar day in UTC. So, this can lead to a time 23:59:60Z on a day with a leap second in the first case, or to time transitioning from 23:59:58Z to 00:00:00Z on a day with a leap second in the second case. Unfortunately, the time standard used by almost all computers defines a calendar day as being composed of 86,400 seconds. Two techniques are used to deal with this discrepancy.  The cool but by far less common one is to make all the computer-reported seconds for a period of time leading up to the end of the leap-second day slightly longer or shorter than true seconds, “smearing” the leap second over several hours. Google apparently does this . The more mundane technique is for the OS clock to have the last second occur two times, from the point of view of observing processes, or to skip the last second, depending on the type of leap second. When the last second of the day occurs twice, an observer reading time with subsecond granularity could observe 23:59:59.800Z and subsequently observe 23:59:59:200Z, making it seem as though time has moved backwards. When the last second of the day is omitted, a process might believe that two seconds have passed when in fact only one has, because it observes 23:59:58Z and then 00:00:00Z. With this information about the observable effects of leap seconds in hand, we can now look at how this might affect MongoDB’s use of wall clock time. MongoDB uses wall clock time for the following: To generate diagnostic information, such as log messages; To record the wall clock time in fields of documents via the $currentDate  update operator and related operators, and to generate OIDs; To generate “optime” fields in replication oplogs; To schedule periodic events, such as replication heartbeats or cursor expirations. Impact on Diagnostic Information Diagnostic data is used by human beings and tools such as MMS Monitoring to monitor the health of a MongoDB cluster, or to perform a forensic analysis after an observed failure.  In these cases, the accuracy of the reported wall clock time aids in diagnosis, but is not required for correct operation of the cluster or for the analytic task.  This must be so, because MongoDB clusters are distributed over asynchronous networks, and tight synchronization of clocks among the components of the system cannot be assured. One caveat in the forensics and monitoring use case is that, if your operating system might allow MongoDB to observe time moving backwards , some diagnostic log messages may indicate that an operation took a very long time when it in fact did not.  These false positives for slow operations are typically easy to identify because they report absurdly long or negative durations  (frequently on the order of two weeks, positive or negative).  This can also occur if you manually reset your system clock during MongoDB operation. Impact on $currentDate et al When a client application requests a document be updated with the server’s notion of the current date and time, MongoDB simply asks the operating system for the current wall clock time and records that value in a client document. Any impact of clock adjustments for leap seconds or otherwise will effectively be passed through to the client application. Applications that require second-granularity precision of timestamps should be examined, whether or not they use MongoDB, as should the time synchronization technology used in support of that application (typically NTP). Impact on the replica set oplog MongoDB replica sets use a replicated operation log, or oplog, to inform secondary nodes of changes to make in order to stay consistent with the primary node.  These changes are kept in a total order, described by an “optime”, sometimes called the timestamp. This optime is composed of wall clock time paired with an " increment ", an integer which uniquely identifies operations that execute during the same wall clock time. For example, the first operation recorded at 23:59:59Z would be recorded as optime (23:59:59Z,1) and the third operation would have optime (23:59:59Z,3). But wall clock time is not used indiscriminately, because system clocks can drift, or be reset. The time portion of the optime is actually the maximum of the current observed time and the greatest previous observation. If MongoDB records operation A with an optime of (23:59:59Z,1), and then observes a time of 23:59:58Z when it attempts to log a subsequent operation B, it will act as if operation B occurred during 23:59:59Z, and thus log it with an optime of (23:59:59Z,2).In addition to leap seconds, unsynchronized clocks between replica set members may cause the optime to be ahead of any one node’s local wall clock time.  This situation is common and does not negatively affect replication operation. Impact on the scheduling of periodic tasks The final way that MongoDB uses wall clock time is to schedule periodic activities, like sending heartbeats to replica set nodes, cleaning up expired cursors or invalidating caches that use age-based invalidation policies.  These activities are typically scheduled to run after some amount of wall clock time has elapsed, rather than at specific absolute wall clock times;  the difference is not material.  In either event, the introduction of a positive leap second may cause an event to occur later than it otherwise would have, and the introduction of a negative leap second may cause an event to occur sooner than it otherwise would have.  MongoDB’s algorithms must already be robust to these behaviors, because they are typically indistinguishable from delays induced by higher-than-average network latency or virtual machine and operating system scheduling issues. Your Operating System matters Remember, MongoDB relies on host operating system capabilities for reading the wall clock time, and for synchronizing events with wall clock time. As such, you should ensure that the operating system running under MongoDB is itself prepared for leap seconds.  The most widely documented database problems during the June 2012 leap second were actually caused by a livelock bug in the Linux kernel futex synchronization primitive.  The DataStax developer blog has a brief summary  of the cause of the June 2012 issue in Cassandra, which correctly assigns responsibility to a since-resolved issue in the Linux kernel. If you use Red Hat Enterprise Linux, they have a nice knowledge base article  that covers the topic of leap second preparedness for RHEL.  If you’re running on Windows, Microsoft  has a very brief knowledge base article  on the subject of leap seconds. If you’re interested in learning more about the operational best practices of MongoDB, download our guide: Learn Best Practices for Operations About the Author - Andy Andy Schwerin is the Director of Distributed Systems Engineering at MongoDB in New York.

April 24, 2015