Hi @Graeme_Henderson ,
Please note that some of the below information is also communicated in the Alert Resolutions > Replication Oplog Alerts documentation.
Do I need to change anything?
If a large amount of write operations are conducted on the PRIMARY node of the replica set and the SECONDARY nodes do not have enough time to replay all of the operations contained in the oplog, this will typically result in “falling off the oplog”, which requires an initial sync in order to recover and ensure that the data is consistent across all nodes.
There are a few things that help mitigate the risk of falling off the oplog:
-
Increase the
oplogsize of the cluster via the configuration panel. The larger theoplogis, the larger the window the cluster has to replicate operations that take place, which in turn makes it less likely thatSECONDARYnodes will fall off theoplog. I recommend reviewing the Oplog GB/Hour and Replication Window metrics to determine what the best value would be in your case. For instance, you can check how muchoplogyou are generating at peak time and consider having at least X GB oplog size so that you can have 1 hour replicationoplogwindow for this kind of peak workload (Where X is the peak value of Oplog GB/Hour in your metrics). -
Utilize a write concern of
w: majorityto ensure that writes are acknowledged by a majority of nodes in the replica set before moving on to the next operation.
It is unclear from the Atlas interface what the oplog capacity is set to
You should be able to view (and set) the clusters oplog size configuration by following the below steps in the Atlas UI:
- Going to the
Clusterssection - Select
...on the cluster you wish to view / configure the oplog size for - Select
Edit configuration - Select the
Additional Settingstab (to expand) - Select the
More Configuration Optionstab (to expand) - View (or Set) the Oplog Size (in MB)
- Click
Review Changes - Click
Apply Changes
Hope this helps.
Kind Regards,
Jason