/ /

Configure and Resolve Alerts

Configure and Resolve Alerts

Resolve Alerts

Fix Oplog Issues

Replication Oplog alerts can be triggered when the amount of oplog data generated on a primary cluster member is larger than the cluster's configured oplog size.

Alert Conditions

You can configure the following alert conditions in the project-level alert settings page to trigger alerts.

Replication Oplog Window is (X) occurs if the approximate amount of time available in the primary replication oplog meets or goes below the specified threshold. This refers to the amount of time that the primary can continue logging given the current rate at which oplog data is generated.

Oplog Data Per Hour is (X) occurs if the amount of data per hour being written to a primary's replication oplog meets or exceeds the specified threshold.

Common Triggers

These are a few common events which may lead to increased oplog activity:

Intensive write and update operations in a short period of time.
The cluster's configured oplog size is smaller than the value in the Oplog GB / Hour graph observed in the cluster metrics view.

Fix the Immediate Problem

These are a few possible actions to consider to help resolve Replication Oplog Alerts:

Increase the oplog size by editing your cluster's configuration to ensure it is higher than the peak value from the Oplog GB / Hour graph in the cluster metrics view.
Increase the oplog size if you foresee intense write and update operations occurring in a short time period.

Note
You may need to increase your cluster's storage to free enough space to resize the oplog.
Ensure that all write operations specify a write concern of majority to ensure that writes are replicated to at least one node before moving on to the next write operation. This controls the rate of traffic from your application by preventing the primary from accepting writes more quickly than the secondaries can handle.

Implement a Long-Term Solution

Refer to Workloads that Might Require a Larger Oplog Size for more information on oplog sizing requirements for your use case.

Monitor Your Progress

You might observe the following scenarios when these alerts trigger:

The Oplog GB / Hour graph in the metrics view spikes upward.
The Replication Oplog Window graph in the metrics view is low.
The Atlas View and Download MongoDB Logs of secondary or unhealthy nodes display the following message:
We are too stale to use <node>:27017 as a sync source.
An Atlas node is reporting a state of STARTUP2 and RECOVERING for an extended period of time.

Typically, this indicates that the node has "fallen off the oplog" and is unable to keep up with the oplog data being generated by the primary node. In this case, the node will require an initial sync in order to recover and ensure that the data is consistent across all nodes. You can check the state of a node using the rs.status() shell method.

Back

Lost Primary

CPU Usage