Query Sentinel

Query Sentinel is an Intelligent Workload Management (IWM) policy in Atlas that automatically terminates inefficient, long-running query operations on a mongod node when a cluster is overloaded. By targeting only the most expensive, least efficient query operations, Query Sentinel reduces outage risk by quickly relieving pressure with minimal impact to the overall workload.

MongoDB considers a node overloaded when the number of incoming operations is large enough to cause total or near-total outage. MongoDB computes overload from metrics like CPU utilization, queue depth, operations per second, and latency.

Important

This policy is a load-shedding policy. If this policy is active on your Atlas cluster and your cluster is overloaded, you might see its associated overload errors.

When long-running queries consume excessive resources during high traffic, they can degrade cluster performance and increase the risk of outages. Query Sentinel protects cluster availability by:

Detecting and terminating long-running query operations during overload
Preserving availability for shorter operations that continue to succeed
Reducing outage risk without shutting down the entire workload

Considerations

Your Atlas cluster must be running MongoDB 8.3 or later to use this policy. On MongoDB 8.3, this policy is disabled by default. To enable or disable IWM policies, see the IWM settings.
This policy is available only for M10+ Atlas replica set clusters.
This policy is not available on sharded clusters or analytics nodes.
Query Sentinel does not terminate operations based on memory consumption, including idle cursors that consume memory but are not actively executing.
Query Sentinel does not pause or queue operations.

Behavior

When Atlas runs the Query Sentinel policy on your cluster, it performs the following actions:

Monitors for overload
- Atlas continuously evaluates indicators of overload on each node.
- When Atlas detects overload conditions, the Query Sentinel policy activates. Atlas triggers an alert for the following alert condition:
  Atlas has activated Query Sentinel to automatically terminate expensive queries and safeguard cluster stability.
  To modify your project's alert settings, see Configure an Alert.
Identifies long-running queries
- Query Sentinel monitors the set of currently running operations and evaluates each operation against predefined efficiency criteria, which can include query runtime, query plan summary, and more.
Terminates matching operations
- Query Sentinel issues a killOp command to terminate operations that match the policy criteria. Only operations that match the criteria are stopped; the rest of the workload is unaffected.
- When an operation is terminated, the server returns an InterruptedDueToOverload error code. To learn more about error handling, see Overload Errors.
Resumes normal operation
- As overload conditions subside, the policy stops terminating operations and the cluster returns to normal operation.
- When the policy is no longer active, the following informational event appears in the cluster's activity feed:
  "Atlas has switched Query Sentinel to monitoring mode and paused the automatic termination of expensive queries."
  To learn more, see IWM activity feed events.

When the policy is active, long-running operations in your application fail with an InterruptedDueToOverload error. Shorter operations continue to succeed. To learn more about handling overload errors, see Overload Errors.

Observability

You can use the following methods to track how Query Sentinel is affecting your workload:

Monitor Cluster Metrics: Operation throttling metrics show the number of operations that IWM policies have terminated.
Configure Alerts:
- Cluster overload conditions trigger default alerts for Intelligent Workload Management alert conditions. To learn how to manage alerts, see Configure Alert Settings.
- When cluster overload conditions resolve, Atlas writes informational events to the activity feed that indicate the resolution of IWM policies. To learn more, see the IWM activity feed events.

Back

Adaptive Operation Rate Limiting

Overload Errors