This page provides a set of recommended Prometheus alerts for self-managed mongot deployments. The alert definitions are starting points. You can copy, adapt, and tune these alert definitions to fit your workload.
Each alert entry includes the following fields:
字段 | 说明 |
|---|---|
严重性 | For details, see Alert Tiers. |
What It Tells You | Operational meaning of the alert. |
PromQL | Example expressions. Adapt the metric names to your environment. |
Threshold Rationale | Reason for the given threshold. |
First response | Actions the on-call engineer should take. |
Configure alerts for the Page tier first. Run these alerts for a week and tune false-positive thresholds for the alerts. Later, add Ticket and Watch alerts.
Alert Tiers
层级 | When to Alert |
|---|---|
页面 | Customer-visible impact is or already occurring or about to occur. Address this alert as soon as possible. |
Ticket | Operational degradation. Address within hours. |
Watch | Useful on dashboards or for trend analysis. No immediate action required. |
Page Tier
The following alerts indicate customer-visible impact and require immediate attention.
mongot Process Is Down
mongot is not responding to Prometheus metrics fetches. Search and vector search are not functioning properly.
Use the following PromQL expression to alert on this condition:
up{job="mongot"} == 0
Set the duration to one minute.
A brief miss might be a transient network error. A sustained absence of longer than one minute is an outage.
Respond to this alert by running one of the following commands depending on how you run your mongot:
For deployments using Kubernetes, run
kubectl get pods.For deployments using
systemd, runsystemctl status mongot.For deployments using Docker, run
docker ps.
Check logs for the crash cause. For troubleshooting steps, see mongot Logs and FTDC.
Crash Loop
The process is restarting repeatedly. The deployment is unstable.
Use the following PromQL expression to alert on this condition:
changes(mongot_process_start_time_seconds[10m]) > 3
More than three restarts in 10 minutes is a crash loop. Crash loops are not a temporary failure and need to be addressed.
Respond to this alert by performing the following actions:
Capture logs from the most recent crash window.
Suspend automated restarts so you can inspect a stopped pod or process.
Open an FTDC capture.
Replication Lag Growing Unboundedly
mongot cannot keep up with mongod. Search results become increasingly out-of-date. If replication lag is left uncorrected, the cursor falls off the oplog and forces a full re-sync.
Use one of the following PromQL expressions to alert on this condition:
max(mongot_index_stats_indexing_replicationLagMs) > 60000
Or, to catch a growing trend before the absolute threshold is reached, use:
deriv(max(mongot_index_stats_indexing_replicationLagMs)[15m:1m]) > 500
This metric is per-index and in milliseconds. Do not divide the metric by 1000 in PromQL. Steady-state lag is below one second. One minute of lag is acceptable for catch-up scenarios. A steadily growing lag is the alarm condition.
The mongot_index_stats_* family of metrics is only present once at least one search index exists. On a fresh deployment with no indexes, this alert does not appear because the series does not exist yet. This is expected behavior.
Respond to this alert by performing the following actions:
Check
mongodwrite rate for a sudden spike.Check
mongotCPU and disk I/O for saturation.
For guidance, see Metrics Reference for mongot.
Sync Exceptions Occurring
mongot is encountering errors during sync. Repeated exceptions force a re-sync. Indexes are temporarily unavailable or stale during the re-sync.
Use one of the following PromQL expressions to alert on this condition.
To alert based on errors during sync, use:
increase(mongot_index_stats_indexing_steadyStateExceptions_total[10m]) > 0
To catch initial sync exceptions, use:
increase(mongot_index_stats_indexing_initialSyncExceptions_total[10m]) > 0
Any steady-state exception in production is a problem. Either the oplog rolled over, meaning that the mongod oplog is too small or mongot is too slow, or a downstream error occurred.
Respond to this alert by performing the following actions:
Capture
mongotlogs around the exception.Check
mongodoplog size.Open an FTDC capture immediately. It is harder to determine the root cause after the fact.
Heap Exhaustion Imminent
JVM heap is near its limit. An OutOfMemoryError is about to occur.
Use the following PromQL expression to alert on this condition:
sum(mongot_jvm_memory_used_bytes{area="heap"}) / sum(mongot_jvm_memory_max_bytes{area="heap"} > 0) > 0.85
Set the duration to five minutes.
Sustained heap usage above 85% can cause issues. The next allocation spike can cause an out-of-memory error. mongot uses the Garbage-First (G1) Garbage Collection by default. The following expression is a more precise post-GC version of the above PromQL expression:
mongot_jvm_gc_live_data_size_bytes / mongot_jvm_gc_max_data_size_bytes > 0.85
Respond to this alert by checking active indexing operations and query load. If a large index is building, the condition may resolve when the build completes. Otherwise, increase the -Xmx heap setting or reduce the number of concurrent operations.
Disk Fill and mongot Self-Protection Cascade
mongot enforces three thresholds on disk usage on the dataPath volume. These thresholds are enforced in the mongot binary itself. The thresholds take effect whether or not you are monitoring. Set alerts at all three thresholds so that the on-call engineer sees the cascade and can act before the final threshold is crossed.
Disk Used | What mongot Does | Customer-Visible Impact | 严重性 |
|---|---|---|---|
85% (15% free) |
| Only visible if a new index is created. Existing search and vector search continue normally. | Ticket |
90% (10% free) |
| Search results are out of date with | 页面 |
95% (5% free) |
| All search and vector search become unavailable. | 页面 |
This alert has three rules. Severity escalates at each threshold.
Use the following PromQL expressions to alert on these conditions:
85% - Ticket Level:
(1 - mongot_system_disk_space_data_path_free_bytes / mongot_system_disk_space_data_path_total_bytes) >= 0.85
90% — Page Level:
(1 - mongot_system_disk_space_data_path_free_bytes / mongot_system_disk_space_data_path_total_bytes) >= 0.90
95% — Page Level (Outage):
(1 - mongot_system_disk_space_data_path_free_bytes / mongot_system_disk_space_data_path_total_bytes) >= 0.95
The mongot /metrics endpoint exposes _free_bytes and _total_bytes. Compute used percentage as 1 - free/total.
Respond to this alert by performing the following actions:
At 85%: Audit and drop unused indexes through the search index management API. Never delete files under
dataPathmanually. New indexes do not build until disk usage drops below 85%.At 90%: Alert your operations or SRE team that the cluster is in replication-disabled state. Drop unused indexes or expand storage to restore replication.
At 95%: This is an outage. Free disk space, then restart
mongot.mongotrefuses to restart cleanly until disk is freed.
Ticket Tier
The following alerts indicate operational degradation and should be addressed within hours.
Sustained Query Latency Degradation
Users are experiencing slow searches.
Use one of the following PromQL expressions to alert on this condition:
Cross-index:
max(mongot_command_searchCommandTotalLatency_seconds{quantile="0.99"}) > <your-SLO-threshold>
Per-index breakdown:
max(mongot_index_stats_query_searchResultBatchLatencies_seconds{quantile="0.99"}) by (indexId_logString) > <your-SLO-threshold>
The threshold depends on your Service-Level Objective. A common starting point is to have the 99th percentile in 500 milliseconds for $search and one second for $vectorSearch. These series are summaries with pre-baked quantile labels, not histograms.
Respond to this alert by investigating the following:
Executor queue depth.
JVM Garbage Collection pause time.
Storage IOPS to identify the bottleneck.
Executor Pool Saturation
Workers are saturated and tasks are queuing. Query latency is about to increase.
Use the following PromQL expression to alert on this condition:
max({__name__=~"mongot_.+_executor_queued_tasks"}) > 10
Set the duration to five minutes.
A small, brief queue is normal under load spikes. A sustained queue means worker capacity is undersized. The common hotspot pools are:
mongot_decoding_executormongot_change_stream_sync_dispatcher_executormongot_indexing_work_executormongot_indexing_lifecycle_executormongot_index_commit_executor
The indexing work is split across several specialized pools. There is no combined mongot_indexing_executor.
Respond to this alert by identifying which pool is queuing:
topk(5, sum by (__name__) ({__name__=~"mongot_.+_executor_queued_tasks"}))
Scale up or increase the pool size. The queue depth ramp provides early warning before query latency rises.
Storage Advisory: Sustained IOPS
The storage volume is approaching saturation. Lucene latency is increasingly disk-bound.
Use the following PromQL expression to alert on this condition:
rate(mongot_system_disk_reads_events{name="<dataPath device>"}[5m]) > 1000
Set the duration to 15 minutes.
The 1,000 IOPS threshold is the storage class recommendation flag. However, it is not a strict limit. The right number depends on your device. Identify your dataPath device with df or by inspecting mongot_system_disk_* label values.
Respond to this alert by checking whether a merge or initial sync is in progress. If the high IOPS level is sustained, the storage class is likely undersized. Revisit your storage configuration.
Storage Advisory: Page Fault Rate
The OS is repeatedly pulling index pages from disk because they have been evicted from cache. Memory is the constraint, not storage capacity.
Use the following PromQL expression to alert on this condition:
rate(mongot_system_process_majorPageFaults_operations[5m]) > 1000
1,000 major faults per second is the canonical threshold for memory pressure on the critical path. Together with sustained IOPS, this is the signal for memory shortage relative to the working set.
Respond to this alert by adding memory because Lucene memory-maps index files.
Indexing Failures Nonzero
A specific index encountered a non-trivial indexing failure.
Use one of the following PromQL expressions to alert on this condition:
increase(mongot_lifecycle_failedInitializationIndexes_total[10m]) > 0
或者:
increase(mongot_indexing_steadyStateChangeStream_unexpectedBatchFailures_total[10m]) > 0
或者:
increase(mongot_index_stats_indexing_invalidGeometryField_total[10m]) > 0
These counters do not increment under normal load. An increase indicates a data issue such as:
A mapping explosion.
An oversized document.
An invalid document.
An increase in these counters might also indicate a code-path problem.
Respond to this alert by performing the following actions:
Inspect labels for the affected index and reason.
Check
mongotlogs for the underlying exception.
Embedding-Channel Disruption
Automated embedding is encountering problems. The indexing process stalls for new documents on the affected indexes.
Use one of the following PromQL expressions to alert on this condition:
increase(mongot_indexing_steadyStateChangeStream_rescheduledEmbeddingGetMores_total[10m]) > 0
或者:
increase(mongot_initialsync_queue_requeuedEmbeddingInitialSyncs_total[10m]) > 0
Sustained rescheduling or requeuing indicates the embedding path is not draining cleanly. The most common causes are:
An invalid API key.
An unreachable network endpoint.
Voyage AI rate limiting.
Respond to this alert by performing the following actions:
Check
mongotlogs for the HTTP error against the embedding endpoint.Verify API key validity and connectivity.
Check Voyage AI status.
Forced Index Status Transition
One or more indexes transitioned out of STEADY state into a recovery, stale, or failed state.
Use the following PromQL expression to alert on this condition:
count by (status) (mongot_index_stats_indexStatusCode{status!="STEADY"} == 1) > 0
A single index in RECOVERING_TRANSIENT for a few seconds during deployment is normal. A sustained count greater than zero in any of the following states indicates a problem:
FAILED.RECOVERING_NON_TRANSIENT.STALE.
Respond to this alert by identifying the affected indexId_logString and checking the corresponding mongot log lines.
FTDC Executor Failure
The diagnostic capture pipeline is failing. mongot is otherwise healthy but you have lost observability for that node.
注意
This metric is gated behind the ftdcExecutorMetricsToPrometheus feature flag. Confirm whether your deployment exposes this metric before adding this alert. The metric is absent from default mongodb/mongodb-community-search scrapes. By default, this flag is off for self-managed deployments.
Use the following PromQL expression to alert on this condition:
mongot_mongot_ftdc_executor_failure_total > 0
Set the duration to five minutes.
If your deployment exposes this metric, treat it as a serious signal that downstream observability is degraded.
Respond to this alert by restarting mongot.
Watch Tier
The following metrics are useful on dashboards for trend analysis. None of these metrics require paging.
Heap Utilization (post-GC)
This metric shows heap utilization after Garbage Collection over time.
Use the following PromQL expression to alert on this condition:
sum(mongot_jvm_gc_live_data_size_bytes) / sum(mongot_jvm_gc_max_data_size_bytes)
Respond to this alert by investigating if the metric climbs over weeks.
GC Pause Time
This metric shows the worst recent pause across collectors.
Use the following PromQL expression to alert on this condition:
max(mongot_jvm_gc_pause_seconds_max)
Respond to this alert by investigating if this metric is sustained over 100 ms.
Open File Descriptors
The metric shows the headroom against the soft limit for open file descriptors.
Use the following PromQL expression to alert on this condition:
mongot_process_*
Respond to this alert by investigating if this metric is over 80%.
Cursor Timeouts
This metric shows the number of clients holding cursors past timeout.
Use the following PromQL expression to alert on this condition:
rate(mongot_cursorManager_trackedCursors[5m])
No need to set an alert threshold for this metric. Purely informational.
Connection Pool Wait
This metric shows the number of threads waiting for a connection.
Use the following PromQL expression to alert on this condition:
mongot_mongoClient_connectionPool_connectionsCheckedOut approaching _maxSize
Respond to this alert by investigating if the metric is sustained and greater than zero.
Disk Free Percentage
This metric shows the storage capacity.
Use the following PromQL expression to alert on this condition:
mongot_system_disk_space_data_path_free_bytes / mongot_system_disk_space_data_path_total_bytes
If this metric drops below 30% free, consider having a planning conversation to increase storage.