对于 AI 代理:可在 https://www.mongodb.com/zh-cn/docs/llms.txt 获取文档索引—通过在任何 URL 路径后添加 .md 可获取所有页面的 Markdown 版本。
Docs 菜单

Monitoring Tool Integrations

This page describes how to integrate mongot metrics and logs with common monitoring platforms. These instructions assume that you already run one of these tools and need the mongot-specific configuration. This page does not teach Prometheus, Grafana, or another platform from scratch.

This guidance targets site reliability engineers and platform teams who integrate mongot into an existing observability stack.

The following table summarizes the surfaces that mongot exposes for monitoring:

Surface
protocol
Default endpoint
Configured under
注意

衡量标准

HTTP, Prometheus text format

localhost:9946/metrics

metrics
.address

The included config.default.yml binds to localhost:9946 only. Override it to 0.0.0.0:9946 for off-host scraping. mongot applies no authentication by default. Protect the endpoint at the network layer.

Liveness

HTTP

localhost:8080/health

healthCheck
.address

Returns {"status":"SERVING"} after mongot binds its services.

Readiness

HTTP

localhost:8080/ready

healthCheck
.address

Returns {"status":"SERVING"} when mongot is ready to receive traffic. This means replication is initialized and catalog indexes are queryable. If no indexes exist, mongot reports ready.

日志

stdout and stderr, or a file

Per logging configuration

logging

JSON or text, depending on the configuration.

FTDC

On-disk binary stream

<storage.dataPath>/diagnostic.data/

advancedConfigs.ftdc

Enabled by default. Tune or disable with advancedConfigs.ftdc. To learn more, see mongot Logs and FTDC.

Prometheus with Grafana is the recommended monitoring stack for most self-managed deployments. The stack is free, widely supported, and works directly with the mongot metrics endpoint.

Add a scrape job to your Prometheus configuration:

scrape_configs:
- job_name: mongot
scrape_interval: 15s
scrape_timeout: 10s
static_configs:
- targets:
- mongot-host-1.internal:9946
- mongot-host-2.internal:9946
labels:
deployment: prod
edition: ce

For Kubernetes deployments that the MongoDB Controllers for Kubernetes Operator manages, use a PodMonitor or ServiceMonitor resource with the Prometheus Operator. Target the pods labeled app=<resource-name>-search:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: mongot
namespace: <mongot-namespace>
spec:
selector:
matchLabels:
app: <resource-name>-search
podMetricsEndpoints:
- port: metrics
interval: 15s

Recording rules reduce PromQL repetition and make Grafana queries faster. The following rules use the metric names that self-managed mongot exposes:

groups:
- name: mongot_recording
interval: 30s
rules:
- record: mongot:search_latency_p99
expr: max(mongot_command_searchCommandTotalLatency_seconds{quantile="0.99"})
- record: mongot:vector_search_latency_p99
expr: max(mongot_command_vectorSearchCommandTotalLatency_seconds{quantile="0.99"})
- record: mongot:search_rate:rate5m
expr: sum(rate(mongot_command_searchCommandTotalLatency_seconds_count[5m]))
- record: mongot:search_failure_rate:rate5m
expr: sum(rate(mongot_command_searchCommandFailure_total[5m]))
- record: mongot:replication_lag_ms:max
expr: max(mongot_index_stats_indexing_replicationLagMs)
- record: mongot:heap_utilization_post_gc
expr: mongot_jvm_gc_live_data_size_bytes / mongot_jvm_gc_max_data_size_bytes
- record: mongot:gc_pause_worst
expr: max(mongot_jvm_gc_pause_seconds_max)

Translate the recommended alerts into Prometheus alert rules. For example:

groups:
- name: mongot_alerts
rules:
- alert: MongotDown
expr: up{job="mongot"} == 0
for: 1m
labels:
severity: page
annotations:
summary: "mongot is down ({{ $labels.instance }})"
- alert: MongotReplicationLagGrowing
expr: deriv(max(mongot_index_stats_indexing_replicationLagMs)[15m:1m]) > 500
for: 10m
labels:
severity: page
- alert: MongotHeapPressure
expr: mongot:heap_utilization_post_gc > 0.85
for: 5m
labels:
severity: page

To translate the full set of alerts into PromQL, see Recommended Alerts for mongot.

A starter Grafana dashboard for mongot should include the following panel groups:

  • Process: uptime, restart count, CPU, and resident memory.

  • JVM: heap used compared to max, post-GC heap, GC pause time, and threads.

  • Replication: current state, lag in milliseconds and rate, and events applied per second.

  • Indexing: active builds, per-index status, indexing failures, and merge backlog.

  • Query: rate by operator, latency p50, p95, and p99 by operator, and error rate.

  • Executors: queue depth by pool and rejected tasks.

  • Storage: free bytes, IOPS, and page-fault rate.

  • Embedding (if enabled): request rate, latency, errors, and token throughput.

If your organization standardizes on OpenTelemetry, the OpenTelemetry Collector can ingest mongot metrics from the Prometheus endpoint and forward them to any OTLP-compatible backend:

receivers:
prometheus:
config:
scrape_configs:
- job_name: mongot
scrape_interval: 15s
static_configs:
- targets:
- localhost:9946
exporters:
otlphttp:
endpoint: https://<your-otel-backend>
service:
pipelines:
metrics:
receivers:
- prometheus
exporters:
- otlphttp

This pattern is provider-neutral. The same collector configuration works for Honeycomb, Grafana Cloud, New Relic, and other backends.

To forward logs, configure mongot to write JSON to stdout. The collector can then parse the structured fields and route the logs to your backend.

mongot writes structured JSON logs to stdout and stderr by default. Forward stdout to your centralized log platform and ingest the logs as JSON.

Fluent Bit and Vector both work for mongot log collection. Treat the logs as a tagged stream. To learn which log patterns matter most, see mongot Logs and FTDC.

For AWS-hosted deployments, the CloudWatch agent can tail the log file directly. Create a CloudWatch metric filter on key log patterns, such as Exception requiring resync, to convert log events into metrics.

mongot exposes two HTTP endpoints on port 8080 by default:

端点
Use For
含义

/health

Liveness

mongot has bound its services. Use this endpoint to detect a crashed or hung process. It does not indicate that mongot can serve queries.

/ready

Readiness

mongot has finished initializing index replication. Use this endpoint to gate traffic into a pod.

Both endpoints return JSON with HTTP 200. Treat {"status":"SERVING"} as healthy and {"status":"NOT_SERVING"} as unhealthy. An invalid query parameter returns HTTP 400 with {"error":"BAD_REQUEST"}.

Like the metrics endpoint, the /health and /ready endpoints are unauthenticated by default. Protect them at the network layer.

In Kubernetes, map the liveness probe to /health and the readiness probe to /ready:

livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3

If you use /health for both probes, a pod can receive traffic before its indexes initialize, because /health returns SERVING as soon as the services bind. The two-endpoint split exists to separate these signals.

When the MongoDB Controllers for Kubernetes Operator manages more than one mongot pod, it provisions a default load balancer and routes traffic based on the /ready endpoint for you. For self-managed deployments that run their own load balancer in front of multiple mongot instances, configure the load balancer to route traffic only to instances that return SERVING on /ready.

To keep a pod ready even when some indexes fail to initialize, set the readiness probe path to /ready?allowFailedIndexes=true. This setting is a deliberate tradeoff, because failed indexes return empty results for queries that reach them.

If your deployment runs more than one mongot instance, each instance exposes its own metrics endpoint. Scrape each instance individually, then use Prometheus aggregations, such as sum, max, and avg, to see a combined view of the metrics.

Track replication lag, executor queue depth, and query latency for each instance and in aggregate. A single saturated instance can degrade latency for the queries it serves, and fleet-wide averages can hide this degradation.

For sharded clusters, label each scrape with the shard name so that you can roll the metrics up per shard.

When you open a MongoDB Support case, attach the FTDC capture from the affected mongot instance. To learn the capture procedure, see mongot Logs and FTDC.