Peek at your MongoDB Clusters like a Pro with Keyhole: Part 2

Ken Chen
January 8, 2020

This is the second stop on Peek at your MongoDB Clusters like a Pro with Keyhole. In Part 1, we covered how to use keyhole to collect cluster information, generate HTML reports, and review whether the provisioned resources are adequate to support the application. In Part 2, we’ll discuss performance evaluations from FTDC metrics and mongo logs. FTDC, short for Full Time Diagnostic data Capture, is MongoDB's internal diagnostic data stored in a proprietary format. MongoDB records diagnostic information every second.

In part 2 of this blog post you will learn how to:

Integrate Keyhole with Grafana
Understand Cluster Performance & Bottlenecks from FTDC data
Identify Slow Operations from Logs

From the information revealed by Keyhole from reading MongoDB logs and FTDC data, you should be able to identify the performance bottlenecks 95% of the time. The complete code of Keyhole is available from this GitHub repository.

Integrate Keyhole with Granfana

MongoDB servers store FTDC data every second. This data includes mongo server status and many hardware resource metrics. To display FTDC metrics visually, Keyhole reads FTDC data and works as a SimpleJson Datasource to support a Grafana UI. Grafana is an open source analytics & monitoring solution for databases.

Install Grafana and SimpleJson Plugin

Follow Grafana’s installation documents to install Grafana on your favorite OS. For macOS users, after installing Grafana, start Grafana using command:

brew services start grafana

Continue configuring Grafana by navigating to http://localhost:3000 from a browser and install the grafana-simple-json-datasource plugin. For example, on macOS you can use the commands below to install the plugin and restart Grafana service:

grafana-cli plugins install grafana-simple-json-datasource
brew services restart grafana

Add Keyhole as Default Datasource

The next step is to add keyhole as the default datasource. Locate the Data Source tab under Configuration, and configure the SimpleJson datasource as below:

Keyhole as a datasource

You can ignore the HTTP Error Bad Gateway message for now. The reason that you see the error message is because Keyhole is not started yet.

Import Keyhole FTDC Analytics Template

Download MongoDB FTDC Analytics template from GitHub. On Dashboard, click the New dashboard icon, then import the downloaded file. Use the exact parameters as shown below. These values have to be exact matches to allow Grafana to locate the correct datasource.

Keyhole Full Time Diagnostic data Capture Template

Click the Import button to complete the configuration. You should be redirected to a MongoDB FTDC Analytics dashboard with a number of blank metrics panels.

Understand Cluster Performance & Bottlenecks from FTDC data

The FTDC data files are kept under a directory called diagnostic.data under your mongo database path. Copy the entire diagnostic.data directory (or simply a few files under it) to the computer where you have Grafana and Keyhole installed. Then, start Keyhole as follows:

keyhole --web --diag ./diagnostic.data/

2019/11/25 14:10:43 reading 1 files with 300 second(s) interval
2019/11/25 14:10:43 metrics.2017-10-12T20-08-53Z-00000 blocks: 164 , time: 245.534381ms Memory Alloc = 58 MiB, TotalAlloc = 101 MiB
2019/11/25 14:10:43 1 files loaded, time spent: 245.718665ms
2019/11/25 14:10:43 Stats from 2017-10-12T20:08:54Z to 2017-10-13T04:29:23Z
2019/11/25 14:10:43 host-0 xxx-00:27018
...
2019/11/25 14:10:43 data points ready for xxx-00:27018 , time spent: 1.244537ms
…
http://localhost:3000/d/simagix-grafana/mongodb-mongo-ftdc?orgId=1&from=1507838934000&to=1507868963000

After Keyhole completes reading all FTDC data files, a URL is provided at the end of the console output. Open the link in a browser to see the previous metrics panels filled with charts, for example:

Keyhole Full Time Diagnostic data Capture Analytics

Although these charts present only a few of the MongoDB FTDC metrics, they are enough for me to diagnose the health of a MongoDB cluster. My evaluation steps are outlined below:

WiredTiger Tickets

First of all, check if the WiredTiger tickets dropped to zero at any point in time. In the WiredTiger storage engine, read and write tickets are used to control concurrency. By default, there are 128 read and 128 write tickets. Below is an example of the cluster out of available read tickets at many points of time:

If the server ran out of read tickets, check MongoDB logs for long-running database operations. (We’ll discuss how to identify slow operations later in this blog.) To resolve the running-out-read-tickets problem, solutions can be as simple as adding indexes to improve query performance and free up read tickets quicker. If all queries are supported by proper indexes, but the problems are due to high transaction rate of read operations, consider adding more shards to the MongoDB cluster.

If the server was out of write tickets, the problem is likely disk performance-related, high disk latency or under-provisioned IOPS. Cross-reference the Disk IOPS and Disk Utilization (%) panels to see if it reached the maximum disk IOPS.

The growing number of queues in the Queues panel is a reflection of the WiredTiger Tickets panel. When WiredTiger is out of tickets, subsequent requests will be queued, for example:

Queues

Disk IOPS

The line charts of Disk IOPS panel fluctuates fiercely and likely you will see seesaw-shaped charts. However, if the IOPS of a disk device is showing as a plateau line, it’s very likely that the device has reached its maximum provisioned IOPS. The solution is to increase disk IOPS. If you still use spindle disks, consider an upgrade to SSD or NVMe for better performance. MongoDB has good results and a good price-performance ratio with SATA SSD.

Disk Input/Output Operations per Second

WiredTiger Tickets and Disk IOPS panels are the first two I’ll review. The other panels can show different bottlenecks and/or potential problems. Different cases have very different chart patterns.

WiredTiger Cache (GB)

The wt_cache_dirty, shown in the chart below, indicates data in the cache that has been modified but not yet flushed to disk. Growing amounts of dirty data usually implies that the rate of writing to the database overwhelms the provisioned disk IOPS.

Metrics

A high number of scan_objects with a low number of scan_keys is typically an indication of a missing index or inefficient indexes used by mongo query engine. A growing number of scan_sort implies that the mongo query engine didn’t use an index key to sort. Instead, it had to load all documents into memory before sorting. In either case, we can identify them from mongo logs - see discussions below in Identifying Slow Operations from Logs. The chart below is an example of a Metrics panel of a cluster that has a high transaction rate of reads.

Connections

I look at the conns_created_per_minute to find if connection pools were always used. Ideally, when connection pools are used, there should be a minimal number of connections created per minute. If you see spikes of conns_created_per_minute like the below chart, ask your developers if all applications use connection pools.

Connections

CPU Usages (%)

If a system is properly provisioned, you shouldn’t see CPU pegged out, i.e. 0% of CPU idle. If you have slow disks with a high transaction rate of writes, you will probably see a growing percentage of cpu_iowait. If your mongo server is a virtual machine hosted on a resource overcommitted host, you could see a high flying line of cpu_system even though the mongo server itself has low activity because other VMs could possibly use the CPU cycles heavily.

CPU Usages

Replication Lags (seconds)

Replication lag represents a delay between an operation on the primary and the application of that operation from the oplog to the secondary. This chart is only available when diagnosing a replica set. This metric shows how far a secondary is behind the primary. A high replication lag can be due to networking issues, slow oplog applying in secondary nodes, and/or insufficient write capacity. Below is an example:

Replication Lags in seconds

Other than application tuning, better hardware (faster network switches and/or disks) is likely required to reduce replication lags. A common solution is to replace existing servers with better hardware or to add more shards. Before a new solution is in place, make sure there is enough oplog window (the interval of time between the oldest and the latest entries in the oplog) to avoid losing data.

Identifying Slow Operations from Logs

Keyhole, with --loginfo flag, reads mongo logs and prints a summary of slow operations grouped by query patterns (filters). The input files can be either plain texts or gzipped files. Being able to read gzipped files comes handy. Below is an example result.

The complete usage of log analytics is as follows:

keyhole --loginfo log_file[.gz] [--collscan] [-v]

With the -v flag, Keyhole also prints the original logs of the top 20 slow operations. In addition, with the --collscan flag, Keyhole only outputs operations missing indexes. A query pattern marked as COLLSCAN means that it didn’t use an index and adding a proper index should resolve this. On the other hand, even if an index was used it might not be efficient enough to support the query pattern. It’s quite possible to find indexes that are optimized for querying and others for sorting. Compound indexes can be structured to optimize for both whenever possible.

Note that there is an output file with a “.enc” extension created upon the completion of log parsing. This is designed for those administrators who are reluctant or prohibited to copy log files out of the servers. Instead, they can process the logs on the server and later view the summary by using the command below:

keyhole --loginfo mongodb.log.enc

Or, you can use Maobi to generate an HTML report to have a more user-friendly view of the results. Below is an example report:

Sample Maobi Report

Recap

Combined with methods described in Part 1, you can use the Keyhole tool to quickly identify performance bottlenecks and tune them accordingly. In summary, Keyhole helps you to:

Verify new installations
Collect MongoDB cluster information including configurations and statistics
Visualize resource usages in a snapshot
Identify performance bottlenecks and slow database operations

As always, I would love to hear from you about the Keyhole tool. Please get in touch to let me know your thoughts.

← Previous

Ultimaker brings new dimensions to 3D printing with MongoDB Atlas and Google Cloud Platform

Ultimaker uses MongoDB Atlas on GCP to provide a complete ecosystem for 3D printing production workflows.

December 27, 2019

Next →

That’s a Wrap: MongoDB’s 2025 in Review & 2026 Predictions

It’s nearly the end of the year—again! That means it’s time for an end-of-year blog post that expresses disbelief at the passage of time. Which, as the saying goes, flies when you’re having fun. And definitely when you’re as busy as MongoDB was in 2025. It was a big year for the company—and more importantly, for the tens of thousands of customers and millions of developers who rely on MongoDB’s modern data platform for their most mission-critical workloads. At MongoDB, everything we do starts with our obsession with customers and their needs, and if there’s a theme to MongoDB’s 2025, it was (and will continue to be) enabling customer innovation and helping them succeed in the AI era. So here are a few highlights of how MongoDB acted on behalf of customers in 2025. From the acquisition of Voyage AI to customer success across industries, a lot happened in 2025. Let’s go!* *Read to the end for 2026 thoughts. 2025: The (MongoDB) year that was Voyage AI, modernization, and search In February, MongoDB announced the acquisition of Voyage AI, a pioneer in embedding and reranking models, to enhance the accuracy of AI applications. Integrating Voyage AI's advanced retrieval technology with MongoDB’s modern, AI-ready data platform addresses a critical challenge: LLM model hallucinations caused by a lack of context. By improving retrieval accuracy for specialized domains like finance and law, the integration enables businesses to deploy AI for mission-critical use cases. To learn more, see the MongoDB Voyage AI page. Then, in September, we launched MongoDB AMP, an AI-powered Application Modernization Platform. AMP is designed to accelerate the transformation of legacy applications through a combination of AI-powered tooling, a proven delivery framework, and expert guidance (tools, techniques, and talent) to help enterprises reduce technical debt and modernize 2-3 times faster. Want more? Sure you do! Check out this short video. MongoDB also announced the addition of search and vector search capabilities to MongoDB Community Edition and MongoDB Enterprise Server. This allows developers to build and test AI-native applications, including those using retrieval-augmented generation (RAG), in local or on-premises environments. Previously exclusive to MongoDB Atlas, these features enable secure, hybrid deployments where sensitive data can remain on-premises while still leveraging advanced search tools. Here’s a (slightly less short) video about search and vector search on Enterprise Server. Growing and scaling with MongoDB As noted, everything we do at MongoDB starts with our obsession with customers. 2025 was another banner year for customer success and innovation—we were inspired by what organizations of every shape and size, across industries and geographies, built with MongoDB in 2025. Here are just two of the many stories our customers shared in 2025; much more can be found in my colleague Katie Palmer’s blog series, Innovating with MongoDB. Factory By combining the Atlas modern data platform with Voyage AI’s high-performance embeddings, the AI-native startup Factory—which uses AI agents called Droids to accelerate software development lifecycles for organizations—consolidated its fragmented tech stack. This enabled superior code retrieval, simplified operations, and provided the scalability needed to process billions of tokens daily. McKesson McKesson, a global pharmaceutical distributor, replaced its monolithic legacy infrastructure with MongoDB Atlas to meet strict drug tracing mandates. By adopting our modern cloud data platform, McKesson scaled its operations 300x, managing tracking data for 1.2 billion containers annually without latency, and ensuring compliance and patient safety while reducing developer complexity. For more, check out the video of McKesson at MongoDB.local NYC from September. From niche NoSQL to enterprise powerhouse As senior MongoDB engineer and Technical Fellow Ashish Kumar put it earlier this year, “through a sustained and deliberate engineering effort,” MongoDB has gone from a (seemingly) niche NoSQL solution to a trusted enterprise standard, and now delivers “the high availability, tunable consistency, ACID transactions, and robust security that enterprises demand.” A new era of leadership The face of MongoDB has also changed—our CFO, Mike Berry, joined the company in April, and Dev Ittycheria stepped down as CEO in November, after more than 11 years leading the company (including its 2017 IPO). In a LinkedIn post about his role, new MongoDB CEO CJ Desai noted that the company is “at the forefront of a new data revolution, unlocking the next wave of productivity and intelligence.” “Having spent my career building and scaling technology platforms, I’ve always been drawn to companies defined by clarity of vision, relentless organic innovation, and a customer-first culture. MongoDB exemplifies all three,” said Desai. We couldn’t agree more. Onward! Reading the 2026 tea leaves So what might 2026 bring (for MongoDB and tech at large)? Here are a handful of our leaders’ predictions: “As much as people want to talk about Artificial General Intelligence (AGI), we’re still in the phase where most AI use cases automate redundant tasks but benefit from human-in-the-loop checks. Organizations that use AI to complete work that historically is a drain on human resources—but then uses people to carefully verify what AI builds, apply governance frameworks, and maintain accountability across the data lifecycle—will be more successful.” —Pete Johnson, Field CTO, AI, MongoDB “After years of inflated expectations and unsustainable spending, the AI industry is trapped in a bubble where companies reflexively attempt to deploy LLMs at every problem, driving up costs with minimal to no return. Businesses that break free from this spending cycle are the ones that understand the need to ground LLM responses in factual data and learn from prior mistakes. We believe the best way to do this will be with highly accurate embedding models and rerankers for reliable data retrieval.” —Frank Liu, Staff Product Manager, MongoDB "In 2026, cloud independence will evolve from strategic preference to existential imperative across enterprises of every scale. The outages and disruptions of recent years have exposed a fundamental truth: in an always-on digital economy—where commerce, mobility, governance, and even public safety depend on uninterrupted access to cloud services—single-provider reliance is no longer a calculated risk, but a systemic vulnerability. Compounding this is the inexorable rise of data sovereignty. Regulatory regimes worldwide now demand precise jurisdictional control over data residency, rendering rigid cloud commitments incompatible with compliance at global scale. The defining competitive advantage will belong to organizations that transcend fragile prevention theater and engineer true infrastructural resilience: architectures inherently portable, data frictionlessly mobile, and operations autonomously sustained across heterogeneous clouds through AI-orchestrated redundancy. In short, the winners will not merely mitigate downtime—they will design systems that render the concept obsolete." —Ben Cefalo, SVP, Head of Core Products, MongoDB Happy holidays and happy New Year, everyone!

December 22, 2025