Introducing mtools
By
Thomas Rueckstiess
, Kernel Program Manager at MongoDB
NOTE
: Mtools is a popular tool developed by MongoDB engineers, but, as noted on its
GitHub repository
, it is not a product of MongoDB or supported as such. Bug reports, feature requests and all other questions can be posted in the
Issues
section of the mtools GitHub repository.
mtools
is a collection of helper scripts, implemented in Python, to parse and filter MongoDB log files (both for mongod and mongos), to visualize information from log files and to quickly set up complex MongoDB test environments on a local machine.
I started working on
mtools
a year ago, when I realized I would automate and script most of my daily tasks as an Engineer at
MongoDB
. Since then,
mtools
has grown to a suite of flexible, useful command line tools that are being used by many of our Engineers internally, as well as MongoDB customers and users, to diagnose the root cause of system issues.
If you find yourself looking at MongoDB log files to identify system and performance issues, then I encourage you to try
mtools
as well.
What’s in the box?
mtools
in its current version 1.1.4 consists of 5 individual scripts:
mloginfo
,
mlogfilter
,
mplotqueries
,
mlogvis
and
mlaunch
.
mloginfo
should be your first stop on the log file analysis. This script will parse the file quickly and output general information about its contents, including start and end date and time, line numbers, version and whether the file came from a mongos or mongod (if available in the file). In addition, you can request certain “sections” of additional information; currently those are “queries”, “connections”, “restarts” and “distinct”.
mlogfilter
helps to narrow down the search in log files. The script lets you filter on attributes of log messages, like their namespace (database and collection names), their type of operation (queries, inserts, updates, commands, etc.) or by individual connection. You can also search for slow operations by setting a threshold, identify collection scans (those are the queries not using an index) and other properties. Additional features include slicing the log files by time (with flexible date/time parsing), merging files, shifting them to different time zones or converting timestamp formats, and exporting them to JSON. The key property of
mlogfilter
is that the output format always remains the same (log lines), so you can pipe the output to another instance of
mlogfilter
, to the grep command or to other scripts like
mplotqueries
.
mplotqueries
takes a log file (mlogfiltered or not) and presents the information visually in various ways. There are a number of options for graph types, such as scatter plots (showing all operations over time vs. their duration), histograms, event and range plots, and other more specialized graphs like connection churn or replica set changes. Independent of graph type, you can assign a specific color to different class categories.
mlogvis
is
mplotqueries’
little brother, it is very similar in its functionality, but provides a web-based alternative using the
d3.js
javascript visualization library. This is particularly useful if the dependencies required by
mplotqueries
are not installed/available, or if you want to create a self-contained interactive graph that can be shared with others, such as customers or colleagues.
mlogvis
will create a single
.html
file that can be shared, since it loads the d3.js library dynamically.
mlaunch
is a little different from the other scripts, and actually has nothing to do with log file parsing.
mlaunch
spins up any number of mongodb nodes on your local machine, either as a stand-alone, as replica sets or sharded clusters. This is useful if you want to do testing or reproduce issues locally. Rather than setting this up manually,
mlaunch
will start the processes and connect the replica sets or shards together. Within a few seconds, you can have a complex environment running, like a 5 shard cluster, each shard consisting of a replica set, authentication enabled, and any kinds of individual flags you want to pass onto the processes.
mlaunch
also has options to start and stop individual instances or groups, and to view which ones are running in the current environment and which ones are down.
How does it work?
Rather than going through all the features of each of the scripts, I’d just like to demonstrate two basic use cases. For a full list of features you can visit the
mtools wiki
, which contains the manual and many usage examples.
Use Case 1: Profiling your Queries with
mloginfo
You have a number of slow queries running against MongoDB that are affecting the performance of the database. To get an idea of where MongoDB is slowing down, as a first step take a look at the “queries” section of
mloginfo
. Here is an example output, created with the following command:
mloginfo mongod.log --queries
QUERIES
namespace pattern count min (ms) max (ms) mean (ms) 95%-ile (ms) sum (ms)
serverside.scrum_master {"datetime_used": {"$ne": 1}} 20 15753 17083 16434 17039.3 328692
serverside.django_session {"_id": 1} 562 101 1512 317 842.6 178168
serverside.user {"_types": 1, "emails.email": 1} 804 101 1262 201 684.85 162311
local.slaves {"_id": 1, "host": 1, "ns": 1} 131 101 1048 310 819.5 40738
serverside.email_alerts {"_types": 1, "email": 1, "pp_user_id": 1} 13 153 11639 2465 8865.2 32053
serverside.sign_up {"_id": 1} 77 103 843 269 551.0 20761
serverside.user_credits {"_id": 1} 6 204 900 369 763.75 2218
serverside.counters {"_id": 1, "_types": 1} 8 121 500 263 470.6 2111
serverside.auth_sessions {"session_key": 1} 7 111 684 277 645.6 1940
serverside.credit_card {"_id": 1} 5 145 764 368 705.0 1840
serverside.email_alerts {"_types": 1, "request_code": 1} 6 143 459 277 415.0 1663
Each line (from left to right) shows the namespace, the query pattern, and various statistics of this particular namespace/pattern combination. The rows are sorted by the “sum” column, descending. Sorting by sum is a good way to see where the database spent most of its time overall. In this example, we see that around half the total time is spent on
$ne
-type queries in the
serverside.scrum_master
collection.
$ne
queries are known to be inefficient since these queries cannot use an index, resulting in a high number of documents scanned. In fact, all of the queries took at least 15 seconds (“min” column). The “count” column also shows that only 20 of the queries were issued, yet these queries contributed to a large amount of the total time spent, more than double the time of the 804 email queries on
serverside.user
.
When optimizing queries and indexes, starting from the top of this list is a good idea as these optimizations will result in the highest gains in terms of performance.
Use Case 2: Visualizing Log Files with
mplotqueries
Another way of looking at query performance and other operations is to visualize them graphically. mplotqueries’ scatter plot (the default) shows the duration of any operation (y-axis) over time (x-axis) and makes it easy to spot long-running operations. The following plot is generated with
mplotqueries mongod.log
and then press L for “logarithmic” y-axis view:
While most of the operations are sub-second (below the 10^3 ms mark), the blue dots immediately stand out, reaching up to the hundreds and thousands of seconds. Clicking on one of the blue dots prints out the relevant log line to stdout:
Sat Apr 5 22:29:54 [conn99] command serverside.$cmd command: { getlasterror: 1, w: "majority" } ntoreturn:1 keyUpdates:0 reslen:112 1006370ms
The
getlasterror
command is used for write concern. In this case, it blocked until the write was replicated to a majority of nodes in the replica set, which took 16 minutes. That is of course an issue, and because this is a command and not a query (or the query part of an update), it didn’t show up in the previous use case with
mloginfo --queries
.
To investigate this further, we can overlay the current plot with an “rsstate” plot, that shows replica set status changes over time. The following two commands create an overlay of the two plots:
grep "majority" mongod.log | mplotqueries --overlay
mplotqueries mongod.log --type rsstate
This shows that for each of the blocking “majority”
getlasterrors
, replica set members are unavailable. The red vertical lines represent a node being DOWN, preceding the yellow lines for a node being in
SECONDARY
state again, at which point the getlasterror commands finally succeed.
From here, the next step would be to look at all the log files of the replica set at one of the incidents and investigate why the secondaries became unavailable:
mlogfilter mongod.log mongod-sec.log mongod-arb.log --from Apr 5 20:15 --to +5min
This last command merges the log files of the three replica set members by time, each line prefixed with the filename, slices out a 5-minute window at the first instance of the issue and prints the lines back to stdout.
What’s Next?
This should give you a sense of how to use mtools for diagnosing and debugging issues affecting your MongoDB system. You can organize and visualize data in a number of ways, form a hypothesis, filter out noise and dig deeper into issues affecting your deployment, all from MongoDB log files.
mtools
contains many more useful features that our Support team uses daily in working through customer cases. The best way to learn how you can leverage these scripts is to
download and install
mtools
and follow some of the examples on the
mtools wiki
page.
mtools
is open source and available for download on
github
. It is also in the PyPI package index and can be installed via
pip
. If you have any questions, bug reports or feature requests, simply go to the
mtools github issues
page and open an issue.
My colleague Asya Kamsky (from
askasya.com
) will show some more examples on how
mtools
can be useful for diagnosing and troubleshooting in her talk
Diagnostics and Debugging
at
MongoDB World
. I’ll be in the “Ask the Experts” sessions, so if you have any questions you can come ask in the Ask the Experts room. You can use my discount code “25ThomasRueckstiess” for 25% off tickets.
May 8, 2014