This is part two of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross.
Having the metrics is a useful first step but shouldn’t be end goal. Far more important than viewing the metrics in a web page is having clear procedures for how to act upon them. In my case, most of the problems arose because of replication lag and page faults. In the case of high replication lag, our application would automatically fail back to the primary server, which is always up to date. The engineers could then investigate the root cause for the issue and fix it. For page faults, the process was lengthier and most of the time meant going back to the application and improving the queries or design that was causing the page faults.
For every key metric, set sensible alert thresholds emailing or texting someone with a clear procedure set about what to do for each type of alert. Sensible thresholds should be emphasized. An alert should be a real situation waiting for an action. Set the threshold too low and you’ll receive alerts all the time and eventually get desensitized to them. Set the threshold too high and by the time you get the alert, you may have already lost data or otherwise be too late to act upon it.
Unfortunately, it takes a bit of time before you can establish what normal is for your system. Once you have a baseline, you can setup the alerts to make sure that you are operating within normal parameters.
An overlooked feature of MMS is that you can get a web view of logs and profile data using a single authentication mechanism across your servers. This is useful for troubleshooting when the production servers are locked up in a room and the janitor has eaten the keys ;)
In my next post, I’ll discuss how you can use MMS to QA new code.
For more on setting alerts in MMS, see Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track.
The Top 5 Metrics to Watch in MongoDB
This is part one of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross , a stealth mode startup seeking to empower patients. Alex is also a proud Carnegie Mellon alumnus, a graduate of the onsite courses offered at MongoDB University and a Cloudera Certified developer for Apache Hadoop (CDH-410). At Upstream Systems, Persado, Care Across and through various consulting roles, I have dealt with all types of MongoDB installations ranging from single server instances, medium size deployments, to large cloud-based sharded clusters. Whether large or small, monitoring is essential to assuring performance and reliability. We needed to visualize the health of production environments and maintain a clearly defined procedure for metrics exceeding threshold values, as well as measure the impact of development changes. MongoDB Management Service (MMS) is rich with metrics, but in my experience, the most valuable metrics in practice are the following: Lock percentage: This was more important in earlier versions, where the global write lock could eat you alive and lock yielding was not yet implemented. While it’s less important with more recent versions (please vote on SERVER-1240 !), lock percentage still shows a lot about your database activity. A continuously high lock percentage will affect reads as they will eventually queue up behind writes. Replication lag: Designing your application to read data from a secondary node can sometimes be a good idea, when it reduces latency of the read. But if your application is using the secondary’s data and you have high replication lag, your application will use stale data. In addition, a primary node failure when you have a high replication lag means that a secondary may not be sufficiently up-to-date in a failover scenario. Journal writes: If your writes are overwhelming your journal file this will impact performance and stability of your MongoDB installation. Page faults: Page faults are expensive to process and at sufficiently high rates, it probably means that your working set is not fitting in memory. In complex data driven applications, page faults may indicate a deeper root cause hidden in the implementation of the business logic of the app. Non Mapped Virtual Memory: When this grows without an end, this usually means a memory leak. It’s better to monitor it and proactively restart the server or try to hunt down the leak rather than wait for the crash to happen. There’s a lot of data in MMS Monitoring but I have found that these metrics are the most interesting. In my next post, I will go over how to make this data actionable.
Flowhub Relies on MongoDB to Meet Changing Regulations and Scale Its Business