This is part two of a three-part guest series by Alex Giamas, Co-Founder and CTO of CareAcross.
Having the metrics is a useful first step but shouldn’t be end goal. Far more important than viewing the metrics in a web page is having clear procedures for how to act upon them. In my case, most of the problems arose because of replication lag and page faults. In the case of high replication lag, our application would automatically fail back to the primary server, which is always up to date. The engineers could then investigate the root cause for the issue and fix it. For page faults, the process was lengthier and most of the time meant going back to the application and improving the queries or design that was causing the page faults.
For every key metric, set sensible alert thresholds emailing or texting someone with a clear procedure set about what to do for each type of alert. Sensible thresholds should be emphasized. An alert should be a real situation waiting for an action. Set the threshold too low and you’ll receive alerts all the time and eventually get desensitized to them. Set the threshold too high and by the time you get the alert, you may have already lost data or otherwise be too late to act upon it.
Unfortunately, it takes a bit of time before you can establish what normal is for your system. Once you have a baseline, you can setup the alerts to make sure that you are operating within normal parameters.
An overlooked feature of MMS is that you can get a web view of logs and profile data using a single authentication mechanism across your servers. This is useful for troubleshooting when the production servers are locked up in a room and the janitor has eaten the keys ;)
In my next post, I’ll discuss how you can use MMS to QA new code.
For more on setting alerts in MMS, see Five MMS Monitoring Alerts to Keep Your MongoDB Deployment on Track.