A hard lesson for any system administrator to learn is to have an outage without even knowing it. Monitoring is a basic requirement of any healthy online infrastructure. These systems range from the very simple to the very complex. The following article describes how Technodabbler converted a simple cron monitoring solution into a light-weight infrastructure monitoring solution. This example is used to provide insight into how you could setup your own monitoring solution.
Keeping It Simple
Although any system administrator can setup a monitoring system, it create a "chicken and the egg" scenario, as you need to monitor the monitoring system. This is why the responsibility is best deferred to a third party.
When choosing a solution for Technodabbler, it had to be simple and inexpensive. In the enterprise setting, New Relic is the gold standard : powerful, scalable, expensive and feature-rich. In the Open Source community, Prometheus is the current favoured solution. Given the insane number of features and flexibility, it pretty time consuming to properly setup. Both systems also focus on gathering metrics and alerting when metrics crosses a particular metric, which introduces additional complexities.
When is comes to monitoring an online server, some of the simplest systems are based on the concept of a dead man switch : if a service fails to report on time, an alert is signalled. These are often called cron monitors. With a bit of scripting, these system can easily be use as a light-weight infrastructure monitoring system.
With so many solution available on the Internet, choosing a cron monitor can be daunting. Healthchecks.io stood out because of its simple API and wealth of integration. It is an Open Source project with the code freely available. They are celebrating 8 years of operations, and have been quite open about their technology stack and revenue. We appreciate this level of openness and have been using them reliably for over 4 years.
Building a Light-Weight Monitoring Solution
By design, a cron monitor sends out alerts when it doesn't receive a signal that a job was successfully completed within a given time frame. A script can be written to monitor a specific resources, sending a successful signal if the resources is healthy. When the script executed at regular interval as a cron job, the three outcomes of the cron monitor are :
- If the script completes successfully, a signal is sent to the cron monitor and nothing happens.
- If the script detects a problem, a failure signal is sent to the cron monitor and an alert is sent.
- The script fails to run, no signal is received by the cron monitor and the timer lapse, thus an alert is sent.
The following is a simple example of a script that check if a specific process is currently running. If the process is not found, the cron monitor will be called with "/fail" appended at the end of the URL, indicated that the monitor should send an alert immediately. Otherwise, a regular ping is sent and the cron monitor and no alerts are triggered.
The scripts used by Technodabbler are available on Github. To use these script :
- Create an account on Healthchecks.io .
- Create a new cron monitor for the resource you want to monitor.
- Clone the repository of checks.
- Add a cron job to execute the check, replacing the check key with the one from the Healthchecks.io console. For example, the following will check every five minutes to inform that a server is online. If the check do not execute (because the server is offline), the cron monitor should send an alert.
The scripts are written in Bash as minimize dependencies and simplify deployment. They should function on any modern OS that support Bash.
The first monitoring system used by Technodabbler would dispatch alerts by email. During the first outage on the email server, it took several hours to notice the problem as alerts never reached their destination.
A good monitoring solution should be able to dispatch alerts in multiple ways. Healthchecks.io supports numerous integrations, including Slack, Discord or even an old-fashion phone call. Technodabbler dispatches alerts using a combination or email, SMS and push notifications. If an alert is missed, it will most likely be noticed in the daily summary report.
No Perfect Solution
No monitoring solution is tailored for all situations. Proper research should be done before a solution is chosen. However, consider some of the advice presented above :
- Don't operate your monitoring solution. It is just another solution to monitor.
- Solutions range from simple to complicated. Pick the right solution for you, but avoid choosing something more complicated than you need.
- Alert using more than one solution, ideally to more than one destination. Avoid alerting to a system you are monitoring.
Often overlooked, monitoring is a critical component of any healthy infrastructure. It is the last line of defence in dealing with unforeseen even. If you are not monitoring your server, ask yourself if they are currently online.
What monitoring system do you use? Sound off in the comments bellow with your solutions. If you enjoyed reading about monitoring, you might be interested in learning more about Voice Over Ip systems or a Kubernetes Continuous Deployment system. And you can get more Technodabbler articles directly in your email box as they are published by subscribing to our mailing list.