Hello everyone, I'm embarking on some monitoring enhancements and I'm wanting to make it really easy for my colleagues (who don't know Prometheus) to write simple tests and have test failures show up as a Prometheus alert.
The fundamental business need is to make it really easy to any admto add high-signal-low-noise alerts, particularly as a result of incident response.
My initial idea looks something like this:
I have a directory, similar in spirit to /etc/cron.d/ or /etc/cron.hourly/, which are essentially scripts that test particular aspects of the system/application. Eg. you might have a simple bash or Python script that does some test and returns some response (eg. return code of 0 might indicate 'assertion-passed', and non-0 might indicate 'assertion-failed'). The name of the test might be taken from the filename of the script.
A test-runner would run all of these scripts and generate appropriate metrics for consumption using textfile-collector
Playing with what this might look like in terms of metrics, and thinking about instances where this would have useful in the past, it might look like:
# HELP assertion Assertion is passing (1) or failing (0)
# TYPE assertian GAUGE
assertion{test_name="apache_httpd_configtest_okay"} = 1
assertion{test_name="drbd_synced"} = 1
assertion{test_name="transfer_queue_not_stuck"} = 0
assertion{test_name="can_reach_ldap_server"} = 1
assertion{test_name="connected_to_accounting_service"} = 1
assertion{test_name="federation_metadata_current"} = 1
assertion{test_name="login_test"} = 1
Before I go further down this path, I'm wanting to know if others have done something similar and to survey what works and what doesn't, so I don't take my group down a wrong path. After all, it doesn't take much playing to determine follow-on requirements such as:
* I want to have this be a easy as dropping in some logic (tests) in a manner befitting how the server/service being deployed is deployed (eg. manually, via Ansible, etc. etc.). This must not require the Prometheus team to be the ones having to create the tests, only raise alerts in response to assertion failures.
* I need to have some tests run much more frequently than others (like unit-tests are to integration-tests)
* Some assertions will be warnings, others will be more critical
* Process supervision must be present to handle process timeout etc.
* If we migrate from Prometheus to something else (or a later major version of Prometheus), I want this collateral to still be useful, so I need a decent abstraction interface.
* Oh, and this must work on Linux systems as well as as Windows.
Hopefully I'm not taking Prometheus in a direction that is terribly foreign, as this seems to be something that I imagine others have already walked.