textfile collector and system/application assertion tests

22 views

Skip to first unread message

Cameron Kerr

unread,

May 10, 2020, 11:32:42 PM5/10/20

to Prometheus Users

Hello everyone, I'm embarking on some monitoring enhancements and I'm wanting to make it really easy for my colleagues (who don't know Prometheus) to write simple tests and have test failures show up as a Prometheus alert.

The fundamental business need is to make it really easy to any admto add high-signal-low-noise alerts, particularly as a result of incident response.

My initial idea looks something like this:

I have a directory, similar in spirit to /etc/cron.d/ or /etc/cron.hourly/, which are essentially scripts that test particular aspects of the system/application. Eg. you might have a simple bash or Python script that does some test and returns some response (eg. return code of 0 might indicate 'assertion-passed', and non-0 might indicate 'assertion-failed'). The name of the test might be taken from the filename of the script.

A test-runner would run all of these scripts and generate appropriate metrics for consumption using textfile-collector

Playing with what this might look like in terms of metrics, and thinking about instances where this would have useful in the past, it might look like:

# HELP assertion Assertion is passing (1) or failing (0)
# TYPE assertian GAUGE
assertion{test_name="apache_httpd_configtest_okay"} = 1
assertion{test_name="drbd_synced"} = 1
assertion{test_name="transfer_queue_not_stuck"} = 0
assertion{test_name="can_reach_ldap_server"} = 1
assertion{test_name="connected_to_accounting_service"} = 1
assertion{test_name="federation_metadata_current"} = 1
assertion{test_name="login_test"} = 1

Before I go further down this path, I'm wanting to know if others have done something similar and to survey what works and what doesn't, so I don't take my group down a wrong path. After all, it doesn't take much playing to determine follow-on requirements such as:

* I want to have this be a easy as dropping in some logic (tests) in a manner befitting how the server/service being deployed is deployed (eg. manually, via Ansible, etc. etc.). This must not require the Prometheus team to be the ones having to create the tests, only raise alerts in response to assertion failures.

* I need to have some tests run much more frequently than others (like unit-tests are to integration-tests)

* Some assertions will be warnings, others will be more critical

* Process supervision must be present to handle process timeout etc.

* If we migrate from Prometheus to something else (or a later major version of Prometheus), I want this collateral to still be useful, so I need a decent abstraction interface.

* Oh, and this must work on Linux systems as well as as Windows.

Hopefully I'm not taking Prometheus in a direction that is terribly foreign, as this seems to be something that I imagine others have already walked.

Thanks for reading,

Cameron Kerr

Ben Kochie

unread,

May 11, 2020, 4:59:53 AM5/11/20

to Cameron Kerr, Prometheus Users

While this is a possible pattern, it doesn't typically follow Prometheus best practices.

The idea behind Prometheus is you want to expose data directly from the thing being monitored, and make the logical decision for "ok/not-ok" on the Prometheus server. This allows for a lot of advantages over host-local checks. For example, you take into account data from your entire fleet, not just what the node can see in isolation.

For example, you have a drbd check, it would be better to get the drbd metrics directly. For example, using the drbd collector in the node_exporter.

Other checks you have there, like the login test, are good examples of blackbox tests. But beware that blackbox tests have limited usefulness. They are blind during the time between probes. They also don't really tell you anything about the actual user requests going on. For something like logins, you would want to gather all login attempts, count them, count failures, etc. For systems that you can't instrument directly, like standard OS logins, you could do this with a log tailing metrics generator. For example https://github.com/google/mtail. There are several variations of this kind of tool.

One thing to think about, if you generate "normal" Prometheus metrics, many other monitoring platforms support this format now. So building Prometheus-compatible exporters is not wasted effort. Even the proprietary vendors support processing Prometheus data.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/63ab82ee-8099-4fb9-be89-e04128659bf8%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages