> I've recently started monitoring a large fleet of hardware devices
> using a combination of blackbox, snmp, node, and json exporters. I
> started out using the *up* metric, but I noticed when using blackbox
> ping, *up* is *always* 1 even when the device is offline. So I plan to
> switch to *probe_success* instead. But I'm thinking about the
> implications of this when mixed with other exporters. For example
> json-exporter does not offer a *probe_success* metric; instead it
> returns *up*=0 when the target times out.
Roughly speaking, up{} tells you if the exporter is running (technically
whether it responds okay to being scraped) and then the exporter may
have its own metrics to say whether or not it's been successful at
generating metrics or doing whatever it normally does. Some exporters
always succeed if they're up; some exporters have more granular success
or failure (for example, individual collectors in the node exporter);
some exporters have completely decoupled up and success statuses, as is
the case with the blackbox exporter (where the exporter is often not
even on the machines you're checking with a particular probe).
Complicating the picture, if an exporter is down (its up is 0), then
it's not generating any metrics and any success metrics it would
normally generate are absent instead of reporting failure.
(The 'up' metric is internally generated by Prometheus itself based on
whether the scrape succeeded or not, so an exporter, such as the
Blackbox exporter, can only influence it by not responding at all, which
would mean that Blackbox can't return any metrics that might explain why
the probe failed. Even for ICMP probes there can be multiple reasons for
the failure.)
The Blackbox exporter is a bit tricky to understand in relation to up{},
because unlike many exporters you create multiple scrape targets against
(or through) the same exporter. This generally means you want to ignore
the up{} metric for any particular blackbox probe and instead scrape
Blackbox's metric endpoint and pay attention to its up{} (for alerts,
for example). Other exporters are much more one to one; you scrape each
exporter once through one target, so there's only one up{} metric that
goes to 0 if that particular exporter instance isn't responding.
(However this is not universally true; there are other multi-target
indirect exporters like Blackbox. I believe that the SNMP exporter is
another one where you often have one exporter separately scraping a lot
of targets, and each target will have its own up{} metric that you
probably want to ignore.)
> My goal is to build a Grafana dashboard and alerts that monitors a
> combination of blackbox and other exporters. For context, when certain
> devices crash, they remain pingable, but they return their failed
> state via REST API. So I'm setting the json-exporter to an HTTP target
> endpoint. I'm struggling to come up with a unified way of monitoring
> all these different devices types.
Unfortunately there is no unified way, as far as I know. If you want one
in the Grafana frontend, you might need to make up some sort of
synthetic 'is-up' metric through recording rules that know how to
combine all of the various status results into one piece of information.
(I don't think Grafana has a way of defining 'functions' like this can
be used across multiple panels and reused between dashboards, but I'm
out of touch with the latest features in current versions of Grafana.)
In our environment, it's useful for us to have a granular view of what
has failed. That a device has stopped pinging is a different issue than
its node_exporter not being up, so our dashboards (and alerts) reflect
that. However, we have a small enough number of devices that we can deal
with things this verbosely.
- cks