probe_success VS up

Elliott Balsley

unread,

Nov 27, 2023, 9:37:46 PM11/27/23

to Prometheus Users

I've recently started monitoring a large fleet of hardware devices using a combination of blackbox, snmp, node, and json exporters.

I started out using the up metric, but I noticed when using blackbox ping, up is always 1 even when the device is offline. So I plan to switch to probe_success instead. But I'm thinking about the implications of this when mixed with other exporters. For example json-exporter does not offer a probe_success metric; instead it returns up=0 when the target times out.

My goal is to build a Grafana dashboard and alerts that monitors a combination of blackbox and other exporters. For context, when certain devices crash, they remain pingable, but they return their failed state via REST API. So I'm setting the json-exporter to an HTTP target endpoint. I'm struggling to come up with a unified way of monitoring all these different devices types.

Chris Siebenmann

unread,

Nov 27, 2023, 11:15:41 PM11/27/23

to Elliott Balsley, Prometheus Users, Chris Siebenmann

> I've recently started monitoring a large fleet of hardware devices
> using a combination of blackbox, snmp, node, and json exporters. I

> started out using the *up* metric, but I noticed when using blackbox
> ping, *up* is *always* 1 even when the device is offline. So I plan to
> switch to *probe_success* instead. But I'm thinking about the

> implications of this when mixed with other exporters. For example

> json-exporter does not offer a *probe_success* metric; instead it
> returns *up*=0 when the target times out.

Roughly speaking, up{} tells you if the exporter is running (technically
whether it responds okay to being scraped) and then the exporter may
have its own metrics to say whether or not it's been successful at
generating metrics or doing whatever it normally does. Some exporters
always succeed if they're up; some exporters have more granular success
or failure (for example, individual collectors in the node exporter);
some exporters have completely decoupled up and success statuses, as is
the case with the blackbox exporter (where the exporter is often not
even on the machines you're checking with a particular probe).
Complicating the picture, if an exporter is down (its up is 0), then
it's not generating any metrics and any success metrics it would
normally generate are absent instead of reporting failure.

(The 'up' metric is internally generated by Prometheus itself based on
whether the scrape succeeded or not, so an exporter, such as the
Blackbox exporter, can only influence it by not responding at all, which
would mean that Blackbox can't return any metrics that might explain why
the probe failed. Even for ICMP probes there can be multiple reasons for
the failure.)

The Blackbox exporter is a bit tricky to understand in relation to up{},
because unlike many exporters you create multiple scrape targets against
(or through) the same exporter. This generally means you want to ignore
the up{} metric for any particular blackbox probe and instead scrape
Blackbox's metric endpoint and pay attention to its up{} (for alerts,
for example). Other exporters are much more one to one; you scrape each
exporter once through one target, so there's only one up{} metric that
goes to 0 if that particular exporter instance isn't responding.

(However this is not universally true; there are other multi-target
indirect exporters like Blackbox. I believe that the SNMP exporter is
another one where you often have one exporter separately scraping a lot
of targets, and each target will have its own up{} metric that you
probably want to ignore.)

> My goal is to build a Grafana dashboard and alerts that monitors a
> combination of blackbox and other exporters. For context, when certain
> devices crash, they remain pingable, but they return their failed
> state via REST API. So I'm setting the json-exporter to an HTTP target
> endpoint. I'm struggling to come up with a unified way of monitoring
> all these different devices types.

Unfortunately there is no unified way, as far as I know. If you want one
in the Grafana frontend, you might need to make up some sort of
synthetic 'is-up' metric through recording rules that know how to
combine all of the various status results into one piece of information.

(I don't think Grafana has a way of defining 'functions' like this can
be used across multiple panels and reused between dashboards, but I'm
out of touch with the latest features in current versions of Grafana.)

In our environment, it's useful for us to have a granular view of what
has failed. That a device has stopped pinging is a different issue than
its node_exporter not being up, so our dashboards (and alerts) reflect
that. However, we have a small enough number of devices that we can deal
with things this verbosely.

- cks

Brian Candler

unread,

Nov 28, 2023, 5:17:57 AM11/28/23

to Prometheus Users

On Tuesday, 28 November 2023 at 04:15:41 UTC Chris Siebenmann wrote:

The Blackbox exporter is a bit tricky to understand in relation to up{},
because unlike many exporters you create multiple scrape targets against
(or through) the same exporter. This generally means you want to ignore
the up{} metric for any particular blackbox probe and instead scrape
Blackbox's metric endpoint and pay attention to its up{} (for alerts,
for example).

I think that's worded in a misleading way.

Blackbox exporter does have a /metrics endpoint, but this is only for metrics internal to the operation of blackbox_exporter itself (e.g. memory stats, software version). You don't need to scrape this, but it gives you a little bit of extra info about how your exporter is performing.

Blackbox exporter's main interface is the /probe endpoint, where you tell it to run individual tests: /probe?target=xxx&module=yyy

The 'up' metric is generated by Prometheus itself, and only tells you that it was successfully able to communicate with the exporter and get some results (without a 4xx / 5xx error for example). So it's correct to say that you're not interested in the 'up' metric for scrapes to /probe, since it will always be 1 unless blackbox_exporter itself is badly broken, and you're interested in probe_success instead.

This is pretty easy to arrange in alerting rules. Here's a starting point:

groups:
- name: UpDown
rules:
- alert: UpDown
expr: up == 0
for: 3m

keep_firing_for: 3m
labels:
severity: critical
annotations:
summary: 'Scrape failed: host is down or scrape endpoint down/unreachable'

- name: BlackboxRules
rules:
- alert: ProbeFail
expr: probe_success == 0
for: 3m
keep_firing_for: 3m
labels:
severity: critical
annotations:
description: |
{{ $labels.instance }} ({{ $labels.module }}) probe is failing
summary: Probed service is down

For Grafana I'd probably just make two dashboards, but if you really want a grand summary of all "problems" then you can simply use a PromQL expression like this:

up == 0 or probe_success == 0

The "or" operator in PromQL is not a boolean: it's more like a set union operator. It will give you all the values of the "up" vector where the value is 0, along with all values of the "probe_success" vector where the value is 0 (except for values of probe_success == 0 which have *exactly* the same labels as up == 0, but those are unlikely anyway)

The consumer of this query is going to see a mixture of up{...} and probe_success{...} metrics, all with value 0.

there are other multi-target
indirect exporters like Blackbox. I believe that the SNMP exporter is
another one where you often have one exporter separately scraping a lot
of targets, and each target will have its own up{} metric that you
probably want to ignore.)

The first part of that is correct: SNMP exporter uses /snmp?target=xxx&module=yyy&auth=zzz.

But the second part is wrong: if SNMP exporter fails to talk to the target then it returns an empty scrape with a 4xx/5xx error code, which prometheus turns into up==0. So you definitely *do* want to alert on up==0 in this case, as that's how you detect a device which is failing to respond to SNMP.

In our environment, it's useful for us to have a granular view of what
has failed. That a device has stopped pinging is a different issue than
its node_exporter not being up, so our dashboards (and alerts) reflect
that.

I agree with that. Different metrics inherently have different meanings, and although 'up' and 'probe_success' have similar 0/1 semantics, there's other information you can get from blackbox_exporter when probe_success==0 which can tell you more about the nature of the problem (e.g. failure to connect, failure to resolve DNS name, TLS certificate validation failure etc)

Ben Kochie

unread,

Nov 28, 2023, 8:44:46 AM11/28/23

to Brian Candler, Prometheus Users

Fantastic summary. This would actually make a really nice addition to the "guides" section of the Prometheus docs.

https://github.com/prometheus/docs/tree/main/content/docs/guides

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com.

Ben Kochie

unread,

Nov 28, 2023, 8:54:17 AM11/28/23

to Brian Candler, Prometheus Users

One more thing to talk about is that the Prometheus ecosystem assumes and follows the "Fail Fast" principle[0].

Best practice[1] in Prometheus is to fail the whole scrape and return a 5xx error if any part of the data collection fails. For simple exporters this is typical. The reason for this is that partial failure can be hard to reason and write alerts for. Either get all the data that's expected or return an error.

But for more complex exporters that gather a lot of data, or you are OK with partial results and will handle that with more complex alerts, proxy "up" metrics are used.

For example, the mysqld_exporter has a `mysql_up` metric if it is able to establish a base connection to the server or not. Or in the node_exporter, there is node_scrape_collector_success.

[0]: https://en.wikipedia.org/wiki/Fail-fast

[1]: https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

On Tue, Nov 28, 2023 at 11:18 AM 'Brian Candler' via Prometheus Users <promethe...@googlegroups.com> wrote:

--

Chris Siebenmann

unread,

Nov 28, 2023, 12:21:37 PM11/28/23

to Brian Candler, Prometheus Users, Chris Siebenmann

> Blackbox exporter does have a /metrics endpoint, but this is only for
> metrics internal to the operation of blackbox_exporter itself (e.g.
> memory stats, software version). You don't need to scrape this, but it
> gives you a little bit of extra info about how your exporter is
> performing.

The reason I suggested scraping Blackbox's /metrics endpoint was as a
convenient way to create an up{} metric that reflects whether or not
Blackbox itself is up. You can do this with the up{} metrics of
individual probes, but if you use them either you need to pick a
specific probe target that you know will always be present or you need
to aggregate the up{} metrics of probes together, so that if N% of them
go to 0 you can raise an alert. (I failed to explain this at all well in
my original message.)

(Depending on how you do Blackbox relabeling, your up{} metrics from
probes may not have a label that identifies the Blackbox instance. Now
that I look, our metrics don't, which may be something I want to fix.)

If all your Blackbox targets are generated through dynamic service
discovery, it might be possible for your service discovery to break so
that you have no probes and thus no up{} metrics generated from them.
Although at this point it's probably unimportant whether Blackbox itself
is up, since you likely have bigger problems.

- cks

Reply all

Reply to author

Forward