Blackbox Exporter alerting best practices

2,510 views
Skip to first unread message

justin...@gmail.com

unread,
Oct 15, 2020, 6:10:26 PM10/15/20
to Prometheus Users

I am trying to make my alerting rules around blackbox exporter a bit more sane and reduce noise. I would like to know what the recommended best practices are around writing alert rules for blackbox exporter probes.

All the examples I have seen from basic tutorials write a simple alert based on the probe_success metric. This means for example, if you are using the basic http_2xx module for the probe, and alerting based on "probe_success", that it may have failed due to not receiving a "2xx" response, or maybe the endpoint is down completely.

So I have written some alerts in addition to the "probe_success == 0" which query other metrics for the probe, such as:

"probe_http_status_code {job="web", module="http_2xx"} != 200"

So now I can know for sure that the probe is failing due to the response code and not for some other reason, and I can use the metric value on my alert to display the actual response code. But of course, if I use both rules I now have two alerts firing for the same condition.

Likewise I have several other custom modules testing for other conditions such as "fail_if_body_not_matches_regexp", "valid_status_codes", "fail_if_not_ssl", and some which use a combination of these conditions. How am I to tell why the probe is failing without writing an alert to query the specific metric rather than "probe_success"?

Am I on the right track here with the specific alerts? In which case I would presumably use inhibition rules in the alert manager to suppress the "probe_success" alerts if a more specific alert is firing? That way if I see an alert based on "probe_success" I can know it has failed because the endpoint is down completely. Or is looking for specific metrics the "wrong" way to approach blackbox testing?

justin...@gmail.com

unread,
Oct 18, 2020, 11:17:33 PM10/18/20
to Prometheus Users
After doing some more testing it seems the situation is more complicated than I initially thought. And perhaps my question was not very clear.

As far as I can see there is no way to separate general probe failures from specific ones. Even if you are using a very specific module e.g. "http_2xx", you still can never be sure that the probe is failing for the reason of "status code other than 2xx" or "various other reasons".

Here are example scenarios to illustrate:

Setup:

Blackbox Exporter config:

http_2xx:
    prober: http
    http:
            ip_protocol_fallback: true
        tcp:
            ip_protocol_fallback: true
        icmp:
            ip_protocol_fallback: true
        dns:
            ip_protocol_fallback: true

Alerts:

- alert: http-probe-fail
  expr: probe_success{job="web"} == 0
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "HTTP probe failure"
    description: "HTTP probe fail"

- alert: http-200-fail
  expr: probe_http_status_code {job="web", module=~".*http_2xx.*"} != 200
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "HTTP 200 Fail"
    description: "HTTP status code other than \"200\" has been returned"
    value_str: "HTTP STATUS:{{ $value }}"

Scenario 1: Host unreachable (e.g. incorrect dns name in scrape target config.)

probe_http_status_code 0
probe_success 0

Both alerts will fire.

Scenario 2: Host reachable, nginx running but back-end not responding. Probe fails after long timeout

probe_http_status_code 0
probe_success 0

Both alerts will fire.

Scenario 3: Host reachable, nginx running but returning 401 <------- It would be nice to have some way of separating this from the other scenarios

probe_http_status_code 0
probe_success 0

Both alerts will fire.

Scenario 4: Host completely down

probe_http_status_code 0
probe_success 0

Both alerts will fire.

Obviously this gets even trickier if you have multiple failure conditions in your module config.

justin...@gmail.com

unread,
Oct 18, 2020, 11:20:21 PM10/18/20
to Prometheus Users
EDIT: Scenario 3 above should be:

Scenario 3: Host reachable, nginx running but returning 401 <------- It would be nice to have some way of separating this from the other scenarios

probe_http_status_code 401
probe_success 0

Nolan Crooks

unread,
Mar 9, 2021, 3:30:06 PM3/9/21
to Prometheus Users
I feel that some of your problems may be solved by exploring inhibition rules within your Alertmanager configuration, for example if "http-probe-fail" is firing, inhibit other rules until it is resolved. Check out the docs for inhibition rules.
Reply all
Reply to author
Forward
0 new messages