I am trying to make my alerting rules around blackbox exporter a
bit more sane and reduce noise. I would like to know what the
recommended best practices are around writing alert rules for blackbox
exporter probes.
All the examples I have seen
from basic tutorials write a simple alert based on the probe_success
metric. This means for example, if you are using the basic http_2xx module for the probe, and alerting based on "probe_success", that it may have failed due to not receiving a "2xx" response, or maybe the endpoint is down completely.
So I have written some alerts in addition to the "probe_success ==
0" which query other metrics for the probe, such as:
"probe_http_status_code {job="web", module="http_2xx"} != 200"
So now I can know for sure that the probe is failing due to the response code and not for some other reason, and I can use the metric value on my alert to display the actual response code. But of course, if I use both rules I now have two alerts firing for the same condition.
Likewise I have several other custom modules testing for other conditions such as "fail_if_body_not_matches_regexp", "valid_status_codes", "fail_if_not_ssl", and some which use a combination of these conditions. How am I to tell why the probe is failing without writing an alert to query the specific metric rather than "probe_success"?
Am I on the right track here with the specific alerts? In which case I would presumably use inhibition rules in the alert manager to suppress the "probe_success" alerts if a more specific alert is firing? That way if I see an alert based on "probe_success" I can know it has failed because the endpoint is down completely. Or is looking for specific metrics the "wrong" way to approach blackbox testing?