How to troubleshoot blackbox_exporter metrics

3,166 views
Skip to first unread message

nafto...@gmail.com

unread,
Jun 28, 2017, 1:55:44 PM6/28/17
to Prometheus Users

I’m getting a lot of alerts of downtime, however I don’t know how to interpret it or how to dig deeper.

Here is my setup:

blockbox.yml

        modules:
          http:
            prober: http
            timeout: 60s

prometheus.rules:

...
        ALERT service_down
        IF probe_success == 0
        FOR 15m

prometheus.yml:

...
  - job_name: 'blackbox-exporter'
    metrics_path: /probe
    scrape_interval: 1m
    params:
      module: ['http']
    static_configs:
      - targets:                             # targets to be tested by the blackbox exporter
          - https://domain1.com/
          - https://domain2.com/
    relabel_configs:
          - source_labels: [__address__]     # set param 'target' to the original target
            regex: (.*)
            target_label: __param_target
            replacement: ${1}
          - source_labels: [__param_target]  # set label 'instance' to it as well
            regex: (.*)
            target_label: instance
            replacement: ${1}
          - source_labels: []                # set __address__ to the blackbox exporter
            regex: .*
            target_label: __address__
            replacement: blackbox_exporter:9115
...

I keep getting alerted about “domain1.com”. The Slack alert says,

[FIRING:1] service_down (https://domain1.com/ blackbox-exporter)

However I never observe an actual issue with it. If I immediately open my browser to domain1.com everything is fine.

Where do I go from here?

  1. Do I understand correctly that the alert means that for 15 (or 14?) times in a row, once per minute, the endpoint either timed out or returned an http error status?
  2. When does blackbox_exporter try the endpoints — when prometheus scrapes it, synchronously?
  3. How can I determine what the cause was? Did it error out (and if so what was the status code — I guess the response body is asking too much)? Or did it time out (I guess that could only have a boolean answer — would be nice to know more specifically how long too long responses are)?
  4. Is this the right tool for the job or is there some other tool I should use together with or instead of this?

Thanks.

Brian Brazil

unread,
Jun 28, 2017, 2:22:55 PM6/28/17
to nafto...@gmail.com, Prometheus Users
That depends on your evaluation_interval, presuming it's 1m then it's 16 times in a row.
 
  1. When does blackbox_exporter try the endpoints — when prometheus scrapes it, synchronously?
Yes, synchronously as that's the way exporters are meant to work.
 
  1. How can I determine what the cause was? Did it error out (and if so what was the status code — I guess the response body is asking too much)? Or did it time out (I guess that could only have a boolean answer — would be nice to know more specifically how long too long responses are)?

What were the other metrics returned by the scrape? The blackbox exporter includes a number of other metrics to help you figure out what's going on.
 
  1. Is this the right tool for the job or is there some other tool I should use together with or instead of this?

This is the correct tool.


--

Naftoli Gugenheim

unread,
Jun 28, 2017, 2:25:37 PM6/28/17
to Brian Brazil, Prometheus Users
How do I know what those metrics are, and how do I know what their values were at the time(s) the exporter reported downtime?

That's kind of the crux of the issue. ;)

Brian Brazil

unread,
Jun 28, 2017, 2:55:14 PM6/28/17
to Naftoli Gugenheim, Prometheus Users
On 28 June 2017 at 19:25, Naftoli Gugenheim <nafto...@gmail.com> wrote:
Pull out metrics with {job="blackbox-exporter"}
 

 
 
  1. Is this the right tool for the job or is there some other tool I should use together with or instead of this?

This is the correct tool.



--



--

Naftoli Gugenheim

unread,
Jun 28, 2017, 3:00:19 PM6/28/17
to Brian Brazil, Prometheus Users

I get back these. Now what? What do each of them mean?

ALERTS{alertname="service_down",alertstate="firing",instance="https://domain1.com/",job="blackbox-exporter"}
probe_http_redirects{instance="https://domain1.com/",job="blackbox-exporter"}
probe_http_content_length{instance="https://domain1.com/",job="blackbox-exporter"}
scrape_duration_seconds{instance="https://domain1.com/",job="blackbox-exporter"}
probe_http_redirects{instance="https://domain2/",job="blackbox-exporter"}
probe_http_content_length{instance="https://domain2/",job="blackbox-exporter"}
probe_duration_seconds{instance="https://domain2/",job="blackbox-exporter"}
probe_ssl_earliest_cert_expiry{instance="https://domain2/",job="blackbox-exporter"}
probe_success{instance="https://domain1.com/",job="blackbox-exporter"}
probe_http_ssl{instance="https://domain2/",job="blackbox-exporter"}
probe_http_ssl{instance="https://domain1.com/",job="blackbox-exporter"}
scrape_samples_scraped{instance="https://domain2/",job="blackbox-exporter"}
probe_http_status_code{instance="https://domain2/",job="blackbox-exporter"}
probe_http_status_code{instance="https://domain1.com/",job="blackbox-exporter"}
up{instance="https://domain2/",job="blackbox-exporter"}
scrape_samples_post_metric_relabeling{instance="https://domain1.com/",job="blackbox-exporter"}
scrape_samples_scraped{instance="https://domain1.com/",job="blackbox-exporter"}
probe_duration_seconds{instance="https://domain1.com/",job="blackbox-exporter"}
probe_success{instance="https://domain2/",job="blackbox-exporter"}
probe_ip_protocol{instance="https://domain1.com/",job="blackbox-exporter"}
scrape_samples_post_metric_relabeling{instance="https://domain2/",job="blackbox-exporter"}
probe_ip_protocol{instance="https://domain2/",job="blackbox-exporter"}
scrape_duration_seconds{instance="https://domain2/",job="blackbox-exporter"}
up{instance="https://domain1.com/",job="blackbox-exporter"}

Brian Brazil

unread,
Jun 28, 2017, 3:29:18 PM6/28/17
to Naftoli Gugenheim, Prometheus Users
On 28 June 2017 at 20:00, Naftoli Gugenheim <nafto...@gmail.com> wrote:

I get back these. Now what? What do each of them mean?

See what they indicate. The HELP strings on the original /probe endpoint (if you've a new enough version) will indicate what each mean.



--

Naftoli Gugenheim

unread,
Jul 3, 2017, 2:17:11 AM7/3/17
to Brian Brazil, Prometheus Users

Is this what you mean?

(Obviously domain1.com is substituted for the real domain.)

$ sudo docker-compose exec blackbox_exporter wget -O - -q 'http://localhost:9115/probe?module=http&target=https://domain1.com'
probe_ip_protocol 4
probe_http_status_code 0
probe_http_content_length 0
probe_http_redirects 1
probe_http_ssl 0
probe_duration_seconds 0.022083
probe_success 0

Brian Brazil

unread,
Jul 3, 2017, 5:43:21 AM7/3/17
to Naftoli Gugenheim, Prometheus Users
On 3 July 2017 at 07:16, Naftoli Gugenheim <nafto...@gmail.com> wrote:

Is this what you mean?

(Obviously domain1.com is substituted for the real domain.)

$ sudo docker-compose exec blackbox_exporter wget -O - -q 'http://localhost:9115/probe?module=http&target=https://domain1.com'
probe_ip_protocol 4
probe_http_status_code 0
probe_http_content_length 0
probe_http_redirects 1
probe_http_ssl 0
probe_duration_seconds 0.022083
probe_success 0


That's an older version, but it looks like correcting to the 2nd server after the redirect failed.

Brian



--

Brian Brazil

unread,
Jul 3, 2017, 5:43:35 AM7/3/17
to Naftoli Gugenheim, Prometheus Users
On 3 July 2017 at 10:43, Brian Brazil <brian....@robustperception.io> wrote:
On 3 July 2017 at 07:16, Naftoli Gugenheim <nafto...@gmail.com> wrote:

Is this what you mean?

(Obviously domain1.com is substituted for the real domain.)

$ sudo docker-compose exec blackbox_exporter wget -O - -q 'http://localhost:9115/probe?module=http&target=https://domain1.com'
probe_ip_protocol 4
probe_http_status_code 0
probe_http_content_length 0
probe_http_redirects 1
probe_http_ssl 0
probe_duration_seconds 0.022083
probe_success 0


That's an older version, but it looks like correcting to the 2nd server after the redirect failed.

Connecting, not correcting.



--

Naftoli Gugenheim

unread,
Jul 3, 2017, 3:43:28 PM7/3/17
to Brian Brazil, Prometheus Users
On Mon, Jul 3, 2017 at 5:43 AM Brian Brazil <brian....@robustperception.io> wrote:
On 3 July 2017 at 10:43, Brian Brazil <brian....@robustperception.io> wrote:
On 3 July 2017 at 07:16, Naftoli Gugenheim <nafto...@gmail.com> wrote:

Is this what you mean?

(Obviously domain1.com is substituted for the real domain.)

$ sudo docker-compose exec blackbox_exporter wget -O - -q 'http://localhost:9115/probe?module=http&target=https://domain1.com'
probe_ip_protocol 4
probe_http_status_code 0
probe_http_content_length 0
probe_http_redirects 1
probe_http_ssl 0
probe_duration_seconds 0.022083
probe_success 0


That's an older version,

Well apparently it's the latest version pushed to the docker image prom/blackbox-exporter:latest. It isn't pulling anything newer (I tried before posting).
 
but it looks like correcting to the 2nd server after the redirect failed.

Connecting, not correcting.

What do you mean "2nd"? Also that still doesn't answer the question. How did it fail? What was the failure mode? How do I get more information?

Like I said, it works fine for me in the browser, so on a practical level it's a false positive.

Also I still don't have the answer to the larger question. What are the metrics that can be returned (are they always the same or not? how would one know?), and what do they mean (what are the possible values and what do they represent)?

 
 

Brian



--



--

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/RZUlIh9UF-Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAHJKeLrK78YaXa2aEo4h3zEPSyhoVCC7554jcZkJkjkJm8nzyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Brian Brazil

unread,
Jul 4, 2017, 6:57:01 AM7/4/17
to Naftoli Gugenheim, Prometheus Users
On 3 July 2017 at 20:43, Naftoli Gugenheim <nafto...@gmail.com> wrote:


On Mon, Jul 3, 2017 at 5:43 AM Brian Brazil <brian.brazil@robustperception.io> wrote:
On 3 July 2017 at 10:43, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 3 July 2017 at 07:16, Naftoli Gugenheim <nafto...@gmail.com> wrote:

Is this what you mean?

(Obviously domain1.com is substituted for the real domain.)

$ sudo docker-compose exec blackbox_exporter wget -O - -q 'http://localhost:9115/probe?module=http&target=https://domain1.com'
probe_ip_protocol 4
probe_http_status_code 0
probe_http_content_length 0
probe_http_redirects 1
probe_http_ssl 0
probe_duration_seconds 0.022083
probe_success 0


That's an older version,

Well apparently it's the latest version pushed to the docker image prom/blackbox-exporter:latest. It isn't pulling anything newer (I tried before posting).

Ah, I thought I'd already released that. 0.6.0 was just released with these changes.
 
 
but it looks like correcting to the 2nd server after the redirect failed.

Connecting, not correcting.

What do you mean "2nd"?

probe_http_redirects is 1, so there was one redirect followed. This 2nd request failed.
 
Also that still doesn't answer the question. How did it fail? What was the failure mode? How do I get more information?

The log messages are your best option. There's plans for better diagnostics in future, but there's limits to what can be done with metrics for this sort of thing.

Like I said, it works fine for me in the browser, so on a practical level it's a false positive.

Also I still don't have the answer to the larger question. What are the metrics that can be returned (are they always the same or not? how would one know?),

The metrics should generally be the same for a given exporter (though the ssl expiry one for blackbox existing is dependant on whether TLS is in use).
 
and what do they mean (what are the possible values and what do they represent)?

For the exact semantics you'd need to look at the code, as there's interactions between the various metrics that may vary depending on exact code structure and failure modes. In this case I'd suggest breaking out tcpdump.

Brian
 

 
 

Brian



--



--

To unsubscribe from this group and all its topics, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.



--

Naftoli Gugenheim

unread,
Jul 4, 2017, 3:31:42 PM7/4/17
to Brian Brazil, Prometheus Users
On Tue, Jul 4, 2017 at 6:57 AM Brian Brazil <brian....@robustperception.io> wrote:
On 3 July 2017 at 20:43, Naftoli Gugenheim <nafto...@gmail.com> wrote:


On Mon, Jul 3, 2017 at 5:43 AM Brian Brazil <brian....@robustperception.io> wrote:
On 3 July 2017 at 10:43, Brian Brazil <brian....@robustperception.io> wrote:
On 3 July 2017 at 07:16, Naftoli Gugenheim <nafto...@gmail.com> wrote:

Is this what you mean?

(Obviously domain1.com is substituted for the real domain.)

$ sudo docker-compose exec blackbox_exporter wget -O - -q 'http://localhost:9115/probe?module=http&target=https://domain1.com'
probe_ip_protocol 4
probe_http_status_code 0
probe_http_content_length 0
probe_http_redirects 1
probe_http_ssl 0
probe_duration_seconds 0.022083
probe_success 0


That's an older version,

Well apparently it's the latest version pushed to the docker image prom/blackbox-exporter:latest. It isn't pulling anything newer (I tried before posting).
Ah, I thought I'd already released that. 0.6.0 was just released with these changes.

Ok thanks, now I got the updated version with some HELP strings. The TYPEs just say "guage" for all of them.
 
 
 
but it looks like correcting to the 2nd server after the redirect failed.

Connecting, not correcting.

What do you mean "2nd"?

probe_http_redirects is 1, so there was one redirect followed. This 2nd request failed.
 
Also that still doesn't answer the question. How did it fail? What was the failure mode? How do I get more information?

The log messages are your best option. There's plans for better diagnostics in future, but there's limits to what can be done with metrics for this sort of thing.

Ok great. The logs revealed that the setting for redirecting non-logged-in users was misconfigured (it had a typo in the domain name). Let's hope it's resolved now G-d willing. :)

 
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



--
Reply all
Reply to author
Forward
0 new messages