blackbox_exporter timeout configuration (status 0, context deadline exceeded)

Matt P

unread,

May 21, 2019, 9:31:38 AM5/21/19

to Prometheus Users

I have configured scrape timeout on prometheus side (10s) and http probe timeout (2s) on blackbox side...but looking at the failure logs it appears there is a third (not set by me) timeout in play of 1.5 seconds. This results in fairly regular alerts due to non-200 status code (0).

This prevents me from monitoring slower http services...generates a lot of noise. I am specifically checking for 200 status so I know if there are any server-side errors occurring. Latency being "too high" is monitored separately...so I want a high timeout so I get real status code and can monitor/report real latency over time. Of course if there is still a timeout over my "high timeout" settings...then for all intents and purposes the server is down so I DO want to get alerted at non-200 status code (0 in this case due to timeout). With 1.5 seconds I get 0 status codes all the time. I could set prometheus to only alert of we are timing out for more than a certain period of time...but then I may miss short-lived periodic performance issues (ie due to cache refresh, db load, etc). So need to figure out how to get the timeouts configured properly.

I am guessing that this isn't a bug but rather me misunderstanding the configuration docs. Have read them multiple times however and am not sure what I missed.

Blackbox exporter version 0.13.0 linux tarball.
Prometheus version 2.7.1 linux tarball.
Ubuntu 16.04

Looking at the probe failure log I see:

Logs for the probe:
ts=2019-05-06T08:27:13.836730378Z caller=main.go:118 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Beginning probe" probe=http timeout_seconds=1.5
ts=2019-05-06T08:27:13.836884962Z caller=utils.go:42 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Resolving target address" preferred_ip_protocol=ip4
ts=2019-05-06T08:27:13.837517123Z caller=utils.go:65 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Resolved target address" ip=123.123.123.123
ts=2019-05-06T08:27:13.837567386Z caller=http.go:281 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Making HTTP request" url=http://[151.139.237.3]/privacy/ host=www.somedomain.net
ts=2019-05-06T08:27:15.337203199Z caller=http.go:296 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=error msg="Error for HTTP request" err="Get http://[151.139.237.3]/privacy/: context deadline exceeded"
ts=2019-05-06T08:27:15.337331049Z caller=http.go:367 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Response timings for roundtrip" roundtrip=0 start=2019-05-06T08:27:13.83765659Z dnsDone=2019-05-06T08:27:13.83765659Z connectDone=2019-05-06T08:27:13.84138627Z gotConn=2019-05-06T08:27:13.841412248Z responseStart=0001-01-01T00:00:00Z end=0001-01-01T00:00:00Z
ts=2019-05-06T08:27:15.337383646Z caller=main.go:131 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=error msg="Probe failed" duration_seconds=1.500593477

So the first line says timeout_seconds=1.5 and indeed the Probe failed at duration_seconds=1.500593477. The thing I don't understand is how this 1.5s timeout is being configured because that is not what I set.

excerpt from prometheus.yml

- job_name: 'blackbox_http200_noredirect'
  scrape_interval: 30s
  scrape_timeout: 10s   # should be higher than http timeout
  metrics_path: /probe
  params:
    module: [http_200_noredirect]
  static_configs:
    - targets:
      - http://www.somedomain.net/privacy/
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115  # The blackbox exporter's real hostname:port.

excerpt from blackbox_exporter config yml

modules:
  http_200_noredirect:
    prober: http
    timeout: 2s  # http timeout
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: [200]  # Defaults to 2xx
      method: GET
      headers:
        User-Agent: "Mozilla/5.0 (blackbox_exporter) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
        Accept-Language: 'en-US,en;q=0.9,de-DE,de;q=0.2'
      no_follow_redirects: true
      fail_if_ssl: false
      fail_if_not_ssl: false
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4" # defaults to "ip6"

So if I understand the docs/config/log (very possible I don't):

the prometheus scrape timeout (prometheus calling the blackbox exporter to do a probe) is 10s
the http probe timeout on the blackbox side is 2 seconds
somehow this config is being ignored and the blackbox exporter is using a 1.5 second http probe timeout which fails occasionally

It makes sense to me that prometheus might inform (via X-Prometheus-Scrape-Timeout-Seconds http header) blackbox about the scrape timeout so that blackbox can use that as a maximum probe timeout (in the case where prometheus has a lower timeout than blackbox). However in my case I am not sure how that applies unless 1.5 seconds is a hard-coded maximum scrape timeout.

Did I misread the docs? Appreciate any advice/education.

Simon Pasquier

unread,

May 22, 2019, 5:53:41 AM5/22/19

to Matt P, Prometheus Users

Check the "--timeout-offset" flag of blackbox_exporter.

> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To post to this group, send email to promethe...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/94dcdf55-9ad6-4624-8206-aac031766224%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Matt P

unread,

May 22, 2019, 6:16:28 AM5/22/19

to Prometheus Users

Indeed that option (with default of 0.5 seconds) does explain the discrepancy between my configured 2s and the 1.5s timeout I was seeing.

Verified by changing the timeout to 10s in the config and I now see 9.5s in the logs.

I had already tried something along these lines (just cranking up the timeout to see if it changed what was in the logs) but I guess my test was flawed...perhaps I forgot to restart the service or looked at a stale debug log or something.

Thanks very much for pointing me in the right direction.

Reply all

Reply to author

Forward