I have configured scrape timeout on prometheus side (10s) and http probe timeout (2s) on blackbox side...but looking at the failure logs it appears there is a third (not set by me) timeout in play of 1.5 seconds. This results in fairly regular alerts due to non-200 status code (0).
This prevents me from monitoring slower http services...generates a lot of noise. I am specifically checking for 200 status so I know if there are any server-side errors occurring. Latency being "too high" is monitored separately...so I want a high timeout so I get real status code and can monitor/report real latency over time. Of course if there is still a timeout over my "high timeout" settings...then for all intents and purposes the server is down so I DO want to get alerted at non-200 status code (0 in this case due to timeout). With 1.5 seconds I get 0 status codes all the time. I could set prometheus to only alert of we are timing out for more than a certain period of time...but then I may miss short-lived periodic performance issues (ie due to cache refresh, db load, etc). So need to figure out how to get the timeouts configured properly.
I am guessing that this isn't a bug but rather me misunderstanding the configuration docs. Have read them multiple times however and am not sure what I missed.
Blackbox exporter version 0.13.0 linux tarball.
Prometheus version 2.7.1 linux tarball.
Ubuntu 16.04
Logs for the probe:
ts=2019-05-06T08:27:13.836730378Z caller=main.go:118 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Beginning probe" probe=http timeout_seconds=1.5
ts=2019-05-06T08:27:13.836884962Z caller=utils.go:42 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Resolving target address" preferred_ip_protocol=ip4
ts=2019-05-06T08:27:13.837517123Z caller=utils.go:65 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Resolved target address" ip=123.123.123.123
ts=2019-05-06T08:27:13.837567386Z caller=http.go:281 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Making HTTP request" url=http://[151.139.237.3]/privacy/ host=www.somedomain.net
ts=2019-05-06T08:27:15.337203199Z caller=http.go:296 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=error msg="Error for HTTP request" err="Get http://[151.139.237.3]/privacy/: context deadline exceeded"
ts=2019-05-06T08:27:15.337331049Z caller=http.go:367 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=info msg="Response timings for roundtrip" roundtrip=0 start=2019-05-06T08:27:13.83765659Z dnsDone=2019-05-06T08:27:13.83765659Z connectDone=2019-05-06T08:27:13.84138627Z gotConn=2019-05-06T08:27:13.841412248Z responseStart=0001-01-01T00:00:00Z end=0001-01-01T00:00:00Z
ts=2019-05-06T08:27:15.337383646Z caller=main.go:131 module=http_200_noredirect target=http://www.somedomain.net/privacy/ level=error msg="Probe failed" duration_seconds=1.500593477
So the first line says timeout_seconds=1.5 and indeed the Probe failed at duration_seconds=1.500593477. The thing I don't understand is how this 1.5s timeout is being configured because that is not what I set.
- job_name: 'blackbox_http200_noredirect' scrape_interval: 30s scrape_timeout: 10s # should be higher than http timeout metrics_path: /probe params: module: [http_200_noredirect] static_configs: - targets: - http://www.somedomain.net/privacy/ relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9115 # The blackbox exporter's real hostname:port.
modules: http_200_noredirect: prober: http timeout: 2s # http timeout http: valid_http_versions: ["HTTP/1.1", "HTTP/2"] valid_status_codes: [200] # Defaults to 2xx method: GET headers: User-Agent: "Mozilla/5.0 (blackbox_exporter) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36" Accept-Language: 'en-US,en;q=0.9,de-DE,de;q=0.2' no_follow_redirects: true fail_if_ssl: false fail_if_not_ssl: false tls_config: insecure_skip_verify: false preferred_ip_protocol: "ip4" # defaults to "ip6"
So if I understand the docs/config/log (very possible I don't):
It makes sense to me that prometheus might inform (via X-Prometheus-Scrape-Timeout-Seconds http header) blackbox about the scrape timeout so that blackbox can use that as a maximum probe timeout (in the case where prometheus has a lower timeout than blackbox). However in my case I am not sure how that applies unless 1.5 seconds is a hard-coded maximum scrape timeout.
Did I misread the docs? Appreciate any advice/education.