Blackbox reporting “Resolution with IP protocol failed” error after running fine for few hours.

Chris Paulraj

unread,

Nov 22, 2020, 9:14:31 PM11/22/20

to Prometheus Users

Blackbox (0.17.0) running on Kubernetes, reporting DNS lookup failures on HTTP probes after running fine for few hours. Has anyone experienced such issue? Could someone help me to figure out the issue?

Sample probe output

Logs for the probe:

ts=2020-11-22T00:21:21.378113222Z caller=main.go:304 module=http_2xx target=https://arp-executor-sy-shra-arp-p.icl1p.xyz.com/actuator/health/ level=info msg="Beginning probe" probe=http timeout_seconds=9.5

ts=2020-11-22T00:21:21.378347979Z caller=http.go:323 module=http_2xx target=https://arp-executor-sy-shra-arp-p.icl1p.xyz.com/actuator/health/ level=info msg="Resolving target address" ip_protocol=ip4

ts=2020-11-22T00:21:30.878306074Z caller=http.go:323 module=http_2xx target=https://arp-executor-sy-shra-arp-p.icl1p.xyz.com/actuator/health/ level=error msg="Resolution with IP protocol failed" err="i/o timeout"

ts=2020-11-22T00:21:30.878395746Z caller=main.go:119 module=http_2xx target=https://arp-executor-sy-shra-arp-p.icl1p.xyz.com/actuator/health/ level=error msg="Error resolving address" err="i/o timeout"

ts=2020-11-22T00:21:30.878422453Z caller=main.go:304 module=http_2xx target=https://arp-executor-sy-shra-arp-p.icl1p.xyz.com/actuator/health/ level=error msg="Probe failed" duration_seconds=9.500237978

Metrics that would have been returned:

HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in secondsTYPE probe_dns_lookup_time_seconds gauge

probe_dns_lookup_time_seconds 9.500014964

HELP probe_duration_seconds Returns how long the probe took to complete in secondsTYPE probe_duration_seconds gauge

probe_duration_seconds 9.500237978

HELP probe_failed_due_to_regex Indicates if probe failed due to regexTYPE probe_failed_due_to_regex gauge

probe_failed_due_to_regex 0

HELP probe_http_content_length Length of http content responseTYPE probe_http_content_length gauge

probe_http_content_length 0

HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirectsTYPE probe_http_duration_seconds gauge

probe_http_duration_seconds{phase="connect"} 0

probe_http_duration_seconds{phase="processing"} 0

probe_http_duration_seconds{phase="resolve"} 0

probe_http_duration_seconds{phase="tls"} 0

probe_http_duration_seconds{phase="transfer"} 0

HELP probe_http_redirects The number of redirectsTYPE probe_http_redirects gauge

probe_http_redirects 0

HELP probe_http_ssl Indicates if SSL was used for the final redirectTYPE probe_http_ssl gauge

probe_http_ssl 0

HELP probe_http_status_code Response HTTP status codeTYPE probe_http_status_code gauge

probe_http_status_code 0

HELP probe_http_uncompressed_body_length Length of uncompressed response bodyTYPE probe_http_uncompressed_body_length gauge

probe_http_uncompressed_body_length 0

HELP probe_http_version Returns the version of HTTP of the probe responseTYPE probe_http_version gauge

probe_http_version 0

HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.TYPE probe_ip_addr_hash gauge

probe_ip_addr_hash 0

HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6TYPE probe_ip_protocol gauge

probe_ip_protocol 0

HELP probe_success Displays whether or not the probe was a successTYPE probe_success gauge

probe_success 0

Module configuration:

prober: http

timeout: 10s

http:

valid_status_codes: - 100 - 200 - 201 - 202 - 203 - 204 - 205 - 206 - 207 - 208 - 226 - 300 - 301 - 302 - 303 - 304 - 305 - 306 - 307 - 308 valid_http_versions: - HTTP/1.1 - HTTP/2 preferred_ip_protocol: ip4

tcp:

ip_protocol_fallback: true

icmp:

ip_protocol_fallback: true

dns:

ip_protocol_fallback: true

b.ca...@pobox.com

unread,

Nov 23, 2020, 3:25:03 AM11/23/20

to Prometheus Users

Does your blackbox container use systemd-resolved?

https://github.com/systemd/systemd/issues/12514

Chris Paulraj

unread,

Nov 23, 2020, 8:30:56 AM11/23/20

to Prometheus Users

Thank you. We are not using Arch Linux, OpenShift/K8s is running on RHEL 7.x. When blackbox reports DNS lookup failures, I am able to render all HTTP URLs from the pod with no issues. There hasn't been ant DNS issue with other applications running in the cluster.

b.ca...@pobox.com

unread,

Nov 23, 2020, 10:32:54 AM11/23/20

to Prometheus Users

The OS that the host is running makes no difference; the question is what OS the container is built from. You'll see this in the Dockerfile used to build the container.

If you are using the off-the-shelf docker container for blackbox_exporter then it will be this Dockerfile which builds from quay.io/prometheus/busybox-linux-amd64:latest

This in turn appears to come from here, which in turn is based on debian:buster or debian:buster-slim. I think those are systemd-based.

I think you should docker exec into the running container, and see if systemd-resolved is running, and/or if /etc/resolv.conf points to 127.0.0.53. If so, the systemd bug I pointed to is relevant.

If not, then you can try resolving host arp-executor-sy-shra-arp-p.icl1p.xyz.com yourself to see if it resolves or not. Ultimately, this problem isn't with blackbox-exporter, it's a case of debugging why DNS isn't resolving. Intermittent DNS resolution can also be caused by problems with your authoritative DNS.

Chris Paulraj

unread,

Nov 23, 2020, 11:04:12 AM11/23/20

to Prometheus Users

I created the image using RHEL 7 and I could see that the DNS is delegated to Openshift node hosting this pod. I was also able to run curl command from within the pod which was successful. But as you point out, issue could very well be within the image I built, will try to gather more information when it happens again. I updated the prometheus & alertmanager with most recent version and restarted the pods, keeping my fingers crossed. Thank you for your help.

sh-4.2$ cat /etc/resolv.conf

nameserver 10.244.60.18

search prometheus-custom.svc.cluster.local svc.cluster.local cluster.local localdomain xyz.com

options ndots:5

sh-4.2$

Chris Paulraj

unread,

Nov 25, 2020, 8:29:18 AM11/25/20

to Prometheus Users

Tried with different build to include network tools, unable to figure out why the lookup fails. Tried with a blackbox-exporter image from docker hub, resulting with the same issue, although it lasted for 8 hours without error. It does look like this is an environmental issue with my setup, would you be able to help me on how I can increase the DNS lookup timeout for HTTP probes? Where can I increase the timeout for "probe_dns_lookup_time_seconds"? -Thank you.

Chris Paulraj

unread,

Dec 11, 2020, 6:21:37 PM12/11/20

to Prometheus Users

Happy to report that the issue has been fixed by having a custom DNS policy for BlackBox pods, skipping cluster DNS and pointing to external DNS server.

Reply all

Reply to author

Forward