black box exporter monitoring SSH and PING

3,509 views
Skip to first unread message

nina guo

unread,
Apr 20, 2022, 7:37:17 AM4/20/22
to Prometheus Users
Hi guys,

We are using black box exporter to monitor ssh and ping.

For ssh, (we monitor the port 22) if we stop sshd service, actually the service will be auto-recovered, but black box exporter detect the recover behavior after about 5mins.

For ping, we use icmp module to monitor system ping, we deleted the IP tables, then Prometheus triggered 2 alerts, one is SSH is failed, the other is Ping is failed. And after the IP table is recovered, the alert for Ping can be cleared after about 20mins, but SSH is still there.

So it is a good approach to use blackbox exporter to monitor SSH and PING?

Brian Candler

unread,
Apr 20, 2022, 8:29:10 AM4/20/22
to Prometheus Users
blackbox_exporter monitoring TCP ports (e.g. for SSH) and ICMP (ping) works fine.

"but black box exporter detect the recover behavior after about 5mins"

Black box exporter only performs a single test when you scrape it.  It does not by itself do any recovery detection.  The problem is therefore most likely with your prometheus scrape config or your alertmanager config.

If you're having a problem, you'll need to be more specific:
* show your blackbox_exporter config, your prometheus scrape config which scrapes it, your alerting rules, and your alertmanager config (if using alertmanager)
* describe more clearly the behaviour you're seeing, and what you expected to see.  (For example, are you waiting for a "recovery" E-mail from alertmanager?)

"And after the IP table is recovered, the alert for Ping can be cleared after about 20mins, but SSH is still there."

Either SSH is working and reachable, or it is not.  You can check the results of blackbox_exporter tests by hand using curl, and also get additional debugging information, like this:


Here is an example:

# curl -g 'http://localhost:9115/probe?module=icmp&target=1.2.3.4&debug=true'
Logs for the probe:
ts=2022-04-20T12:25:11.587855449Z caller=main.go:320 module=icmp target=1.2.3.4 level=info msg="Beginning probe" probe=icmp timeout_seconds=3
ts=2022-04-20T12:25:11.588014456Z caller=icmp.go:91 module=icmp target=1.2.3.4 level=info msg="Resolving target address" ip_protocol=ip6
ts=2022-04-20T12:25:11.588065658Z caller=icmp.go:91 module=icmp target=1.2.3.4 level=info msg="Resolving target address" ip_protocol=ip4
ts=2022-04-20T12:25:11.588098688Z caller=icmp.go:91 module=icmp target=1.2.3.4 level=info msg="Resolved target address" ip=1.2.3.4
ts=2022-04-20T12:25:11.588133368Z caller=main.go:130 module=icmp target=1.2.3.4 level=info msg="Creating socket"
ts=2022-04-20T12:25:11.588188673Z caller=main.go:130 module=icmp target=1.2.3.4 level=debug msg="Unable to do unprivileged listen on socket, will attempt privileged" err="socket: permission denied"
ts=2022-04-20T12:25:11.58829848Z caller=main.go:130 module=icmp target=1.2.3.4 level=info msg="Creating ICMP packet" seq=24581 id=190
ts=2022-04-20T12:25:11.588348917Z caller=main.go:130 module=icmp target=1.2.3.4 level=info msg="Writing out packet"
ts=2022-04-20T12:25:11.588470176Z caller=main.go:130 module=icmp target=1.2.3.4 level=info msg="Waiting for reply packets"
ts=2022-04-20T12:25:14.588761946Z caller=main.go:130 module=icmp target=1.2.3.4 level=debug msg="Cannot get TTL from the received packet. 'probe_icmp_reply_hop_limit' will be missing."
ts=2022-04-20T12:25:14.588979317Z caller=main.go:130 module=icmp target=1.2.3.4 level=warn msg="Timeout reading from socket" err="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout"
ts=2022-04-20T12:25:14.589247538Z caller=main.go:320 module=icmp target=1.2.3.4 level=error msg="Probe failed" duration_seconds=3.001307309



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.000116077
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 3.001307309
# HELP probe_icmp_duration_seconds Duration of icmp request by phase
# TYPE probe_icmp_duration_seconds gauge
probe_icmp_duration_seconds{phase="resolve"} 0.000116077
probe_icmp_duration_seconds{phase="rtt"} 0
probe_icmp_duration_seconds{phase="setup"} 0.000212886
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 3.268949123e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0



Module configuration:
prober: icmp
timeout: 3s
http:
    ip_protocol_fallback: true
    follow_redirects: true
tcp:
    ip_protocol_fallback: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true



Look at "probe_success" for the overall result.

You can also use the PromQL browser in the Prometheus web interface: enter "probe_success" as the query and look at the graph tab. You'll see the history of your blackbox exporter probes.

nina guo

unread,
Apr 21, 2022, 4:22:32 AM4/21/22
to Prometheus Users
blackbox exporter config:
icmp:
        prober: icmp
        icmp:
          preferred_ip_protocol: "ip4"
tcp:
        prober: tcp
        timeout: 5s
        tcp:
          preferred_ip_protocol: "ip4"

Prometheus scrape config:
global:
      scrape_interval: 60s
      evaluation_interval: 60s
- job_name: PING
        metrics_path: /probe
        params:
          module: [icmp]
        file_sd_configs:
        - files:
          - '/etc/prometheus/targets/'
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
            regex: '([^:]+)(:[0-9]+)?'
            replacement: '${1}'
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: prometheus-blackbox-exporter:9115
      - job_name: SSH
        metrics_path: /probe
        params:
          module: [ssh_banner]
        file_sd_configs:
        - files:
          - '/etc/prometheus/targets/'
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
            regex: '([^:]+)(:[0-9]+)?'
            replacement: '${1}:22'
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: prometheus-blackbox-exporter:9115

Alert rules:
- alert: TargetDown
          expr: probe_success == 0
          for: 5s
          labels:
            severity: critical
          annotations:
            description: Service {{ $labels.instance }} is unreachable.
            value: DOWN ({{ $value }})
            summary: "Target {{ $labels.instance }} is down."

Alert manager config:
config.yml: |-
    global:
      resolve_timeout: 5m
      smtp_smarthost: mail
      smtp_from: alertmanager
      smtp_require_tls: false
    route:
      receiver: email-me
      group_by: [instance, alertname, job]
      group_wait: 45s
      group_interval: 5m
      repeat_interval: 24h
    receivers:
    - name: email-me
      email_configs:
      - to: alert
        send_resolved: true

Brian Candler

unread,
Apr 21, 2022, 4:51:18 AM4/21/22
to Prometheus Users
On Thursday, 21 April 2022 at 09:22:32 UTC+1 ninag...@gmail.com wrote:
blackbox exporter config:
icmp:
        prober: icmp
        icmp:
          preferred_ip_protocol: "ip4"
tcp:
        prober: tcp
        timeout: 5s
        tcp:
          preferred_ip_protocol: "ip4"

Prometheus scrape config:
... 
      - job_name: SSH
        metrics_path: /probe
        params:
          module: [ssh_banner]
        file_sd_configs:
        - files:
          - '/etc/prometheus/targets/'
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
            regex: '([^:]+)(:[0-9]+)?'
            replacement: '${1}:22'
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: prometheus-blackbox-exporter:9115

In your scrape job you are setting parameter module=ssh_banner, but you have not defined a module called "ssh_banner" in your blackbox exporter config.

Therefore it will always result in a failure.  Test like this:

 
Alert rules:
- alert: TargetDown
          expr: probe_success == 0
          for: 5s
          labels:
            severity: critical
          annotations:
            description: Service {{ $labels.instance }} is unreachable.
            value: DOWN ({{ $value }})
            summary: "Target {{ $labels.instance }} is down."


You can leave out "for: 5s" since you're only scraping and evaluating rules every 60s.

If you don't want an immediate alert in the case of a single probe failure (like a single dropped packet), then set "for: 1m" or "for: 2m" as required.  This will then only alert if the alert is continuously present for that duration.

 
Alert manager config:
...

    - name: email-me
      email_configs:
      - to: alert
        send_resolved: true


In your original post you said "but black box exporter detect the recover behavior after about 5mins". Are you talking about when you receive the "send_resolved" message from alertmanager?

There are various delays which can occur between prometheus making an alert and alertmanager sending it, and also with prometheus withdrawing an alert and alertmanager sending a resolved message.

If I understand correctly: Prometheus doesn't explicitly "resolve" an alert, rather it just stops sending that alert.  The alert comes with an "endsAt" time, which is explained here:
"3x the greater of the evaluation_interval or resend-delay values"
Since you have an evaluation_interval of 60s, I believe this means there will be at least a 3 minute delay between an alert ceasing to fire, and the resolved message being sent.

See also:

# ResolveTimeout is the default value used by alertmanager if the alert does
# not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
# This has no impact on alerts from Prometheus, as they always include EndsAt.
[ resolve_timeout: <duration> | default = 5m ]

Really I think you need to separate your problem into two parts:
1. Making sure that blackbox_exporter is probing ICMP and SSH successfully.  Check "probe_status" is going to 0 or 1 at the correct times.  View the PromQL history of the probe_status metric to confirm this.  Ignore alerts.
2. Then look at your alerting configuration, as to exactly when it sends messages.

Brian Candler

unread,
Apr 21, 2022, 5:10:47 AM4/21/22
to Prometheus Users
On Thursday, 21 April 2022 at 09:51:18 UTC+1 Brian Candler wrote:
If I understand correctly: Prometheus doesn't explicitly "resolve" an alert, rather it just stops sending that alert.

Sorry, I was wrong. To resolve the alert, prometheus posts an alert with endsAt equal to the time when the alert went away.  (Tested with tcpdump)
Reply all
Reply to author
Forward
0 new messages