How to detect lost connection to Alertmanager

R. Diez

unread,

Nov 20, 2021, 4:17:05 PM11/20/21

to Prometheus Users

Hi all:

I am fairly new to Prometheus. I am using the Prometheus version 2.15.2 that comes with Ubuntu 20.04.

On this page:

https://awesome-prometheus-alerts.grep.to/rules

I found this alert:

- alert: PrometheusNotConnectedToAlertmanager
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})

But it is not working properly. Metric prometheus_notifications_alertmanagers_discovered starts at 0, and then it goes to 1 as expected.

However, when I stop the service, it does not revert to 0:

systemctl stop prometheus-alertmanager.service

I checked that Alertmanager is not running by trying to load this URL:

http://localhost:9093/metrics

By the way, my Prometheus configuration looks like this:

alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']

In fact, with prometheus-alertmanager.service still stopped, if I restart Prometheus, the value for prometheus_notifications_alertmanagers_discovered still goes from 0 to 1.

When I start the Alertmanager, then alerts are generated and the e-mails come through.

Is this a known issue with that Prometheus version?

Or is there a better way to check whether the connection between Prometheus and Alertmanager is healthy?

Thanks in advance,

rdiez

Brian Candler

unread,

Nov 21, 2021, 5:29:29 AM11/21/21

to Prometheus Users

On Saturday, 20 November 2021 at 21:17:05 UTC rdie...@gmail.com wrote:

But it is not working properly. Metric prometheus_notifications_alertmanagers_discovered starts at 0, and then it goes to 1 as expected.

However, when I stop the service, it does not revert to 0:

It's unclear to me what that particularly metric measures. It could just be talking about the service discovery of alertmanagers. Given that your prometheus.yml contains:

alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']

then the service discovery ("targets") is always returning one alertmanager, whether or not that alertmanager is up or down.

Or is there a better way to check whether the connection between Prometheus and Alertmanager is healthy?

I suggest you scrape the alertmanager itself, by adding a new scrape job:

- job_name: alertmanager

static_configs:

- targets: ['localhost:9093']

Then you can check the up{job="alertmanager"} metric to tell if alertmanager is up or down. In addition, you'll collect extra alertmanager-specific metrics, such as the number of alerts which have been sent out over different channels. Use "curl localhost:9093/metrics" to see them.

Of course, if alertmanager is down, it's hard to get alerted on this condition :-)

R. Diez

unread,

Nov 21, 2021, 12:57:51 PM11/21/21

to Prometheus Users

First of all, thanks for your answer.

Scraping the Alertmanager is an interesting idea. However, although rather unlikely, Prometheus may be able to scrape it, but not send alerts to it.

In the meantime, I found another way on the Internet which should be more reliable:

- alert: PrometheusErrorSendingAlertsToSomeAlertmanagers
annotations:
    description: '{{ printf "%.1f" $value }}% errors while sending alerts from Prometheus
      {{$labels.instance}} to Alertmanager {{$labels.alertmanager}}.'
    summary: Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.
expr: |
    (
      rate(prometheus_notifications_errors_total{job="prometheus"}[5m])
    /
      rate(prometheus_notifications_sent_total{job="prometheus"}[5m])
    )
    * 100
    > 1 # This is a percentage.
for: 15m
labels:
    severity: critical

Reply all

Reply to author

Forward