Prevent Alertmanger sending resolved when silenced

462 views

Skip to first unread message

dc3o

unread,

Mar 10, 2021, 3:31:30 AM3/10/21

to Prometheus Users

Few times we had to bring our database clusters down due to maintenance. Prior to this we create a silence for a limited period of time. The silence is properly catching all the alerts. Problem is that once the db host is down, Prometheus is no longer scraping metrics and marks the initial alert as resolved. No metrics no problem. Looks like send resolved is skipping silencing pipeline and we're getting alert fatigue of resolved events.

Alertmanager configuration:

---
global:
smtp_smarthost: localhost:25
smtp_from: alertmanager@localhost
templates:
- "/etc/alertmanager/*.tmpl"
route:
group_by:
- "..."
group_wait: 3s
group_interval: 10s
repeat_interval: 12h
receiver: webhook
routes:
- match:
severity: info
receiver: webhook
- match:
severity: critical
receiver: pagerduty
- match:
severity: warning
receiver: slack
receivers:
- name: webhook
webhook_configs:
- url: https://wh.local/api/webhooks/prometheus
send_resolved: true
slack_configs:
- api_url: https://XXXXXXXXXXX
channel: "#prom-info"
icon_url: https://avatars3.githubusercontent.com/u/3380462
http_config:
proxy_url: http://proxy:3129/
send_resolved: true
actions:
- type: button
text: 'Silence :no_bell:'
url: '{{ template "__alert_silence_link" . }}'
- name: pagerduty
pagerduty_configs:
- service_key: XXXXXXXXXXXXX
http_config:
proxy_url: http://proxy:3129/
webhook_configs:
- url: https://wh.local/api/webhooks/prometheus
send_resolved: true
- name: slack
slack_configs:
- api_url: XXXXXXXXXXXXX
channel: "#prom-warnings"
icon_url: https://avatars3.githubusercontent.com/u/3380462
http_config:
proxy_url: http://proxy:3129/
send_resolved: true
actions:
- type: button
text: 'Silence :no_bell:'
url: '{{ template "__alert_silence_link" . }}'
webhook_configs:
- url: https://wh.local/api/webhooks/prometheus
send_resolved: true
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal:
- alertname
- cluster
- service

Bjoern Rabenstein

unread,

Mar 18, 2021, 12:11:36 PM3/18/21

to dc3o, Prometheus Users

On 10.03.21 00:31, dc3o wrote:
> Few times we had to bring our database clusters down due to maintenance.
> Prior to this we create a silence for a limited period of time. The silence
> is properly catching all the alerts. Problem is that once the db host is
> down, Prometheus is no longer scraping metrics and marks the initial alert
> as resolved. No metrics no problem. Looks like send resolved is skipping
> silencing pipeline and we're getting alert fatigue of resolved events.

Yeah, in my understanding, silencing right now has a semantic
independent from silencing. Which is IMHO confusing because a silenced
alert is not repeatedly sent to the receiver as configured with the
repeat_interval. (Some receivers are configured to consider an alert
resolved after a while if not receiving any updates).

See the old issue
https://github.com/prometheus/alertmanager/issues/226 with some
considerations when Alertmanager should send resolved and when not. I
expect some movement on this front in the near future. Reporting your
use case and your expectation there might be helpful.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Reply all

Reply to author

Forward

0 new messages