alertmanager: 0.21.0
prometheus: 2.30.3
I am trying to get my head around some unexpected alertmanager behaviour.
I am alerting on the following metrics:
client_disconnect{appenv="testbed",conn="2",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="3",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="4",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="5",compid="CLIENT-A"} 0
and have the rule below defined:
- alert: Client Disconnect
expr: client_disconnect == 1
for: 2s
labels:
severity: critical
notification: slack
annotations:
summary: "Appenv {{ $labels.appenv }} on connection {{ $labels.conn }} compid {{ $labels.compid }} down"
description: "{{ $labels.instance }} disconnect: {{ $labels.appenv }} on connection {{ $labels.conn }} compid {{ $labels.compid }}"
My alertmanager config is as below:
global:
route:
group_wait: 5s
group_interval: 5s
group_by: ['section','env']
repeat_interval: 10m
receiver: 'default_receiver'
routes:
- match:
notification: slack
receiver: slack_receiver
group_by: ['appenv','compid']
receivers:
- name: 'slack_receiver'
slack_configs:
- channel: 'monitoring'
send_resolved: true
title: '{{ template "custom_title" . }}'
text: '{{ template "custom_slack_message" . }}'
- name: 'default_receiver'
webhook_configs:
send_resolved: true
templates:
- /etc/alertmanager/notifications.tmpl
My custom template results in a message as formatted below being display in Slack:
as expected this repeats every 10 mins.
If one of these client_disconnects subsequently resolves, such that the metric now looks like this:
client_disconnect{appenv="testbed",conn="2",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="3",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="4",compid="CLIENT-A"} 0
client_disconnect{appenv="testbed",conn="5",compid="CLIENT-A"} 0
Then I receive the following messages:
When the repeat interval comes round (10 mins later) I receive the following messages:
The second firing line comes in at 22:02 and the third firing line at 22:03 (sorry the timestamps only show through a hover over in Slack).
I can't understand this behaviour. I am running single unclustered instances of prometheus and alertmanager.
Is anyone in a position to explain this behaviour to me. I get a very similar situation if I simply use the webhook instead of slack.
The subsequent repeat (after the last message) shows the current state:
Many thanks.
For reference, my slack templates are below:
{{ define "__single_message_title" }}{{ range .Alerts.Firing }}{{ .Labels.alertname }} on {{ .Annotations.identifier }}{{ end }}{{ range .Alerts.Resolved }}{{ .Labels.alertname }} on {{ .Annotations.identifier }}{{ end }}{{ end }}
{{ define "custom_title" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing) 0) (eq (len .Alerts.Resolved) 1)) }}{{ template "__single_message_title" . }}{{ end }}{{ end }}
{{ define "custom_slack_message" }}
{{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing) 0) (eq (len .Alerts.Resolved) 1)) }}
{{ range .Alerts.Firing }}{{ .Annotations.description }}{{ end }}{{ range .Alerts.Resolved }}{{ .Annotations.description }}{{ end }}
{{ else }}
{{ if gt (len .Alerts.Firing) 0 }}
*Alerts Firing:*
Client disconnect: {{ .CommonLabels.appenv }} for {{ .CommonLabels.compid }}. Connections: {{ range .Alerts.Firing }}{{ .Labels.conn }} {{ end }}have failed.
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
*Alerts Resolved:*
Client disconnect: {{ .CommonLabels.appenv }} for {{ .CommonLabels.compid }}. Connections: {{ range .Alerts.Resolved }}{{ .Labels.conn }} {{ end }}have failed.
{{ end }}
{{ end }}
{{ end }}