Duplicate Slack notifications from Alertmanager (my misunderstanding?)

252 views
Skip to first unread message

William Hargrove

unread,
Dec 2, 2021, 5:20:21 PM12/2/21
to Prometheus Users
alertmanager: 0.21.0
prometheus: 2.30.3

I am trying to get my head around some unexpected alertmanager behaviour.

I am alerting on the following metrics:

client_disconnect{appenv="testbed",conn="2",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="3",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="4",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="5",compid="CLIENT-A"} 0

and have the rule below defined:

    - alert: Client Disconnect
      expr: client_disconnect == 1
      for: 2s
      labels:
        severity: critical
        notification: slack
      annotations:
        summary: "Appenv {{ $labels.appenv }} on connection {{ $labels.conn }} compid {{ $labels.compid }} down"
        description: "{{ $labels.instance }} disconnect: {{ $labels.appenv }} on connection {{ $labels.conn }} compid {{ $labels.compid }}"

My alertmanager config is as below:

global:

route:
  group_wait: 5s
  group_interval: 5s
  group_by: ['section','env']
  repeat_interval: 10m
  receiver: 'default_receiver'

  routes:
    - match:
        notification: slack
      receiver: slack_receiver
      group_by: ['appenv','compid']

receivers:
- name: 'slack_receiver'
  slack_configs:
    - channel: 'monitoring'
      send_resolved: true
      title: '{{ template "custom_title" . }}'
      text: '{{ template "custom_slack_message" . }}'

- name: 'default_receiver'
  webhook_configs:
      send_resolved: true

templates:
  - /etc/alertmanager/notifications.tmpl

My custom template results in a message as formatted below being display in Slack:

slack1.PNG
as expected this repeats every 10 mins.

If one of these client_disconnects subsequently resolves, such that the metric now looks like this:

client_disconnect{appenv="testbed",conn="2",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="3",compid="CLIENT-A"} 1
client_disconnect{appenv="testbed",conn="4",compid="CLIENT-A"} 0
client_disconnect{appenv="testbed",conn="5",compid="CLIENT-A"} 0

Then I receive the following messages:
slack2.PNG
When the repeat interval comes round (10 mins later) I receive the following messages:
slack3.PNG
The second firing line comes in at 22:02 and the third firing line at 22:03 (sorry the timestamps only show through a hover over in Slack).

I can't understand this behaviour. I am running single unclustered instances of prometheus and alertmanager.

Is anyone in a position to explain this behaviour to me. I get a very similar situation if I simply use the webhook instead of slack.

The subsequent repeat (after the last message) shows the current state:
slack4.PNG

Many thanks.

For reference, my slack templates are below:

{{ define "__single_message_title" }}{{ range .Alerts.Firing }}{{ .Labels.alertname }} on {{ .Annotations.identifier }}{{ end }}{{ range .Alerts.Resolved }}{{ .Labels.alertname }} on {{ .Annotations.identifier }}{{ end }}{{ end }}

{{ define "custom_title" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing) 0) (eq (len .Alerts.Resolved) 1)) }}{{ template "__single_message_title" . }}{{ end }}{{ end }}

{{ define "custom_slack_message" }}
{{ if or (and (eq (len .Alerts.Firing) 1) (eq (len .Alerts.Resolved) 0)) (and (eq (len .Alerts.Firing) 0) (eq (len .Alerts.Resolved) 1)) }}
{{ range .Alerts.Firing }}{{ .Annotations.description }}{{ end }}{{ range .Alerts.Resolved }}{{ .Annotations.description }}{{ end }}
{{ else }}
{{ if gt (len .Alerts.Firing) 0 }}
*Alerts Firing:*
Client disconnect: {{ .CommonLabels.appenv }} for {{ .CommonLabels.compid }}. Connections: {{ range .Alerts.Firing }}{{ .Labels.conn }} {{ end }}have failed.
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
*Alerts Resolved:*
Client disconnect: {{ .CommonLabels.appenv }} for {{ .CommonLabels.compid }}. Connections: {{ range .Alerts.Resolved }}{{ .Labels.conn }} {{ end }}have failed.
{{ end }}
{{ end }}
{{ end }}


Reply all
Reply to author
Forward
0 new messages