Grouping of alarms (group_interval, group_wait and repeat_interval)

76 views
Skip to first unread message

rosaLux161

unread,
Aug 12, 2020, 9:41:47 AM8/12/20
to Prometheus Users

I'm trying to understand how the grouping of alarms works.

The alertmanager is configured as follows:

```
---
route:
group_by:
- alertname
group_interval: 10s
group_wait: 10s
receiver: opsgenie
repeat_interval: 1m
routes:
- receiver: opsgenie
receivers:
- name: opsgenie
opsgenie_configs:
- api_key: '***'
api_url: '***'
description: 'Alert: {{ range .Alerts}} {{ .Labels.instance }}, {{ end }}'
message: 'Alert'
send_resolved: true
```

What is the difference between group_wait and repeat_interval.

Let's assume:

alert 1:
  alertname: alert1
  label_1: abc

alert 2:
  alertname: alert1
  label: def

If alert 1 and alert 2 occur simultaneously or in a very short time, then only one alert should be sent out. If alert 2 only occurs after some time, then another alert should be sent. The latter does not work. If alert 2 occurs, nothing happens.

Christian Hoffmann

unread,
Aug 12, 2020, 5:00:25 PM8/12/20
to rosaLux161, Prometheus Users
Hi,

On 8/12/20 3:41 PM, rosaLux161 wrote:
> If alert 1 and alert 2 occur simultaneously or in a very short time,
> then only one alert should be sent out. If alert 2 only occurs after
> some time, then another alert should be sent. The latter does not work.
> If alert 2 occurs, nothing happens.
Hrm, that sounds unexpected. Could it be that OpsGenie is doing some
additional filtering/grouping?
Maybe try with a simpler receiver for testing, e.g. email?
You can also try checking the logs and/or Alertmanager metrics to see if
there are any problems with sending notifications.

Note: What you describe as "alert" is usually referred to as
"notification" in Alertmanager terms.

Kind regards,
Christian

rosaLux161

unread,
Aug 13, 2020, 8:20:18 AM8/13/20
to Prometheus Users
Yes, you're right. E-Mail-Notifications works.

After research, I found out that OpsGenie deduplicates the notifications using the alias created by the AlertManager through the group labels: https://github.com/prometheus/alertmanager/issues/1598

The group labels contain the grouping data defined by 'group by'. 

The following assumption:

* Monitoring the accessibility of Internet pages


Three pages are no longer accessible within 10 seconds. Therefore group_wait and group_interval should be? The message should only appear once. So a group_by of the URLs is not possible.

But if five minutes later another website is no longer available, another message should appear. But this does not happen because of the alias thing.

Are there still any possibilities?
Reply all
Reply to author
Forward
0 new messages