First alerts of a group does not trigger a group wait, notify is sent immediately

208 views
Skip to first unread message

Kristof Bruyninckx

unread,
Feb 24, 2022, 6:33:47 AM2/24/22
to Prometheus Users
I've set up an alertmanager with a group_wait setting of 30 seconds. From my understanding this should, once a group is created (i.e. the first alert belonging to said group comes in) trigger a waiting time before sending out a notification for the group. This should then also allow for inhibiting and is especially important when rebooting a system that will start with several rules triggering within milliseconds of each other.

Given this configuration for alertmanager v0.23.0:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'some_ip:some_port/alerthook'
# Supress warning level notifications when critical alerts of the same instance are firing
inhibit_rules:
  - source_matchers:
      - severity = "critical"
    target_matchers:
      - severity = "warning"
    equal: ['alertname']

Often i observe that this group_wait interval is not respected and the receivers are notified immediately. I'd see something along the lines of (starting from the first alert coming in)

level=info ts=2022-02-24T09:34:20.939Z caller=cluster.go:688 component=cluster msg="gossip settled; proceeding" elapsed=10.002903524s
level=debug ts=2022-02-24T09:34:30.651Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Test[f79d400][active]
level=debug ts=2022-02-24T09:34:30.652Z caller=dispatch.go:475 component=dispatcher aggrGroup="{}:{alertname=\"Test\"}" msg=flushing alerts=[Test[f79d400][active]]
level=debug ts=2022-02-24T09:34:30.658Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Test[7261095][active]
level=debug ts=2022-02-24T09:34:30.661Z caller=notify.go:734 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1

As you can see, the web hook is notified in a matter of milliseconds, rather than the configured 30 seconds. When restarting my entire stack, this results in warning alerts coming in mere milliseconds before critical alerts that are supposed to inhibit the warnings. Rather than waiting and applying inhibition the warning is sent immediately to the web hook. Note that sometimes i do get the expected behavior, but only if the alerts come in before the message gossip settled pops up.

level=info ts=2022-02-24T09:39:00.435Z caller=cluster.go:696 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000931443s
level=debug ts=2022-02-24T09:39:00.654Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Test[f79d400][active]
level=debug ts=2022-02-24T09:39:00.654Z caller=dispatch.go:475 component=dispatcher aggrGroup="{}:{alertname=\"Test\"}" msg=flushing alerts=[Test[f79d400][active]]
level=debug ts=2022-02-24T09:39:00.657Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Test[7261095][active]
level=debug ts=2022-02-24T09:39:02.435Z caller=cluster.go:693 component=cluster msg="gossip looks settled" elapsed=4.001161046s
level=debug ts=2022-02-24T09:39:04.436Z caller=cluster.go:693 component=cluster msg="gossip looks settled" elapsed=6.001426627s
level=debug ts=2022-02-24T09:39:06.436Z caller=cluster.go:693 component=cluster msg="gossip looks settled" elapsed=8.001634074s
level=info ts=2022-02-24T09:39:08.436Z caller=cluster.go:688 component=cluster msg="gossip settled; proceeding" elapsed=10.00187456s
level=debug ts=2022-02-24T09:39:30.655Z caller=dispatch.go:475 component=dispatcher aggrGroup="{}:{alertname=\"Test\"}" msg=flushing alerts="[Test[7261095][active] Test[f79d400][active]]"
level=debug ts=2022-02-24T09:39:30.655Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Test[f79d400][active]
level=debug ts=2022-02-24T09:39:30.659Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Test[7261095][active]
level=debug ts=2022-02-24T09:39:30.665Z caller=notify.go:734 component=dispatcher receiver=web.hook integration=webhook[0] msg="Notify success" attempts=1

Do i have the correct view on what group_wait is supposed to do? and if not, what does group_wait actually do here. Why does it work as i expect only when the first alert comes in before the gossip settled message?

Reply all
Reply to author
Forward
0 new messages