I'm trying to use the group_wait parameter in order to allow Alertmanager to wait for all the alerts received from Prometheus, group them and send a single notification.
I have the following configuration:
route:
receiver: default-receiver
group_by:
- alertname
- environment
continue: false
group_wait: 5m
group_interval: 20m
repeat_interval: 1d
receivers:
- name: default-receiver
email_configs:
- send_resolved: true
to:
mye...@exmaple.com from:
alertm...@example.com hello: localhost
smarthost: smptserver:25
Although the group_wait parameter is set to 5 minutes, as soon as Alertmanager receives the alerts from Prometheus, it flushes the alerts and also sends a notification to the configured receiver. I would expect Alertmanager to delay the notification message and send it after 5 minutes (value of group_wait parameter).
ts=2022-11-22T12:37:19.367Z caller=cluster.go:705 level=info component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000781422s
ts=2022-11-22T12:37:21.368Z caller=cluster.go:702 level=debug component=cluster msg="gossip looks settled" elapsed=4.001197371s
ts=2022-11-22T12:37:23.368Z caller=cluster.go:702 level=debug component=cluster msg="gossip looks settled" elapsed=6.001883916s
ts=2022-11-22T12:37:25.369Z caller=cluster.go:702 level=debug component=cluster msg="gossip looks settled" elapsed=8.00222292s
ts=2022-11-22T12:37:27.369Z caller=cluster.go:697 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002782746s
ts=2022-11-22T12:37:42.811Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=file_not_processed[c0e2772][active]
ts=2022-11-22T12:37:42.812Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=file_not_processed[64605a5][active]
ts=2022-11-22T12:37:42.812Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=file_not_processed[e70ae18][active]
ts=2022-11-22T12:37:42.812Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=file_not_processed[7325965][active]
ts=2022-11-22T12:37:42.812Z caller=dispatch.go:517 level=debug component=dispatcher aggrGroup="{}:{alertname=\"file_not_processed\", environment=\"ACC\"}" msg=flushing alerts=[file_not_processed[c0e2772][active]]
ts=2022-11-22T12:37:42.812Z caller=dispatch.go:517 level=debug component=dispatcher aggrGroup="{}:{alertname=\"file_not_processed\", environment=\"DEV\"}" msg=flushing alerts="[file_not_processed[64605a5][active] file_not_processed[e70ae18][active] file_not_processed[7325965][active]]"
ts=2022-11-22T12:37:42.883Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=webhook[0] msg="Notify success" attempts=1
ts=2022-11-22T12:37:42.914Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=webhook[0] msg="Notify success" attempts=1
ts=2022-11-22T12:37:43.031Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=email[0] msg="Notify success" attempts=1
ts=2022-11-22T12:37:43.031Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=email[0] msg="Notify success" attempts=1
ts=2022-11-22T12:37:43.660Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=locked_oracle_accounts[bcc49ac][active]
ts=2022-11-22T12:37:43.660Z caller=dispatch.go:517 level=debug component=dispatcher aggrGroup="{}:{alertname=\"locked_oracle_accounts\", environment=\"DEV\"}" msg=flushing alerts=[locked_oracle_accounts[bcc49ac][active]]
ts=2022-11-22T12:37:43.704Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=webhook[0] msg="Notify success" attempts=1
ts=2022-11-22T12:37:43.840Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=email[0] msg="Notify success" attempts=1
ts=2022-11-22T12:37:58.355Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=sdl_critical_services_down[7b9c988][active]
ts=2022-11-22T12:37:58.355Z caller=dispatch.go:517 level=debug component=dispatcher aggrGroup="{}:{alertname=\"sdl_critical_services_down\", environment=\"TST\"}" msg=flushing alerts=[sdl_critical_services_down[7b9c988][active]]
ts=2022-11-22T12:37:58.398Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=webhook[0] msg="Notify success" attempts=1
ts=2022-11-22T12:37:58.416Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=sdl_critical_services_down[7b9c988][active]
ts=2022-11-22T12:37:58.494Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=email[0] msg="Notify success" attempts=1
ts=2022-11-22T12:38:02.724Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=edl_instance_down[49003d1][active]
ts=2022-11-22T12:38:02.724Z caller=dispatch.go:517 level=debug component=dispatcher aggrGroup="{}:{alertname=\"edl_instance_down\", environment=\"ACC\"}" msg=flushing alerts=[edl_instance_down[49003d1][active]]
ts=2022-11-22T12:38:02.765Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=webhook[0] msg="Notify success" attempts=1
ts=2022-11-22T12:38:02.876Z caller=notify.go:743 level=debug component=dispatcher receiver=default-receiver integration=email[0] msg="Notify success" attempts=1
I expect Alertmanager to group the alerts from Prometheus and send after 5 minutes (group_wait value) 1 single notification that contains all the grouped alerts. In my case it seems like group_wait parameter is not considered and as soon as the alert is received from Prometheus, a notification to the receiver is sent immediately after. Due to this behavior, alertmanager won't have time to group all the alerts of the same type (based on my group_by filters) and i will have multiple notifications for the same alerts at a new evaluation interval period (group_interval).