Hi all.
We recently introduced the disable grouping label aka ['...'] on our 0.20.0 Alertmanager instances.
The label is used in our routes as shown in the piece of configuration below:
routes:
- receiver: 'slack_primary'
group_by: [...] # disables grouping
continue: true
match_re:
stack: our_stack
severity: warning|average|high|disaster
We have alerts which has a "stack" label and an "environment" label for staging and production clusters. Recently, we had a very awkward outage and some clusters went down for both environments. Since our current message templates expect just one alert, we ended up missing staging alerts in slack.
Of course I can change the template to iterate over the alerts but the question remains: is that a normal behaviour or should alerts be generated separately and that' s a bug?
One of the expressions that failed was this one:
envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3"} == 0
We basically had several alerts from the expression above boiling down to those two:
envoy_cluster_health_check_healthy{envoy_cluster_name=<name>, environment="staging" }
envoy_cluster_health_check_healthy{envoy_cluster_name=<name>, environment="production" }
But in slack we got reported with just the "production" one because the two alerts were clustered and the template didn't take in account that.
We have currently split the two alerts as follows:
envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3", environment="staging"} == 0
envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3", environment="production"} == 0
Summing up, is that behaviour expected and we should absolutely change the templates and/or split the rules on environment?
Thanks in advance,
F.