Does disabling grouping always result in single alerts in Slack?

baca...@gmail.com

unread,

Aug 18, 2020, 3:05:01 AM8/18/20

to Prometheus Users

Hi all.

We recently introduced the disable grouping label aka ['...'] on our 0.20.0 Alertmanager instances.

The label is used in our routes as shown in the piece of configuration below:

routes:

- receiver: 'slack_primary'
group_by: [...] # disables grouping
continue: true
match_re:
stack: our_stack
severity: warning|average|high|disaster

We have alerts which has a "stack" label and an "environment" label for staging and production clusters. Recently, we had a very awkward outage and some clusters went down for both environments. Since our current message templates expect just one alert, we ended up missing staging alerts in slack.

Of course I can change the template to iterate over the alerts but the question remains: is that a normal behaviour or should alerts be generated separately and that' s a bug?

One of the expressions that failed was this one:

envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3"} == 0

We basically had several alerts from the expression above boiling down to those two:

envoy_cluster_health_check_healthy{envoy_cluster_name=<name>, environment="staging" }

envoy_cluster_health_check_healthy{envoy_cluster_name=<name>, environment="production" }

But in slack we got reported with just the "production" one because the two alerts were clustered and the template didn't take in account that.

We have currently split the two alerts as follows:

envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3", environment="staging"} == 0

envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3", environment="production"} == 0

Summing up, is that behaviour expected and we should absolutely change the templates and/or split the rules on environment?

Thanks in advance,

F.

Bjoern Rabenstein

unread,

Aug 27, 2020, 12:18:21 PM8/27/20

to baca...@gmail.com, Prometheus Users

On 18.08.20 00:05, baca...@gmail.com wrote:
>
> routes:
> - receiver: 'slack_primary'
> group_by: [...] # disables grouping

> [...]

>
> One of the expressions that failed was this one:
>
> envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3"} ==
> 0
>
> We basically had several alerts from the expression above boiling down to those
> two:
>
> envoy_cluster_health_check_healthy{envoy_cluster_name=<name>, environment=
> "staging" }
> envoy_cluster_health_check_healthy{envoy_cluster_name=<name>, environment=
> "production" }
>
> But in slack we got reported with just the "production" one because the two
> alerts were clustered and the template didn't take in account that.
>
> We have currently split the two alerts as follows:
>
> envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3",
> environment="staging"} == 0
>
> envoy_cluster_health_check_healthy{envoy_cluster_name=~"name1|name2|name3",
> environment="production"} == 0
>
> Summing up, is that behaviour expected and we should absolutely change the
> templates and/or split the rules on environment?

Interesting... I pretty sure it shouldn't matter if the alerts come
from the same rule or from different rules.

I'm wondering if this is a bug in Alertmanager. I could imagine that
AM only ever thinks about grouping together alerts coming from
different rules but never considers breaking up alerts coming from the
same rule. Or perhaps this weird behavior only happens with the
`[...]` grouping.

Just to make sure this hasn't been fixed in the latest release, could
you repreduce the behavior in v0.21.0 ? If it still happens, I think
you should file a bug in https://github.com/prometheus/alertmanager .

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

baca...@gmail.com

unread,

Aug 31, 2020, 12:01:14 PM8/31/20

to Prometheus Users

Hi Björn.

Thank you for the reply.

Today I had another look at the issue and I must say I've figure out an important detail. My original assumption was that the two alerts were clustered because some bad-behaving templates weren't showing the necessary information. At least everything was adding up to that. It turned out that Slack is working oddly with long urls inside descriptions and sometimes cuts out content. But the content is actually there, i.e. the alert for "production", mentioned in my original post, was well-formed.

The inability of some templates to handle clustering was a real issue. Despite that the alert for "production" was correct. From these two statements we can only assume that no clustering was occurring at all. That opens up for another question though: if my colleagues looking at alerts reported that both "production" and "staging" alerts were firing why we didn't receive the notification for "staging" in Slack? I know that AM works on best-effort for resolutions but I don't expect that to be the case for firing alerts. Or is that the case?

Unfortunately we don't have a record of the alerts apart from slack - we are yet to implement a persistent storage - so it's difficult to judge the scenario at this point. At any rate I'm going to update AM since some bugs that can affect us have been solved. Once the update is in place I'll try to simulate the issue again and see what I can find out.