Hi all.
I'm chasing an inhibition rules problem and I'm not sure what I'm doing wrong.
Basically, I'd like to snooze alerting during deployment or maintenance just because it doesn't make sense to have those when the services are purposely down. Despite that, alerts notifications keep popping out in Slack.
For the purpose of inhibiting deploys I've defined the following section in the alert manager:
inhibit_rules:
- source_match_re:
alertname: deployment_in_progress|maintenance_in_progress
target_match_re:
severity: warning|average|high|disaster
equal: ['stack', 'environment']
When a deploy is started, a metric is pushed via the Pushgateway and one of the alert above fires. Let's take in account the first one which looks like that:
- alert: deployment_in_progress
expr: time() - last_deployment{status="started"} < 300
labels:
severity: note
annotations:
In short the deploy alert should last for 5 minutes. The metric is pushed by several services, as the deploy goes. Hence we could have several alerts ongoing at increasing times. Severity is "note" so those alerts are never inhibited. A note message is also delivered to slack, with the desired "stack" and "environment" values.
So far, so good. Assuming everything is fine there, the problem starts here, in Slack. Basically, despite the inhibition, notifications about target down are delivered to Slack. This morning I had the following in Slack:
<7:17> note deploy firing
<7:17> note deploy firing
<7:22> note deploy firing
<7:22> note deploy firing
<7:23> compound notification for several target down firing <--- this is incomplete, last alert is cut in half
<7:27> note deploy resolve
<7:27> note deploy resolve
<7:28> compound notification for several target down resolve <--- this is incomplete, last alert is cut in half
<7:43> note deploy firing
<7:44> compound notification for several target down firing
<7:48> note deploy firing
<7:49> compound notification for several target down firing (and resolves from before)
<7:53> note deploy resolve
<7:54> compound notification for several target down resolve
I have configured:
group_by: [severity, stack, environment]
group_wait: 30s
group_interval: 5m
Also upness rule is as follows:
alert: target_node_with_source_down
expr: avg_over_time(up{job="node",source=~".+"}[5m]) < 0.9
labels:
severity: average
Is this just a timing issue, i.e. the source alert is reaching prom too late to be taken in account to avoid the triggering of the alerts or there could be something else? Thinking about it, could be the effect of avg_over_time that is spreading the down-ness over time?
This afternoon I had a deploy notification @14:34 and a grouped set of alerts at @14:38 which is indeed under the 5m span. But I guess the avg plays a role here. Am I wrong?
Any help much appreciated. Thanks in advance.