Despite inhibition notifications are delivered to slack

21 views

Skip to first unread message

Federico Buti

unread,

Mar 5, 2020, 9:42:13 AM3/5/20

to Prometheus Users

Hi all.

I'm chasing an inhibition rules problem and I'm not sure what I'm doing wrong.

Basically, I'd like to snooze alerting during deployment or maintenance just because it doesn't make sense to have those when the services are purposely down. Despite that, alerts notifications keep popping out in Slack.

For the purpose of inhibiting deploys I've defined the following section in the alert manager:

inhibit_rules:
- source_match_re:
    alertname: deployment_in_progress|maintenance_in_progress
  target_match_re:
    severity: warning|average|high|disaster
  equal: ['stack', 'environment']

When a deploy is started, a metric is pushed via the Pushgateway and one of the alert above fires. Let's take in account the first one which looks like that:

- alert: deployment_in_progress
  expr: time() - last_deployment{status="started"} < 300
  labels:
    severity: note
  annotations:

In short the deploy alert should last for 5 minutes. The metric is pushed by several services, as the deploy goes. Hence we could have several alerts ongoing at increasing times. Severity is "note" so those alerts are never inhibited. A note message is also delivered to slack, with the desired "stack" and "environment" values.

So far, so good. Assuming everything is fine there, the problem starts here, in Slack. Basically, despite the inhibition, notifications about target down are delivered to Slack. This morning I had the following in Slack:

<7:17> note deploy firing

<7:22> note deploy firing

<7:23> compound notification for several target down firing <--- this is incomplete, last alert is cut in half

<7:27> note deploy resolve

<7:28> compound notification for several target down resolve <--- this is incomplete, last alert is cut in half

<7:43> note deploy firing

<7:44> compound notification for several target down firing

<7:48> note deploy firing

<7:49> compound notification for several target down firing (and resolves from before)

<7:53> note deploy resolve

<7:54> compound notification for several target down resolve

I have configured:

group_by: [severity, stack, environment]
group_wait: 30s
group_interval: 5m

Also upness rule is as follows:

alert: target_node_with_source_down
  expr: avg_over_time(up{job="node",source=~".+"}[5m]) < 0.9
  labels:
    severity: average

Is this just a timing issue, i.e. the source alert is reaching prom too late to be taken in account to avoid the triggering of the alerts or there could be something else? Thinking about it, could be the effect of avg_over_time that is spreading the down-ness over time?

This afternoon I had a deploy notification @14:34 and a grouped set of alerts at @14:38 which is indeed under the 5m span. But I guess the avg plays a role here. Am I wrong?

Any help much appreciated. Thanks in advance.

Federico Buti

unread,

Jul 7, 2020, 6:01:21 AM7/7/20

to Prometheus Users

Hello.

I figure out that the notifications were not delivered to Slack in a perfectly aligned fashion because of the way the inhibition metric was pushed and how grouping was setup. Revisiting grouping helped better align the Slack notifications to the actual snoozing window and fixed the issue.

Thanks to anyone who looked into this issue.

Bests,

Reply all

Reply to author

Forward

0 new messages