groups:
- name: PowerOutageAlert
rules:
- alert: PowerOutageAlert
expr: |
sum(probe_success{job="blackbox_linux"} or probe_success{job="blackbox_windows"} or probe_success{job="blackbox_router-1"} or probe_success{job="blackbox_router-2"} by (Site) == 0
for: 1m
- name: LinuxGroup
rules:
- alert: Linux Servers Down
expr: |
sum(probe_success{job="blackbox_linux"} or probe_success{job="blackbox_router-1"} by (Site) == 0
for: 1m
- name: WindowsGroup
rules:
- alert: Windows Servers Down
expr: |
sum(probe_success{job="blackbox_windows"} or probe_success{job="blackbox_router-2"}) by (Site) == 0
for: 1m
route:
group_by: ['alertname']
receiver: ms-teams
group_wait: 1m
group_interval: 1m
repeat_interval: 1m
receivers:
- name: ms-teams
webhook_configs:
- url: 'http://xx.xx.xx.xx:2000/alertmanager'
send_resolved: false
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['Site','instance']
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['Site','instance']
Think about it for a moment. This says that:
1. this rule will only suppress alerts which have label severity=warning
2. the alerts which perform the suppressing must have label severity=critical
3. the source alert will only suppress the target alert if both the Site and instance labels are identical
But that's not what you want to do. You've just copied an example, but the example doesn't do what you want. Your requirements don't mention "warning" or "critical" alerts. And furthermore, you're only matching alerts with equal "instance" label, and since your alerting rules all have sum(...) by (Site), those expressions will only have a Site label and no instance label (and no severity label either). Note that a missing 'instance' label on both source and target will count as being 'equal', it's confusing and unnecessary.
So start with your description in English. Is there a way I can ignore the matched targets from "PowerOutageAlert" on the "LinuxGroup/WindowsGroup" alerts?
Yes, write a rule which says that. A starting point might be like this (using the more modern "matchers" syntax):
inhibit_rules:
- source_matchers:
- alertname=PowerOutageAlert
target_matchers:
- alertname=~'Linux Servers Down|Windows Servers Down'
equal: ['Site']
I'm not guaranteeing that will work, because, I can't write that properly without seeing examples of the *actual alerts* with *all their labels*. As I said in the other thread, you simply go to the Prometheus web interface or the Alertmanager web interface to see these. Once you can see an example of an alert that you want to suppress, together with an alert that should do the suppression, you can easily write an inhibit rule which inhibits the first by the second.
I'm also wondering about your alerting rules. Given that you've aggregated all the labels away apart from Site, I'm not sure *exactly* what you're trying to do with alerting. I think you only want the Linux Servers Down alert to fire if *all* the Linux servers in a site have gone down, is that right?
That's OK, although it's not what people normally do; normally they generate a separate alert for each server, and then use alertmanager grouping so that a single alert message gets sent out, listing all the servers that are down.
Now, if you don't want to get alerts for individual servers going down, but only if *all* servers have gone down, that's a perfectly reasonable requirement. But then that's such a major outage, I wouldn't want to be doing alert suppression. I think I'd be doing grouping again.
What you can do is add a label like "severity=MajorOutage" to each of these alerts, and then group them on this label.
Then you'll get a single alert message, which contains a summary of all the information in one place:
- all my Linux servers have gone down
- all my Windows servers have gone down
- there's a power outage
A human being can quickly deduce the connection between these statements. And it's simpler than trying to suppress two major outage alerts because of a third major outage. But either way will work.