Prometheus Rules

118 views
Skip to first unread message

Sandosh Kumar P

unread,
Sep 5, 2022, 6:57:21 PM9/5/22
to Prometheus Users
Hi,

I am new to prometheus and looking for some guidance on how to get my prometheus rule work for the below requirement. 

In my environment, all Linux servers are connected to Router-1 group and all Windows server are connected to Router-2 group. I have configured the prometheus rules based on the below requirement.
  1. When there is a complete outage on a site, it needs to tell just the site numbers where all the targets are down. So I have configured a rule "PowerOutageAlert" and this is working fine as expected.
  2. When the Linux server is down in a site, it needs to show which site linux servers are down. So I have configured a rule "LinuxGroup" and this is also working fine as expected. 
  3. When the Windows server is down in a site, it needs to show which site Windows servers are down. So I have configured a rule "WindowsGroup" and this is also working fine as expected. 

prometheus_rules.yml:

groups:

 - name: PowerOutageAlert

   rules:

   - alert: PowerOutageAlert

     expr: |

       sum(probe_success{job="blackbox_linux"} or probe_success{job="blackbox_windows"} or probe_success{job="blackbox_router-1"} or probe_success{job="blackbox_router-2"} by (Site) == 0

     for: 1m

 - name: LinuxGroup

   rules:

   - alert: Linux Servers Down

     expr: |

       sum(probe_success{job="blackbox_linux"} or probe_success{job="blackbox_router-1"} by (Site) == 0

     for: 1m

 - name: WindowsGroup

   rules:

   - alert: Windows Servers Down

     expr: |

       sum(probe_success{job="blackbox_windows"} or probe_success{job="blackbox_router-2"}) by (Site) == 0

     for: 1m


Alertmanager.yml:

route:

  group_by: ['alertname']

  receiver: ms-teams

  group_wait: 1m

  group_interval: 1m

  repeat_interval: 1m

receivers:

- name: ms-teams

  webhook_configs:

    - url: 'http://xx.xx.xx.xx:2000/alertmanager'

      send_resolved: false

inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['Site','instance']


The issue I am facing now is:
  1. When there is a complete outage on a site, I am getting 3 alerts (PowerOutageAlert/LinuxGroup/WindowsGroup) for the same targets based on the above configuration. Is there a way I can ignore the matched targets from "PowerOutageAlert" on the "LinuxGroup/WindowsGroup" alerts?
  2. As per the above setup for "LinuxGroup/WindowsGroup", it will throw alert only if the "blackbox_router-1/blackbox_linux" (or) "blackbox_router-2/blackbox_windows" server both goes down. And it wont alert if just the Linux/Windows server are down. How can I achieve it getting all alerts even if routers are up?

On a Shell script I can achieve this by using "if else" conditions but I am not sure how to use the same logics in the prometheus. Any help is really appreciated.


Thanks
Sandosh

Brian Candler

unread,
Sep 6, 2022, 3:39:51 AM9/6/22
to Prometheus Users
This has all been explained to you in another thread.

Read your config.  You have written:

 - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

   equal: ['Site','instance']


Think about it for a moment.  This says that:

1. this rule will only suppress alerts which have label severity=warning

2. the alerts which perform the suppressing must have label severity=critical

3. the source alert will only suppress the target alert if both the Site and instance labels are identical

But that's not what you want to do.  You've just copied an example, but the example doesn't do what you want.  Your requirements don't mention "warning" or "critical" alerts.  And furthermore, you're only matching alerts with equal "instance" label, and since your alerting rules all have sum(...) by (Site), those expressions will only have a Site label and no instance label (and no severity label either). Note that a missing 'instance' label on both source and target will count as being 'equal', it's confusing and unnecessary.

So start with your description in English.  Is there a way I can ignore the matched targets from "PowerOutageAlert" on the "LinuxGroup/WindowsGroup" alerts?

Yes, write a rule which says that.  A starting point might be like this (using the more modern "matchers" syntax):

inhibit_rules:
  - source_matchers:
      - alertname=PowerOutageAlert
    target_matchers:
      - alertname=~'Linux Servers Down|Windows Servers Down'
    equal: ['Site']

I'm not guaranteeing that will work, because, I can't write that properly without seeing examples of the *actual alerts* with *all their labels*.  As I said in the other thread, you simply go to the Prometheus web interface or the Alertmanager web interface to see these.  Once you can see an example of an alert that you want to suppress, together with an alert that should do the suppression, you can easily write an inhibit rule which inhibits the first by the second.

I'm also wondering about your alerting rules.  Given that you've aggregated all the labels away apart from Site, I'm not sure *exactly* what you're trying to do with alerting.  I think you only want the Linux Servers Down alert to fire if *all* the Linux servers in a site have gone down, is that right?

That's OK, although it's not what people normally do; normally they generate a separate alert for each server, and then use alertmanager grouping so that a single alert message gets sent out, listing all the servers that are down.

Now, if you don't want to get alerts for individual servers going down, but only if *all* servers have gone down, that's a perfectly reasonable requirement.  But then that's such a major outage, I wouldn't want to be doing alert suppression.  I think I'd be doing grouping again.

What you can do is add a label like "severity=MajorOutage" to each of these alerts, and then group them on this label.

Then you'll get a single alert message, which contains a summary of all the information in one place:

- all my Linux servers have gone down
- all my Windows servers have gone down
- there's a power outage

A human being can quickly deduce the connection between these statements.  And it's simpler than trying to suppress two major outage alerts because of a third major outage.  But either way will work.

Reply all
Reply to author
Forward
0 new messages