Query on Inhibit rules

277 views
Skip to first unread message

Sandosh Kumar P

unread,
Aug 24, 2022, 9:04:46 AM8/24/22
to Prometheus Users

We are using blackbox exporter on a remote location to monitor gateway routers, hypervisors and virtual machines (router —> hypervisor —> virtual machines). We are looking for something like below.


Example 1:

If a gateway router is down and alertmanager is firing, it should stop alerting on hypervisor hosts and servers

Example2:

If a hypervisor is down, it should not alert on the virtual machines


On prometheus,we group routers in one group, hypervisor on another group and also virtual machines as a single group . 

Example

job_name: 'blackbox_icmp-routers

job_name: 'blackbox_icmp-hypervisors

job_name: 'blackbox_icmp-virtualmachines


Alertmanager rules are defined based on each job

- name: RouterDown

   rules:

   - alert: R-InstanceDown

     expr: probe_success{job="blackbox_icmp-routers} == 0

     for: 1m


- name: HypervisorDown

   rules:

   - alert: H-InstanceDown

     expr: probe_success{job="blackbox_icmp-hypervisors} == 0

     for: 1m


- name: VirtualMachinesDown

   rules:

   - alert: V-InstanceDown

     expr: probe_success{job="blackbox_icmp-virtualmachines} == 0

     for: 1m


Alertmanager config is below:

route:

  group_by: ['alertname']

  receiver: ms-teams

  repeat_interval: 5m

receivers:

- name: ms-teams

  webhook_configs:

    - url: 'http://monitoring:2000/alertmanager'

      send_resolved: false


inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['alertname', 'dev', 'instance']


Any help is much appreciated.


Thanks

Sandosh

Brian Candler

unread,
Aug 24, 2022, 11:48:54 AM8/24/22
to Prometheus Users
You'll need to set some common labels - and if they are target labels, make sure they propagate through to the alert (i.e. don't write your alerting 'expr' in such a way that it aggregates these labels away).

For example: your gateway and all the servers in a particular site can have {site="site123"}.  Then you can write an inhibit rule to suppress alerts for 'device down' (target alert) if there's an active alert for 'gateway down' (source alert) and the 'network' label is the same (equal).  You may need additional labels to identify "device down" versus "gateway down" alerts, or to distinguish the gateway from a non-gateway device.

Similarly, your VMs and your hypervisor can have some shared label like {cluster="vm123"}.  Then you can suppress alerts for 'VM down' if there's an alert for 'hypervisor down' with an equal 'cluster' label.

For more info:

Sandosh Kumar P

unread,
Aug 25, 2022, 9:39:57 AM8/25/22
to Prometheus Users
Hi Brian,

Thanks for your response. I created common labels for each category something like below and I see 3 groupings in the alertmanager now. 

Since our targets has unique naming per cluster (For eg: router111, router 112, hypervisor111, hypervisor112, instance111, instance112), is there a way to group them based on their naming? Like all nodes which has 111 grouped together and 112 grouped together etc... Please let me know. 

As per the below configuration, we are seeing only Router Down alerts if anything is added to Router group and it is suppressing even the valid alerts. Not sure what we are missing. 

Rules:

- name: RouterDown

   rules:

   - alert: R-InstanceDown

     expr: probe_success{job="blackbox_icmp-routers} == 0

     for: 1m

     labels:

       Category: 'Site'

       Type: 'Router'


- name: HypervisorDown

   rules:

   - alert: H-InstanceDown

     expr: probe_success{job="blackbox_icmp-hypervisors} == 0

     for: 1m

     labels:

       Category: 'Site'

       Type: 'Hypervisor'


- name: VirtualMachinesDown

   rules:

   - alert: V-InstanceDown

     expr: probe_success{job="blackbox_icmp-virtualmachines} == 0

     for: 1m

     labels:

       Category: 'Site'

       Type: 'Instance'



Alertmanager conf:

route:

  group_by: ['Type']

  receiver: ms-teams

  repeat_interval: 5m

receivers:

- name: ms-teams

  webhook_configs:

    - url: 'http://monitoring:2000/alertmanager'

      send_resolved: false

  routes:

    - match:

       alertname: "R-InstanceDown"

      receiver: ms-teams

      routes:

        - match:

           alertname: "H-InstanceDown"

          receiver: ms-teams

        - match:

           alertname: "V-InstanceDown"

          receiver: ms-teams

inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['alertname', 'dev', 'instance']




Thanks
Sandosh

Brian Candler

unread,
Aug 25, 2022, 11:25:52 AM8/25/22
to Prometheus Users
On Thursday, 25 August 2022 at 14:39:57 UTC+1 sando...@gmail.com wrote:

Since our targets has unique naming per cluster (For eg: router111, router 112, hypervisor111, hypervisor112, instance111, instance112), is there a way to group them based on their naming? Like all nodes which has 111 grouped together and 112 grouped together etc... Please let me know. 


You can use the label_replace function to extract the substring of interest into a new label.

However I don't really understand what you're trying to do, because presumably these are N:1 relationships (i.e. N VMs sharing one hypervisor; and N hypervisors sharing one gateway router). If you have router111, it won't be serving just a single hypervisor111 running a single instance111.

 
As per the below configuration, we are seeing only Router Down alerts if anything is added to Router group and it is suppressing even the valid alerts. Not sure what we are missing. 

...

 

inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['alertname', 'dev', 'instance']



The problem is that you haven't thought about your inhibit rules.

All that you've written says: suppress any alert with label severity="warning", if there is any active alert with label severity="critical" and matching values of alertname, dev and instance labels.

What you want is something different: e.g. suppress any alert with label alertname="H-InstanceDown", if there is any active alert with label alertname="R-InstanceDown" and matching values of whatever label you have set to identify the "site" that both the router and the hypervisor are in.  It's up to you to write that in the form of an inhibit rule.

Note that you can set additional labels on an alert, in the alerting rule itself, if you need extra labels to be available to alertmanager.

Sandosh Kumar P

unread,
Aug 29, 2022, 12:41:33 PM8/29/22
to Prometheus Users
In my case there are multiple sites located in different locations and each site as a unique number per that site added to the targets of hypervisor, router and instances. When I create an additional label in the rules files like in the previous configuration I have shared, it is grouping all the sites routers together, hypervisor together and all instances together. 

What I am trying to achieve is to group all the targets with the same site numbers together and then on top of that I need to separate targets based on hypervisor, router & instances. Since I am new to prometheus I am getting stuck on how to separate them based on the unique number first and then later by the type. 

And for the inhibit rules,  I will definitely make the said changes based on your recommendations. Let me know how can I achieve the above. 

Brian Candler

unread,
Aug 29, 2022, 4:53:36 PM8/29/22
to Prometheus Users
What do you mean by "added to the targets"?  Can you give some examples?

If the instance label contains both the instance name and the site name and the structure is clearly demarked, then you can can use the function label_replace(), as I said before, to extract the part of interest.

e.g. if the hypervisor's instance label is "hyper3-site1" then you can use label_replace to match the pattern "-site<N>" and return just the "site<N>" part.  But the exact details of how to do this depend on exactly what you're doing.

The example given matches a label like service="xxx:yyy" and adds a new label foo="xxx".  That's pretty much exactly what you're trying to do, if I understand you correctly.

Use the PromQL browser in the Prometheus web interface to test your expressions as you write them.

Sandosh Kumar P

unread,
Aug 29, 2022, 5:46:33 PM8/29/22
to Prometheus Users
Hi Brian,

This is how the target file looks like.

Hypervisor.yml:

Instance.yml:
Router.yml:

Based on the above targets it need to group the targets like below based on site number:
Group111:
Group 211:

Group 311:


And then it needs to alert it based on their category (hypervisor/router/instances).
  • If router is down on Group111 then it need to suppress hypervisor and instance alerts.
  • If hypervisor is down on Group 111 then it need to suppress the instance alerts.
  • If more than one group routers are down then it need to consolidate all and send one alert for those groups.

I am going through the documentation to understand label_replace & other stuffs but I am not finding more examples or use cases that fit my scenario.


Thanks
Sandosh

Sandosh Kumar P

unread,
Aug 30, 2022, 2:51:50 PM8/30/22
to Prometheus Users
Thanks Brian. "label_replace" did the magic and I am able to separate the site numbers from the target using the below conf and in the alertmanager I grouped them by "SNumber". But not sure how to use the inhibit rules to suppress the alerts. Can you help? 

And I want to use the "SNumber" grouping to suppress the alerts only when Router/Hypervisor is down and in all other cases it needs to be consolidated  alerts for all "SNumber". Is that something can be achieved? Please let me know.

      - source_labels: [__param_target]

        target_label: SNumber

        regex: 'hyper.(.*)com'

        replacement: '${1}'

      - source_labels: [__param_target]

        target_label: SNumber

        regex: 'linux.(.*)com'

        replacement: '${1}'

      - source_labels: [__param_target]

        target_label: SNumber

        regex: 'win.(.*)com'

        replacement: '${1}'

      - source_labels: [__param_target]

        target_label: SNumber

        regex: 'Router.(.*)com'

        replacement: '${1}'



Thanks
Sandosh

Brian Candler

unread,
Aug 30, 2022, 3:46:19 PM8/30/22
to Prometheus Users
Please start by reading the documentation:

Then write and test your inhibit rule.

Then if it doesn't work, show the inhibit rules you've written, and examples of the alerts in question:
- the target alert (i.e. the one you want to suppress)
- the source alert (i.e. the one which should be suppressing it)

Make sure you include the full, unexpurgated set of labels on both.  You can get this from the alerts views in either prometheus or alertmanager.
Reply all
Reply to author
Forward
0 new messages