Alermanager inhibit alerts in tree structure at any level

51 views
Skip to first unread message

l...@seznam.cz

unread,
Feb 19, 2019, 4:58:40 AM2/19/19
to Prometheus Users
Hello,

  I'm new to prometheus and alertmanager. I'm trying to find a way how
to setup alertmanager to suppress (inhibit) alerts in network tree structure
at any level. Something like
  • root switch
    • host 1
    • host 2
    • level 1 switch 1
      • host 3
      • host 4
      • level 2 switch
        • host 5
    • level 1 switch 2
      • host 6
I want to receive only notification about root switch if it fails (no other host/switch).
I want to receive only notification about level 1 switch 1 (and no host 3-5 or level 2 switch).
and so on.

What is the best way? I was thinking about using some prefix form in label
net (e.g.
net: root
net: root_host1,
net: root_lev2sw
net: root_lev2sw_host5,
but I find no way how to use source label in target match. I do not want to write
static inhibit rule for every switch node.

Thank you for any hint,

Luf

Sandosh Kumar P

unread,
Sep 5, 2022, 3:34:08 PM9/5/22
to Prometheus Users
I am in the same boat trying to find a way for the same issue in my environment. Are you able to find a solution?


Thanks
Sandosh

Brian Candler

unread,
Sep 6, 2022, 6:20:56 AM9/6/22
to Prometheus Users
I think you're on the right lines.

Since the inhibit rules can do nothing more sophisticated than "equal" matching, I would go with multiple labels to represent levels 1/2/3 etc of the hierarchy. The slightly tricky part is to determine the difference between parent and child (remembering that one node can be both).

This is what I came up with:

up{instance="coresw1",level1="coresw1"}
up{instance="host1",level1="coresw1",level2="host1"}
up{instance="host2",level1="coresw1",level2="host2"}
up{instance="l1sw1",level1="coresw1",level2="l1sw1""}
up{instance="host3",level1="coresw1",level2="l1sw1",level3="host3"}
up{instance="host4",level1="coresw1",level2="l1sw1",level3="host4"}
up{instance="l2sw1",level1="coresw1",level2="l1sw1",level3="l2sw1"}
up{instance="host5",level1="coresw1",level2="l1sw1",level3="l2sw1",level4="host5"}
up{instance="l1sw2",level1="coresw1",level2="l1sw1"}
up{instance="host6",level1="coresw1",level2="l1sw1",level3="host6"}

The rule is simply that the lowest "level" label is equal to the "instance" label, and the "depth" in the tree is equal to the number of "level" labels.

Then inhibit rules something like this:

inhibit_rules:
  - source_matchers:
      - level1=~'.+'
      - level2=''
    target_matchers:
      - level2=~'.+'
    equal: ['level1']
  - source_matchers:
      - level1=~'.+'
      - level2=~'.+'
      - level3=''
    target_matchers:
      - level3=~'.+'
    equal: ['level1','level2']
  - source_matchers:
      - level1=~'.+'
      - level2=~'.+'
      - level3=~'.+'
      - level4=''
    target_matchers:
      - level4=~'.+'
    equal: ['level1','level2','level3']
... etc

This means that:
* An alert with level1="foo" (but no level2, i.e. it's at depth 1 in the tree) will suppress any alert for something with depth>1 and level1="foo"
* An alert with level1="foo",level2="bar" (but no level3, i.e. it's at depth 2 in the tree)  will suppress any alert for something with depth>2, level1="foo" and level2="bar"
* etc

Untested, but you get the idea.  Let me know if something like this works for you.

Generating those labels by hand is tedious, but you could write a script which reads in a set of targets with "instance" and "parent" attributes, and rewrites them to depth/level1/level2 etc.

Brian Candler

unread,
Sep 6, 2022, 9:11:30 AM9/6/22
to Prometheus Users
It might be possible to simplify this a bit, if:

1. An active but "inhibited" alert is still able to inhibit another alert (I don't know if this is true, I have not tested it)
2. All devices have unique instance names
3. You know for sure that whenever a device fails at depth N, the failure *will* cascade and cause all its children and their descendants to alert.

In this case, a device's alert only needs to be inhibited by its immediate parent, which means you only need the *lowest two levels*, where level(N-1) is the parent and level(N) is the device itself, and (N) is the depth in the tree.

up{instance="coresw1",level1="coresw1"}
up{instance="host1",level1="coresw1",level2="host1"}
up{instance="host2",level1="coresw1",level2="host2"}
up{instance="l1sw1",level1="coresw1",level2="l1sw1""}
up{instance="host3",level2="l1sw1",level3="host3"}
up{instance="host4",level2="l1sw1",level3="host4"}
up{instance="l2sw1",level2="l1sw1",level3="l2sw1"}
up{instance="host5",level3="l2sw1",level4="host5"}
up{instance="l1sw2",level1="coresw1",level2="l1sw2"}
up{instance="host6",level2="l1sw2",level3="host6"}

inhibit_rules:
  - source_matchers:
      - level1=~'.+'
      - level2=''
    target_matchers:
      - level2=~'.+'
    equal: ['level1']
  - source_matchers:
      - level2=~'.+'
      - level3=''
    target_matchers:
      - level3=~'.+'
    equal: ['level2']
  - source_matchers:
      - level3=~'.+'
      - level4=''
    target_matchers:
      - level4=~'.+'
    equal: ['level3']
... etc

Now consider what happens if coresw1 fails.  All the other devices will raise alerts, due to the way they are interconnected.
The alert for coresw1 will inhibit the alerts for host1, host2 and l1sw1
The alert for l1sw1 will inhibit the alerts for host3, host4 and l1sw2
The alert for l1sw2 will inhibit the alert for host5.

This isn't as robust as the previous example, where the alert for coresw1 will *directly* inhibit all the other dependent resources; but it is slightly easier to create the instance labels.  You just need to know what depth it is in the tree, and you need to add level(N-1)=parent and level(N)=self.
Reply all
Reply to author
Forward
0 new messages