always notify the boyz

47 views
Skip to first unread message

Gabor Kiss

unread,
Jul 28, 2021, 10:56:09 AM7/28/21
to Prometheus Users
Hi there! 

Another silly question for you . If something happens, always notify the boyz group. If it's no route for the critical, then say boyz_pager. Here is my routing table:

route:
  group_by: ['alertname', 'cluster', 'service', 'owner']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: boyz

  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.

  - match:
      owner: A
    receiver: A_team
    routes:
    - match:
        severity: warning
      receiver: A_team
    - match:
        severity: critical
      receiver: A_pager

  - match:
      owner: B
    receiver: B_team
    routes:
    - match:
        severity: warning
      receiver: B_team
    - match:
        severity: critical
      receiver: B_pager

  - match:
      owner: team_without_critical_response 
    receiver: team_without_critical_response _team
    routes:
    - match:
        severity: warning
      receiver: team_without_critical_response _team

  - match:
      severity: critical
      owner: boyz
    receiver: boyz_pager

I don't know where to put the continue. How is this working? The example lead me to nowhere. 
Example: sky is falling , the team_without_critical_response got the message, but it's critical severity , so pagerduting the boyz. How?
 

Ian Billett

unread,
Jul 29, 2021, 11:33:18 AM7/29/21
to Gabor Kiss, Prometheus Users
Hey Gabor,

For questions around unexpected alertmanager routing, I always recommend checking out amtool, which is a helper binary for alertmanager configuration.

For example, it lets you see which routes would match for a given alert: amtool config routes test ...
> Will return receiver names which the alert with given labels resolves to. If the labelset resolves to multiple receivers, they are printed out in order as defined in the routing tree.

Hope you can figure out the problem!

Best,

Ian  

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ec8c1c7c-ff11-4e94-b9a3-0ecf7f51cea5n%40googlegroups.com.

Brian Candler

unread,
Jul 30, 2021, 4:09:18 AM7/30/21
to Prometheus Users
And there is an online tool here for testing and visualising alerting configs:

If you want *all* alerts to go to the "boyz" group, then put them at the top - with no "match" clause since you want to match everything, and with "continue".

  - receiver: boyz
    continue: true
  - match:
      owner: A
    receiver: A_team
    ...etc


Your routes appear to be more verbose than necessary. For example:

  - match:
      owner: A
    receiver: A_team
    routes:
    - match:
        severity: warning
      receiver: A_team
    - match:
        severity: critical
      receiver: A_pager

The default receiver in this group is "A_team", so there's no need to match on severity: warning.  Hence I think it can simplify to:

  - match:
      owner: A
    receiver: A_team
    routes:
    - match:
        severity: critical
      receiver: A_pager

Then:
- if the "severity: critical" is matched, it goes to A_pager
- otherwise it goes to A_team

Now, your second requirement "If it's no route for the critical, then say boyz_pager" is unclear.  If you want *all* critical alerts to go to boyz_pager, then put another rule at the top (with continue).  If you only want critical alerts which have not been caught already - which means they don't have owner: A or owner: B etc, then put a rule at the bottom.  This appears to be what you've done, except you've also matched on owner: boyz, which means only critical alerts with that owner will fire.

If you change your last rule to

  - match:
      severity: critical
    receiver: boyz_pager

then I think it will do what you want: i.e. it catches any alert which hasn't already been matched and is critical.

And if *that* rule doesn't match, it will fall through to using the top-level default receiver (under the top-level "route" block), which in your case is "boyz".  I'm not sure if that will end up sending them twice, since you've already sent to 'boyz' earlier.  If it's a problem, then make a null receiver.

Gabor Kiss

unread,
Jul 30, 2021, 8:09:55 AM7/30/21
to Prometheus Users
Hi Guys!

@Bill : Tahnks for the tool!
@Brian: absolutely right at the first section, the simplified rule is awesome, I figured it out with the boyz. Also, I got what I wanted, here it is:


route:
  group_by: ['alertname', 'cluster', 'service', 'owner']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: boyz


  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.

  - match:
    receiver: boyz
    continue: true
  - match:
      severity: critical
    continue: true
    receiver: boyz_pager

  - match:
      owner: A
    receiver: A_team
    routes:
    - match:
        severity: critical
      receiver: A_pager

  - match:
      owner: B 
    receiver: B_team
    routes:
    - match:
        severity: critical
      receiver: B_pager

  - match:
      owner: team_without_on-call_duty 
    receiver: team_without_on-call_duty



Reply all
Reply to author
Forward
0 new messages