[RFC][Proposal]: Alertmanager Time Intervals

22 views
Skip to first unread message

Benjamin Ridley

unread,
Jun 7, 2020, 11:32:50 PM6/7/20
to Prometheus Users
Hi everyone,

I'm sure many of you have come across the problem of controlling alerts based on the time of day or outside business hours etc inside Alertmanager. There is also a longstanding issue on the Alertmanager GitHub about this which I encourage you to read if you want some more context.

This is a proposed design for defining time intervals in the Alertmanager configuration file and how they would be used in the routing tree to silence particular routes inside or outside the specified intervals, allowing users to model time-based requirements to their liking.

The document is open for suggestions and comments and any feedback is welcomed, so please take a look and let us know what you think. You can access the document here.

Cheers,
Ben

Rajesh Reddy Nachireddi

unread,
Jun 8, 2020, 12:37:15 AM6/8/20
to Benjamin Ridley, Prometheus Users
Hi,


However, your proposal makes it generic for time of the day messages however there might few other issues as we go , better approach is following inhibit rules.

Regards,
Rajesh

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5145dc1a-282f-446a-b5fc-4f4d5d95504eo%40googlegroups.com.

Benjamin Ridley

unread,
Jun 8, 2020, 1:11:51 AM6/8/20
to Rajesh Reddy Nachireddi, Prometheus Users
Thanks Rajesh,

I am aware of the inhibition rules approach and the success that users have had with it. I encourage you to read the GitHub thread for more discussion on this, but the current line of thinking is that this approach is quite complex, especially for new users, and also requires users to define recording rules in Prometheus for something that should be theoretically outside of Prometheus' 'awareness' if you will, as it's purely an alerting decision.

Cheers,
Ben

Benjamin Ridley

unread,
Jun 15, 2020, 9:55:53 PM6/15/20
to Prometheus Users
Hi everyone,

Thank you all for your feedback on the Alertmanager Time Interval design doc so far. The design has been greatly simplified and (in my opinion) improved already due to the feedback received.

I've made some changes to the proposed implementation so that it works on receivers now, not the routes themselves. Brian pointed out that this approach requires users to define many of their routes twice, with the only difference being the active time interval and which receiver to use. For example, notice how in the below snippet the 'severity: warning' alerts have two routes but really all that needs to change is the receiver and time:
- match:
  severity: warning
  time_intervals:
    include:
      - business_hours
    exclude:
      - public_holidays
  receiver: team-X-pager
  continue: true
- match:
  severity: warning
  receiver: team-X-slack
  time_intervals:
    exclude:
      - business_hours

So the current proposal is to introduce a 'timed_receivers' section of a route that pairs receivers and time intervals. This way the above route is simplified into a single block:
- match:
    severity: warning
    receiver: team-X-slack
  timed_receivers:
  - receiver: team-X-pager
    include_intervals:
    - business_hours
    exclude_intervals:
    - public_holidays
 
Additionally, this approach maintains the desirable characteristics of previous solutions in that it has no impact on existing routing decisions.  Adding the tag also maintains backwards compatibility for existing configurations.

Please let me know what you think, either here or in the comments of the design doc found here.

Cheers,
Ben

On Mon, Jun 8, 2020 at 1:31 PM Benjamin Ridley <benri...@gmail.com> wrote:
Hi everyone,

I'm sure many of you have come across the problem of controlling alerts based on the time of day or outside business hours etc inside Alertmanager. There is also a longstanding issue on the Alertmanager GitHub about this which I encourage you to read if you want some more context.

This is a proposed design for defining time intervals in the Alertmanager configuration file and how they would be used in the routing tree to silence particular routes inside or outside the specified intervals, allowing users to model time-based requirements to their liking.

The document is open for suggestions and comments and any feedback is welcomed, so please take a look and let us know what you think. You can access the document here.

Cheers,
Ben

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/aa340edb-eb89-4dd8-af60-886d0427d3c7o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages