[RFC][Proposal]: Alertmanager Time Intervals

44 views
Skip to first unread message

Benjamin Ridley

unread,
Jun 7, 2020, 11:31:13 PM6/7/20
to Prometheus Developers
Hi everyone,

I'm sure many of you have come across the problem of controlling alerts based on the time of day or outside business hours etc inside Alertmanager. There is also a longstanding issue on the Alertmanager GitHub about this which I encourage you to read if you want some more context.

This is a proposed design for defining time intervals in the Alertmanager configuration file and how they would be used in the routing tree to silence particular routes inside or outside the specified intervals, allowing users to model time-based requirements to their liking.

The document is open for suggestions and comments and any feedback is welcomed, so please take a look and let us know what you think. You can access the document here.

Cheers,
Ben

Benjamin Ridley

unread,
Jun 15, 2020, 9:55:29 PM6/15/20
to Prometheus Developers
Hi everyone,

Thank you all for your feedback on the Alertmanager Time Interval design doc so far. The design has been greatly simplified and (in my opinion) improved already due to the feedback received.

I've made some changes to the proposed implementation so that it works on receivers now, not the routes themselves. Brian pointed out that this approach requires users to define many of their routes twice, with the only difference being the active time interval and which receiver to use. For example, notice how in the below snippet the 'severity: warning' alerts have two routes but really all that needs to change is the receiver and time:
- match:
  severity: warning
  time_intervals:
    include:
      - business_hours
    exclude:
      - public_holidays
  receiver: team-X-pager
  continue: true
- match:
  severity: warning
  receiver: team-X-slack
  time_intervals:
    exclude:
      - business_hours

So the current proposal is to introduce a 'timed_receivers' section of a route that pairs receivers and time intervals. This way the above route is simplified into a single block:
- match:
    severity: warning
    receiver: team-X-slack
  timed_receivers:
  - receiver: team-X-pager
    include_intervals:
    - business_hours
    exclude_intervals:
    - public_holidays
 
Additionally, this approach maintains the desirable characteristics of previous solutions in that it has no impact on existing routing decisions.  Adding the tag also maintains backwards compatibility for existing configurations.

Please let me know what you think, either here or in the comments of the design doc found here.

Cheers,
Ben

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/aa340edb-eb89-4dd8-af60-886d0427d3c7o%40googlegroups.com.

Benjamin Ridley

unread,
Jun 30, 2020, 8:30:51 PM6/30/20
to Prometheus Developers
Hi everyone,

The Alertmanager Time Interval design doc is now in its final stages. Again I want to thank you all for feedback and comments, they have been very productive and helpful.

The final design allows users to specify a hierarchy of receivers with valid times for a route, and Alertmanager will choose the first receiver that matches the time of the alert firing. I have provided some examples in the design doc of how this might be used.

I will leave the document open for the rest of the week for final comments and review, and then I will close it for comments and hopefully we can begin implementation. You can find the document here.

Cheers,
Ben
Reply all
Reply to author
Forward
0 new messages