Alerting within specific time periods

11,512 views
Skip to first unread message

ma...@syple.com.au

unread,
Jun 26, 2016, 9:17:46 PM6/26/16
to Prometheus Developers
Hi all,

TLDR:
I want to generate alerts if a job fails to run during its uptime window, and don't want alerts generated outside of those hours. How do I configure this time window in Prometheus or Alert Manager?

Full story:
I have Prometheus metrics coming out of a service that runs scheduled jobs, and am attempting to configure alerting rules to alert if the service dies. There are two main failure states: the service completely dies, or the service stays up but the schedule stops running jobs.

One of the metrics exported is a counter that increments each time the job runs. I currently have an alert that fires if the job hasn't run in the last 5 minutes (there's probably a slightly better way of doing this too, perhaps with increase()?). I have 'FOR 5m' in there to give the job time to get the rate up after a service restart.

ALERT InvoicingStopped
IF rate(jobs_invoice_run_total{job="jobs"}[5m]) < 0.1
FOR 5m
LABELS {severity="critical"}

The catch is that the job only runs between 0600 and 2100 each day. I don't want alerts being sent out during the night. I would prefer to have a solution in the IF line of the alert that prevents it from firing during the downtime window, but if that's not possible, I'd be content to have a silence period configured in Alert Manager. I don't know how to do either, and so am looking for some help. Does the answer have something to do with the last successful scrape (scrapes should be successful during the downtime, just the job run counters won't increment)?

Feel free to throw in any other advice you want as well as I'm still learning how to use Prometheus.

Thanks,
Matt

Ben Kochie

unread,
Jun 27, 2016, 3:52:47 AM6/27/16
to ma...@syple.com.au, Prometheus Developers

For the alert, I would suggest increase() for this style of test.

The time of day thing could be handed if there was a time of day in seconds type of metric.  Maybe this is something we could add to the query language, but of course these can get complicated with timezones and DST.  We officially do everything in Prometheus on UTC to avoid this.

As a workaround, your app could emit this as a gauge now, and you could do

IF increase(jobs_invoice_run_total{job="jobs"}[5m]) < 1 and time_of_day > 60*60*6 and time_of_day < 60*60*21

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Björn Rabenstein

unread,
Jun 27, 2016, 9:47:33 AM6/27/16
to Ben Kochie, ma...@syple.com.au, Prometheus Developers
The relevant issue here is
https://github.com/prometheus/prometheus/issues/1545 . There has been
plenty of discussions here. It's hard because of timezones and DST, as
Ben said.

And then it's not clear where the logic should live. Alerting
expression, or in the alert routing on Alertmanager, or delegate it
further down the chain to something like Pagerduty, which obviously is
already quite concerned with schedules.

I think it's most likely that there will be a "day of the week" and
"time of the day" function one day in PromQL. But it's not a top
priority.

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Matt Smith

unread,
Jun 27, 2016, 7:16:03 PM6/27/16
to Prometheus Developers
Thanks for the replies, I will solve this for now by adding a metric with the last capture date I guess.  I feel like what I need is already there though but I just can't get to it... Prometheus knows the times that the counters were scraped, if I could get that then I could do what I needed with it, even if it was a bit ugly.  Is there a time series available for the scraping history?  In the mean time I'll go have a read of that GitHub issue as well.

Thanks,
Matt
Message has been deleted

Matt Smith

unread,
Oct 12, 2016, 10:58:06 PM10/12/16
to Prometheus Developers
Hey Kris,  

I haven't made that attempt, but the new hour() function will allow me to do what I want.  I'll just have to be a little careful when daylight savings changes as the window that the business process runs in is based off local time rather than UTC.

On 13 October 2016 at 00:23, Kris Berry <be...@mosaicatm.com> wrote:
Matt,

Were you able to solve this with adding a metric using the last capture date?  If so, can you describe how you did that?

All,

On a more general level, Prometheus recently added time functions, but when I tried this:

  IF rate(datafeed[10m]) == 0 and day_of_week() < 6 and hour() >= 13

it did not result in alerts, even though I knew the datafeed was flat. When I remove the day_of_week() and hour() calls I get alerts.  Any feedback on how to correctly use the time functionality to limit the alerts would be appreciated!

Thanks,
Kris.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/oBRQwL1qhoc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/e701b636-5921-4ff1-911d-1d3a02d3b0e2%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Matt Smith  |  0414 379 210  |  www.syple.com 


Confidentiality and Limited Liability Notice 

This e-mail is intended only to be read or used by the addressee. It is confidential and may contain legally privileged information. If a recipient is not the addressee, please destroy the message and contact the sender or this Company immediately, you must not copy or deliver this message to anyone. Unauthorised access, use or reproduction in any form by any person other than the addressee is prohibited. The sender does not warrant that this email or any files transmitted with it are free of viruses or any other electronic defect. The sender’s liability is limited to resending the information contained in this email. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views of Syple Technologies. Syple Technologies Pty Ltd ABN 36 108 906 047

geal...@gmail.com

unread,
Dec 6, 2018, 1:49:22 PM12/6/18
to Prometheus Developers

This post shows how to use time restrictions in your alerts. https://www.robustperception.io/combining-alert-conditions

- Neil

Da Sm

unread,
Jun 16, 2020, 3:38:28 PM6/16/20
to Prometheus Developers


Additionally, there's an interesting post about implementing daylight savings here:


Reply all
Reply to author
Forward
0 new messages