TLDR:
I want to generate alerts if a job fails to run during its uptime window, and don't want alerts generated outside of those hours. How do I configure this time window in Prometheus or Alert Manager?
Full story:
I have Prometheus metrics coming out of a service that runs scheduled jobs, and am attempting to configure alerting rules to alert if the service dies. There are two main failure states: the service completely dies, or the service stays up but the schedule stops running jobs.
One of the metrics exported is a counter that increments each time the job runs. I currently have an alert that fires if the job hasn't run in the last 5 minutes (there's probably a slightly better way of doing this too, perhaps with increase()?). I have 'FOR 5m' in there to give the job time to get the rate up after a service restart.
ALERT InvoicingStopped
IF rate(jobs_invoice_run_total{job="jobs"}[5m]) < 0.1
FOR 5m
LABELS {severity="critical"}
The catch is that the job only runs between 0600 and 2100 each day. I don't want alerts being sent out during the night. I would prefer to have a solution in the IF line of the alert that prevents it from firing during the downtime window, but if that's not possible, I'd be content to have a silence period configured in Alert Manager. I don't know how to do either, and so am looking for some help. Does the answer have something to do with the last successful scrape (scrapes should be successful during the downtime, just the job run counters won't increment)?
Feel free to throw in any other advice you want as well as I'm still learning how to use Prometheus.
Thanks,
Matt
For the alert, I would suggest increase() for this style of test.
The time of day thing could be handed if there was a time of day in seconds type of metric. Maybe this is something we could add to the query language, but of course these can get complicated with timezones and DST. We officially do everything in Prometheus on UTC to avoid this.
As a workaround, your app could emit this as a gauge now, and you could do
IF increase(jobs_invoice_run_total{job="jobs"}[5m]) < 1 and time_of_day > 60*60*6 and time_of_day < 60*60*21
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Matt,Were you able to solve this with adding a metric using the last capture date? If so, can you describe how you did that?All,On a more general level, Prometheus recently added time functions, but when I tried this:IF rate(datafeed[10m]) == 0 and day_of_week() < 6 and hour() >= 13it did not result in alerts, even though I knew the datafeed was flat. When I remove the day_of_week() and hour() calls I get alerts. Any feedback on how to correctly use the time functionality to limit the alerts would be appreciated!Thanks,Kris.
--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/oBRQwL1qhoc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/e701b636-5921-4ff1-911d-1d3a02d3b0e2%40googlegroups.com.
Matt Smith | 0414 379 210 | www.syple.com
Confidentiality and Limited Liability Notice
This e-mail is intended only to be read or used by the addressee. It is confidential and may contain legally privileged information. If a recipient is not the addressee, please destroy the message and contact the sender or this Company immediately, you must not copy or deliver this message to anyone. Unauthorised access, use or reproduction in any form by any person other than the addressee is prohibited. The sender does not warrant that this email or any files transmitted with it are free of viruses or any other electronic defect. The sender’s liability is limited to resending the information contained in this email. Any views expressed in this message are those of the individual sender, except where the sender expressly, and with authority, states them to be the views of Syple Technologies. Syple Technologies Pty Ltd ABN 36 108 906 047
This post shows how to use time restrictions in your alerts. https://www.robustperception.io/combining-alert-conditions
- Neil