alert rules between regions - to avoid triggering absent metric

362 views
Skip to first unread message

Dan S

unread,
Oct 3, 2021, 2:40:23 PM10/3/21
to Prometheus Users
Hi,

Looking for some general advice about shared prom alert rules between regions.  We currently push the same alert rules to all regions, and sometimes we run into situations where we have a specific job in region X but not Y.

This is fine for basic cases, such as up{job="jenkins"} == 0 which will be ignored in regions where there's no jenkins job present (or could easily specify region="X").

But in some situations I'd like to use absent on a metric that often has gaps for example
absent(jenkins_up{job="jenkins"})
This would trigger in all regions, whether or not there's a job "jenkins" (obviously because it's triggering on the missing metrics) even if I try to be more specific: absent(jenkins_up{job="jenkins", region="US"}).

Any suggestions how I can craft an alert query using absent() in on metrics that don't appear in all regions?  So that if region="US" has job="jenkins" and I watch to catch gaps here, it won't also fire in region="EU" which never has job="jenkins".... ?

Thanks for any advice.

Dan

Brian Candler

unread,
Oct 3, 2021, 3:14:59 PM10/3/21
to Prometheus Users
This might be an XY problem, because it is often better to have a defined "up/down" metric (with value 1/0), which tells you whether something worked or not, rather than alerting on presence or absence of a metric.

However, to answer your question directly, I think you would need to include some condition saying whether that metric *should* be there or not - which is the presence of some other metric.  The "up" metric added by all scrape jobs can be useful for this.  In this case, I expect up{job="jenkins"} will exist, if and only if you have a 'jenkins' scrape job in that region.  Therefore maybe something like this will do what you want:

absent(jenkins_up{job="jenkins"}) unless on (job) absent(up{job="jenkins"})

which I think may simplify, if the 'jenkins_up' metric is only scraped by the 'jenkins' job, to this (not sure):

absent(jenkins_up) unless on () absent(up{job="jenkins"})

Brian Candler

unread,
Oct 3, 2021, 3:21:34 PM10/3/21
to Prometheus Users
Or slightly weirder:

absent(jenkins_up) and absent(absent(up{job="jenkins"}))

absent(absent(...)) being a way to get the RHS to have no labels, to match the LHS.

Ben Kochie

unread,
Oct 3, 2021, 5:50:59 PM10/3/21
to Brian Candler, Prometheus Users
Rather than use absent(), you can use the Prometheus metamon metric prometheus_target_scrape_pool_targets.

Prometheus alerts are meant to be done in layers, where you have separate alerts on `jenkins_up`, `up`, and `prometheus_target_scrape_pool_targets`.

Trying to manipulate alerts with absent() tends to behave badly.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8533225b-267c-4dd1-af32-5967966d5156n%40googlegroups.com.

Brian Candler

unread,
Oct 4, 2021, 10:07:04 AM10/4/21
to Prometheus Users
On Sunday, 3 October 2021 at 22:50:59 UTC+1 sup...@gmail.com wrote:
Trying to manipulate alerts with absent() tends to behave badly.

Aside: I found it a bit surprising at first that count() and sum() across an empty instant vector give an empty result, rather than 0.  I don't see that behaviour explicitly called out here, but I guess it makes sense when you think about what "count by", "sum by" or "count_values" would have to do, when given no input.

You can of course make it work the other way if required: e.g. "count(foo) or vector(0)"

Dan Schanler

unread,
Oct 4, 2021, 10:29:14 AM10/4/21
to Brian Candler, Prometheus Users
Thanks Brian for the advice! I found the `absent() and absent()` seemed to work well.  

Also Ben - thank you - I did take your advice as well re: making multiple layers of alerts, and didn't know about prometheus_target_scrape_pool_targets, which could be useful in other ways as well.

Appreciate it!

Dan


--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/gi8GtrBMMMk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4adf9702-adbc-43be-837e-794c6e009d2bn%40googlegroups.com.

Brian Candler

unread,
Oct 4, 2021, 12:02:18 PM10/4/21
to Prometheus Users
count() returns no labels, and it also returns no timeseries when it has no input (rather than a timeseries with value zero, which I had naïvely expected).  So this is simpler again:

absent(jenkins_up) and count(up{job="jenkins"})

Dan Schanler

unread,
Oct 4, 2021, 2:25:01 PM10/4/21
to Brian Candler, Prometheus Users
Brian, even better - great.

Now that you mentioned how count() returns no labels - it relates to another alert rule I was trying to implement.
If I wanted to alert anytime a counter is incremented (and have it self resolve after x time), this seems to do it:

count(exception_total) - count(exception_total offset 1h) 
{}   0

above returns a zero value when it has been incremented but no labels or useful results otherwise, but this other query I happened upon returns labels and I don't understand why

exception_total unless exception_total offset 1h

exception_total{pod="x"} 1
exception_total{pod="y"} 1
exception_total{pod="z"} 1



--
Dan



Brian Candler

unread,
Oct 4, 2021, 2:34:42 PM10/4/21
to Prometheus Users
On Monday, 4 October 2021 at 19:25:01 UTC+1 Dan S wrote:
Brian, even better - great.

Now that you mentioned how count() returns no labels - it relates to another alert rule I was trying to implement.
If I wanted to alert anytime a counter is incremented (and have it self resolve after x time), this seems to do it:

count(exception_total) - count(exception_total offset 1h) 
{}   0

At first glance, that expression will always alert, so you'll want to wrap it in (....) > 0
 
But are you sure you want "count" there?  It implies that you will get multiple *timeseries* for exception_total.  If it's a single metric, then you want

(metric_total - metric_total offset 1h) > 0

 
above returns a zero value when it has been incremented but no labels or useful results otherwise, but this other query I happened upon returns labels and I don't understand why

exception_total unless exception_total offset 1h

exception_total{pod="x"} 1
exception_total{pod="y"} 1
exception_total{pod="z"} 1


Compare these two expressions separately:

(A) exception_total

(B) exception_total offset 1h

You'll only get a result if (A) has a timeseries but (B) has no corresponding timeseries (meaning with exactly the same labels).

Dan Schanler

unread,
Oct 5, 2021, 1:20:40 AM10/5/21
to Brian Candler, Prometheus Users
Thanks! Very much appreciated

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/gi8GtrBMMMk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages