{{ $labels.instance } not picking the instance detials

WarneFit

unread,

Dec 12, 2021, 5:41:27 AM12/12/21

to Prometheus Users

Hi all,

I had created the configuration in prometheus for setting up alert using alert manager and pagerduty. But alert configuration itself is not picking up the instance details.

For example below is the config of the alert:

- alert: RepairFailed

expr: (sum(scylla_manager_task_run_total{type=~"repair", status="ERROR"}) or vector(0)) - (sum(scylla_manager_task_run_total{type=~"repair", status="ERROR"} offset 3m) or vector(0)) > 0

for: 15m

labels:

severity: CRITICAL

environment: "{{ AWS_ENVIRONMENT }}"

monitoring: RepairFailed

rackspace: "true"

hostname: "{{ $labels.instance }}"

annotations:

description: 'For {{ $labels.instance }} Repair failed'

summary: Instance {{ $labels.instance }} Repair task failed
But alert which is getting triggered on pager duty is not showing the instance details in description and in summary itself.
Could anyone help me on this.

Brian Candler

unread,

Dec 12, 2021, 6:10:51 AM12/12/21

to Prometheus Users

Paste the expression in the PromQL browser in the prometheus web interface. This will show you the results of the expression, including all the labels (switch to graph view to see historical results). If the result of the PromQL expression doesn't have an instance label, then that won't be available to the alert.

A brief look at your expression suggests that you've intentionally got rid of all the labels.

* sum(foo) gives the total value across all timeseries with metric name "foo". The result is a single value with no labels (because the result summarises *all* the timeseries given)

* vector(0) has no labels

If you want instance labels in the result then you're going to have to rewrite your expression. As a starting point,

sum(foo) by (instance)

will give you a vector of results, each of which has a different instance label.

I'm not 100% sure what you're trying to do with the "or vector(0)" stuff, but maybe you want something like this:

expr: increase(blah[3m]) > 0

with the proviso that the resulting value may not be an exact integer - it's the calculated per-second rate, scaled to the time period. Note that the rate window has to include both the first and last data points of the time period you wish to calculate across: so if you're sampling every 1 minute, and you want to calculate the rate using two data points which are 3 minutes apart, then you need blah[4m]. However the result will also be scaled to tell you the estimated increased over 4 minutes, even though it's only using 3 minute's worth of data. I'm afraid this is an ugly corner of prometheus; more discussion at #3806

However, even increase(blah[3m]) will not work with a single data point, or it won't alarm if the first value in the timeseries is non-zero, because it doesn't know for sure that the counter was previously zero.

Maybe this is closer to what you want:

expr: (blah > 0) unless (blah == blah offset 3m)

However that will give you the value of the counter, not the value of the increase.

HTH, Brian.

l.mi...@gmail.com

unread,

Dec 12, 2021, 4:21:20 PM12/12/21

to Prometheus Users

The issue of stripping away labels in queries but then relying on them in alert annotations comes up a lot in the teams I work with. It's just so easy to overlook it.

That's why I've added a check to pint that will try to warn users when that happens, see https://github.com/cloudflare/pint/blob/main/docs/CONFIGURATION.md#template.

It won't work in every case, but it when I did run it on our internal repo it found tons of cases, mostly with Grafana links that rely on labels to populate dashboard variables, so it should be useful.

Reply all

Reply to author

Forward