Null value in alerts

1,895 views
Skip to first unread message

Sebastian Glock

unread,
Dec 9, 2022, 2:31:32 AM12/9/22
to Prometheus Users
Hi,

I'm having trouble setting up an alert that will send a notification when a value is different from 0 and the value is missing (i.e. null).

expression:
windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0 or on() vector(0)

The alert goes off non-stop. How can I set the metric to send an alert when the value is different from 0 and is null?

I tried with sum() but not working anyway:
sum(windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0) or on() vector(0)

Thanks for replies!

sebag...@gmail.com

unread,
Dec 9, 2022, 3:49:48 AM12/9/22
to Matthias Rampke, Prometheus Users

Thanks for advice,

 

So in this case I just need to use absent like this In alert?:

 

  - alert: Resource group in cluster is down

    expr: absent(windows_mscluster_resourcegroup_state {name!~"Available Storage"}) == 1

 

    for: 10s

    labels:

      severity: "[Cluster]"

    annotations:

      summary: "Resource group in cluster is down!"

      description: "{{ humanize $value }}"

 

This one will send message, when metric is missing?

 

From: Matthias Rampke <matt...@prometheus.io>
Sent: Friday, December 9, 2022 8:57 AM
To: Sebastian Glock <sebag...@gmail.com>
Cc: Prometheus Users <promethe...@googlegroups.com>
Subject: Re: [prometheus-users] Null value in alerts

 

When you say "the value is missing", what condition exactly do you want to alert on?

 

To detect that there is *no* metric matching your selector, you can use the absent(…) function. It returns 1 when … is nothing.

 

It gets more complicated and difficult if you want to detect that a single series has disappeared. In this case, you need to very specific in telling Prometheus which series *should* exist. Common ways to do this are

 

- listing them all out with separate absent(x) clauses and specific positive matchers

- comparing to a previous time (x offset 15m unless x)

- use some other metric that lets you determine what should be there

- generate recording rules to create such a metric

 

The fundamental challenge here is to distinguish between "this went missing" and "this went away because of expected changes".

 

In general, I prefer splitting "metric indicates there is a problem " and "metric is missing" into two different alerts with separate names and descriptions. To the one investigating, the difference matters. Additionally using absent() often results in different label sets because it cannot know labels for a time series that is absent. This causes trouble with templating that you sidestep by using separate alert definitions to begin with.

 

/MR

 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9fbc7d5d-c7ce-4b93-b653-733cac798956n%40googlegroups.com.

Stuart Clark

unread,
Dec 9, 2022, 4:00:27 AM12/9/22
to sebag...@gmail.com, Matthias Rampke, Prometheus Users
On 09/12/2022 08:49, sebag...@gmail.com wrote:

Thanks for advice,

 

So in this case I just need to use absent like this In alert?:

 

  - alert: Resource group in cluster is down

    expr: absent(windows_mscluster_resourcegroup_state {name!~"Available Storage"}) == 1

You aren't listing a metric here as you are using !~. You need to ensure you are only using = in any labels.

-- 
Stuart Clark

Brian Candler

unread,
Dec 9, 2022, 4:02:30 AM12/9/22
to Prometheus Users
On Friday, 9 December 2022 at 07:31:32 UTC sebag...@gmail.com wrote:
expression:
windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0 or on() vector(0)

The alert goes off non-stop.

Yes, that's correct.

PromQL expressions don't work like normal boolean expressions.  They return the presence or absence of values, not a true or false value.  The presence of *any* value will trigger an alert, and vector(0) generates a value all of the time.

For example, suppose you have 5 timeseries for the metric "node_filesystem_avail_bytes".

The PromQL expression "node_filesystem_avail_bytes" returns an instant vector containing 5 values.

The PromQL expression "node_filesystem_avail_bytes < 10000000" returns an instant vector containing between 0 and 5 values; you have filtered down to just those timeseries whose values are less than the threshold.

If you use this as an alerting expression, then if the instant vector is not empty, i.e. if 1 or more machines have a value less than the threshold, then an alert is generated.

 
 How can I set the metric to send an alert when the value is different from 0 and is null?

There is no concept of "null" in PromQL.  (Well, you can store a floating point value of "NaN" in a timeseries, but that's not what we're discussing here).

Either a timeseries is present, or it is not.
 
Hence I'm not really sure what you're trying to alert on.  What do your metrics look like?

Let me guess they look something like this:

windows_mscluster_resourcegroup_state{instance="foo",name="Available Storage"} 123
windows_mscluster_resourcegroup_state{instance="foo",name="Broken Storage"} 0
windows_mscluster_resourcegroup_state{instance="bar",name="Available Storage"} 0 
windows_mscluster_resourcegroup_state{instance="bar",name="Broken Storage"} 4

Now, this alerting expression:

windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0

will only alert on the last one of these (it filters to labels which are not "Available Storage", and then it filters to values which are not 0, and only the fourth metric shown matches both conditions)

Similarly, "or" works differently to what you might expect.

foo or bar

will return a union of:
- all timeseries with metric name "foo", PLUS:
- all those timeseries with metric name "bar" which *don't* have exactly the same label sets as the timeseries on the LHS (foo)

Since vector(0) has no labels, but the expression you gave on your LHS has labels, this will *always* include vector(0) in the result set, and therefore will always generate alerts.

The question is, what sort of "missing" values do you want to look for?

For example, are you trying to alert on instance "baz", which doesn't generate *any* values for windows_mscluster_resourcegroup_state ?  If so, you either need to alert explicitly on this absence, or you need to cross-reference to some other timeseries which refers to "baz" (such a timeseries is often "up").  Otherwise, the PromQL expression for windows_mscluster_resourcegroup_state has no way of knowing that you *expect* a value for baz, but there isn't one.

So one possibility is:

absent(windows_mscluster_resourcegroup_state{instance="baz",name="Available Storage"})

which will alert explicitly if there is no timeseries with that metric name and those particular labels.  But you've hard-coded the existence of a machine called "baz" into your alerting rules.

Or are you trying to alert on any node which is being scraped by scrape job "windows_exporter" but is not returning windows_mscluster_resourcegroup_state with a particular label?  The "up" metric tells you whether something is being scraped, so the expression might be along the lines of "... or on (instance) up"

If you show the *actual* metrics you are scraping (including the full label sets), and an example of an *actual* condition you are trying to catch, then we can help you write the expression.

For more hints:

Matthias Rampke

unread,
Dec 11, 2022, 12:50:24 PM12/11/22
to Sebastian Glock, Prometheus Users
When you say "the value is missing", what condition exactly do you want to alert on?

To detect that there is *no* metric matching your selector, you can use the absent(…) function. It returns 1 when … is nothing.

It gets more complicated and difficult if you want to detect that a single series has disappeared. In this case, you need to very specific in telling Prometheus which series *should* exist. Common ways to do this are

- listing them all out with separate absent(x) clauses and specific positive matchers
- comparing to a previous time (x offset 15m unless x)
- use some other metric that lets you determine what should be there
- generate recording rules to create such a metric

The fundamental challenge here is to distinguish between "this went missing" and "this went away because of expected changes".

In general, I prefer splitting "metric indicates there is a problem " and "metric is missing" into two different alerts with separate names and descriptions. To the one investigating, the difference matters. Additionally using absent() often results in different label sets because it cannot know labels for a time series that is absent. This causes trouble with templating that you sidestep by using separate alert definitions to begin with.

/MR


On Fri, 9 Dec 2022, 08:31 Sebastian Glock, <sebag...@gmail.com> wrote:
--

Yashaswini K

unread,
Dec 17, 2022, 4:05:20 AM12/17/22
to Prometheus Users
Hi Team
Reply all
Reply to author
Forward
0 new messages