how to make sure a metric is to be checked is "there"

842 views
Skip to first unread message

Christoph Anton Mitterer

unread,
Apr 24, 2023, 10:27:21 PM4/24/23
to Prometheus Users
Hey there.

What I'm trying to do is basically replace Icinga with Prometheus (or
well not really replacing, but integrating it into the latter, which I
anyway need for other purposes).

So I'll have e.g. some metric that shows me the RAID status on
instances, and I want to get an alert, when a HDD is broken.


I guess it's obvious that it could turn out bad if I don't get an
alert, just because the metric data isn't there (for some reason).


In Icinga, this would have been simple:
The system knows about every host and every service it needs to check.
If there's no result (like RAID is OK or FAILED) anymore (e.g. because
the raid CLI tool is no installed), the check's status would at least
go into UNKNOWN.



I wonder how this is / can be handled in Prometheus?


I mean I can of course check e.g.
expr: up == 0
in some alert.
But AFAIU this actually just tells me whether there are any scrape
targets that couldn't be scraped (in the last run, based on the scrape
interval), right?

If my important checks were all their own exporters, e.g. one exporter
just for the RAID status, then - AFAIU - this would already work any
notify me for sure, even if there's no result at all.

But what if it's part of some larger exporter, like e.g. the mdadm data
in node exporter.

up wouldn't become 0, just because node_md_disks would be not part of
the metrics.


Even if I'd say it's the duty of the exporter to make sure that there
is a result even on failure to read the status... what e.g. if some
tool is already needed just to determine whether that metric make sense
to be collected at all.
That would by typical for most hardware RAID controllers... you need
the respective RAID tool just to see whether any RAIDS are present.


So in principle I'd like a simple way to check for a certain group of
hosts on the availability of a certain time series, so that I can set
up e.g. an alert that fires if any node where I have e.g. some MegaCLI
based RAID, lacks megacli_some_metric.

Or is there some other/better way this is done in practise?


Thanks,
Chris.

Brian Candler

unread,
Apr 25, 2023, 3:32:25 AM4/25/23
to Prometheus Users
I think you would have basically the same problem with Icinga unless you have configured Icinga with a list of RAID controllers which should be present on a given device, or a list of drives which should be present in a particular RAID array.

> I mean I can of course check e.g.
> expr: up == 0
> in some alert.
> But AFAIU this actually just tells me whether there are any scrape
> targets that couldn't be scraped (in the last run, based on the scrape
> interval), right?

There is a separate 'up' metric for each individual target that is being scraped, so it's not just "any" target that failed - you can see exactly *which* target(s) failed.

If the exporter on a particular target were to go wrong internally, it should return an error (like a 500 HTTP response) which would cause its corresponding 'up' metric to go to 0.

I'm not sure if you realise this, but the expression "up == 0" is not a boolean, it's a filter.  The metric "up" has many different timeseries, each with a different label set, and each with a value.  The PromQL expression "up" returns all of those timeseries.  The expression "up == 0" filters it down to a subset: just those timeseries where the value is 0.  Hence this expression could return 0, 1 or more timeseries.  When used as an alerting expression, the alert triggers if the expression returns one or more timeseries (and regardless of the *value* of those timeseries).  When you understand this, then using PromQL for alerting makes much more sense.

However, if the RAID controller card were to simply vanish, then yes the corresponding metrics would vanish - similarly if a drive were to vanish from an array, its status would vanish.

You can create alert expressions which check for a specific sentinel metric being present with absent(...), and you can do things like joining with the 'up' metric, so you can say "if any target is being scraped, then alert me if that target doesn't return metric X".  It *is* a bit trickier to understand than a simple alerting condition, but it can be done.


As for drives vanishing from an array, you can write expressions using count() to check the number of drives.  If you have lots of machines and don't want separate rules per controller, then it's possible to use another timeseries as a threshold, again this a bit more complex:

But personally I would go really simple, and just create an alert whenever the count *changes*.  You can do this using something as simple as:

   expr: foo != foo offset 5m

(this compares the value of foo now, with the value of foo 5 minutes ago). Similarly, you can alert when any given metric vanishes:

    expr: foo offset 5m unless foo

Those sort of simple alerts have great value.

Do you have some specifics about what types of RAID you want to monitor?  I've done this for mdraid (using node_exporter) and for MegaRAID, using smartmon.py/sh from https://github.com/prometheus-community/node-exporter-textfile-collector-scripts

If using textfile collector scripts, there is a timestamp metric you can use to check when your script last wrote the file (node_textfile_mtime_seconds), which means it's easy to create an alert to check if your script hasn't run recently.

This was all running Prometheus completely standalone though.  If you want to feed existing Icinga checks into Prometheus, or Prometheus metrics into Icinga, that's a different matter.

HTH,

Brian.

Christoph Anton Mitterer

unread,
Apr 25, 2023, 9:29:03 PM4/25/23
to Prometheus Users
On Tuesday, April 25, 2023 at 9:32:25 AM UTC+2 Brian Candler wrote:
I think you would have basically the same problem with Icinga unless you have configured Icinga with a list of RAID controllers which should be present on a given device, or a list of drives which should be present in a particular RAID array.

Well true, you still depend on the RAID tool to actually detect the controller and any RAIDs managed by that.

But Icinga would likely catch most real world issues that may happen by accident:
- raid tool not installed
- some wrong parameters used when invoking the tool (e.g. a new version that might have changed command names)
- permissions issues (like tool not run as root, broken sudo rules
 

I'm not sure if you realise this, but the expression "up == 0" is not a boolean, it's a filter.  The metric "up" has many different timeseries, each with a different label set, and each with a value.  The PromQL expression "up" returns all of those timeseries.  The expression "up == 0" filters it down to a subset: just those timeseries where the value is 0.  Hence this expression could return 0, 1 or more timeseries.  When used as an alerting expression, the alert triggers if the expression returns one or more timeseries (and regardless of the *value* of those timeseries).  When you understand this, then using PromQL for alerting makes much more sense.

Well I think that's clear... I have one (scalar) value in up for each target I scrape, e.g. if I have just node exporter running, I'd get one (scalar) value for the scraped node exporter of every instance.

But the problem is that this does not necessarily tell me if e.g. my raid status result was contained in that scraped data, does it?

It depends on the exporter... if I had a separate exporter just for the RAID metrics, then I'd be fine. But if it's part of a larger one, like node exporter, it would depend if that errors out just because the RAID data couldn't be determined. And I guess most exporters would pre default just work fine, if e.g. there was simply no RAID tools installed (which does make sense in a way).

But it would also mean, that I wouldn't notice the error, if e.g. I forgot to install the tool.
In Icinga I'd notice this, cause I have the configured check per host. If that runs and doesn't find e.g. MegaCli... it would error out.

Prometheus OTOH knows just about the target (i.e. the host) and the exporter (e.g. node)... so it cannot really tell "ah... the RAID tool is missing"... unless node exporter had an option that would tell it to insist on RAID tool xyz being executed and fail otherwise.
That's basically what I'd like to do manually.


However, if the RAID controller card were to simply vanish, then yes the corresponding metrics would vanish - similarly if a drive were to vanish from an array, its status would vanish.

Well but that would usually also be unnoticed in the Icinga setup...  but it's also something that I think never really happens - and if it does one probably sees other errors like broken filesystems.

 
You can create alert expressions which check for a specific sentinel metric being present with absent(...), and you can do things like joining with the 'up' metric, so you can say "if any target is being scraped, then alert me if that target doesn't return metric X".  It *is* a bit trickier to understand than a simple alerting condition, but it can be done.

I guess that sounds what I'd like to do. Thanks for the below pointers :-)

expr: up{job="myjob"} == 1 unless my_metric

So my_metric would return "something" as soon as it was contained (in the most recent scrape!)... and if it wasn't, up{job="myjob"} == 1 would silence the "extra" error, in case it is NOT up anyway.

So in that case one should do always both:
- in general, check for any targets/jobs that are not up
- in specific (for e.g. very important metrics), additionally check for the specific metric.
 Right?

In general, when I get the value of some time series like node_cpu_seconds_total ... when that is missing for e.g. one instance I would get nothing, right? I.e. there is no special value, just the vector of scalar has one element less. But if I do get a value, it's for sure the one from the most recent scrape?!
 
Is this with absent() also needed when I have all my targets/jobs statically configured? I guess not because Prometheus should know about it and reflect it in `up` if any of them couldn't be scraped, right?

 
As for drives vanishing from an array, you can write expressions using count() to check the number of drives.  If you have lots of machines and don't want separate rules per controller, then it's possible to use another timeseries as a threshold, again this a bit more complex:

Thanks, but I guess that scenario (RAID volume suddenly vanishing) is anyway too unlikely as that I'd bother. And if it happens... many other bells and whistles would go off.


But personally I would go really simple, and just create an alert whenever the count *changes*.  You can do this using something as simple as:
   expr: foo != foo offset 5m

That's however a really good idea... and quite simple (AFAIU it should work like that out of the box for all possible instances, right?).
But that would also fire once at initialisation, and when it then really fires... it would silence again after another 5 min (unless the could changes again), right?

 
(this compares the value of foo now, with the value of foo 5 minutes ago). Similarly, you can alert when any given metric vanishes:

    expr: foo offset 5m unless foo

But same here as above, right? It would no longer fire, after another 5m?

 
Do you have some specifics about what types of RAID you want to monitor?

I'll have few MegaRAID based controllers and other than that mostly HP Smart Storage Admin CLI (ssacli)... haven't really looked yet for any exporters of these.

  I've done this for mdraid (using node_exporter) and for MegaRAID, using smartmon.py/sh from https://github.com/prometheus-community/node-exporter-textfile-collector-scripts

How does that work via smartmon?

 
If using textfile collector scripts, there is a timestamp metric you can use to check when your script last wrote the file (node_textfile_mtime_seconds), which means it's easy to create an alert to check if your script hasn't run recently.

Okay,.. guess I'll have to look into that first,... never did it so far (I mean using text file collector scripts).
 
This was all running Prometheus completely standalone though.  If you want to feed existing Icinga checks into Prometheus, or Prometheus metrics into Icinga, that's a different matter.

For Icinga/Nagios and RAID most people seem to use check_raid [0], which seemed a bit unmaintained for some years, though it got a few commits recently.

But in principle I'd like to use Prometheus standalone to keep the maintenance effort as low as possible.... so if I can do it without any of that, I'd be happy.
OTOH, I would rather want to avoid writing my own exporters just for some RAID checks (=metrics).


Thanks :-)
Chris.

Brian Candler

unread,
Apr 26, 2023, 3:35:32 AM4/26/23
to Prometheus Users
> expr: up{job="myjob"} == 1 unless my_metric

Beware with that, that it will only work if the labels on both 'up' and 'my_metric' match exactly.  If they don't, then you can either use on(...) to specify the set of labels which match, or ignoring(...) to specify the ones which don't.

You could start with:

expr: up{job="myjob"} == 1 unless on (instance) my_metric

but I believe this will break if there are multiple instances of my_metric for the same host. I'd probably do:

expr: up{job="myjob"} == 1 unless on (instance) count by (instance) (my_metric)

> So my_metric would return "something" as soon as it was contained (in the most recent scrape!)... and if it wasn't, up{job="myjob"} == 1 would silence the "extra" error, in case it is NOT up anyway.

Yes, if up == 0 (i.e. the target is down) then you don't want an additional alert saying the metric is missing, as obviously it will be.

> So in that case one should do always both:
> - in general, check for any targets/jobs that are not up
> - in specific (for e.g. very important metrics), additionally check for the specific metric.
>  Right?

Yes, if there's any chance that the metric could be missing in a "good" scrape. This is rarely the case.

You mention MegaCLI: if you're using the node_exporter textfile collector scripts to collect information on the RAID card, then you can use the timestamp metric I mentioned before to check that the script has run recently.  If you forgot to install the script, then sure you won't get any metrics.  If you want to alert on this specific bad setup, then obviously you'll need a list of targets which *should* have MegaRAID metrics - in which case, you might just use this list with your configuration management system (e.g. ansible or whatever).

> In general, when I get the value of some time series like node_cpu_seconds_total ... when that is missing for e.g. one instance I would get nothing, right? I.e. there is no special value, just the vector of scalar has one element less. 

Again, I'd consider it unlikely that a successful scrape from node_exporter would silently drop node_cpu_seconds_total metrics.

If you're talking about the instance vector across all targets, i.e. the PromQL expression "node_cpu_seconds_total", then yes the vector will include all known values.

> But if I do get a value, it's for sure the one from the most recent scrape?!

Yes. Google "prometheus staleness handling".  Basically when you evaluate an instant vector query it's done at some time T (by default "now"), and in the TSDB it looks for the most recent value of the metric, looking back up to 5 minutes (default). Also, if a scrape does not contain a particular timeseries, but the previous scrape *did* contain that timeseries, then the timeseries is marked "stale" by storing a staleness marker.

So if you do see a value, it means:
- it was in the last scrape
- it was in the last 5 minutes
- there has not been a subsequent scrape where the timeseries was missing

> Is this with absent() also needed when I have all my targets/jobs statically configured?

Use absent() when you need to write an expression which you can't do as a join against another existing timeseries.

>    expr: foo != foo offset 5m

> That's however a really good idea... and quite simple (AFAIU it should work like that out of the box for all possible instances, right?).
> But that would also fire once at initialisation, and when it then really fires... it would silence again after another 5 min (unless the could changes again), right?

Almost. It won't fire at initialisation, because foo != bar will give no results unless foo and bar both exist.

If you want to fire when foo exists not but did not exist 5 minutes ago (i.e. alert whenever a new metric is created), then

expr: foo unless foo offset 5m

And yes, it will silence after 5 minutes. You don't want to send recovery messages on such alerts. (Personally I don't send recovery messages for *any* alerts, but that's a different story: https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped )

> How does that work via smartmon?

Sorry, that was my brainfart. It's "storcli.py" that you want.  (Although collecting smartmon info is a good idea too).

> OTOH, I would rather want to avoid writing my own exporters just for some RAID checks (=metrics).

Hopefully, scripting with node_exporter textfile collector will do the job easily enough.

Christoph Anton Mitterer

unread,
Apr 27, 2023, 10:41:19 PM4/27/23
to Prometheus Users
Hey again.

On Wednesday, April 26, 2023 at 9:35:32 AM UTC+2 Brian Candler wrote:
> expr: up{job="myjob"} == 1 unless my_metric

Beware with that, that it will only work if the labels on both 'up' and 'my_metric' match exactly.  If they don't, then you can either use on(...) to specify the set of labels which match, or ignoring(...) to specify the ones which don't.

You could start with:

expr: up{job="myjob"} == 1 unless on (instance) my_metric

Ah. I see.
I guess one should use on(...) rather than ignoring(...) because one doesn't really know which labels may get added, right?

Also, wouldn't it be better to also consider the "job" label?
   expr: up{job="myjob"} == 1 unless on (instance, job) my_metric
because AFAIU, job is set by Prometheus itself, so if I operate on it as well, I can make sure that my_metric is really from the desired job - an not perhaps from some other job that wrongly exports a metric of that name.
Does that make sense?

 
but I believe this will break if there are multiple instances of my_metric for the same host. I'd probably do:

expr: up{job="myjob"} == 1 unless on (instance) count by (instance) (my_metric)

So with job that would be:
   expr: up{job="myjob"} == 1 unless on (instance,job) count by (instance,job) (my_metric)
 
but I don't quite understand why it's needed in the first place?!

If I do the previous:
  expr: up{job="myjob"} == 1 unless on (instance) my_metric
then even if for one given instance value (and optionally one given job value) there are multiple results for my_metric (just differing in other labels), like:
   node_filesystem_free_bytes{device="/dev/vda1",fstype="vfat",mountpoint="/boot/efi"} 5.34147072e+08
   node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/"} 1.2846592e+10
   node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/data/btrfs-top-level-subvolumes/system"} 1.2846592e+10
(all with the same instance/job)

shouldn't the "unless on (instance)" still work? I mean it wouldn't notice if only one time series were gone (like e.g. only device="/dev/vda1" above), but it should if all were gone?
But the count by would also only notice it if all were gone, because only then it gives back no data for the respective instance (and not just 0 as value)?


Also, if a scrape does not contain a particular timeseries, but the previous scrape *did* contain that timeseries, then the timeseries is marked "stale" by storing a staleness marker.

 Is there a way to test for that marker in expressions?

 
So if you do see a value, it means:
- it was in the last scrape
- it was in the last 5 minutes
- there has not been a subsequent scrape where the timeseries was missing

Ah, good to know.
 

> Is this with absent() also needed when I have all my targets/jobs statically configured?

Use absent() when you need to write an expression which you can't do as a join against another existing timeseries.

Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the above:
   up{job="myjob"} == 1 unless on (instance) my_metric
because if I'd do something like:
   absent(my_metric)
it would be empty as soon as there was at least one time series for the metric.
With that I could really only check for a specific time series to be missing like:
   absent(my_metric{instance="somehost",job="node"})
and would have to make one alert with a different expression for e.g. every instance.

Or is there any way to use absent() for the general case which I just don't see?


If you want to fire when foo exists not but did not exist 5 minutes ago (i.e. alert whenever a new metric is created), then

expr: foo unless foo offset 5m

No I think I'd only want alerts if something vanishes.
 

And yes, it will silence after 5 minutes. You don't want to send recovery messages on such alerts.

Sounds reasonable.

I wonder whether the expression is ideal:
The above form would already fire, even if the value was missing just once, exactly the 5m ago.
Wouldn't it be better to do something like.
   expr: foo unless foo offset 15s
   for: 5m
assuming scrape interval of 15s?

With offset I cannot just specify the "previous" sample, right?

Is it somehow possible to do the above like automatically for all metrics (and not just foo) from one expression?
And I guess one would again need to link that somehow with `up` to avoid useless errors?

 
> How does that work via smartmon?
Sorry, that was my brainfart. It's "storcli.py" that you want.  (Although collecting smartmon info is a good idea too).

Ah... I even saw that too, but had totally forgotten that they've renamed megacli.


Is there a list of some generally useful alerts, things like:
   up == 0
or like the above idea of checking for metrics that have vanished? Ideally with how to use them properly ;-)


 

Thanks,
Chris.

Brian Candler

unread,
Apr 28, 2023, 3:01:54 AM4/28/23
to Prometheus Users
On Friday, 28 April 2023 at 03:41:19 UTC+1 Christoph Anton Mitterer wrote:
You could start with:

expr: up{job="myjob"} == 1 unless on (instance) my_metric

Ah. I see.
I guess one should use on(...) rather than ignoring(...) because one doesn't really know which labels may get added, right?

It's a matter of taste. I like to keep things simple, and to keep the rules for different metrics similar to each other. You know that 'up' has only job and instance labels plus any target labels from service discovery; my_metric will have these plus some others which vary between metrics.
 

Also, wouldn't it be better to also consider the "job" label?
   expr: up{job="myjob"} == 1 unless on (instance, job) my_metric

Again a matter of taste, but typically not needed: you already filtered to a single job="myjob" on the LHS, and it would be very unusual for the same "mymetric" to be received in scrapes from two different jobs (which usually means two different exporters) for the same host.

In fact by that logic you might as well simplify it to

expr: up == 1 unless on (instance) my_metric
 

but I believe this will break if there are multiple instances of my_metric for the same host. I'd probably do:

expr: up{job="myjob"} == 1 unless on (instance) count by (instance) (my_metric)

So with job that would be:
   expr: up{job="myjob"} == 1 unless on (instance,job) count by (instance,job) (my_metric)

That should be fine.
 
 but I don't quite understand why it's needed in the first place?!

It turns out you're right.

Prometheus provides a web interface where you can test all these expressions: even alerting rules are just expressions, which alert if the instant vector result is non-empty.

Suppose you wanted to alert if a node *isn't* returning node_filesystem_avail_bytes.  You can test it like this:

    up{job="node"} == 1 unless on(instance,job) node_filesystem_avail_bytes

and you were right, "unless" doesn't care if there are multiple matches on the right-hand side.

But suppose you wanted to do arithmetic between the metrics (note extra parentheses required):

    (up{job="node"} == 1) * on(instance,job) node_filesystem_avail_bytes

This will give you an error, because each instance/job combination on the LHS matches multiple filesystems on the RHS. The correct syntax for that is:

    (up{job="node"} == 1) * on(instance,job) group_right() node_filesystem_avail_bytes

which gives N results for each 1:N combination.

This particular example is pointless because the LHS is always 1, so you're just multiplying by 1. But even with a static metric like that there are cases where you want labels from the LHS to be added to the result, which you can do by listing those labels inside the group_right() clause.


 

Also, if a scrape does not contain a particular timeseries, but the previous scrape *did* contain that timeseries, then the timeseries is marked "stale" by storing a staleness marker.

 Is there a way to test for that marker in expressions?

Not usefully, and I don't see why you'd want to. Internally it's stored as a special flavour of NaN, but in queries it just looks like the timeseries has disappeared - which indeed it has. It stops you looking back in time to the previous real scrape value.
 
 Use absent() when you need to write an expression which you can't do as a join against another existing timeseries.

Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the above:
   up{job="myjob"} == 1 unless on (instance) my_metric
because if I'd do something like:
   absent(my_metric)
it would be empty as soon as there was at least one time series for the metric.
With that I could really only check for a specific time series to be missing like:
   absent(my_metric{instance="somehost",job="node"})
and would have to make one alert with a different expression for e.g. every instance.

That's exactly what I mean. If you want to look for the absence of a *specific* timeseries, you can do it that way. But it's very rare that I've had to do that. If you do it with a join on 'up' then it will work for multiple similar timeseries.

You would use absent() that if you wanted to test for complete absence of up{job="myjob"} for example - which would mean that service discovery for that job had returned zero targets.
 
Wouldn't it be better to do something like.
   expr: foo unless foo offset 15s
   for: 5m
assuming scrape interval of 15s?

Well yes, in reality that might be better; but remember that if a few scrapes fail, "foo" will still be present in query results for 5 minutes anyway, because of the lookback - so the original expression I gave is not as fragile as you might think.
 
 With offset I cannot just specify the "previous" sample, right?

Again, why would you want to?

As I said before, the value of a timeseries at time T is *defined* to be the most recent value of the timeseries at or before time T, up to 5 minutes previously; so if a few scrapes fail, then the value is defined to persist at the previous value for 5 minutes. This is a good thing: it helps make the whole ecosystem more robust. You don't want

    expr: foo offset 5m unless foo

to trigger if a single scrape fails. But if the previous scrape was successful and did not include that metric, then it will immediately vanish, so an expression like the above will trigger immediately.
 
At this point, I think it's best if we leave the discussion here, as it's all getting rather theoretical - you clearly have plenty of clue to get running with all this, and if you have a specific problem then you can raise it here.

Regards,

Brian.
Reply all
Reply to author
Forward
0 new messages