collect non-metrics data

102 views
Skip to first unread message

Christoph Anton Mitterer

unread,
Feb 11, 2023, 5:02:29 AM2/11/23
to Prometheus Users
Hey.

I wondered whether the following is possible with Prometheus. I basically think about possibly phasing out Icinga and do any alerting in Prometheus.

For checks that are clearly metrics based (like load or free disk space) this seems rather easy.

But what about any checks that are not really based on metrics?
Like e.g. check_raid, which gives an error if any RAID has lost a disk or similar.

Of course one could always just try to make a metric out of it - above one could make e.g. the number of non-consistent RAIDs the metric.

But what one actually wants from such checks is additional (typically purely textual) information, like in the above example which HDD (enclosure, bay number,... or the serial number) has failed.
Also I have numerous other checks which test for things which are not really related to a number but where the output are strings.

Is there any (good) way to get that done with Prometheus, or is it simply not meant for that specific use case.

Thanks,
Chris.

Ben Kochie

unread,
Feb 11, 2023, 5:18:44 AM2/11/23
to Christoph Anton Mitterer, Prometheus Users
Typically those values are exposed as booleans/states.

For example, mdadm collector in the node_exporter has metrics like this:

# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="check"} 0
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0


You combine this with an "info" metric that tells you about the rest of the device.

For example, there is `node_os_info` that reads from LSB data.

# HELP node_os_info A metric with a constant '1' value labeled by build_id, id, id_like, image_id, image_version, name, pretty_name, variant, variant_id, version, version_codename, version_id.
# TYPE node_os_info gauge
node_os_info{build_id="",id="ubuntu",id_like="debian",image_id="",image_version="",name="Ubuntu",pretty_name="Ubuntu 20.04.2 LTS",variant="",variant_id="",version="20.04.2 LTS (Focal Fossa)",version_codename="focal",version_id="20.04"} 1


PromQL allows you to do joins, kinda like SQL, in order to match this information onto an alert.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8fe84502-eca5-4e53-8a9c-35e7a9dd6113n%40googlegroups.com.

Christoph Anton Mitterer

unread,
Feb 13, 2023, 6:05:20 PM2/13/23
to Prometheus Users
Hey Ben.

On Saturday, February 11, 2023 at 11:18:44 AM UTC+1 Ben Kochie wrote:

You combine this with an "info" metric that tells you about the rest of the device.

Ah,... and I assume that one could just also export these info metrics alongside e.g. node_md_state?

Thanks :-)
Chris.

Brian Candler

unread,
Feb 16, 2023, 1:33:28 PM2/16/23
to Prometheus Users
On Saturday, 11 February 2023 at 10:02:29 UTC Christoph Anton Mitterer wrote:
But what one actually wants from such checks is additional (typically purely textual) information, like in the above example which HDD (enclosure, bay number,... or the serial number) has failed.

Typically you'd have one metric for each unit (e.g. physical disk) with its status, something along the lines of:

node_pd_failed{device="/dev/hda",serial="ABC123"} 0
node_pd_failed{device="/dev/hdb",serial="DEF456"} 1

You can get some realistic examples from node-exporter textfile collector example scripts: smartmon.py (for SMART stats) and storcli.py (for megaraid)
 
Also I have numerous other checks which test for things which are not really related to a number but where the output are strings.

In that case, prometheus may or may not be a good solution.  You can put the string in a label, but every time it changes it will create a new timeseries.  If the values are stable, and only change occasionally, it may be good enough.  The problems are more that it is difficult to query timeseries when they appear and disappear over time.

Depending on your use case, "examplars" might work for you: these are blobs of data which are associated with a timeseries, and are intended to give *one* detailed example of a piece of information which went into building that metric.  For example, if the metric is a count of HTTP request 502 failures, the exemplar might contain the details of the most recent such failure.  Exemplars are relatively new and I believe are still hidden behind a feature flag.  They are stored in RAM.

Otherwise you might want to look at a log system like loki.
Reply all
Reply to author
Forward
0 new messages