Best way to export status

77 views
Skip to first unread message

Roman Baeriswyl

unread,
Jul 17, 2022, 12:26:24 AM7/17/22
to Prometheus Users
Hey all
I am working on a Dell iDRAC SNMP Exporter and I struggle with "Status" fields.
I think there are three main possibilities:

1. EnumAsStateSet
The downside here is that it can really clutter the output. For example the Dell Fans have 10 possible status, so each fan has 10 fields where only one is set to "1".

2. EnumAsInfo
The downside here is that have not so nice time history and it is probably harder to create alerts.

3. Use the numeric value
The downside here is that you need to do the enum lookup in the alert / dashboard.

What do you think is in general the best way for such status?

Thanks for your input.

Ben Kochie

unread,
Jul 17, 2022, 4:50:43 AM7/17/22
to Roman Baeriswyl, Prometheus Users
For things that have state changes you care about, I usually recommend EnumAsStateSet.

The good news is that Prometheus deals with compressing the boolean values very well. And since all fans have the same set of states, those values are deduplicated in the index.

So while it looks like a lot in the metric output, it stores well in the TSDB.

The question is, how many fans on how many servers are we talking about?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2b0defe0-0a8e-4ae5-be10-bc0efcadcd73n%40googlegroups.com.

Roman Baeriswyl

unread,
Jul 18, 2022, 4:50:00 PM7/18/22
to Prometheus Users
Thanks for the answer. Well, it is not only fans, there are dozens of other status fields as well (i'm doing an idrac snmp exporter). And that for technically dozens of servers. Should I try to stick with the StateSet or should I switch to just expose the numerical represenation?

Ben Kochie

unread,
Jul 18, 2022, 5:14:23 PM7/18/22
to Roman Baeriswyl, Prometheus Users
Let's do the math:

100 servers * 10 states * 20 sensors = 20,000 metrics

Worst case, say you have 5000 metrics each for 100 servers, that's still only 500,000 series. This will probably take about 4GiB of memory. It should still fit easily in an 8GiB memory instance.

A single Prometheus can handle millions of metrics if you capacity plan accordingly.


Roman Baeriswyl

unread,
Jul 18, 2022, 6:21:34 PM7/18/22
to Ben Kochie, Prometheus Users
True, the amount should not be an issue at all.
I wonder what is more convenient for the end user: having 10 states per sensor but with their state name as label, or just having one with the numerical value (which would allow > and < operations for alerts). I cannot decide between those two.

Regarding the other projects: I've looked thru many projects. The first one you mention need to actually run on the dell server itself, which I do not want. The second contains only a few metrics and uses the Redfish api (basically JSON, but I think a bit limited, especially for older systems). There are also a lot of others, mostly based on prometheus/snmp_exporter but they also lack a lot of metrics. In my first try, I created my own snmp_exporter generator (https://github.com/Roemer/idrac-snmp-exporter) even with a fully working automatic pipeline. But I find the generator way too restrictive.
I am now working on a node based exporter with express, prom-client and net-snmp and it seems to work fairly well. I can export what I want, exactly how I want. This is the v2 branch which only exposes one set of metrics.

Ben Kochie

unread,
Jul 18, 2022, 11:32:21 PM7/18/22
to Roman Baeriswyl, Prometheus Users
With PromQL, the state label with a boolean value tends to be more user-friendly.

For example, you can do things like `avg_over_time(foo{state="some state"}[10m])` to detect problems, but maybe ignore one or two state changes.

Similarly, you can be more specific about states with `changes_over_time()`.

Roman Baeriswyl

unread,
Jul 19, 2022, 5:41:47 AM7/19/22
to Ben Kochie, Prometheus Users
Why not both:

idrac_amperage_probe_status{index="1",statusName="other",statusNumber="1"} 0
idrac_amperage_probe_status{index="1",statusName="unknown",statusNumber="2"} 0
idrac_amperage_probe_status{index="1",statusName="ok",statusNumber="3"} 1
idrac_amperage_probe_status{index="1",statusName="nonCriticalUpper",statusNumber="4"} 0
idrac_amperage_probe_status{index="1",statusName="criticalUpper",statusNumber="5"} 0
idrac_amperage_probe_status{index="1",statusName="nonRecoverableUpper",statusNumber="6"} 0
idrac_amperage_probe_status{index="1",statusName="nonCriticalLower",statusNumber="7"} 0
idrac_amperage_probe_status{index="1",statusName="criticalLower",statusNumber="8"} 0
idrac_amperage_probe_status{index="1",statusName="nonRecoverableLower",statusNumber="9"} 0
idrac_amperage_probe_status{index="1",statusName="failed",statusNumber="10"} 0

This way, one can use the name or the number if that would be easier (for < or > checks).

Brian Candler

unread,
Jul 19, 2022, 6:14:17 AM7/19/22
to Prometheus Users
I don't think you can do numeric comparisons on labels(*). If you want both approaches, then you need two sets of metrics: a single metric with a value of 3, and another set of metrics giving the 10 booleans.

(*) apart from a regex like `[1-5]`, in which case you might as well use `(other|unknown|ok|nonCriticalUpper|criticalUpper)`

Stuart Clark

unread,
Jul 19, 2022, 6:18:42 AM7/19/22
to Roman Baeriswyl, Ben Kochie, Prometheus Users
On 19/07/2022 10:41, Roman Baeriswyl wrote:
> Why not both:
>
> idrac_amperage_probe_status{index="1",statusName="other",statusNumber="1"}
> 0
> idrac_amperage_probe_status{index="1",statusName="unknown",statusNumber="2"}
> 0
> idrac_amperage_probe_status{index="1",statusName="ok",statusNumber="3"} 1
> idrac_amperage_probe_status{index="1",statusName="nonCriticalUpper",statusNumber="4"}
> 0
> idrac_amperage_probe_status{index="1",statusName="criticalUpper",statusNumber="5"}
> 0
> idrac_amperage_probe_status{index="1",statusName="nonRecoverableUpper",statusNumber="6"}
> 0
> idrac_amperage_probe_status{index="1",statusName="nonCriticalLower",statusNumber="7"}
> 0
> idrac_amperage_probe_status{index="1",statusName="criticalLower",statusNumber="8"}
> 0
> idrac_amperage_probe_status{index="1",statusName="nonRecoverableLower",statusNumber="9"}
> 0
> idrac_amperage_probe_status{index="1",statusName="failed",statusNumber="10"}
> 0
>
> This way, one can use the name or the number if that would be easier
> (for < or > checks).

The downside with numeric statuses is that you need more knowledge to
use them compared with the label method. I have to know that 7 = unknown
or 5 = too hot, etc.

That suggestion wouldn't actually help BTW as the statusNumber is a
label so you could only use regex matches rather than >/<. If you wanted
that as well you'd need a separate metric
(idrac_amperage_probe_status_number or something) that has no labels and
just the 1-10 value.

The value of that purely numeric status metric also depends on what the
status values actually are. It might be more useful for things which
"progress" (good, poor, bad, broken) but probably not for statuses which
are unrelated (network error, disk error, hardware fault, temperature
error) as you are unlikely to use >/<

--
Stuart Clark

Roman Baeriswyl

unread,
Jul 19, 2022, 7:18:57 AM7/19/22
to Stuart Clark, Ben Kochie, Prometheus Users
Great feedback as well, thanks.

I will add both metrics:
idrac_amperage_probe_status{index="1",statusName="other"} 0
idrac_amperage_probe_status{index="1",statusName="unknown"} 0
idrac_amperage_probe_status{index="1",statusName="ok"} 1
idrac_amperage_probe_status{index="1",statusName="nonCriticalUpper"} 0
idrac_amperage_probe_status{index="1",statusName="criticalUpper"} 0
idrac_amperage_probe_status{index="1",statusName="nonRecoverableUpper"} 0
idrac_amperage_probe_status{index="1",statusName="nonCriticalLower"} 0
idrac_amperage_probe_status{index="1",statusName="criticalLower"} 0
idrac_amperage_probe_status{index="1",statusName="nonRecoverableLower"} 0
idrac_amperage_probe_status{index="1",statusName="failed"} 0
idrac_amperage_probe_status{index="2",statusName="other"} 0
idrac_amperage_probe_status{index="2",statusName="unknown"} 0
idrac_amperage_probe_status{index="2",statusName="ok"} 1
idrac_amperage_probe_status{index="2",statusName="nonCriticalUpper"} 0
idrac_amperage_probe_status{index="2",statusName="criticalUpper"} 0
idrac_amperage_probe_status{index="2",statusName="nonRecoverableUpper"} 0
idrac_amperage_probe_status{index="2",statusName="nonCriticalLower"} 0
idrac_amperage_probe_status{index="2",statusName="criticalLower"} 0
idrac_amperage_probe_status{index="2",statusName="nonRecoverableLower"} 0
idrac_amperage_probe_status{index="2",statusName="failed"} 0

idrac_amperage_probe_status_code{index="1"} 3
idrac_amperage_probe_status_code{index="2"} 3

and probably make them configurable (which to show).
Reply all
Reply to author
Forward
0 new messages