Patterns to expose absent metrics when 0 is meaningful.

379 views
Skip to first unread message

Dorian Jaminais-Grellier

unread,
Aug 26, 2021, 12:14:04 AM8/26/21
to Prometheus Users
Hi,

I am looking for guidance on how to properly instrument a system where the metric can go missing for extended period of times. 

Here is the setup:
* I have a sensor that is reporting (temperature in my case) a gauge. The sensor is pushing that data.
* I'm running something (to be built) that will take that sensor value and expose it to Prometheus.

Now the naive way is to create a gauge for the sensor and expose it all the time. The problem is that this makes it impossible for me to distinguish between the temperature staying constant and the sensor not reporting any metrics anymore.

Another way, as suggested in the docs, is to report 0, but then I can't distinguish between the temperature being 0 and a 0 meaning no data.

I could make the metrics disappear completely from my /metrics endpoint but I understand this is frown upon but it would have the advantage of being very clear to users that the data is missing.

Final idea that I have is to report two metrics, one for the temperature and one for the last collection time, but this is very confusing for users to use since they will need to think about invalidating the temperature data at query time whenever the last collection time goes above an arbitrary threshold. 

So here I am. Is there a recommended pattern for this type of metrics? Is there another option I haven't listed above?

Thank you

Dorian

Michael Ströder

unread,
Aug 26, 2021, 5:26:33 AM8/26/21
to Prometheus Users
On 8/26/21 06:14, Dorian Jaminais-Grellier wrote:
> Now the naive way is to create a gauge for the sensor and expose it all
> the time. The problem is that this makes it impossible for me to
> distinguish between the temperature staying constant and the sensor not
> reporting any metrics anymore.
>
> Another way, as suggested in the docs
> <https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics>,
> is to report 0, but then I can't distinguish between the temperature
> being 0 and a 0 meaning no data.
>
> I could make the metrics disappear completely from my /metrics endpoint
> but I understand this is frown upon but it would have the advantage of
> being very clear to users that the data is missing.
>
> Final idea that I have is to report two metrics, one for the temperature
> and one for the last collection time, but this is very confusing for
> users to use since they will need to think about invalidating the
> temperature data at query time whenever the last collection time goes
> above an arbitrary threshold. 
>
> So here I am. Is there a recommended pattern for this type of metrics?
> Is there another option I haven't listed above?

I'd also love to know an answer for that (after considering the very
same work-arounds like you).

There's a metric value missing like None / NULL or whatever you're used
to in programming languages to express "nothing herein".

Ciao, Michael.

Brian Candler

unread,
Aug 29, 2021, 4:22:35 AM8/29/21
to Prometheus Users
On Thursday, 26 August 2021 at 05:14:04 UTC+1 Dorian Jaminais-Grellier wrote:
I could make the metrics disappear completely from my /metrics endpoint but I understand this is frown upon but it would have the advantage of being very clear to users that the data is missing.


I'd say it's not exactly frowned upon.  It can make it more difficult to alert on this condition, but it's doable, either by joining to another timeseries that has all the labels that you expect to see (using 'and' or 'unless'), or by joining to itself in conjunction with a time offset (e.g. alert when timeseries existed 10 minutes ago but doesn't exist now).


The traditional way to handle this is to have a separate metric representing whether or not temperature was collected successfully - comparable to "up" in regular scraping, or "probe_success" in blackbox_exporter.  This assumes that you are able to scrape, and the exporter is able to say explicitly "I could not talk to the temperature sensor", or "I talked to the temperature sensor, but it had no value to give to me".  In that case, the value 0 or 1 tells you whether there's a problem with temperature collection or not; the main metric can either vanish, or report the last-known value, whichever is more useful to you.

However it sounds like rather than scraping, you're using something like pushgateway to get the last reported value.  In that case, the reporting of the temperature (to the push gateway) is not synchronous with the scraping of the data (from the push gateway).  In that case, it depends on what failure modes you're trying to deal with.  If the issue is "temperature probe is broken, but I'm able to report that it's broken" then it can push a separate metric saying success/fail.  But if it just goes offline or stops pushing data, that doesn't help you.

In that case, a separate metric with timestamp of last push is the safest approach, but as you suggest, you need to process this somewhat to make it more useful.  You could have a recording rule to synthesise a status value, i.e. it stores a value of 1 if the push timestamp is "fresh enough" and 0 if it hasn't been seen for longer than some threshold.

Or you can make a pushgateway which has a TTL that expires the metric; that's a feature that has been requested but rejected for the standard pushgateway, so you may find it useful to read the relevant issue threads to understand why it's considered a bad idea.

I did find a fork with TTL: https://github.com/dinumathai/pushgateway
Reply all
Reply to author
Forward
0 new messages