Alerting on freshness (metrics which have not reported in x minutes)

Matt Bostock

unread,

May 1, 2016, 2:36:32 PM5/1/16

to Prometheus Developers

Hi,

I'm looking at migrating from Nagios to Prometheus and one of the alerts we need to migrate over is a freshness check, i.e.:

"Alert if metric `db_lag` has not reported from a given node in the last 5 minutes"

The metric in question is a guage.

I saw the 'absent()' function but it's not clear to me how to alert on a time threshold using absent(). How can I alert on stale values in Prometheus?

Related, does Prometheus have the concept of null values, i.e. if a metric does not report in a given timeframe does it register a null value, and could I query on that?

Thanks in advance,

Matt

Björn Rabenstein

unread,

May 1, 2016, 5:32:37 PM5/1/16

to Matt Bostock, Prometheus Developers

On 1 May 2016 at 20:36, Matt Bostock <ma...@mattbostock.com> wrote:
> "Alert if metric `db_lag` has not reported from a given node in the last 5
> minutes"

That's not really the way Prometheus works. Prometheus scrapes the
monitored target, and in a sane setup, it always scrapes all the
metrics, or none at all (if the whole scrapes fail).

The latter can be tracked via the `up` metric, which contains 1 for
successful scrapes and 0 for failed ones. See
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series

> I saw the 'absent()' function but it's not clear to me how to alert on a
> time threshold using absent(). How can I alert on stale values in
> Prometheus?

`absent` is indeed the right function to detect if a metric is
missing. (However, as said above, it sounds like your use case is a
different one.) Alerting rules have a `FOR` clause which allows you to
define for how long an alerting condition (e.g. absence of a metric)
has to be true before firing the alert. See
https://prometheus.io/docs/alerting/rules/

> Related, does Prometheus have the concept of null values, i.e. if a metric
> does not report in a given timeframe does it register a null value, and
> could I query on that?

A metric can have any IEEE 754 float value, which includes NaN and
such. But the Prometheus collection semantics doesn't really work in
the way your question suggests. Prometheus scrapes the targets at
(ideally regular but in principle) arbitrary intervals. Any given
metric always has the value of the last scrape (modulo what we call a
staleness limit). The monitored targets in turn expose all the metrics
all the time. They don't usually disappear or come back. If a metric
doesn't change, it simply stays at the same value.

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Julius Volz

unread,

May 1, 2016, 7:04:01 PM5/1/16

to Björn Rabenstein, Matt Bostock, Prometheus Developers

On Sun, May 1, 2016 at 11:32 PM, Björn Rabenstein <bjo...@soundcloud.com> wrote:

On 1 May 2016 at 20:36, Matt Bostock <ma...@mattbostock.com> wrote:
> "Alert if metric `db_lag` has not reported from a given node in the last 5
> minutes"

That's not really the way Prometheus works. Prometheus scrapes the
monitored target, and in a sane setup, it always scrapes all the
metrics, or none at all (if the whole scrapes fail).

The latter can be tracked via the `up` metric, which contains 1 for
successful scrapes and 0 for failed ones. See
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series

A variation of this is when you have one of those rare use cases where you have to legitimately push metrics via the Pushgateway (see https://prometheus.io/docs/practices/pushing/ for when this is the right approach). In that case, you don't have an automatic "up" metric for the target that pushed the metrics.

Instead, you will want to push a separate metric that contains the timestamp of the last successful push or run and alert on that getting too old (via "time() - my_last_success_timestamp_seconds > x").

See also the best practices for instrumenting batch jobs: https://prometheus.io/docs/practices/instrumentation/#batch-jobs

Matt Bostock

unread,

May 2, 2016, 2:23:02 AM5/2/16

to Prometheus Developers, bjo...@soundcloud.com, ma...@mattbostock.com

Thanks both.

To clarify, I'm currently pushing all metrics into Prometheus as a temporary work-around until all metrics can be migrated from Nagios.

It sounds like sending a 'last time suceeded' metric might be the correct solution, although absent() seems more elegant.

Is there any reason why absent() would not work with the push gateway?

Thanks,

Matt

Björn Rabenstein

unread,

May 2, 2016, 5:23:08 AM5/2/16

to Matt Bostock, Prometheus Developers

On 2 May 2016 at 08:23, Matt Bostock <ma...@mattbostock.com> wrote:
> Is there any reason why absent() would not work with the push gateway?

The point of the Pushgateway is to reconstruct Prometheus semantics in
a push context. To do that, the Pushgateway "keeps alive" the metric
until explicitly deleted or overwritten (by a later push). The
timestamp will be the time Prometheus has pushed the pushgateway. So
if your pushing job ceases to push, you will be stuck with the old
metric forever, and it will always appear fresh.

Matt Bostock

unread,

May 2, 2016, 11:01:27 AM5/2/16

to Björn Rabenstein, Prometheus Developers

Thanks Bjorn, that's very helpful.

niravs...@gmail.com

unread,

Aug 28, 2018, 4:23:13 AM8/28/18

to Prometheus Developers

You can use below:
https://niravshah2705-software-engineering.blogspot.com/2018/08/prometheus-monitoring.html

We recorded time of scraping for each metrics
rules:
- record: stackdriver_pubsub:scraptime
expr: timestamp(stackdriver_pubsub_topic_pubsub_googleapis_com_topic_send_request_count)

Alert rule:
time() - max_over_time(stackdriver_pubsub:scraptime[5h]) > 3600 or sum_over_time(stackdriver_pubsub_topic_pubsub_googleapis_com_topic_send_request_count[30m]) < 5

Reply all

Reply to author

Forward