Staleness and failed scrapes

Matthias Rampke

unread,

Aug 14, 2018, 11:17:19 AM8/14/18

to Prometheus

Hey,

we observed something and I wanted to check if this is the desired behaviour. We have an exporter that sometimes fails for reasons. When it does, the time series that were scraped from it immediately disappear, which in turn causes alerts to resolve and then re-fire:

https://www.evernote.com/l/AAXdxZoRedREB5T6Pf6qqDkb5lG1d4fs7soB/image.png

(note that this target is only scraped once per minute, so the up == 0 represents a single scrape error)

I read through the staleness documentation:

https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf

and found no mention of scrape errors.

I would expect the time series not to end because of ephemeral scrape errors, is this expectation wrong? Or is this something that isn't supposed to happen?

/MR

Brian Brazil

unread,

Aug 14, 2018, 11:26:24 AM8/14/18

to Matthias Rampke, Prometheus

This is as expected, you don't want old stale values to hang around towards the end of a target's life before it is removed from SD.

This is covered on the 15th slide of the above talk, and I also discussed it on https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0

--

Brian Brazil

www.robustperception.io

Matthias Rampke

unread,

Aug 15, 2018, 2:45:41 AM8/15/18

to Brian Brazil, Prometheus

I see, that makes sense. I'm wondering if this is the best possible trade-off though – as an administrator I can control the delay between a target disappearing and updating Prometheus, but I cannot exclude all scrape errors.

Also, in this case the target returns a 500, so it's definitely still *there* and Prometheus knows this.

I would prefer not to equate scrape errors and targets that have gone away tbh.

/MR

Brian Brazil

unread,

Aug 15, 2018, 3:31:01 AM8/15/18

to Matthias Rampke, Prometheus

On 15 August 2018 at 07:45, Matthias Rampke <m...@soundcloud.com> wrote:

I see, that makes sense. I'm wondering if this is the best possible trade-off though – as an administrator I can control the delay between a target disappearing and updating Prometheus, but I cannot exclude all scrape errors.

Also, in this case the target returns a 500, so it's definitely still *there* and Prometheus knows this.

I would prefer not to equate scrape errors and targets that have gone away tbh.

What should we do if the target failed scrapes for 5 minutes, and there were no longer samples within the staleness window? Should we somehow keep those series alive? Making the series stale on a failed scrape is consistent with what we're dong with staleness generally.

Brian

/MR

On Tue, Aug 14, 2018, 17:26 Brian Brazil <brian.brazil@robustperception.io> wrote:

On 14 August 2018 at 16:17, 'Matthias Rampke' via Prometheus Users <prometheus-users@googlegroups.com> wrote:
Hey,

we observed something and I wanted to check if this is the desired behaviour. We have an exporter that sometimes fails for reasons. When it does, the time series that were scraped from it immediately disappear, which in turn causes alerts to resolve and then re-fire:

https://www.evernote.com/l/AAXdxZoRedREB5T6Pf6qqDkb5lG1d4fs7soB/image.png

(note that this target is only scraped once per minute, so the up == 0 represents a single scrape error)

I read through the staleness documentation:

https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness
https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf

and found no mention of scrape errors.

I would expect the time series not to end because of ephemeral scrape errors, is this expectation wrong? Or is this something that isn't supposed to happen?

This is as expected, you don't want old stale values to hang around towards the end of a target's life before it is removed from SD.

This is covered on the 15th slide of the above talk, and I also discussed it on https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0

--
Brian Brazil
www.robustperception.io

--

Brian Brazil

www.robustperception.io

Julien Pivotto

unread,

Aug 15, 2018, 4:20:56 AM8/15/18

to Matthias Rampke, Brian Brazil, Prometheus

On 15 Aug 08:45, 'Matthias Rampke' via Prometheus Users wrote:
> I see, that makes sense. I'm wondering if this is the best possible
> trade-off though – as an administrator I can control the delay between a
> target disappearing and updating Prometheus, but I cannot exclude all
> scrape errors.
>
> Also, in this case the target returns a 500, so it's definitely still
> *there* and Prometheus knows this.
>
> I would prefer not to equate scrape errors and targets that have gone away
> tbh.
>
> /MR

Hi Matthias,

I really like staleness the way it is.

We have one flaky target however. For it, we simply use
max_over_time(metric[5m]) it our alert. Maybe there could be a better
function, like last_over_time(metric[5m]) that would just take the last
result and ignore the staleness?

regards,

--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu

signature.asc

Brian Brazil

unread,

Aug 15, 2018, 4:33:48 AM8/15/18

to Julien Pivotto, Matthias Rampke, Prometheus

On 15 August 2018 at 09:20, Julien Pivotto <roidel...@inuits.eu> wrote:

On 15 Aug 08:45, 'Matthias Rampke' via Prometheus Users wrote:
> I see, that makes sense. I'm wondering if this is the best possible
> trade-off though – as an administrator I can control the delay between a
> target disappearing and updating Prometheus, but I cannot exclude all
> scrape errors.
>
> Also, in this case the target returns a 500, so it's definitely still
> *there* and Prometheus knows this.
>
> I would prefer not to equate scrape errors and targets that have gone away
> tbh.
>
> /MR

Hi Matthias,

I really like staleness the way it is.

We have one flaky target however. For it, we simply use
max_over_time(metric[5m]) it our alert. Maybe there could be a better
function, like last_over_time(metric[5m]) that would just take the last
result and ignore the staleness?

avg_over_time is a bit safer in general, as it's more resilient to noise in the data and individual outliers.

--

Brian Brazil

www.robustperception.io

Reply all

Reply to author

Forward