I see, that makes sense. I'm wondering if this is the best possible trade-off though – as an administrator I can control the delay between a target disappearing and updating Prometheus, but I cannot exclude all scrape errors.Also, in this case the target returns a 500, so it's definitely still *there* and Prometheus knows this.I would prefer not to equate scrape errors and targets that have gone away tbh.
/MR
On 14 August 2018 at 16:17, 'Matthias Rampke' via Prometheus Users <prometheus-users@googlegroups.com> wrote:Hey,we observed something and I wanted to check if this is the desired behaviour. We have an exporter that sometimes fails for reasons. When it does, the time series that were scraped from it immediately disappear, which in turn causes alerts to resolve and then re-fire:(note that this target is only scraped once per minute, so the up == 0 represents a single scrape error)I read through the staleness documentation:and found no mention of scrape errors.I would expect the time series not to end because of ephemeral scrape errors, is this expectation wrong? Or is this something that isn't supposed to happen?This is as expected, you don't want old stale values to hang around towards the end of a target's life before it is removed from SD.This is covered on the 15th slide of the above talk, and I also discussed it on https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0
On 15 Aug 08:45, 'Matthias Rampke' via Prometheus Users wrote:
> I see, that makes sense. I'm wondering if this is the best possible
> trade-off though – as an administrator I can control the delay between a
> target disappearing and updating Prometheus, but I cannot exclude all
> scrape errors.
>
> Also, in this case the target returns a 500, so it's definitely still
> *there* and Prometheus knows this.
>
> I would prefer not to equate scrape errors and targets that have gone away
> tbh.
>
> /MR
Hi Matthias,
I really like staleness the way it is.
We have one flaky target however. For it, we simply use
max_over_time(metric[5m]) it our alert. Maybe there could be a better
function, like last_over_time(metric[5m]) that would just take the last
result and ignore the staleness?