better way to get notified about (true) single scrape failures?

454 views
Skip to first unread message

Christoph Anton Mitterer

unread,
May 8, 2023, 9:29:40 PM5/8/23
to Prometheus Users
Hey.

I have an alert rule like this:

groups:
  - name:       alerts_general
    rules:
    - alert: general_target-down
      expr: 'up == 0'
      for:  5m

which is intended to notify about a target instance (respectively a specific exporter on that) being down.

There are also routes in alertmanager.yml which have some "higher" periods for group_wait and group_interval and also distribute that resulting alerts to the various receivers (e.g. depending on the instance that is affected).


By chance I've noticed that some of our instances (or the networking) seem to be a bit unstable and every now and so often, a single scrape or some few fail.

Since this does typically not mean that the exporter is down (in the above sense) I wouldn't want that to cause a notification to be sent to people responsible for the respective instances.
But I would want to get one sent, even if only a single scrape fails, to the local prometheus admin (me ^^), so that I can look further, what causes the scrape failures.



My (working) solution for that is:
a) another alert rule like:
groups:
  - name:     alerts_general_single-scrapes
    interval: 15s
    rules:
    - alert: general_target-down_single-scrapes     
      expr: 'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0'
      for:  0s

(With 15s being the smallest scrape time used by any jobs.)

And a corresponding alertmanager route like:
  - match:
      alertname: general_target-down_single-scrapes
    receiver:       admins_monitoring_no-resolved
    group_by:       [alertname]
    group_wait:     0s
    group_interval: 1s


The group_wait: 0s and group_interval: 1s seemed necessary, cause despite of the for: 0s, it seems that alertmanager kind of checks again before actually sending a notification... and when the alert is gone by then (because there was e.g. only one single missing scrape) it wouldn't send anything (despite the alert actually fired).


That works so far... that is admins_monitoring_no-resolved get a notification for every single failed scrape while all others only get them when they fail for at least 5m.

I even improved the above a bit, by clearing the alert for single failed scrapes, when the one for long-term down starts firing via something like:
      expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"} == 0 )  unless on (instance,job)  ( ALERTS{alertname="general_target-down", alertstate="firing"} == 1 )'


I wondered wheter this can be done better?

Ideally I'd like to get notification for general_target-down_single-scrapes only sent, if there would be no one for general_target-down.

That is, I don't care if the notification comes in late (by the above ~ 5m), it just *needs* to come, unless - of course - the target is "really" down (that is when general_target-down fires), in which case no notification should go out for general_target-down_single-scrapes.


I couldn't think of an easy way to get that. Any ideas?


Thanks,
Chris.

Brian Candler

unread,
May 9, 2023, 3:55:22 AM5/9/23
to Prometheus Users
That's tricky to get exactly right. You could try something like this (untested):

    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
    for: 5m

- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard' failure alert should have triggered)

Therefore, this should alert if any scrape failed over 5 minutes, unless all scrapes failed over 5 minutes.

There is a boundary condition where if the scraping fails for approximately 5 minutes you're not sure if the standard failure alert would have triggered. Hence it might need a bit of tweaking for robustness. To start with, just make it over 6 minutes:

    expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
    for: 6m

That is, if max_over_time[6m] is zero, we're pretty sure that a standard alert will have been triggered by then.

I'm still not quite convinced about the "for: 6m" and whether we might lose an alert if there were a single failed scrape. Maybe this would be more sensitive:

    expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
    for: 7m

but I think you might get some spurious alerts at the *end* of a period of downtime.

Christoph Anton Mitterer

unread,
May 9, 2023, 9:47:25 PM5/9/23
to Prometheus Users
Hey Brian.

On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote:
That's tricky to get exactly right. You could try something like this (untested):

    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
    for: 5m

- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard' failure alert should have triggered)

Therefore, this should alert if any scrape failed over 5 minutes, unless all scrapes failed over 5 minutes.

Ah that seems a pretty smart idea.

And the for: is needed to make it actually "count", as the [5m] only looks back 5m, but there, max_over_time(up[5m]) would have likely been still 1 while min_over_time(up[5m]) would already be 0, and if one had then e.g. for: 0s, it would fire immediately.
 

There is a boundary condition where if the scraping fails for approximately 5 minutes you're not sure if the standard failure alert would have triggered.

You mean like the above one wouldn't fire cause it thinks it's the long-term alert, while that wouldn't fire either, because it has just resolved then?
 
 
Hence it might need a bit of tweaking for robustness. To start with, just make it over 6 minutes:

    expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
    for: 6m

That is, if max_over_time[6m] is zero, we're pretty sure that a standard alert will have been triggered by then.

That one I don't quite understand.
What if e.g. the following scenario happens (with each line giving the state 1m after the one before):

                                                  for=6           for=5
m   -5 -4 -3 -2 -1  0   for     min[6m] max[6m] result/short    result/long
up:  1  1  1  1  1  0   1       0       1       pending         pending
up:  1  1  1  1  0  0   2       0       1       pending         pending
up:  1  1  1  0  0  0   3       0       1       pending         pending
up:  1  1  0  0  0  0   4       0       1       pending         pending
up:  1  0  0  0  0  0   5       0       1       pending         fire
up:  0  0  0  0  0  1   6       0       1       fire            clear


After 5m, the long term alert would fire, after that the scraping would succeed again, but AFAIU the "special" alert for the short ones would still be true at that point and then start to fire, despite all the previous 5 zeros have actually been reported as part of a long-down alert.


I'm still not quite convinced about the "for: 6m" and whether we might lose an alert if there were a single failed scrape. Maybe this would be more sensitive:

    expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
    for: 7m

but I think you might get some spurious alerts at the *end* of a period of downtime.

That also seems quite complex. And I guess it might have the same possible issue from above?

The same should be the case if one would do:
    expr: min_over_time(up[6m]) == 0 unless max_over_time(up[5m]) == 0
    for: 6m
It may be just 6m ago that there was a "0" (from a long alert) and the last 5m there would have been "1"s. So the short-alert would fire, despite it's unclear whether the "0" 6m ago was really just a lonely one or the end of a long-alert period.

Actually, I think, any case where the min_over_time goes further back than the long-alert's for:-time should have that.


    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[6m]) == 0
    for: 5m
would also be broken, IMO, cause if 6m ago there was a "1", only the min_over_time(up[5m]) == 0 would remain (and nothing would silence the alert if needed)... if there 6m ago was a "0", it should effectively be the same than using [5m]?


Isn't the problem from the very above already solved by placing both alerts in the same rule group?

"Recording and alerting rules exist in a rule group. Rules within a group are run sequentially at a regular interval, with the same evaluation time."
which I guess applies also to alert rules.

Not sure if I'm right, but I think if one places both rules in the same group (and I think even the order shouldn't matter?), then the original:
    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
    for: 5m
with 5m being the "for:"-time of the long-alert should be guaranteed to work... in the sense that if the above doesn't fire... the long-alert does.

Unless of course the grouping settings at alert manager cause trouble.. which I don't quite understand.... especially, once an alert fires, even if just for short,... is it guaranteed that a notiication is sent?
Cause as I wrote before, that didn't seem to be the case.

Last but not least, if my assumption is true and your 1st version would work if both alerts are in the same group... how would the interval then matter? Would it still need to be the smallest scrape time (I guess so)?


Thanks,
Chris.

Brian Candler

unread,
May 10, 2023, 3:03:36 AM5/10/23
to Prometheus Users
> Not sure if I'm right, but I think if one places both rules in the same group (and I think even the order shouldn't matter?), then the original:
>     expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
>     for: 5m
> with 5m being the "for:"-time of the long-alert should be guaranteed to work... in the sense that if the above doesn't fire... the long-alert > does.

It depends on the exact semantics of "for". e.g. take a simple case of 1 minute rule evaluation interval. If you apply "for: 1m" then I guess that means the alert must be firing for two successive evaluations (otherwise, "for: 1m" would have no effect).

If so, then "for: 5m" means it must be firing for six successive evaluations.

But up[5m] only looks at samples wholly contained within a 5 minute window, and therefore will normally only look at 5 samples.  (If there is jitter in the sampling time, then occasionally it might look at 4 or 6 samples)

If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas

... unless max_over_time(up[5m])

will suppress an alert if "up" is zero for (usually) 5 cycles.

If you want to get to the bottom of this with certainty, you can write unit tests that try out these scenarios.

Christoph Anton Mitterer

unread,
May 12, 2023, 10:26:18 PM5/12/23
to Prometheus Users
Hey Brian

On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote:
It depends on the exact semantics of "for". e.g. take a simple case of 1 minute rule evaluation interval. If you apply "for: 1m" then I guess that means the alert must be firing for two successive evaluations (otherwise, "for: 1m" would have no effect).

Seems you're right.

I did quite some testing meanwhile with the following alertmanager route (note, that I didn't use 5m, but 1m... simply in order to not have to wait so long):
  routes:
  - match_re:
      alertname: 'td.*'
    receiver:       admins_monitoring

    group_by:       [alertname]
    group_wait:     0s
    group_interval: 1s

and the following rules:
groups:
  - name:     alerts_general_single-scrapes
    interval: 15s   
    rules:
    - alert: td-fast
      expr: 'min_over_time(up[75s]) == 0 unless max_over_time(up[75s]) == 0'
      for:  1m
    - alert: td
      expr: 'up == 0'
      for:  1m


My understanding is, correct me if wrong, that basically prometheus would run a thread for the scrape job (which in my case would have an interval of 15s) and another one that evaluates the alert rules (above every 15s) which then sends the alert to the alertmanager (if firing).

It felt a bit brittle to have the rules evaluated with the same period then the scrapes, so I did all tests once with 15s for the rules interval, and once with 10s. But it seems as if this wouldn't change the behaviour.


But up[5m] only looks at samples wholly contained within a 5 minute window, and therefore will normally only look at 5 samples.

As you can see above,... I had already noticed that you were indeed right before, and if my for: is e.g. 4 * evaluation_interval(15s) = 1m ... I need to look back 5 * evaluation_interval(15s) = 75s

At least in my tests, that seemed to cause the desired behaviour, except for one case:
When my "slow" td fires (i.e. after 5 consecutive "0"s) and then there is... within (less than?) 1m, another sequence of "0"s that eventually cause a "slow" td. In that case, td-fast fires for a while, until it directly switches over to td firing.

Was your idea above with something like:
>    expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
>    for: 7m
intended to fix that issue?

Or could one perhaps use ALERTS{alertname="td",instance="lcg-lrz-ext.grid.lrz.de",job="node"}[??s] == 1 somehow, to check whether it did fire... and then silence the false positive.

 
  (If there is jitter in the sampling time, then occasionally it might look at 4 or 6 samples)

Jitter in the sense that the samples are taken at slightly different times?
Do you think that could affect the desired behaviour? I would intuitively expect that it rather only cases the "base duration" not be be exactly e.g. 1m ... so e.g. instead of taking 1m for the "slow" td to fire, it would happen +/- 15s earlier (and conversely for td-slow).


Another point I basically don't understand... how does all that relate to the scrap intervals?
The plain up == 0 simply looks at the most recent sample (going back up to 5m as you've said in the other thread).

The series up[Ns] looks back N seconds, giving whichever samples are within there and now. AFAIU, there it doesn't go "automatically" back any further (like the 5m above), right?

In order for the for: to work I need at least two samples... so doesn't that mean that as soon as any scrape time is for:-time(1m) / 2 = ~30s (in the above example), the above two alerts will never fire, even if it's down?

So if I had e.g. some jobs scraping only every 10m ... I'd need another pair of td/td-fast alerts, which then filter on the job (up{job="longRunning"}) and either only have td... (if that makes sense) ... or at td-fast for if one of the every-10m-scrape fails and an even long "slow" td like if that fails for 1h.


If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas

As far as I understand you... 6 cycles of rule evaluation interval... with at least two samples within that interval, right?
 
... unless max_over_time(up[5m])

will suppress an alert if "up" is zero for (usually) 5 cycles.


 Last but not least an (only) partially related question:

Once an alert fires (in prometheus), even i just for one evaluation interval cycle.... and there is no inhibiton rule or so in alertmanager... is it expected that a notification is sent out for sure,... regardless of alertmanagers grouping settings?
Like when the alert fires for one short 15s evaluation interval and clears again afterwards,... but group_wait: is set to some 7d ... is it expected to send that singe firing event after 7d, even if it has resolved already once the 7d are over and there was .g. no further firing in between?


Thanks a lot :-)
Chris.

Brian Candler

unread,
May 13, 2023, 6:39:43 AM5/13/23
to Prometheus Users
On Saturday, 13 May 2023 at 03:26:18 UTC+1 Christoph Anton Mitterer wrote:

  (If there is jitter in the sampling time, then occasionally it might look at 4 or 6 samples)

Jitter in the sense that the samples are taken at slightly different times?

Yes. Each sample is timestamped with the time the scrape took place.

Consider a 5 minute window which contains generally contains 5 samples at 1 minute intervals:

   |...*......*......*......*......*....|...*....

Now consider what happens when one of those samples is right on the boundary of the window:

   |*......*......*......*......*.......|*.......

Depending on the exact timings that the scrape takes place, it's possible that the first sample could fall outside:

   *|......*......*......*......*.......|*.......

Or the next sample could fall inside:

   |*......*......*......*......*......*|.......

 
Do you think that could affect the desired behaviour?

In my experience, the scraping regularity of Prometheus is very good (just try putting "up[5m]" into the PromQL browser and looking at the timestamps of the samples, they seem to increment in exact intervals).  Oo it's unlikely to happen much, and it might when the system is under high load, I guess.  Or it might never happen, if Prometheus writes the timestamps of the times it *wanted* to make the scrape, not when it actually occurred.  Determining that would require looking in source code.
 
Another point I basically don't understand... how does all that relate to the scrap intervals?
The plain up == 0 simply looks at the most recent sample (going back up to 5m as you've said in the other thread).

The series up[Ns] looks back N seconds, giving whichever samples are within there and now. AFAIU, there it doesn't go "automatically" back any further (like the 5m above), right?

That's correct.

So if you're trying to make mutual expressions which fire in case A but not B, and case B but not A, then you'd probably be better off writing then to both use up[5m].

min_over_time(up[5m]) == 0    # use this instead of "up == 0  // for: 5m" for the main alert.

 

In order for the for: to work I need at least two samples

No, you just need two rule evaluations. The rule evaluation interval doesn't have to be the same as the scrape interval, and even if they are the same, they are not synchronized.


If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas

(rule evaluation cycles, if your rule evaluation interval is 1m)
 

As far as I understand you... 6 cycles of rule evaluation interval... with at least two samples within that interval, right?

No.  The expression "up" is evaluated at each rule evaluation time, and it gives the most recent value of "up", looking back up to 5 minutes.

So if you had a scrape interval of 2 minutes, with a rule evaluation interval of 1 minute it could be that two rule evaluations of "up" see the same scraped value.

(This can also happen in real life with a 1 minute scrape interval, if you have a failed scrape)

 
Once an alert fires (in prometheus), even i just for one evaluation interval cycle.... and there is no inhibiton rule or so in alertmanager... is it expected that a notification is sent out for sure,... regardless of alertmanagers grouping settings?

There is group_wait. If the alert were to trigger and clear within the group_wait interval, I'd expect no alert to be sent. But I've not tested that.
 
Like when the alert fires for one short 15s evaluation interval and clears again afterwards,... but group_wait: is set to some 7d ... is it expected to send that singe firing event after 7d, even if it has resolved already once the 7d are over and there was .g. no further firing in between?

You'll need to test it, but my expectation would be that it wouldn't send *anything* for 7 days (while it waits for other similar alerts to appear), and if all alerts have disappeared within that period, that nothing would be sent.  However, I don't know if the 7 day clock resets as soon as all alerts go away, or it continues to tick.  If this matters to you, then test it.

Nobody in their right might would use 7d for group_wait of course.  Typically you might set it to around a minute, so that if a bunch of similar alerts fire within that 1 minute period, they are gathered together into a single notification rather than a slew of separate notifications.

HTH,

Brian.

Christoph Anton Mitterer

unread,
Mar 17, 2024, 9:29:42 PM3/17/24
to Prometheus Users
Hey there.

I eventually got back to this and I'm still fighting this problem.

As a reminder, my goal was:
- if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
  how Icinga would put the host into down state, after pings failed or a
  number of seconds)
- but even if a single scrape fails (which alone wouldn't trigger the above
  alert) I'd like to get a notification (telling me, that something might be
  fishy with the networking or so), that is UNLESS that single failed scrape
  is part of a sequence of failed scrapes that also caused / will cause the
  above target-down alert

Assuming in the following, each number is a sample value with ~10s distance for
the `up` metric of a single host, with the most recent one being the right-most:
- 1 1 1 1 1 1 1 => should give nothing
- 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single failure,
                   or develop into the target-down alert)
- 1 1 1 1 1 0 0 => same as above, not clear yet
...
- 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert

In the following:
- 1 1 1 1 1 0 1
- 1 1 1 1 0 0 1
- 1 1 1 0 0 0 1
...
should eventually (not necessarily after the right-most 1, though) all give a
"single-scrape-failure" (even though it's more than just one - it's not a
target-down), simply because there's a 0s but for a time span less than 1m.

- 1 0 1 0 0 0 0 0 0
should give both, a single-scrape-failure alert (the left-most single 0) AND a
target-down alert (the 6 consecutive zeros)

-           1 0 1 0 1 0 0 0
should give at least 2x a single-scrape-failure alert, and for the leftmost
zeros, it's not yet clear what they'll become.
-   0 0 0 0 0 0 0 0 0 0 0 0  (= 2x six zeros)
should give only 1 target-down alert
- 0 0 0 0 0 0 1 0 0 0 0 0 0  (= 2x six zeros, separated by a 1)
should give 2 target-down alerts

Whether each of such alerts (e.g. in the 1 0 1 0 1 0 ...) case actually results
in a notification (mail) is of course a different matter, and depends on the
alertmanager configuration, but at least the alert should fire and with the right
alert-manager config one should actually get a notification for each single failed
scrape.


Now, Brian has already given me some pretty good ideas how do them basically the
ideas were:
(assuming that 1m makes the target down, and a scrape interval of 10s)

For the target-down alert:
a) expr: 'up == 0'
   for:  1m
b) expr: 'max_over_time(up[1m]) == 0'
   for:  0s
=> here (b) was probably better, as it would use the same condition as is also used
   in the alert below, and there can be no weird timing effects depending on the
   for: an when these are actually evaluated.

For the single-scrape-failiure alert:
A) expr: min_over_time(up[1m20s]) == 0 unless max_over_time(up[1m]) == 0
   for: 1m10s
   (numbers a bit modified from Brian's example, but I think the idea is the same)
B) expr: min_over_time(up[1m10s]) == 0 unless max_over_time(up[1m10s]) == 0
   for: 1m

=> I did test (B) quite a lot, but there was at least still one case where it failed
   and that was when there were two consecutive but distinct target-down errors, that
   is:
   0 0 0 0 0 0 1 0 0 0 0 0 0  (= 2x six zeros, separated by a 1)
   which would eventually look like e.g.
   0 1 0 0 0 0 0 0   or   0 0 1 0 0 0 0 0
   in the above check, and thus trigger (via the left-most zeros) a false
   single-scrape-failiure alert.

=> I'm not so sure whether I truly understand (A),... especially with respect to any
   niche cases, when there's jitter or so (plus, IIRC, it also failed in the case
   described for (B).


One approach I tried in the meantime was to use sum_over_time .. and then the idea was
simply to check how mane ones there are for each case. But it turns out that even if
everything runs normal, the sum is not stable... some times, over [1m] I got only 5,
whereas most times it was 6.
Not really sure how that comes, because the printed timestamps for each sample seem to
be suuuuper accurate (all the time), but the sum wasn't.


So I tried a different approach now, based on the above from Brian,... which at least in
tests looks promising so far... but I'd like to hear what experts think about it.

- both alerts have to be in the same alert groups (I assume this assures they're then
  evaluated in the same thread and at the "same time" (that is, with respect to the same
  reference timestamp).
- in my example I assume a scrape time of 10s and evaluation interval of 7s (not really
  sure whether the latter matters or could be changed while the rules stay the same - and
  it would still work or not)
- for: is always 0s ... I think that's good, because at least to me it's unclear, how
  things are evaluated if the two alerts have different values for for:, especially in
  border cases.
- rules:
    - alert: target-down
      expr: 'max_over_time( up[1m0s] )  ==  0'
      for:  0s
    - alert: single-scrape-failure
      expr: 'min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0
                           unless max_over_time(up[1m0s] offset 1m10s) == 0
                           unless max_over_time(up[1m0s] offset 1m) == 0
                           unless max_over_time(up[1m0s] offset 50s) == 0
                           unless max_over_time(up[1m0s] offset 40s) == 0
                           unless max_over_time(up[1m0s] offset 30s) == 0
                           unless max_over_time(up[1m0s] offset 20s) == 0
                           unless max_over_time(up[1m0s] offset 10s) == 0'
      for:  0m
I think the intended working of target-down is obvious so let me explain the ideas behind
single-scrape-failure:

I divide the time spans I look at:
 -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s   -30s   -20s   -10s   0s/now
   |       |       |       |       |      |      |  0   |      |      |      |      |      |      |     case 1
   |       |       |       |       |      |      |  0   |  0   |  0   |  0   |  0   |  0   |  0   |     case 2
   |       |       |       |       |      |      |  0   |  1   |  0   |  0   |  0   |  0   |  0   |     case 3
   |       |       |       |       |      |      |  0   |  1   |  1   |  1   |  1   |  1   |  1   |     case 4
   |       |       |       |       |      |  1   |  0   |  1   |  0   |  0   |  0   |  0   |  0   |     case 5
   |       |       |       |       |      |  1   |  0   |  1   |  1   |  1   |  1   |  1   |  1   |     case 6
   |   1   |   0   |   0   |   0   |  0   |  0   |  0   |  1   |  0   |  0   |  0   |  0   |  0   |     case 7
1: Having a 0 somewhere between -70s and -60s is the mandatory for a single scrape failure.
   For every 0 more rightwards it's not yet clear which case it will end up as (well actually
   it may be already clear, if there's a 1 even more right, but that's to complex to check
   and not really needed).
   For every 0 more leftwards (later than -70s) the alert, if any, would have already fired
   when between -70s and -60s.

   So I check this via:
       min_over_time(up[15s] offset 1m) == 0
   not really sure about the 15s ... the idea is to account for jitter, i.e. if there was
   only one 0 and that came a bit early and eas already before -70s.
   I guess the question here is, what happens if I do:
       min_over_time(up[10s] offset 1m)
   and there is NO sample between -70 and -60 ?? Does it take the next older one? Or the
   next newer?

2: Should not be single-scrape-failure, but a target-down failure.
   This I get via the:
      unless max_over_time(up[1m0s]) == 0

3, 4: are actually undefined, because I didn't fill in the older numbers, so maybe there
      was another 1m full of 0 after the leftmost (which would have then been it's own
      target-down alert)
5, 6: Here it's clear, the 0 between -70 and -60 must be single-scrape-failures and should
      alert, which they do already if the rule were just:
      expr: min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0
7: These fail if we had just:
      expr: min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0
   because, the 0 between -70 and -60 is actually NOT a single-scrape failure, but a
   part of a target-down alert.

This is, where the:
                           unless max_over_time(up[1m0s] offset 1m10s) == 0
                           unless max_over_time(up[1m0s] offset 1m   ) == 0
                           unless max_over_time(up[1m0s] offset 50s  ) == 0
                           unless max_over_time(up[1m0s] offset 40s  ) == 0
                           unless max_over_time(up[1m0s] offset 30s  ) == 0
                           unless max_over_time(up[1m0s] offset 20s  ) == 0
                           unless max_over_time(up[1m0s] offset 10s  ) == 0
come into play.
The idea is that I make a number of excluding conditions, which are the same as the expr
for target-down, just shifted around the important interval from -70 to -60:
 -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s   -30s   -20s   -10s   0s/now
   |       |       |       |       |      |      |  0   |  X   |  X   |  X   |  X   |  X   |  X   |      unless max_over_time(up[1m0s]             ) == 0
   |       |       |       |       |      |      |  0/X |  X   |  X   |  X   |  X   |  X   |      |      unless max_over_time(up[1m0s] offset 10s  ) == 0
   |       |       |       |       |      |  X   |  0/X |  X   |  X   |  X   |  X   |      |      |      unless max_over_time(up[1m0s] offset 20s  ) == 0
   |       |       |       |       |  X   |  X   |  0/X |  X   |  X   |  X   |      |      |      |      unless max_over_time(up[1m0s] offset 30s  ) == 0
   |       |       |       |   X   |  X   |  X   |  0/X |  X   |  X   |      |      |      |      |      unless max_over_time(up[1m0s] offset 40s  ) == 0
   |       |       |   X   |   X   |  X   |  X   |  0/X |  X   |      |      |      |      |      |      unless max_over_time(up[1m0s] offset 50s  ) == 0
   |       |   X   |   X   |   X   |  X   |  X   |  0/X |      |      |      |      |      |      |      unless max_over_time(up[1m0s] offset 1m   ) == 0
   |   X   |   X   |   X   |   X   |  X   |  X   |  0   |      |      |      |      |      |      |      unless max_over_time(up[1m0s] offset 1m10s) == 0

X simply denotes whether the 10s interval is part of the respective 1m interval.
0/X is simply when the important interval from -70 to -60 is also part of that, which doesn't
matter as it's anyway 0 and we use max_over_time.

So, *if* the important interval from -70 to -60 is 0, it looks at the shifted 1m intervals,
whether any of those was a target-down alert, and if so, causes not to fire.


Now there's still man open questions.

First, and perhaps more rhetorical:
Why is this so hard to do in Prometheus? I know Prometheus isn't Icinga/Nagios, but there a
failed probe would immediately cause the check to go into UNKNOWN state.
For Prometheus, whose main purpose is scraping of metrics, one should assume that people may
at least have a simply way to get notified, if these scrapes fail.


But more concrete questions:
1) Does the above solution sound reasonable?
2) What about my up[15s] offset 1m ... should it be only [10s]? Or something else?
   (btw: The 10+5s is obviously one scrape interval + less (I took half) than one scrape interval)
3) Should the more or less corresponding
     unless max_over_time(up[1m0s] offset 1m10s) == 0
   be rather
     unless max_over_time(up[1m5s] offset 1m10s) == 0
4) The question from above:
   > what happens if I do:
   >     min_over_time(up[10s] offset 1m)
   > and there is NO sample between -70 and -60 ?? Does it take the next older one? Or the
   > next newer?
5) I split up the time spans in chunks of 10s, which is my scrape interval.
   Is that even reasonable? Or should it rather be split up in evaluation intervals?
6) How do the above alerts, depend on the evaluation interval? I mean will they still work as expected
   if I use e.g. the scrape interval (10s)? Or could this cause the two intervals to be overlaid in just
   the wrong manner? Same if I'd use any divisor of the scrape interval, like 5s, 2s or 1s?
   What if I'd use a evaluation interval *bigger* than the scrape interval?
7) In all my above 10s intervals:
   -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s   -30s   -20s   -10s   0s/now
     |       |       |       |       |      |      |      |      |      |      |      |      |      |
   The query is always inclusive on both ends, right?
   So if a sample would lay e.g. exactly on -70s, it would count for both intervals, the one
   from -80 to -70 and the one from -70 to -60.

   I'm a bit unsure whether or not that matter for my alerts.
   Intuitively not, because my expressions look all at intervals (there is no for: Xs) or so
   and if the sample is right at the border, well that simply means both intervals have that value.
   And if there's another sample in the same interval, the max_ and min_ functions should just
   do the right thing (I... kinda guess ^^).
8) I also thought what would happen if there are multiple samples in one interval e.g.:
   -130s   -120s   -110s   -100s   -90s   -80s   -70s   -60s   -50s   -40s   -30s   -20s   -10s   0s/now
     |   1   |   1   |   1   |   1   |  1   |  1   |  0 1 |  0   |  0   |  0   |  0   |  0   |  0   |     case 8a
     |   1   |   1   |   1   |   1   |  1   |  1   |  1 0 |  0   |  0   |  0   |  0   |  0   |  0   |     case 8b

   8a, 8b: min_over_time for the -70s to -60s interval would be 0 in both cases,
           but in 8a, that would mean single-scrape-failure is lost.

           No idea how one can solve this. I guess not at all. :-(
           Perhaps by using an evaluation interval that prevents this mostly, e.g.
           7s evaluation interval for 10s scrape interval.

           Or could one solve this by using count_over_time or last_over_time?


*If* that approach of mine (largely based on Brian's ideas), would indeed work as intended...
there's still one problem left:

If one wants to make a longer period after which target-down fires (e.g. 5m, rather than 1m)
but still keep the short scrape time of 10s, one gets and awfully big expression (which probably
doesn't execute faster, the longer it gets).

Any ideas how to make that better?


Thanks,
Chris.

Chris Siebenmann

unread,
Mar 17, 2024, 10:40:36 PM3/17/24
to Christoph Anton Mitterer, Prometheus Users, Chris Siebenmann
> As a reminder, my goal was:
> - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
> how Icinga would put the host into down state, after pings failed or a
> number of seconds)
> - but even if a single scrape fails (which alone wouldn't trigger the above
> alert) I'd like to get a notification (telling me, that something might be
> fishy with the networking or so), that is UNLESS that single failed scrape
> is part of a sequence of failed scrapes that also caused / will cause the
> above target-down alert
>
> Assuming in the following, each number is a sample value with ~10s distance
> for
> the `up` metric of a single host, with the most recent one being the
> right-most:
> - 1 1 1 1 1 1 1 => should give nothing
> - 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single
> failure,
> or develop into the target-down alert)
> - 1 1 1 1 1 0 0 => same as above, not clear yet
> ...
> - 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert

One thing you can look into here for detecting and counting failed
scrapes is resets(). This works perfectly well when applied to a gauge
that is 1 or 0, and in this case it will count the number of times the
metric went from 1 to 0 in a particular time interval. You can similarly
use changes() to count the total number of transitions (either 1->0
scrape failures or 0->1 scrapes starting to succeed after failures).
It may also be useful to multiply the result of this by the current
value of the metric, so for example:

resets(up{..}[1m]) * up{..}

will be non-zero if there have been some number of scrape failures over
the past minute *but* the most recent scrape succeeded (if that scrape
failed, you're multiplying resets() by zero and getting zero). You can
then wrap this in an '(...) > 0' to get something you can maybe use as
an alert rule for the 'scrapes failed' notification. You might need to
make the range for resets() one step larger than you use for the
'target-down' alert, since resets() will also be zero if up{...} was
zero all through its range.

(At this point you may also want to look at the alert 'keep_firing_for'
setting.)

However, my other suggestion here would be that this notification or
count of failed scrapes may be better handled as a dashboard or a
periodic report (from a script) instead of through an alert, especially
a fast-firing alert. I think it will be relatively difficult to make an
alert give you an accurate count of how many times this happened; if you
want such a count to make decisions, a dashboard (possibly visualizing
the up/down blips) or a report could be better. A program is also in the
position to extract the raw up{...} metrics (with timestamps) and then
readily analyze them for things like how long the failed scrapes tend to
last for, how frequently they happen, etc etc.

- cks
PS: This is not my clever set of tricks, I got it from other people.

Christoph Anton Mitterer

unread,
Mar 18, 2024, 12:45:46 AM3/18/24
to Chris Siebenmann, Prometheus Users
Hey Chris.

On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote:
>
> One thing you can look into here for detecting and counting failed
> scrapes is resets(). This works perfectly well when applied to a
> gauge

Though it is documented as to be only used with counters... :-/


> that is 1 or 0, and in this case it will count the number of times
> the
> metric went from 1 to 0 in a particular time interval. You can
> similarly
> use changes() to count the total number of transitions (either 1->0
> scrape failures or 0->1 scrapes starting to succeed after failures).

The idea sounds promising... especially to also catch cases like that
8a, I've mentioned in my previous mail and where the
{min,max}_over_time approach seems to fail.


> It may also be useful to multiply the result of this by the current
> value of the metric, so for example:
>
> resets(up{..}[1m]) * up{..}
>
> will be non-zero if there have been some number of scrape failures
> over
> the past minute *but* the most recent scrape succeeded (if that
> scrape
> failed, you're multiplying resets() by zero and getting zero). You
> can
> then wrap this in an '(...) > 0' to get something you can maybe use
> as
> an alert rule for the 'scrapes failed' notification. You might need
> to
> make the range for resets() one step larger than you use for the
> 'target-down' alert, since resets() will also be zero if up{...} was
> zero all through its range.
>
> (At this point you may also want to look at the alert
> 'keep_firing_for'
> setting.)

I will give that some more thinking and reply back if I should find
some way to make an alert out of this.

Well and probably also if I fail to ^^ ... at least at a first glance I
wasn't able to use that to create and alert that would behave as
desired. :/


> However, my other suggestion here would be that this notification or
> count of failed scrapes may be better handled as a dashboard or a
> periodic report (from a script) instead of through an alert,
> especially
> a fast-firing alert.

Well the problem with a dashboard would IMO be, that someone must
actually look at it or otherwise it would be pointless. ;-)

Not really sure how to do that with a script (which I guess would be
conceptually similar to an alert... just that it's sent e.g. weekly).

I guess I'm not so much interested in the exact times, when single
scrapes fail (I cannot correct it retrospectively anyway) but just
*that* it happens and that I have to look into it.

My assumption kinda is, that normally scrapes aren't lost. So I would
really only get an alert mail if something's wrong.
And even if the alert is flaky, like in 1 0 1 0 1 0, I think it could
still reduce mail but on the alertmanager level?


> I think it will be relatively difficult to make an
> alert give you an accurate count of how many times this happened; if
> you
> want such a count to make decisions, a dashboard (possibly
> visualizing
> the up/down blips) or a report could be better. A program is also in
> the
> position to extract the raw up{...} metrics (with timestamps) and
> then
> readily analyze them for things like how long the failed scrapes tend
> to
> last for, how frequently they happen, etc etc.

Well that sounds to be quite some effort... and I already think that my
current approaches required far too much of an effort (and still don't
fully work ^^).
As said... despite not really being comparable to Prometheus: in
Incinga a failed sensor probe would be immediately noticeable.


Thanks,
Chris.

Ben Kochie

unread,
Mar 18, 2024, 3:12:37 AM3/18/24
to Christoph Anton Mitterer, Chris Siebenmann, Prometheus Users
I usually recommend throwing out any "But this is how Icinga does it". thinking.

The way we do things in Prometheus for this kind of thing is to simply think about "availability".

For any scrape failures:

    avg_over_time(up[5m]) < 1

For more than one scrape failure (assuming 15s intervals)

    avg_over_time(up[5m]) < 0.95

This is a much easier way to think about "uptime".

Also, if you want, there is the new "keep_firing_for" alerting option.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f3549318aaa5f5b4fa0a01fb20c44e30769f540a.camel%40gmail.com.

Christoph Anton Mitterer

unread,
Mar 21, 2024, 10:31:52 PM3/21/24
to Prometheus Users

I've been looking into possible alternatives, based on the ideas given here.

I) First one completely different approach might be:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
- alert: single-scrape-failure
expr: 'min_over_time( up[2m0s] ) == 0'
for: 1m
or
- alert: single-scrape-failure
expr: 'resets( up[2m0s] ) > 0'
for: 1m
or perhaps even
- alert: single-scrape-failure
expr: 'changes( up[2m0s] ) >= 2'
for: 1m
(which would however behave a bit different, I guess)
)

plus an inhibit rule, that silences single-scrape-failure when
target-down fires.
The for: 1m is needed, so that target-down has a chance to fire
(and inhibit) before single-scrape-failure does.

I'm not really sure, whether that works in all cases, though,
especially since I look back much more (and the additional time
span further back may undesirably trigger again.


Using for: > 0 seems generally a bit fragile for my use-case (because I want to capture even single scrape failures, but with for: > 0 I need t to have at least two evaluations to actually trigger, so my evaluation period must be small enough so that it's done >= 2 during the scrape interval.

Also, I guess the scrape intervals and the evaluation intervals are not synced, so when with for: 0s, when I look back e.g. [1m] and assume a certain number of samples in that range, it may be that there are actually more or less.


If I forget about the above approach with inhibiting, then I need to consider cases like:
----time---->
- 0 1 0 0 0 0 0 0
first zero should be a single-scrape-failure, the last 6 however a
target-down
- 1 0 0 0 0 0 1 0 0 0 0 0 0
same here, the first 5 should be a single-scrape-failure, the last 6
however a target-down
- 1 0 0 0 0 0 0 1 0 0 0 0 0 0
here however, both should be target-down
- 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
or
1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
here, 2x target-down, 1x single-scrape-failure




II) Using the original {min,max}_over_time approach:
- min_over_time(up[1m]) == 0
tells me, there was at least one missing scrape in the last 1m.
but that alone would already be the case for the first zero:
. . . . . 0
so:
- for: 1m
was added (and the [1m] was enlarged)
but this would still fire with
0 0 0 0 0 0 0
which should however be a target-down
so:
- unless max_over_time(up[1m]) == 0
was added to silence it then
but that would still fail in e.g. the case when a previous
target-down runs out:
0 0 0 0 0 0 -> target down
the next is a 1
0 0 0 0 0 0 1 -> single-scrape-failure
and some similar cases,

Plus the usage of for: >0s is - in my special case - IMO fragile.



III) So in my previous mail I came up with the idea of using:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0 unless max_over_time(up[1m0s] offset 1m10s) == 0 unless max_over_time(up[1m0s] offset 1m) == 0 unless max_over_time(up[1m0s] offset 50s) == 0 unless max_over_time(up[1m0s] offset 40s) == 0 unless max_over_time(up[1m0s] offset 30s) == 0 unless max_over_time(up[1m0s] offset 20s) == 0 unless max_over_time(up[1m0s] offset 10s) == 0' for: 0m
The idea was, that when I don't use for: >0s, the first time
window where one can be really sure (in all cases), that whether
it's a single-scrape-failure or target-down is a 0 in -70s to
-60s:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | | | | | | | 0 | | | | | | | | | | | | | | | | | | 1 | 0 | 1 | case 1 | | | | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 2 | | | | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | case 3 In case 1 it would be already clear when the zeros is between -20
and -10.
But if there's a sequence of zeros, it takes up to -70s to -60s,
when it becomes clear.

Now the zero in that time span could also be that of a target-down
sequence of zeros like in case 3.
For these cases, I had the shifted silencers that each looked over
1m.

Looked good at first, though there were some open questions.
At least one main problem, namely it would fail in e.g. that case:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | 1 | 1 | 1 | 1 | 1 | 1 | 0 1 | 0 | 0 | 0 | 0 | 0 | 0 | case 8a
The zero between -70s to 60s would be noticed, but still be
silenced, because the one would not.




Chris Siebenmann suggested to use resets(). ... and keep_firing_for:, which Ben Kochie, suggested, too.

First I didn't quite understand how the latter would help me? Maybe I have the wrong mindset for it, so could you guys please explain what your idea was wiht keep_firing_for:?




IV) resets() sounded promising at first, but while I tried quite some
variations, I wasn't able to get anything working.
First, something like
resets(up[1m]) >= 1
alone (with or without a for: >0s) would already fire in case of:
----time---->
1 0
which still could become a target-down but also in case of:
1 0 0 0 0 0 0
which is a target down.
And I think even if I add some "unless ..." I'd still have the
problem as above in (II), that I get a false positive alert, when
a true target-down sequence moves through.
So just like in (III) I'd need do that shifted silencers.

resets(up[1m]) >= 2
wouldn't work either e.g. in case of:
1 0 1 1 1 1 1 1
there simply is no 2nd reset.

I even tried a variant where the target-down must come first in the
rules definition:
- alert: target-down expr: 'up == 0' for: 1m <- for is needed here, or I get no ALERTS - alert: single-scrape-failure expr: 'resets(up[1m0s]) > 0 unless on (instance,job) ALERTS{alertname="target-down"}' for: 0m and where I then used ALERTS trying to filter ... but no success.

V) Instead of resets() I tried changes() (which is even not only
defined for counters):
- alert: target-down
expr: 'max_over_time( up[1m0s] ) == 0'
for: 0s
- alert: single-scrape-failure
expr:

using just
changes(up[1]) >= 1
does of course not work, as it could be an incoming target-down
1 0 0 0 0 0 0
or an outgoing one:
0 0 0 0 0 0 1

using
changes(up[1]) >= 2
seems promising first, if I have e.g.
1 1 1 1 0 1
it's already clear, that it's a single-scrape-failure...
but it could be something like 0 0 0 0 0 0 1 1 0 0 0
i.e. an outgoing target-down and something that may still become
one.
 

using
changes(up[1m5s]) >= 2
unless max_over_time(up[1m0s] offset 1m) == 0
unless max_over_time(up[1m0s] offset 50s) == 0
unless max_over_time(up[1m0s] offset 40s) == 0
unless max_over_time(up[1m0s] offset 30s) == 0
unless max_over_time(up[1m0s] offset 20s) == 0
used the above, and filtered again the shifted 1m time spans (no
need to look at offset 0s or 10s).

But that fails e.g. in the case of
0 0 0 0 0 0 1 0 1 1 1 1 1 1 1
(i.e. a target-down followed by a single-scrape-failure followed by
OK)




VI) avg_over_time.
I guess I might just not understand what you mean, but at least
something like:
expr: 'avg_over_time(up[1m10s]) < 1 and avg_over_time(up[1m10s]) > 0'
for: 1m
fails already in the simple case of
0 0 0 0 0 1
where it gives a false alert after the target-down


Well... guess I'm a my wits' end and this might simply not be possible with PromQL.

Cheers,
Chris.

Brian Candler

unread,
Mar 22, 2024, 4:20:45 AM3/22/24
to Prometheus Users
Personally I think you're looking at this wrong.

You want to "capture" single scrape failures?  Sure - it's already being captured.  Make yourself a dashboard.

But do you really want to be *alerted* on every individual one-time scrape failure?  That goes against the whole philosophy of alerting, where alerts should be "urgent, important, actionable, and real".  A single scrape failure is none of those.

If you want to do further investigation when a host has more than N single-scrape failures in 24 hours, sure. But firstly, is that urgent enough to warrant an alert? If it is, then you also say you *don't* want to be alerted on this when a more important alert has been sent for the same host in the same time period.  That's tricky to get right, which is what this whole thread is about. Like you say: alertmanager is probably not the right tool for that.

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not enough to trigger any alerts?

If it's "not often" then I wouldn't worry too much it anyway (check a dashboard), but in any case you don't want to waste time trying to bend existing tooling to work in ways it wasn't intended for. That is: if you need suitable tooling, then write it.

It could be as simple as a script doing one query per day, using the same logic I just outlined above:
- identify hosts with scrape failures above a particular threshold over the last 24 hours
- identify hosts where one or more alerts have been generated over the last 24 hours (there are metrics for this)
- subtract the second set from the first set
- if the remaining set is non-empty, then send a notification

You can do this in any language of your choice, or even a shell script with promtool/curl and jq.

Christoph Anton Mitterer

unread,
Apr 4, 2024, 12:18:59 PM4/4/24
to Prometheus Users
Hey.

On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote:
You want to "capture" single scrape failures?  Sure - it's already being captured.  Make yourself a dashboard.

Well as I've said before, the dashboard always has the problem that someone actually needs to look at it.
 
But do you really want to be *alerted* on every individual one-time scrape failure?  That goes against the whole philosophy of alerting, where alerts should be "urgent, important, actionable, and real".  A single scrape failure is none of those.

I guess in the end I'll see whether or not I'm annoyed by it. ;-)
 

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not enough to trigger any alerts?

So far I've seen two kinds of nodes, those where I never get scrape errors, and those where they happen regularly - and probably need investigation.


Anyway,... I think it might have found a solution, which - if some
assumption's I've made are correct - I'm somewhat confident that
it works, even in the strange cases.


The assumptions I've made are basically three:
- Prometheus does that "faking" of sample times, and thus these are
  always on point with exactly the scrape interval between each.
  This in turn should mean, that if I have e.g. a scrape interval of
  10s, and I do up[20s], then regardless of when this is done, I get
  at least 2 samples, and in some rare cases (when the evaluation
  happens exactly on a scrape time), 3 samples.
  Never more, never less.
  Which for `up` I think should be true, as Prometheus itself
  generates it, right, and not the exporter that is scraped.
- The evaluation interval is sufficiently less than the scrape
  interval, so that it's guaranteed that none of the `up`-samples are
  being missed.
- After some small time (e.g. 10s) it's guaranteed that all samples
  are in the TSDB and a query will return them.
  (basically, to counter the observation I've made in
- Both alerts run in the same alert group, and that means (I hope) that
  each query in them is evaluated with respect to the very same time.

With that, my final solution would be:
    - alert: general_target-down   (TD below)
      expr: 'max_over_time(up[1m] offset 10s) == 0'
      for:  0s
    - alert: general_target-down_single-scrapes
   (TDSS below)
      expr: 'resets(up[20s] offset 60s) >= 1  unless  max_over_time(up[50s] offset 10s) == 0'
      for:  0s

And that seems to actually work for at least practical cases (of
course it's difficult to simulate the cases where the evaluation
happens right on time of a scrape).

For anyone who'd ever be interested in the details, and why I think that works in all cases,
I've attached the git logs where I describe the changes in my config git below.

Thanks to everyone for helping me with that :-)

Best wishes,
Chris.


(needs a mono-spaced font to work out nicely)
TL/DR:
-------------------------------------------------
commit f31f3c656cae4aeb79ce4bfd1782a624784c1c43
Author: Christoph Anton Mitterer <cale...@gmail.com>
Date:   Mon Mar 25 02:01:57 2024 +0100

    alerts: overhauled the `general_target-down_single-scrapes`-alert
    
    This is a major overhaul of the `general_target-down_single-scrapes`-alert,
    which turned out to have been quite an effort that went over several months.
    
    Before this branch was merged, the `general_target-down_single-scrapes`-alert
    (from now on called “TDSS”) had various issues.
    While the alert did stop to fire, when the `general_target-down`-alert (from now
    on called “TD”) started to do so, that alone meant that it would still also fire
    when scrapes failed which eventually turned out to be an actual TD.
    For example the first few (< ≈7) `0`s would have caused TDSS to fire which would
    seamlessly be replaced by a firing TD (unless any `1`s came in between).
    
    Assumptions made below:
    • The scraping interval is `10s`.
    • If a (single) time series for the `up`-metric is given like `0 1 0 0 1`, the
      time goes from left (farther back in time) to right (less farther back in
      time).
    
    I) Goals
    ********
    There should be two alerts:
    • TD
      Is for general use and similar to Icinga’s concept of host being `UP` or
      `DOWN` (with the minor difference, that an unreachable Prometheus target does
      not necessarily mean that a host is `DOWN` in that sense).
      It should fire after scraping has failed for some time, for example one
      minute (which is assumed form now on).
    • TDSS
      Since Prometheus is all about monitoring metrics, it’s of interest whether the
      scraping fails, even if it’s only every now and then for very short amount of
      times, because in that cases samples are lost.
      TD will notice any scraping failures that last for more than its time, but
      won’t notice any that last less.
      TDSS shall notice these, but only fire if they are not part of an already
      ongoing TD and neither will be part of one.
      The idea is that is an alert for the monitoring itself.
    
    Whether each firing alert actually results in a notification being sent is of
    course a different matter and depends on the configuration of the
    `alertmanager` (the current route that matches the alert name
    `general_target-down_single-scrapes` in `alertmanager.yml` should cause every
    single firing alert to be sent).
    Nevertheless, TDSS should fire for even only a single `0` surrounded by `1`s
    
    Examples (below the `:` is “now”):
    1 1 1 1 1 1 1: neither alert fires

    
    1 1 1 1 1 1 0
    1 1 1 1 1 0 0
    1 1 1 1 0 0 0
    1 1 1 0 0 0 0
    1 1 0 0 0 0 0: neither alert shall fire yet (it may become either a TD or a
                   TDSS)
    
    1 0 0 0 0 0 0: TD shall fire

    
    1 1 1 1 1 0 1
    1 1 1 1 0 0 1
    1 1 1 0 0 0 1
    1 1 0 0 0 0 1
    1 0 0 0 0 0 1: TDSS shall fire, not necessarily immediately (that is: exactly
                   with the most recent `1`) but at least eventually, and stop
                   firing.
    
    1 1 1 0 1 0 1
    1 1 0 1 0 0 1
    1 0 0 1 0 0 1: TDSS shall fire, stop firing, fire again and stop firing again.
    
              1 0 1 0 0 0 0 0 0: TDSS shall fire, stop firing, then TD shall fire.
    1 0 0 0 0 0 0 1 0 0 0 0 0 0: TD shall fire, stop firing, and fire again.
    
    II) Prometheus’ Mode Of Operation
    *********************************
    Neither an alert’s `for:` (which is however not used here anyway) nor the
    queries are made in terms of number-of-samples but time durations.
    There is no way to make a query like `metric<6 samples>`, which would then
    (assuming a scrape interval of 10s) be some time around 1 minute. Instead a
    query like `metric[1m]` gives any samples from now until 1m ago.
    Usually, this will be 6 samples, in some cases it may be 7 samples (namely when
    the request is made exactly at the time of a sample), in principle it may be
    even only 5 samples (namely, when there is jitter and the samples aren’t
    recorded exactly on time) and for most metrics it could be any other number down
    to 0 (namely if metrics couldn’t be scraped for some reason).
    
    `up` is however special and “generated” by Prometheus itself and should be
    always there, even if the target couldn’t be scraped.
    
    Moreover, Prometheus (at least within some tolerance) fakes (see [0]) the times
    of samples to be straight on time, so for example a query like `up[1m]` will
    result in times/samples like:
    1711333608.175 "1"
    1711333618.175 "1"
    1711333628.175 "1"
    1711333638.175 "1"
    1711333648.175 "1"
    1711333658.175 "1"
    here, all exactly at `*.175`.
    This means that, relative to some starting point in time, the samples are
    scraped like this:
    +0s   +10s  +20s
     ├─────┼─────┼┈
     ⓢ     ⓢ     ⓢ
     ╵     ╵     ╵
    Above and below, the 0s, +10s and +20s are scraping and sample times.
    If Prometheus wouldn’t fake the times of samples ⓢ, this might instead look
    like:
     0s   +10s  +20s
     ├─────┼─────┼┈
    ⓢ│     │ⓢ   ⓢ│
    ⓢ│     ⓢ     │ⓢ
     ╵     ╵     ╵
    This would then even further complicate what might happen if the “moving”
    behaviour of queries (as described below) is applied on top of that.
    
    With all the above, a query like `up[20s]` may give the following:
    -20s  -10s   0s
     ├─────┼─────┤
     │    ⓢ│    ⓢ│
     │   ⓢ │   ⓢ │
     │  ⓢ  │  ⓢ  │
     │ ⓢ   │ ⓢ   │
     │ⓢ    │ⓢ    │
     ⓢ     ⓢ     ⓢ
     ╵     ╵     ╵
    Above, the -20s, -10s and 0s are **not** the interval points at which scraping
    is performed – they’re rather the duration (which will later be intentionally a
    multiple of the scrape interval) which the query “looks back”, for visualisation
    separated in pieces of the length of the scrape interval. This will also be the
    case in later illustrations where -Ns is used.
    As the query may happen at any time, the samples ⓢ (which, as described above,
    happen exactly on time, that is always exactly the scrape interval apart from
    each other), the samples “move” depending on when the query is made.
    If the query is made exactly “at” the time of a scraping, one will get even 3
    samples (because they, as described above, happen exactly on time).
    A query like `up[20s] offset 50s` would work analogously, just shifted.
    
    With respect to some fixed sample times, and queries made at subsequent times
    this would look like the following:
           …00.314s  …10.314s  …20.314s
    ┊         ┊         ┊         ┊
    ⓢ┊  ┊  ┊  ⓢ┊  ┊     ⓢ┊  ┊  ┊  ⓢ
     └──┊──┊──┊┴──┊──┊──┊┘  ┊  ┊  ┊    query 1, 2 samples
        └──┊──┊───┴──┊──┊───┘  ┊  ┊    query 2, 2 samples
           └──┊──────┴──┊──────┘  ┊    query 3, 2 samples
              └─────────┴─────────┘    query 4 (exactly at a sample time), 3 samples
    
    It follows from all this, that the examples in (I) above are actually only
    correct in the usual case and a bit misleading how Prometheus respectively it’s
    queries and thus alerts work.
    It's not 6 consecutive `0s` as in:

    1 0 0 0 0 0 0
    that cause TD to fire, but having only `0s` for a time duration (relative to the
    evaluation time) of 1m from the current evaluation time:
       -1m                      0s
        ├───────────────────────┤

       1│  0   0   0   0   0   0│
      1 │ 0   0   0   0   0   0 │

     1  │0   0   0   0   0   0  │
    1   0   0   0   0   0   0   0
        ╵                       ╵
    
    III) Failed Approaches
    **********************
    In order to fulfil the goals from (I) various approaches have been tried with
    quite some effort.
    Each of them ultimately failed for some reason.
    Some of them are listed here for educational purposes respectively to cause
    caution which alternatives may fail in subtle cases.
    
    These approaches were discussed at [1].
    
    a) Using `min_over_time()` and `max_over_time()`.
       Based on an idea from Brian Candler an
       expression for the TDSS like:
       ```
       min_over_time(up[1m10s]) == 0  unless  max_over_time(up[1m10s]) == 0
       ```
       with a `for:`-value of `1m` and an expression for the TD like:
       ```
       up == 0
       ```
       with a `for:`-value of `1m` was tried.
       The expression for the latter was later changed to:
       ```

       max_over_time(up[1m]) == 0
       ```
       with a `for:`-value of `0s` in order to make sure that TD would fire exactly
       when the same term would silence the TDSS.
       This was tried with evaluation intervals of `10s` and `7s`.
    
       The TDSS did never fire with time durations of exactly `1m` (as used by the
       TD) – it needed to be longer. But that seemed already to be fragile because
       of the differing times between TDSS and TD.
       Also, it generally failed when a TD was quickly (probably within ≈ `1m10s`)
       followed by `0s`, for example:
       0 0 0 0 0 0 0 1 0: This would have first caused TDSS to become pending, after
                          the 6th or 7th `0` TD would have fired (while TDSS would
                          have still been pending), after the `1` TD would have
                          stopped firing and with the next `0`, TDSS would have
                          wrongly fired.
                          Something similar would have happened with the `for: 1m`-
                          based TD.
    
       In [1] it was also suggested to use different time durations in the TDSS,
       for example an expression like:
       ```
       min_over_time(up[1m20s]) == 0  unless  max_over_time(up[1m0s]) == 0
       ```
       with a `for:`-value of `1m10s`.
       This however seemed to have the same issues than above and be even more
       fragile with respect to the overlapping time windows.
    
    b) Using `min_over_time()` only on a critical time window with shifted
       silencers.
       The solution from (a) was extended to a TDSS with an expression like:
       ```

       min_over_time(up[15s] offset 1m) == 0
       unless  max_over_time(up[1m] offset 1m10s) == 0
       unless  max_over_time(up[1m] offset 1m   ) == 0
       unless  max_over_time(up[1m] offset 50s  ) == 0
       unless  max_over_time(up[1m] offset 40s  ) == 0
       unless  max_over_time(up[1m] offset 30s  ) == 0
       unless  max_over_time(up[1m] offset 20s  ) == 0
       unless  max_over_time(up[1m] offset 10s  ) == 0
       unless  max_over_time(up[1m]             ) == 0
       ```
       with a `for:`-value of `0s` and a TD like above.
       This was tried with an evaluation interval of `8s`.
       Using `15s` instead of `10s` was just to account for jitter (which should
       however not happen anyway – see in (II) above) and should otherwise not
       matter.
       The idea was to look only at the time window from -(1m+15s) to -1m (at which
       it is always clear whether a series of `0`s becomes a TD or a TDSS – though
       it may also already be clear earlier) for a `0` and silence the alert if it’s
       actually part of a longer series that forms a TD.
    
       There were a number of issues with this approach: It was again fragile with
       the many overlapping time windows. Further investigation would have been
       necessary on whether the many shifted silencers may wrongly silence a true
       TDSS in certain time series patterns or – less problematic – not silence a
       wrong TDSS. Changing the expression to cover a TD time that is longer than
       `1m` (while the scrape interval stays short) would have lead to very large
       (and more complex to evaluate) expressions.
    
       It was originally believed that the main problem were a fundamental flaw in
       the usage of `min_over_time()` on the critical time window, when jitter would
       have happened like in:
       -80s  -70s  -60s             0s
        ├─────┼─────┼───────────────┤
        │1    │0   0│    0  ⋯  0    │    case 1
        │1    │0   1│    0  ⋯  0    │    case 2
        │1    │1   0│    0  ⋯  0    │    case 3
        │1    │1   1│    0  ⋯  0    │    case 4
        ╵     ╵     ╵               ╵
       In cases, 1-3 `min_over_time(up[15s] offset 1m)` would have yielded `0` and
       in case 4 it would have yielded `1` – all as intended. The silencer with
       `max_over_time(up[1m])` would have silenced the alert in all cases, which
       would however have been wrong in case 2, where the single `0` would have went
       through unnoticed.
       However and as described in (II) above, jitter should be prevented by
       Prometheus faking the times of samples and thus a query like `up[10s]` (and
       similarly with `15s`) should give two samples (assuming a scrape interval of
       `10s`) only if the evaluation happens exactly at the time where both samples
       are exactly at the boundaries like in:
       -80s  -70s  -60s             0s
        ├─────┼─────┼───────────────┤
        │1    0     0    0  ⋯  0    │    case 1
        │1    0     1    0  ⋯  0    │    case 2
        │1    1     0    0  ⋯  0    │    case 3
        │1    1     1    0  ⋯  0    │    case 4
        ╵     ╵     ╵               ╵
       In cases, 1-3 `min_over_time(up[15s] offset 1m)` would have yielded `0` and
       in case 4 it would have yielded `1` – all as intended. The silencer with
       `max_over_time(up[1m])` would have silenced the alert only in cases 1 and 3 –
       again, all as intended.
    
       It might have been possible to get this approach working, but at that time it
       was thought any jitter (which, however, apparently cannot happen) would break
       it and even without this issue, the shifted silencers may have caused other
       issues.
    
    c) Using `changes()` only on a critical time window with shifted silencers.
       Chris Siebenmann proposed to use
       `resets()`, which was however first considered to be not feasible, as for
       example a time series like `1 0` would have already caused at least a simple
       expression to cause TDSS to fire, while this might still turn out to be a TD.
    
       Instead, the solution from (b) was modified to use `changes()` in an
       expression like:
       ```
       changes(up[1m5s]) > 1
       unless  max_over_time(up[1m] offset 1m ) == 0
       unless  max_over_time(up[1m] offset 50s) == 0
       unless  max_over_time(up[1m] offset 40s) == 0
       unless  max_over_time(up[1m] offset 30s) == 0
       unless  max_over_time(up[1m] offset 20s) == 0
       ```
       with a `for:`-value of `0s` and a TD like above.
    
       This had similar issues than approach (b): It seemed again fragile because of
       the many overlapping time windows. Cases were found, where the silences would
       wrongly silence a TDSS, which was then lost. This happened sometimes for a
       time series like 0 0 0 0 0 0 0 1 0 1 1 …, that is 7 or 8 `0s` which cause a
       TD, a single `1`, followed by a single `0` (which should be a TDSS), followed
       by only `1`s. Sometimes (but not always) the single `0` was not detected as
       TDSS. This was probably dependant on how the respective evaluation time of
       the alert is shifted compared to the sample times.
    
       One property of this approach would have been, that it fires earlier in some
       (but not all) cases.
    
    d) Using `resets()` only on a critical time window with a silencer.
       Eventually, a probable solution was found by again looking primarily on only
       a critical time window, however with `resets()`, a single (non-shifted)
       silencer and a handler for a case where that silencer would wrongly silence
       the TDSS.
       The TD was used as above (that is: the version that uses the expression
       `max_over_time(up[1m]) == 0` with a `for:`-value of `0s`).
    
       In the first version of this approach, the following expression for the TDSS
       was looked at:
       ```
       (
        resets(up[20s] offset 1m) >= 1

        unless  max_over_time(up[1m]) == 0
       )
       or
       (
        resets(up[20s] offset 1m) >= 1
        and  changes(up[20s] offset 1m) >= 2
        and  sum_over_time(up[20s] offset 1m) >= 2
       )
       ```
       where the critical time window is from -80s to -60s (that is exactly before
       the time window for the TD), but which failed at least in a case like:
           -80s  -70s  -60s            -10s   0s
       ┈┈┈┈┈┼─────┼─────┼───────────────┼─────┼┈┈┈┈┈
            │  1  │  1  │    0  ⋯  0    │  0  │  1      1st step
         1  │  1  │  0  │    0  ⋯  0    │  1  │         2nd step
       in which first the TD would have fired and then the TDSS.
       It might have been possible to solve that by several ways, for example by
       using `sum_over_time()` or by trying with a shifted silencer.
    
       Eventually it also turned out that – given how the scraping and alert rule
       evaluation works and especially as there’s not time jitter with samples – the
       whole second term after the `or` was not needed.
    
    IV) Testing commands
    ********************
    For the final solution described below (but also in similar forms for the
    previous approaches), the following commands (each executed in another terminal)
    or similar were used for testing:
    • Printing the currently pending or firing alerts:
      ```
      while :; do curl -g 'http://localhost:9090/api/v1/alerts' 2>/dev/null | jq '.data.alerts' | grep -E 'alertname|state'; date +%s.%N; printf '\n'; sleep 1; done
      ```
    • Printing the most recent samples and their times:
      ```
      while :; do curl -g 'http://localhost:9090/api/v1/query?query=up{instance="testnode.example.org",job="node"}[1m20s]' 2>/dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - | sed $'3i \n'; printf '%s\n' -------------------------------; sleep 1; done
      ```
    • Causing `0`s:
      ```
      iptables -A OUTPUT --destination testnode.example.org -p tcp -m tcp -j REJECT --reject-with tcp-reset
      ```
      Causing `1`s:
      For example by reloading the netfilter rules.
    
    V) Final Solution
    *****************
    The final solution is based in that shown in (III.d) but uses an overlapping
    critical time window and a shortened silencer.
    The TD was used as above (that is: the version that uses the expression
    `max_over_time(up[1m]) == 0` with a `for:`-value of `0s`).
    
    The critical time window has it’s middle exactly at the end of the time window
    of the TD, so that there’s one length of a scrape interval at both sides (and
    thus it’s from -70s to -50s) and the silencer goes exactly to the right end of
    the critical time window, which gives the following expression:
    ```
    resets(up[20s] offset 50s) >= 1  unless  max_over_time(up[50s]) == 0
    ```
    with a `for:`-value of `0s`.
    
    The problem described for (III.d) cannot occur, as illustrated below:
        -70s  -60s  -50s            -10s   0s
    ┈┈┈┈┈┼─────┼─────┼───────────────┼─────┼┈┈┈┈┈
         │  1  │  0  ╎    0  ⋯  0    │  0  │  1      1st step
      1  │  0  │  0  ╎    0  ⋯  0    │  1  │         2nd step
    in which the 1st step is the “earliest” time series that can cause a TD, but
    even if this moves on and the next sample is a `1`, the TDSS won’t fire as there
    is no longer a reset in the critical time window (because due to time jitter
    being impossible for samples the leftmost `1` must have moved out as the
    rightmost `1` moves in). The same is the case if the evaluation happens just at
    a sample time so that the critical time window has three samples.
    
    In general, the second version of this approach works as follows:
    
    The whole solution depends heavily on time jitter of samples being impossible.
    It should however be possible to change the time duration of the TD and/or the
    scrape interval, as long as the time durations and offsets for the TDSS are
    adapted accordingly.
    Also, the evaluation interval must be less (enough to account for any evaluation
    interval jitter) than the scrape interval.
    In principle should even work (though even that was not thoroughly checked) if
    it’s equal to it, but since there may be time jitter with the evaluation
    intervals, samples might be “jumped over”. If it’s greater than the scrape
    interval, samples are definitely being “jumped over”. In those cases, the check
    would fail to produce reliable results.
    Also, the TD and TDSS must be in the same alert group, to assure that they’re
    evaluated relative to the same current time.
    
    A time window of the duration of two scrape intervals is needed, as `resets()`
    may only ever give a value > 0 if there are at least two samples, which – as
    described in (II) above – is only assured when looking back that long (which may
    however also yield up to three samples).
    If the time window were only one scrape interval long, one would need to be very
    lucky to get two samples.
    
    As in (III.d), the basic idea is to look whether a reset occurred in the
    critical time window – here – from -70s to -50s and silence the alert if the
    reset is actually or possibly the start of a TD.
    
    Reset means any decrease of the value between two consecutive samples as counted
    by the `resets()`-function.
    While `resets()` is documented for use with counters only, it seems to work with
    gauges, too, and especially with `up` (whose samples may have only the values
    `0` or `1`) even in a reasonable way.
    
    Since there is – as described in (II) above – no jitter with the sample times,
    the 20s critical time window has always either two samples or three.
    
    It consists of two halves, each the size of a scrape interval and it’s
    especially also impossible, that one half contains two samples and the other
    only one – both halves contain always the same number of samples (if there are
    three in total, the middle sample is “shared” by both halves).
    
    If there are two samples (which then are not exactly on the boundaries) one gets
    the following types of cases (at the moment where TDSS might fire):
    -70s  -60s  -50s              0s    r │ m₅₀ ┃ TDSS       m₆₀ ┃  TD
     ├─────┼─────┼────────────────┤    ───┼─────╂───────    ─────┼───────
     │  0  │  0  ╎    0  ⋯  0     │     0 │  0  ┃   -         0  │ fires
     │  0  │  0  ╎ at least one 1 │     0 │  1  ┃   -         1  │   -       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     │  0  │  1  ╎    0  ⋯  0     │     0 │  0  ┃   -         1  │   -       case 2
     │  0  │  1  ╎ at least one 1 │     0 │  1  ┃   -         1  │   -       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     │  1  │  0  ╎    0  ⋯  0     │     1 │  0  ┃   -ₛ        0  │ fires
     │  1  │  0  ╎ at least one 1 │     1 │  1  ┃ fires       1  │   -       case 1
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     │  1  │  1  ╎    0  ⋯  0     │     0 │  0  ┃   -         1  │   -
     │  1  │  1  ╎ at least one 1 │     0 │  1  ┃   -         1  │   -
     │  1  │  1  ╎    1  ⋯  1     │     0 │  1  ┃   -         1  │   -
    
    with “r” being `resets(up[20s] offset 50s)`, “m₅₀” being
    `max_over_time(up[50s])` and “m₆₀” being `max_over_time(up[1m])` as well as with
    `ₛ` meaning that the TDSS was silenced by `max_over_time(up[50s]) == 0`.
    
    This gives the desired firing behaviour.
    
    Of course, “at least one 1” might contain time series like 1 0 1 and TDSS would
    still not fire – but it would later, as the time series moves through the
    critical time window.
    Similarly in case 1, where the TDSS does fire, it may do so again later,
    depending on the “at least one 1”.
    
    In cases 2, the alert (either a TD or a TDSS) for the consecutive `0`s on the
    left side, would have already fired earlier.
    
    If there are three samples (which then are exactly on the boundaries) one gets
    the following types of cases (at the moment where TDSS might fire):
    -70s  -60s  -50s              0s    r │ m₅₀ ┃ TDSS       m₆₀ │  TD
     ├─────┼─────┼────────────────┤    ───┼─────╂───────    ─────┼───────
     0     0     0    0  ⋯  0     0     0 │  0  ┃   -         0  │ fires
     0     0     0 at least one 1 ⏼     0 │  1  ┃   -         1  │   -       case 1, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     0     0     1    anything    ⏼     0 │  1  ┃   -         1  │   -       case 1, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     0     1     0    0  ⋯  0     0     1 │  0  ┃   -ₛ        1  │   -       case 3
     0     1     0 at least one 1 ⏼     1 │  1  ┃ fires       1  │   -       case 2, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     0     1     1    anything    ⏼     0 │  1  ┃   -         1  │   -       case 1, 3
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     0     0    0  ⋯  0     0     1 │  0  ┃   -ₛ        0  │ fires
     1     0     0 at least one 1 ⏼     1 │  1  ┃ fires       1  │   -       case 2
     1     0     0    0  ⋯  0     1     1 │  1  ┃ fires       1  │   -       case 2, 4
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     0     1    anything    ⏼     1 │  1  ┃ fires       1  │   -       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈╂┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     1     0    0  ⋯  0     0     1 │  0  ┃   -ₛ        1  │   -
     1     1     0 at least one 1 ⏼     1 │  1  ┃ fires       1  │   -       case 2
     ├┈┈┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┤    ┈┈┈┼┈┈┈┈┈┼┈┈┈┈┈┈┈    ┈┈┈┈┈┼┈┈┈┈┈┈┈
     1     1     1    anything    ⏼     0 │  1  │   -         1  │   -       case 1
     1     1     1    1  ⋯  1     1     0 │  1  │   -         1  │   -       case 1
    with the same legend as above as well as with “⏼” indicating the position of the
    rightmost sample (which may be `0` or `1`) of “at least one 1”.
    
    This gives the desired firing behaviour.
    
    As above, the “at least one 1” respectively the “anything” of cases 1 contain
    time series like 1 0 1 and TDSS would still not fire – but it would later, as
    the time series moves through the critical time window.
    Similarly in cases 2, where the TDSS does fire, it may do so again later,
    depending on the “at least one 1” respectively the “anything”.
    
    As above, in cases 3, the alert (either a TD or a TDSS) for the consecutive `0`s
    on the left side, would have already fired earlier.
    
    Case 4, while it has a time series with as many consecutive `0`s that in other
    cases could have caused at TD, is still not a TD as – per definition – requires
    only `0`s in the last `1m`.
    
    For both, two and three samples in the critical time window:
    
    A firing TDSS stops to do so, when the 1 0 that causes the reset in the critical
    time window, “moves” out of that (it should not be possible to be caused by
    getting silenced).
    The time a TDSS fires might vary, depending on the time jitter with the
    evaluation intervals.
    
    If the evaluation interval is sufficiently smaller than the scrape interval (to
    account for time jitter in the former), it should not be possible that samples
    are “jumped over”. One interesting example (which includes case 4 above) for
    this is the following:
    ┈┈┼─┼─┼─────────┼┈┈
      0 1 0  0 ⋯ 0  0 ⏼    1st step
    0 1 0  0 ⋯ 0  0 ⏼      2nd step
    Given the evaluation interval is sufficiently small enough, the leftmost `1`
    that causes the reset, cannot have moved further “out” than in the 2nd step. But
    by that time, the next sample `⏼` is assured to have moved “in” and determines
    whether this is a TD or TDSS.
    
    In the `resets(up[20s] offset 50s) >= 1`-term of the TDSS’ expression `>= 1`
    rather than `= 1` (which in principle would work, too) was used, merely for the
    conceptual purpose that the alert shall rather fire as a false positive, than
    not fire.
    
    V) Different time durations
    ***************************
    Without having been looked at it in detail, if a Prometheus uses additionally
    larger scrape intervals, the alerts should in principle still work, though TDSS
    might of course never fire, as nothing would be considered a TDSS.
    In any case, the time durations for the TDSS must be aligned to the smallest
    scrape interval in use (and the evaluation interval must be aligned to that, as
    described above).
    
    Some examples for the TDSS expression with different time durations:
    • TD time duration: `5m`, scrape interval: `10s`
      ```
      resets(up[20s] offset 290s) >= 1  unless  max_over_time(up[290s]) == 0
      ```
    • TD time duration: `1m`, scrape interval: `20s`
      ```
      resets(up[40s] offset 40s) >= 1  unless  max_over_time(up[40s]) == 0
      ```
    • TD time duration: `5m`, scrape interval: `20s`
      ```
      resets(up[40s] offset 280s) >= 1  unless  max_over_time(up[280s]) == 0
      ```
    
    (Of course, the TD expression would need to be aligned, too, which should
    however be straightforward.)
    
    [0] “query for time series misses samples (that should be there), but not when
        offset is used” (https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg)
    [1] “better way to get notified about (true) single scrape failures?”
        (https://groups.google.com/g/prometheus-users/c/BwJNsWi1LhI)

commit b4b5586614b1add0f9cc71629390b8bc223b8181
Author: Christoph Anton Mitterer <cale...@gmail.com>
Date:   Fri Mar 29 05:48:13 2024 +0100

    alerts: shift `general_target-down`- and `general_target-down_single-scrapes`-alerts
    
    It was noted that a query like `metric[1m]` made at a time t+ε may not (yet)
    include the sample at time t if ε is sufficiently small.
    
    This could lead to wrong results with the `general_target-down`- and
    `general_target-down_single-scrapes`-alerts
    
    For example the TD could fail in cases like this:
    ├─┼─┼─────────┤
    ┊⏼┊0┊ 0 ⋯ 0  ˽┊
    ⏼ 0 0  0 ⋯ 0  ˽
    with “⏼” being either a `0` or a `1` and with “˽” being the not yet available
    sample.
    Here, the alert would fire, which would only be correct, if ˽ is a `0`.
    
    For example the TDSS could fail in cases like this:
    ├─┼─┼─────────┤
    ┊1┊0┊ 0 ⋯ 0  ˽┊
    1 0 0  0 ⋯ 0  ˽
    with the same legend as above.
    Here, the alert would not fire (because it would be silenced), which would only
    be correct, if ˽ is a `0` – if it’s however a `1`, the TDSS would be missed (and
    a TD would fire instead).
    
    This is solved, by shifting everything a sufficiently large offset in the past,
    with `10s` seeming to be enough for now.
    
    This must be done for both alerts (TD and TDSS) and all queries in their
    expressions. Those which already have an offset, must of course be shifted
    further.
    
    With the shifted expression, the following may be used as a testing command:
    • Printing the most recent samples and their times:
      ```
      while :; do curl -g 'http://localhost:9090/api/v1/query?query=up{instance="testnode.example.org",job="node"}[1m30s]' 2>/dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - | sed $'3i \n; 9i \n'; printf '%s\n' -------------------------------; sleep 1; done
      ```
    
    [0] “query for time series misses samples (that should be there), but not when
        offset is used” (https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg)


Chris Siebenmann

unread,
Apr 4, 2024, 2:41:02 PM4/4/24
to Christoph Anton Mitterer, Prometheus Users, Chris Siebenmann
> The assumptions I've made are basically three:
> - Prometheus does that "faking" of sample times, and thus these are
> always on point with exactly the scrape interval between each.
> This in turn should mean, that if I have e.g. a scrape interval of
> 10s, and I do up[20s], then regardless of when this is done, I get
> at least 2 samples, and in some rare cases (when the evaluation
> happens exactly on a scrape time), 3 samples.
> Never more, never less.
> Which for `up` I think should be true, as Prometheus itself
> generates it, right, and not the exporter that is scraped.
> - The evaluation interval is sufficiently less than the scrape
> interval, so that it's guaranteed that none of the `up`-samples are
> being missed.

I don't believe this assumption about up{} is correct. My understanding
is that up{} is not merely an indication that Prometheus has connected
to the target exporter, but an indication that it has successfully
scraped said exporter. Prometheus can only know this after all samples
from the scrape target have been received and ingested and there are no
unexpected errors, which means that just like other metrics from the
scrape, up{} can only be visible after the scrape has finished (and
Prometheus knows whether it succeeded or not).

How long scrapes take is variable and can be up to almost their timeout
interval. You may wish to check 'scrape_duration_seconds'. Our metrics
suggest that this can go right up to the timeout (possibly in the case
of failed scrapes).

- cks

Christoph Anton Mitterer

unread,
Apr 5, 2024, 1:30:55 PM4/5/24
to Prometheus Users
Hey Chris.

On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote:
> - The evaluation interval is sufficiently less than the scrape
> interval, so that it's guaranteed that none of the `up`-samples are
> being missed.

I assume you were referring to the above specific point?

Maybe there is a misunderstanding:

With the above I merely meant that, my solution requires that the alert rule evaluation interval is small enough, so that when it look at resets(up[20s] offset 60s) (which is the window from -70s to -50s PLUS an additional shift by 10s, so effectively -80s to -60s), the evaluations happen often enough, so that no sample can "jump over" that time window.

I.e. if the scrape interval was 10s, but the evaluation interval only 20s, it would surely miss some.
 

I don't believe this assumption about up{} is correct. My understanding
is that up{} is not merely an indication that Prometheus has connected
to the target exporter, but an indication that it has successfully
scraped said exporter. Prometheus can only know this after all samples
from the scrape target have been received and ingested and there are no
unexpected errors, which means that just like other metrics from the
scrape, up{} can only be visible after the scrape has finished (and
Prometheus knows whether it succeeded or not).

Yes, I'd have assumed so as well. Therefore I generally shifted both alerts by 10s, hoping that 10s is enough for all that.

 
How long scrapes take is variable and can be up to almost their timeout
interval. You may wish to check 'scrape_duration_seconds'. Our metrics
suggest that this can go right up to the timeout (possibly in the case
of failed scrapes).

Interesting.

I see the same (I mean entries that go up to and even a bit above the timeout). Would be interesting to know whether these are ones that still made it "just in time (despite actually being a bit longer than the timeout)... or whether these are only such that timed out and were discarded.
Cause the name scrape_duration_seconds would kind of imply that it's the former, but I guess it's actually the latter.

So what would you think that means for me and my solution now? The I should shift all my checks even further? That is at least the scrape_timeout + some extra time for the data getting into the TDSB?


Thanks,
Chris.
Reply all
Reply to author
Forward
0 new messages