Understanding the parameter endsAt and resend-delay.

1,137 views
Skip to first unread message

akshay sharma

unread,
Aug 29, 2021, 8:53:47 AM8/29/21
to Prometheus Users
Hi, 

Recently, I've been debugging an issue where the alert is resolving even though from prometheus it is in firing mode. 
so, the cycle is firing->resolving->firing. 

After going through some documents and blogs, I found out that alertmanager will resolve the alert if the prometheus doesn't send the alert within the "resolve_timeout".
If, Prometheus now sends the endsAt field to the Alertmanager with a very short timeout until AlertManager can mark the alert as resolved. This overrides the resolve_timeout setting in AlertManager and creates the firing->resolved->firing behavior if Prometheus does not resend the alert before the short timeout.

Is that understanding correct?

Questions as follows:
1) How endsAt time is calculated? Is that calculated from resend_delay?
2) what is the default value of resend_delay or how to check this configuration and in which file it is defined? 

3) msg:Received alerts logs are received when prometheus sends alerts to alertmanager? msg:flushing logs get logged when? (below)

4)  evaluation_interval : 1m, and scrape_interval : 1m. then why did the received alert at 12:34 and received alert at 12:36 have a time difference of 2m? 
     When I do get a request for alerts from alertmanager, I could see endsat time is +4 minutes from the last received alert, why is that so? Is my resend_delay 4m? Because, I didn't set the resend_delay value.

Below are the logs from alertmanager :

level=debug ts=2021-08-29T12:34:40.342Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=disk_utilization[6356c43][active]
level=debug ts=2021-08-29T12:34:40.342Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=disk_utilization[1db5352][active]

level=debug ts=2021-08-29T12:34:40.381Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}" msg=flushing alerts="[disk_utilization[6356c43][active] disk_utilization[1db5352][active]]"
level=debug ts=2021-08-29T12:35:10.381Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}" msg=flushing alerts="[disk_utilization[6356c43][active] disk_utilization[1db5352][active]]"
level=debug ts=2021-08-29T12:35:40.382Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}" msg=flushing alerts="[disk_utilization[6356c43][active] disk_utilization[1db5352][active]]"
level=debug ts=2021-08-29T12:36:10.382Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{name=~\"^(?:test-1)$\"}:{alertname=\"disk_utilization\"}" msg=flushing alerts="[disk_utilization[6356c43][active] disk_utilization[1db5352][active]]"

level=debug ts=2021-08-29T12:36:40.345Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=disk_utilization[6356c43][active]
level=debug ts=2021-08-29T12:36:40.345Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=disk_utilization[1db5352][active]

Get request from alertmanager:
curl http://10.233.49.116:9092/api/v1/alerts
{"status":"success","data":[{"labels":{"alertname":"disk_utilization","device":"xx.xx.xx.xx:/media/test","fstype":"nfs4","instance":"xx.xx.xx.xx","job":"test-1","mountpoint":"/media/test","node_name":"test-1","severity":"critical"},"annotations":{"summary":"Disk utilization has crossed x%. Current Disk utilization = 86.823044624783"},"startsAt":"2021-08-29T11:28:40.339802555Z","endsAt":"2021-08-29T12:40:40.339802555Z","generatorURL":"x","status":{"state":"active","silencedBy":[],"inhibitedBy":[]},"receivers":["test-1"],"fingerprint":"1db535212ea6dcf6"},{"labels":{"alertname":"disk_utilization","device":"test","fstype":"ext4","instance":"xx.xx.xx.xx","job":"Node_test-1","mountpoint":"/","node_name":"test-1","severity":"critical"},"annotations":{"summary":"Disk utilization has crossed x%. Current Disk utilization = 94.59612027578963"},"startsAt":"2021-08-29T11:28:40.339802555Z","endsAt":"2021-08-29T12:40:40.339802555Z","generatorURL":"x","status":{"state":"active","silencedBy":[],"inhibitedBy":[]},"receivers":["test-1"],"fingerprint":"6356c43dc3589622"}]}



thanks,
Akshay

Brian Candler

unread,
Aug 30, 2021, 7:32:13 AM8/30/21
to Prometheus Users
Are you sure that's your problem?  Can you show your complete alerting rule and its enclosing rule group?

When starting an alert, the expression has to return a value for a certain amount of time ("for:") before the alert triggers.  But the converse is not true: if the expr value disappears, even for a single evaluation cycle, the alert is immediately resolved.

Therefore, try entering your alert expr in the PromQL browser, and look for any gaps in it.  Any gap will resolve the alert.

akshay sharma

unread,
Aug 31, 2021, 3:19:05 AM8/31/21
to Brian Candler, Prometheus Users
Hi Brain,

Alerting rule: - 
name: disk utilization increasing
  rules:
  - alert: disk_utilization
    annotations:
      summary: Disk utilization increased x%
    expr: 100 - (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x}
      * 100) > x%
    for: 5m
    labels:
      severity: critical

Rule group:

receivers:
- name: x
route:
  group_by:
  - alertname
  group_interval: 30s
  group_wait: 30s
  receiver: x
  repeat_interval: 8h

Is there any way we can avoid getting alert resolve due to expr value disappearing?  (values are disappearing as node exporter running on host is going down at some point). 
You mean if we are fetching metrics from any exporter, that exporter should always be UP and running when we are monitoring particular?.

And I want to understand the points below.


Questions as follows:
1) How endsAt time is calculated? Is that calculated from resend_delay?
2) what is the default value of resend_delay or how to check this configuration and in which file it is defined? 

3) msg:Received alerts logs are received when prometheus sends alerts to alertmanager? msg:flushing logs get logged when? (below)

4)  evaluation_interval : 1m, and scrape_interval : 1m. then why did the received alert at 12:34 and received alert at 12:36 have a time difference of 2m? 
     When I do get a request for alerts from alertmanager, I could see endsat time is +4 minutes from the last received alert, why is that so? Is my resend_delay 4m? Because, I didn't set the resend_delay value.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0af8717a-791a-49a1-9efc-f256273854b3n%40googlegroups.com.

Brian Candler

unread,
Aug 31, 2021, 4:09:06 AM8/31/21
to Prometheus Users
I can't answer your questions about endsAt because I've never had any problem with this, so I've never had to dig into it.  As long as your alert rule evaluation interval is something sensible like 1m or 15s then it should be fine.  (This is defined at the level of the rule *group*, or the global evaluation_interval if not specified there)

It sounds like the problem is that the alerting expression is resolving transiently - see also https://groups.google.com/g/prometheus-users/c/yVL8e257VvU

To prove this, enter the alerting expression as-is into the PromQL browser:

100 - (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x} * 100) > x%

Graph it over time and look for small gaps.  And/or trim it to

100 - (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes{name=x} * 100)

and look for values which are above and below x% which could cause the alert to be resolved briefly before re-firing.

There's no easy way around this except to make your alerting expressions more robust.  For example, suppose you're alerting on

expr: up == 0

but the value of "up" goes like this when a machine becomes overloaded:  1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 ...

You're going to get a new alert every time it goes 0 1 0.  So instead you might write it like this:

expr: min_over_time(up[10m]) == 0

This will alert as soon as it goes to 0, but will only stop alerting when it has been at 1 for 10 minutes continuously (or the entire metric is missing for 10 minutes continuously).  This is straightforward, but unfortunately more sophisticated cases are tricky. Also, it's harder to cope with the case of occasional failures like 1 1 1 1 1 1 0 1 1 1 1 1 1 without sending spurious alerts.  You could try building an expression with count_over_time instead, or by joining on the ALERTS metric which exists for alerts that are already firing, although joining on metrics which might not exist is not easy:

Aside 1: the divide operator only generates results where the numerator and denominator have exactly the same set of labels, so your expression can be simplified to

100 - (node_filesystem_avail_bytes{name=x}/node_filesystem_size_bytes * 100) > x%

without any change in behaviour.

Aside 2: if the idea is to warn when a disk is *soon going to be full* then you can do something more sophisticated than static thresholds: e.g.

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
  - alert: DiskFilling
    expr: |
      predict_linear(node_filesystem_avail_bytes{instance!~"foo|bar|baz",fstype!~"fuse.*|nfs.*"}[3h], 2*86400) < 0
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in less than 2d at current 3h growth rate'

This means that a disk which is sitting at 91% full but unchanging won't alert, but one which goes from 91% to 92% to 93% over 3 hours will alert.  It's something to consider anyway, perhaps in conjunction with higher static thresholds so you get a quick alert when the filesystem reaches 98% (say).

Remember that "alerts" in principle are things which may get someone out of bed, and should be something that can be immediately actioned, and resolved (or at worst silenced while the problem is fixed soon).  If you have alerts which are continuously firing, it leads to "alert fatigue" very quickly. There's an excellent document here (by an ex-Google site reliability engineer): https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/

akshay sharma

unread,
Aug 31, 2021, 10:00:53 AM8/31/21
to Brian Candler, Prometheus Users
Thanks Brain, 

So, min_over_time is the only option to avoid that scenario(alert resolving when metric gone missing ) ?because min_over_time is a range vector and it will mark alert in firing state or any old state till that range vector.
but what if service/exporter doesn't come for like 1d or sometimes in 1h, who could we possibly make it work?


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Aug 31, 2021, 10:19:16 AM8/31/21
to Prometheus Users
Sorry, I don't understand what you're saying.  It's true that min_over_time works over a range vector, and it returns an instant vector.  What's the problem?

Alert will not be marked as firing if there is no time series returned by the "expr" in the alerting rule.  It is the presence or absence of a timeseries (not its value) which triggers an alert.

Perhaps you can provide a concrete example of an alerting rule, and what the problem is (e.g. what behaviour you see that you don't want to see).

akshay sharma

unread,
Aug 31, 2021, 2:15:37 PM8/31/21
to Brian Candler, Prometheus Users
Okay, 

Use case is : I have an alert which should be raise when threshold value mention crosses. Say, it is raised now, during next evaluation interval ,the exporter/target is down, so Prometheus will get empty metrics and it will mark the alert as resolve.

I want alert to be in firing state only until the target comes back or Prometheus gets the metrics back. 

How can I avoid this?

Stuart Clark

unread,
Aug 31, 2021, 2:24:35 PM8/31/21
to promethe...@googlegroups.com, akshay sharma, Brian Candler, Prometheus Users
In general I don't think you can. You can use the absent function, but that will also alert if the scrape fails (eg. Host is down). You could also use one of the over_time functions, but that wouldn't be based on the last sample before scrapes stopped.

However, taking a step back, why do you want to keep alerting if scrapes fail? At that point you have no idea what is actually happening, so the server might be fine (but a firewall issue has stopped scrapes). You'd end up wasting time investing a server when you don't need to.

Instead you'd generally have two alerts - on for your threshold and one for scrape failures. You'd then investigate whichever fires.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

akshay sharma

unread,
Aug 31, 2021, 3:51:15 PM8/31/21
to Stuart Clark, Prometheus Users, Brian Candler
Ok @Stuart Clark , thanks for reaching out.

Would you help me in understanding below queries? 


) How endsAt time is calculated? Is that calculated from resend_delay?
2) what is the default value of resend_delay or how to check this configuration and in which file it is defined? 

3) msg:Received alerts logs are received when prometheus sends alerts to alertmanager? msg:flushing logs get logged when? (below mail thread)
Reply all
Reply to author
Forward
0 new messages