StartsAt time is right but endsAt time in alertmanager API is not matching. What does endsAt time actually stands for ?

42 views
Skip to first unread message

Rahul Hada

unread,
Apr 21, 2020, 6:06:50 AM4/21/20
to Prometheus Users
We have configured Alertmanager to send alert notification to different mediums. Below is the sample output of one alert from alertmanager API. Here what does startAt & endsAt time actually refers to. Please help, as the endsAt time does not stands for alert ends time.

**************
Mountpoint: /mnt/vol1 ","summary":"High Disk Usage on 172.20.11.80:9100 - dh4-k1-og-ws-n1.foo.in on the filesystem /mnt/vol1"},"startsAt":"2020-04-13T10:40:53.612166298+05:30","endsAt":"2020-04-21T15:26:53.612166298+05:30","generatorURL":"http://dh4-k1-infra-prometheus-n1.foo.in:9090/graph?g0.expr=%28%28node_filesystem_size_bytes%7Bfstype%21~%22nfs.%2A%22%7D+-+node_filesystem_avail_bytes%7Bfstype%21~%22nfs.%2A%22%7D%29+%2F+node_filesystem_size_bytes%7Bfstype%21~%22nfs.%2A%22%7D+%2A+100+%3E+90%29+%2A+on%28instance%29+group_left%28nodename%29+node_uname_info\u0026g0.tab=1","status":{"state":"active","silencedBy":[],"inhibitedBy":[]},"receivers":["eben_api"]

****************

Thanks in Advance

Brian Brazil

unread,
Apr 21, 2020, 6:10:02 AM4/21/20
to Rahul Hada, Prometheus Users
endsAt is an implementation detail of how alerting is done reliably so that a brief disruption of alerts making it to the Alertmanager won't be an issue. Alerts are basically leases, and endsAt is when the lease is up.

Basically you should never rely on either value, outside of debugging Prometheus alerting in and of itself.

--

Yagyansh S. Kumar

unread,
Apr 21, 2020, 6:12:43 AM4/21/20
to Prometheus Users
So, we can't rely on either Prometheus' internal ALERTS_FOR_STATE and endsAt, StartsAt also.

Then, what should we use to get the age of alerts?

Brian Brazil

unread,
Apr 21, 2020, 6:20:31 AM4/21/20
to Yagyansh S. Kumar, Prometheus Users
On Tue, 21 Apr 2020 at 11:12, Yagyansh S. Kumar <yagyans...@gmail.com> wrote:
So, we can't rely on either Prometheus' internal ALERTS_FOR_STATE and endsAt, StartsAt also.

Then, what should we use to get the age of alerts?

I'd personally look at it on the relevant graph for the metric.

An alert firing indicates that an issue has gotten so bad that a human needs to be called in to investigate, how long it has been firing then is a matter of oncall response time and thus not interesting in terms of debugging the problem. What's interesting is more how the system behaved as it approached the threshold being hit, for example was it a sudden spike or did it grow gradually over time.

Brian
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/690a9167-ba7e-4a91-bfcb-6bbd6b8f8b25%40googlegroups.com.


--

Yagyansh S. Kumar

unread,
Apr 21, 2020, 6:48:36 AM4/21/20
to Prometheus Users
That is indeed true.

But the age of the alert is good to keep a check on response time and several other things. So, which according to you should be the best way to get the age of a particular alert?
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Rahul Hada

unread,
Apr 21, 2020, 6:49:08 AM4/21/20
to Prometheus Users
Thanks for the reply Brian. Metric to conclude Age of alert would help in getting the idea of which particular server from the cluster has been above the threshold if the other part of team does not have access or familiar with grafana and only relying on alertmanager. If there is any open thread on progress of this metric, or we should start one. 
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Brian Candler

unread,
Apr 21, 2020, 7:31:31 AM4/21/20
to Prometheus Users
Alertmanager doesn't cover all use cases, and in particular I don't think does much in the way of time-based escalation.  However you can forward alerts to a higher-level management system like OpsGenie, VictorOps, PagerDuty etc.
Reply all
Reply to author
Forward
0 new messages