> I am looking for a solution to calculate the total duration of each
> firing alert since it started firing. Following is the query I tried,
> but i see the value for all the firing alert is 86400
>
> (avg_over_time(customer_ALERTS{alertstate="firing",severity="critical"}[24h]))
> *24 * 3600
You can only use avg_over_time() this way on metrics that are always
present and are either 0 or 1; the classical metric for this is 'up',
whether scrapes succeed, but there are many others that at least
approximate this (such as Blackbox 'probe_success'). Unfortunately
alerts are not like this; when an ALERTS metric is present at all, it's
always one. Therefor the average over the time it's present is 1.
There are two possible queries to deal with this, depending on how
many assumptions you want to make about your setup. The simple version
is to count how many times the metric is present over your time period
and then multiply this by your Prometheus alert checking interval. For
example, if you check all alert rules every 15 seconds (a common default),
you would do the following to generate the total duration of every alert
in seconds over the past 24 hours:
count_over_time(ALERTS{alertstate="firing",severity="critical"}[24h]) \
* 15
Every single count of a particular ALERTS metric being present
represents a 15 second period when that particular alert was firing,
so the total amount of time (in seconds) is just that number times 15
(seconds). Note that this is subtly different from 'the total duration
of each alert from when it started firing', which Prometheus cannot
readily calculate. If a single alert fired twice during the past 24
hours, this PromQL expression will give you the total time across both
incidents, not two separate duration metrics (one for each incident).
If you don't want to embed knowledge of your alert rule interval
into your dashboard (perhaps different alert rules have different
check frequency, perhaps you want to change them later), you need to
use a subquery, which lets you explicitly specify the interval. Here
you would change the '[24h]' to '[24h:15s]'. This will likely cause
Prometheus to do more work but you probably have few enough ALERTS
metrics that this doesn't matter.
Generally you want to pick a subquery interval (the '15s') that is
at least as small as your smallest real interval. This insures you
won't miss-count metrics that only appeared briefly.
- cks