Hi,rate() actually uses delta() internally and there is a way – but undocumented and discouraged.You want to alert if the average availability over the past hour falls below 59/60.It seems like rate() would fulfill this requirement with rate(job:leader-secs:sum[1h]) < (59 / 60).
Could you elaborate why you think it is not a good fit? Where do you want to use an absolute value in your scenario?
On Saturday, May 16, 2015 at 6:33:52 PM UTC+2, Michael Stapelberg wrote:Hey,For http://robustirc.net, I want to introduce a new metric/graph/alerting rule based on availability. In this context, the network being available means one node being the raft leader (hence accepting mutations).My first idea was to add a goroutine which wakes up every second and increments a counter for the raft state the node is in, i.e. I’d end up with a map containing e.g. follower:60, candidate:5, leader:55 if the node was running for 2 minutes (120s), was busy with elections for 5s total, was a follower during 1 minute and leader for the remaining time.Now, with prometheus I want to achieve 2 things:1. Graph the availability over the past [time interval], e.g. past 1h, to judge how well we are doing with regards to our SLA.2. Alert when the availability (of a rolling 60 minute window) is below a certain threshold, say below 59 minutes.However, I’m not sure how best to attack this problem from the prometheus side, starting with the alert: I’d be inclined to use:job:leader-secs:sum = sum(state-secs{state="leader"}) by (job)ALERT AvailabilityTooLowIF delta(job:leader-secs:sum[1h]) < (59 * 60)…However, as per http://prometheus.io/docs/querying/functions/, delta() should only be used with gauges, and I have a counter.I think I don’t want to use rate() because then I need to specify a time window and the value gets averaged over that time window — I want a precise value, though, not an average.Is there a way to use delta() with a counter, or am I going down the wrong path entirely?Thanks,
--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/wfGhGkDcZ3Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and all its topics, send an email to prometheus-developers+unsub...@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/wfGhGkDcZ3Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.
On Sat, May 16, 2015 at 6:33 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> My first idea was to add a goroutine which wakes up every second and
> increments a counter for the raft state the node is in, i.e. I’d end up with
> a map containing e.g. follower:60, candidate:5, leader:55 if the node was
> running for 2 minutes (120s), was busy with elections for 5s total, was a
> follower during 1 minute and leader for the remaining time.
Going back to the initial idea here:
Is 1sec by coincidence the resolution you require? If you could live
with broader resolution, you could just expose the state and scrape
every 10s, using the normal Prometheus expressions to calculate
availability.
If you need really high precision, I'd remember the time of each state
change. Then increment a counter vector with the time spent upon each
state change, (and upon scraping to get the increment for the state
currently running).
Have a time.Ticker which increment two metrics: alive state time and total.
This way you eliminate scrape frequency, and can calculate "last interval aliveness".
Thanks, everyone!I’ve added these rules:job_instance:available_secs:sum = sum(seconds_in_state{state=~"Leader|Follower"}) BY (job, instance)job_instance:total_secs:sum = sum(seconds_in_state) BY (job, instance)And I’m now graphing this expression:rate(job_instance:available_secs:sum[1m]) / on(instance) rate(job_instance:total_secs:sum[1m])
This seems to work so far, but I only have it running for a couple of minutes.In case I don’t update this thread, this is the solution. Otherwise, I’ll follow up :).On Tue, May 19, 2015 at 6:09 PM, Matthias Rampke <m...@soundcloud.com> wrote:On Tue, May 19, 2015 at 4:03 PM, <tgula...@gmail.com> wrote:
> Scrape this two metrics, and then you can calculate availability for each scrape period:
> rate((availableSeconds/totalSeconds)[1m])
>
> (I'm not sure about the syntax: you'd need (availableSeconds_t1 - availableSeconds_t0) / (totalSeconds_t1 - totalSeconds_t0) to calculate the availability rate in the [t0,t1) interval).
I think this should work:
rate(availableSeconds[1m])/rate(totalSeconds[1m])
you can adjust the 1m to any interval you are interested in.
/MR
--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/wfGhGkDcZ3Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
2. What’s the explanation for job_instance:total_secs:sum_rate being 1 @1432190920.899, when job_instance:available_secs:sum_rate is 1.0192525481313703 @1432190920.899? Recall that total_secs is defined as a superset of available_secs, so I would have expected it to contain precisely the same value with the data at hand:job_instance:available_secs:sum_rate = sum(task_instance:seconds_in_state:rate{state=~"Leader|Follower"}) BY (job, instance)job_instance:total_secs:sum_rate = sum(task_instance:seconds_in_state:rate) BY (job, instance)
1. The timestamp prometheus uses for the scraped timeseries seems to be the timestamp of the start of the scrape (see also https://github.com/prometheus/prometheus/blob/267fd341564d5c29755ead159a2106faf056c4f2/retrieval/target.go#L351). Wouldn’t it make more sense to use the timestamp of when the target actually replied? That should avoid the rates > 1 in the above data.