How do I model availability?

550 views
Skip to first unread message

Michael Stapelberg

unread,
May 16, 2015, 12:33:52 PM5/16/15
to prometheus...@googlegroups.com
Hey,

For http://robustirc.net, I want to introduce a new metric/graph/alerting rule based on availability. In this context, the network being available means one node being the raft leader (hence accepting mutations).

My first idea was to add a goroutine which wakes up every second and increments a counter for the raft state the node is in, i.e. I’d end up with a map containing e.g. follower:60, candidate:5, leader:55 if the node was running for 2 minutes (120s), was busy with elections for 5s total, was a follower during 1 minute and leader for the remaining time.

Now, with prometheus I want to achieve 2 things:

1. Graph the availability over the past [time interval], e.g. past 1h, to judge how well we are doing with regards to our SLA.
2. Alert when the availability (of a rolling 60 minute window) is below a certain threshold, say below 59 minutes.

However, I’m not sure how best to attack this problem from the prometheus side, starting with the alert: I’d be inclined to use:

job:leader-secs:sum = sum(state-secs{state="leader"}) by (job)

ALERT AvailabilityTooLow
  IF delta(job:leader-secs:sum[1h]) < (59 * 60)
  …

However, as per http://prometheus.io/docs/querying/functions/, delta() should only be used with gauges, and I have a counter.

I think I don’t want to use rate() because then I need to specify a time window and the value gets averaged over that time window — I want a precise value, though, not an average.

Is there a way to use delta() with a counter, or am I going down the wrong path entirely?

Thanks,

Fabian Reinartz

unread,
May 16, 2015, 1:14:40 PM5/16/15
to prometheus...@googlegroups.com
Hi,

rate() actually uses delta() internally and there is a way – but undocumented and discouraged.

You want to alert if the average availability over the past hour falls below 59/60.
It seems like rate() would fulfill this requirement with rate(job:leader-secs:sum[1h]) < (59 / 60).

Could you elaborate why you think it is not a good fit? Where do you want to use an absolute value in your scenario?

Michael Stapelberg

unread,
May 17, 2015, 4:14:08 PM5/17/15
to Fabian Reinartz, prometheus...@googlegroups.com
On Sat, May 16, 2015 at 7:14 PM, Fabian Reinartz <fab.re...@gmail.com> wrote:
Hi,

rate() actually uses delta() internally and there is a way – but undocumented and discouraged.

You want to alert if the average availability over the past hour falls below 59/60.
It seems like rate() would fulfill this requirement with rate(job:leader-secs:sum[1h]) < (59 / 60).

Yeah, I think that approach should work, but see below.
 

Could you elaborate why you think it is not a good fit? Where do you want to use an absolute value in your scenario?

I’ve added the metric with http://git.io/vTLsD and let the network run for 30 minutes with that new binary. Here is the output of evaluating 600*rate(seconds_in_state{instance="dock0"}[10m]):

{
  "version": 1,
  "value": [
    {
      "timestamp": 1431893160.162,
      "value": "588.421052631579",
      "metric": {
        "state": "Follower",
        "job": "robustirc",
        "instance": "dock0"
      }
    },
    {
      "timestamp": 1431893160.162,
      "value": "0",
      "metric": {
        "state": "Candidate",
        "job": "robustirc",
        "instance": "dock0"
      }
    }
  ],
  "type": "vector"
}

I’m a bit concerned about the fact that I see 588 instead of the expected 600 for the value. I would have expected my expression to return the total number of seconds within the last (scraped) 10m during which the instance dock0 was in the Follower state, which I know to be 10m out of 10m. Am I misunderstanding something?

At first I thought this is because of the 10s interval and the last datapoint being a bit in the past, but then I’ve repeated this query a bunch of times (see http://sprunge.us/JgEf) and the return value doesn’t change until prometheus has a new datapoint, so that doesn’t explain it.

Here are the contents of the seconds_in_state timeseries: http://sprunge.us/QChF — as an aside, is there a better way to query it rather than using e.g. http://localhost:9090/api/query_range\?end\=0\&expr\=seconds_in_state%7Binstance%3D%22dock0%22%7D\&range\=14400\&step\=10? The downside is that with the step function, the values get interpolated it seems, because there are fractional values in an integer counter.
 


On Saturday, May 16, 2015 at 6:33:52 PM UTC+2, Michael Stapelberg wrote:
Hey,

For http://robustirc.net, I want to introduce a new metric/graph/alerting rule based on availability. In this context, the network being available means one node being the raft leader (hence accepting mutations).

My first idea was to add a goroutine which wakes up every second and increments a counter for the raft state the node is in, i.e. I’d end up with a map containing e.g. follower:60, candidate:5, leader:55 if the node was running for 2 minutes (120s), was busy with elections for 5s total, was a follower during 1 minute and leader for the remaining time.

Now, with prometheus I want to achieve 2 things:

1. Graph the availability over the past [time interval], e.g. past 1h, to judge how well we are doing with regards to our SLA.
2. Alert when the availability (of a rolling 60 minute window) is below a certain threshold, say below 59 minutes.

However, I’m not sure how best to attack this problem from the prometheus side, starting with the alert: I’d be inclined to use:

job:leader-secs:sum = sum(state-secs{state="leader"}) by (job)

ALERT AvailabilityTooLow
  IF delta(job:leader-secs:sum[1h]) < (59 * 60)
  …

However, as per http://prometheus.io/docs/querying/functions/, delta() should only be used with gauges, and I have a counter.

I think I don’t want to use rate() because then I need to specify a time window and the value gets averaged over that time window — I want a precise value, though, not an average.

Is there a way to use delta() with a counter, or am I going down the wrong path entirely?

Thanks,

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/wfGhGkDcZ3Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fabian Reinartz

unread,
May 17, 2015, 5:55:11 PM5/17/15
to prometheus...@googlegroups.com, fab.re...@gmail.com
It is true, that the ingested samples do not align exactly with the evaluation interval. The implementation of the rate()
function attempts to correct for that.

Before we drill down further into Prometheus, I'd like to rule out one last thing on the outside. Your counter is incremented by 1
The sleep is not necessarily exact and the scheduler will wake the goroutine up after a second has passed, but possibly a bit later.
This means, in the long run, your counter is incremented by less than one per second.

I did a trivial example with two goroutines and had a ratio of 1.0002. Your case would happen at a ratio of 1.02.
For a non-trivial program, I think this could be realistic (for a one second sleep).

Could you implement the increment in a way that it eventually accounts for the difference (by counting +2 every once in a while).
time.Since() is probably the way to go here to measure the actual seconds.

It would be great to verify whether this plays a role in your observations or not - just to be sure.

To unsubscribe from this group and all its topics, send an email to prometheus-developers+unsub...@googlegroups.com.

Björn Rabenstein

unread,
May 18, 2015, 7:37:00 AM5/18/15
to Michael Stapelberg, prometheus-developers
On Sat, May 16, 2015 at 6:33 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> My first idea was to add a goroutine which wakes up every second and
> increments a counter for the raft state the node is in, i.e. I’d end up with
> a map containing e.g. follower:60, candidate:5, leader:55 if the node was
> running for 2 minutes (120s), was busy with elections for 5s total, was a
> follower during 1 minute and leader for the remaining time.

Going back to the initial idea here:

Is 1sec by coincidence the resolution you require? If you could live
with broader resolution, you could just expose the state and scrape
every 10s, using the normal Prometheus expressions to calculate
availability.

If you need really high precision, I'd remember the time of each state
change. Then increment a counter vector with the time spent upon each
state change, (and upon scraping to get the increment for the state
currently running).

But I guess that would require meddling quite deeply with the
Hashicorp Raft code...

Would need a deeper look at their API to come up with a more detailed idea...

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Michael Stapelberg

unread,
May 18, 2015, 3:50:46 PM5/18/15
to Fabian Reinartz, prometheus...@googlegroups.com
Thanks for pointing this out. I’ve added code to deal with this issue, but it turns out it only corrects one second roughly every 2 hours.

Hence, it doesn’t matter whether we use the current or the fixed version of the code, for the initial example I posted the results will be the same.

To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/wfGhGkDcZ3Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Best regards,
Michael

Michael Stapelberg

unread,
May 18, 2015, 3:53:26 PM5/18/15
to Björn Rabenstein, prometheus-developers
On Mon, May 18, 2015 at 1:36 PM, Björn Rabenstein <bjo...@soundcloud.com> wrote:
On Sat, May 16, 2015 at 6:33 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> My first idea was to add a goroutine which wakes up every second and
> increments a counter for the raft state the node is in, i.e. I’d end up with
> a map containing e.g. follower:60, candidate:5, leader:55 if the node was
> running for 2 minutes (120s), was busy with elections for 5s total, was a
> follower during 1 minute and leader for the remaining time.

Going back to the initial idea here:

Is 1sec by coincidence the resolution you require? If you could live
with broader resolution, you could just expose the state and scrape
every 10s, using the normal Prometheus expressions to calculate
availability.

1 second is the precision I want to have, yeah.

10s doesn’t cut it since leader elections take less than 10s. Exposing the state is something I already do, but experience shows that 10s scraping is too coarse and misses leader elections.
 

If you need really high precision, I'd remember the time of each state
change. Then increment a counter vector with the time spent upon each
state change, (and upon scraping to get the increment for the state
currently running).

How is that different from what I’m currently doing?

tgula...@gmail.com

unread,
May 18, 2015, 5:19:11 PM5/18/15
to prometheus...@googlegroups.com
Have a time.Ticker which increment two metrics: alive state time and total.

This way you eliminate scrape frequency, and can calculate "last interval aliveness".

Michael Stapelberg

unread,
May 18, 2015, 5:29:04 PM5/18/15
to tgula...@gmail.com, prometheus...@googlegroups.com
Can you please elaborate on how I’d go about the calculation you have in mind?

On Mon, May 18, 2015 at 11:19 PM, <tgula...@gmail.com> wrote:
Have a time.Ticker which increment two metrics: alive state time and total.

This way you eliminate scrape frequency, and can calculate "last interval aliveness".

Björn Rabenstein

unread,
May 19, 2015, 7:17:51 AM5/19/15
to Michael Stapelberg, prometheus-developers
On Mon, May 18, 2015 at 9:53 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> 1 second is the precision I want to have, yeah.
>
> 10s doesn’t cut it since leader elections take less than 10s. Exposing the
> state is something I already do, but experience shows that 10s scraping is
> too coarse and misses leader elections.

You could just scrape every second... :)

>> If you need really high precision, I'd remember the time of each state
>> change. Then increment a counter vector with the time spent upon each
>> state change, (and upon scraping to get the increment for the state
>> currently running).
>
> How is that different from what I’m currently doing?

If I understand correctly, you are currently sampling the state of the
system once per second, i.e. you'll potentially miss a state that
lasts for less than a second, and your exported totals will have a
limited accuracy. (That was the concern behind my previous remarks: If
10sec resolution is not good enough, then the odds are that at some
point, 1sec resolution is not good enough, either, and you should aim
for a solution with higher accuracy by design.)

Pseudo Go-ish code:

var timeCounter prometheus.CounterVec
var lastStateChange time.Time

func onStateChange(oldState) {
timeCounter.With(Label{"state":oldState}).Inc(time.Since(lastStateChange))
lastStateChange = time.Now()
}

In that way, you get quite precise timings. Caveat: You have to call
above on scrape, too, to get the time spent in the current state in
the counters, too.

tgula...@gmail.com

unread,
May 19, 2015, 12:03:52 PM5/19/15
to prometheus...@googlegroups.com, tgula...@gmail.com
2015. május 18., hétfő 23:29:04 UTC+2 időpontban Michael Stapelberg a következőt írta:
> Can you please elaborate on how I’d go about the calculation you have in mind?
>

Have two counters: totalSeconds and aliveSeconds.
Have a time.Ticker in a goroutine which increments totalSeconds for each second, and increments aliveSeconds iff your server is available.

Scrape this two metrics, and then you can calculate availability for each scrape period:
rate((availableSeconds/totalSeconds)[1m])

(I'm not sure about the syntax: you'd need (availableSeconds_t1 - availableSeconds_t0) / (totalSeconds_t1 - totalSeconds_t0) to calculate the availability rate in the [t0,t1) interval).

Hope this helps.

Matthias Rampke

unread,
May 19, 2015, 12:10:00 PM5/19/15
to tgula...@gmail.com, prometheus-developers
On Tue, May 19, 2015 at 4:03 PM, <tgula...@gmail.com> wrote:

> Scrape this two metrics, and then you can calculate availability for each scrape period:
> rate((availableSeconds/totalSeconds)[1m])
>
> (I'm not sure about the syntax: you'd need (availableSeconds_t1 - availableSeconds_t0) / (totalSeconds_t1 - totalSeconds_t0) to calculate the availability rate in the [t0,t1) interval).


I think this should work:

rate(availableSeconds[1m])/rate(totalSeconds[1m])

you can adjust the 1m to any interval you are interested in.

/MR

Michael Stapelberg

unread,
May 19, 2015, 4:22:01 PM5/19/15
to Matthias Rampke, tgula...@gmail.com, prometheus-developers
Thanks, everyone!

I’ve added these rules:

job_instance:available_secs:sum = sum(seconds_in_state{state=~"Leader|Follower"}) BY (job, instance)
job_instance:total_secs:sum = sum(seconds_in_state) BY (job, instance)

And I’m now graphing this expression:

rate(job_instance:available_secs:sum[1m]) / on(instance) rate(job_instance:total_secs:sum[1m])

This seems to work so far, but I only have it running for a couple of minutes.

In case I don’t update this thread, this is the solution. Otherwise, I’ll follow up :).

Brian Brazil

unread,
May 20, 2015, 5:47:38 AM5/20/15
to Michael Stapelberg, Matthias Rampke, tgula...@gmail.com, prometheus-developers
On 19 May 2015 at 21:21, Michael Stapelberg <mic...@robustirc.net> wrote:
Thanks, everyone!

I’ve added these rules:

job_instance:available_secs:sum = sum(seconds_in_state{state=~"Leader|Follower"}) BY (job, instance)
job_instance:total_secs:sum = sum(seconds_in_state) BY (job, instance)

And I’m now graphing this expression:

rate(job_instance:available_secs:sum[1m]) / on(instance) rate(job_instance:total_secs:sum[1m])

You always want to take a rate, then sum - not sum then rate. Doing it the wrong way around will cause breakage when servers restart. Otherwise I think that'll work.

Brian
 

This seems to work so far, but I only have it running for a couple of minutes.

In case I don’t update this thread, this is the solution. Otherwise, I’ll follow up :).

On Tue, May 19, 2015 at 6:09 PM, Matthias Rampke <m...@soundcloud.com> wrote:
On Tue, May 19, 2015 at 4:03 PM,  <tgula...@gmail.com> wrote:

> Scrape this two metrics, and then you can calculate availability for each scrape period:
> rate((availableSeconds/totalSeconds)[1m])
>
> (I'm not sure about the syntax: you'd need (availableSeconds_t1 - availableSeconds_t0) / (totalSeconds_t1 - totalSeconds_t0) to calculate the availability rate in the [t0,t1) interval).


I think this should work:

rate(availableSeconds[1m])/rate(totalSeconds[1m])

you can adjust the 1m to any interval you are interested in.

/MR

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/wfGhGkDcZ3Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Michael Stapelberg

unread,
May 20, 2015, 2:00:37 PM5/20/15
to Brian Brazil, Matthias Rampke, tgula...@gmail.com, prometheus-developers
Ah, thanks, I always forget about that :). I’ve sent https://github.com/prometheus/docs/pull/92 so that others don’t run into the same issue.

Current attempt:

task_instance:seconds_in_state:rate = rate(seconds_in_state[1m])

job_instance:available_secs:sum_rate = sum(task_instance:seconds_in_state:rate{state=~"Leader|Follower"}) BY (job, instance)

job_instance:total_secs:sum_rate = sum(task_instance:seconds_in_state:rate) BY (job, instance)

job:availability:max_sum_rate = max(job_instance:available_secs:sum_rate / job_instance:total_secs:sum_rate)

I’m just graphing job:availability:max_sum_rate now, which should describe whether there is at least 1 monitored target which claims to be available.

Michael Stapelberg

unread,
May 21, 2015, 3:22:19 AM5/21/15
to Brian Brazil, Matthias Rampke, tgula...@gmail.com, prometheus-developers
I got the new ruleset running for a couple of hours and I’m seeing behavior that I cannot explain. Here is a bit of data to illustrate my observations (the rules from my previous email are still unchanged). seconds_in_state is the timeseries I’m scraping, its state label can either be Candidate, Follower or Leader:

seconds_in_state{instance="alp", state="Follower",job="robustirc"}
108362 @1432190884.346
108362 @1432190910.836
108362 @1432190940.836
108362 @1432190970.836

seconds_in_state{instance="alp", state="Candidate",job="robustirc"}
19 @1432190884.346
19 @1432190910.836
19 @1432190940.836
19 @1432190970.836

seconds_in_state{instance="alp", state="Leader",job="robustirc"}
191398 @1432190884.346 # delta=--, Thu May 21 08:48:04 CEST 2015
191425 @1432190910.836 # delta=27, Thu May 21 08:48:30 CEST 2015
191455 @1432190940.836 # delta=30, Thu May 21 08:49:00 CEST 2015
191485 @1432190970.836 # delta=30, Thu May 21 08:49:30 CEST 2015

task_instance:seconds_in_state:rate{instance="alp", state="Leader",job="robustirc"}
1 @1432190910.899
1.0192525481313703 @1432190920.899
1.0192525481313703 @1432190930.899
1.0192525481313703 @1432190940.899
1 @1432190950.899

job_instance:available_secs:sum_rate{instance="alp",job="robustirc"}
1 @1432190910.899
1.0192525481313703 @1432190920.899
1.0192525481313703 @1432190930.899
1.0192525481313703 @1432190940.899
1 @1432190950.899

job_instance:total_secs:sum_rate{instance="alp",job="robustirc"}
1 @1432190910.899
1 @1432190920.899
1.0192525481313703 @1432190930.899
1.0192525481313703 @1432190940.899
1 @1432190950.899

job_instance:availability:sum_rate{instance="alp",job="robustirc"}
1 @1432190910.899
1 @1432190920.899
1.0192525481313703 @1432190930.899
1 @1432190940.899
1 @1432190950.899

Here are my questions:

1. The timestamp prometheus uses for the scraped timeseries seems to be the timestamp of the start of the scrape (see also https://github.com/prometheus/prometheus/blob/267fd341564d5c29755ead159a2106faf056c4f2/retrieval/target.go#L351). Wouldn’t it make more sense to use the timestamp of when the target actually replied? That should avoid the rates > 1 in the above data.

2. What’s the explanation for job_instance:total_secs:sum_rate being 1 @1432190920.899, when job_instance:available_secs:sum_rate is 1.0192525481313703 @1432190920.899? Recall that total_secs is defined as a superset of available_secs, so I would have expected it to contain precisely the same value with the data at hand:

job_instance:available_secs:sum_rate = sum(task_instance:seconds_in_state:rate{state=~"Leader|Follower"}) BY (job, instance)

job_instance:total_secs:sum_rate = sum(task_instance:seconds_in_state:rate) BY (job, instance)

Thanks in advance,

Brian Brazil

unread,
May 21, 2015, 5:40:41 AM5/21/15
to Michael Stapelberg, Matthias Rampke, tgula...@gmail.com, prometheus-developers
There's various options here, and they all unfortunately result in races. Due to how it's currently implemented, start is the only practical option.
 

2. What’s the explanation for job_instance:total_secs:sum_rate being 1 @1432190920.899, when job_instance:available_secs:sum_rate is 1.0192525481313703 @1432190920.899? Recall that total_secs is defined as a superset of available_secs, so I would have expected it to contain precisely the same value with the data at hand:

job_instance:available_secs:sum_rate = sum(task_instance:seconds_in_state:rate{state=~"Leader|Follower"}) BY (job, instance)

job_instance:total_secs:sum_rate = sum(task_instance:seconds_in_state:rate) BY (job, instance)

Rules are currently all run at the same time, so depending on timing may see data from either the previous run of another rule or the current run. Being able to specify rules to run in order is something we plan on adding.

I'd suggest combining everything into one rule to avoid this.

Your,
Brian

Julius Volz

unread,
May 21, 2015, 6:17:13 AM5/21/15
to Michael Stapelberg, Brian Brazil, Matthias Rampke, tgula...@gmail.com, prometheus-developers
On Thu, May 21, 2015 at 9:21 AM, Michael Stapelberg <mic...@robustirc.net> wrote:
1. The timestamp prometheus uses for the scraped timeseries seems to be the timestamp of the start of the scrape (see also https://github.com/prometheus/prometheus/blob/267fd341564d5c29755ead159a2106faf056c4f2/retrieval/target.go#L351). Wouldn’t it make more sense to use the timestamp of when the target actually replied? That should avoid the rates > 1 in the above data.

The timestamp at the start of the scrape is usually the best approximation because that is when the client actually "snapshots" its data to be exported. I would assume that in the normal case, the transfer back to Prometheus takes a longer time than the initial compilation of state to be transferred, so the "true" timestamp belonging to the transferred samples is closer to the beginning of the scrape than the end. There can be other cases where the preparation of state for each scrape takes much longer than the transfer, but Prometheus wouldn't know to optimize for each special case...
Reply all
Reply to author
Forward
0 new messages