Rule evaluation at exact intervals

460 views
Skip to first unread message

Alin Sînpălean

unread,
May 28, 2018, 5:41:53 AM5/28/18
to Prometheus Developers
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

The upside of this would be that it would make it possible to reliably summarize stats over longer intervals from stats over shorter intervals: e.g. max_over_time(job:cpu_utilization:max_over_time_10m[1h]) would reliably return the same value as max_over_time(cpu_utilization[1h]), except in extreme cases (such as skipped evaluations or Prometheus being down). The downside would be that rule eval might be a few seconds late, in that a rule evaluation run that is 10 seconds late would compute the rule values from 10 seconds ago. I think this is definitely worth the tradeoff (again, assuming I'm not missing anything obvious).

Any thoughts?

Cheers,
Alin.

Brian Brazil

unread,
May 28, 2018, 6:22:05 AM5/28/18
to Alin Sînpălean, Prometheus Developers
On 28 May 2018 at 10:41, Alin Sînpălean <alin.si...@gmail.com> wrote:
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

Usually as in stands it'll already be on the button unless the Prometheus is overloaded. The main question I'd have is how you choose the initial value, if that first eval is delayed then you might have fun as that error would propagate.

This all only works for evals, it doesn't help much with scrapes and doing the same there would be misleading.

The upside of this would be that it would make it possible to reliably summarize stats over longer intervals from stats over shorter intervals: e.g. max_over_time(job:cpu_utilization:max_over_time_10m[1h]) would reliably return the same value as max_over_time(cpu_utilization[1h]), except in extreme cases (such as skipped evaluations or Prometheus being down).

I don't believe that's the case, as the phase of evals still isn't guaranteed. Also an eval could be skipped due to overload, which breaks that too.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.
 
The downside would be that rule eval might be a few seconds late, in that a rule evaluation run that is 10 seconds late would compute the rule values from 10 seconds ago. I think this is definitely worth the tradeoff (again, assuming I'm not missing anything obvious).

As it stands, if it's late it was going to be a bit late anyway due to overload. That there's a bit more delay in that situation doesn't sound bad to me.

Brian
 

Any thoughts?

Cheers,
Alin.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAA%3DW5mXapdZ_SBoh_Cxx%3DAbz7Mx5ER60Sy%2B-vb_%3DpsDM2Z0%3D0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--

Alin Sînpălean

unread,
May 28, 2018, 9:39:41 AM5/28/18
to Brian Brazil, Prometheus Developers
On Mon, May 28, 2018 at 12:22 PM, Brian Brazil <brian....@robustperception.io> wrote:
On 28 May 2018 at 10:41, Alin Sînpălean <alin.si...@gmail.com> wrote:
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

Usually as in stands it'll already be on the button unless the Prometheus is overloaded.

More or less. But if evaluation happens within milliseconds of scraping, then it's fully possible that (even assuming perfectly spaced scrapes), evaluation may be delayed enough for a given sample to be included in one more (or less) eval run than other samples. E.g. considering the particular case of a 1 minute eval interval and a rule that computes the max/avg/sum over 1 minute of samples, it is fully possible for a given sample to be included in 2 successive evals or, worse, left out completely.

The main question I'd have is how you choose the initial value, if that first eval is delayed then you might have fun as that error would propagate.

I don't think the initial value matters. The first eval would get the timestamp of whenever it happened and subsequent evals would use that exact timestamp plus a multiple of evaluation_interval. Or, one could go wild and either (a) run the eval with timestamps that are exact multiples of evaluation_interval (e.g. on the hour, at 5 minutes past, 10 minutes past etc. for a 5 minute interval) but that would probably be bad for load and latency; or (b) use a hash of the rule name or definition to deterministically compute a fixed offset from a multiple of evaluation_interval. But I don't think either is necessary.

This all only works for evals, it doesn't help much with scrapes and doing the same there would be misleading.

Well, I did take the idea further and thought about what it would mean for scrapes, but thought I should leave it out, for the sake of conciseness. But with (admittedly deep) changes to the scraping protocol, Prometheus could register itself with a target and ask for metrics to be pushed to it at a fixed interval. So within the realm of theoretical possibility, at least.

The upside of this would be that it would make it possible to reliably summarize stats over longer intervals from stats over shorter intervals: e.g. max_over_time(job:cpu_utilization:max_over_time_10m[1h]) would reliably return the same value as max_over_time(cpu_utilization[1h]), except in extreme cases (such as skipped evaluations or Prometheus being down).

I don't believe that's the case, as the phase of evals still isn't guaranteed.

Sorry, I didn't get that. What do you mean by "eval phase"?

Also an eval could be skipped due to overload, which breaks that too.

Fair enough. I'm not saying this would be a panacea, only that it might be preferable to what happens right now. Which, AFAICT is that when Prometheus or the machine becomes overloaded, then you get rule evaluations with more or less random timestamps for the duration of that occurrence, some more than evaluation_interval apart, some less, and the only way you can tell one got skipped is counting how many eval results you have and dividing by the difference between their timestamps.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

The downside would be that rule eval might be a few seconds late, in that a rule evaluation run that is 10 seconds late would compute the rule values from 10 seconds ago. I think this is definitely worth the tradeoff (again, assuming I'm not missing anything obvious).

As it stands, if it's late it was going to be a bit late anyway due to overload. That there's a bit more delay in that situation doesn't sound bad to me.

I fully agree, I just pointed it out in the interest of completeness. Or covering my ass, whichever way you want to look at it.

Cheers,
Alin.

Brian
 

Any thoughts?

Cheers,
Alin.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.



--

Brian Brazil

unread,
May 28, 2018, 12:54:19 PM5/28/18
to Alin Sînpălean, Prometheus Developers
On 28 May 2018 at 14:39, Alin Sînpălean <alin.si...@gmail.com> wrote:


On Mon, May 28, 2018 at 12:22 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 10:41, Alin Sînpălean <alin.si...@gmail.com> wrote:
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

Usually as in stands it'll already be on the button unless the Prometheus is overloaded.

More or less. But if evaluation happens within milliseconds of scraping, then it's fully possible that (even assuming perfectly spaced scrapes), evaluation may be delayed enough for a given sample to be included in one more (or less) eval run than other samples. E.g. considering the particular case of a 1 minute eval interval and a rule that computes the max/avg/sum over 1 minute of samples, it is fully possible for a given sample to be included in 2 successive evals or, worse, left out completely.

I think there's a additional races here around scraping, so what you're proposing is better but not perfect.
 

The main question I'd have is how you choose the initial value, if that first eval is delayed then you might have fun as that error would propagate.

I don't think the initial value matters. The first eval would get the timestamp of whenever it happened and subsequent evals would use that exact timestamp plus a multiple of evaluation_interval. Or, one could go wild and either (a) run the eval with timestamps that are exact multiples of evaluation_interval (e.g. on the hour, at 5 minutes past, 10 minutes past etc. for a 5 minute interval) but that would probably be bad for load and latency; or (b) use a hash of the rule name or definition to deterministically compute a fixed offset from a multiple of evaluation_interval. But I don't think either is necessary.

We already do b) for load reasons (it's one of the reasons for rule groups existing). I'd like to avoid delaying evals unnecessarily in the typical case.
 

This all only works for evals, it doesn't help much with scrapes and doing the same there would be misleading.

Well, I did take the idea further and thought about what it would mean for scrapes, but thought I should leave it out, for the sake of conciseness. But with (admittedly deep) changes to the scraping protocol, Prometheus could register itself with a target and ask for metrics to be pushed to it at a fixed interval. So within the realm of theoretical possibility, at least.

Yes, let's avoid that rabbit hole :) If you care about that level of precision then you something well beyond what a metrics tool like Prometheus can offer.
 

The upside of this would be that it would make it possible to reliably summarize stats over longer intervals from stats over shorter intervals: e.g. max_over_time(job:cpu_utilization:max_over_time_10m[1h]) would reliably return the same value as max_over_time(cpu_utilization[1h]), except in extreme cases (such as skipped evaluations or Prometheus being down).

I don't believe that's the case, as the phase of evals still isn't guaranteed.

Sorry, I didn't get that. What do you mean by "eval phase"?

If the interval is 10s, the phase is whether it happens on the minute, 1s after the minute, 2s after the minute etc.


Also an eval could be skipped due to overload, which breaks that too.

Fair enough. I'm not saying this would be a panacea, only that it might be preferable to what happens right now. Which, AFAICT is that when Prometheus or the machine becomes overloaded, then you get rule evaluations with more or less random timestamps for the duration of that occurrence, some more than evaluation_interval apart, some less, and the only way you can tell one got skipped is counting how many eval results you have and dividing by the difference between their timestamps.

There is a metric for the skips, but it only helps generally detect that it is happening. All bets are off really when you're overloaded.
 

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.
 

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).
 

The downside would be that rule eval might be a few seconds late, in that a rule evaluation run that is 10 seconds late would compute the rule values from 10 seconds ago. I think this is definitely worth the tradeoff (again, assuming I'm not missing anything obvious).

As it stands, if it's late it was going to be a bit late anyway due to overload. That there's a bit more delay in that situation doesn't sound bad to me.

I fully agree, I just pointed it out in the interest of completeness. Or covering my ass, whichever way you want to look at it.

I think the idea makes sense, and I don't think it'll collide with my pending PromQL PR. Do you want to send a PR?

Brian

 

Cheers,
Alin.

Brian
 

Any thoughts?

Cheers,
Alin.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAA%3DW5mXapdZ_SBoh_Cxx%3DAbz7Mx5ER60Sy%2B-vb_%3DpsDM2Z0%3D0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.



--




--

Alin Sînpălean

unread,
May 29, 2018, 4:23:27 AM5/29/18
to Brian Brazil, Prometheus Developers
Just sent out PR 4201, still needs tests. And I'll reply to your points in a follow-up email.

Cheers,
Alin.

Alin Sînpălean

unread,
May 29, 2018, 10:42:45 AM5/29/18
to Brian Brazil, Prometheus Developers
On Mon, May 28, 2018 at 6:54 PM, Brian Brazil <brian....@robustperception.io> wrote:
On 28 May 2018 at 14:39, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Mon, May 28, 2018 at 12:22 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 10:41, Alin Sînpălean <alin.si...@gmail.com> wrote:
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

Usually as in stands it'll already be on the button unless the Prometheus is overloaded.

More or less. But if evaluation happens within milliseconds of scraping, then it's fully possible that (even assuming perfectly spaced scrapes), evaluation may be delayed enough for a given sample to be included in one more (or less) eval run than other samples. E.g. considering the particular case of a 1 minute eval interval and a rule that computes the max/avg/sum over 1 minute of samples, it is fully possible for a given sample to be included in 2 successive evals or, worse, left out completely.

I think there's a additional races here around scraping, so what you're proposing is better but not perfect.

If you're referring to scrapes in progress at evaluation time, which will later append samples with timestamps earlier than the evaluation timestamp, that can definitely happen. It might be possible (although not always desirable) to wait for in-progress scrapes (and other rule groups?) to complete before proceeding with evaluation, but that's a much more complex issue and discussion.

Another way of approaching this issue would be to set the timestamp of scraped samples to scrape end time, rather than scrape start time. There would still exist the possibility of a race (between the samples being committed and evaluation) but it would be significantly lower than the current status quo, where the race window covers the whole request-collect-response-commit period.

The main question I'd have is how you choose the initial value, if that first eval is delayed then you might have fun as that error would propagate.

I don't think the initial value matters. The first eval would get the timestamp of whenever it happened and subsequent evals would use that exact timestamp plus a multiple of evaluation_interval. Or, one could go wild and either (a) run the eval with timestamps that are exact multiples of evaluation_interval (e.g. on the hour, at 5 minutes past, 10 minutes past etc. for a 5 minute interval) but that would probably be bad for load and latency; or (b) use a hash of the rule name or definition to deterministically compute a fixed offset from a multiple of evaluation_interval. But I don't think either is necessary.

We already do b) for load reasons (it's one of the reasons for rule groups existing). I'd like to avoid delaying evals unnecessarily in the typical case.

Thanks for the pointer, I have repurposed the existing offset calculation to assign a "phase" to the evaluation timestamps.

The upside of this would be that it would make it possible to reliably summarize stats over longer intervals from stats over shorter intervals: e.g. max_over_time(job:cpu_utilization:max_over_time_10m[1h]) would reliably return the same value as max_over_time(cpu_utilization[1h]), except in extreme cases (such as skipped evaluations or Prometheus being down).

I don't believe that's the case, as the phase of evals still isn't guaranteed.

If you mean (assuming a 1 minute eval interval) that e.g. job:cpu_utilization:max_over_time_10m might be evaluated at 10 seconds past the minute and then the max_over_time() at another 10 seconds past that (i.e. 20 seconds past the minute), whereas the max_over_time(cpu_utilization[1h]) expression would be evaluated at 10 seconds past, I don't think that's actually a problem. The former would have the same value as the latter, had it been computed with a 20 seconds phase. In other words, the values of the 2 expressions wouldn't necessarily always match, but you could adjust offsets/phases of one and/or the other evaluation and get perfectly matching values.

If you are referring to the race conditions you mentioned above, then yes, you could get different outputs from the 2 expressions, even with perfectly matching phases.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.

What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure. But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

Cheers,
Alin.

Brian Brazil

unread,
May 29, 2018, 10:57:38 AM5/29/18
to Alin Sînpălean, Prometheus Developers
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Mon, May 28, 2018 at 6:54 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 14:39, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Mon, May 28, 2018 at 12:22 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 10:41, Alin Sînpălean <alin.si...@gmail.com> wrote:
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

Usually as in stands it'll already be on the button unless the Prometheus is overloaded.

More or less. But if evaluation happens within milliseconds of scraping, then it's fully possible that (even assuming perfectly spaced scrapes), evaluation may be delayed enough for a given sample to be included in one more (or less) eval run than other samples. E.g. considering the particular case of a 1 minute eval interval and a rule that computes the max/avg/sum over 1 minute of samples, it is fully possible for a given sample to be included in 2 successive evals or, worse, left out completely.

I think there's a additional races here around scraping, so what you're proposing is better but not perfect.

If you're referring to scrapes in progress at evaluation time, which will later append samples with timestamps earlier than the evaluation timestamp, that can definitely happen. It might be possible (although not always desirable) to wait for in-progress scrapes (and other rule groups?) to complete before proceeding with evaluation, but that's a much more complex issue and discussion.

That could delay things by rather a lot, having realtime metrics is valuable.

Another way of approaching this issue would be to set the timestamp of scraped samples to scrape end time, rather than scrape start time. There would still exist the possibility of a race (between the samples being committed and evaluation) but it would be significantly lower than the current status quo, where the race window covers the whole request-collect-response-commit period.

I've pondered this and could see this increasing artifacts, as the lag between recorded timestamp and actual timestamp would have more jitter.

The main question I'd have is how you choose the initial value, if that first eval is delayed then you might have fun as that error would propagate.

I don't think the initial value matters. The first eval would get the timestamp of whenever it happened and subsequent evals would use that exact timestamp plus a multiple of evaluation_interval. Or, one could go wild and either (a) run the eval with timestamps that are exact multiples of evaluation_interval (e.g. on the hour, at 5 minutes past, 10 minutes past etc. for a 5 minute interval) but that would probably be bad for load and latency; or (b) use a hash of the rule name or definition to deterministically compute a fixed offset from a multiple of evaluation_interval. But I don't think either is necessary.

We already do b) for load reasons (it's one of the reasons for rule groups existing). I'd like to avoid delaying evals unnecessarily in the typical case.

Thanks for the pointer, I have repurposed the existing offset calculation to assign a "phase" to the evaluation timestamps.

The upside of this would be that it would make it possible to reliably summarize stats over longer intervals from stats over shorter intervals: e.g. max_over_time(job:cpu_utilization:max_over_time_10m[1h]) would reliably return the same value as max_over_time(cpu_utilization[1h]), except in extreme cases (such as skipped evaluations or Prometheus being down).

I don't believe that's the case, as the phase of evals still isn't guaranteed.

If you mean (assuming a 1 minute eval interval) that e.g. job:cpu_utilization:max_over_time_10m might be evaluated at 10 seconds past the minute and then the max_over_time() at another 10 seconds past that (i.e. 20 seconds past the minute), whereas the max_over_time(cpu_utilization[1h]) expression would be evaluated at 10 seconds past, I don't think that's actually a problem. The former would have the same value as the latter, had it been computed with a 20 seconds phase. In other words, the values of the 2 expressions wouldn't necessarily always match, but you could adjust offsets/phases of one and/or the other evaluation and get perfectly matching values.

Yes that's what I meant. We often get users who want 100% accurate numbers over arbitrary time ranges, and don't realise that this aspect of metrics makes it fundamentally impossible (scrapes all have different phases too). Mismatching phases are fine for practical purposes though, it just makes the data lag a bit.

If you are referring to the race conditions you mentioned above, then yes, you could get different outputs from the 2 expressions, even with perfectly matching phases.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.

What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.
 
Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

--

Alin Sînpălean

unread,
May 29, 2018, 12:27:15 PM5/29/18
to Brian Brazil, Prometheus Developers
On Tue, May 29, 2018 at 4:57 PM, Brian Brazil <brian....@robustperception.io> wrote:
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Mon, May 28, 2018 at 6:54 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 14:39, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Mon, May 28, 2018 at 12:22 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 10:41, Alin Sînpălean <alin.si...@gmail.com> wrote:
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

Usually as in stands it'll already be on the button unless the Prometheus is overloaded.

More or less. But if evaluation happens within milliseconds of scraping, then it's fully possible that (even assuming perfectly spaced scrapes), evaluation may be delayed enough for a given sample to be included in one more (or less) eval run than other samples. E.g. considering the particular case of a 1 minute eval interval and a rule that computes the max/avg/sum over 1 minute of samples, it is fully possible for a given sample to be included in 2 successive evals or, worse, left out completely.

I think there's a additional races here around scraping, so what you're proposing is better but not perfect.

If you're referring to scrapes in progress at evaluation time, which will later append samples with timestamps earlier than the evaluation timestamp, that can definitely happen. It might be possible (although not always desirable) to wait for in-progress scrapes (and other rule groups?) to complete before proceeding with evaluation, but that's a much more complex issue and discussion.

That could delay things by rather a lot, having realtime metrics is valuable.

That's what I meant by "not always desirable". It could be a flag one could set on the rule group. But that may be unnecessarily complex (even leaving aside the implementation) and you can probably get the exact same results by adding offset 1m to all your rules. :o)

Another way of approaching this issue would be to set the timestamp of scraped samples to scrape end time, rather than scrape start time. There would still exist the possibility of a race (between the samples being committed and evaluation) but it would be significantly lower than the current status quo, where the race window covers the whole request-collect-response-commit period.

I've pondered this and could see this increasing artifacts, as the lag between recorded timestamp and actual timestamp would have more jitter.

I'm not entirely sure about that, particularly as it relates to rules. You're more likely to miss a sample in a rule eval as things stand right now because of this race condition between scrape and eval: a sample scraped while an eval was in progress is going to be ignored by that eval as well as the following one (assuming an expression with an eval range equal to the eval interval; but you have a similar problem with any range that is a multiple of the eval range: that sample will get included in one less eval run than its peers; and even worse problems with ranges that are not a multiple of the eval interval).

As for the jitter, you will very likely have wider variance in terms of the intervals between samples. But as for the lag between the recorded and actual timestamps, how would you determine the actual timestamp? It's going to be somewhere between the scrape start and end times, but you can't say where (unless you actually measure it). One might argue that the /metrics request is likely to have been sitting in a queue for a while.(so the actual timestamp is likely to be closer to the end) or that in the case of multiple collectors most would be quick (and thus most sample timestamps would be closer to the scrape start time) with one or two laggards.

So I guess it's more of a choice between uniformly spaced (somewhat misleadingly so) and consistent numbers of samples in each range on the one hand; and more reliable recorded rules computed over jittery samples. Neither is ideal.

If you are referring to the race conditions you mentioned above, then yes, you could get different outputs from the 2 expressions, even with perfectly matching phases.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.

What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.

Well, according to best practices (and common sense) you shouldn't compute an increase over a sum of counters. So because I'm actually aggregating away a number of dimensions here (such as instance, request, status code or some combination thereof) computing the rate on the fly is significantly slower than using a precomputed increase. Of course, Prometheus has to waste a lot of CPU on this rule evaluation, but for the time being we have enough CPU to spare on the Prometheus instance. And our debug dashboards (which compute the increase on-the-fly in order to allow for arbitrary filtering) take tens of seconds to load, whereas the regular dashboards (which lack some of the filters, but not many) load in less than a second.

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

Cheers,
Alin.

Brian Brazil

unread,
May 29, 2018, 12:41:13 PM5/29/18
to Alin Sînpălean, Prometheus Developers
On 29 May 2018 at 17:26, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Tue, May 29, 2018 at 4:57 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Mon, May 28, 2018 at 6:54 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 14:39, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Mon, May 28, 2018 at 12:22 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 28 May 2018 at 10:41, Alin Sînpălean <alin.si...@gmail.com> wrote:
[Warning: possibly clueless proposal coming up.]

In my past interactions with Brian and Björn regarding the rate() implementation and its (in my view) limitations, one of the arguments against the improvements I was proposing was that there would always be errors/noise/aliasing caused by jitter in the scraping and evaluation intervals. I took that as granted and tried to argue my way around it, but it occurred to me just now that (theoretically speaking, at least) there need not be any eval interval jitter.

More specifically, I think it should be fully possible to set the timestamps of successive rule evaluations to be exactly evaluation_interval apart from one another: even if the eval run is scheduled a . I might be missing some very obvious (to others) reason why this is not feasible, but assuming I'm not, I think this would be an improvement over the current behavior which seems to be to do the eval at whatever timestamp the rule evaluation happens to be scheduled.

Usually as in stands it'll already be on the button unless the Prometheus is overloaded.

More or less. But if evaluation happens within milliseconds of scraping, then it's fully possible that (even assuming perfectly spaced scrapes), evaluation may be delayed enough for a given sample to be included in one more (or less) eval run than other samples. E.g. considering the particular case of a 1 minute eval interval and a rule that computes the max/avg/sum over 1 minute of samples, it is fully possible for a given sample to be included in 2 successive evals or, worse, left out completely.

I think there's a additional races here around scraping, so what you're proposing is better but not perfect.

If you're referring to scrapes in progress at evaluation time, which will later append samples with timestamps earlier than the evaluation timestamp, that can definitely happen. It might be possible (although not always desirable) to wait for in-progress scrapes (and other rule groups?) to complete before proceeding with evaluation, but that's a much more complex issue and discussion.

That could delay things by rather a lot, having realtime metrics is valuable.

That's what I meant by "not always desirable". It could be a flag one could set on the rule group. But that may be unnecessarily complex (even leaving aside the implementation) and you can probably get the exact same results by adding offset 1m to all your rules. :o)

Another way of approaching this issue would be to set the timestamp of scraped samples to scrape end time, rather than scrape start time. There would still exist the possibility of a race (between the samples being committed and evaluation) but it would be significantly lower than the current status quo, where the race window covers the whole request-collect-response-commit period.

I've pondered this and could see this increasing artifacts, as the lag between recorded timestamp and actual timestamp would have more jitter.

I'm not entirely sure about that, particularly as it relates to rules. You're more likely to miss a sample in a rule eval as things stand right now because of this race condition between scrape and eval: a sample scraped while an eval was in progress is going to be ignored by that eval as well as the following one (assuming an expression with an eval range equal to the eval interval; but you have a similar problem with any range that is a multiple of the eval range: that sample will get included in one less eval run than its peers; and even worse problems with ranges that are not a multiple of the eval interval).

As for the jitter, you will very likely have wider variance in terms of the intervals between samples. But as for the lag between the recorded and actual timestamps, how would you determine the actual timestamp? It's going to be somewhere between the scrape start and end times, but you can't say where (unless you actually measure it). One might argue that the /metrics request is likely to have been sitting in a queue for a while.(so the actual timestamp is likely to be closer to the end) or that in the case of multiple collectors most would be quick (and thus most sample timestamps would be closer to the scrape start time) with one or two laggards.

The way I reason about it is that what matters when the collection actually starts within the client. How long the collection actually takes currently doesn't affect the timestamp, so any effect will (roughly) be a constant offset of true collection start time to the recorded timestamp - which is fine.
If you use the end time then that offset can jttter a lot more based on load and network.

So I guess it's more of a choice between uniformly spaced (somewhat misleadingly so) and consistent numbers of samples in each range on the one hand; and more reliable recorded rules computed over jittery samples. Neither is ideal.

Yeah, I don't think there's a clear winner here. I've seen both ways implemented, and they both work in practice.


If you are referring to the race conditions you mentioned above, then yes, you could get different outputs from the 2 expressions, even with perfectly matching phases.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.

What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.

Well, according to best practices (and common sense) you shouldn't compute an increase over a sum of counters.

You can take an average of the increase though. Generally you should use rate rather than increase in rules as it produces base units which are easier to work with, and leave increase only for Grafana expressions and equivalents.

So because I'm actually aggregating away a number of dimensions here (such as instance, request, status code or some combination thereof) computing the rate on the fly is significantly slower than using a precomputed increase.

That's my previous point, it's aggregation that matters here in performance terms - not the recording rule with rate/increase over a small range.
 
Of course, Prometheus has to waste a lot of CPU on this rule evaluation, but for the time being we have enough CPU to spare on the Prometheus instance. And our debug dashboards (which compute the increase on-the-fly in order to allow for arbitrary filtering) take tens of seconds to load, whereas the regular dashboards (which lack some of the filters, but not many) load in less than a second.

That's generally a good tradeoff.


Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

My experience is that taking a normal rate, and then either graphing that directly or taking an avg_over_time is sufficient for practical purposes. The interesting numbers also tend to be ratios such as latency and failure ratios, in which such artifacts are cancelled out.

--

Alin Sînpălean

unread,
May 30, 2018, 5:39:29 AM5/30/18
to Brian Brazil, Prometheus Developers
On Tue, May 29, 2018 at 6:41 PM, Brian Brazil <brian....@robustperception.io> wrote:
On 29 May 2018 at 17:26, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 4:57 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:
Another way of approaching this issue would be to set the timestamp of scraped samples to scrape end time, rather than scrape start time. There would still exist the possibility of a race (between the samples being committed and evaluation) but it would be significantly lower than the current status quo, where the race window covers the whole request-collect-response-commit period.

I've pondered this and could see this increasing artifacts, as the lag between recorded timestamp and actual timestamp would have more jitter.

I'm not entirely sure about that, particularly as it relates to rules. You're more likely to miss a sample in a rule eval as things stand right now because of this race condition between scrape and eval: a sample scraped while an eval was in progress is going to be ignored by that eval as well as the following one (assuming an expression with an eval range equal to the eval interval; but you have a similar problem with any range that is a multiple of the eval range: that sample will get included in one less eval run than its peers; and even worse problems with ranges that are not a multiple of the eval interval).

As for the jitter, you will very likely have wider variance in terms of the intervals between samples. But as for the lag between the recorded and actual timestamps, how would you determine the actual timestamp? It's going to be somewhere between the scrape start and end times, but you can't say where (unless you actually measure it). One might argue that the /metrics request is likely to have been sitting in a queue for a while.(so the actual timestamp is likely to be closer to the end) or that in the case of multiple collectors most would be quick (and thus most sample timestamps would be closer to the scrape start time) with one or two laggards.

The way I reason about it is that what matters when the collection actually starts within the client. How long the collection actually takes currently doesn't affect the timestamp, so any effect will (roughly) be a constant offset of true collection start time to the recorded timestamp - which is fine.
If you use the end time then that offset can jttter a lot more based on load and network.

I can fully see and agree with the evenly spaced vs. jittery timestamps argument. I'm just wondering how much of a problem that is in practice, both in terms of (a) how much jitter you're talking of in the average case of a service instrumented with one of the Prometheus client libraries; and (b) how significant the resulting artifacts would be. For (b) I'm thinking it would be mostly sum_over_time/count_over_time/changes/resets and the rate/increase/delta functions that would actually be affected, with the latter group relatively straightforward to fix. (With apologies again for the product placement ad.)

So I guess it's more of a choice between uniformly spaced (somewhat misleadingly so) and consistent numbers of samples in each range on the one hand; and more reliable recorded rules computed over jittery samples. Neither is ideal.

Yeah, I don't think there's a clear winner here. I've seen both ways implemented, and they both work in practice.

I'm just thinking out loud here, but I'm wondering whether there is a third way, where you produce quick and dirty values for the most recent samples, to be replaced with a permanent value once all samples that may be referenced by the expression have been collected and stored (whatever that means). Or, looking at it from a different perspective, where recorded rules are not actually recorded, but rather computed on the fly, with samples older than some threshold "cached" because they are now unlikely to change. It's very likely a rabbit hole, though.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.

What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.

Well, according to best practices (and common sense) you shouldn't compute an increase over a sum of counters.

You can take an average of the increase though. Generally you should use rate rather than increase in rules as it produces base units which are easier to work with, and leave increase only for Grafana expressions and equivalents.

The reason I'm using increase rather than rate is that I'm interested in the absolute number. In a more reporting-focused view of the world you might want to know how many errors you got per day/hour/whatever. And when you have relatively few occurrences, it may make more sense to see 2 events here, 3 events there on a graph than a couple of spikes with 0.035 error QPS. And I think you meant "sum of the increase", rather than average: average as an aggregation goes together with rate.

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

My experience is that taking a normal rate, and then either graphing that directly or taking an avg_over_time is sufficient for practical purposes. The interesting numbers also tend to be ratios such as latency and failure ratios, in which such artifacts are cancelled out.

I am probably more OCD than most (or it could be that the numbers I'm dealing with are smaller than those of the average Prometheus user), but the combination of graphs with varying resolution, spiky counters and "leaky" ranges (when talking about rate/increase) pushed me in the opposite direction. The particular case that pushed me over the edge was a graph with only a handful of spikes corresponding to one-off errors and not being able to see all of them at once, depending on when I loaded the graph, unless I "smeared" them out over a long enough time range. With sincere apologies for ranting.

There exists the possibility that a "crisp" graph of overly precise values and timestamps might lead one to believe that the underlying data is 100% accurate (which it isn't, for all the arguments above and many more). But it is a more useful debugging tool and all it needs is a disclaimer and/or forced introduction to metric based monitoring, which might eventually lead prospective users to better instrument their code.

Cheers,
Alin.

Brian Brazil

unread,
May 30, 2018, 6:12:44 AM5/30/18
to Alin Sînpălean, Prometheus Developers
On 30 May 2018 at 10:39, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Tue, May 29, 2018 at 6:41 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 17:26, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 4:57 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:
Another way of approaching this issue would be to set the timestamp of scraped samples to scrape end time, rather than scrape start time. There would still exist the possibility of a race (between the samples being committed and evaluation) but it would be significantly lower than the current status quo, where the race window covers the whole request-collect-response-commit period.

I've pondered this and could see this increasing artifacts, as the lag between recorded timestamp and actual timestamp would have more jitter.

I'm not entirely sure about that, particularly as it relates to rules. You're more likely to miss a sample in a rule eval as things stand right now because of this race condition between scrape and eval: a sample scraped while an eval was in progress is going to be ignored by that eval as well as the following one (assuming an expression with an eval range equal to the eval interval; but you have a similar problem with any range that is a multiple of the eval range: that sample will get included in one less eval run than its peers; and even worse problems with ranges that are not a multiple of the eval interval).

As for the jitter, you will very likely have wider variance in terms of the intervals between samples. But as for the lag between the recorded and actual timestamps, how would you determine the actual timestamp? It's going to be somewhere between the scrape start and end times, but you can't say where (unless you actually measure it). One might argue that the /metrics request is likely to have been sitting in a queue for a while.(so the actual timestamp is likely to be closer to the end) or that in the case of multiple collectors most would be quick (and thus most sample timestamps would be closer to the scrape start time) with one or two laggards.

The way I reason about it is that what matters when the collection actually starts within the client. How long the collection actually takes currently doesn't affect the timestamp, so any effect will (roughly) be a constant offset of true collection start time to the recorded timestamp - which is fine.
If you use the end time then that offset can jttter a lot more based on load and network.

I can fully see and agree with the evenly spaced vs. jittery timestamps argument. I'm just wondering how much of a problem that is in practice, both in terms of (a) how much jitter you're talking of in the average case of a service instrumented with one of the Prometheus client libraries; and (b) how significant the resulting artifacts would be. For (b) I'm thinking it would be mostly sum_over_time/count_over_time/changes/resets and the rate/increase/delta functions that would actually be affected, with the latter group relatively straightforward to fix. (With apologies again for the product placement ad.)

I don't see how this is fixable, you've timestamp jitter one way or the other. If you're using the end timestamp, then the jitter will be determined by the slowest collector (roughly) which can vary more - especially with exporters.
 

So I guess it's more of a choice between uniformly spaced (somewhat misleadingly so) and consistent numbers of samples in each range on the one hand; and more reliable recorded rules computed over jittery samples. Neither is ideal.

Yeah, I don't think there's a clear winner here. I've seen both ways implemented, and they both work in practice.

I'm just thinking out loud here, but I'm wondering whether there is a third way, where you produce quick and dirty values for the most recent samples, to be replaced with a permanent value once all samples that may be referenced by the expression have been collected and stored (whatever that means). Or, looking at it from a different perspective, where recorded rules are not actually recorded, but rather computed on the fly, with samples older than some threshold "cached" because they are now unlikely to change. It's very likely a rabbit hole, though.

That'd be some rather deep changes to how Prometheus works alright, for relatively little gain.
 
In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.

What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.

Well, according to best practices (and common sense) you shouldn't compute an increase over a sum of counters.

You can take an average of the increase though. Generally you should use rate rather than increase in rules as it produces base units which are easier to work with, and leave increase only for Grafana expressions and equivalents.

The reason I'm using increase rather than rate is that I'm interested in the absolute number. In a more reporting-focused view of the world you might want to know how many errors you got per day/hour/whatever. And when you have relatively few occurrences, it may make more sense to see 2 events here, 3 events there on a graph than a couple of spikes with 0.035 error QPS.

I'm not saying you shouldn't display other units, but for internal calculations it's best to stick with per-second values.

And I think you meant "sum of the increase", rather than average: average as an aggregation goes together with rate.

I meant average. Increase is only syntactic sugar over rate, so all the same reasoning applies.

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

My experience is that taking a normal rate, and then either graphing that directly or taking an avg_over_time is sufficient for practical purposes. The interesting numbers also tend to be ratios such as latency and failure ratios, in which such artifacts are cancelled out.

I am probably more OCD than most (or it could be that the numbers I'm dealing with are smaller than those of the average Prometheus user), but the combination of graphs with varying resolution, spiky counters and "leaky" ranges (when talking about rate/increase) pushed me in the opposite direction. The particular case that pushed me over the edge was a graph with only a handful of spikes corresponding to one-off errors and not being able to see all of them at once, depending on when I loaded the graph, unless I "smeared" them out over a long enough time range. With sincere apologies for ranting.

I'm well aware of such issues, and the important thing to me personally is to be able to notice when graphs aren't telling the whole truth due to being incorrectly constructed (which sounds like your example) or one of the fundamental metrics issues. At the point where I hit the limit of metrics, I switch to other data sources - but I likely understand the limitations of metrics better than most.

No matter what we do users are always going to construct incorrect or less than perfectly useful dashboards. To reduce that and help uses produce monitoring that generally works, I think we should be providing consistent clear guidance there (e.g. the 4x recommendation).
 

There exists the possibility that a "crisp" graph of overly precise values and timestamps might lead one to believe that the underlying data is 100% accurate (which it isn't, for all the arguments above and many more). But it is a more useful debugging tool and all it needs is a disclaimer and/or forced introduction to metric based monitoring, which might eventually lead prospective users to better instrument their code.

It's a bit more than a possibility, it's a reasonably common occurrence. In addition we have users believing that about previous systems they used, and not understanding that those weren't accurate either. I did my Counting With Prometheus talk with all this in mind.

I'm not sure what you're getting at with better instrumentation.

-- 

Alin Sînpălean

unread,
May 30, 2018, 8:07:58 AM5/30/18
to Brian Brazil, Prometheus Developers
[In the interest of not turning a very useful and enlightening discussion into a flamewar I'll try to refrain from going into my personal experiences and (hopefully) stick to objective measures].

On Wed, May 30, 2018 at 12:12 PM, Brian Brazil <brian....@robustperception.io> wrote:
On 30 May 2018 at 10:39, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 6:41 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 17:26, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 4:57 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:
Another way of approaching this issue would be to set the timestamp of scraped samples to scrape end time, rather than scrape start time. There would still exist the possibility of a race (between the samples being committed and evaluation) but it would be significantly lower than the current status quo, where the race window covers the whole request-collect-response-commit period.

I've pondered this and could see this increasing artifacts, as the lag between recorded timestamp and actual timestamp would have more jitter.

I'm not entirely sure about that, particularly as it relates to rules. You're more likely to miss a sample in a rule eval as things stand right now because of this race condition between scrape and eval: a sample scraped while an eval was in progress is going to be ignored by that eval as well as the following one (assuming an expression with an eval range equal to the eval interval; but you have a similar problem with any range that is a multiple of the eval range: that sample will get included in one less eval run than its peers; and even worse problems with ranges that are not a multiple of the eval interval).

As for the jitter, you will very likely have wider variance in terms of the intervals between samples. But as for the lag between the recorded and actual timestamps, how would you determine the actual timestamp? It's going to be somewhere between the scrape start and end times, but you can't say where (unless you actually measure it). One might argue that the /metrics request is likely to have been sitting in a queue for a while.(so the actual timestamp is likely to be closer to the end) or that in the case of multiple collectors most would be quick (and thus most sample timestamps would be closer to the scrape start time) with one or two laggards.

The way I reason about it is that what matters when the collection actually starts within the client. How long the collection actually takes currently doesn't affect the timestamp, so any effect will (roughly) be a constant offset of true collection start time to the recorded timestamp - which is fine.
If you use the end time then that offset can jttter a lot more based on load and network.

I can fully see and agree with the evenly spaced vs. jittery timestamps argument. I'm just wondering how much of a problem that is in practice, both in terms of (a) how much jitter you're talking of in the average case of a service instrumented with one of the Prometheus client libraries; and (b) how significant the resulting artifacts would be. For (b) I'm thinking it would be mostly sum_over_time/count_over_time/changes/resets and the rate/increase/delta functions that would actually be affected, with the latter group relatively straightforward to fix. (With apologies again for the product placement ad.)

I don't see how this is fixable, you've timestamp jitter one way or the other. If you're using the end timestamp, then the jitter will be determined by the slowest collector (roughly) which can vary more - especially with exporters.

Fair enough. I was thinking specifically about the case of timestamps moving around a bit but not so much as to end up in the neighboring range. If that was the only problem you were trying to address, then you might e.g. not adjust the increase by the requested_range / actual_range ratio and just take the actual increase as a starting point for the rate calculation. As long as your samples stayed within the same range, you'd get fewer artifacts. As for samples "jumping" from one range to the next, you would then get an underestimation in range R and an overestimation by the same amount in range R+1. Not ideal, but not critical either.

But regardless, you would indeed end up producing more artifacts than currently.

So I guess it's more of a choice between uniformly spaced (somewhat misleadingly so) and consistent numbers of samples in each range on the one hand; and more reliable recorded rules computed over jittery samples. Neither is ideal.

Yeah, I don't think there's a clear winner here. I've seen both ways implemented, and they both work in practice.

I'm just thinking out loud here, but I'm wondering whether there is a third way, where you produce quick and dirty values for the most recent samples, to be replaced with a permanent value once all samples that may be referenced by the expression have been collected and stored (whatever that means). Or, looking at it from a different perspective, where recorded rules are not actually recorded, but rather computed on the fly, with samples older than some threshold "cached" because they are now unlikely to change. It's very likely a rabbit hole, though.

That'd be some rather deep changes to how Prometheus works alright, for relatively little gain.

Indeed. For the vast majority of use cases.

In general this sort of summarization doesn't help performance wise, as you'll have the same amount of data to process as the interval of the summarized data will be the same as the original data. It also only works for min and max. Recording rules only help with performance when you can reduce cardinality via aggregation across time series.

I was not necessarily thinking about performance, but rather downsampling, either for the purpose of long term storage or graphing. In particular, I'm using recorded rules for pre-aggregating data for dashboards (using a range equal to the eval interval), and I have seen quite a few instances of samples that went missing or got included twice. And I'm generally bothered by the fact that (because of this issue) recorded rules, whose results get stored in the TSDB, produce lower quality data than a throwaway range eval for a graph.

Hmm, what exact setup are you thinking of here?

We've had users use higher eval intervals to try and downsample data withing Prometheus, and they quickly find that doesn't work out due to staleness (multiple eval intervals is also not good for sanity) so that's not something we recommended. If you're using a relatively long range with your usual interval then this can be more performant for graphs, but it's not that common that pops up.

What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.

Well, according to best practices (and common sense) you shouldn't compute an increase over a sum of counters.

You can take an average of the increase though. Generally you should use rate rather than increase in rules as it produces base units which are easier to work with, and leave increase only for Grafana expressions and equivalents.

The reason I'm using increase rather than rate is that I'm interested in the absolute number. In a more reporting-focused view of the world you might want to know how many errors you got per day/hour/whatever. And when you have relatively few occurrences, it may make more sense to see 2 events here, 3 events there on a graph than a couple of spikes with 0.035 error QPS.

I'm not saying you shouldn't display other units, but for internal calculations it's best to stick with per-second values.

And I think you meant "sum of the increase", rather than average: average as an aggregation goes together with rate.

I meant average. Increase is only syntactic sugar over rate, so all the same reasoning applies.

With this approach you end up with a graph with (say) one point per day, displaying the average rate of errors per 10 seconds. That's objectively worse than a graph of the average rate of errors per second. And my goal is to have a graph of the number of errors per day, which is what sum(increase(foo)) provides (given a slightly tweaked increase calculation).

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

My experience is that taking a normal rate, and then either graphing that directly or taking an avg_over_time is sufficient for practical purposes. The interesting numbers also tend to be ratios such as latency and failure ratios, in which such artifacts are cancelled out.

I am probably more OCD than most (or it could be that the numbers I'm dealing with are smaller than those of the average Prometheus user), but the combination of graphs with varying resolution, spiky counters and "leaky" ranges (when talking about rate/increase) pushed me in the opposite direction. The particular case that pushed me over the edge was a graph with only a handful of spikes corresponding to one-off errors and not being able to see all of them at once, depending on when I loaded the graph, unless I "smeared" them out over a long enough time range. With sincere apologies for ranting.

I'm well aware of such issues, and the important thing to me personally is to be able to notice when graphs aren't telling the whole truth due to being incorrectly constructed (which sounds like your example) or one of the fundamental metrics issues. At the point where I hit the limit of metrics, I switch to other data sources - but I likely understand the limitations of metrics better than most.

I am sure you do and I think I do too, having worked on the Borgmon TSDB for a couple of years, but that's neither here nor there.

And (going back to personal experience, which I said I would try not to) it is not an issue of an incorrectly constructed graph, but rather of the limitations of the tools at my disposal: Grafana will not let me do any sort of time arithmetic (beyond "fill in the graph resolution into the Prometheus query") and Prometheus will not produce a rate/increase/whatever over a given time range, only an extrapolated approximation. As proof of the concept's correctness, I do have Grafana graphs that reliably display almost every single one-off increase in a counter, at any resolution (higher than eval interval).

And most of the reason why it's only "almost every single one-off increase" rather than every single one is due to the fact that I have to rely on recorded rules which, as our discussion has shown, suffer from time jitter and race conditions. An /api/v1/query_range?rate(foo[10s])&step=10 query that did not leave out any samples would cover all possible edge cases except target or Prometheus restart (which is pretty close to where logs get you: they will also go missing when the target or logs collector crashes).

No matter what we do users are always going to construct incorrect or less than perfectly useful dashboards. To reduce that and help uses produce monitoring that generally works, I think we should be providing consistent clear guidance there (e.g. the 4x recommendation).

I fully support the sentiment (and the best practices) except for the details behind that 4x (which is really a 3x).

There exists the possibility that a "crisp" graph of overly precise values and timestamps might lead one to believe that the underlying data is 100% accurate (which it isn't, for all the arguments above and many more). But it is a more useful debugging tool and all it needs is a disclaimer and/or forced introduction to metric based monitoring, which might eventually lead prospective users to better instrument their code.

It's a bit more than a possibility, it's a reasonably common occurrence. In addition we have users believing that about previous systems they used, and not understanding that those weren't accurate either. I did my Counting With Prometheus talk with all this in mind.

I'm not sure what you're getting at with better instrumentation.

I was mostly venting at my peers happily accepting my generic HTTP request latency metrics as a cool stunt but failing to further instrument their code and define any metrics of their own.

Cheers,
Alin.

Brian Brazil

unread,
May 30, 2018, 8:25:01 AM5/30/18
to Alin Sînpălean, Prometheus Developers
On 30 May 2018 at 13:07, Alin Sînpălean <alin.si...@gmail.com> wrote:
[In the interest of not turning a very useful and enlightening discussion into a flamewar I'll try to refrain from going into my personal experiences and (hopefully) stick to objective measures].

That's not what I'm suggesting. Let's say you have a rate5m recorded for some counter. To zoom out to hour resolution you might do avg_over_time(rate5m[1h]) in a graph with a 1h step. The same works with increase, but the math is more finicky due to the lack of base units.
 

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

My experience is that taking a normal rate, and then either graphing that directly or taking an avg_over_time is sufficient for practical purposes. The interesting numbers also tend to be ratios such as latency and failure ratios, in which such artifacts are cancelled out.

I am probably more OCD than most (or it could be that the numbers I'm dealing with are smaller than those of the average Prometheus user), but the combination of graphs with varying resolution, spiky counters and "leaky" ranges (when talking about rate/increase) pushed me in the opposite direction. The particular case that pushed me over the edge was a graph with only a handful of spikes corresponding to one-off errors and not being able to see all of them at once, depending on when I loaded the graph, unless I "smeared" them out over a long enough time range. With sincere apologies for ranting.

I'm well aware of such issues, and the important thing to me personally is to be able to notice when graphs aren't telling the whole truth due to being incorrectly constructed (which sounds like your example) or one of the fundamental metrics issues. At the point where I hit the limit of metrics, I switch to other data sources - but I likely understand the limitations of metrics better than most.

I am sure you do and I think I do too, having worked on the Borgmon TSDB for a couple of years, but that's neither here nor there.

And (going back to personal experience, which I said I would try not to) it is not an issue of an incorrectly constructed graph, but rather of the limitations of the tools at my disposal: Grafana will not let me do any sort of time arithmetic (beyond "fill in the graph resolution into the Prometheus query") and Prometheus will not produce a rate/increase/whatever over a given time range, only an extrapolated approximation. As proof of the concept's correctness, I do have Grafana graphs that reliably display almost every single one-off increase in a counter, at any resolution (higher than eval interval).

And most of the reason why it's only "almost every single one-off increase" rather than every single one is due to the fact that I have to rely on recorded rules which, as our discussion has shown, suffer from time jitter and race conditions. An /api/v1/query_range?rate(foo[10s])&step=10 query that did not leave out any samples would cover all possible edge cases except target or Prometheus restart (which is pretty close to where logs get you: they will also go missing when the target or logs collector crashes).

No matter what we do users are always going to construct incorrect or less than perfectly useful dashboards. To reduce that and help uses produce monitoring that generally works, I think we should be providing consistent clear guidance there (e.g. the 4x recommendation).

I fully support the sentiment (and the best practices) except for the details behind that 4x (which is really a 3x).

There exists the possibility that a "crisp" graph of overly precise values and timestamps might lead one to believe that the underlying data is 100% accurate (which it isn't, for all the arguments above and many more). But it is a more useful debugging tool and all it needs is a disclaimer and/or forced introduction to metric based monitoring, which might eventually lead prospective users to better instrument their code.

It's a bit more than a possibility, it's a reasonably common occurrence. In addition we have users believing that about previous systems they used, and not understanding that those weren't accurate either. I did my Counting With Prometheus talk with all this in mind.

I'm not sure what you're getting at with better instrumentation.

I was mostly venting at my peers happily accepting my generic HTTP request latency metrics as a cool stunt but failing to further instrument their code and define any metrics of their own.

Ah right. At some point it'll click and you'll have to try to slow them down :)

--

Alin Sînpălean

unread,
May 30, 2018, 9:50:09 AM5/30/18
to Brian Brazil, Prometheus Developers
On Wed, May 30, 2018 at 2:24 PM, Brian Brazil <brian....@robustperception.io> wrote:
On 30 May 2018 at 13:07, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Wed, May 30, 2018 at 12:12 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 30 May 2018 at 10:39, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 6:41 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 17:26, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 4:57 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:
What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.

Well, according to best practices (and common sense) you shouldn't compute an increase over a sum of counters.

You can take an average of the increase though. Generally you should use rate rather than increase in rules as it produces base units which are easier to work with, and leave increase only for Grafana expressions and equivalents.

The reason I'm using increase rather than rate is that I'm interested in the absolute number. In a more reporting-focused view of the world you might want to know how many errors you got per day/hour/whatever. And when you have relatively few occurrences, it may make more sense to see 2 events here, 3 events there on a graph than a couple of spikes with 0.035 error QPS.

I'm not saying you shouldn't display other units, but for internal calculations it's best to stick with per-second values.

And I think you meant "sum of the increase", rather than average: average as an aggregation goes together with rate.

I meant average. Increase is only syntactic sugar over rate, so all the same reasoning applies.

With this approach you end up with a graph with (say) one point per day, displaying the average rate of errors per 10 seconds. That's objectively worse than a graph of the average rate of errors per second. And my goal is to have a graph of the number of errors per day, which is what sum(increase(foo)) provides (given a slightly tweaked increase calculation).

That's not what I'm suggesting. Let's say you have a rate5m recorded for some counter. To zoom out to hour resolution you might do avg_over_time(rate5m[1h]) in a graph with a 1h step. The same works with increase, but the math is more finicky due to the lack of base units.

Well, isn't it the case that if your rate5m has a base unit of QPS (i.e. "queries/s") then the equivalent increase5m would have a base unit of "queries"? You would have to make it explicit in the graph (e.g. in the title or legend) that it is displaying "total queries over $__interval" (whatever that is), and that would be dynamic depending on zoom level, but the base unit is clearly "queries".

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

My experience is that taking a normal rate, and then either graphing that directly or taking an avg_over_time is sufficient for practical purposes. The interesting numbers also tend to be ratios such as latency and failure ratios, in which such artifacts are cancelled out.

I am probably more OCD than most (or it could be that the numbers I'm dealing with are smaller than those of the average Prometheus user), but the combination of graphs with varying resolution, spiky counters and "leaky" ranges (when talking about rate/increase) pushed me in the opposite direction. The particular case that pushed me over the edge was a graph with only a handful of spikes corresponding to one-off errors and not being able to see all of them at once, depending on when I loaded the graph, unless I "smeared" them out over a long enough time range. With sincere apologies for ranting.

I'm well aware of such issues, and the important thing to me personally is to be able to notice when graphs aren't telling the whole truth due to being incorrectly constructed (which sounds like your example) or one of the fundamental metrics issues. At the point where I hit the limit of metrics, I switch to other data sources - but I likely understand the limitations of metrics better than most.

I am sure you do and I think I do too, having worked on the Borgmon TSDB for a couple of years, but that's neither here nor there.

And (going back to personal experience, which I said I would try not to) it is not an issue of an incorrectly constructed graph, but rather of the limitations of the tools at my disposal: Grafana will not let me do any sort of time arithmetic (beyond "fill in the graph resolution into the Prometheus query") and Prometheus will not produce a rate/increase/whatever over a given time range, only an extrapolated approximation. As proof of the concept's correctness, I do have Grafana graphs that reliably display almost every single one-off increase in a counter, at any resolution (higher than eval interval).

And most of the reason why it's only "almost every single one-off increase" rather than every single one is due to the fact that I have to rely on recorded rules which, as our discussion has shown, suffer from time jitter and race conditions. An /api/v1/query_range?rate(foo[10s])&step=10 query that did not leave out any samples would cover all possible edge cases except target or Prometheus restart (which is pretty close to where logs get you: they will also go missing when the target or logs collector crashes).

No matter what we do users are always going to construct incorrect or less than perfectly useful dashboards. To reduce that and help uses produce monitoring that generally works, I think we should be providing consistent clear guidance there (e.g. the 4x recommendation).

I fully support the sentiment (and the best practices) except for the details behind that 4x (which is really a 3x).

There exists the possibility that a "crisp" graph of overly precise values and timestamps might lead one to believe that the underlying data is 100% accurate (which it isn't, for all the arguments above and many more). But it is a more useful debugging tool and all it needs is a disclaimer and/or forced introduction to metric based monitoring, which might eventually lead prospective users to better instrument their code.

It's a bit more than a possibility, it's a reasonably common occurrence. In addition we have users believing that about previous systems they used, and not understanding that those weren't accurate either. I did my Counting With Prometheus talk with all this in mind.

I'm not sure what you're getting at with better instrumentation.

I was mostly venting at my peers happily accepting my generic HTTP request latency metrics as a cool stunt but failing to further instrument their code and define any metrics of their own.

Ah right. At some point it'll click and you'll have to try to slow them down :)

Fingers crossed.

Cheers,
Alin.

Brian Brazil

unread,
May 30, 2018, 9:55:56 AM5/30/18
to Alin Sînpălean, Prometheus Developers
On 30 May 2018 at 14:49, Alin Sînpălean <alin.si...@gmail.com> wrote:

On Wed, May 30, 2018 at 2:24 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 30 May 2018 at 13:07, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Wed, May 30, 2018 at 12:12 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 30 May 2018 at 10:39, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 6:41 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 17:26, Alin Sînpălean <alin.si...@gmail.com> wrote:
On Tue, May 29, 2018 at 4:57 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 29 May 2018 at 15:42, Alin Sînpălean <alin.si...@gmail.com> wrote:
What I'm doing is I have a setup with a 10 second eval interval and I'm computing something along the lines of job_env:http_requests:increase_10s. I then graph that in Grafana as sum_over_time(job_env:http_requests:increase_10s[$__interval]), to get a fast, variable resolution graph that (in the overwhelming majority of cases) covers every single request (with particular emphasis on errors).

What do you see as the benefit of this against requesting the increase directly in Grafana? With a 10s range there's going to be basically no performance gain from using a recording rule here, especially given the resources the recording rule uses.

Well, according to best practices (and common sense) you shouldn't compute an increase over a sum of counters.

You can take an average of the increase though. Generally you should use rate rather than increase in rules as it produces base units which are easier to work with, and leave increase only for Grafana expressions and equivalents.

The reason I'm using increase rather than rate is that I'm interested in the absolute number. In a more reporting-focused view of the world you might want to know how many errors you got per day/hour/whatever. And when you have relatively few occurrences, it may make more sense to see 2 events here, 3 events there on a graph than a couple of spikes with 0.035 error QPS.

I'm not saying you shouldn't display other units, but for internal calculations it's best to stick with per-second values.

And I think you meant "sum of the increase", rather than average: average as an aggregation goes together with rate.

I meant average. Increase is only syntactic sugar over rate, so all the same reasoning applies.

With this approach you end up with a graph with (say) one point per day, displaying the average rate of errors per 10 seconds. That's objectively worse than a graph of the average rate of errors per second. And my goal is to have a graph of the number of errors per day, which is what sum(increase(foo)) provides (given a slightly tweaked increase calculation).

That's not what I'm suggesting. Let's say you have a rate5m recorded for some counter. To zoom out to hour resolution you might do avg_over_time(rate5m[1h]) in a graph with a 1h step. The same works with increase, but the math is more finicky due to the lack of base units.

Well, isn't it the case that if your rate5m has a base unit of QPS (i.e. "queries/s") then the equivalent increase5m would have a base unit of "queries"?

An increase5m has a unit of queries/5m.
 
You would have to make it explicit in the graph (e.g. in the title or legend) that it is displaying "total queries over $__interval" (whatever that is), and that would be dynamic depending on zoom level, but the base unit is clearly "queries".

The base unit is queries per second.

Brian
 

Also, it is not entirely true that it only works for min and max. It works for sums and counts too (and by extension, averages and standard deviation), changes(), resets() and (beating my own drum again) would work perfectly with the proposed xrate() implementation. So it could, in theory, work for any function that takes a range argument.

When using it in the standard recommended way, it works only for min and max as in any other scenario you'll be double counting (and even then min/max will be working over a slightly different range than you requested).

If by "standard recommended way" you're referring to the recommendation of using a time range ~2.5x the eval interval for rate(), then sure.

I'm considering entirely gauges (though the argument generalises). I though you were doing something different that we've seen users attempt, where you'd have say a 1h eval interval.
 
But if you're (a) willing to ignore the odd missed evaluation run and go with a range equal to the eval interval (my personal choice); or (b) use a range that's a multiple of the eval interval (say 3x) and then take that into account (e.g. by dividing everything by 3); then I think you can get very accurate sums/counts/rates (again, modulo missed scrapes/evals).

I'm not sure this is going to buy you much compared to simpler approaches.

To be perfectly honest, the reason why I started down this path (in addition to the performance gain described above) was that I just couldn't get rate()/increase() to produce good enough numbers (both in terms of the noise introduced by extrapolation and in terms of including some samples, or rather increases between adjacent samples, more times than others). So I'm computing an increase over a fixed range (which I know how to adjust for both issues) and then I can do simple arithmetic with the results and be relatively sure to not miss or overcount any samples.

My experience is that taking a normal rate, and then either graphing that directly or taking an avg_over_time is sufficient for practical purposes. The interesting numbers also tend to be ratios such as latency and failure ratios, in which such artifacts are cancelled out.

I am probably more OCD than most (or it could be that the numbers I'm dealing with are smaller than those of the average Prometheus user), but the combination of graphs with varying resolution, spiky counters and "leaky" ranges (when talking about rate/increase) pushed me in the opposite direction. The particular case that pushed me over the edge was a graph with only a handful of spikes corresponding to one-off errors and not being able to see all of them at once, depending on when I loaded the graph, unless I "smeared" them out over a long enough time range. With sincere apologies for ranting.

I'm well aware of such issues, and the important thing to me personally is to be able to notice when graphs aren't telling the whole truth due to being incorrectly constructed (which sounds like your example) or one of the fundamental metrics issues. At the point where I hit the limit of metrics, I switch to other data sources - but I likely understand the limitations of metrics better than most.

I am sure you do and I think I do too, having worked on the Borgmon TSDB for a couple of years, but that's neither here nor there.

And (going back to personal experience, which I said I would try not to) it is not an issue of an incorrectly constructed graph, but rather of the limitations of the tools at my disposal: Grafana will not let me do any sort of time arithmetic (beyond "fill in the graph resolution into the Prometheus query") and Prometheus will not produce a rate/increase/whatever over a given time range, only an extrapolated approximation. As proof of the concept's correctness, I do have Grafana graphs that reliably display almost every single one-off increase in a counter, at any resolution (higher than eval interval).

And most of the reason why it's only "almost every single one-off increase" rather than every single one is due to the fact that I have to rely on recorded rules which, as our discussion has shown, suffer from time jitter and race conditions. An /api/v1/query_range?rate(foo[10s])&step=10 query that did not leave out any samples would cover all possible edge cases except target or Prometheus restart (which is pretty close to where logs get you: they will also go missing when the target or logs collector crashes).

No matter what we do users are always going to construct incorrect or less than perfectly useful dashboards. To reduce that and help uses produce monitoring that generally works, I think we should be providing consistent clear guidance there (e.g. the 4x recommendation).

I fully support the sentiment (and the best practices) except for the details behind that 4x (which is really a 3x).

There exists the possibility that a "crisp" graph of overly precise values and timestamps might lead one to believe that the underlying data is 100% accurate (which it isn't, for all the arguments above and many more). But it is a more useful debugging tool and all it needs is a disclaimer and/or forced introduction to metric based monitoring, which might eventually lead prospective users to better instrument their code.

It's a bit more than a possibility, it's a reasonably common occurrence. In addition we have users believing that about previous systems they used, and not understanding that those weren't accurate either. I did my Counting With Prometheus talk with all this in mind.

I'm not sure what you're getting at with better instrumentation.

I was mostly venting at my peers happily accepting my generic HTTP request latency metrics as a cool stunt but failing to further instrument their code and define any metrics of their own.

Ah right. At some point it'll click and you'll have to try to slow them down :)

Fingers crossed.

Cheers,
Alin.



--
Reply all
Reply to author
Forward
0 new messages