Counting events over arbitrary time intervals

4,518 views
Skip to first unread message

Nick Yantikov

unread,
Apr 19, 2016, 10:38:16 AM4/19/16
to Prometheus Developers
Hello All

I have monotonically incrementing counter of processed requests. My goal is to have an equivalent of graphite's summarize(nonNegativeDerivative(simulated_requests_total), '1m', 'sum', false) or '1h', '1d' etc.
The closest to the expected result I could get is with increase(simulated_requests_total[1m]) but it is still not matching the result I expect.

Could you please point me in the right direction?

Thank you
-Nick

Brian Brazil

unread,
Apr 19, 2016, 10:48:06 AM4/19/16
to Nick Yantikov, Prometheus Developers
When doing Prometheus evaluation over a time range, each point in time is evaluated independently so. So if you want to do alignToFrom=true you need to specify a start time that's aligned with what you want and then use increase().


--

Nick Yantikov

unread,
Apr 19, 2016, 11:51:13 AM4/19/16
to Prometheus Developers, ovo...@gmail.com

The other thing is that it looks like in order to properly plot 1minute counter I also have to specify proper "step" value of 60 seconds. Otherwise if I understand correctly it plots results of the overlapping 1minute intervals which does not make much sense. Or does it?

A question about "increase" function. The doc says it "calculates the increase in the time series in the range vector". If my scraping interval is 15 seconds then "simulated_requests_total[1m]" seem to always return four data points (which is expected). However does "increase" function only calculate an increase between these four data points? For the purpose of what I am trying to achieve it should account for the data point from the scraping immediately preceding the 1m interval. Is that right? Otherwise I was able to model a situation when if the first data point in the 1m window is a significant spike in requests count (compared to the data point right before it) my chart completely misses the spike.

Brian Brazil

unread,
Apr 19, 2016, 11:59:23 AM4/19/16
to Nick Yantikov, Prometheus Developers
On 19 April 2016 at 16:51, Nick Yantikov <ovo...@gmail.com> wrote:


On Tuesday, April 19, 2016 at 7:48:06 AM UTC-7, Brian Brazil wrote:
On 19 April 2016 at 15:38, Nick Yantikov <ovo...@gmail.com> wrote:
Hello All

I have monotonically incrementing counter of processed requests. My goal is to have an equivalent of graphite's summarize(nonNegativeDerivative(simulated_requests_total), '1m', 'sum', false) or '1h', '1d' etc.
The closest to the expected result I could get is with increase(simulated_requests_total[1m]) but it is still not matching the result I expect.

Could you please point me in the right direction?

When doing Prometheus evaluation over a time range, each point in time is evaluated independently so. So if you want to do alignToFrom=true you need to specify a start time that's aligned with what you want and then use increase().


The other thing is that it looks like in order to properly plot 1minute counter I also have to specify proper "step" value of 60 seconds. Otherwise if I understand correctly it plots results of the overlapping 1minute intervals which does not make much sense. Or does it?

It makes sense, but not if you're trying to produce a per-minute report.

A question about "increase" function. The doc says it "calculates the increase in the time series in the range vector". If my scraping interval is 15 seconds then "simulated_requests_total[1m]" seem to always return four data points (which is expected). However does "increase" function only calculate an increase between these four data points?

Yes, it calculates based only on those data points. There's a small bit of extrapolation in there too to produce more accurate numbers overall.
 
For the purpose of what I am trying to achieve it should account for the data point from the scraping immediately preceding the 1m interval. Is that right?

That's one way you could do it, but what we have works quite well without that.
 
Otherwise I was able to model a situation when if the first data point in the 1m window is a significant spike in requests count (compared to the data point right before it) my chart completely misses the spike. 

If that's the sort of thing you're looking for then this is not how you should go about it. We recommend use of rate() to keep things consistently measured per-second, and then have your step smaller than the range in your rate(). This will let you see any spikes.

--

Nick Yantikov

unread,
Apr 19, 2016, 12:38:19 PM4/19/16
to Brian Brazil, Prometheus Developers
On Tue, Apr 19, 2016 at 8:59 AM, Brian Brazil <brian....@robustperception.io> wrote:
On 19 April 2016 at 16:51, Nick Yantikov <ovo...@gmail.com> wrote:


On Tuesday, April 19, 2016 at 7:48:06 AM UTC-7, Brian Brazil wrote:
On 19 April 2016 at 15:38, Nick Yantikov <ovo...@gmail.com> wrote:
Hello All

I have monotonically incrementing counter of processed requests. My goal is to have an equivalent of graphite's summarize(nonNegativeDerivative(simulated_requests_total), '1m', 'sum', false) or '1h', '1d' etc.
The closest to the expected result I could get is with increase(simulated_requests_total[1m]) but it is still not matching the result I expect.

Could you please point me in the right direction?

When doing Prometheus evaluation over a time range, each point in time is evaluated independently so. So if you want to do alignToFrom=true you need to specify a start time that's aligned with what you want and then use increase().


The other thing is that it looks like in order to properly plot 1minute counter I also have to specify proper "step" value of 60 seconds. Otherwise if I understand correctly it plots results of the overlapping 1minute intervals which does not make much sense. Or does it?

It makes sense, but not if you're trying to produce a per-minute report.

A question about "increase" function. The doc says it "calculates the increase in the time series in the range vector". If my scraping interval is 15 seconds then "simulated_requests_total[1m]" seem to always return four data points (which is expected). However does "increase" function only calculate an increase between these four data points?

Yes, it calculates based only on those data points. There's a small bit of extrapolation in there too to produce more accurate numbers overall.
 

Could you please point me to the source code so I can learn more?
 
For the purpose of what I am trying to achieve it should account for the data point from the scraping immediately preceding the 1m interval. Is that right?

That's one way you could do it, but what we have works quite well without that.

What is the other way that works well?
 

 
Otherwise I was able to model a situation when if the first data point in the 1m window is a significant spike in requests count (compared to the data point right before it) my chart completely misses the spike. 

If that's the sort of thing you're looking for then this is not how you should go about it. We recommend use of rate() to keep things consistently measured per-second, and then have your step smaller than the range in your rate(). This will let you see any spikes.


I'd say that this is a very common request to plot counters over time periods (1m, 5m, 1h, 1d, etc). I cannot override this requirement. There is a graphite (in grafana) dashboard that plots counters over time intervals. summarize(nonNegativeDerivative(simulated_requests_total), '1m', 'sum', false) produces exact results. My goal is to get the same dashboard using Prometheus data and queries.

Summarizing the conversation so far it looks like in order to produce a dashboard with 1 minute counters I need to query "increase(simulated_requests_total[75s])" where 75 == 60+15second scraping interval and use step = 60. Is this how you would recommend approaching this or am I missing the mark entirely? Are there better ways of achieving this requirement?

Thank you for your prompt answers.

Brian Brazil

unread,
Apr 20, 2016, 5:19:44 AM4/20/16
to Nick Yantikov, Prometheus Developers
On 19 April 2016 at 17:37, Nick Yantikov <ovo...@gmail.com> wrote:

On Tue, Apr 19, 2016 at 8:59 AM, Brian Brazil <brian....@robustperception.io> wrote:
On 19 April 2016 at 16:51, Nick Yantikov <ovo...@gmail.com> wrote:


On Tuesday, April 19, 2016 at 7:48:06 AM UTC-7, Brian Brazil wrote:
On 19 April 2016 at 15:38, Nick Yantikov <ovo...@gmail.com> wrote:
Hello All

I have monotonically incrementing counter of processed requests. My goal is to have an equivalent of graphite's summarize(nonNegativeDerivative(simulated_requests_total), '1m', 'sum', false) or '1h', '1d' etc.
The closest to the expected result I could get is with increase(simulated_requests_total[1m]) but it is still not matching the result I expect.

Could you please point me in the right direction?

When doing Prometheus evaluation over a time range, each point in time is evaluated independently so. So if you want to do alignToFrom=true you need to specify a start time that's aligned with what you want and then use increase().


The other thing is that it looks like in order to properly plot 1minute counter I also have to specify proper "step" value of 60 seconds. Otherwise if I understand correctly it plots results of the overlapping 1minute intervals which does not make much sense. Or does it?

It makes sense, but not if you're trying to produce a per-minute report.

A question about "increase" function. The doc says it "calculates the increase in the time series in the range vector". If my scraping interval is 15 seconds then "simulated_requests_total[1m]" seem to always return four data points (which is expected). However does "increase" function only calculate an increase between these four data points?

Yes, it calculates based only on those data points. There's a small bit of extrapolation in there too to produce more accurate numbers overall.
 

Could you please point me to the source code so I can learn more?

 
 
For the purpose of what I am trying to achieve it should account for the data point from the scraping immediately preceding the 1m interval. Is that right?

That's one way you could do it, but what we have works quite well without that.

What is the other way that works well?

Have your step smaller than the rate's range.
 
 

 
Otherwise I was able to model a situation when if the first data point in the 1m window is a significant spike in requests count (compared to the data point right before it) my chart completely misses the spike. 

If that's the sort of thing you're looking for then this is not how you should go about it. We recommend use of rate() to keep things consistently measured per-second, and then have your step smaller than the range in your rate(). This will let you see any spikes.


I'd say that this is a very common request to plot counters over time periods (1m, 5m, 1h, 1d, etc). I cannot override this requirement. There is a graphite (in grafana) dashboard that plots counters over time intervals. summarize(nonNegativeDerivative(simulated_requests_total), '1m', 'sum', false) produces exact results.

Looking at the code, Graphite's summarize function is not returning exact results.

My goal is to get the same dashboard using Prometheus data and queries.

If you're looking for exact results then Prometheus (and Graphite) aren't for you.


Summarizing the conversation so far it looks like in order to produce a dashboard with 1 minute counters I need to query "increase(simulated_requests_total[75s])" where 75 == 60+15second scraping interval and use step = 60. Is this how you would recommend approaching this or am I missing the mark entirely? Are there better ways of achieving this requirement?

The only way to get this data exactly is from log processing. You can't do what you're looking to do with a metrics-based system such as Prometheus or Graphite.


When switching monitoring systems you can't expect that everything will work in exactly the same way, as different systems have different tradeoffs and design decisions. If you're looking to graph request rate over time then Prometheus is perfectly suited to that and will produce better results than Graphite, they're not going to be identical though.

--

Nick Yantikov

unread,
Apr 21, 2016, 1:03:42 AM4/21/16
to Brian Brazil, Prometheus Developers
This was never my intention to slide into religious wars or framework battle. 

My question really is how to derive some results based on the information already recorded in Prometheus. The reasons I brought graphite up are: a) as people use graphite queries currently they will be asking (myself including) what is the path to achieve the same in Prometheus, and b) graphite does return results that match test data. 

Namely, how do I aggregate counters into larger time intervals based on monotonically increasing counter metric? I might be oversimplifying things but it looks like if there was a function that takes a diff between t and t-1 datapoints (accounting for counter resets of course) then I would "sum_over_time" results of this function to get desired result. By the same token if I reset counter in my test harness after every scraping period (just for the sake of the experiment) then sum_over_time(simulated_requests_total[1m]) produces results that match test data.


Brian Brazil

unread,
Apr 21, 2016, 3:44:02 AM4/21/16
to Nick Yantikov, Prometheus Developers
On 21 April 2016 at 06:03, Nick Yantikov <ovo...@gmail.com> wrote:
This was never my intention to slide into religious wars or framework battle. 

My question really is how to derive some results based on the information already recorded in Prometheus. The reasons I brought graphite up are: a) as people use graphite queries currently they will be asking (myself including) what is the path to achieve the same in Prometheus, and b) graphite does return results that match test data.

Graphite could return the correct result on synthetic test data, but not real world data. This is actually a small challenge when unit testing some of our more intricate time-based logic, as perfectly aligned data won't expose real world behaviour.

In real world scenarios Graphite and Prometheus simply don't have access to the data needed to get the exact correct result, no math can fix that. We can only do statistical estimates, which summarize doesn't appear to attempt as it presumes all results belong to the bucket they land in rather than apportioning some of the first point in a bucket to the previous bucket. For Prometheus we do some extrapolation, which uses the points in a bucket to estimate out to the boundaries of the bucket.

Namely, how do I aggregate counters into larger time intervals based on monotonically increasing counter metric? I might be oversimplifying things but it looks like if there was a function that takes a diff between t and t-1 datapoints (accounting for counter resets of course) then I would "sum_over_time" results of this function to get desired result. By the same token if I reset counter in my test harness after every scraping period (just for the sake of the experiment) then sum_over_time(simulated_requests_total[1m]) produces results that match test data.

We don't have functions and features to allow for that currently, it'd require https://github.com/prometheus/prometheus/issues/394 at the least. What you're really asking for isn't just a new function, it's a change in our core evaluation model.

If you're trying to do this sort of custom analysis you're best off getting the raw data from Prometheus and doing the processing yourself.

At the end of the day Graphite produces the wrong result. Prometheus produces a different wrong result. We're not likely to make major changes to the project to add support for a different type of wrong result merely because another project made different design decisions. Both approaches are generally good enough in practice.

Brian



--
Reply all
Reply to author
Forward
0 new messages