Best way to calculate the increase in the counter value over a time period?

57 views
Skip to first unread message

O

unread,
Apr 30, 2020, 9:57:50 PM4/30/20
to Prometheus Users
Hi everyone,
I am using increase() function to calculate the increase in counter value over a time period. These values are being displayed in a table in Grafana. But, for a duration of 15 days or so, it errors out because the number of samples that are being pulled is too high and the limit for --query.max-samples flag is crossed

So, my question is if there is a better way to calculate the increase in counter and display it in the Grafana table without pulling so many labels from Prometheus.

Thanks!

Brian Candler

unread,
May 1, 2020, 2:34:47 AM5/1/20
to Prometheus Users
Can you show your actual query?  Also what version of Prometheus are you using?

Christian Hoffmann

unread,
May 1, 2020, 2:46:58 AM5/1/20
to O, Prometheus Users
Hi,
increase() tries to detect counter resets. In order for this to work,
each data point has to be considered (at least I assume that this is the
case). I don't see a a way around this.

If you know for sure that your counter does not reset (at least in the
timeframe you are interested in), you might achieve what you want by a
simple substraction which should be less resource-intensive:

your_metric - your_metric offset 14d

Of course, you can also increase the max-samples value. It is primarily
there as a safeguard against high resource usage (i.e. you might need
more RAM and longer processing times).

Kind regards,
Christian

Brian Brazil

unread,
May 1, 2020, 2:51:43 AM5/1/20
to Christian Hoffmann, O, Prometheus Users
On Fri, 1 May 2020 at 07:46, Christian Hoffmann <ma...@hoffmann-christian.info> wrote:
Hi,

On 5/1/20 3:57 AM, O wrote:
> I am using increase() function to calculate the increase in counter
> value over a time period. These values are being displayed in a table in
> Grafana. But, for a duration of 15 days or so, it errors out because the
> number of samples that are being pulled is too high and the limit
> for |--query.max-samples| flag is crossed. 
>
> So, my question is if there is a better way to calculate the increase in
> counter and display it in the Grafana table without pulling so many
> labels from Prometheus.

increase() tries to detect counter resets. In order for this to work,
each data point has to be considered (at least I assume that this is the
case). I don't see a a way around this.

You're correct.
 

If you know for sure that your counter does not reset (at least in the
timeframe you are interested in), you might achieve what you want by a
simple substraction which should be less resource-intensive:

your_metric - your_metric offset 14d

Of course, you can also increase the max-samples value. It is primarily
there as a safeguard against high resource usage (i.e. you might need
more RAM and longer processing times).

Even with 15d of data at a 1s interval, that's only 1.3M samples that need to be in memory at a time to calculate the rate() - so it's not the rate() function that's the issue here.

--

O

unread,
May 1, 2020, 4:16:08 AM5/1/20
to Prometheus Users
Thanks for your responses. I am using grok exporter for parsing logs to convert them into prometheus metrics.
grok exporter service won't restart very often but it could still happen. 

Here's the query that I am using to calculate the egress data by different users in the selected time range in Grafana:
sum by (category, user)(increase(user_egress_total{job="test", user=~"$user", category=~"$category"}[$__range]))

Cardinality is too high for the metric, so I end up getting ~1200 timeseries. After applying the step and changing [$__range] to [$__range:1m], I was able to make it work for ~15 days.
I understand that high cardinality metrics are not recommended for Prometheus. But, I am wondering if there is a better way of implementing it in Prometheus either using a different exporter or by rewriting the query. 

Appreciate your inputs. Thanks!

Ben Kochie

unread,
May 1, 2020, 4:53:57 AM5/1/20
to O, Prometheus Users
The other way to solve this is to use recording rules to pre-summarize the data.

For example:

groups:
- name: User Egress
  interval: 5m
  - record: category_user:user_egress_total:increase5m
    expr: sum by (category, user) (increase(user_egress_total{job="test"}[5m]))

With this, you can now summarize with fewer samples over longer periods of time.

sum_over_time(category_user:user_egress_total:increase5m{user=~"$user", category=~"$category"}[$__range])

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4d076cee-9ac9-4b0d-8247-83c2bceb0ff3%40googlegroups.com.

O

unread,
May 3, 2020, 5:40:54 PM5/3/20
to Prometheus Users
Thanks Ben! I will learn more about recording rules and give this a shot.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages