How to rate a sum of the same counter from different machines?

6,118 views
Skip to first unread message

amir....@gmail.com

unread,
Oct 21, 2018, 6:25:28 AM10/21/18
to Prometheus Users
Hello,

:I have a Prometheus problem I'm failing to solve for a long time - It seems to me a basic one

I have a Prometheus counter, for which I want to get its rate on a time range
(the real target is to sum the rate, and sometimes use histogram_quantile on that for histogram metric).  
However, I've got multiple machines running that kind of job, each one sets its own instance label.
This causes different inc operations on this counter in different machines to create different entities of the counter, as the combination of labels values is unique.
The problem is that rate() works separately on each such counter entity.  
The result is that counter entities with unique combinations don't get into account for rate().  
For example, if I've got:

    mycounter{aaa="1",instance="1.2.3.4:6666",job="job1"} value: 1
    mycounter{aaa="2",instance="1.2.3.4:6666",job="job1"} value: 1
    mycounter{aaa="2",instance="1.2.3.4:7777",job="job1"} value: 1
    mycounter{aaa="1",instance="5.5.5.5:6666",job="job1"} value: 1

All counter entities are unique, so they get values of 1.  
If counter labels are always unique, rate(mycounter[5m]) would get values of 0 in this case,
and sum(rate(mycounter[5m])) would get 0, which is not what I need!  
I want to ignore the instance label so that it would refer these mycounter inc operations as they were made on the same counter entity. 
In other words, I expect to have only 2 entities (they can have a common instance value or no instance value):

    mycounter{aaa="1", job="job1"} value: 2
    mycounter{aaa="2", job="job1"} value: 2

In such a case, the entities values are increased instead of adding new entities with value of 1, and rate() would get real rates for each, so we may sum() them. 
How do I do that?  

I made several tries to solve it but all failed:
  • Doing a rate() of the sum() -  fails because of type mismatch...   
  • Removing the automatic instance label, using metric_relabel_configswork with action: labeldrop in configuration, but then it assign the default address value.
  • Changing all instance values to a common one using metric_relabel_configswork with replacement, but it seems that one of the entities overwrites all others, so it doesn't help...
Any suggestions?  

Prometheus version: 2.3.2  
Thanks in advance!
-- Amir

Simon Pasquier

unread,
Oct 22, 2018, 4:36:08 AM10/22/18
to amir....@gmail.com, Prometheus Users
In general you would do:
sum by (instance) (rate(mycounter[5m]))

You are probably not initializing the counter values to zero which is
why you are getting a rate of 0.
https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
https://www.robustperception.io/existential-issues-with-metrics
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To post to this group, send email to promethe...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7e60b875-5874-4974-8260-f6137b4f63fe%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

amir....@gmail.com

unread,
Oct 22, 2018, 9:28:41 AM10/22/18
to Prometheus Users
Thanks Simon,
But you probably didn't get my point!

First, I don't want to get the counter rate  for every instance (which is what your query does) - I don't care about instance level info!

Second, It wouldn't even get the right rate of each instance - and that's actually my problem here:
If the counter increases in a new instance, I get a new instance counter with value of 1 - at this point rate() gives 0 for that instance.
I have multiple instances and jobs are given to them in parallel, each increases its counter.
So it's a typical case when each instance increases the counter once.
Assume I start with 10 instances of the job, in such scenario I get a rate of 0, although it should really reflect the change from 0 to 10.
It doesn't happen only at the beginning, because I work in Kubernetes cluster, where Kubernetes jobs are created with auto-created address and removed all the time.
So many new instances are added during runtime.

Maybe it's really a reset problem: 
I tried to reset the counter after creation in the client, but it doesn't help - maybe I did something wrong...
I write NodeJS so I use 'prom-client' module in the following way:

const client = require('prom-client');
requestCounter = new client.Counter(...);
requestCounter.reset();

It seems to clear the internal map (which is already empty).
What I really need is a way to make Prometheus client consider every increase of a counter with unique labels combination values as it was 0 before,
so that it would get a real rate also from new instances.

Any Idea?
Thanks

Simon Pasquier

unread,
Oct 22, 2018, 10:55:40 AM10/22/18
to amir....@gmail.com, Prometheus Users
On Mon, Oct 22, 2018 at 3:28 PM <amir....@gmail.com> wrote:
>
> Thanks Simon,
> But you probably didn't get my point!
>
> First, I don't want to get the counter rate for every instance (which is what your query does) - I don't care about instance level info!

Then the query is simply: sum(rate(mycounter[5m]))

>
> Second, It wouldn't even get the right rate of each instance - and that's actually my problem here:
> If the counter increases in a new instance, I get a new instance counter with value of 1 - at this point rate() gives 0 for that instance.
> I have multiple instances and jobs are given to them in parallel, each increases its counter.
> So it's a typical case when each instance increases the counter once.
> Assume I start with 10 instances of the job, in such scenario I get a rate of 0, although it should really reflect the change from 0 to 10.
> It doesn't happen only at the beginning, because I work in Kubernetes cluster, where Kubernetes jobs are created with auto-created address and removed all the time.
> So many new instances are added during runtime.

If Prometheus gets the counter with 0 and later it gets the counter
with 1 then rate() won't be zero if the range contains the 2 different
values.
But IIUC your counter value is always 1 for Prometheus. If so then
rate() can't detect a change...

>
> Maybe it's really a reset problem:
> I tried to reset the counter after creation in the client, but it doesn't help - maybe I did something wrong...
> I write NodeJS so I use 'prom-client' module in the following way:
>
> const client = require('prom-client');
> requestCounter = new client.Counter(...);
> requestCounter.reset();
>
> It seems to clear the internal map (which is already empty).

I'm not at all familiar with the NodeJS client but in general you
don't have to reset counters.
If you have created the counter metric with all the possible
combinations of label names and values then it is fine provided that
Prometheus scrapes your target before the the counter is incremented.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5e8b3a19-3e30-43d7-a9c3-a8834fc65386%40googlegroups.com.

val...@gmail.com

unread,
Oct 22, 2018, 3:33:59 PM10/22/18
to Prometheus Users

On Sunday, October 21, 2018 at 1:25:28 PM UTC+3, amir....@gmail.com wrote:
Hello,

:I have a Prometheus problem I'm failing to solve for a long time - It seems to me a basic one

I have a Prometheus counter, for which I want to get its rate on a time range
(the real target is to sum the rate, and sometimes use histogram_quantile on that for histogram metric).  
However, I've got multiple machines running that kind of job, each one sets its own instance label.
This causes different inc operations on this counter in different machines to create different entities of the counter, as the combination of labels values is unique.
The problem is that rate() works separately on each such counter entity.  
The result is that counter entities with unique combinations don't get into account for rate().  
For example, if I've got:

    mycounter{aaa="1",instance="1.2.3.4:6666",job="job1"} value: 1
    mycounter{aaa="2",instance="1.2.3.4:6666",job="job1"} value: 1
    mycounter{aaa="2",instance="1.2.3.4:7777",job="job1"} value: 1
    mycounter{aaa="1",instance="5.5.5.5:6666",job="job1"} value: 1

All counter entities are unique, so they get values of 1.  
If counter labels are always unique, rate(mycounter[5m]) would get values of 0 in this case,
and sum(rate(mycounter[5m])) would get 0, which is not what I need!  
I want to ignore the instance label so that it would refer these mycounter inc operations as they were made on the same counter entity. 
In other words, I expect to have only 2 entities (they can have a common instance value or no instance value):

    mycounter{aaa="1", job="job1"} value: 2
    mycounter{aaa="2", job="job1"} value: 2

In such a case, the entities values are increased instead of adding new entities with value of 1, and rate() would get real rates for each, so we may sum() them. 
How do I do that?  


As I know this task is impossible with vanilla PromQL, but it is easy to do with Extended PromQL provided by VictoriaMetrics:

    srate(sum(remove_resets(mycounter)) without (instance))

- remove_resets removes counter resets from mycounter.
- srate calculates `step rate`, i.e. rate over each step in the query range.

amir....@gmail.com

unread,
Oct 23, 2018, 5:15:17 AM10/23/18
to Prometheus Users
Hi,
Thanks for answering.
I didn't find any info related to these extension functions except the Extended PromQL page (no code, detailed description or reference guide).
I don't understand:
  • How it can solve my problem?
  • What do I need for Prometheus and Grafana to support Extended PromQL?
  • Is it an OpenSource?
  • Is it released already and when?

Aliaksandr Valialkin

unread,
Oct 23, 2018, 7:30:36 AM10/23/18
to amir....@gmail.com, promethe...@googlegroups.com
On Tue, Oct 23, 2018 at 12:15 PM <amir....@gmail.com> wrote:
Hi,
Thanks for answering.
I didn't find any info related to these extension functions except the Extended PromQL page (no code, detailed description or reference guide).
I don't understand:
  • How it can solve my problem?

Let's look at the query again:

   srate(sum(remove_resets(mycounter)) without (instance))
 
The inner `remove_resets(mycounter)` removes counter resets from each `mycounter` timeseries. This is necessary, since `mycounter` may reset when the corresponding instance resets. Otherwise incorrect results will appear.
Then the `sum() without (instance)` sums all the counters ignoring `instance` label. For example, the following input timeseries:

    mycounter{instance="foo", job="x"} 10
    mycounter{instance="bar", job="x"} 20
    mycounter{instance="baz", job="y"} 5
    mycounter{instance="xxx", job="y"} 10

will be converted into

    {job="x"} 10+20=30
    {job="y"} 5+10=15

I.e. an independent counter for each group of mycounter timeseries ignoring `instance` label.

The `srate` calculates rate for each i-th point using the following formula:

    v(i+1)-v(i) / step

where v(i) is a timeseries value at i-th point, and step is a duration in seconds between subsequent points. Step is usually passed into query_range request by Grafana. It usually corresponds to a time distance between subsequent points on the graph.
  • What do I need for Prometheus and Grafana to support Extended PromQL?
Use VictoriaMetrics as a remote storage for Prometheus and then use the provided URL as a Prometheus datasource in Grafana. See quick start for details.
 
  • Is it an OpenSource?
No. It is SaaS.
 
  • Is it released already and when?
The preview version has been released a few days ago - see the announcement.

Aliaksandr Valialkin

unread,
Oct 23, 2018, 11:41:45 AM10/23/18
to אמיר יראון, promethe...@googlegroups.com


‪On Tue, Oct 23, 2018 at 4:18 PM ‫אמיר יראון‬‎ <amir....@gmail.com> wrote:‬
OK, that's much more clear!
So now I know better what to ask:
  1. Does remove_resets get the time series from last reset till now?
remove_resets() just adds the last value before the reset to all the subsequent values. For example, the following timeseries:

    1, 2, 3, 5, 1, 6, 1, 2,

would be transformed into

    1, 2, 3, 5, 1+5, 6+5, 1+5+(6+5), 2+5+(6+5) => 1, 2, 3, 5, 6, 11, 12, 13
 
  1. How srate(sum(..)) could work, when rate(sum(..)) couldn't? (type mismatch because rate expects time series and sum returns a vector...)
rate() works on a range vector, i.e. a metric selector followed by window duration in square brackets, while srate() works on an instant vector.
 
  1. Does srate reflects the rate between last 2 vectors?
srate() reflects the rate between each consecutive pairs of points returned by Promtheus to Grafana in range_query response.
 
  1. Is it normalized to some unit? (counter per second or so) or do I need to calculate it myself according to the step?
Yes, it is normalized to counter increments per second unit, like rate() result.


Aliaksandr Valialkin

unread,
Oct 25, 2018, 2:25:31 PM10/25/18
to אמיר יראון, promethe...@googlegroups.com


‪On Wed, Oct 24, 2018 at 9:14 AM ‫אמיר יראון‬‎ <amir....@gmail.com> wrote:‬
What about a new instance (i.e. with a new address)?
If a the counter is increased in a new instance and gets the value 1, would it give a none zero rate as I expect?

No, it will get (1 / step) rate. 0 rate is returned when new instance appears and until it returns non-zero value.
 
My original problem is rate() returns 0 for such instance (reflecting the change from undefined to 0, to my best knowledge).
It happens because I work in Kubernetes cluster where jobs are created and removed often.
I can't even reset the counter at init time because I have some dynamic labels I can't guess their future values at init time.

Secondly, SaaS concept is problematic for us, because we develop an open source,
and also because some of the costumers work within internal network.
Any other options?

Rate may be emulated with offset. Try something like the following in Prometheus:

sum(counter - counter offset 60s) without (instance) / 60

This is rough equivalent of the following invalid query:

rate((sum(counter) by (instance))[60s])

However, this doesn't account for counter resets, so results may be inaccurate.

Aliaksandr Valialkin

unread,
Oct 25, 2018, 2:37:34 PM10/25/18
to אמיר יראון, promethe...@googlegroups.com


‪On Wed, Oct 24, 2018 at 3:48 PM ‫אמיר יראון‬‎ <amir....@gmail.com> wrote:‬

Also, original rate() should already handle reset, as put in the doc:

rate(v range-vector) calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. Also, the calculation extrapolates to the ends of the time range, allowing for missed scrapes or imperfect alignment of scrape cycles with the range's time period.

Yes, rate() and srate() both handle resets. But they won't work as expected for the following query if counter resets exist:

srate(sum(counter))

For instance, let the following counters exist:

counter{job="1"} 1 2 3 4
counter{job="2"} 1 3 1 2

Then sum() will transform them to a single counter:
1+1=2, 2+3=5, 3+1=4, 4+2=6

Then srate() would remove counter resets:
2, 5, 4+5=9, 6+5=11

And then calculate the rate (for example, for step=1):
5-2=3, 9-5=4, 11-9=2 => 3, 4, 2

This rate is invalid.

Let's remove counter resets before calculating the sum():
counter{job="1"} 1 2 3 4
counter{job="2"} 1 3 1+3=4 2+3=5

Then sum would be:
1+1=2, 2+3=5, 3+4=7, 4+5=9

And srate() should be:
5-2=3, 7-5=2, 9-7=2 => 3, 2, 2

As you can see, the second data point differs from the incorrect value calculated above.
Reply all
Reply to author
Forward
0 new messages