Sum over time without instance

2,208 views
Skip to first unread message

Gergely Brautigam

unread,
Sep 1, 2020, 7:58:43 AM9/1/20
to Prometheus Users
Hi!

I was wondering how to group a sum over time for a value which is coming from multiple instances? 

For example consider a distributed environment. The same metric is reported by instance 'a' and instance 'b'. So I have the same metric for server restarts for example reported by the same metric reporter but twice.

I would like to have a sum over time for server restarts to see which server restarted how many times over the last 2 days.

I tried various combinations, but how do I ignore the duplicate entries? I tried something like... 

    sum_over_time(server_statistics_total{restart_reason="failed"}[7d])>0

But this results in insane large numbers. The metric is a Gauge. So that doesn't help. :) Because I'm guessing when the gauge decreases because the server normalises, that will also be reported as 1 and will be counted towards server restarts... Uh. :/

Any advice would be much appreciated.
Cheers,
Gergely.

Brian Candler

unread,
Sep 1, 2020, 8:16:34 AM9/1/20
to Prometheus Users
First understand what your metric does, and what change in that metric you're looking for.  Decide whether it is a counter (it increments for each server restart, resetting to zero only when some collector restarts) or is a gauge (e.g. the number of restarts in a 5 minute period).

If it's a counter, sum_over_time() is not what you're looking for.  If each of your metrics goes

{instance="a"} va1 va2 va3 va4
{instance="b"} vb1 vb2 vb3 vb4
               t1  t2  t3  t4 -->

then the results of sum_over_time will be

{instance="a"} va1+va2+va3+va4
{instance="b"} vb1+vb2+vb3+vb4
                     t4

You are probably looking for either increase() or resets().

The first says how much the counter has increased in total.  It is roughly equal to va4-va1.  It's more accurately equal to va4-va1 (but excluding counter resets), divided by the different in timestamps between va4 and va1, multiplied by the entire time window the query covers. If the counter goes up by 1 each time a "restart" event occurs, this is what you want to get the total number of restarts over a given period.

The second says how many times the value has dipped downwards, i.e. the number of times the counter has reset.  If the counter is going up continuously, but resets to zero each time a "restart" event occurs, this is what you want.

Gergely Brautigam

unread,
Sep 1, 2020, 8:24:33 AM9/1/20
to Brian Candler, Prometheus Users
Hi Brian!

Thanks! :) 

The metric IS a gauge. It goes up on server restarts and then goes down again as the reconciliation does not find any more servers which have restarted. It's part of a statistic for all servers. So, for now, I have to deal with it being a gauge.

resets MIGHT be actually something that could work. Please correct me if I'm wrong here, but resets is actually for counters and not gauges?

resets should only be used with counters.

Or does this not imply the literal metric type, counter?

Cheers.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9b69d82d-7e3c-440e-a670-1b6d38c6463ao%40googlegroups.com.

Brian Candler

unread,
Sep 1, 2020, 9:18:02 AM9/1/20
to Prometheus Users
On Tuesday, 1 September 2020 13:24:33 UTC+1, Gergely Brautigam wrote: 
The metric IS a gauge. It goes up on server restarts and then goes down again as the reconciliation does not find any more servers which have restarted. It's part of a statistic for all servers. So, for now, I have to deal with it being a gauge.


This doesn't make any sense to me.  The value goes 0... 1... 2... 3... but then goes back to 0.  That sounds like a counter, but on what circumstance does it go back to 0?

Is it something that runs periodically, and goes 0 ... (runs) 3 .... (runs again) 0 .... ?

The latter is not a good metric.  Depending on your scrape interval, you might scrape it as 0  0  3  3  0  0 ...  (which incorrectly suggests 6 restarts)

Or you might scrape it as 0  0  (missing the "3" entirely)

Treating it as a prometheus counter doesn't fix the problem.  If it goes 0  0  3  3  1  0  0  0 ....  then the transition from 3 to 1 will be considered as a counter reset (and hence ignored), and the transition from 1 to 0 will be considered as a second counter reset (and also ignored).

You really want to convert this into a real counter: one where the value stays unchanged when there are no events, and increments for each event.  If you are unable to maintain state within your application, e.g. because it's a one-shot script, then use statsd_exporter to maintain the counter state for you.

Given a true counter, you can compare values at any two times t1 and t2 and get the increase, or calculate the average rate of increase.

Or does this not imply the literal metric type, counter?


Metrics aren't labelled as "counter" or "gauge" in the database AFAIK.  All metrics are just float64 values, so you can apply any function to any value.  Whether the results are meaningful or not is up to you.

Gergely Brautigam

unread,
Sep 1, 2020, 9:33:23 AM9/1/20
to Prometheus Users
I agree that is should be a counter. And to be fair, I did create a ticket to add / make it a counter. :) So I agree with you on that. For now, however, I need to work with this. And you see it right, yes... It will report the same count and that's why it's probably so high.

The thing about the counter I copied in from here: https://prometheus.io/docs/prometheus/latest/querying/functions/#resets

I ended up using this: 

sum(sum by (deploymentId)(increase(restart_reason{statistic="failed"}[$__range])>0))

It's.. close enough. It counts increases, which is a way to test for a new restart and disregards decreases which is the counter resetting.

What do you think about that?
G.

Brian Candler

unread,
Sep 1, 2020, 9:59:17 AM9/1/20
to Prometheus Users
If it works for you, then you get to own the query and the results :-)  But "garbage in, garbage out" applies here.

What you've written doesn't "count increases".  It calculates average rates of increase and scales them across the whole period, whilst skipping steps where the counter appears to have reset.

Suppose you take a window of 10 time units, which contains the following 10 samples:

. 0 ... 0 ... 3 ... 3 ... 0 ... 0 ... 0 ... 2 ... 0 ... 0 .
  <----------------->     <----------------->     <----->
     increase of 3   (rst)   increase of 2   (rst)increase of 0 

|<--------------------- 10 time units ------------------->|


I think that increase() works something like this:
. an increase of 3 over 3 time units
. an increase of 2 over 3 time units
. an increase of 0 over 1 time unit
= an increase of 5 over 7 time units
= average rate of 5/7
. scaled to a total time window of 10 units
= answer of 7.14

The value you were expecting to see was probably 5 (3+2) or possibly 8 (3+3+2).

Gergely Brautigam

unread,
Sep 1, 2020, 12:00:38 PM9/1/20
to Prometheus Users
Uh. I see what you mean. Also, yes, after some fiddling, this isn't actually a good query.

Also, also, I have some further problems with instances and pods. The same number is coming from different pods in the cluster. Prometheus nicely shows that separately but I can't "without" it.

Somehow I have to collapse the two different metrics into one. So now there are two rows:

| time | id  | endpoint | instance          | pod    | service  | metric                 | value |
|---------|-----|----------------|------------------------|-----------|--------------|--------------------------|-----------|
| ...    . | ... | ....             | 10.1.1.1:1234 | pod-a | metrics | server_statistic | 5         |
| ...      | ... | ....             | 10.1.1.2:1234 | pod-b | metrics | server_statistic | 5         |

So I have to merge this into a single metric ignoring instance and pod different but not adding them together. I could just divide by two, but that would mean that I would potentially miss metrics. But I might have to take that into consideration and say... meh.

Brian Candler

unread,
Sep 1, 2020, 1:28:51 PM9/1/20
to Prometheus Users
It depends what these numbers mean individually and what is a meaningful way to combine them.  You have avg(), max() etc.

Gergely Brautigam

unread,
Sep 1, 2020, 3:42:14 PM9/1/20
to Prometheus Users
They are basically duplicates. Two instances of the same service report on the same metric twice from a different pod with a different id. But it's literally the same metric.
One of those should be ignored.... I don't necessarily know how and I can't find any good write-ups to follow of a situation like this. :/ Any ideas?? :)  

Brian Candler

unread,
Sep 1, 2020, 3:59:34 PM9/1/20
to Prometheus Users
max() sounds like a reasonable approach.  If the values are the same, you get the value.  If one is higher than the other, choose the more pessimistic one.

Gergely Brautigam

unread,
Sep 2, 2020, 3:31:15 AM9/2/20
to Brian Candler, Prometheus Users
This sounds good, but there are more than one entries. So it wouldn't just choose the max out of two but it would choose a max out of several others. So I would have to add some kind of check to only choose max if there are two for the same server id.

Is that even possible with promQL?

On 2020. Sep 1., at 21:59, Brian Candler <b.ca...@pobox.com> wrote:

max() sounds like a reasonable approach.  If the values are the same, you get the value.  If one is higher than the other, choose the more pessimistic one.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Gergely Brautigam

unread,
Sep 2, 2020, 3:49:34 AM9/2/20
to Prometheus Users
I could possibly do something like a union to create a unique vector out of two vectors with the same information 'ignoring' pod and instance?

Brian Candler

unread,
Sep 2, 2020, 3:59:13 AM9/2/20
to Prometheus Users
On Wednesday, 2 September 2020 08:31:15 UTC+1, Gergely Brautigam wrote:
So I would have to add some kind of check to only choose max if there are two for the same server id.


max by (server_id) (metric)

or

max without (pod, instance) (metric)


Gergely Brautigam

unread,
Sep 2, 2020, 4:31:50 AM9/2/20
to Prometheus Users
Thanks! I didn't know by or without would work like that, so now I know. :) fantastic. This is now I think a working query:

round(sum(max by (serverId, clusterId) (increase(server_statistics_total{statistic="failed", serverId=~"$serverId"}[$__range])) > 0))

Do you think that is as accurate as possible with a gauge and multiple metric providers? :)

Reply all
Reply to author
Forward
0 new messages