Aggregations over timeseries

rcham...@zulily.com

unread,

Oct 3, 2016, 1:09:46 PM10/3/16

to Prometheus Developers

I've been looking into methods to do time series aggregations in prometheus where I, for example, what to work out a quantile over a time range. The latest version of prometheus provides the xxx_over_time functions (quantile_over_time) in order to do this. However, these functions don't allow for aggregations within the time series. For example there is no way to say "give me an average over this time series by metric name, but combine all the hosts together in a single dataset."

That's annoying since that's typically the data that I find useful from a dashboarding perspective; if a single host is running slow that is not interesting as long as the overall service performance remains within the latency bounds I've outlined.

Since there is nothing in the APIs to do this, I looked into writing a function to do the timeseries aggregations which I could provide a patch for. However, as I was adding additional functionality to this I came to the realization that what I was doing was essentially rewriting the aggregation logic which already exists but only supports instant points.

From what I've been able to tell, getting the aggregation functions to support timeseries is not that hard, although all "instant" readings need to be silently upconverted to time series with a single entry.

I'd like to understand whether there is a philosophical reason this has not been done already; it seems that all of the existing aggregations would apply to time series equally as they do to the existing "instant"s. The benefit of doing it is a much more natural and powerful query interface; all of the xxx_over_time functions can be deprecated and the facilities available to time series become much more powerful.

Rod.

Brian Brazil

unread,

Oct 3, 2016, 1:17:11 PM10/3/16

to rcham...@zulily.com, Prometheus Developers

On 3 October 2016 at 18:09, <rcham...@zulily.com> wrote:

I've been looking into methods to do time series aggregations in prometheus where I, for example, what to work out a quantile over a time range. The latest version of prometheus provides the xxx_over_time functions (quantile_over_time) in order to do this. However, these functions don't allow for aggregations within the time series. For example there is no way to say "give me an average over this time series by metric name, but combine all the hosts together in a single dataset."

I don't understand the calculation you're trying to perform. Can you provide a concrete example?

Brian

That's annoying since that's typically the data that I find useful from a dashboarding perspective; if a single host is running slow that is not interesting as long as the overall service performance remains within the latency bounds I've outlined.

Since there is nothing in the APIs to do this, I looked into writing a function to do the timeseries aggregations which I could provide a patch for. However, as I was adding additional functionality to this I came to the realization that what I was doing was essentially rewriting the aggregation logic which already exists but only supports instant points.

From what I've been able to tell, getting the aggregation functions to support timeseries is not that hard, although all "instant" readings need to be silently upconverted to time series with a single entry.

I'd like to understand whether there is a philosophical reason this has not been done already; it seems that all of the existing aggregations would apply to time series equally as they do to the existing "instant"s. The benefit of doing it is a much more natural and powerful query interface; all of the xxx_over_time functions can be deprecated and the facilities available to time series become much more powerful.

Rod.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/2a161d64-dfcf-4e33-a8a2-99d6e45c684d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Rod Chamberlin

unread,

Oct 3, 2016, 1:43:53 PM10/3/16

to Brian Brazil, Prometheus Developers

Suppose, I have a load of gauge metrics across a fleet of 50 hosts, collected every 15 seconds. For example JMX memory usage. I would like to calculate the p90 JMX memory usage across the fleet.

Quantile_over_time(0.9,jmx_memory{}[5m]) will return a set of metrics (one per instance). However, there is no way of statistically combining these to get an accurate p90 over the entire dataset.

The first approach I outlined below involves writing a function: combine() such that I can use:

Quantile_over_time(0.9, combine(jmx_memory()[5m])) which combines the multiple time series into a single one over which I can run a timeseries aggregation function.

However, if I want to apply the same approach with multiple metrics (for example maybe I have host-based cache hit rate metrics over a number of caches and want to alarm if the p90 hit rate falls out of bounds for any one of the caches).

e.g. the following metrics:

cache_1_hitrate{instance=”blah”}

cache_2_hitrate{instance=”blah”}

cache_3_hitrate{instance=”blah”}

The above combine function then requires additional logic to know it needs to aggregate metrics with the same name (or other labels) into the same bucket.

What I’d really like to write is:

Quantile(0.9, {__name__ =~ “.*_hitrate”}[5m]) by (__name__)

However, the aggregations functions sum(), quantile(), topK(), etc. don’t support time series[matrix types] despite the fact that the operations they are performing make logical sense on them. Indeed the workaround for Prometheus to provide a set of duplicate functions to perform these operations (xxx_over_time).

What I’m proposing is to update the parser/engine to support aggregations functions over matrix types.

Rod.

From: Brian Brazil <brian....@robustperception.io>
Date: Monday, October 3, 2016 at 10:17
To: Rod Chamberlin <rcham...@zulily.com>
Cc: Prometheus Developers <prometheus...@googlegroups.com>
Subject: Re: Aggregations over timeseries

On 3 October 2016 at 18:09, <rcham...@zulily.com> wrote:

I've been looking into methods to do time series aggregations in prometheus where I, for example, what to work out a quantile over a time range. The latest version of prometheus provides the xxx_over_time functions (quantile_over_time) in order to do this. However, these functions don't allow for aggregations within the time series. For example there is no way to say "give me an average over this time series by metric name, but combine all the hosts together in a single dataset."

I don't understand the calculation you're trying to perform. Can you provide a concrete example?

Brian

That's annoying since that's typically the data that I find useful from a dashboarding perspective; if a single host is running slow that is not interesting as long as the overall service performance remains within the latency bounds I've outlined.

Since there is nothing in the APIs to do this, I looked into writing a function to do the timeseries aggregations which I could provide a patch for. However, as I was adding additional functionality to this I came to the realization that what I was doing was essentially rewriting the aggregation logic which already exists but only supports instant points.

From what I've been able to tell, getting the aggregation functions to support timeseries is not that hard, although all "instant" readings need to be silently upconverted to time series with a single entry.

I'd like to understand whether there is a philosophical reason this has not been done already; it seems that all of the existing aggregations would apply to time series equally as they do to the existing "instant"s. The benefit of doing it is a much more natural and powerful query interface; all of the xxx_over_time functions can be deprecated and the facilities available to time series become much more powerful.

Rod.

--

You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/2a161d64-dfcf-4e33-a8a2-99d6e45c684d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Brian Brazil

unread,

Oct 3, 2016, 1:52:17 PM10/3/16

to Rod Chamberlin, Prometheus Developers

On 3 October 2016 at 18:43, Rod Chamberlin <rcham...@zulily.com> wrote:

Suppose, I have a load of gauge metrics across a fleet of 50 hosts, collected every 15 seconds. For example JMX memory usage. I would like to calculate the p90 JMX memory usage across the fleet.

Quantile_over_time(0.9,jmx_memory{}[5m]) will return a set of metrics (one per instance). However, there is no way of statistically combining these to get an accurate p90 over the entire dataset.

What is exactly the number you are trying to calculate here? I'm having difficult understanding what you want in a way that would make sense both statistically and operationally.

The first approach I outlined below involves writing a function: combine() such that I can use:

Quantile_over_time(0.9, combine(jmx_memory()[5m])) which combines the multiple time series into a single one over which I can run a timeseries aggregation function.

However, if I want to apply the same approach with multiple metrics (for example maybe I have host-based cache hit rate metrics over a number of caches and want to alarm if the p90 hit rate falls out of bounds for any one of the caches).

That's not how caches work. You care about the overall hit rate, which is sum(rate(hits))/sum(rate(requests)).

Brian

--

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/2a161d64-dfcf-4e33-a8a2-99d6e45c684d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

--

Brian Brazil

www.robustperception.io

rcham...@zulily.com

unread,

Oct 3, 2016, 2:32:20 PM10/3/16

to Prometheus Developers, rcham...@zulily.com

On Monday, October 3, 2016 at 10:52:17 AM UTC-7, Brian Brazil wrote:

On 3 October 2016 at 18:43, Rod Chamberlin <rcham...@zulily.com> wrote:

Suppose, I have a load of gauge metrics across a fleet of 50 hosts, collected every 15 seconds. For example JMX memory usage. I would like to calculate the p90 JMX memory usage across the fleet.

Quantile_over_time(0.9,jmx_memory{}[5m]) will return a set of metrics (one per instance). However, there is no way of statistically combining these to get an accurate p90 over the entire dataset.

What is exactly the number you are trying to calculate here? I'm having difficult understanding what you want in a way that would make sense both statistically and operationally.

A quantile over a combined set of timeseries which represent similar data points, but have different labels.

The first approach I outlined below involves writing a function: combine() such that I can use:

Quantile_over_time(0.9, combine(jmx_memory()[5m])) which combines the multiple time series into a single one over which I can run a timeseries aggregation function.

However, if I want to apply the same approach with multiple metrics (for example maybe I have host-based cache hit rate metrics over a number of caches and want to alarm if the p90 hit rate falls out of bounds for any one of the caches).

That's not how caches work. You care about the overall hit rate, which is sum(rate(hits))/sum(rate(requests)).

Whilst in an ideal world you would be correct we do not always have the opportunity to add the instrumentation to our services that we might desire; I am avoiding going into specifics of the system which i'm instrumenting because I don't feel it will add a great deal to the discussion and will likely send us off into a tangent.

You asked for examples of what I would like to accomplish and I had thought I had provided you with some; I have collected time series, I'd like to be able to perform aggregations over them. The need for this has clearly been identified in the past because the XXX_over_time functions have been provided. However, I'm surprised this isn't supported as a first class aggregation because it's easy to implement at what appears to be low cost to the framework, yet considerably increases its flexibility which is not available with the existing aggregation functions.

Brian Brazil

unread,

Oct 3, 2016, 2:55:53 PM10/3/16

to Rod Chamberlin, Prometheus Developers

On 3 October 2016 at 19:32, <rcham...@zulily.com> wrote:

On Monday, October 3, 2016 at 10:52:17 AM UTC-7, Brian Brazil wrote:
On 3 October 2016 at 18:43, Rod Chamberlin <rcham...@zulily.com> wrote:

Suppose, I have a load of gauge metrics across a fleet of 50 hosts, collected every 15 seconds. For example JMX memory usage. I would like to calculate the p90 JMX memory usage across the fleet.

Quantile_over_time(0.9,jmx_memory{}[5m]) will return a set of metrics (one per instance). However, there is no way of statistically combining these to get an accurate p90 over the entire dataset.

What is exactly the number you are trying to calculate here? I'm having difficult understanding what you want in a way that would make sense both statistically and operationally.

A quantile over a combined set of timeseries which represent similar data points, but have different labels.

In what fashion do you want to combine the time series? Can you provide a numeric example?

Brian

The first approach I outlined below involves writing a function: combine() such that I can use:

Quantile_over_time(0.9, combine(jmx_memory()[5m])) which combines the multiple time series into a single one over which I can run a timeseries aggregation function.

However, if I want to apply the same approach with multiple metrics (for example maybe I have host-based cache hit rate metrics over a number of caches and want to alarm if the p90 hit rate falls out of bounds for any one of the caches).

That's not how caches work. You care about the overall hit rate, which is sum(rate(hits))/sum(rate(requests)).

Whilst in an ideal world you would be correct we do not always have the opportunity to add the instrumentation to our services that we might desire; I am avoiding going into specifics of the system which i'm instrumenting because I don't feel it will add a great deal to the discussion and will likely send us off into a tangent.

You asked for examples of what I would like to accomplish and I had thought I had provided you with some; I have collected time series, I'd like to be able to perform aggregations over them. The need for this has clearly been identified in the past because the XXX_over_time functions have been provided. However, I'm surprised this isn't supported as a first class aggregation because it's easy to implement at what appears to be low cost to the framework, yet considerably increases its flexibility which is not available with the existing aggregation functions.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/1ada2720-8d94-43f1-a70b-09dc3097ac95%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Rod Chamberlin

unread,

Oct 3, 2016, 3:43:45 PM10/3/16

to Brian Brazil, Prometheus Developers

Consider the following:

Metric{context="c1",instance="node-5"}

313 @1475518428

Metric{context="c1",instance="node-7"}

321 @1475518367.128

Metric{context="c2",instance="node-5"}

96 @1475518353.948

Metric{context="c2",instance="node-7"}

24 @1475518382

23 @1475518427

24 @1475518442

Metric{context="c3",instance="node-5"}

230 @1475518368

229 @1475518428

230 @1475518444

Metric{context="c3",instance="node-6"}

224 @1475518358

222 @1475518433

220 @1475518434

Metric{context="c3",instance="node-8"}

225 @1475518368

225 @1475518369

223 @1475518370

221 @1475518381

219 @1475518428

221 @1475518442

222 @1475518443

Metric{context="c3",instance="node-4"}

265 @1475518418

Metric{context="c4",instance="node-6"}

647 @1475518358

537 @1475518359

714 @1475518400

512 @1475518410

501 @1475518420

552 @1475518433

553 @1475518434

Metric{context="c4",instance="node-5"}

678 @1475518353

565 @1475518353

589 @1475518354

535 @1475518432

576 @1475518440

523 @1475518442

Metric{context="c4",instance="node-7"}

556 @1475518352

509 @1475518353

547 @1475518355

530 @1475518427

523 @1475518433

554 @1475518434

506 @1475518436

I would like to be able to generate aggregations over:

· Everything (what’s the overall p50 for this data?)

· What is the p50 by context aggregated over all instances?

· What is the p50 over all contexts by instance?

· What is the p50 for each distinct label?

At the moment I can only answer the last of these questions:

Quantile_over_time(0.5,Metric{}[5m])

There is no way to combine that result, because the input to the quantile_over_time needs to be the full dataset under consideration. For the “p50 aggregated over all instances” I need the input to quantile_over_time to be:

Metric{context="c1",instance="node-5"}

313 @1475518428

321 @1475518367.128

Metric{context="c2",instance="node-5"}

96 @1475518353.948

24 @1475518382

23 @1475518427

24 @1475518442

Metric{context="c3"}

230 @1475518368

229 @1475518428

230 @1475518444

224 @1475518358

222 @1475518433

220 @1475518434

225 @1475518368

225 @1475518369

223 @1475518370

221 @1475518381

219 @1475518428

221 @1475518442

222 @1475518443

265 @1475518418

Metric{context="c4"}

647 @1475518358

537 @1475518359

714 @1475518400

512 @1475518410

501 @1475518420

552 @1475518433

553 @1475518434

678 @1475518353

565 @1475518353

589 @1475518354

535 @1475518432

576 @1475518440

523 @1475518442

556 @1475518352

509 @1475518353

547 @1475518355

530 @1475518427

523 @1475518433

554 @1475518434

506 @1475518436

but that cannot be generated (Metric{context=”c4”}[5m]) corresponds to 3 distinct matrixes). One could add a function which takes a set of matrix arguments and returns a single matrix which is the individual matrixes combined (which is what I originally started on). However,

If aggregation functions supported timeseries I could instead write:

Quantile(0.5, Metric{}[5m]) by (context)

At present “quantile(0.5,Metric{}[5m])” will throw an error (“expected type vector in aggregation expression, got matrix”) because it does not support timeseries (matrix) arguments.

Fixing that seems like an overall better approach so that you can treat vectors and matrixes equally in aggregate queries.

Rod.

From: <prometheus...@googlegroups.com> on behalf of Brian Brazil <brian....@robustperception.io>
Date: Monday, October 3, 2016 at 11:55
To: Rod Chamberlin <rcham...@zulily.com>
Cc: Prometheus Developers <prometheus...@googlegroups.com>
Subject: Re: Aggregations over timeseries

On 3 October 2016 at 19:32, <rcham...@zulily.com> wrote:

On Monday, October 3, 2016 at 10:52:17 AM UTC-7, Brian Brazil wrote:

On 3 October 2016 at 18:43, Rod Chamberlin <rcham...@zulily.com> wrote:

Suppose, I have a load of gauge metrics across a fleet of 50 hosts, collected every 15 seconds. For example JMX memory usage. I would like to calculate the p90 JMX memory usage across the fleet.

Quantile_over_time(0.9,jmx_memory{}[5m]) will return a set of metrics (one per instance). However, there is no way of statistically combining these to get an accurate p90 over the entire dataset.

What is exactly the number you are trying to calculate here? I'm having difficult understanding what you want in a way that would make sense both statistically and operationally.

A quantile over a combined set of timeseries which represent similar data points, but have different labels.

In what fashion do you want to combine the time series? Can you provide a numeric example?

Brian

The first approach I outlined below involves writing a function: combine() such that I can use:

Quantile_over_time(0.9, combine(jmx_memory()[5m])) which combines the multiple time series into a single one over which I can run a timeseries aggregation function.

However, if I want to apply the same approach with multiple metrics (for example maybe I have host-based cache hit rate metrics over a number of caches and want to alarm if the p90 hit rate falls out of bounds for any one of the caches).

That's not how caches work. You care about the overall hit rate, which is sum(rate(hits))/sum(rate(requests)).

Whilst in an ideal world you would be correct we do not always have the opportunity to add the instrumentation to our services that we might desire; I am avoiding going into specifics of the system which i'm instrumenting because I don't feel it will add a great deal to the discussion and will likely send us off into a tangent.

You asked for examples of what I would like to accomplish and I had thought I had provided you with some; I have collected time series, I'd like to be able to perform aggregations over them. The need for this has clearly been identified in the past because the XXX_over_time functions have been provided. However, I'm surprised this isn't supported as a first class aggregation because it's easy to implement at what appears to be low cost to the framework, yet considerably increases its flexibility which is not available with the existing aggregation functions.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/1ada2720-8d94-43f1-a70b-09dc3097ac95%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/LtEX6w-M5Yg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLqb8otYx7SNG5_YzZ4v_%2B%2BXmW%3DQzgGf7sEXd4D0ZPqSsg%40mail.gmail.com.

Brian Brazil

unread,

Oct 3, 2016, 3:56:19 PM10/3/16

to Rod Chamberlin, Prometheus Developers

I don't think that has any statistical use, each series has different properties.

· What is the p50 by context aggregated over all instances?

What do you mean by "aggregated"?

· What is the p50 over all contexts by instance?

In what time slice?

I'm not seeing your use case. What's your ultimate goal here?

Brian

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/1ada2720-8d94-43f1-a70b-09dc3097ac95%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/LtEX6w-M5Yg/unsubscribe.

To unsubscribe from this group and all its topics, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLqb8otYx7SNG5_YzZ4v_%2B%2BXmW%3DQzgGf7sEXd4D0ZPqSsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/9EA8CC12-D3F4-4C11-9DE0-9A2757FF1714%40zulily.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

rcham...@zulily.com

unread,

Oct 3, 2016, 5:11:15 PM10/3/16

to Prometheus Developers

Message has been deleted

rcham...@zulily.com

unread,

Oct 3, 2016, 5:20:42 PM10/3/16

to Prometheus Developers, rcham...@zulily.com

To clarify:

* The data presented represents a single timeslice (or 'matrix' in the language of the internals)

* The goal is to run aggregation functions over these matrixes combining them into a single larger matrix based around some label aggregation

** The question as to whether all of the aggregations I've suggested make sense is something that is beyond the scope of the discussion (since I'm hacking together data to generate examples, I'm not going to guaratee it all looks sensible)

* The label aggregations should be able to follow a similar pattern to those available for standard aggregation functions (i.e. BY (label-list) or WITHOUT (label-list))

Does this make sense?

On Monday, October 3, 2016 at 12:56:19 PM UTC-7, Brian Brazil wrote:

On 3 October 2016 at 20:43, Rod Chamberlin <rcham...@zulily.com> wrote:

· Everything (what’s the overall p50 for this data?)

I don't think that has any statistical use, each series has different properties.

Surely that's for me to judge based on knowledge of my data; I'll admit that, given the random data pulled from an arbitrary source it does look like a couple of the series don't make sense to combine together, but it's my job to understand my data and know when and where that's applicable.

Brian Brazil

unread,

Oct 3, 2016, 5:33:11 PM10/3/16

to Rod Chamberlin, Prometheus Developers

On 3 October 2016 at 22:20, <rcham...@zulily.com> wrote:

To clarify:
* The data presented represents a single timeslice (or 'matrix' in the language of the internals)

* The goal is to run aggregation functions over these matrixes combining them into a single larger matrix based around some label aggregation
** The question as to whether all of the aggregations I've suggested make sense is something that is beyond the scope of the discussion (since I'm hacking together data to generate examples, I'm not going to guaratee it all looks sensible)
* The label aggregations should be able to follow a similar pattern to those available for standard aggregation functions (i.e. BY (label-list) or WITHOUT (label-list))

Does this make sense?

It doesn't make sense to me. If you can't convince me that this has a valid use case that makes sense both operationally and statistically, it has extremely little chance of getting added. We don't add features merely because one user asserts it's useful to them, features cost us maintenance and cost our users cognitive load.

On Monday, October 3, 2016 at 12:56:19 PM UTC-7, Brian Brazil wrote:
On 3 October 2016 at 20:43, Rod Chamberlin <rcham...@zulily.com> wrote:

· Everything (what’s the overall p50 for this data?)

I don't think that has any statistical use, each series has different properties.

Surely that's for me to judge based on knowledge of my data; I'll admit that, given the random data pulled from an arbitrary source it does look like a couple of the series don't make sense to combine together, but it's my job to understand my data and know when and where that's applicable.

It's my job to try and understand your use case, which you have not yet shared. That limits my options to help you.

Brian

·         What is the p50 by context aggregated over all instances?

What do you mean by "aggregated"?

·         What is the p50 over all contexts by instance?

In what time slice?

I'm not seeing your use case. What's your ultimate goal here?

Brian

·         What is the p50 for each distinct label?

At the moment I can only answer the last of these questions:

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/a6292414-085f-4d2d-9777-11db30cbe71f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Rod Chamberlin

unread,

Oct 3, 2016, 5:45:18 PM10/3/16

to Brian Brazil, Prometheus Developers

“It doesn't make sense to me. If you can't convince me that this has a valid use case that makes sense both operationally and statistically, it has extremely little chance of getting added. We don't add features merely because one user asserts it's useful to them, features cost us maintenance and cost our users cognitive load.”

It isn’t clear to me whether you don’t understand:

1/ what I’m trying to do, or

2/ why I’m trying to do it.

Those two are fundamentally different questions, you potentially need to understand both of them in the longer term, but right now I’m focusing on the “What” side of things.

To be honest, I’m confused as to what I can explain more clearly about that, but if we just focus on that, the following attempts to explain the situation:

· I have a number of distinct metrics (as determined by their labels)

· I can create timeseries (or timeslices or matrixes) from these metrics

· I can perform analysis on these metrics in a 1x1 fashion using the xxx_over_time aggregation functions

· I cannot, combine these metrics prior to aggregation in order to execute the aggregation function (xxx_over_time) on a combined set of comprising the entire set of (or some grouped subset of) datapoints from a larger group of such timeslices

Given this explanation do you understand what it is I am trying to accomplish (if not the why)?

Rod.

From: <prometheus...@googlegroups.com> on behalf of Brian Brazil <brian....@robustperception.io>
Date: Monday, October 3, 2016 at 14:33
To: Rod Chamberlin <rcham...@zulily.com>
Cc: Prometheus Developers <prometheus...@googlegroups.com>
Subject: Re: Aggregations over timeseries

On 3 October 2016 at 22:20, <rcham...@zulily.com> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/a6292414-085f-4d2d-9777-11db30cbe71f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

--

You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/LtEX6w-M5Yg/unsubscribe.

To unsubscribe from this group and all its topics, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLojbyJptnhdap2C5Y%3DTtbLw-7jzxro0%2B%2B%2BdKmMzdVjSwQ%40mail.gmail.com.

Brian Brazil

unread,

Oct 3, 2016, 5:51:21 PM10/3/16

to Rod Chamberlin, Prometheus Developers

On 3 October 2016 at 22:45, Rod Chamberlin <rcham...@zulily.com> wrote:

“It doesn't make sense to me. If you can't convince me that this has a valid use case that makes sense both operationally and statistically, it has extremely little chance of getting added. We don't add features merely because one user asserts it's useful to them, features cost us maintenance and cost our users cognitive load.”

It isn’t clear to me whether you don’t understand:

1/ what I’m trying to do, or

2/ why I’m trying to do it.

Those two are fundamentally different questions, you potentially need to understand both of them in the longer term, but right now I’m focusing on the “What” side of things.

I'm focusing on the why, as it's my experience that when a user is asking for something non-standard that seems odd there's usually a simpler and standard way to do what they want. If you're not willing to share that then all I can do it point you at the query and query_range endpoints and let you write your own tooling on top of those.

Brian

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/a6292414-085f-4d2d-9777-11db30cbe71f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-developers/LtEX6w-M5Yg/unsubscribe.

To unsubscribe from this group and all its topics, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLojbyJptnhdap2C5Y%3DTtbLw-7jzxro0%2B%2B%2BdKmMzdVjSwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Björn Rabenstein

unread,

Oct 5, 2016, 9:08:59 AM10/5/16

to Brian Brazil, Rod Chamberlin, Prometheus Developers

If I understood correctly, the fundamental issue here is:

- PromQL allows you to aggregate samples from multiple time series at
any one given point in time.

- PromQL allows you to aggregate samples over time from any one given
time series.

- PromQL does not allow you to to both in one step.

The question is if the latter would be a useful feature. To vet that,
we need real use-cases.

Note that Prometheus is not an event logging system. Sampling events
into buckets of a histogram should happen on the client side (using
the Prometheus metric type `Histogram`). Histograms can then be
aggregated at will.

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Vaisakh Rajagopal

unread,

Aug 13, 2018, 1:43:59 PM8/13/18

to Prometheus Developers

Hi @brian,
Am recently facing the same kind of issue.
avg(metric) = overall average value
but avg_over_time(metrics[interval]) = averages value per label

what method you can propose me to find the overall average of metrics overtime ?

recently I get to know avg( avg_over_time(metric[scrape interval]) ) won't be same as(when the data is not continuous and denominator value is different) avg(metric) !!!!
Or given a scenario what will be the possible way to find the overall average over a time period.

Eg: Find the average response time now and Find the average response(over all) of all the request triggered in last one hour.

mark...@verizon.net

unread,

May 20, 2019, 1:11:22 PM5/20/19

to Prometheus Developers

If I understand Rod, Bjorn, and Vaisakh from above, I have the same question. Here's my use case:

I have gauges and histograms recording durations. One of the labels is the marathon_task_id (a unique identifier for each instance of a process). I would like to collect statistics about these metrics within a specific time window aggregating the results of a metric across all marathon_task_id (i.e., the aggregate across all instances of the process that produces the metric).

This can be correctly done for the min or max as follows:

query?query=min(min_over_time())&time=<EVAL_TIME>
query?query=max(max_over_time())&time=<EVAL_TIME>

because the minimum of all the minimums across all marathon_task_id instances is the true minimum. Same for the maximum.

This isn't true for stddev, avg and quantiles.

Did anyone ever figure out a solution to this?

Ngân Nguyễn

unread,

May 19, 2020, 9:41:55 PM5/19/20

to Prometheus Developers

Hi all,

I’m looking for some help with percentile in Grafana - Prometheus.

I would expect that CPU usage shown by Grafana would match that of the Excel calculation =PERCENTILE(B2:B268,0.95).

I used this query quantile_over_time(0.95, (100 - (avg by (instance) (irate(node_cpu_seconds_total{instance=~"$node",mode="idle"}[1m])) * 100))[$__range:])

but it works not true, it returns an instant vector with per-series aggregation results

Please tell me how can I get exactly what I want!

Reply all

Reply to author

Forward