Compute aggregate percentile (or average) over multiple time series?

633 views
Skip to first unread message

Yongjik Kim

unread,
Feb 14, 2020, 2:06:48 PM2/14/20
to Prometheus Users
Hi,

I have a problem with aggregation. I want to get the CPU usage of a set of jobs (each with potentially different start/stop time), over the past week, and then get 95% percentile among these values.

So, I can get the raw data points with this:

> rate(cpu_usage{name="myjob"}[5m])[1d:5m]

cpu_usage is an accumulative series (counter?) which records "the amount of CPU resource this job has used since it started."  So, as far as I understand, this gives me a nice list of "average CPU usage for each 5-minute interval, for every job and for every interval the job was alive."

So far so good, but then how do I get the 95% percentile of *all these values*?

If I try this:

> quantile(0.95, rate(cpu_usage{name="myjob"}[5m])[1d:5m])

I get: "Error executing query: invalid parameter 'query': parse error at char 147: expected type instant vector in aggregation expression, got range vector"

I can make it output *some number* by removing [1d:5m], but that's not what I want. I don't need 95% percentile at the current instant, but over the past week.

Any way to make it work without piping the result through a custom script?

Thanks,
- Yongjik Kim

Aliaksandr Valialkin

unread,
Feb 24, 2020, 1:41:19 PM2/24/20
to Yongjik Kim, Prometheus Users
Hi Yongjik!

Try using `quantile_over_time` instead of `quantile`. See https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/637d706d-e39a-4b0a-8ec3-70bdcc9c3cbc%40googlegroups.com.


--
Best Regards,

Aliaksandr

Yongjik Kim

unread,
Feb 24, 2020, 1:56:34 PM2/24/20
to Aliaksandr Valialkin, Prometheus Users
Hi Aliaksandr,

Thanks a lot for the reply, but I think quantile_over_time() will compute percentiles over each series?

So, for example, if I have three different time series A/B/C (representing three instances of the same task T), and I use quantile_over_time(), then I could get "95% CPU usage of A/B/C" separately, but it still won't tell me "95% CPU usage across all instances of T", as far as I can tell.


Aliaksandr Valialkin

unread,
Feb 24, 2020, 3:09:28 PM2/24/20
to Yongjik Kim, Prometheus Users
On Mon, Feb 24, 2020 at 8:56 PM Yongjik Kim <yon...@houzz.com> wrote:
Hi Aliaksandr,

Thanks a lot for the reply, but I think quantile_over_time() will compute percentiles over each series?

Yes.
 

So, for example, if I have three different time series A/B/C (representing three instances of the same task T), and I use quantile_over_time(), then I could get "95% CPU usage of A/B/C" separately, but it still won't tell me "95% CPU usage across all instances of T", as far as I can tell.

I'm afraid PromQL doesn't provide the functionality, which can be used for calculating percentiles over data points on the given range from multiple time series :( The closest approximation is max(quantile_over_time(0.95, ...)) . I don't recommend using avg() instead of max(), since it hides time series spikes.
 



On Mon, Feb 24, 2020 at 10:41 AM Aliaksandr Valialkin <val...@gmail.com> wrote:
Hi Yongjik!

Try using `quantile_over_time` instead of `quantile`. See https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time 

On Fri, Feb 14, 2020 at 9:06 PM 'Yongjik Kim' via Prometheus Users <promethe...@googlegroups.com> wrote:
Hi,

I have a problem with aggregation. I want to get the CPU usage of a set of jobs (each with potentially different start/stop time), over the past week, and then get 95% percentile among these values.

So, I can get the raw data points with this:

> rate(cpu_usage{name="myjob"}[5m])[1d:5m]

cpu_usage is an accumulative series (counter?) which records "the amount of CPU resource this job has used since it started."  So, as far as I understand, this gives me a nice list of "average CPU usage for each 5-minute interval, for every job and for every interval the job was alive."

So far so good, but then how do I get the 95% percentile of *all these values*?

If I try this:

> quantile(0.95, rate(cpu_usage{name="myjob"}[5m])[1d:5m])

I get: "Error executing query: invalid parameter 'query': parse error at char 147: expected type instant vector in aggregation expression, got range vector"

I can make it output *some number* by removing [1d:5m], but that's not what I want. I don't need 95% percentile at the current instant, but over the past week.

Any way to make it work without piping the result through a custom script?

Thanks,
- Yongjik Kim

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/637d706d-e39a-4b0a-8ec3-70bdcc9c3cbc%40googlegroups.com.


--
Best Regards,

Aliaksandr


--
Best Regards,

Aliaksandr
Reply all
Reply to author
Forward
0 new messages