Best way to calculate cluster-wide cpu usage figure for Kubernetes?

Khusro Jaleel

unread,

Sep 18, 2017, 12:18:44 PM9/18/17

to Prometheus Users

Hi, I'm trying to show a line graph in Grafana of the CPU usage across a whole Kubernetes cluster (6 nodes) but I'm not sure if this is a useful metric to have, and what the best query would be?

I have come across many "per-node" queries but I wanted a single number that could correctly indicate that my cluster is very busy or very idle.

I tried to make a line graph of the following query but the values it generates "vary" up and down quite a bit so I'm not sure how reliable this query is:

sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100

I saw this blog post about calculating CPU usage, would this work across the entire 6 node cluster? It seems to not be per-node:

https://www.robustperception.io/understanding-machine-cpu-usage/

I also tried this, but am not sure if this is correct and what the scale should be for the values:

sum(sum(rate(node_cpu[5m])) by (instance)) / sum (machine_cpu_cores)

Ben Kochie

unread,

Sep 18, 2017, 12:50:25 PM9/18/17

to Khusro Jaleel, Prometheus Users

Take a look at the recording rules example[0] to see some usage ideas.

I'm guessing you probably want something like this:

sum(rate(node_cpu{mode!="idle"}[5m])) / count(node_cpu{mode="idle"})

[0]: https://github.com/prometheus/node_exporter/blob/master/example.rules

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9c008bae-9c16-42da-8aa5-8431e4c046e7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Khusro Jaleel

unread,

Sep 18, 2017, 2:09:08 PM9/18/17

to Prometheus Users

Thanks Ben, that's useful. I'm trying to compare that to the Stackdriver graphs for a single instance, and while the stackdriver graphs are quite smooth and keep hovering around 14-15%, the graphs from that query are quite variable and the range can be between 0.629 and 0.703.

I assume that since "node_cpu" is a seconds measure, the numbers I'm getting are seconds values showing how much time (in seconds) the whole cluster is spending "not idle". Is this accurate? Maybe converting this to some kind of percentage is a better way?

On Monday, 18 September 2017 17:50:25 UTC+1, Ben Kochie wrote:

Take a look at the recording rules example[0] to see some usage ideas.

I'm guessing you probably want something like this:

sum(rate(node_cpu{mode!="idle"}[5m])) / count(node_cpu{mode="idle"})

[0]: https://github.com/prometheus/node_exporter/blob/master/example.rules

On Mon, Sep 18, 2017 at 6:18 PM, Khusro Jaleel <kerne...@gmail.com> wrote:

Hi, I'm trying to show a line graph in Grafana of the CPU usage across a whole Kubernetes cluster (6 nodes) but I'm not sure if this is a useful metric to have, and what the best query would be?

I have come across many "per-node" queries but I wanted a single number that could correctly indicate that my cluster is very busy or very idle.

I tried to make a line graph of the following query but the values it generates "vary" up and down quite a bit so I'm not sure how reliable this query is:
sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100

I saw this blog post about calculating CPU usage, would this work across the entire 6 node cluster? It seems to not be per-node:
https://www.robustperception.io/understanding-machine-cpu-usage/

I also tried this, but am not sure if this is correct and what the scale should be for the values:
sum(sum(rate(node_cpu[5m])) by (instance)) / sum (machine_cpu_cores)

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Ben Kochie

unread,

Sep 18, 2017, 3:12:30 PM9/18/17

to Khusro Jaleel, Prometheus Users

On Mon, Sep 18, 2017 at 8:09 PM, Khusro Jaleel <kerne...@gmail.com> wrote:

Thanks Ben, that's useful. I'm trying to compare that to the Stackdriver graphs for a single instance, and while the stackdriver graphs are quite smooth and keep hovering around 14-15%, the graphs from that query are quite variable and the range can be between 0.629 and 0.703.

I assume that since "node_cpu" is a seconds measure, the numbers I'm getting are seconds values showing how much time (in seconds) the whole cluster is spending "not idle". Is this accurate? Maybe converting this to some kind of percentage is a better way?

The first half is non-idle CPU seconds per second per for all CPUs.

The second half divides by the number of CPUs.

This gives you a ratio of cores in use from 0.0 to 1.0. To make this a percent, multiply by 100.

As to why stackdriver is reporting only 15% CPU vs node_exporter reporting 60-70%, I don't know.

On Monday, 18 September 2017 17:50:25 UTC+1, Ben Kochie wrote:
Take a look at the recording rules example[0] to see some usage ideas.

I'm guessing you probably want something like this:

sum(rate(node_cpu{mode!="idle"}[5m])) / count(node_cpu{mode="idle"})

[0]: https://github.com/prometheus/node_exporter/blob/master/example.rules

On Mon, Sep 18, 2017 at 6:18 PM, Khusro Jaleel <kerne...@gmail.com> wrote:
Hi, I'm trying to show a line graph in Grafana of the CPU usage across a whole Kubernetes cluster (6 nodes) but I'm not sure if this is a useful metric to have, and what the best query would be?

I have come across many "per-node" queries but I wanted a single number that could correctly indicate that my cluster is very busy or very idle.

I tried to make a line graph of the following query but the values it generates "vary" up and down quite a bit so I'm not sure how reliable this query is:
sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100

I saw this blog post about calculating CPU usage, would this work across the entire 6 node cluster? It seems to not be per-node:
https://www.robustperception.io/understanding-machine-cpu-usage/

I also tried this, but am not sure if this is correct and what the scale should be for the values:
sum(sum(rate(node_cpu[5m])) by (instance)) / sum (machine_cpu_cores)

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9c008bae-9c16-42da-8aa5-8431e4c046e7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b6987d7b-31b2-46e9-a597-9bee385e2212%40googlegroups.com.

Matthias Rampke

unread,

Sep 19, 2017, 4:23:29 AM9/19/17

to Ben Kochie, Khusro Jaleel, Prometheus Users

Here are our rules for cluster-wide CPU allocation (requests) and real utilisation (usage):

cluster:cpu_allocation:percent =
sum(kube_pod_container_resource_requests_cpu_cores)
/
sum(kube_node_status_capacity_cpu_cores)
* 100

cluster:node_cpu_use:percent =

sum(rate(node_cpu{mode!="idle"}[5m]))
/

sum(machine_cpu_cores)
* 100

We show both on the main cluster dashboard, but as a cluster operator the first number is actually the one that is more important to me. I doesn't help me if users reserve CPUs and don't use them, it still means other users can't use them.

Keep in mind that the latter is derived from node exporter metrics. You can probably get similar results by crunching cAdvisor metrics for the "/" container.

/MR

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b6987d7b-31b2-46e9-a597-9bee385e2212%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmqGO8QENkBLrF7DcDN3BWVMnEAMx1xcJWd%2BQDnCmN1msg%40mail.gmail.com.

Reply all

Reply to author

Forward