Need to capture metrics using Prometheus

Monica

unread,

Aug 23, 2023, 6:55:02 AM8/23/23

to Prometheus Users

Hi All,

I need to capture the 'st' parameter (which represents the time stolen from this virtual machine by the hypervisor) from the Linux 'top' command using Prometheus for monitoring purposes.

%Cpu(s): 1.5 us, 1.2 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

Could anyone please suggest whether this is achievable using Prometheus? If so, could you also explain how?

Brian Candler

unread,

Aug 23, 2023, 9:44:40 AM8/23/23

to Prometheus Users

The CPU steal time is already available as a metric from node_exporter, as node_cpu_seconds_total{instance="XXX",cpu="N",mode="steal"}

Since this is an accumulated number of seconds, you'd use rate() to find out how fast it is growing.

If you want to view this in Grafana there are existing dashboards you can use, e.g. https://grafana.com/grafana/dashboards/1860-node-exporter-full/

Monica

unread,

Aug 24, 2023, 3:03:11 AM8/24/23

to Prometheus Users

Hi Brian,

Thank you for the update. The node_cpu_second metric is already present in the system. However, the 'st' value highlighted below is not being reflected, and I am want to capture only this specific parameter. Pls suggest.

%Cpu(s): 1.5 us, 1.2 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

Brian Candler

unread,

Aug 24, 2023, 4:54:20 AM8/24/23

to Prometheus Users

You will need to make a PromQL query which performs the same calculation that "top" is doing to calculate that value. I don't know what time period top calculates that over, nor what scrape interval you are using for your node_exporter metrics.

As I said before, if you want some queries to copy for any node_exporter variables, there are Grafana dashboards available. Just open them up and copy the queries they are making.

If you are scraping at 1 minute intervals, then something like this should do the trick:

avg by (instance) (rate(node_cpu_seconds_total{mode="steal"}[2m])) * 100

You can use 'sum' instead of 'avg', but then the percentages will reflect multiple CPUs (e.g. host has 8 CPUs => values will be out of 800%)

How this works:

- the metric node_cpu_seconds_total{mode="steal"} accumulates all the time that each CPU has spent in the "steal" state

- taking a rate(...) of this metric will tell you the fraction of time in this state, i.e. the number of seconds in "steal" state, per second of real time

- there will be separate values of this metric for each host (instance) and each cpu on that host

- avg by (instance) will group together all the metrics for each unique host, i.e. all CPUs on that host, and average them - giving one metric per host

The values for all CPU states *should* add up to 100%. In practice, they don't quite exactly: see