can node_exporter expose aggregated node_cpu_seconds_total?

612 views
Skip to first unread message

koly li

unread,
Feb 2, 2023, 1:26:29 AM2/2/23
to Prometheus Users
Hi,

Currently, node_exporter exposes time series for each cpu core (an example below), which generates a lot of data in a large cluster (10k nodes cluster). However, we only care about total cpu usage instead of usage per core. So is there a way for node_exporter to only expose aggregated node_cpu_seconds_total?

we also notice there is an discussion here (reduce cardinality of node_cpu_seconds_total), but it seems no conclusion.

node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="system",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 9077.24 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="85",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="user",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 19298.57 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="idle",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 1.060892164e+07 1675059665571
node_cpu_seconds_total{container="node-exporter",cpu="86",endpoint="metrics",hostname="603k09311-9-bjsimu01",instance="10.253.108.171:9100",ip="10.253.108.171",job="node-exporter",mode="iowait",namespace="product-coc-monitor",pod="coc-monitor-prometheus-node-exporter-c2plp",service="coc-monitor-prometheus-node-exporter",prometheus="product-coc-monitor/coc-prometheus",prometheus_replica="prometheus-coc-prometheus-1"} 4.37 1675059665571

Stuart Clark

unread,
Feb 2, 2023, 1:40:34 AM2/2/23
to koly li, Prometheus Users

You can't remove it as far as I'm aware, but you can use a recording rule to aggregate that data to just give you a metric that represents the overall CPU usage (not broken down by core/status).

-- 
Stuart Clark

koly li

unread,
Feb 2, 2023, 4:05:30 AM2/2/23
to Prometheus Users
If using a recording rule to aggerate data, then I have to store both the per core samples and metric samples in the same prometheus, which costs lots of memory.

After some investigation on node_exporter sourcecode, I found:
1. updateStat(cpu_linux.go) function reads the content of /proc/stat file and generate the node_cpu_seconds_total samples per core
2. updateStat function calls c.fs.Stat() to read and parse the content of /proc/stat file
3. fs.Stat() function parse the /proc/stat file and store the cpu total statics to Stat.CPUTotal(stat.go
4. However, updateStat function ignores the Stat.CPUTotal, it only uses the stats.CPU which contains info per core

so, the question is why node_exporter developers don't use the CPUTotal to expose a total cpu statics? Should the new metrics about total usage statics be added to node-exporter?

Ben Kochie

unread,
Feb 2, 2023, 4:20:58 AM2/2/23
to koly li, Prometheus Users
The node_exporter exposes per-cpu metrics because that's what most users want. Knowing about per-core saturation, single-core IO wait, etc are extremely useful and common use cases.

Using a recording rule is recommended.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bc11f812-92b3-4b2d-81f8-e0720adc7510n%40googlegroups.com.

Brian Candler

unread,
Feb 5, 2023, 6:13:23 AM2/5/23
to Prometheus Users
On Thursday, 2 February 2023 at 09:05:30 UTC koly li wrote:
If using a recording rule to aggerate data, then I have to store both the per core samples and metric samples in the same prometheus, which costs lots of memory.

Timeseries in Prometheus are extremely cheap.  If you're talking 10K nodes and 96 cores per node, that's less than 1m timeseries; compared to the cost of the estate you are managing, it's a drop in the ocean :-)  How many *other* timeseries are you storing from node_exporter?

But if you still want to drop these timeseries, I can see two options:

1. Scrape into a primary prometheus, use recording rules to aggregate, and then either remote_write or federate to a second prometheus to store only the timeseries of interest.  This can be done with out-of-the-box components.  The primary prometheus needs only a very small retention window.

2. Write a small proxy which makes a node_exporter scrape, does the aggregation, and returns only the aggregates.  Then scrape the proxy.  That will involve some coding. 

Ben Kochie

unread,
Feb 5, 2023, 6:25:55 AM2/5/23
to Brian Candler, Prometheus Users
Well, there are 8 modes per CPU, so around 8M series. But still, that's not much for such a large infra. Since it's bare metal, you can drop "steal" to get it down to 7 modes.

If you really only cared about utilization, you could maybe just keep "idle" and maybe "iowait".

It would probably be a small patch to the node_exporter to only expose system-wide. But it's probably not something we would really want to maintain upstream.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

l.mi...@gmail.com

unread,
Feb 5, 2023, 3:46:10 PM2/5/23
to Prometheus Users
On Sunday, 5 February 2023 at 11:13:23 UTC Brian Candler wrote:
Timeseries in Prometheus are extremely cheap.  If you're talking 10K nodes and 96 cores per node, that's less than 1m timeseries; compared to the cost of the estate you are managing, it's a drop in the ocean :-)  How many *other* timeseries are you storing from node_exporter?

A single timeseries eats >=4KB on all nodes I touch. Having a lot of labels (or long labels) will make it more expensive.
So 1M timeseries will eat 4GB of memory.
Not everyone would call that extremely cheap, especially if that's just to tell what's the cpu usage of each server.

There was a PR that tried to implement scrape time recording rules, which would help here, but it didn't seem to go far - https://github.com/prometheus/prometheus/pull/10529

koly li

unread,
Feb 6, 2023, 2:03:28 AM2/6/23
to Prometheus Users
thank you everyone.

We are planning for 10K nodes and each node has 128 cores. So the timeseries data is: 128 * 8 * 10000 = 10,240,000. Meanwhile, there are other timeseries data from kubelet、kube-state-metrics and more(business data). Totally, all data comes to around 30M timeseries, then prometheus eats 170G memory (as tested), and we think there should be some buffer (maybe 100G). So it makes sense to reduce the node-exporter timeseries data to 128 * 1 * 10000 = 1,280,000. 

We will have a try keeping the idle mode only for cpu usage.  Some expr considered:
1 - avg without(cpu, mode) (rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))

avg(1 - avg(rate(node_cpu_seconds_total{origin_prometheus=~"$origin_prometheus",job=~"$job",mode="idle"}[$interval])) by (instance)) * 100



Reply all
Reply to author
Forward
0 new messages