node_expoter high cpu usage

109 views
Skip to first unread message

Dimitri Yioulos

unread,
Jan 6, 2022, 1:14:41 AMJan 6
to Prometheus Users
Hello, all, and Happy New Year.  This is my first post here, and I hope I'm posting in the right place. 

I've installed node_exprter v. 1.3.1 on a host with lots of memory and CPUs.  When I run top, I note that node_exporter uses up tp 20% CPU.  Is that normal?  that seems high.  Is there a way to "optimize" node_exporter/make it more efficient, resource-wise?

Many thanks.

Ben Kochie

unread,
Jan 6, 2022, 2:21:20 AMJan 6
to Dimitri Yioulos, Prometheus Users
What do you get for rate(process_cpu_seconds_total[5m]) for the node_exporter job?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/a5754b72-52c4-4124-a594-f49ae81bd144n%40googlegroups.com.

Brian Candler

unread,
Jan 6, 2022, 6:27:41 AMJan 6
to Prometheus Users
What flags are you running node_exporter with?

How often are you scraping it?

Dimitri Yioulos

unread,
Jan 14, 2022, 9:12:02 AMJan 14
to Prometheus Users
Thanks for the responses, an apologies for mot answering more quickly.

@sup  I added the query that you posted, though I don't really know if/how it should be set up.  The value I get for one of my monitored hosts (and, one I'm chiefly concerned about, vis-a-vis high CPU usage) is 0.194.  Does that mean 19.4%?  It's relatively consistent over time.

Additionally, I'm scraping every 5 seconds.  Is that too aggressive?

@Brian Chandler  I'm using the node_exporter defaults, as described here - https://github.com/prometheus/node_exporter.

I hope this information helps you help me.  I'll try to provide any additional information that I can in order to fine-tune node_exporter.

Ben Kochie

unread,
Jan 14, 2022, 10:16:12 AMJan 14
to Dimitri Yioulos, Prometheus Users
On Fri, Jan 14, 2022 at 3:12 PM Dimitri Yioulos <dyio...@gmail.com> wrote:
Thanks for the responses, an apologies for mot answering more quickly.

@sup  I added the query that you posted, though I don't really know if/how it should be set up.  The value I get for one of my monitored hosts (and, one I'm chiefly concerned about, vis-a-vis high CPU usage) is 0.194.  Does that mean 19.4%?  It's relatively consistent over time.

Yes, that means 0.194 CPU seconds per second. Where a 1 would be a full CPU. So you're correct, that's 19.4% of one CPU.
 
I have some 96 core VMs that only use 0.02 to 0.03 with 15s scrapes (2x HA, so 7.5s average).

Something is clearly abnormal with what is happening for you.

I would recommend using pprof to pull a CPU profile chart.

go tool pprof -svg http://localhost:9100/debug/pprof/profile > node_exporter.svg
You can replace "localhost" with your target hostname/IP of course.

That should give us an idea of what is taking up all your CPU time.


Additionally, I'm scraping every 5 seconds.  Is that too aggressive?

Typical is 15 seconds, but 5 seconds isn't too aggressive.
 

@Brian Chandler  I'm using the node_exporter defaults, as described here - https://github.com/prometheus/node_exporter.

I hope this information helps you help me.  I'll try to provide any additional information that I can in order to fine-tune node_exporter.

On Thursday, January 6, 2022 at 6:27:41 AM UTC-5 Brian Candler wrote:
What flags are you running node_exporter with?

How often are you scraping it?

On Thursday, 6 January 2022 at 06:14:41 UTC dyio...@gmail.com wrote:
Hello, all, and Happy New Year.  This is my first post here, and I hope I'm posting in the right place. 

I've installed node_exprter v. 1.3.1 on a host with lots of memory and CPUs.  When I run top, I note that node_exporter uses up tp 20% CPU.  Is that normal?  that seems high.  Is there a way to "optimize" node_exporter/make it more efficient, resource-wise?

Many thanks.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Dimitri Yioulos

unread,
Jan 14, 2022, 11:03:31 AMJan 14
to Prometheus Users
@sup. OK, i installed go, then ran go tool pprof -svg http://localhost:9100/debug/pprof/profile > node_exporter.svg (which seems handy, btw).  But, i'm somewhat unsure of how to interpret the output.  See attached.
node_exporter.svg

Ben Kochie

unread,
Jan 15, 2022, 2:46:26 AMJan 15
to Dimitri Yioulos, Prometheus Users
It looks like you have several no-default collectors enabled.


Based on the profile, it looks like the most expensive ones are the processes and systemd collectors.

The processes collector is known to be expensive for larger systems, and I suspect the systemd one is going to be affected for similar reasons.

Brian Candler

unread,
Jan 15, 2022, 11:23:43 AMJan 15
to Prometheus Users
On Friday, 14 January 2022 at 14:12:02 UTC dyio...@gmail.com wrote:
@Brian Chandler  I'm using the node_exporter defaults, as described here - https://github.com/prometheus/node_exporter.

Are you really?    Can you show the exact command line that node_exporter is running with?  e.g.

ps auxwww | grep node_exporter

Dimitri Yioulos

unread,
Jan 18, 2022, 9:33:25 AMJan 18
to Prometheus Users
[root@myhost1 ~]# ps auxwww | grep node_exporter
node_ex+ 4143664 12.5  0.0 725828 22668 ?        Ssl  09:29   0:06 /usr/local/bin/node_exporter --no-collector.wifi

Brian Candler

unread,
Jan 18, 2022, 1:12:04 PMJan 18
to Prometheus Users
Can you show the output of:

curl -Ss localhost:9100/metrics | grep -i collector

Dimitri Yioulos

unread,
Jan 19, 2022, 6:27:40 PMJan 19
to Prometheus Users
[root@myhost1 ~]# curl -Ss localhost:9100/metrics | grep -i collector
# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="arp"} 0.002911805
node_scrape_collector_duration_seconds{collector="bcache"} 1.4571e-05
node_scrape_collector_duration_seconds{collector="bonding"} 0.000112308
node_scrape_collector_duration_seconds{collector="btrfs"} 0.001308192
node_scrape_collector_duration_seconds{collector="conntrack"} 0.002750716
node_scrape_collector_duration_seconds{collector="cpu"} 0.010873961
node_scrape_collector_duration_seconds{collector="cpufreq"} 0.008559194
node_scrape_collector_duration_seconds{collector="diskstats"} 0.01727642
node_scrape_collector_duration_seconds{collector="dmi"} 0.000971785
node_scrape_collector_duration_seconds{collector="edac"} 0.006972343
node_scrape_collector_duration_seconds{collector="entropy"} 0.001360089
node_scrape_collector_duration_seconds{collector="fibrechannel"} 2.8256e-05
node_scrape_collector_duration_seconds{collector="filefd"} 0.000739988
node_scrape_collector_duration_seconds{collector="filesystem"} 0.00554684
node_scrape_collector_duration_seconds{collector="hwmon"} 0.014143617
node_scrape_collector_duration_seconds{collector="infiniband"} 1.3484e-05
node_scrape_collector_duration_seconds{collector="ipvs"} 7.5532e-05
node_scrape_collector_duration_seconds{collector="loadavg"} 0.004074291
node_scrape_collector_duration_seconds{collector="mdadm"} 0.000974966
node_scrape_collector_duration_seconds{collector="meminfo"} 0.004201816
node_scrape_collector_duration_seconds{collector="netclass"} 0.013852102
node_scrape_collector_duration_seconds{collector="netdev"} 0.006993921
node_scrape_collector_duration_seconds{collector="netstat"} 0.007896151
node_scrape_collector_duration_seconds{collector="nfs"} 0.000125062
node_scrape_collector_duration_seconds{collector="nfsd"} 3.6075e-05
node_scrape_collector_duration_seconds{collector="nvme"} 0.001064067
node_scrape_collector_duration_seconds{collector="os"} 0.005645435
node_scrape_collector_duration_seconds{collector="powersupplyclass"} 0.001394135
node_scrape_collector_duration_seconds{collector="pressure"} 0.001466664
node_scrape_collector_duration_seconds{collector="rapl"} 0.00226622
node_scrape_collector_duration_seconds{collector="schedstat"} 0.006677493
node_scrape_collector_duration_seconds{collector="sockstat"} 0.000970676
node_scrape_collector_duration_seconds{collector="softnet"} 0.002014497
node_scrape_collector_duration_seconds{collector="stat"} 0.004216999
node_scrape_collector_duration_seconds{collector="tapestats"} 1.0296e-05
node_scrape_collector_duration_seconds{collector="textfile"} 5.2573e-05
node_scrape_collector_duration_seconds{collector="thermal_zone"} 0.010936983
node_scrape_collector_duration_seconds{collector="time"} 0.00568072
node_scrape_collector_duration_seconds{collector="timex"} 3.3662e-05
node_scrape_collector_duration_seconds{collector="udp_queues"} 0.004138555
node_scrape_collector_duration_seconds{collector="uname"} 1.3713e-05
node_scrape_collector_duration_seconds{collector="vmstat"} 0.005691152
node_scrape_collector_duration_seconds{collector="xfs"} 0.008633677
node_scrape_collector_duration_seconds{collector="zfs"} 2.8179e-05
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="arp"} 1
node_scrape_collector_success{collector="bcache"} 1
node_scrape_collector_success{collector="bonding"} 0
node_scrape_collector_success{collector="btrfs"} 1
node_scrape_collector_success{collector="conntrack"} 1
node_scrape_collector_success{collector="cpu"} 1
node_scrape_collector_success{collector="cpufreq"} 1
node_scrape_collector_success{collector="diskstats"} 1
node_scrape_collector_success{collector="dmi"} 1
node_scrape_collector_success{collector="edac"} 1
node_scrape_collector_success{collector="entropy"} 1
node_scrape_collector_success{collector="fibrechannel"} 0
node_scrape_collector_success{collector="filefd"} 1
node_scrape_collector_success{collector="filesystem"} 1
node_scrape_collector_success{collector="hwmon"} 1
node_scrape_collector_success{collector="infiniband"} 0
node_scrape_collector_success{collector="ipvs"} 0
node_scrape_collector_success{collector="loadavg"} 1
node_scrape_collector_success{collector="mdadm"} 1
node_scrape_collector_success{collector="meminfo"} 1
node_scrape_collector_success{collector="netclass"} 1
node_scrape_collector_success{collector="netdev"} 1
node_scrape_collector_success{collector="netstat"} 1
node_scrape_collector_success{collector="nfs"} 0
node_scrape_collector_success{collector="nfsd"} 0
node_scrape_collector_success{collector="nvme"} 0
node_scrape_collector_success{collector="os"} 1
node_scrape_collector_success{collector="powersupplyclass"} 1
node_scrape_collector_success{collector="pressure"} 0
node_scrape_collector_success{collector="rapl"} 1
node_scrape_collector_success{collector="schedstat"} 1
node_scrape_collector_success{collector="sockstat"} 1
node_scrape_collector_success{collector="softnet"} 1
node_scrape_collector_success{collector="stat"} 1
node_scrape_collector_success{collector="tapestats"} 0
node_scrape_collector_success{collector="textfile"} 1
node_scrape_collector_success{collector="thermal_zone"} 1
node_scrape_collector_success{collector="time"} 1
node_scrape_collector_success{collector="timex"} 1
node_scrape_collector_success{collector="udp_queues"} 1
node_scrape_collector_success{collector="uname"} 1
node_scrape_collector_success{collector="vmstat"} 1
node_scrape_collector_success{collector="xfs"} 1
node_scrape_collector_success{collector="zfs"} 0

Brian Candler

unread,
Jan 20, 2022, 3:46:35 AMJan 20
to Prometheus Users
So the systemd and process collectors aren't active.  I wonder why they appeared in your pprof graph then?  Was it exactly the same binary you were running?

20% CPU usage from a once-every-five-second scrape implies that it should take about 1 CPU-second in total, but all the collectors seem very fast.  The top five use between 0.01 and 0.015 seconds - and that's wall clock time, not CPU time.

node_scrape_collector_duration_seconds{collector="cpu"} 0.010873961
node_scrape_collector_duration_seconds{collector="diskstats"} 0.01727642
node_scrape_collector_duration_seconds{collector="hwmon"} 0.014143617
node_scrape_collector_duration_seconds{collector="netclass"} 0.013852102
node_scrape_collector_duration_seconds{collector="thermal_zone"} 0.010936983

Something weird is going on.  Next you might want to drill down into node_exporter's user versus system time.  Is the usage mostly system time?  That might point you some way, although the implication then is that the high CPU usage is some part of node_exporter outside of individual collectors.

Dimitri Yioulos

unread,
Jan 20, 2022, 7:33:06 AMJan 20
to Prometheus Users
Brian,

Originally, I had not activated any additional collectors.  Then, I read somewhere that I should add the systemd and process collectors.  Still learning, here, so ... .  That's why you saw them in the pprof graph.  I then curcled back and removed them.  However, high CPU usage has always been an issue.  That goes for every system in which I have node_exporter running.  While a few are test machines, and I care a bit less, for production machines it's an issue.

Here's some time output for node_exporter, though I'm not good at interpreting the results:

[root@myhost1 ~]# time for ((i=1;i<=1000;i++)); do node_exporter >/dev/null 2>&1; done

real        0m6.103s
user        0m3.658s
sys        0m3.151s

So, if the above is a good way to measure node_exporter's user versus system time, then they're about equal.  If you have another means to do such measurement, I'd appreciate your sharing it.  Once that's determined and, if system time versus user time is "out-of-whack", how do I remediate?

Many thanks.

Brian Candler

unread,
Jan 20, 2022, 11:54:33 AMJan 20
to Prometheus Users
So now go back to the original suggestion: run pprof with node_exporter running the way you *want* to be running it.

> [root@myhost1 ~]# time for ((i=1;i<=1000;i++)); do node_exporter >/dev/null 2>&1; done

That's meaningless.  node_exporter is a daemon, not something you can run one-shot like that.  If you remove the ">/dev/null 2>&1" you'll see lots of startup messages, probably ending with

ts=2022-01-20T16:49:07.433Z caller=node_exporter.go:202 level=error err="listen tcp :9100: bind: address already in use"

and then node_exporter terminating.  So you're not seeing the CPU overhead of any node_exporter scrape jobs, only its startup overhead.

If the system is idle apart from running node_exporter, then "top" will show you system time and cpu time.  More accurately, find the process ID of node_exporter then look in /proc/<pid>/stat

Dimitri Yioulos

unread,
Jan 20, 2022, 6:30:25 PMJan 20
to Prometheus Users
I ran pprof (attached).  I'll have to work on /proc/<pid>/stat (even with the much appreciated reference :-) ).
node_exporter.svg

Dimitri Yioulos

unread,
Jan 20, 2022, 7:27:14 PMJan 20
to Prometheus Users
The attached is pprof output in text format, which may be easier to read
node_exporter.txt

Brian Candler

unread,
Jan 21, 2022, 3:06:43 AMJan 21
to Prometheus Users
The question is, why are systemd collector and process collector still in that graph?

Dimitri Yioulos

unread,
Jan 21, 2022, 7:27:18 AMJan 21
to Prometheus Users
That's a good question.  The machine that I'm running node_exporter on for which you see the pprof output was just rebuilt.  So, the output is from a fresh, and basic, install of node_exporter.  This is the systemd node_exporter service:

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

and, the prometheus target:

  - job_name: 'myserver1'
    scrape_interval: 5s
    static_configs:
      - targets: ['myserver1:9100']
        labels:
          env: prod
          alias: myserver1

I'm not sure what else to look at.

Dimitri Yioulos

unread,
Jan 28, 2022, 7:59:09 AMJan 28
to Prometheus Users
All,

I realize this is a very long thread (with apologies.  But, I really need to find a solution to the high CPU usage.  Please let me know if I can provide any additional information so that you can help me.

Brian Candler

unread,
Jan 28, 2022, 11:30:59 AMJan 28
to Prometheus Users
It appears to be a problem that's specific to your system only.  Therefore, you'll need to debug it on your side I'm afraid.  Tools like strace may be able to identify specific system calls which are taking a lot of time.

FWIW, I ran that "go tool pprof" command here, pointing it at a node_exporter instance running on a very low power (NUC DN2820 dual Celeron) box.
go tool pprof -svg http://nuc1:9100/debug/pprof/profile > node_exporter.svg

At the same time I hit it with curl roughly every 2 seconds:
while true; do curl nuc1:9100/metrics >/dev/null; sleep 2; done

The target system's node_exporter is running with just one flag:
node_exporter --collector.textfile.directory=/var/lib/node_exporter

In the SVG summary I see:

File: node_exporter
Type: cpu
Time: Jan 28, 2022 at 4:20pm (GMT)
Duration: 30s, Total samples = 1.95s ( 6.50%)
Showing nodes accounting for 1.42s, 72.82% of 1.95s total
Showing top 80 nodes out of 374

and there are no nodes for the systemd collector.  They do appear if I add "--collector.systemd", as expected.

So: there's something strange on your system.  It could be one of a hundred things, but the fact that systemd appears in your pprof output, even though you claim you're not running the systemd collector, is a big red flag.  Finding out what's happening there will probably point you at the answer.

Remember of course that if node_exporter is running on a remote host, say 1.2.3.4, but you're running go tool pprof on another system (say your laptop), then you'd need to do
go tool pprof -svg http://1.2.3.4:9100/debug/pprof/profile > node_exporter.svg

If you leave it as 127.0.0.1 then you're looking at the node_exporter instance running on the *same* system as where you're running go tool pprof.

Good luck with your hunting!

Dimitri Yioulos

unread,
Feb 1, 2022, 5:31:15 PMFeb 1
to Prometheus Users
Thank you, so much, for your time and patience.  The information you've provided is, at the very least, instructive.  I'm most appreciative!
Reply all
Reply to author
Forward
0 new messages