[JIRA] (JENKINS-58018) High CPU load at slave with SSH-agent running freestyle project

33 views
Skip to first unread message

macmac@xdsnet.pl (JIRA)

unread,
Jun 14, 2019, 4:05:02 AM6/14/19
to jenkinsc...@googlegroups.com
Maciej Maczynski created an issue
 
Jenkins / Improvement JENKINS-58018
High CPU load at slave with SSH-agent running freestyle project
Issue Type: Improvement Improvement
Assignee: Jeff Thompson
Components: remoting
Created: 2019-06-14 08:04
Environment: Jenkins ver. 2.164.3,
Slave: OS: Ubuntu 18.04 LTS, JDK: OpenJDK 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03
Priority: Minor Minor
Reporter: Maciej Maczynski

When running "freestyle project" (gmake) at slave, java takes high CPU percentage: from 40% to even 70%, 50% maybe considered average.

The load does not depend on build-process output.

Easy way to reproduce:

1) Create repository with single Makefile that just waits:

all:
    sleep 300

2) Create jenkins "freestyle project" with single build step:

make

3) Execute the build at node

What you would see at slave is java process taking about 50% CPU waiting for process that does nothing.

I tried various options for agent including http://-dhudson.remoting.remoteinvocationhandler/$Unexporter.retainOrigin=false, but it makes no difference.

 I've run remoting with profiler. Unfortunately I don't have results saved and it's hard for me to repeat this investigation right now, but I remember that most CPU was consumed for I/O read attempts. Child process output stream polling? Anyway - one can easily reproduce it as described above.

This affects make-style (freestyle builds) only. E.g. for maven builds, remoting process behaves correctly, I mean it takes few % of CPU.

I create it as "Improvement" as it allows builds to run SOMEHOW. But, frankly speaking, for me heating CPU for nothing is a bug. Besides, we have some builds which involve timing-tests - for these ones this is KO as tests cannot complete in such heavily-loaded environment.

 

 

 

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

macmac@xdsnet.pl (JIRA)

unread,
Jun 14, 2019, 5:06:02 AM6/14/19
to jenkinsc...@googlegroups.com
Maciej Maczynski stopped work on Improvement JENKINS-58018
 
Change By: Maciej Maczynski
Status: In Progress Open

macmac@xdsnet.pl (JIRA)

unread,
Jun 14, 2019, 5:06:02 AM6/14/19
to jenkinsc...@googlegroups.com
Maciej Maczynski started work on Improvement JENKINS-58018
 
Change By: Maciej Maczynski
Status: Open In Progress

jthompson@cloudbees.com (JIRA)

unread,
Jun 14, 2019, 1:38:03 PM6/14/19
to jenkinsc...@googlegroups.com
Jeff Thompson commented on Improvement JENKINS-58018
 
Re: High CPU load at slave with SSH-agent running freestyle project

I tried a quick reproduction on my system, which is different than the reported configuration. I'm not able to observe anything similar. Many times these sorts of issues are specific to the environment. It is often more about the platform on which it is hosted and the resources it has than the specific Jenkins version or OS flavor and version.

Some VM or cloud environments can be particularly susceptible to these sorts of issues. JENKINS-56581 is another recent with similar behavior, though the OSes are very different so the causes probably aren't related.

Without reproducibility and further details it will be very hard for anyone to figure this one out. 

macmac@xdsnet.pl (JIRA)

unread,
Jun 14, 2019, 4:05:02 PM6/14/19
to jenkinsc...@googlegroups.com

Thanks for fast response.

This slave is VM running at ESX server v 5.5.  Mentioned CPU usage values were noted with 2 virtual CPU cores.

Increasing CPU resources for VM helps in a way. E.g. with 8 virtual cores, java CPU usage gets to somehow like 10% or a bit less.  I would say - still far too much for process that could  just wait on "select" system call, using non-blocking I/O.   It looks like more CPU resources "mask" actual problem.  Maybe you could try to reproduce it with 2 CPU cores? Sometimes running a program with limited resources shows flaws that are not visible otherwise. I know that 2 cores is kind of laugh today, but for build process of this kind (which for some reasons is not suitable for parallel GCC compilation) it is pretty enough.

From my side, I would test the build at newer ESX running at different hardware. And try to profile the process and post results here.

macmac@xdsnet.pl (JIRA)

unread,
Jun 18, 2019, 5:03:02 AM6/18/19
to jenkinsc...@googlegroups.com

macmac@xdsnet.pl (JIRA)

unread,
Jun 18, 2019, 5:07:02 AM6/18/19
to jenkinsc...@googlegroups.com
 
Re: High CPU load at slave with SSH-agent running freestyle project

Here are screenshots from profiling session. It was single run of make with target as above (60s sleep).

Looks like readBytes method is CPU-eater. I'm not Java developer, but if it was C, I would say this is too frequent polling of input streams, without taking benefit of system asynchronous I/O features. I can post .NPS file if it can be useful for you.

 

jthompson@cloudbees.com (JIRA)

unread,
Jun 19, 2019, 6:39:02 PM6/19/19
to jenkinsc...@googlegroups.com

I don't have your configuration, but I've tried a couple of different configurations to see if I can reproduce anything similar. No luck. Do you think the gmake and Makefile approach you use has any significance? I tried that on one of my configurations but I wouldn't think that would be meaningful.

I'm not certain what your profiler session is showing (they can be difficult to read and this one appears to be cutoff or in a different layout than I've used before). It looks like it shows that the agent is spending quite a while blocked waiting to read, which is expected behavior. Especially if the job isn't doing anything, these threads will spend most of their time blocked waiting for input. That wouldn't cause high CPU load. You might see if you can catch the number of times the thread performs this operation, which could possibly affect CPU load.

I've tried it with 2 cores and not seen any problems with CPU load. The only setup I currently have with 2 cores uses a different platform, but the code is the same.

I'm sure lots of people run agents on machines with two cores. And the SSH-agent launcher is very popular. If it's a general problem we would expect to see more problem reports or others adding on here. We have seen, though, that VMs can be susceptible to weird problems, including odd load and I/O problems.

 

macmac@xdsnet.pl (JIRA)

unread,
Jun 24, 2019, 3:48:02 AM6/24/19
to jenkinsc...@googlegroups.com

You are right - total method time may be misleading, I don't know exactly what NetBeans profiler means here. BUT - I've run another profiling session: selected only 1 class: hudson.remoting.FlightRecorderInputStream - following my guess.

The screenshot is from single run of sample project mentioned above, that just gets repo and runs Makefile, which sleeps for 60 seconds. Thus - producing no output. And "input stream"-related method of remoting was called 25k times during this. In lower pane of the screenshot you have "Hot spots": here java.io.BufferedInputStream.getBufIfOpen was called over 100k times!  For me, this shows clearly, that high CPU shown in previous screenshot correspond to actual method execution time, not to the time spent on blocked input. As I said, I barely know Java, but I know async I/O - this is not how "asynchronous I/O" behaves.

Answering your questions: I don't think this behavior has anything to do with "Makefile" - when I run this project manually in shell, it barely takes any CPU. Same, when I run it from another program (Emacs), where reading process' output is involved.

macmac@xdsnet.pl (JIRA)

unread,
Jun 24, 2019, 3:48:02 AM6/24/19
to jenkinsc...@googlegroups.com
Maciej Maczynski updated an issue
Change By: Maciej Maczynski
Attachment: remoting-profile-hit-count.PNG

macmac@xdsnet.pl (JIRA)

unread,
Jun 24, 2019, 3:48:02 AM6/24/19
to jenkinsc...@googlegroups.com

macmac@xdsnet.pl (JIRA)

unread,
Jun 24, 2019, 4:27:02 AM6/24/19
to jenkinsc...@googlegroups.com

I've tried same test at another virtualization environment: newest Hyper-V at i7 PC.  VM still the same: Ubuntu 18.04 LTS, 2 virtual cores.

I can see same effect here. Well, maybe not exactly same, but very similiar: "top" at VM shows about 30% CPU for java process during test-project build. Better then 60/70% that I see in original environment, but IMHO still much too much as for waiting for process to complete sleep command.

 

macmac@xdsnet.pl (JIRA)

unread,
Jun 24, 2019, 4:50:02 AM6/24/19
to jenkinsc...@googlegroups.com
Maciej Maczynski edited a comment on Improvement JENKINS-58018
You are right - total method time may be misleading, I don't know exactly what NetBeans profiler means here. BUT - I've run another profiling session: selected only 1 class: hudson.remoting.FlightRecorderInputStream - following my guess.

The screenshot is from single run of sample project mentioned above, that just gets repo and runs Makefile, which sleeps for 60 seconds. Thus - producing no output. And "input stream"-related method of remoting was called 25k times during this. In lower pane of the screenshot you have "Hot spots": here java.io.BufferedInputStream.getBufIfOpen was called over 100k times!  For me, this shows clearly, that high CPU shown in previous screenshot correspond to actual method execution time, not to the time spent on blocked input. As I said, I barely know Java, but I know async I/O - this is not how "asynchronous I/O" behaves.   Or "synchronous" - if you wait for blocking "read()" to complete.

Answering your questions: I don't think this behavior has anything to do with "Makefile" - when I run this project manually in shell, it barely takes any CPU. Same, when I run it from another program (Emacs), where reading process' output is involved.

macmac@xdsnet.pl (JIRA)

unread,
Jun 26, 2019, 5:00:06 AM6/26/19
to jenkinsc...@googlegroups.com

It looks like I was completely wrong regarding the source of this CPU load. I noticed that sshd on slave also gives noticeable CPU load during build process.
I've made tcpdump session and found out that there is constant date exchange between jenkins master and slave. I counted TCP data frames during "sleep 60" execution: there were about 37 MB sent from slave to master and about 18 MB from master to slave. This gives me back when I started, trying to set retainOrigin agent option, but it does not help. Maybe I'm setting it wrong? This is my agent process:

/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -Dhudson.remoting.remoteinvocationhandler.Unexporter.retainOrigin=false -jar remoting.jar -workDir /home/jenkins

Is retainOrigin passed correctly? Or maybe there are another options that can be used to limit sending (probably some error) information between agent and slave? Or some logging options to see what actually happens there? All I can see in remoting.log are messages like that:

 

INFO: rate(1min) = 333.0±32.1/sec; rate(5min) = 224.4±167.3/sec; rate(15min) = 102.7±160.2/sec; rate(total) = 0.9±18.0/sec; N = 24,002
Jun 25, 2019 7:12:26 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: The 1 minute average rate is 333.0±32.1/sec. At the 95% confidence level this is above 100.0/sec. If this message is repeated often in the logs then consider setting system property '
hudson.remoting.RemoteInvocationHandler.Unexporter.retainOrigin' to 'false' to trade debug diagnostics for reduced memory pressure.

 

Any hints wellcome

 

 

 

macmac@xdsnet.pl (JIRA)

unread,
Jun 26, 2019, 5:05:03 AM6/26/19
to jenkinsc...@googlegroups.com
Maciej Maczynski edited a comment on Improvement JENKINS-58018
It looks like I was completely wrong regarding the source of this CPU load. I noticed that sshd on slave also gives noticeable CPU load during build process.
I've made tcpdump session and found out that there is constant date exchange between jenkins master and slave. I counted TCP data frames during "sleep 60" execution: there were about 37 MB sent from slave to master and about 18 MB from master to slave. This gives me back when I started, trying to set retainOrigin agent option, but it does not help. Maybe I'm setting it wrong? This is my agent process:

{{/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -Dhudson.remoting. remoteinvocationhandler RemoteInvocationHandler .Unexporter.retainOrigin=false -jar remoting.jar -workDir /home/jenkins}}


Is retainOrigin passed correctly? Or maybe there are another options that can be used to limit sending (probably some error) information between agent and slave? Or some logging options to see what actually happens there? All I can see in remoting.log are messages like that:

 

{{INFO: rate(1min) = 333.0±32.1/sec; rate(5min) = 224.4±167.3/sec; rate(15min) = 102.7±160.2/sec; rate(total) = 0.9±18.0/sec; N = 24,002}}
{{Jun 25, 2019 7:12:26 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats}}
{{INFO: The 1 minute average rate is 333.0±32.1/sec. At the 95% confidence level this is above 100.0/sec. If this message is repeated often in the logs then consider setting system property '}}
{{hudson.remoting.RemoteInvocationHandler.Unexporter.retainOrigin' to 'false' to trade debug diagnostics for reduced memory pressure.}}

 

Any hints wellcome

 

 

 

macmac@xdsnet.pl (JIRA)

unread,
Jun 26, 2019, 5:12:27 AM6/26/19
to jenkinsc...@googlegroups.com

One more thing: this happens also with OpenJDK 11, which is default at Ubuntu 18.04 (apt-get install default-jdk).

jthompson@cloudbees.com (JIRA)

unread,
Jun 27, 2019, 6:13:02 PM6/27/19
to jenkinsc...@googlegroups.com

Hmmm ... 

So, it's not really a CPU load problem. sshd also shows heavy CPU load. That's good to know.

That's a lot of data transfer. I didn't have anything like that with my jobs trying to duplicate the problem. My console output shows practically nothing. Do you get a large amount of stuff in your console output?

A lot of times things that appear to be Remoting problems turn out to be caused by plugins. Sometimes it involves weird interactions between plugins. That could be what's going on here. Unfortunately, it can be complicated to isolate the problem and track it down.

I don't have experience with the retainOrigin setting. Reading the code, it appears to reduce some of the resource consumption. However, if you're getting to the point where that makes a difference, there is probably something going on that is still keeping the consumption too high. Sometimes it's high server load. Network issues. Agent load. VM host issues or load.

These links have some related info: JENKINS-49736, JENKINS-48810, and https://groups.google.com/forum/#!topic/jenkinsci-dev/zAkbD_NmnrM . Most of these point to resource issues.

These can be really challenging to track down.

macmac@xdsnet.pl (JIRA)

unread,
Jun 28, 2019, 3:18:02 AM6/28/19
to jenkinsc...@googlegroups.com

There is no console output from build-process at all.  Checking this option is the reason why I ended-up in "sleep project" - it just runs "sleep" command and produces no output. So build process output is not the reason of these tons of data transferred.

Issues you've quoted seems to be related rather to network connectivity - not the case here. My connection between node and server is rock-solid. Just very busy sending nobody-knows-what.

VM load, agent load, server load - also not the case here. My server is happily serving several nodes on various OSes. This surely not happens for Windows slaves. And rather not for older linux slaves (I'm not 100%, must verify it). But try any Ubuntu 18.04 LTS - and you got it.

These tons of traffic between master and slave must be Jenkins-originated. Only remoting java process can send to server TCP data I've tcpdump-ed.  And as child process is silent - it must be remoting that generates it. I tried all those options for reporting, stats - it has no effect (unless I specify it incorrectly).

What I really need now is to see what data are transferred so vigorously between node and server. Is there any low-level logging option that lets you see the traffic between node and server? Sniffing the network is useless as data stream is encrypted.

 

gunter@grodotzki.co.za (JIRA)

unread,
Jul 26, 2019, 11:27:03 AM7/26/19
to jenkinsc...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages