Jenkins jvm spike (abrupt and rapid) leads to unresponsive/molasses-slow Jenkins

306 views

Skip to first unread message

Richard Geiger

unread,

Jul 13, 2011, 6:33:14 PM7/13/11

to Jenkins Users

We recently migrated our Jenkins server from running on Solaris (a VM)
to Ubuntu (also a VM). The VMs are similarly configured.

After the move to Ubuntu, commonly within a couple of hours (but we've
also seen it survive for most of a week), the memory usage of the
Jenkins master jvm goes vertical, and top indicates that the CPU goes
from perhaps a few 10's of a percent, to indicating 100%+ (typically
something likke 104%), and stays there until we intervene (typically
by killing and restarting Jenkins)

Pictures being helpful sometimes, I've posted a plot of the RSS size
for the jvm process, which includes three instances of the behavior we
see. (So, the graph is depicting multiple difference jvm/jenkins
instances over time):

http://www.purestorage.foxcove.com/jenkins-mem-woes.jpg

The three spikes are pretty clear to see. The first, on the far left,
happened just after we enabled the logging that collected the graph
data. We killed/restarted Jenkins at the end of the time it was
spiking. We then see what we consider the "normal" operation of the
server between 100 and 5700 minutes. All this time, the process runs
with a mean RSS of something like 600m.

Then, our problem happens again near 5700 minutes... within two or
three minutes, the RSS for the jvm climbs to around 2.5g. The CPU also
changes gears at the point, going from a few 10s of a % to 100+%
(presumably this is gc kicking in).

One other clue: In the first few minutes of the spike, the Jenkins web
UI gets slower and slower, until any operation is painfully slow. But
I have been able to peek inside the jvm with the Monitoring plugin
during the spikes, and noticed something that looked odd: a seemingly
immortal request. Unfortunately I don't have a copy of the request
information, but it was a very long-lived thread (many hours) that
would *always* show up in the Current Request list. Sometimes, it
would be the only current request, but when repeatedly reloading the
page, I'd see other requests coming and going.

Here's a little of the possibly-relevant configuration information for
the systems where we are and are not seeing the problem:

===== Old Jenkins Server host (problem never observed here):

Jenkins version: 1.410

root $ uname -a
SunOS hudson 5.11 oi_147 i86pc i386 i86pc

host memory: 4Gb

root $ java -version
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) Server VM (build 16.0-b13, mixed mode)

Jenkins jvm heap size: 512m (-Xmx512m when starting java)

===== New Jenkins Server host (where the problem happens)

Jenkins version: 1.417

root $ uname -a
Linux jenkins 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

host memory: 4Gb

root $ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.1) (6b22-1.10.1-0ubuntu1)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

Jenkins jvm heap size: 2048m (-Xmx2048m when starting java; this is an
increased value - we bumped it in an early attempt at troubleshooting
the problem).

Obviously, there are many things we could change in further
experiments, but given that it can take several days for the problem
to crop up, such experimentation is very slow. I'm hoping that the
description here might be enough for somebody to recognize a known
problem or gotcha.

evernat

unread,

Jul 15, 2011, 7:18:25 PM7/15/11

to Jenkins Users, eve...@free.fr

Hi,

Next time, I suggest to take the stack-trace of the thread for the
long request.
The stack-trace is displayed in a tooltip when you hover on the thread
name in the list of the "Current requests" of the monitoring page (on
1.6+)
You can then make a screenshot of the stack-trace or using the up and
down arrows of the keyboard and the mouse you can select the stack-
trace.
Otherwise you can probably use jstack of the JDK to get all the stack-
traces and then find the problematic stack.

Given that the memory grows very fast, I think that a particular http
request is the cause of the issue.

bye,
Emeric

Reply all

Reply to author

Forward

0 new messages