Getting OutOfMemoryError (OOM) constantly in latest update to 1.542.

305 views
Skip to first unread message

Tim Drury

unread,
Dec 6, 2013, 8:33:07 AM12/6/13
to jenkins...@googlegroups.com
We updated Jenkins to 1.542 two days ago (from 1.514) and we're getting a lot of OOM errors. (info: Windows server 2008 R2, Jenkins JVM is jdk-x64-1.6.0_26)

At first I did the simplest thing and increased the heap from 3G to 4.2G (and bumped up permgen).  This didn't help so I started looking at threads via the Jenkins monitoring tool.  It indicated the disk-usage plugin was hung.  When you tried to view a page for a particularly large job, the page would "hang" and the stack trace showed the disk-usage plugin was to blame (or so I thought).  Jira report with thread dump here: https://issues.jenkins-ci.org/browse/JENKINS-20876

We disabled the disk-usage plugin and restarted and now we can visit that job page.  However, we still get OOM and lots of GCs in the logs at least once a day.  The stack trace looks frighteningly similar to that from the disk-usage plugin.  Here is an edited stack trace showing the methods common between the two OOM incidents: one during the disk-usage plugin and one after it was disabled:

[lots of xstream methods snipped]
hudson.XmlFile.unmarshal(XmlFile.java:165)
hudson.model.Run.reload(Run.java:323)
hudson.model.Run.<init>(Run.java:312)
hudson.model.AbstractBuild.<init>(AbstractBuild.java:185)
hudson.maven.AbstractMavenBuild.<init>(AbstractMavenBuild.java:54)
hudson.maven.MavenModuleSetBuild.<init>(MavenModuleSetBuild.java:146)
... [JVM methods snipped]
hudson.model.AbstractProject.loadBuild(AbstractProject.java:1155)
hudson.model.AbstractProject$1.create(AbstractProject.java:342)
hudson.model.AbstractProject$1.create(AbstractProject.java:340)
hudson.model.RunMap.retrieve(RunMap.java:225)
hudson.model.RunMap.retrieve(RunMap.java:59)
jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:677)
jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:660)
jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:502)
jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:536)
hudson.model.AbstractProject.getBuildByNumber(AbstractProject.java:1077)
hudson.maven.MavenBuild.getParentBuild(MavenBuild.java:165)
hudson.maven.MavenBuild.getWhyKeepLog(MavenBuild.java:273)
hudson.model.Run.isKeepLog(Run.java:572)
...

It seems something in "core" Jenkins has changed and not for the better.  Anyone seeing these issues?

-tim

Nigel Magnay

unread,
Dec 9, 2013, 6:54:35 AM12/9/13
to jenkins...@googlegroups.com
Yes, we're seeing the same.

Our jenkins install is currently unusable :-(



--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Stephen Connolly

unread,
Dec 9, 2013, 7:45:28 AM12/9/13
to jenkins...@googlegroups.com
How does the current LTS (1.532.1) hold up?


--

Tim Drury

unread,
Dec 9, 2013, 9:41:25 AM12/9/13
to jenkins...@googlegroups.com
I intended to install 1.532 on Friday, but mistakenly installed 1.539.  It gave us the same OOM exceptions.  I'm installing 1.532 now and will - hopefully - know tomorrow whether it's stable or not.  I'm not exactly sure what's going to happen with our plugins though.  Hopefully Jenkins will tell me if they must be downgraded too.

-tim

Nigel Magnay

unread,
Dec 9, 2013, 10:22:57 AM12/9/13
to jenkins...@googlegroups.com
For us, I think we actually have several things going on -

- something is taking far more memory - bumping up Xmx significantly seemed to help, though ..
 - this was then causing the host (smartos / solaris) to run out of swap, because of fork(). So I had to significantly raise tmpfs too.

- we habitually run under screen, so people can log in and re-attach to the session to check what's going on. That seems b0rked (god alone knows why, it's been fine for years) as multiple threads lock up waiting for this monitor:

"Computer.threadPoolForRemoting [#51] : IO ID=7610 : seq#=7609" daemon prio=3 tid=0x00000000037bb800 nid=0x242 runnable [0xfffffd7fec2fd000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:318)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        - locked <0x00000000a2692488> (a java.io.BufferedOutputStream)
        at java.io.PrintStream.write(PrintStream.java:480)
        - locked <0x00000000a2692468> (a java.io.PrintStream)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
        at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
        at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
        - locked <0x00000000a26a1cf8> (a java.io.OutputStreamWriter)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
        at java.util.logging.StreamHandler.flush(StreamHandler.java:242)
        - locked <0x00000000a26a1c88> (a java.util.logging.ConsoleHandler) <-- all web UI hangs waiting on this.
        at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:106)
        at java.util.logging.Logger.log(Logger.java:565)
        at java.util.logging.Logger.doLog(Logger.java:586)
        at java.util.logging.Logger.log(Logger.java:675)
        at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:271)
        at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111)
        at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)

Not running under screen seemed to solve that issue.

Sadly I can't seem to persuade JMX to work properly, so it's a bit of a black box at the moment, though I have a couple of heap dumps I'll walk when I have the time.

Tim Drury

unread,
Dec 9, 2013, 11:32:21 AM12/9/13
to jenkins...@googlegroups.com
I'm doing a heap-dump analysis now and I think I might know what the issue was.  The start of this whole problem was the disk-usage plugin hanging our attempts to view a job in Jenkins (see https://issues.jenkins-ci.org/browse/JENKINS-20876) so we disabled that plugin.  After disabling, Jenkins complained about data in an older/unreadable format:

You have data stored in an older format and/or unreadable data.

If I click the "Manage" button to delete it, it takes a _long_ time for it to display all the disk-usage plugin data - there must be thousands of rows, but it does display it all eventually.  The error shown in each row is:

CannotResolveClassException: hudson.plugins.disk_usage.BuildDiskUsageAction

If I click "Discard Unreadable Data" at the bottom of the page, I quickly get a stack trace:

javax.servlet.ServletException: java.util.ConcurrentModificationException
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:735)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:799)
at org.kohsuke.stapler.MetaClass$6.doDispatch(MetaClass.java:239)
at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:53)
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:685)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:799)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:587)
at org.kohsuke.stapler.Stapler.service(Stapler.java:218)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:45)
at winstone.ServletConfiguration.execute(ServletConfiguration.java:248)
at winstone.RequestDispatcher.forward(RequestDispatcher.java:333)
at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:376)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:96)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:203)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:181)
at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:86)

and it fails to discard the data.  Older data isn't usually a problem so I brushed off this error.  However, here is dominator_tree of the heap dump:

Class Name                                                                                                              | Shallow Heap | Retained Heap | Percentage
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
hudson.diagnosis.OldDataMonitor @ 0x6f9f2c4a0                                                                           |           24 | 3,278,466,984 |     88.69%
com.thoughtworks.xstream.converters.SingleValueConverterWrapper @ 0x6f9da8780                                           |           16 |    13,825,616 |      0.37%
hudson.model.Hudson @ 0x6f9b8b8e8                                                                                       |          272 |     3,572,400 |      0.10%
org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6f9a73598                                                                |           88 |     2,308,760 |      0.06%
org.apache.commons.jexl.util.introspection.Introspector @ 0x6fbb74710                                                   |           32 |     1,842,392 |      0.05%
org.kohsuke.stapler.WebApp @ 0x6f9c0ff10                                                                                |           64 |     1,127,480 |      0.03%
java.lang.Thread @ 0x7d5c2d138  Handling GET /view/Alle/job/common-translation-main/ : RequestHandlerThread[#105] Thread|          112 |       971,336 |      0.03%
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

What is hudson.diagnosis.OldDataMonitor?  Could the disk-usage plugin data be the cause of all my recent OOM errors?  If so, how do I get rid of it?

-tim

Robert CHAVANNE

unread,
Dec 9, 2013, 12:13:06 PM12/9/13
to jenkins...@googlegroups.com
Tim Drury <tdrury <at> gmail.com> writes:

>
>
> I'm doing a heap-dump analysis now and I think I might know what the issue
was.  The start of this whole problem was the disk-usage plugin hanging our
attempts to view a job in Jenkins
(see https://issues.jenkins-ci.org/browse/JENKINS-20876) so we disabled that
plugin.  After disabling, Jenkins complained about data in an
older/unreadable format:
>
> You have data stored in an older format and/or unreadable data.If I click
the "Manage" button to delete it, it takes a _long_ time for it to display
all the disk-usage plugin data - there must be thousands of rows, but it
does display it all eventually.  The error shown in each row is:
>
>
> CannotResolveClassException: hudson.plugins.disk_usage.BuildDiskUsageAction
>
>
> If I click "Discard Unreadable Data" at the bottom of the page, I quickly
get a stack trace:
>
>
> javax.servlet.ServletException: java.util.ConcurrentModificationException
>
> at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:735)
>
> at org.kohsuke.stapler.Stapler.invoke(Stapler.java:799)
>
> at org.kohsuke.stapler.MetaClass$6.doDispatch(MetaClass.java:239)
>
> at
org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:53)
>
> at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:685)
>
> at org.kohsuke.stapler.Stapler.invoke(Stapler.java:799)
>
> at org.kohsuke.stapler.Stapler.invoke(Stapler.java:587)
>
> at org.kohsuke.stapler.Stapler.service(Stapler.java:218)
>
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:45)
>
> at winstone.ServletConfiguration.execute(ServletConfiguration.java:
>
> at winstone.RequestDispatcher.forward(RequestDispatcher.java:333)
>
> at winstone.RequestDispatcher.doFilter(RequestDispatcher.java:376)
>
> at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.
>
> at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.
>
> at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.
>
> at
net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.
>
> and it fails to discard the data.  Older data isn't usually a problem so I
brushed off this error.  However, here is dominator_tree of the heap dump:
>
>
> Class Name                                                                
                                             | Shallow Heap | Retained Heap
| Percentage
>
--------------------------------------------------------------------
> hudson.diagnosis.OldDataMonitor <at> 0x6f9f2c4a0                        
                                                  |           24 |
3,278,466,984 |     88.69%
> com.thoughtworks.xstream.converters.SingleValueConverterWrapper <at>
0x6f9da8780                                           |           16 |  
 13,825,616 |      0.37%
> hudson.model.Hudson <at> 0x6f9b8b8e8                                    
                                                  |          272 |    
3,572,400 |      0.10%
> org.eclipse.jetty.webapp.WebAppClassLoader <at> 0x6f9a73598            
                                                   |           88 |    
2,308,760 |      0.06%
> org.apache.commons.jexl.util.introspection.Introspector <at> 0x6fbb74710
                                                  |           32 |    
1,842,392 |      0.05%
> org.kohsuke.stapler.WebApp <at> 0x6f9c0ff10                            
                                                   |           64 |    
1,127,480 |      0.03%
> java.lang.Thread <at> 0x7d5c2d138  Handling GET
/view/Alle/job/common-translation-main/ : RequestHandlerThread[#105] Thread|
         112 |       971,336 |      0.03%
>
------------------------------------------------------------------------------
>
>
> What is hudson.diagnosis.OldDataMonitor?  Could the disk-usage plugin data
be the cause of all my recent OOM errors?  If so, how do I get rid of it?
>
> -tim
>
> On Monday, December 9, 2013 9:41:25 AM UTC-5, Tim Drury wrote:I intended
to install 1.532 on Friday, but mistakenly installed 1.539.  It gave us the
same OOM exceptions.  I'm installing 1.532 now and will - hopefully - know
tomorrow whether it's stable or not.  I'm not exactly sure what's going to
happen with our plugins though.  Hopefully Jenkins will tell me if they must
be downgraded too.
> -timOn Monday, December 9, 2013 7:45:28 AM UTC-5, Stephen Connolly wrote:
> How does the current LTS (1.532.1) hold up?
>
> hudson.model.AbstractProject.getBuildByNumber(AbstractProject.java:1077)
> hudson.maven.MavenBuild.getParentBuild(MavenBuild.java:165)
> hudson.maven.MavenBuild.getWhyKeepLog(MavenBuild.java:273)
> hudson.model.Run.isKeepLog(Run.java:572)
> ...
>
>
> It seems something in "core" Jenkins has changed and not for the better.
 Anyone seeing these issues?
>
>
> -tim
>
>
>
>
>

Hello,
I don't know if that can help but I had an issue with both disk usage
plugin and Jenkins Job Configuration History Plugin.
Configuration History Plugin was badly set and kept history of many things,
for example each time a configuration page was visited.
It seems that recent update of disk usage plugin increased that effect.
I didn't fully understand why (perhaps the usage for each build and each
configuration was logged).

I have multi-configuration projects with thousands of configurations.
So after some time, Jenkins got slower and finally we ran out of inodes (I
am working on Linux).
To fix that, I had to clean config-history and change settings of
Job Config History in Configure System tab to disable some of the logs.

Regards,
Robert



Tim Drury

unread,
Dec 9, 2013, 12:58:27 PM12/9/13
to jenkins...@googlegroups.com
This Jira issue seems to capture my problem: https://issues.jenkins-ci.org/browse/JENKINS-19544

Nigel Magnay

unread,
Dec 9, 2013, 3:09:48 PM12/9/13
to jenkins...@googlegroups.com
Interesting - we use the disk usage plugin too, though had disabled it (thought not removed it).


Nigel Magnay

unread,
Dec 11, 2013, 12:55:55 PM12/11/13
to jenkins...@googlegroups.com
I've just cracked out MAT on a oom dump from our machine, and I can confirm that it looks like OldDataMonitor is the culprit here, too (750Mb of retained heap).

There's over a million entries in the hashmap...


Nigel Magnay

unread,
Dec 12, 2013, 5:26:42 AM12/12/13
to jenkins...@googlegroups.com
So this is what is happening for us :

- The build-usage plugin was displaying the problems at the beginning of the thread, so we disabled it.
- Now, every build that we do, and every sub-project fills up the 'Old data' log, with hundreds of 
CannotResolveClassException: hudson.plugins.disk_usage.BuildDiskUsageAction

even though that plugin is not used in that build, and does not exist any more.

After a modest number of builds (say, 1/2 a day or so), Jenkins bombs with OOM as this log is filled with millions of entries, and it's game over.

Is there a way to disable this functionality? I can't see the utility of it, and it's making the system totally unusable.

Tim Drury

unread,
Dec 12, 2013, 2:05:46 PM12/12/13
to jenkins...@googlegroups.com
I had some success getting the Manage Old Data screen to work.  Most of the time it will throw a ConcurrentModificationException, but occasionally it will list a few hundred records with the "Discard Old Data" button.  I press the button and - again - sometimes it will "work" and sometimes throw a CME, but in either case it does seem to delete some of the old data.  I repeated this process about every hour a couple days ago and managed to delete enough old data that Jenkins continued to run for more than a day.  The best chance of this working is when no build jobs are running.

The alternative is to manually delete the disk-usage XML elements from the build.xml files in each job's build directories.  I did this for about 200 files before I got tired of doing it.  A groovy script could probably be written to do this.

-tim
Reply all
Reply to author
Forward
0 new messages