[JIRA] (JENKINS-50223) Pipeline Jenkins master ran out of Java Metaspace

15 views
Skip to first unread message

mkozell@magento.com (JIRA)

unread,
Mar 16, 2018, 2:04:03 PM3/16/18
to jenkinsc...@googlegroups.com
Mike Kozell created an issue
 
Jenkins / Bug JENKINS-50223
Pipeline Jenkins master ran out of Java Metaspace
Issue Type: Bug Bug
Assignee: Unassigned
Components: pipeline
Created: 2018-03-16 18:03
Priority: Major Major
Reporter: Mike Kozell

Setup

Jenkins v2.89.3 / v2.89.4 LTS
Pipeline API Plugin: 2.26
Pipeline Nodes and Processes Plugin: 2.19
Durable Task Plugin: 1.18
Pipeline Job Plugin: 2.17
Pipeline Shared Groovy Libraries Plugin: 2.9
Pipeline Supporting APIs Plugin: 2.18
Pipeline Groovy Plugin: 2.45
Script Security Plugin: 1.41

Pipeline Default Speed/Durability Level: Performance-Optimized
"Do not allow the pipeline to resume if the master restarts": Enabled on all jobs

Problem

Shortly after upgrading from Jenkins 2.60.3 to Jenkins 2.89.3 LTS, we experienced a java.lang.OutOfMemoryError: Metaspace error. I'm unsure if this a bug or if increased Java Metaspace is to be expected in the newer version of Jenkins and pipeline plugins.

Details

Jenkins v2.60.3

In Jenkins 2.60.3 we used the memory parameters below and the Java nonheap space (Metaspace) stayed between 350mb and 550mb. We ran for months without restarting Jenkins.

-XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M -Xms4g -Xmx8g -Dgroovy.use.classvalue=true

Jenkins v2.89.3

After upgrading to Jenkins 2.89.3 and upgrading all of the plugins, we used the Java memory settings below. We removed the "groovy.use.classvalue=true" parameter based on the recommendation here: https://support.cloudbees.com/hc/en-us/articles/115001505172-Jenkins-hangs-due-to-Groovy-ClassInfo-deadlock. We saw the Java nonheap memory fluctuate between 250mb and 750mb slowing trending upwards. After a few weeks, nonheap memory was at 1GB and Jenkins no longer responded to requests. The log file was full of java.lang.OutOfMemoryError: Metaspace errors.

-XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M -Xms4g -Xmx8g

Debugging

Our Jenkins masters perform a large amount of Jenkins pipeline jobs and I noticed that all of our Jenkins 2.89.4 masters show higher Java nonheap memory usage than before the Jenkins upgrade. We disabled pipeline durability but still see high memory usage. Standard Java heap memory usage looks fine.

Example 1:

Jenkins Master 1 was completely idle and I performed a System.gc() to force a full garbage collection. Afterwards I saw that the nonheap memory still was over 700mb. Below are some of the Java details I collected on that installation just after the full GC.

VM.native_memory summary

Native Memory Tracking:

Total: reserved=11014208KB, committed=5551020KB
Java Heap (reserved=8388608KB, committed=4194304KB)
(mmap: reserved=8388608KB, committed=4194304KB)

Class (reserved=1686933KB, committed=707989KB)
(classes #48326)
(malloc=13717KB #208203)
(mmap: reserved=1673216KB, committed=694272KB)

Thread (reserved=93951KB, committed=93951KB)
(thread #92)
(stack: reserved=93548KB, committed=93548KB)
(malloc=296KB #463)
(arena=107KB #182)

Code (reserved=273528KB, committed=155620KB)
(malloc=23928KB #24978)
(mmap: reserved=249600KB, committed=131692KB)

GC (reserved=412351KB, committed=256703KB)
(malloc=68287KB #693569)
(mmap: reserved=344064KB, committed=188416KB)

Compiler (reserved=406KB, committed=406KB)
(malloc=276KB #2009)
(arena=131KB #3)

Internal (reserved=88791KB, committed=88791KB)
(malloc=88759KB #184270)
(mmap: reserved=32KB, committed=32KB)

Symbol (reserved=30516KB, committed=30516KB)
(malloc=27279KB #301740)
(arena=3236KB #1)

Native Memory Tracking (reserved=22549KB, committed=22549KB)
(malloc=348KB #5361)
(tracking overhead=22201KB)

Arena Chunk (reserved=190KB, committed=190KB)
(malloc=190KB)

Unknown (reserved=16384KB, committed=0KB)
(mmap: reserved=16384KB, committed=0KB)

GC.class_histogram

num #instances #bytes class name
----------------------------------------------
1: 490462 84832616 [C
2: 2552835 40845360 com.cloudbees.groovy.cps.impl.ConstantBlock
3: 930699 37227960 com.cloudbees.groovy.cps.impl.FunctionCallBlock
4: 1493734 35849616 com.cloudbees.groovy.cps.impl.SourceLocation
5: 883507 33258176 [Ljava.lang.Object;
6: 179552 30097544 [B
7: 922229 29511328 java.util.HashMap$Node
8: 1151159 27386104 [Lcom.cloudbees.groovy.cps.Block;
9: 947492 22739808 java.lang.String
10: 790957 18982968 com.cloudbees.groovy.cps.impl.LocalVariableBlock
11: 213822 13097984 [Ljava.util.HashMap$Node;
12: 519301 12463224 com.cloudbees.groovy.cps.impl.SequenceBlock
13: 452808 10867392 java.util.ArrayList
14: 320616 10259712 com.cloudbees.groovy.cps.impl.PropertyAccessBlock
15: 250810 10032400 com.google.common.cache.LocalCache$WeakEntry
16: 168578 9440368 org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry
17: 260734 8343488 java.util.concurrent.locks.ReentrantLock$NonfairSync
18: 250147 7394208 [Lhudson.model.Action;
19: 142590 6844320 java.util.HashMap
20: 139363 6689424 org.jenkinsci.plugins.workflow.cps.nodes.StepAtomNode
21: 264178 6340272 com.google.common.collect.SingletonImmutableList
22: 195095 6243040 com.cloudbees.groovy.cps.impl.AssignmentBlock
23: 49477 6237064 java.lang.Class
24: 253041 6072984 java.util.concurrent.CopyOnWriteArrayList
25: 250681 6016344 org.jenkinsci.plugins.workflow.support.storage.BulkFlowNodeStorage$Tag

GC.class_stats

We use the Pipeline Shared Groovy Libraries Plugin and noticed a large number of class files with the names similar to below. Each of our shared library classes (org.xxxxxxxx.scripts.xxxxxxx) had about 400 references.

script15202683643871464209649
sun.reflect.GeneratedMethodAccessor2903
org.xxxxxxxx.scripts.xxxxxxx

Example 2

I ran a Java heap dump through an analyzer program and this is what it stated was the top consumers.

Class Name Objects Shallow Heap Retained Heap
org.jenkinsci.plugins.workflow.support.actions.LogActionImpl 976,720 23,441,280 >= 343,262,256
org.jenkinsci.plugins.workflow.cps.actions.ArgumentsActionImpl 735,961 17,663,064 >= 305,348,632

Example 3

I ran a different Java heap dump through the analyzer and below are the results. Again this was after a full GC.

Classes = 35k
Objects = 47M
Class Load = 4.6k
GC Roots = 4.3k

952,652 instances of "org.jenkinsci.plugins.workflow.support.actions.LogActionImpl", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c1746598" occupy 342,684,616 (20.83%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

717,629 instances of "org.jenkinsci.plugins.workflow.cps.actions.ArgumentsActionImpl", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c17d9770" occupy 296,727,528 (18.04%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

293 instances of "org.jenkinsci.plugins.workflow.cps.CpsFlowExecution", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c17d9770" occupy 210,909,800 (12.82%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

Package Retained Heap Retained Heap, % Top Dominators
workflow 1,340,511,168 81.50% 5,043,559

Workaround

We once again added "-Dgroovy.use.classvalue=true" to Java startup parameters and have noticed a reduction in Java nonheap memory usage. With the parameter we see the memory increase to around 750mb but when the server becomes idle the Java nonheap memory is reduced to under 300mb. Without the parameter we don't see the same reduction. Currently one of our Jenkins masters that doesn't have the parameter started on 3/2/18 at .2 GB and is currently (3/16/18) at .76GB. The graph looks like slow climb without much fluctuation. According to the Cloudbees article, this parameter is not recommended for Jenkins 2.89 however it does seem to help lower Java nonheap memory usage. Since it increases by over 200 mb a week, I should know shortly if some type of garbage collection reduces it on this second Jenkins master that is nearing the 1GB max. Java nonheap memory usage only reduced by 1mb when I manually performed a System.gc().

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

svanoort@cloudbees.com (JIRA)

unread,
Mar 19, 2018, 8:57:02 PM3/19/18
to jenkinsc...@googlegroups.com
Sam Van Oort commented on Bug JENKINS-50223
 
Re: Pipeline Jenkins master ran out of Java Metaspace

Mike Kozell Are these masters which are also experiencing https://issues.jenkins-ci.org/browse/JENKINS-50199 ? It is quite possible the issues there are preventing the Pipeline memory cleanup routines from running if it causes a failure before the cleanup can run.

org/jenkinsci/plugins/workflow/cps/CpsGroovyShell$CleanGroovyClassLoader is an internal ClassLoader which would be removed during the cleanup process.

Recommendation: for now switch back to MAX_SURVIVABILITY durability level (slowest).

mkozell@magento.com (JIRA)

unread,
Mar 20, 2018, 12:22:04 PM3/20/18
to jenkinsc...@googlegroups.com

Yes these are the same masters.  I will try to compare the memory usage from successful builds and build failures.  Then I will try changing the durability level.  We had another master run out of Metaspace today.

mkozell@magento.com (JIRA)

unread,
Mar 21, 2018, 12:04:02 PM3/21/18
to jenkinsc...@googlegroups.com

I switched a test Jenkins master back to MAX_SURVIVABILITY and still see the memory leak when failed builds occur.  I'm going to test if the issue occurs on successful builds.

svanoort@cloudbees.com (JIRA)

unread,
Mar 21, 2018, 12:35:02 PM3/21/18
to jenkinsc...@googlegroups.com

Mike Kozell Several questions, then:

1. Are you on the latest versions of Artifactory plugin if you are using it? It had a memory leak with its Pipeline step in older versions, the latter has fixes that prevent leaks from occurring.
2. Are you using ConfigSlurper or extending java.util.Level in your Pipelines? Both can cause memory leaks and are discouraged now.
3. Do you have a fairly simple Pipeline that will reproduce the issue?

My suspicion is that this represents an edge case perhaps linked to the failures above or a bad plugin interaction, because we have explicit tests for memory leaks in Pipeline.

Finally... if you are by any chance a CloudBees customer, please open a support case with a support bundle & heap dump, because a heap dump + support bundle will give us the fastest route to a fix. While wearing my Open Source hat and not my CloudBees hat there's not a good & secure way to transfer a heap dump.

Either way I consider memory leaks to be critical issues needing an urgent fix, but it's exponentially easier to do that with the additional information in a heap dump.

mkozell@magento.com (JIRA)

unread,
Mar 23, 2018, 6:26:03 PM3/23/18
to jenkinsc...@googlegroups.com

We are able to reproduce the memory leak with two examples that is not specific to our code.  In both examples and our potential workaround, the Jenkins jobs were all successful and the output in the console log was the same.  The classes remained in the GC.class_stats query when using the examples and didn't when using the work around.

Abstract Animal class

Concrete Lion class extends Animal

Example 1

We create a custom pipeline step and pass in an instantiated extended abstract object.  The issue occurs when we use "runIdentify(AbstractObject)"

runZoo.groovy

def call() {
    node {
        def myLion = new Lion()
        myLion.noise = "Roar"
        myLion.sound(steps,myLion.noise)
        runIdentify(myLion)
    }

runIdentify.groovy

def call(Animal animal) {
    if (animal.noise.toLowerCase().contains('roar')) {
        println('This must be a Lion')
    } else {
        println('NOT a Lion')
    }
}

 

Example 2

We instantiate an extended abstract object and pass it into a constructor object.  The issue occurs when use "new Identity(AbstractObject)" to create the new object.

runZoo.groovy

def call() {
    node {
        def myLion = new Lion()
        myLion.noise = "Roar"
        myLion.sound(steps,myLion.noise)
        new Identity(steps,myLion)
    }
}
Identity.groovy

class Identity implements Serializable {
    Identity(DSL steps, Animal animal) {
        if (animal.noise.toLowerCase().contains('roar')) {
            steps.echo('This must be a Lion')
        } else {
            steps.echo('NOT a Lion')
        }
    }
}

Work Around

We were able to work around the issue by adding a new method to the abstract or concrete class.

runZoo.groovy

def call() {
    node {
        def myLion = new Lion()
        myLion.noise = "Roar"
        myLion.sound(steps,myLion.noise)
        myLion.identifyAnimal(steps)
    }
}
    def identifyAnimal(DSL steps) {
        if (this.noise.toLowerCase().contains('roar')) {
            steps.echo('This must be a Lion')
        }
    }

 

I'm not sure if the memory leak is a bug or if I'm doing something incorrect with pipeline in the above examples. The output on the screen for all three scenarios is the same and the builds are successful.

james@howeswho.co.uk (JIRA)

unread,
Oct 26, 2018, 7:39:02 AM10/26/18
to jenkinsc...@googlegroups.com

Just to note:

I performed a System.gc() to force a full garbage collection

If you have -XX:+ExplicitGCInvokesConcurrent then it doesn't do a "full" collection.

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 5:31:04 PM5/23/19
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 5:31:04 PM5/23/19
to jenkinsc...@googlegroups.com
Jesse Glick updated an issue
Change By: Jesse Glick
Component/s: workflow-cps-plugin
Component/s: pipeline

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 5:31:05 PM5/23/19
to jenkinsc...@googlegroups.com
Jesse Glick started work on Bug JENKINS-50223
 
Change By: Jesse Glick
Status: Open In Progress

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 6:16:02 PM5/23/19
to jenkinsc...@googlegroups.com
Jesse Glick commented on Bug JENKINS-50223
 
Re: Pipeline Jenkins master ran out of Java Metaspace

I was able to reproduce a similar leak based on these hints. The use of the node step, as well as a library, appears critical: the leak goes through ExecutorStepExecution.PlaceholderTask.context, but that is not really at fault: StepContext is advertised as Serializable, and a CpsStepContext.outcome will hold a reference to the return value of the step, even after the build finishes. (PlaceholderTask can stick around for a while, for example in Queue.leftItems.) With that fixed, a basic test passes, but in a real Jenkins instance the library classes (and their loaders) are nonetheless retained, though with no apparent root references, so I am be missing a secondary issue.

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 8:16:03 PM5/23/19
to jenkinsc...@googlegroups.com

In a live Jenkins instance, any use of a type (from src), but not a global variable (from vars), triggers a mystery leak for me. The MemoryAssert test does not reproduce this, since there are no apparent root references. Perhaps that is what the comment of 2018-04-02 alludes to. Running with -XX:MetaspaceSize=1m -verbose:gc -verbose:class seems to confirm that while WorkflowScript, globalVar, and sun.reflect.GeneratedSerializationConstructorAccessor1234 get unloaded pretty quickly, pkg.NamedType and pkg.NamedType$method never do. So far I am at a loss to explain why.

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 8:36:03 PM5/23/19
to jenkinsc...@googlegroups.com

Correction: the library classes do eventually get unloaded, it just takes a while. I suspect there is a SoftReference to blame somewhere, but I have yet to find where.

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 9:47:02 PM5/23/19
to jenkinsc...@googlegroups.com

Finally tracked it down, to ClassHelperCache, and managed to fully clear all classes immediately upon end of the build.

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 9:53:03 PM5/23/19
to jenkinsc...@googlegroups.com

For future reference, my monitoring script for the live version:

while :; do jmap -histo:live $(jps -lm | fgrep war | cut -f1 -d' ') | fgrep GroovyClassLoader > /tmp/x; clear; cat /tmp/x; sleep 5; done

jglick@cloudbees.com (JIRA)

unread,
May 23, 2019, 9:56:04 PM5/23/19
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
May 28, 2019, 4:48:04 PM5/28/19
to jenkinsc...@googlegroups.com
Devin Nusbaum updated Bug JENKINS-50223
 

A fix for this issue was just released in version 2.69 of Pipeline Groovy Plugin.

Change By: Devin Nusbaum
Status: In Review Resolved
Resolution: Fixed
Released As: workflow-cps 2.69

tatu@indeed.com (JIRA)

unread,
May 29, 2019, 4:45:03 PM5/29/19
to jenkinsc...@googlegroups.com
Tatu Saloranta commented on Bug JENKINS-50223
 
Re: Pipeline Jenkins master ran out of Java Metaspace

Excellent! I think we managed to trigger this problem as well, via some seemingly unrelated refactoring, resulting in max uptime of 4-5 days for some masters.

For what it is worth, Jenkins Metrics plug-in sends 2 useful metrics – `vm.class.loaded`, `vm.class.unloaded` – difference of which is number of classes currently retained. So we are graphing that to see class retention over time, which showed steady increase without pronunced sawtooth pattern which normally occurs.

tatu@indeed.com (JIRA)

unread,
Aug 6, 2019, 3:27:03 PM8/6/19
to jenkinsc...@googlegroups.com

Alas, it is actually not sure version 2.69 fixes this problem for good. Problem came back, and upgrade of Pipeline Groovy plug-in to 2.70 does not seem to have solved steady increase (although due to time it takes for classes to accumulate, over 1 week, it is difficult to be sure after 2 days since restart).

tatu@indeed.com (JIRA)

unread,
Aug 6, 2019, 3:28:02 PM8/6/19
to jenkinsc...@googlegroups.com
Tatu Saloranta edited a comment on Bug JENKINS-50223
Alas, it is actually not sure version 2.69 fixes this problem for good. Problem came back (we had initially reverted code we deemed that must have triggered retention) , and upgrade of Pipeline Groovy plug-in to 2.70 does not seem to have solved steady increase (although due to time it takes for classes to accumulate, over 1 week, it is difficult to be sure after 2 days since restart). If confirmed, it's back to Ye Olde Drawing Board to solve the problem.
Reply all
Reply to author
Forward
0 new messages