[JIRA] (JENKINS-51057) EventDispatcher and ConcurrentLinkedQueue ate my JVM

95 views
Skip to first unread message

docwhat@gerf.org (JIRA)

unread,
May 1, 2018, 3:37:02 PM5/1/18
to jenkinsc...@googlegroups.com
Christian Höltje created an issue
 
Jenkins / Bug JENKINS-51057
EventDispatcher and ConcurrentLinkedQueue ate my JVM
Issue Type: Bug Bug
Assignee: Tom FENNELLY
Components: sse-gateway-plugin
Created: 2018-05-01 19:36
Environment: Jenkins 2.107.2
ace-editor-1.1
analysis-core-1.95
ansicolor-0.5.2
ant-1.8
antisamy-markup-formatter-1.5
apache-httpcomponents-client-4-api-4.5.3-2.1
artifactory-2.15.1
authentication-tokens-1.3
badge-1.4
blueocean-1.5.0
blueocean-autofavorite-1.2.2
blueocean-bitbucket-pipeline-1.5.0
blueocean-commons-1.5.0
blueocean-config-1.5.0
blueocean-core-js-1.5.0
blueocean-dashboard-1.5.0
blueocean-display-url-2.2.0
blueocean-events-1.5.0
blueocean-git-pipeline-1.5.0
blueocean-github-pipeline-1.5.0
blueocean-i18n-1.5.0
blueocean-jira-1.5.0
blueocean-jwt-1.5.0
blueocean-personalization-1.5.0
blueocean-pipeline-api-impl-1.5.0
blueocean-pipeline-editor-1.5.0
blueocean-pipeline-scm-api-1.5.0
blueocean-rest-1.5.0
blueocean-rest-impl-1.5.0
blueocean-web-1.5.0
bouncycastle-api-2.16.2
branch-api-2.0.19
build-timeout-1.19
build-user-vars-plugin-1.5
buildtriggerbadge-2.9
checkstyle-3.50
cli-commander-0.3
cloudbees-bitbucket-branch-source-2.2.10
cloudbees-folder-6.4
clover-4.9.0
cobertura-1.12
command-launcher-1.2
conditional-buildstep-1.3.6
config-file-provider-2.18
copyartifact-1.39.1
credentials-2.1.16
credentials-binding-1.16
cucumber-reports-3.16.0
cvs-2.14
dashboard-view-2.9.11
display-url-api-2.2.0
docker-commons-1.11
docker-workflow-1.15.1
durable-task-1.22
email-ext-2.62
embeddable-build-status-1.9
envinject-2.1.5
envinject-api-1.5
extended-read-permission-2.0
external-monitor-job-1.7
favorite-2.3.1
findbugs-4.72
git-3.8.0
git-client-2.7.1
git-server-1.7
github-1.29.0
github-api-1.90
github-branch-source-2.3.3
github-oauth-0.29
gradle-1.28
groovy-2.0
groovy-postbuild-2.4
handlebars-1.1.1
handy-uri-templates-2-api-2.1.6-1.0
htmlpublisher-1.16
http_request-1.8.22
ibm-application-security-1.1.2
icon-shim-2.0.3
ivy-1.28
jackson2-api-2.8.11.1
jacoco-3.0.1
javadoc-1.4
jenkins-design-language-1.5.0
jira-2.5.1
job-dsl-1.69
jquery-detached-1.2.1
jsch-0.1.54.2
junit-1.24
ldap-1.20
lockable-resources-2.2
mailer-1.21
mapdb-api-1.0.9.0
matrix-auth-2.2
matrix-project-1.13
maven-plugin-3.1.2
mercurial-2.3
metrics-3.1.2.11
metrics-graphite-3.0.0
modernstatus-1.2
momentjs-1.1.1
monitoring-1.72.0
pam-auth-1.3
parameterized-trigger-2.35.2
pegdown-formatter-1.3
performance-3.8
pipeline-build-step-2.7
pipeline-github-2.0
pipeline-github-lib-1.0
pipeline-graph-analysis-1.6
pipeline-input-step-2.8
pipeline-maven-3.5.3
pipeline-milestone-step-1.3.1
pipeline-model-api-1.2.9
pipeline-model-declarative-agent-1.1.1
pipeline-model-definition-1.2.9
pipeline-model-extensions-1.2.9
pipeline-rest-api-2.10
pipeline-stage-step-2.3
pipeline-stage-tags-metadata-1.2.9
pipeline-stage-view-2.10
pipeline-utility-steps-2.0.2
plain-credentials-1.4
pmd-3.50
pubsub-light-1.12
rebuild-1.28
resource-disposer-0.8
run-condition-1.0
scm-api-2.2.6
script-security-1.44
sidebar-link-1.9.1
simple-theme-plugin-0.4
slack-2.3
sse-gateway-1.15
ssh-agent-1.15
ssh-credentials-1.13
ssh-slaves-1.26
structs-1.14
subversion-2.10.5
support-core-2.46
swarm-3.12
tap-2.2.1
timestamper-1.8.9
token-macro-2.5
translation-1.16
variant-1.1
view-job-filters-1.27
webhook-step-1.3
windows-slaves-1.3.1
workflow-aggregator-2.5
workflow-api-2.27
workflow-basic-steps-2.6
workflow-cps-2.48
workflow-cps-global-lib-2.9
workflow-durable-task-step-2.19
workflow-job-2.19
workflow-multibranch-2.17
workflow-scm-step-2.6
workflow-step-api-2.14
workflow-support-2.18
ws-cleanup-0.34
xunit-1.102
Priority: Major Major
Reporter: Christian Höltje

We started running out of memory in our JVM (Xmx 8G) and when looking at Melody's memory (heap) histogram (JENKINS_URL/monitoring?part=heaphisto) the top two items were:

 

Class Size (Kb) % size Instances % instances Source
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 2,890,399 44 92,492,793 43  
java.util.concurrent.ConcurrentLinkedQueue$Node 2,167,981 33 92,500,553 43  

 
77% (and growing as we were researching the problem) of the memory was being used by these two items.

I have two support bundles from this time and an .hprof as well.

I can either screen share with someone or if you can tell me how to analyze these files I would be happy to.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

docwhat@gerf.org (JIRA)

unread,
May 3, 2018, 11:52:02 AM5/3/18
to jenkinsc...@googlegroups.com
Christian Höltje commented on Bug JENKINS-51057
 
Re: EventDispatcher and ConcurrentLinkedQueue ate my JVM

Our Jenkins server has been up 23 hours and we're already seeing large numbers of the EventDispacher objects:

 

 Class  Size (Kb)  % size  Instances  % instances  Source

 

It isn't a problem (yet) but this is alarming.

 

Our other, Jenkins server has no references to EventDispatcher$Retry in the memory histogram, even when expanding details.

docwhat@gerf.org (JIRA)

unread,
May 3, 2018, 12:12:02 PM5/3/18
to jenkinsc...@googlegroups.com
Christian Höltje edited a comment on Bug JENKINS-51057
Our Jenkins server has been up 23 hours and we're already seeing large numbers of the EventDispacher objects:

 
|| Class|| Size (Kb)|| % size|| Instances|| % instances|| Source||
|[


 

It isn't a problem (yet) but this is alarming.

 

Our other, Jenkins server has no references to {{EventDispatcher$Retry}} in the memory histogram, even when expanding details.

docwhat@gerf.org (JIRA)

unread,
May 3, 2018, 12:20:03 PM5/3/18
to jenkinsc...@googlegroups.com

In our logs, we're see messages like:

May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter: 
May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter: 
May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter: 
May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
Invalid SSE unsubscribe configuration. No active subscription matching filter: 

Could that be related?

 

docwhat@gerf.org (JIRA)

unread,
May 3, 2018, 12:39:01 PM5/3/18
to jenkinsc...@googlegroups.com

And now its at:

Class  Size (Kb)  % size  Instances  % instances  Source

docwhat@gerf.org (JIRA)

unread,
May 3, 2018, 1:16:03 PM5/3/18
to jenkinsc...@googlegroups.com

There are a lot of changes/improvements in sse-gateway (mostly by R. Tyler Croy) since 1.15 was released (back in Jan 16th, 2017).

In the logs, I see a commit titled message leaks which sounds interesting.

I tried compiling the master branch and I get errors:

$ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8-alpine mvn verify 
...
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.0:npm (npm install) on project sse-gateway: Failed to run task: 'npm install' failed. java.io.IOException: Cannot run program "/src/node/node" (in directory "/src"): error=2, No such file or directory -> [Help 1]

docwhat@gerf.org (JIRA)

unread,
May 3, 2018, 1:30:02 PM5/3/18
to jenkinsc...@googlegroups.com

 

I got further when using the non-alpine openjdk image.  Still doesn't build, but it is due to tests failing.

 

$ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8 mvn verify
...
Results :
Failed tests:
org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_autoDeleteOnExpire(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest)
 Run 1: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 2: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 3: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 4: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
 Run 5: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_delete_stale_events(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest)
 Run 1: EventHistoryStoreTest.test_delete_stale_events:69
 Run 2: EventHistoryStoreTest.test_delete_stale_events:69
 Run 3: EventHistoryStoreTest.test_delete_stale_events:69
 Run 4: EventHistoryStoreTest.test_delete_stale_events:69
 Run 5: EventHistoryStoreTest.test_delete_stale_events:69

Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project sse-gateway: There are test failures.
...

 

 

docwhat@gerf.org (JIRA)

unread,
May 3, 2018, 1:44:03 PM5/3/18
to jenkinsc...@googlegroups.com

Some more events from our logs:

 

May 03, 2018 1:36:00 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch
Error dispatching event to SSE channel. Write failed.
May 03, 2018 1:36:36 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch
Error dispatching event to SSE channel. Write failed.
May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).
May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).
May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).

docwhat@gerf.org (JIRA)

unread,
May 4, 2018, 9:08:02 AM5/4/18
to jenkinsc...@googlegroups.com

This morning:

 Class  Size (Kb)  % size  Instances  % instances  Source

 

Is there a way I can find out what's in the queues to track what's causing the Retry to happen?

docwhat@gerf.org (JIRA)

unread,
May 4, 2018, 10:40:02 AM5/4/18
to jenkinsc...@googlegroups.com

Looking in $JENKINS_HOME/logs/sse-events/ I see 16 .json files in jobs and 108 .json files in pipeline. And they change frequently (e.g. they are now, a minute later, 6 and 54).

Is there another place I can look at why there are some many Retry objects?

docwhat@gerf.org (JIRA)

unread,
May 7, 2018, 4:09:02 PM5/7/18
to jenkinsc...@googlegroups.com

I managed to use jvirtualvm to look at the .hprof from the initial comment and it was our friend EventDispatcher$Retry and $ConcurrentLinkedQueue$Node using up 90% of the memory.

docwhat@gerf.org (JIRA)

unread,
May 11, 2018, 2:28:02 PM5/11/18
to jenkinsc...@googlegroups.com

I finally got VirtualVM working reasonably well (using lots of RAM and a RAM Disk for temp files).

I found that there are 0 instances of org.jenkinsci.plugins.ssegateway.EventHistoryStore and 0 instances of org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.

Is it possible that the Retry objects are not being reclaimed by GC?

When I look at the objects in VirtualVM the Retry object has a reference (item) to a ConcurrentLinkedQueue$Node object.

The Node object has a reference (item) to the Retry object plus a reference to the next {{Node} in the queue.

docwhat@gerf.org (JIRA)

unread,
May 11, 2018, 2:28:02 PM5/11/18
to jenkinsc...@googlegroups.com

docwhat@gerf.org (JIRA)

unread,
May 11, 2018, 3:45:02 PM5/11/18
to jenkinsc...@googlegroups.com
 
Re: EventDispatcher and ConcurrentLinkedQueue ate my JVM

Since I didn't see this problem in Blue Ocean 1.4.2 but I'm seeing it in 1.5.0 I went looking for changes in Blue Ocean that may have triggered this problem...

I found PR #1667 by Vivek Pandey. It was to fix JENKINS-48820. I doubt it caused this problem, but maybe it made it worse?

me@wilfred.me.uk (JIRA)

unread,
Jun 1, 2018, 5:23:01 AM6/1/18
to jenkinsc...@googlegroups.com
Wilfred Hughes edited a comment on Bug JENKINS-51057
We're seeing this problem with Blue Ocean 1.4.2 and Jenkins 2 . 89.4.

me@wilfred.me.uk (JIRA)

unread,
Jun 1, 2018, 5:23:01 AM6/1/18
to jenkinsc...@googlegroups.com

We're seeing this problem with Blue Ocean 1.4.2.

docwhat@gerf.org (JIRA)

unread,
Jun 25, 2018, 2:04:02 PM6/25/18
to jenkinsc...@googlegroups.com

The only workaround I found was to completely uninstall Blue Ocean and the pubsub plugins.

This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

sternr@windowslive.com (JIRA)

unread,
Jul 5, 2018, 7:08:02 AM7/5/18
to jenkinsc...@googlegroups.com

We think we might've found the root cause, we're checking it now on prod.

https://github.com/jenkinsci/sse-gateway-plugin/pull/27

gustaf.lundh@gmail.com (JIRA)

unread,
Aug 30, 2018, 4:53:02 AM8/30/18
to jenkinsc...@googlegroups.com

We just had to restart a critical Jenkins master due to this memory leak, where org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry and java.util.concurrent.ConcurrentLinkedQueue$Node consumed 20GB of our 40GB heap. We need to get rid of blueocean due to this issue. The priority should be raised to critical IMHO.

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jon.sten@gmail.com (JIRA)

unread,
Aug 31, 2018, 8:09:03 AM8/31/18
to jenkinsc...@googlegroups.com
Jon Sten commented on Bug JENKINS-51057

As a temporary workaround, here's a script which will go through and look for old event dispatchers:

// First of, due to groovy and private field in super classes, we need to
// change visibility of retryQueue so that we can use reflection instead...
def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
retryQueueField.setAccessible(true)

def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
println("There are ${sessions.size()} sessions...")
def numEventsRemoved = 0
def numEventDispatchersPurged = 0
sessions.each{id, session->
  def eventDispatchers = session.sessionData._attributes['org.jenkinsci.plugins.ssegateway.sse.EventDispatcher']
  if (eventDispatchers) {
    println("Sessions ${id} has a ssegateway EventDispatcher map...")
    eventDispatchers.findAll{k, v -> k.startsWith('jenkins-blueocean-core-js')}.each { dispatcherId, dispatcher ->
      def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
      if (retryQueue.isEmpty()) {
        println("  Found one EventDispatcher, '${dispatcherId}', it has no retry events.")
      } else {
        def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
        if (oldestAge > 300) {
          println("  Found one EventDispatcher, '${dispatcherId}', its oldest retry event is ${oldestAge} seconds old, it contains ${retryQueue.size()} retry events, removing events and unsubscribing.")
          numEventsRemoved += retryQueue.size()
          numEventDispatchersPurged++
          dispatcher.unsubscribeAll()
          retryQueue.clear()
     	 } else {
        println("  Found one EventDispatcher, its oldest retry event is ${oldestAge} seconds old, so sparing it for now...")
      	}
      }
    }
  }
}

println("Removed ${numEventsRemoved} retry events from ${numEventDispatchersPurged} EventDispatchers!")

Btw, I've verified this bug in a fresh Jenkins install (2.121.3) and with latest version of Blue Ocean (1.8.2) which pulls SSE-gateway (1.15).

mstewart@riotgames.com (JIRA)

unread,
Sep 5, 2018, 10:00:01 PM9/5/18
to jenkinsc...@googlegroups.com
Maxfield Stewart updated an issue
 
Change By: Maxfield Stewart
Attachment: image-2018-09-05-18-59-08-587.png

mstewart@riotgames.com (JIRA)

unread,
Sep 5, 2018, 10:00:04 PM9/5/18
to jenkinsc...@googlegroups.com
Maxfield Stewart commented on Bug JENKINS-51057
 
Re: EventDispatcher and ConcurrentLinkedQueue ate my JVM

We're having this problem with blue ocean 1.4.2 and Jenkins 2.112.  Out of a 18 GB heap, 16GB are the GuavaPubSub object which containes the concurrent queue and EventDispatchers.  We ran Jon Sten 's script and it cleaned up only 2.5GB's of retry objects but still had 14GB's of references to EventDispatchers.  It takes' out our Jenkins Server in about 14-15 days. 

I took a heap dump and am adding a screenshot of a MAT analyzer view of the memory consumption.  Was hoping someone would accept the PR for the memory leak fix which seems to be outstanding for some time now.

g.sironi@elifesciences.org (JIRA)

unread,
Oct 18, 2018, 8:30:03 AM10/18/18
to jenkinsc...@googlegroups.com

Experiencing the same problem, it takes ~30 days for Jenkins to build up to its VM memory limit. We are on SSE Gateway 1.15 too.

 

Heap     Classes: 10,648,     Instances: 57,541,468,     Kilo-Bytes: 1,715,402

 Class  Size (Kb)  % size  Instances  % instances  Source
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 800,661 46 25,621,179 44  
java.util.concurrent.ConcurrentLinkedQueue$Node 600,538 35 25,622,966 44  
char[] 125,236 7 1,298,894 2  
java.lang.String 30,420 1 1,297,943 2  
short[] 20,271 1 9,119 0  

msa@rtx.dk (JIRA)

unread,
Nov 6, 2018, 5:21:03 AM11/6/18
to jenkinsc...@googlegroups.com

Bump. Same issue here on SSE Gateway 1.16 it seems. 

Any chance for a fix in the near future?

Heap     Classes: 20,087,     Instances: 280,652,984,     Kilo-Bytes: 9,119,689

 Class  Size (Kb)  % size  Instances  % instances  Source
org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 3,383,758 37 108,280,267 38  
java.util.concurrent.ConcurrentLinkedQueue$Node 2,538,166 27 108,295,118 38  
char[] 1,075,470 11 9,044,530 3  
java.lang.Object[] 205,241 2 5,771,490 2  
int[] 182,430 2 1,633,983 0  
byte[] 165,738 1 558,281 0  
java.lang.String 157,701 1 6,728,615 2  
java.util.HashMap$Node 99,299 1 3,177,591 1  

batmat@batmat.net (JIRA)

unread,
Jan 22, 2019, 8:10:03 AM1/22/19
to jenkinsc...@googlegroups.com
Baptiste Mathus assigned an issue to Unassigned
 
Change By: Baptiste Mathus
Assignee: Tom FENNELLY

mstewart@riotgames.com (JIRA)

unread,
Mar 5, 2019, 10:15:01 PM3/5/19
to jenkinsc...@googlegroups.com
Maxfield Stewart commented on Bug JENKINS-51057
 
Re: EventDispatcher and ConcurrentLinkedQueue ate my JVM

This problem continues. It continues to take production Jenkins offline forcing restarts every 15-25 days. We're beginning to see if there's a way to deploy jenkins with Zero Blue Ocean elements as it's not production ready.  Though the plugin seems to install by default. it's a shame there's been zero ownership of this for almost a year now

airadier@gmail.com (JIRA)

unread,
Mar 7, 2019, 5:45:02 AM3/7/19
to jenkinsc...@googlegroups.com

Just my 2 cents...

Jon Sten, I tried your cleanup script with mixed results. It was able to empty retryQueues but we still had a big leak. After isolating the problem, I noticed that if the HTTP session from the user expires, that script is not able to find the leaking EventDispatchers. They are, however, still in the GuavaPubSubBus subscribers map, and can be seen with:

 

 

import org.jenkinsci.plugins.pubsub.PubsubBus;
this.bus = PubsubBus.getBus();
this.bus.subscribers.each{ channelSubscriber, guavaSubscriber -> println channelSubscriber }
println "Done"

 

This outputs some lines containing:

org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber@3472875b 

which is the Inner class that implements the ChannelSubscriber interface. I suspect that the memory is still leaking as it is being referenced through this subscriber.

I saw there is already a PR https://github.com/jenkinsci/sse-gateway-plugin/pull/27 and a fork of the plugin https://github.com/taboola/sse-gateway-plugin

The approach of this fix is changing the handling of retries so they are aborted and the queue cleared after some amount of time or retries. Howevery, I wonder if the problem is the cleanup should be done when the HttpSession is destroyed. Currently, I can see in https://github.com/jenkinsci/sse-gateway-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ssegateway/sse/EventDispatcher.java the cleanup code for sessionDestroyed is:

    /**
     * Http session listener.
     */
    @Extension
    public static final class SSEHttpSessionListener extends HttpSessionListener {
        @Override
        public void sessionDestroyed(HttpSessionEvent httpSessionEvent) {
            try {
                Map<String, EventDispatcher> dispatchers = EventDispatcherFactory.getDispatchers(httpSessionEvent.getSession());
                try {
                    for (EventDispatcher dispatcher : dispatchers.values()) {
                        try {
                            dispatcher.unsubscribeAll();
                        } catch (Exception e) {
                            LOGGER.log(Level.FINE, "Error during unsubscribeAll() for dispatcher " + dispatcher.getId() + ".", e);
                        }
                    }
                } finally {
                    dispatchers.clear();
                }
            } catch (Exception e) {
                LOGGER.log(Level.FINE, "Error during session cleanup. The session has probably timed out.", e);
            }
        }
    }

but although I can see that for every dispatcher there is a call to dispatcher.unsubscribeAll(), I am missing a call to retryQueue.clear(); But I am just guessing here...

Anyone knowing the internals of the plugin can confirm this? I might try forking the plugin and testing a modified version.

Regards.

airadier@gmail.com (JIRA)

unread,
Mar 8, 2019, 2:41:03 PM3/8/19
to jenkinsc...@googlegroups.com

After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm.

Just in case it can help, this is the cleanup script we are running daily:

import org.jenkinsci.plugins.pubsub.PubsubBus;
import org.jenkinsci.plugins.ssegateway.sse.*;

def dryRun = false
this.bus = PubsubBus.getBus();

// change visibility of retryQueue so that we can use reflection instead...
def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
retryQueueField.setAccessible(true
)

def dispatcherCount = 0
def dispatchersList = []

//Build a list of EventDispatchers in all existing HTTP sessions
println "DISPATCHERS IN HTTP SESSIONS"
println "----------------------------"
def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
sessions.each{id, session->
  def eventDispatchers = EventDispatcherFactory.getDispatchers(session)
  if (eventDispatchers) {
    eventDispatchers.each { dispatcherId, dispatcher ->
      dispatchersList.add(dispatcherId)
      def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
      if (retryQueue.peek() != null) {
        def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
        println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
      } else {
        println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents"
      }      
    }
  }
}

println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions"
println ""

//Find all subscribers in bus
println "DISPATCHERS IN PUBSUBBUS"
println "------------------------"
this.bus.subscribers.each{ channelSubscriber, guavaSubscriber ->
  if (channelSubscriber.getClass().getName().equals('org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber')) {
    dispatcherCount++
    def dispatcher = channelSubscriber.this$0

    def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
    if (retryQueue.peek() != null) {
      def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
	  println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
      if (oldestAge > 300) {
      	println "  Clearing retryQueue with " + retryQueue.size() + " events"
        if (!dryRun) {
        	retryQueue.clear()
        } else {
          println "  Ignoring, dryrun"
        }
      }
    } else {
	    println "Dispatcher: " + dispatcher.id + " with no retryEvents"
    }
    
    if (dispatchersList.indexOf(dispatcher.id) < 0) {
      println "  Dispatcher is not in HTTP Sessions. Clearing"
        if (!dryRun) {
          dispatcher.unsubscribeAll()
          dispatcher.stop()
        } else {
          println "  Ignoring, dryrun"
        }
    }
  } 
}

println ""
println "Dispatchers in PubSubBus: " + dispatcherCount

 

jon.sten@gmail.com (JIRA)

unread,
Mar 11, 2019, 3:42:04 AM3/11/19
to jenkinsc...@googlegroups.com
Jon Sten commented on Bug JENKINS-51057

Álvaro Iradier Thank you for the investigation. In our case I've only noticed EventDispatcher objects being kept alive by the HTTP sessions (for some unknown reason sessions aren't terminated, might be related to the LDAP plugin, not sure though). However, as you noted it seems like there are other places where the objects are kept referenced on the heap. So some kind of timeout of the dispatcher objects and sessions is really needed.

mstewart@riotgames.com (JIRA)

unread,
Mar 23, 2019, 12:23:03 AM3/23/19
to jenkinsc...@googlegroups.com

Álvaro Iradier Thank you for the hard work, your script almost works on some of our servers.  I had to modify the code to put an exception trap around "dispatcher.stop()" as in some cases that was blowing up on my server, after that it cleared memory.

However on another (older version) of Jenkins the above code explodes trying to access `def dispatcher = channelSubscriber.this$0` as the reflection property fails and doesn't return the parent class (Dispatcher). Not sure what's up there, but we're upgrading that server soon and I'll try again.

airadier@gmail.com (JIRA)

unread,
Mar 24, 2019, 4:46:02 PM3/24/19
to jenkinsc...@googlegroups.com
Álvaro Iradier edited a comment on Bug JENKINS-51057
After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm.

Just in case it can help, this is the cleanup script we are running daily:
{code:java}
this.bus.subscribers. each any { channelSubscriber, guavaSubscriber ->

  if (channelSubscriber.getClass().getName().equals('org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber')) {
    dispatcherCount++
    def dispatcher = channelSubscriber.this$0

    def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
    if (retryQueue.peek() != null) {
      def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
   println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
      if (oldestAge > 300) {
       println "  Clearing retryQueue with " + retryQueue.size() + " events"
        if (!dryRun) {
         retryQueue.clear()
            dispatcher.unsubscribeAll()
            try {
              dispatcher.stop()
        } catch (Exception ex) {
              println "  !! Exception stopping AsynDispatcher"
            }          
        }
else {

          println "  Ignoring, dryrun"
        }
      }
    } else {
     println "Dispatcher: " + dispatcher.id + " with no retryEvents"
    }
    
    if (dispatchersList.indexOf(dispatcher.id) < 0) {
      println "  Dispatcher is not in HTTP Sessions. Clearing"
        if (!dryRun) {
          dispatcher.unsubscribeAll()
          try {
          dispatcher.stop()
        }
catch (Exception ex) {
            println "  !! Exception stopping AsynDispatcher"
          }
        }
else {

          println "  Ignoring, dryrun"
        }
    }
  }
}

println ""
println "Dispatchers in PubSubBus: " + dispatcherCount
{code}
 

airadier@gmail.com (JIRA)

unread,
Mar 24, 2019, 4:50:03 PM3/24/19
to jenkinsc...@googlegroups.com

Hi Maxfield Stewart, yes I also got the exception after some tests in our production environment, and had to wrap in try-catch. I just updated my comment with latest version of the script we are running. Probably the dispatcher.stop() is not necessary, as I mostly always get an exception, but it won't hurt. Also I noticed that the part in the " Dispatcher is not in HTTP Sessions. Clearing" was not executed most of the times, so I copied the cleanup part (unsubscribeAll() and the .stop()) to the "Clearing retryQueue with..." part.

After a couple of weeks running I examined the thread dumps and saw no trace of any leak regarding the EventDispatcher$Retry elements, so for us it is working. Looking forward for a fix in the official upstream version of the plugin

mstewart@riotgames.com (JIRA)

unread,
Mar 29, 2019, 1:08:01 PM3/29/19
to jenkinsc...@googlegroups.com

Álvaro Iradier Confirmed by the way that your script needs a newer/newest version of plugins, not entirely sure which ones. Of 3 4 deployments I managed, 1 the script was not working for ( couldn't get the parent dispatcher via this$0 ) until I moved to the latest versions.  Tested on Blue ocean 1.13 and 1.14, Jenkins v2.168.  Doesn't seem to work on versions of Blue Ocean older than 1.10/Jenkins 2.147.

dan.mord@gmail.com (JIRA)

unread,
May 13, 2019, 8:31:03 AM5/13/19
to jenkinsc...@googlegroups.com

@Álvaro Iradier Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.

Some things to point out:

  1. I had to change the line 
    def dispatcher = channelSubscriber.this$0
    

    to

    def dispatcher = channelSubscriber.eventDispatcher
    
  1. Your script will only work for SSE gatway plugin version <= 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

dan.mord@gmail.com (JIRA)

unread,
May 13, 2019, 8:32:05 AM5/13/19
to jenkinsc...@googlegroups.com
Dan Mordechai edited a comment on Bug JENKINS-51057
@[~airadier] Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.


Some things to point out:
# I had to change the line 
{code:java}
def dispatcher = channelSubscriber.this$0
{code}
to
{code:java}
def dispatcher = channelSubscriber.eventDispatcher
{code}

# Your script will only work for SSE gatway plugin version <= 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

dan.mord@gmail.com (JIRA)

unread,
May 13, 2019, 8:33:03 AM5/13/19
to jenkinsc...@googlegroups.com
Dan Mordechai edited a comment on Bug JENKINS-51057
@[~airadier] Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.

Some things to point out:
# I had to change the line 
{code:java}
def dispatcher = channelSubscriber.this$0
{code}
to
{code:java}
def dispatcher = channelSubscriber.eventDispatcher
{code}

#          2. Your script will only work for SSE gatway plugin version <= 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

dan.mord@gmail.com (JIRA)

unread,
May 13, 2019, 8:36:03 AM5/13/19
to jenkinsc...@googlegroups.com
Dan Mordechai edited a comment on Bug JENKINS-51057
@ [~airadier] Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.


Some things to point out:
# I had to change the line 
{code:java}
def dispatcher = channelSubscriber.this$0
{code}
to
{code:java}
def dispatcher = channelSubscriber.eventDispatcher
{code}

         2. Your script will only work for SSE gatway plugin version <= 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

dan.mord@gmail.com (JIRA)

unread,
May 16, 2019, 5:43:03 AM5/16/19
to jenkinsc...@googlegroups.com
Dan Mordechai edited a comment on Bug JENKINS-51057
[~airadier] Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.

Some things to point out:
# I had to change the line 
{code:java}
def dispatcher = channelSubscriber.this$0
{code}
to
{code:java}
def dispatcher = channelSubscriber.eventDispatcher
{code}

         2. Your script will only work for SSE gatway plugin version
<= > 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.
Add Comment Add Comment
 

raihaan.shouhell@autodesk.com (JIRA)

unread,
Jun 23, 2019, 11:13:02 PM6/23/19
to jenkinsc...@googlegroups.com

A fix got merged to master that definitely helped our masters

olamy@apache.org (JIRA)

unread,
Jun 24, 2019, 10:56:04 PM6/24/19
to jenkinsc...@googlegroups.com

olamy@apache.org (JIRA)

unread,
Jun 24, 2019, 10:57:04 PM6/24/19
to jenkinsc...@googlegroups.com

andreas.galek@dieboldnixdorf.com (JIRA)

unread,
Aug 12, 2019, 2:43:02 AM8/12/19
to jenkinsc...@googlegroups.com
Andreas Galek commented on Bug JENKINS-51057
 
Re: EventDispatcher and ConcurrentLinkedQueue ate my JVM

Having installed the newest plugin's version 1.19 I was facing error messages like this:

 
Aug 09, 2019 6:36:46 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException
SEVERE: A thread (EventDispatcher.retryProcessor/37063) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
java.lang.OutOfMemoryError: unable to create new native thread
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:717)
    at java.util.Timer.<init>(Timer.java:160)
    at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296)
    at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437)
    at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)
 
 
Is it the same problem as the one you fixed in this Jira ticket?
 

olamy@apache.org (JIRA)

unread,
Aug 12, 2019, 3:11:09 AM8/12/19
to jenkinsc...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages