[JIRA] (JENKINS-46507) Parallel Pipeline random java.lang.InterruptedException

1,409 views
Skip to first unread message

bsousa.pro@gmail.com (JIRA)

unread,
Apr 11, 2018, 2:20:02 AM4/11/18
to jenkinsc...@googlegroups.com
Bruno Sousa commented on Bug JENKINS-46507
 
Re: Parallel Pipeline random java.lang.InterruptedException

+1 same here

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

lightningbladex2@hotmail.com (JIRA)

unread,
Apr 18, 2018, 2:30:02 PM4/18/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 18, 2018, 2:32:02 PM4/18/18
to jenkinsc...@googlegroups.com
Sam Van Oort started work on Bug JENKINS-46507
 
Change By: Sam Van Oort
Status: Open In Progress

jenkins@ghenzler.de (JIRA)

unread,
Apr 18, 2018, 2:42:02 PM4/18/18
to jenkinsc...@googlegroups.com

We also had this problem again today... reading Jesse Glick's comment in https://issues.jenkins-ci.org/browse/JENKINS-35710?focusedCommentId=268398&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-268398 I suspect adding more slaves does not help (as the stacktrace clearly points to CPS, and according the comment that always runs on master)... but I suppose using https://github.com/jenkinsci/throttle-concurrent-builds-plugin to ensure only a maximum of builds run in parallel globally could help. Sam Van Oort before trying, do you think this is a valid way of getting rid of the problem? (asking before trying as this is not easy to reproduce)

svanoort@cloudbees.com (JIRA)

unread,
Apr 18, 2018, 3:40:03 PM4/18/18
to jenkinsc...@googlegroups.com

Georg Henzler Executors have nothing to do with this, and this issue is generally within the scope of a single build, so limiting the maximum number should have no impact – except when restarting Jenkins, where there might be enough of a bottleneck due to loading FlowExecutions to hit the 1 minute timeout. Maybe if you had maybe 10x as many Pipelines as CPU cores on the master, or really complex Program state – in this case we'd be blocking on the code here: https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L1024

Do you have a message like this in your logs? WARNING, failed to wait for $someRun to be loaded (and then a stack trace).

More likely there's something subtly wrong with the concurrency/error-handling model here – some of what I'm doing right now to untangle some nasty issues in these guts may help – and there was a change in the last release of workflow-cps that might help if you don't have it already (it caught more exceptions and prevented threads from potentially waiting forever).

alexeygrigorov@gmail.com (JIRA)

unread,
Sep 4, 2018, 11:54:02 AM9/4/18
to jenkinsc...@googlegroups.com

Same here. I'm running a @NonCPS labeled method in library and if fails with this error randomly in the middle

import jenkins.*
import jenkins.model.*
import hudson.*
import hudson.model.*

/* This method will find all jobs and verify their disk usage.
Then it will remove job runs to fit  quota

That will help to save disk space

*/
@NonCPS
def call() {
    def maxRunSize = 21474836480 //20Gb
    def maxJobSize = 42949672960 //40Gb
    Jenkins.instance.getAllItems(hudson.model.Job).each
            {
                println it
                def runs = it.getBuilds()
                def job = it
                def totalSize = 0
                def jobSizeReached = false
                runs.each{
                    def currentRun = it
                    def currentRunSize = it.getRootDir().directorySize()
                    println it
                    println currentRunSize
                    switch (currentRunSize) {
                        case {it > maxRunSize} :
                            currentRun.delete()
                            println "deleted " + currentRun + " for having size " + it + " over allowed " +  maxRunSize
                            break
                        case {it + totalSize > maxJobSize || jobSizeReached } :
                            jobSizeReached = true
                            currentRun.delete()
                            println "deleted " + currentRun + " for not fitting total  " + job + " jobs size " + (totalSize+currentRunSize) +" to max  allowed " +  maxJobSize
                            break
                        default :
                            totalSize += currentRunSize
                            println 'leaving '  + currentRun + ". current total = " + totalSize
                            break
                    }
                    println '--------------------------'
                }
                println '=========================='
            }
}

--------------------------
[Pipeline] echo
[Pipeline] End of Pipeline
java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:298)
	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
	at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:258)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:206)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:153)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.println(CpsScript.java:221)
	at sun.reflect.GeneratedMethodAccessor1023.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:925)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:908)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.println(DefaultGroovyMethods.java:678)
	at org.codehaus.groovy.runtime.dgm$500.doMethodInvoke(Unknown Source)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:69)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:166)

jenkins 2.107.3

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

toby.broyles@gmail.com (JIRA)

unread,
Sep 5, 2018, 8:32:11 AM9/5/18
to jenkinsc...@googlegroups.com

We are seeing this error quite often lately. Is there anything I can do to help troubleshoot?

svanoort@cloudbees.com (JIRA)

unread,
Sep 5, 2018, 9:23:06 AM9/5/18
to jenkinsc...@googlegroups.com

alexeygrigorov@gmail.com (JIRA)

unread,
Sep 5, 2018, 10:01:02 AM9/5/18
to jenkinsc...@googlegroups.com
Alexey Grigorov edited a comment on Bug JENKINS-46507
Same here. I'm running a @NonCPS labeled method in library and if fails with this error randomly in the middle

{code:java}
{code}


{noformat}

<<Some output from the loop here sometime few lines sometime few hundred lines, i guess it fails after some random number>>
{noformat}

jenkins 2.107.3

svanoort@cloudbees.com (JIRA)

unread,
Sep 14, 2018, 10:20:03 AM9/14/18
to jenkinsc...@googlegroups.com
Change By: Sam Van Oort
Assignee: Sam Van Oort Jose Blas Camacho Taboada

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 14, 2018, 5:17:05 PM9/14/18
to jenkinsc...@googlegroups.com

I finally had a chance to take a look at this.

Alexey Grigorov: You are using println inside of an @NonCPS method. It might seem like this should be ok, but behind the scenes println ends up calling the echo step, which means that you are executing Pipeline Steps inside of an @NonCPS method, which should not be done even though it may appear to work correctly in some cases.

Toby Broyles If you have a minimal pipeline that can reproduce the issue, or a scenario in which you can reproduce the problem consistently it would be very helpful if you could post the Pipeline script and the build folder of one of the builds that failed with the exception here. Otherwise, it would be great to see your actual pipeline that is hitting the error so we can check for any similarities with Alexey's or Rick's Pipelines that might give us an idea of what is happening.

The only way I have been able to reproduce anything remotely similar to the exceptions seen here has been to create @NonCPS methods that take more than 5 minutes to complete, which causes CpsThread#runNextChunk to interrupt the thread. AIUI that may very well be the code that is interrupting pp.get here, but even so, I am not sure why the CpsFlowExecution's program promise would be taking so long to load. My only theory is that in CpsThread#getThreadGroupSynchronously the local variable pp is set to the nonnull value of flowExecution.programPromise, but then maybe something else changes the value of flowExecution.programPromise, but we are still waiting on the stale program promise pp which does not complete for whatever reason and eventually it times out. If that is the issue, we might be able to fix this issue with a patch like the following to CpsThread#getThreadGroupSynchronously:

diff --git a/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java b/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java
index 2b90bec1..6f0a45cb 100644
--- a/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java
+++ b/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java
@@ -63,6 +63,8 @@ import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
 import java.util.logging.Level;
 import java.util.logging.Logger;
 import jenkins.model.CauseOfInterruption;
@@ -241,14 +243,19 @@ public class CpsStepContext extends DefaultStepContext { // TODO add XStream cla
         if (threadGroup == null) {
             ListenableFuture<CpsThreadGroup> pp;
             CpsFlowExecution flowExecution = getFlowExecution();
+            while (true) {
             while ((pp = flowExecution.programPromise) == null) {
                 Thread.sleep(100); // TODO does JENKINS-33005 remove the need for this?
             }
             try {
-                threadGroup = pp.get();
+                // Wait in 30-second intervals in case the local variable pp becomes stale.
+                threadGroup = pp.get(30, TimeUnit.SECONDS);
+            } catch (TimeoutException e) {
+                // Update reference to flowExecution.programPromise and attempt to load it again.
             } catch (ExecutionException e) {
                 throw new IOException(e);
             }
+            }
         }
         return threadGroup;
     }

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 18, 2018, 5:20:13 PM9/18/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-46507
I finally had a chance to take a look at this.

[~iceiceice]: You are using println inside of an {{@NonCPS}} method. It might seem like this should be ok, but [behind the scenes|https://github.com/jenkinsci/workflow-cps-plugin/blob/0e4c25f8d7b84470aa523491e29933db3b3df588/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsScript.java#L221] {{println}} ends up calling the {{echo}} step, which means that you are executing Pipeline Steps inside of an {{@NonCPS}} method, which should not be done even though it may appear to work correctly in some cases.

[~tobybro] If you have a minimal pipeline that can reproduce the issue, or a scenario in which you can reproduce the problem consistently it would be very helpful if you could post the Pipeline script and the build folder of one of the builds that failed with the exception here. Otherwise, it would be great to see your actual pipeline that is hitting the error so we can check for any similarities with Alexey's or Rick's Pipelines that might give us an idea of what is happening.

The only way EDIT: My original hypothesis was misleading and incorrect, so I have been able to reproduce anything remotely similar to 've edited it based on new information.

Notably,
the exceptions seen here has been to create exception being thrown in all these cases comes from {{ @NonCPS AbstractQueuedSynchronizer#acquireSharedInterruptibly }} methods that take more than 5 minutes to complete Line 1302 , which causes [CpsThread#runNextChunk to checks the interrupt status before the thread |https://github tries to acquire the lock . com/jenkinsci/workflow-cps-plugin/blob/54d2f4fe8069fde53789bfe21229ce8e545300bb/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsThread.java#L170]. AIUI This means that may very well be the code that is interrupting thread was actually interrupted before the step started executing, and {{ pp.get}} [here|https://github.com/jenkinsci/workflow-cps-plugin/blob/54d2f4fe8069fde53789bfe21229ce8e545300bb/src/main/java/org/jenkinsci/plugins/workflow/cps/ CpsStepContext .java # L248], but even so, I am not sure _why_ the CpsFlowExecution's program promise would be taking so long to load. My only theory is that in {{CpsThread# getThreadGroupSynchronously}} the local variable {{pp}} is set just happens to be the nonnull value first piece of {{flowExecution code to check the interrupted flag and throw an exception . programPromise}}, but then maybe something else changes I got access to the value flow nodes of {{flowExecution.programPromise}} another user's pipeline hitting the same exception , but we are still waiting on and noticed that the stale program promise {{pp}} which does not complete for whatever reason duration between the last successfully executed node and eventually it times out. If the start of the step that is threw the issue exception was just over 5 minutes , we might be able which leads me to fix this issue with a patch like believe that the following to {{ code that is doing the interruption is the [Timeout in CpsThread# getThreadGroupSynchronously}} runNextChunk|https :

{noformat}
diff --git a
/ src / main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext github . java b com / src/main/java/org/ jenkinsci/ plugins/ workflow / - cps /CpsStepContext.java
index 2b90bec1..6f0a45cb 100644
- -- a plugin / src blob / main 54d2f4fe8069fde53789bfe21229ce8e545300bb / java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java
+++ b/
src/main/java/org/jenkinsci/plugins/workflow/cps/ CpsStepContext CpsThread .java
@@ -63,6 +63,8 @@ import java
#L170] . util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
import java.util.logging.Level;
import java.util.logging.Logger;
import jenkins.model.CauseOfInterruption;
@@ -241,14 +243,19 @@ public class CpsStepContext extends DefaultStepContext { // TODO add XStream cla
         if (threadGroup == null) {
             ListenableFuture<CpsThreadGroup> pp;
             CpsFlowExecution flowExecution = getFlowExecution();
+            while (true) {
             while ((pp = flowExecution.programPromise) == null) {
                 Thread.sleep(100); // TODO does JENKINS-33005 remove None of the need for Pipelines that I've seen that have thrown this ?

             }
             try {
-                threadGroup = pp.get();
+                // Wait error look like they should have taken more than 5 minutes in 30-second intervals in case the local variable pp becomes stale.
+                threadGroup = pp.get(30
runNextChunk , TimeUnit.SECONDS);
+            } catch
so I am suspicious that there is something going wrong inside of that method ( TimeoutException e maybe it is executing more code than it should ) {
+                // Update reference
, but I do not know how to flowExecution validate my hypothesis . programPromise and attempt to load it again.

             } catch (ExecutionException e) {
                 throw new IOException(e);
             }
+            }
         }
         return threadGroup;
     }
{noformat}

spam@ncrmnt.org (JIRA)

unread,
Sep 22, 2018, 9:21:02 AM9/22/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Hi everyone. A little info on this issue that we've been hitting as well. We're using jenkins and it's pipeline plugin to run our regression and ci builds for a SoC (System-on-chip) we're designing. As a side effect of what we're doing - some tests (under the parallel step) take a very long time to complete and are very CPU-intensive. Here's some statistics I've noticed:

  1. CI builds that have a small amount (~10) of short tests under parallel step almost always complete successfully.
  2. Regression runs that take a few days to complete and have a huge list of longer tests almost always hit this bug.
  3. If the jenkins was restarted fresh, chances to hit this bug are somewhat lower. I came to think it was related to a memory leak in jenkins test-results-analyzer plugin at first, which (due to our huge, sometime over 500MiB log files) crashed jenkins often at fist, until I started filtering logs to mitigate the issue (https://issues.jenkins-ci.org/browse/JENKINS-34134). Tuning GC options seemed to further improve things.
  4. Since we're running builds on NFS filesystem to later distribute and execute tests on different servers, thus balancing the load, I first considered filesystem lags due to other tasks being run on our servers. Adding this seemed to improve things a little bit. The disk load is not evenly distributed. It's heavy on the start of the test (when the model is loaded to memory and during reset due to extensive logging), later it's usually very low throughout the whole run.

 

/* Workaround laggy filesystem */
script {
   System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", "3800");
}

 

 

5. Another thing I've noticed, some tests seem to randomly fail. According to the simulation log, the simulation process was just killed (by something other than OOM), rather that getting stuck, and that happens quite fast (e.g. the simulation haven't even completed the reset sequence, which takes a few seconds of real time, but usually spews a lot of info into the log). 

6. At the Blue Ocean UI all fails in parallel steps seem to be grouped to the very bottom half of the screen, after 150+ of successfully executed tests. Since these are mostly in the order they're being started (I assume), it's making me believe, that it might be somehow related to either GC taking a lot of time or some memory leaking here and there.

 

A grand total would be of 5-10 tests of 300+ randomly crashing, and almost always the annoying java.lang.interruptedException with the whole run (That took 3-4 days to complete!) just freezing.

A full regression that takes a few days to complete almost always triggers the bug. I'm willing to help solve this issue, but since it requires some sacred knowledge about jenkins' guts I do not have, I can either post my pipeline script (which will most likely be useless, due to a huge amount of our internal tools being called), or give some of the proposed fixes a try and post the results, since I have the right environment.

 

 

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 25, 2018, 10:54:03 AM9/25/18
to jenkinsc...@googlegroups.com

Andrew a Thank you for including all of those details in your comment! It's interesting that you are hitting the issue consistently on multi-day builds. How much of your build is running external processes sh/bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)?

I am not aware of any workarounds for the issue. One thing that would be helpful would be if you could upload the build folder for one of your builds which failed with this issue. In particular, I am interested in the XML files in the workflow/ directory to see if the serialized flow nodes can tell us what was executing when the issue occurred which may give us some insight.

spam@ncrmnt.org (JIRA)

unread,
Sep 26, 2018, 7:57:04 AM9/26/18
to jenkinsc...@googlegroups.com
Andrew a updated an issue
Change By: Andrew a
Attachment: Jenkinsfile

spam@ncrmnt.org (JIRA)

unread,
Sep 26, 2018, 8:02:03 AM9/26/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Devin Nusbaum, my last run was inconsistent due to the fact that the server with jenkins was out of disk space (sic!), I'll start the next regression run today and post results once it's done or hangs. This will take a few days though, depending on the load. Once it's done I'll try to provide your with as much data as I can.

> How much of your build is running external processes sh/bat vs Pipeline/Groovy code (i.e. Are you more or less running a multi-day bash script (or many such scripts in a parallel block), or is most of the time spent running Groovy code in your Jenkinsfile, or is it about even)?

Basically it all comes down to running a single "cmake's ctest" process with one test in the list that may take anything from a few minutes to a few days with little to no stdout. Here's my pipeline script:

Jenkinsfile

Please note, that it may seem weird in places due to our internal server config and me trying to workaround this bug. It configures and builds the project on one select node (the build is done on an NFS share with the exact same path on all nodes), touchstone simulation step creates simulation checkpoints and, finally, the result is executed in parallel during regression phase with throttling enabled with load distributed over our servers.
 

 

 

spam@ncrmnt.org (JIRA)

unread,
Sep 30, 2018, 4:29:06 PM9/30/18
to jenkinsc...@googlegroups.com
Andrew a updated an issue
Change By: Andrew a
Attachment: hs_err_pid239040.log

spam@ncrmnt.org (JIRA)

unread,
Sep 30, 2018, 4:39:02 PM9/30/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Devin Nusbaum, an update. It looks like I've got the hit the first try and now see a handful of
 

[rumboot-default-rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException

in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?

 

spam@ncrmnt.org (JIRA)

unread,
Oct 1, 2018, 6:47:05 AM10/1/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~dnusbaum], an update. It looks like I've got the hit the first try and now see a handful of
 
{code:java}

[rumboot-default-rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException
{code}

in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?  

EDIT: Hm... I've been running tests with a maximum timeout of 260000 s and it seems it has failed once it stumbled upon the first test that actually timed out. Looking at the tests - all tests that caused InterruptedException are tests that are timing out in our current project. This leads me to believe that


 

spam@ncrmnt.org (JIRA)

unread,
Oct 1, 2018, 6:51:34 AM10/1/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~dnusbaum], an update. It looks like I've got the hit the first try and now see a handful of
 
{code:java}
[rumboot-default-rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException
{code}
in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?
  


EDIT: Hm... I've been running tests with a maximum timeout of 260000 s and it seems it has failed once it stumbled upon the first test that actually timed out. Looking at the tests - Actually, all tests that caused InterruptedException have an interrupted exception are tests those that are currently timing out in our current project and being worked on! They hang for at least 260000 seconds consuming 100% of one CPU core with no stdout before being killed internally by ctest process . This leads me to believe that

 

spam@ncrmnt.org (JIRA)

unread,
Oct 1, 2018, 6:52:01 AM10/1/18
to jenkinsc...@googlegroups.com

spam@ncrmnt.org (JIRA)

unread,
Oct 1, 2018, 8:42:04 AM10/1/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~dnusbaum], an update. It looks like I've got the hit the first try and now see a handful of
 
{code:java}
[rumboot-default-rumboot-basis-Debug-basis-stress-test-uart-eth-dma] Cannot contact oribi: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_1] Cannot contact bison: java.lang.InterruptedException
[rumboot-default-rumboot-basis-Debug-simple-iram-mirror-hello-arinc_6] Cannot contact oribi: java.lang.InterruptedException
{code}
in the log. I now have the build in the stuck state and can provide the data. Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?

EDIT: Hm... I've been running tests with a maximum timeout of 260000 s and it seems it has failed once it stumbled upon the first test that actually timed out. Actually, all tests that have an interrupted exception are those that are currently timing out and being worked on! They hang shoud've hung for at least 260000 seconds consuming 100% of one CPU core with no stdout before being killed internally by ctest process.   InterruptedException arrives earlier.

I wonder if wrapping up 'sleep 260000' in groovy will produce the same results.

dnusbaum@cloudbees.com (JIRA)

unread,
Oct 1, 2018, 9:20:03 AM10/1/18
to jenkinsc...@googlegroups.com

Can you tell a bit more detail of where to find the relevant XML files? A quick grep on the slaves didn't reveal any workflow directories. Where should I look for them?

Yep, sorry for not specifying earlier. They will be on the master, not on the agents. The workspace path will be $JENKINS_HOME/jobs/<path-to-job>/builds/<failing_build_number>/. Inside, the main files that would be interesting are build.xml and everything in workflow/.

If it is reproducible with sleep 260000 that would be interesting. At those lengths of time I would start to suspect something going wrong with a network call from the master to the agent causing code somewhere to hang that Pipeline does not handle gracefully. Do your overall Jenkins master logs have any warnings/errors around the time the InterruptedException was thrown?

spam@ncrmnt.org (JIRA)

unread,
Oct 1, 2018, 10:58:09 AM10/1/18
to jenkinsc...@googlegroups.com

spam@ncrmnt.org (JIRA)

unread,
Oct 1, 2018, 11:05:05 AM10/1/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

stuff.tgz Here goes the tarball with all the xml files. Be careful, there's a ton of them there

I do see some exceptions in jenkins master logs like this:

Oct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed
INFO: new failure
org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125)
        at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
        Suppressed: java.lang.InterruptedException
                at java.lang.Object.wait(Native Method)
                at hudson.remoting.Request.call(Request.java:177)
                at hudson.remoting.Channel.call(Channel.java:954)
                at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084)
                at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304)
                at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258)
                at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296)
                at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139)
                at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128)
                at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906)
                at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
                at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                ... 3 moreOct 01, 2018 1:34:10 PM org.jenkinsci.plugins.workflow.cps.CpsStepContext completed
INFO: previously delivered here
java.lang.Throwable
        at org.jenkinsci.plugins.workflow.cps.CpsStepContext.completed(CpsStepContext.java:341)
        at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:329)
        at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:254)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

But I'm not sure if these are due to the build failing, or the restarting the master earlier today. Our instance has become quite bloated recently.

Btw, I restarted the build with an insanely huge heartbeat interval that should match the tests timeout. Will see how it goes now.

c.amshoff@gmx.de (JIRA)

unread,
Oct 10, 2018, 3:25:06 PM10/10/18
to jenkinsc...@googlegroups.com

Same issue here...

We're having a quite small Jenkinsfile that calls a RESTful service, which takes (depending on parameters) few seconds up to an hour or more to complete. The JSON returned by the service should be attached to the build, that's why it's called synchronously. The longer the service call takes, the more likely the InterruptedException occurs. There are no parallel steps in our pipeline, so I'm pretty sure it's related to the long-running step.

The REST call is done by some selfmade Groovy function and classes in our shared library, basically setting up a HttpURLConnection instance. However, that works nicely in fast service calls, so I don't think there is an issue in this code.

Let me know if I can help with any files/logs.

dnusbaum@cloudbees.com (JIRA)

unread,
Oct 10, 2018, 3:51:02 PM10/10/18
to jenkinsc...@googlegroups.com

Christoph Amshoff Without seeing your Jenkinsfile, it's hard to say, but my guess would be that you are hitting the 5 minute timeout here because your code is waiting for a network response or similar in a method that doesn't respond to being interrupted (so it doesn't fail until the next step is invoked which notices that the thread was interrupted, assuming your stack trace is original to the OP's). I would highly recommend not making direct HTTP calls in your Jenkinsfile even if it seems like it works fine in some cases. You can create a bash script that does whatever you want (curl, java -jar yourRestClient.jar, etc.) and call that script using the sh or bat steps which run asynchronously and are not subject to a 5 minute timeout.

Andrew a Based on those logs, there does appear to be a network issue with the remoting layer involved as well. I will check through the build files you sent to see if they give any indication of a root cause that we can address.

dnusbaum@cloudbees.com (JIRA)

unread,
Oct 10, 2018, 3:57:02 PM10/10/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-46507
[~chamshoff] Without seeing your Jenkinsfile, it's hard to say, but my guess would be that you are hitting [the 5 minute timeout here|https://github.com/jenkinsci/workflow-cps-plugin/blob/a67af1db1eb7b83c287243547b8c6b0d452fd3c3/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsThread.java#L178] because your code is waiting for a network response or similar in a method that doesn't respond to being interrupted (so it doesn't fail until the next step is invoked which notices that the thread was interrupted, assuming your stack trace is original to the OP's). I would highly recommend not making direct HTTP calls in your Jenkinsfile /Shared Library even if it seems like it works fine in some cases. You can create a bash script that does whatever you want (curl, java -jar yourRestClient.jar, etc.) and call that script using the {{sh}} or {{bat}} steps which run asynchronously and are not subject to a 5 minute timeout.

[~ncrmnt] Based on those logs, there does appear to be a network issue with the remoting layer involved as well. I will check through the build files you sent to see if they give any indication of a root cause that we can address.

spam@ncrmnt.org (JIRA)

unread,
Oct 11, 2018, 8:40:02 AM10/11/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Devin Nusbaum Thanks a lot for looking into it. Let me know if I can be of any help with this issue.

 

P.S. Another run with a super-high heartbeat interval ran for a little longer, but ended up with another InterruptedException as well on one of our tests that is timing out.

dnusbaum@cloudbees.com (JIRA)

unread,
Oct 23, 2018, 2:08:02 PM10/23/18
to jenkinsc...@googlegroups.com

Andrew a The exception in your most recent build looks like a networking issue after you manually aborted the build. Here is the stack trace:

org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1125)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$2$1.run(WorkflowRun.java:378)
    at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Suppressed: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at hudson.remoting.Request.call(Request.java:177)
        at hudson.remoting.Channel.call(Channel.java:954)
        at hudson.Launcher$RemoteLauncher.kill(Launcher.java:1084)
        at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.stop(FileMonitoringTask.java:304)
        at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.stop(DurableTaskStep.java:258)
        at org.jenkinsci.plugins.workflow.cps.CpsThread.stop(CpsThread.java:296)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1139)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6.onSuccess(CpsFlowExecution.java:1128)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:906)
        at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    ... Suppressed exception repeated for each parallel block

And the serialized FlowInterruptedException has:

<causes class="java.util.Arrays$ArrayList">
  <a class="jenkins.model.CauseOfInterruption-array">
    <jenkins.model.CauseOfInterruption_-UserInterruption>
      <user>andrianov</user>
    </jenkins.model.CauseOfInterruption_-UserInterruption>
  </a>
</causes>

Have you been manually aborting the builds, or is something else happening? Either way, it seems like you might have a distinct issue, so might be better to move the discussion to a new ticket or to the mailing list to help diagnose your remoting problems.

dnusbaum@cloudbees.com (JIRA)

unread,
Oct 23, 2018, 2:41:02 PM10/23/18
to jenkinsc...@googlegroups.com

For anyone still seeing the original exception in the description, try adding a custom logger (Manage Jenkins -> System Log -> Add new log recorder) at FINE level for org.jenkinsci.plugins.workflow.support.concurrent.Timeout, so that we see these log messages, and then next time you see the issue, add the output of the custom logger here, which will tell us if my hypothesis is correct, and if so, it will tell us what code was actually running at the time of the issue.

spam@ncrmnt.org (JIRA)

unread,
Oct 24, 2018, 11:43:03 AM10/24/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Thanks for looking into it. Yes, I have been manually aborting the build, but already when they were stuck with the exception. I'll clean up all logs and fire up a fresh regression run this weekend with fine logging you've mentioned. Let's see if that gives us any better results.

spam@ncrmnt.org (JIRA)

unread,
Oct 24, 2018, 11:43:04 AM10/24/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507

dnusbaum@cloudbees.com (JIRA)

unread,
Oct 24, 2018, 11:45:03 AM10/24/18
to jenkinsc...@googlegroups.com

Sounds good. When you abort the build, it can change the stored info, so I would collect an archive of the build files prior to aborting it.

spam@ncrmnt.org (JIRA)

unread,
Oct 28, 2018, 5:22:03 AM10/28/18
to jenkinsc...@googlegroups.com
Andrew a updated an issue
Change By: Andrew a
Attachment: jenkins.log

spam@ncrmnt.org (JIRA)

unread,
Oct 28, 2018, 5:22:09 AM10/28/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Before actually starting the job for this weekend, I went to the folks at our IT department and talked about the issue. We looked at the zabbix logs and could see no problem with network connectivity during the previous runs, and the nodes still had plenty of free resources. The VM running jenkins was NOT being backed up (VM backup might break/freeze TCP connections).  The only thing that might have interferred - NFS home directories backup that is done every evening, since that will introduce some file system lags. We'll have a good look at the zabbix logs once again on monday. Meanwhile, here are the FINE exception logs:

Oct 28, 2018 12:58:13 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#222]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=541659 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 12:58:14 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#159]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=541661 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 1:26:27 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#227]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=546106 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#235]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=556575 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#234]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556578 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:32:40 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=556579 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#238]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=559586 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#236]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559587 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 2:51:49 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#230]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=559588 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#232]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561098 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#225]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561099 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:00:59 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#221]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561100 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#242]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561455 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#245]: checking /home/jenkins/basis_rtl_regression/build on chinkara / waiting for chinkara id=561456 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Oct 28, 2018 3:03:24 AM FINE org.jenkinsci.plugins.workflow.support.concurrent.TimeoutInterrupting org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#228]: checking /home/jenkins/basis_rtl_regression/build on oryx / waiting for oryx id=561457 after 10 SECONDS
java.lang.Throwable
	at java.lang.Object.wait(Native Method)
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:954)
	at hudson.FilePath.act(FilePath.java:1070)
	at hudson.FilePath.act(FilePath.java:1059)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:199)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:457)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:416)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Here goes the jenkins.log full of exceptions

jenkins.log

 

spam@ncrmnt.org (JIRA)

unread,
Oct 28, 2018, 5:25:06 AM10/28/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
Before actually starting the job for this weekend, I went to the folks at our IT department and talked about the issue. We looked at the zabbix logs and could see no problem with network connectivity during the previous runs, and the nodes still had plenty of free resources. The VM running jenkins was NOT being backed up (VM backup might break/freeze TCP connections).  The only thing that might have interferred - NFS home directories backup that is done every evening, since that will introduce some file system lags. We'll have a good look at the zabbix logs once again on monday. Meanwhile, here are the FINE exception logs:
{code:java}
{code}

Here goes the jenkins.log full of exceptions

[^jenkins.log]

  P.S. I haven't yet interrupted the build, so I can collect some more info if you need any. I'll fetch files from nodes later.

spam@ncrmnt.org (JIRA)

unread,
Nov 3, 2018, 3:19:03 AM11/3/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

A little follow-up. Carefully inspecting the zabbix logs it turned out that the issue was during the time when we were performing routine backups of the NFS home directories. As the builds are performed on the NFS share, this may have resulted in filesystem lags. I have applied the following patch (600 s is quite an overkill, but let's see if it helps) and restarted the regression.

+++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

A question to other folks experiencing the issue: How much RAM/disk io do your processes take? May it be that a huge memory-hungry process being terminated will freeze for a while (>10 seconds), syncing data to the disk?

spam@ncrmnt.org (JIRA)

unread,
Nov 3, 2018, 3:22:03 AM11/3/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
A little follow-up. Carefully inspecting the zabbix logs it turned out that the issue was during the time when we were performing routine backups of the NFS home directories at night . We didn't hit this bug every night, but almost always on the second night. My hypothesis: As the builds are performed on the NFS share, this may have resulted in filesystem lags. I have applied the following patch (600 s is quite an overkill, but let's see if it helps) and restarted the regression.
{code:java}

+++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

{code}

A question to other folks experiencing the issue: How much RAM/disk io do your processes take? May it be that a huge memory-hungry process being terminated will freeze for a while (>10 seconds), syncing data to the disk?

spam@ncrmnt.org (JIRA)

unread,
Nov 3, 2018, 3:23:03 AM11/3/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
A little follow-up. Carefully inspecting the zabbix logs it turned out that the issue was during the time when we were performing routine backups of the NFS home directories at night. iowait is still very low, but a little more than usual. We didn't hit this bug every night, but almost always on the second night. My hypothesis: As the builds are performed on the NFS share, this may have resulted in filesystem lags. I have applied the following patch (600 s is quite an overkill, but let's see if it helps) and restarted the regression.

{code:java}
+++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

{code}
A question to other folks experiencing the issue: How much RAM/disk io do your processes take? May it be that a huge memory-hungry process being terminated will freeze for a while (>10 seconds), syncing data to the disk?

dnusbaum@cloudbees.com (JIRA)

unread,
Nov 5, 2018, 10:01:03 AM11/5/18
to jenkinsc...@googlegroups.com

Andrew a Are you able to upload the content of the custom logger for org.jenkinsci.plugins.workflow.support.concurrent.Timeout (or was it blank?)? Custom loggers are not included in the normal jenkins.log, so you'll have to add it separately. Along with your patch, it could be interesting to add another custom logger at FINE level for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTask.

ian.cotton@clearswift.com (JIRA)

unread,
Nov 5, 2018, 10:32:03 AM11/5/18
to jenkinsc...@googlegroups.com

I have the custom logger turned on. I have just added additional logging to it for DurableTask, so no results from that yet, just for the Timeout class. We are getting many exceptions as below. Our Jenkins master runs on an NFS file system, which ought to be reasonably quick but apparently may have issues.

Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 78969: checking /var/lib/jenkins/workspace/anches_ert_gw-ert-custard-3-GARHGYOSDUCJ4EHKSORKR3E2RAFWD3A5ZVBAKP6LQ5ZKQZBMWUQA on RedHatSlave3 / waiting for RedHatSlave3 id=14810179,5,main] after 10 SECONDS
java.lang.Throwable
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:169)
at hudson.remoting.Channel.call(Channel.java:904)
at hudson.FilePath.act(FilePath.java:987)
at hudson.FilePath.act(FilePath.java:976)
at hudson.FilePath.isDirectory(FilePath.java:1514)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289)


at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

dnusbaum@cloudbees.com (JIRA)

unread,
Nov 5, 2018, 11:05:03 AM11/5/18
to jenkinsc...@googlegroups.com

Ian Cotton Thanks for adding the stack trace! If Andrew a is seeing identical stack traces, then the patch they are testing will not have any affect, because it looks like the timeout that is timing out is in DurableTaskStep#getWorkspace here, before the timeout in DurableTaskStep#check itself. It could be worth trying to increase that timeout just to see if it changes the behavior, but given that the stuck operation is a remoting call, my guess is that this is a networking issue (perhaps an issue in Jenkins' networking stack (remoting) rather than something wrong with the network itself if your system monitoring shows no network issues). Did you see any network-related issues on the machine on which Jenkins was running around the time of the issue?

dnusbaum@cloudbees.com (JIRA)

unread,
Nov 5, 2018, 11:06:03 AM11/5/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-46507
[~icotton64] Thanks for adding the stack trace! If [~ncrmnt] is seeing identical stack traces, then the patch they are testing will not have any affect, because it looks like the timeout that is timing out is in {{DurableTaskStep#getWorkspace}} [here|https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/00f94234cfb54181980352f9290ce987edde9b17/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java#L314], before the timeout in {{DurableTaskStep#check}} itself. It could be worth trying to increase that timeout just to see if it changes the behavior (though my initial guess though is that if we don't get a response after 10 seconds we will never get one) , but given that the stuck operation is a remoting call, my guess is that this is a networking issue (perhaps an issue in Jenkins' networking stack (remoting) rather than something wrong with the network itself if your system monitoring shows no network issues). Did you see any network-related issues on the machine on which Jenkins was running around the time of the issue?

ian.cotton@clearswift.com (JIRA)

unread,
Nov 5, 2018, 11:19:04 AM11/5/18
to jenkinsc...@googlegroups.com

Thanks Devin Nusbaum. No, I didn't observe any networking related issues. I monitor both machines using both Nagios and Prometheus. the granularity of monitoring on Prometheus is every minute and it didn't see any disconnect.

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 3:02:05 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Devin Nusbaum, yes, I've just included those in the code section, see my comments above (apparently JIRA has hidden those already). All of them are related to file operations.

Anyways, I think I have resolved the issue on my setup with this patch (600 seconds is an overkill, but I just had to test it): 

 

@@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                 }
             }
             boolean directory;
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 directory = ws.isDirectory();
             } catch (Exception x) {
                 getWorkspaceProblem(x);
@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 
return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(600, TimeUnit.SECONDS)) {
                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

 
                                                            

It seems that the 10 second timeout for file-based operations is something causing the problem under heavy disc/network load. During all three runs with FINE exception logging those occurred during routine backups.

I also found the following workaround: A restart of the master will cause pipeline to resume and correctly reconnect with all the nodes, effectively unfreezing the job.

 

ian.cotton@clearswift.com (JIRA)

unread,
Nov 6, 2018, 7:07:03 AM11/6/18
to jenkinsc...@googlegroups.com

I now get the following exception from the DurableTaskStep log:

Interrupting Thread[org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 88150: checking /var/lib/jenkins/workspace/GW_GW_Pipeline_Build_trunk-KY5NRFS7BWLOEMDFRPJGO5IAG4Y6DFVRUT333ENYFD5NOOFQINTA on RedHatSlave10 / waiting for RedHatSlave10 id=16567391,5,main] after 10 SECONDS


java.lang.Throwable
at java.lang.Object.wait(Native Method)
at hudson.remoting.Request.call(Request.java:169)
at hudson.remoting.Channel.call(Channel.java:904)
at hudson.FilePath.act(FilePath.java:987)
at hudson.FilePath.act(FilePath.java:976)
at hudson.FilePath.isDirectory(FilePath.java:1514)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:193)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:306)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:289)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

 

So it looks like it is timing out calling the node.

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 9:39:03 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Ian Cotton, can you give my proposed patch with increased timeouts a try?

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 9:41:02 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~icotton64], can you give my proposed patch with increased timeouts a try?

Here's my pre-compiled version of the plugin for your convenience.

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 9:42:02 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~icotton64], can you give my proposed patch with increased timeouts a try?

Here's my pre-compiled version of the plugin for your convenience. [^workflow-durable-task-step.hpi]

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 9:43:02 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~icotton64], can you give my proposed patch with increased timeouts a try?

Here's my pre-compiled version of the plugin for your convenience. (I hope sharing a precompiled plugin is allowed here, if not - sorry) [^workflow-durable-task-step.hpi]

dnusbaum@cloudbees.com (JIRA)

unread,
Nov 6, 2018, 9:44:03 AM11/6/18
to jenkinsc...@googlegroups.com

Andrew a any chance you are able to play with the timeout to figure out a lower bound of what works for you? I would be fine increasing the limit to maybe 30-60 seconds by default, and providing a system property to increase it further, but I'm not sure about increasing the default further than that.

dnusbaum@cloudbees.com (JIRA)

unread,
Nov 6, 2018, 9:45:01 AM11/6/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-46507
[~ncrmnt]  any Thanks for working on the issue! Any chance you are able to play with the timeout to figure out a lower bound of what works for you? I would be fine increasing the limit to maybe 30-60 seconds by default, and providing a system property to increase it further, but I'm not sure about increasing the default further than that.

 

ian.cotton@clearswift.com (JIRA)

unread,
Nov 6, 2018, 11:07:04 AM11/6/18
to jenkinsc...@googlegroups.com

Sorry Andrew a the system we are seeing the issue on is our production system and I really don't want to risk installing a pre-compiled plugin on it.

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 11:10:16 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Devin Nusbaum I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 11:15:02 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~dnusbaum] I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.      

[~icotton64] No problem, I fully understand. Can you recompile it with a 30 second timeout (see my patch above) and give it a try? This way we'll provide [~dnusbaum] some results about a suitable timeout faster.

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 11:15:04 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~dnusbaum] I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.  

[~icotton64] No problem, I fully understand. Can you still recompile it with a 30 second timeout (see my patch above) and give it a try? This way we'll provide [~dnusbaum] some results about a suitable timeout faster.

spam@ncrmnt.org (JIRA)

unread,
Nov 6, 2018, 11:16:03 AM11/6/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
[~dnusbaum] I'll recompile the plugin with a 60 second timeout and fire up the next regression tomorrow, expect results somewhere by friday/saturday. A facility to override this timeout would be very useful, because (according to zabbix) the iowait fluctuations were barely noticeable during backup. At higher loads things will get waaay worse, so I'd also put a note somewhere in README/TROUBLESHOOTING section.  

[~icotton64] No problem, I fully understand. Can you still recompile it yourself with a 30 second timeout (see my patch above) and give it a try? This way we'll provide [~dnusbaum] some results about a suitable timeout faster.

dnusbaum@cloudbees.com (JIRA)

unread,
Nov 6, 2018, 11:42:02 AM11/6/18
to jenkinsc...@googlegroups.com
Devin Nusbaum assigned an issue to Devin Nusbaum
Change By: Devin Nusbaum
Assignee: Jose Blas Camacho Taboada Devin Nusbaum

ian.cotton@clearswift.com (JIRA)

unread,
Nov 6, 2018, 11:56:03 AM11/6/18
to jenkinsc...@googlegroups.com

Andrew a I am not able to build the plugin at the moment. My Jenkins server doesn't have the required plugins. We are setting up some new server and hopefully I can use one of them.

dnusbaum@cloudbees.com (JIRA)

unread,
Nov 6, 2018, 12:01:03 PM11/6/18
to jenkinsc...@googlegroups.com

For now, I went ahead and filed https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 to allow the timeout to be configured by system property. Feel free to pull down the incremental build of that PR once it is complete for testing if you are already running workflow-durable-task-step 2.26, or review/comment on the PR. Thanks!

ian.cotton@clearswift.com (JIRA)

unread,
Nov 7, 2018, 6:17:02 AM11/7/18
to jenkinsc...@googlegroups.com

I realised we are running an old version of the plugin (2.17). unfortunately we are also running Jenkins 2.73.2 and the plugin versions beyond 2.22 require 2.73.3. We can try the plugin on our newer replacement Jenkins but we don't have the failing builds on those Jenkins. We are working to improve matters but I don't think we will be in a position to test this properly for at least a few days.

spam@ncrmnt.org (JIRA)

unread,
Nov 13, 2018, 1:56:04 AM11/13/18
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

First one:
I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:

Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling

But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.

 

Second one:

In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:
Cancelling nested steps due to timeoutSending interrupt signal to processCancelling nested steps due to timeoutAfter 10s process did not stop
java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716: Device or resource busy
at sun.nio.fs.UnixException.translateToIOException(Unknown Source)
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source)
at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(Unknown Source)
at java.nio.file.Files.deleteIfExists(Unknown Source)
at hudson.Util.tryOnceDeleteFile(Util.java:316)
at hudson.Util.deleteFile(Util.java:272)
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to taruca
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at hudson.FilePath.act(FilePath.java:1070)
at hudson.FilePath.act(FilePath.java:1059)
at hudson.FilePath.deleteRecursive(FilePath.java:1266)
at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)


at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/.nfs0000000029ee028d00002716'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
at hudson.Util.deleteFile(Util.java:277)
at hudson.FilePath.deleteRecursive(FilePath.java:1303)
at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
at hudson.FilePath.deleteRecursive(FilePath.java:1302)
at hudson.FilePath.access$1600(FilePath.java:211)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)Sending interrupt signal to process
After 10s process did not stop
java.nio.file.FileSystemException: /home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597: Device or resource busy
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165)
at hudson.Util.tryOnceDeleteFile(Util.java:316)
at hudson.Util.deleteFile(Util.java:272)
Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to oryx
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at hudson.FilePath.act(FilePath.java:1070)
at hudson.FilePath.act(FilePath.java:1059)
at hudson.FilePath.deleteRecursive(FilePath.java:1266)
at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:340)
at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution$1.run(DurableTaskStep.java:382)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)


at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

Caused: java.io.IOException: Unable to delete '/home/jenkins/ws/BootromSignoff/build@tmp/durable-e53cb05b/.nfs0000000029ee0a9d00010597'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
at hudson.Util.deleteFile(Util.java:277)
at hudson.FilePath.deleteRecursive(FilePath.java:1303)
at hudson.FilePath.deleteContentsRecursive(FilePath.java:1312)
at hudson.FilePath.deleteRecursive(FilePath.java:1302)
at hudson.FilePath.access$1600(FilePath.java:211)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1272)
at hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1268)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3084)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)[Pipeline] }[Pipeline] }[Pipeline] // timeout[Pipeline] // timeout[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] echoEXCEPTION: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException[Pipeline] echoCTEST BUG: Ctest didn't honor timeout setting?[Pipeline] }[Pipeline] // dir[Pipeline] // dir[Pipeline] }[Pipeline] }[Pipeline] // node[Pipeline] // node[Pipeline] }[Pipeline] }sh: line 1: 104849 Terminated sleep 3sh: line 1: 163732 Terminated { while [ ( -d /proc/$pid -o ! -d /proc/$$ ) -a -d '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03' -a ! -f '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-result.txt' ]; do
touch '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt'; sleep 3;
done; }
sh: line 1: 163733 Terminated JENKINS_SERVER_COOKIE=$jsc '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/script.sh' > '/home/jenkins/ws/BootromSignoff/build@tmp/durable-bcad1b03/jenkins-log.txt' 2>&11/1 Test #56: rumboot-default-rumboot-Production-bootrom-integration-no-selftest-host-easter-egg ...***Failed 20250.70 sec
It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes.

 

spam@ncrmnt.org (JIRA)

unread,
Nov 13, 2018, 1:59:04 AM11/13/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

*First one:*

I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:
{code:java}

Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling
{code}

But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.



 

*Second one:*


In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:


{code:java}


{code}

It looks like when jenkins is trying to kill off simulation it takes way more than 10 seconds (Perhaps, due to the fact that the simulator interprets the signal as a crash and starts collecting logs/core dumps that take a lot of time). I'll try to patch this timeout as well and see how it goes.

 

spam@ncrmnt.org (JIRA)

unread,
Nov 13, 2018, 2:00:04 AM11/13/18
to jenkinsc...@googlegroups.com

spam@ncrmnt.org (JIRA)

unread,
Nov 13, 2018, 2:47:04 AM11/13/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

*First one:*
I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:
{code:java}
Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling
{code}
But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, perhaps due to a master node restart (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.

 

*Second one:*

In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:


{code}


P.S. I've just updated jenkins and all plugins, workflow-durable-task-step-plugin from git and applied the following patch. I hope 60s timeouts will do nicely.

{code}
diff --git a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
index 9b449d7..b338690 100644
--- a/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
+++ b/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java

@@ -311,7 +311,7 @@ public abstract class DurableTaskStep extends Step {
                 }
             }
             boolean directory;
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {

                 directory = ws.isDirectory();
             } catch (Exception x) {
                 getWorkspaceProblem(x);
@@ -374,7 +374,7 @@ public abstract class DurableTaskStep extends Step {
                     stopTask = null;
                     if (recurrencePeriod > 0) {
                     recurrencePeriod = 0;
-                     listener().getLogger().println("After 10s process did not stop");
+                     listener().getLogger().println("After 60s process did not stop");
                     getContext().onFailure(cause);
                     try {
                     FilePath workspace = getWorkspace();
@@ -386,7 +386,7 @@ public abstract class DurableTaskStep extends Step {
                     }
                     }
                     }
-                }, 10, TimeUnit.SECONDS);
+                }, 60, TimeUnit.SECONDS);
                 controller.stop(workspace, launcher());
             } else {
                 listener().getLogger().println("Could not connect to " + node + " to send interrupt signal to process");

@@ -451,7 +451,7 @@ public abstract class DurableTaskStep extends Step {
                 return; // slave not yet ready, wait for another day
             }
             TaskListener listener = listener();
-            try (Timeout timeout = Timeout.limit(10, TimeUnit.SECONDS)) {
+            try (Timeout timeout = Timeout.limit(60, TimeUnit.SECONDS)) {

                 if (watching) {
                     Integer exitCode = controller.exitStatus(workspace, launcher(), listener);
                     if (exitCode == null) {

{code}

spam@ncrmnt.org (JIRA)

unread,
Nov 13, 2018, 4:27:07 AM11/13/18
to jenkinsc...@googlegroups.com
Andrew a edited a comment on Bug JENKINS-46507
I've just ran a few regressions with that insanely huge timeout and the bad news is, the problem didn't completely go away. More, 2 different problems have emerged (I'm now not really sure if they are directly related to this issue, or I should open a new ticket. Posting everything here for now)

*First one:*
I'm now seeing a pipeline freezing AFTER all the tasks under parallel statement have completed. A restart of jenkins causes some of the steps under parallel to be rerun with the following warning:
{code:java}
Queue item for node block in SoC » RTL_REGRESSION #255 is missing (perhaps JENKINS-34281); rescheduling
{code}
But the pipeline completes. I'm also seeing runaway simulation processes that have to be killed by hand. Those kept running after the pipeline has been completed, - perhaps due to a master node restart - (and thus preventing further builds in that workspace). Not yet sure how I should debug this one.


 

*Second one:*

In an attempt to mitigate another the issue (now with old ctest on RHEL, not always handling timeouts correctly)  I've added a timeout() block inside parallel, and that exposed another filesystem/timeout problem:


{code :java }

{code
:java }

dnusbaum@cloudbees.com (JIRA)

unread,
Jan 14, 2019, 4:39:03 PM1/14/19
to jenkinsc...@googlegroups.com

Andrew a did those timeouts end up helping? If so, I can roll them up into https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/90 and release that so they can be configured without needing to run custom code.

spam@ncrmnt.org (JIRA)

unread,
Jan 15, 2019, 5:13:03 AM1/15/19
to jenkinsc...@googlegroups.com
Andrew a commented on Bug JENKINS-46507

Devin Nusbaum Sorry for not reporting earlier. 60 seconds seem to have fixed all issues for me. The rest of the problems were due to ctest (and our numa scheduler wrapped within it before the actual simulator) not correctly dying when jenkins asked them to do so.

dnusbaum@cloudbees.com (JIRA)

unread,
Jan 15, 2019, 9:20:03 AM1/15/19
to jenkinsc...@googlegroups.com

Andrew a No problem! I will move forward with my PR (adding an additional timeout), thanks so much for interactively debugging the issue!

dnusbaum@cloudbees.com (JIRA)

unread,
Jan 15, 2019, 2:05:04 PM1/15/19
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Jan 31, 2019, 4:33:16 PM1/31/19
to jenkinsc...@googlegroups.com
 

As of version 2.29 of the Pipeline Nodes and Process Plugin, the default timeout for remote calls is 20 seconds, and the value can be configured using the system property org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.REMOTE_TIMEOUT.

I am marking this ticket as closed since that is the main cause of the issue identified from discussion in the comments (thanks Andrew a!). If this issue is still occurring frequently for someone after increasing that value, please comment and we can investigate further.

Change By: Devin Nusbaum
Status: In Review Resolved
Resolution: Fixed
Released As: workflow-durable-task-step 2.29

tpoerio@argo.ai (JIRA)

unread,
Oct 18, 2019, 2:31:04 PM10/18/19
to jenkinsc...@googlegroups.com

Hi Devin Nusbaum – when will this fix be released? My team needs it. (Or at least we think we do.)

 

And I saw that this message is from about 9 months back, at time of writing.

 

As of right now, the current release is only `2.200`, (released 10-14-2019).

 

However, the post immediately above is referencing version `2.29`.

 

Is it possible that this update is already present in `2.200`?

 

If not, when will it become available in a stable release?  Many thanks for the help.

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages