[JIRA] (JENKINS-48300) Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

4,053 views
Skip to first unread message

svanoort@cloudbees.com (JIRA)

unread,
Feb 16, 2018, 2:59:03 PM2/16/18
to jenkinsc...@googlegroups.com
Sam Van Oort closed an issue as Fixed
 

Released to the wild now

Jenkins / Bug JENKINS-48300
Pipeline shell step aborts prematurely with ERROR: script returned exit code -1
Change By: Sam Van Oort
Status: Open Closed
Resolution: Fixed
Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

me@basilcrow.com (JIRA)

unread,
Feb 16, 2018, 3:10:02 PM2/16/18
to jenkinsc...@googlegroups.com
Basil Crow commented on Bug JENKINS-48300
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

I see that this bug has been closed as fixed, but I'm not sure I'd consider it fixed. I guess that depends on what the scope of this bug is. There were two problems identified in the bug:

  1. JENKINS-47791 introduced a new failure mode that only manifests when using NFS-based workspaces.
  2. This new failure mode having a poor error message.

If the scope of this bug is both of these issues, then only the second has been fixed. The first issue remains, and this bug shouldn't be marked as fixed.

If the scope of this bug is only the second issue, then a new bug should be filed covering the first issue.

Which of the two is the case?

svanoort@cloudbees.com (JIRA)

unread,
Feb 16, 2018, 4:28:03 PM2/16/18
to jenkinsc...@googlegroups.com

Basil Crow #2 has been addressed – there is separate work in the proposed solution to https://issues.jenkins-ci.org/browse/JENKINS-37575 (PRs from Jesse that I am reviewing) that should address your issue, so I'm trying to avoid double-tracking the same cluster of issues.   The issues are phrased differently (resending output vs. timing out) but the root cause is the same (timing in the communication).

So, that means there's a more comprehensive long-term solution in the works.

Oh, and if you're by any chance using NFS for your master too: the parts of JENKINS-47170 that I've released already will probably benefit you a lot (particularly the PERFORMANCE-OPTIMIZED pipeline mode) – docs are up at https://jenkins.io/doc/book/pipeline/scaling-pipeline/  Should greatly reduce the IO needs of your master when running Pipelines.

me@basilcrow.com (JIRA)

unread,
Feb 16, 2018, 4:59:03 PM2/16/18
to jenkinsc...@googlegroups.com

Sam Van Oort Thanks! I'll start following JENKINS-37575 now. Glad to hear there's a long-term solution in the works.

I did see the PERFORMANCE-OPTIMIZED pipeline mode and am looking forward to trying it out soon

jenkins@gyoo.com (JIRA)

unread,
Feb 26, 2018, 8:15:02 PM2/26/18
to jenkinsc...@googlegroups.com

The folder jenkins/jobs with the log being on the master, should the JVM parameter -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 set on the master (in the start script) or on each slave node (in the node configuration) ?

jenkins@gyoo.com (JIRA)

unread,
Feb 26, 2018, 8:16:02 PM2/26/18
to jenkinsc...@googlegroups.com
Jean-Paul G edited a comment on Bug JENKINS-48300
The folder jenkins/jobs with the log being on the master, should the JVM parameter -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 be set on the master JVM (in the start script) or on each slave node JVM (in the node configuration) ?

zdmerlin@gmail.com (JIRA)

unread,
Feb 27, 2018, 7:55:02 AM2/27/18
to jenkinsc...@googlegroups.com

Hi Jean-Paul, the -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 must be set on the master JVM. Note that I use only 60s and currently this solve my issues.

moritz.baumann@oracle.com (JIRA)

unread,
Apr 24, 2018, 8:50:05 AM4/24/18
to jenkinsc...@googlegroups.com

Sam Van Oort:
We're periodically running into this even though we don't use NFS on either the master or the slaves and even though we're using the fastest durability setting, so I did some research. It looks like between two heartbeat checks, there are a lot of network I/O operations between master and slave which can easily cause a timeout, even without NFS. Therefore, the current error message was extremely misleading in our case.

At the very least, the error message should be changed to make people aware that the heartbeat timestamps are compared on the Jenkins master and that there are a lot of other network operations happening in between those two heartbeat checks. Without a code review of both plugins involved (Durable Task, Durable Task Step), I would have never figured that out.

But I'm also questioning whether the defaults are sensible at all. Why should Jenkins assume that the shell process is dead just because a bunch of network operations between master and slave took more than 15 seconds to complete? That's an awfully short time span.

Please reconsider the default value for this. I think something in the order of minutes might be more reasonable; short-term network congestion can happen from time to time and shouldn't cause builds to fail.

moritz.baumann@oracle.com (JIRA)

unread,
Apr 24, 2018, 9:05:03 AM4/24/18
to jenkinsc...@googlegroups.com
Moritz Baumann updated an issue
 
Change By: Moritz Baumann
Comment: [~svanoort]:
We're periodically running into this even though we don't use NFS on either the master or the slaves and even though we're using the fastest durability setting, so I did some research. It looks like between two heartbeat checks, there are a lot of network I/O operations between master and slave which can easily cause a timeout, even without NFS. Therefore, the current error message was extremely misleading in our case.

At the very least, the error message should be changed to make people aware that the heartbeat timestamps are compared on the Jenkins master and that there are a lot of other network operations happening in between those two heartbeat checks. Without a code review of both plugins involved (Durable Task, Durable Task Step), I would have never figured that out.

But I'm also questioning whether the defaults are sensible at all. Why should Jenkins assume that the shell process is dead just because a bunch of network operations between master and slave took more than 15 seconds to complete? That's an awfully short time span.

Please reconsider the default value for this. I think something in the order of minutes might be more reasonable; short-term network congestion can happen from time to time and shouldn't cause builds to fail.

moritz.baumann@oracle.com (JIRA)

unread,
Apr 24, 2018, 9:12:02 AM4/24/18
to jenkinsc...@googlegroups.com
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

Sorry for my earlier comment (which I deleted); I misunderstood the logic in ShellController::exitStatus when I first glanced over it. I will try to better understand what's happening in my case (why we're getting failures even though we're using a local hard disk) and comment here and/or open a new issue once I've actually understood the problem.

federicon@al.com.au (JIRA)

unread,
Jun 11, 2018, 8:59:03 PM6/11/18
to jenkinsc...@googlegroups.com
Federico Naum updated an issue
 
Change By: Federico Naum
A few of my Jenkins pipelines failed last night with this failure mode:

{noformat}
01:19:19 Running on blackbox-slave2 in /var/tmp/jenkins_slaves/jenkins-regression/path/to/workspace.   [Note: this is an SSH slave]
[Pipeline] {
[Pipeline] ws
01:19:19 Running in /net/nas.delphix.com/nas/regression-run-workspace/jenkins-regression/workspace@10. [Note: This is an NFS share on a NAS]
nd they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed
[Pipeline] {
[Pipeline] sh
01:20:10 [qa-gate] Running shell script
[... script output ...]
01:27:19 Running test_create_domain at 2017-11-29 01:27:18.887531...
[Pipeline] // dir
[Pipeline] }
[Pipeline] // ws
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // timestamps
[Pipeline] }
[Pipeline] // timeout

ERROR: script returned exit code -1
Finished: FAILURE
{noformat}

As far as I can tell the script was running fine, but apparently Jenkins killed it prematurely because Jenkins didn't think the process was still alive.

The interesting thing is that this is normally working, but failed last night at exactly the same time in multiple pipeline jobs. And I only started seeing this after upgrading {{durable-task-plugin}} from 1.14 to 1.17. I looked at the code change and saw that the main change has been the change in {{ProcessLiveness}} from using a {{ps}}-based system to a timestamp-based system. What I suspect is that the NFS server on which this workspace is hosted wasn't processing I/O operations fast enough at the time this problem occurred, so the timestamp wasn't updated even though the script continued running. Note that I am not using Docker here, this is just a regular SSH slave.

The ps-based approach may have been suboptimal, but it was more reliable for us than the new timestamp-based approach, at least when using NFS-based workspaces. Expecting a timestamp to increase on a file every 15 seconds may be a tall order for some system and network administrators, especially over NFS
-- network issues can and do happen, and they shouldn't take down Jenkins jobs when they do. Our Jenkins jobs used to just hang when there was a NFS outage, now the script liveness check kills the job. I view this as a regression. As flawed as the old approach may have been, it was immune to this failure mode. Is there anything I can do here besides increasing various timeouts to avoid hitting this? The fact that no diagnostic information was printed to the Jenkins log or the SSH slave remotin log is also problematic here.

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 3, 2018, 3:28:02 PM8/3/18
to jenkinsc...@googlegroups.com
Craig Rodrigues commented on Bug JENKINS-48300
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

If you see problems like this, I recommend that you go to Manage Jenkins, and create a logger which logs all events for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

This will print out more debug statements and help diagnose the problem.

 

There is additional logging in https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/stable/src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java#L177

 

That is enabled if you do this, and can help identify the problem.

This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 3, 2018, 3:45:02 PM8/3/18
to jenkinsc...@googlegroups.com

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 3, 2018, 3:59:02 PM8/3/18
to jenkinsc...@googlegroups.com
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

I ran into the same problem as others did.

 

I am using Durable Task Plugin 1.23

 

I have a Pipeline which takes a very long time, and looks like this:

 

 #!groovypipeline {
    agent {
        label 'CRAIG-1'
    }    options {
        disableConcurrentBuilds()
        timeout(time: 10, unit: 'HOURS')
    }    parameters {
        booleanParam(name: 'UPDATE_PARAMETERS',
                     defaultValue: false,
                     description: 'Update the parameters from this pipeline script')        string(defaultValue: 'master',
               description: 'branch',
               name: 'BRANCH')
    }    stages {
        stage("Display build parameters") {
            steps {
                script {
                    /*
                     * Print out the build parameters
                     */
                    def all_params = ""                    for ( k in params ) {
                        all_params = all_params + "${k.key}=${k.value}\n"
                    }                    print("These parameters were passed to this build:\n" + all_params)
                    writeFile(file: "env-vars.txt", text: "$all_params")
                }
            }
        }        /*
         * Jenkins needs to parse the entire pipeline before it can
         * parse the parameters, if the parameters are specified in this file.
         */
        stage("Updating Parameters") {
            when {
                expression {
                    params.UPDATE_PARAMETERS == true
                }
            }
            steps {
                script {
                        currentBuild.result = 'ABORTED'
                        error('DRY RUN COMPLETED. JOB PARAMETERIZED.')
                }
            }
        }        stage("First") {
            steps {
                dir("dir1") {
		
                    git(url: 'https://github.com/twisted/twisted')
                    sh("""
                       echo do some stuff
                       """)
                }
            }
        }        stage("Second") {
            steps {
                dir("dir2") {
                    git(url: 'https://github.com/twisted/twisted')
                sh("""
                   echo do more stuff
                   """)
	        }
            }
        }       stage("Third: Takes a long time, over 1.5 hours") {
            steps {
                sh("""
                    echo this operation takes a long time
                    """)
            }
            post {
                always {
                    junit "report.xml"
                }
            }
       }
    }    post {
        failure {
            slackSend (channel: '#channel-alerts', color: '#FF0000', message: "FAILED: Job '${env.JOB_NAME} started by ${env.CAUSEDBY} [${env.BUILD_NUMBER}]' (${env.RUN_DISPLAY_URL})");
        }        changed {
            script {
                /*
                 * Only send e-mails on failures, or when status changes from failure
                 * to success, or success to failure.
                 * This requires currentBuild.result to be set.
                 *
                 * See: https://baptiste-wicht.com/posts/2017/06/jenkins-tip-send-notifications-fixed-builds-declarative-pipeline.html
                 */
                def prevBuild = currentBuild.getPreviousBuild()
                /*
                 * If this pipeline has never run before, then prevBuild will be null.
                 */
                if (prevBuild == null) {
                    return
                }
                def prevResult = prevBuild.getResult()
                def result = currentBuild.getResult()
                if ("${prevResult}" != "${result}" && "${result}" != "FAILURE") {
                    if ("${prevResult}" == "FAILURE") {
                        slackSend(channel: '#smoketest-alerts', color: 'good', message: "SUCCEEDED: Job '${env.JOB_NAME} [${env.BUILD_NUMBER}]' (${env.RUN_DISPLAY_URL})")
                    }
                }
            }
        }
    }
}

In Manage Jenkins -> System Log I enabled a logger to log ALL events for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

After running my pipeline for 2 hours, Jenkins terminated the pipeline, and I saw this in the log:

Post stage
wrapper script does not seem to be touching the log file in /root/workspace/workspace/PX-TEST-STABLE@tmp/durable-502ca4bd
(JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)         

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 3, 2018, 4:01:01 PM8/3/18
to jenkinsc...@googlegroups.com

Is there a way to specify -D*org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300* without modifying the invocation of java which starts the Jenkins master?

I am running Jenkins using the jenkins-lts docker image, and it is a pain to modify the startup command-line unless I build my own docker image running jenkins-lts.

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 3, 2018, 4:09:02 PM8/3/18
to jenkinsc...@googlegroups.com
Craig Rodrigues edited a comment on Bug JENKINS-48300
If you see problems like this, I recommend that you go to *Manage Jenkins -> System Log *, and create a logger which logs all events for *org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep*


This will print out more debug statements and help diagnose the problem.

 

There is additional logging in [https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/stable/src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java#L177]

 

That is enabled if you do this, and can help identify the problem.

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 3, 2018, 4:17:02 PM8/3/18
to jenkinsc...@googlegroups.com
Craig Rodrigues edited a comment on Bug JENKINS-48300
Is there a way to specify

{noformat}
- D*org Dorg .jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300 * {noformat}

without modifying the invocation of java which starts the Jenkins master?

I am running Jenkins using the jenkins-lts docker image, and it is a pain to modify the startup command-line unless I build my own docker image running jenkins-lts.

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 4, 2018, 12:23:02 PM8/4/18
to jenkinsc...@googlegroups.com

The workaround I tried was to go to Manage Jenkins -> System Console

then I entered:

 

System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", 36000)

 

I then ran my pipeline, and it wasn't terminated.

Is there a way I can do this inside the pipeline?

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 6, 2018, 3:38:02 PM8/6/18
to jenkinsc...@googlegroups.com

I was able to do this in my pipeline, and it worked, after enabling the:

script {
   System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", "3800");
} 

I had to enable the function to work in the security settings at Manage Jenkins > In-process Script approval, but it worked.

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 6, 2018, 3:39:02 PM8/6/18
to jenkinsc...@googlegroups.com
Craig Rodrigues edited a comment on Bug JENKINS-48300
I was able to do this in my pipeline, and it worked, after enabling the:
{noformat}

script {
   System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", "3800");
} {noformat}
I had to enable the function to work in the security settings at *Manage Jenkins* -
> - *In- * - **-* process Script approval*, but it worked.

sverre.moe@gmail.com (JIRA)

unread,
Aug 7, 2018, 10:28:03 AM8/7/18
to jenkinsc...@googlegroups.com

We are using durable-task-plugin 1.23, but are still seeing this problem. According to the changelog it was fixed in version 1.18

A few(4-5) weeks ago we didn't have this problem, then yesterday we upgraded Jenkins and all our plugins. Now it fails building on Windows.

[Native master-windows7-x86_64] wrapper script does not seem to be touching the log file in C:\cygwin64\home\build\jenkins\workspace\applicationA_sverre_work-3U54DPE57F6TMOZM2O6QBWDQ2LNRU2QHAXT6INC3UPGWF2ERMXAQ@tmp\durable-0ead6a5b
[Native master-windows7-x86_64] (JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)

Is the workaround mentioned above the actual fix that was fixed in version 1.18? We have been using it for months without seeing a problem.
We are not having the problem now when started Jenkins with that system property.

bartek.kania@sharespine.com (JIRA)

unread,
Aug 8, 2018, 3:41:02 AM8/8/18
to jenkinsc...@googlegroups.com

I get the same problem as Sverre Moe above as of version 1.23 on windows build slaves.

Didn't have any problems before.

System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", "3800");

Seems to work around it for me.

bartek.kania@sharespine.com (JIRA)

unread,
Aug 8, 2018, 3:41:06 AM8/8/18
to jenkinsc...@googlegroups.com
Bartek Kania edited a comment on Bug JENKINS-48300
I get the same problem as [~djviking] above as of version 1.23 on windows build slaves.


Didn't have any problems before.

{quote} System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", "3800");
{quote}
Seems to work around it for me.

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 8, 2018, 5:03:02 PM8/8/18
to jenkinsc...@googlegroups.com

@dwnusbaum can you take a look at this?  This seems to be affecting a few people, and the workaround seems to be to set *

org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL* to some really high value.

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 8, 2018, 5:03:04 PM8/8/18
to jenkinsc...@googlegroups.com
Craig Rodrigues edited a comment on Bug JENKINS-48300

eric@blackbagtech.com (JIRA)

unread,
Aug 8, 2018, 8:27:02 PM8/8/18
to jenkinsc...@googlegroups.com
E H commented on Bug JENKINS-48300

I ran into this as well, the durable-task 1.25 appears to resolve the problem - thanks much for the quick fix.

michael@stieler.it (JIRA)

unread,
Aug 21, 2018, 2:40:02 PM8/21/18
to jenkinsc...@googlegroups.com

I seem to have the same problem with durable-task 1.25. Trying to build an (empty) Spring boot app using Artifactory Maven plugin on a Kuberenetes slave, the build script was aborted with a link to this issue. After setting the property in the pipeline script, (which fixed it), I noticed that there is a rather long time between two log statements:

 

17:36:55.137 [DEBUG] [org.gradle.initialization.DefaultGradlePropertiesLoader] Found system project properties: []
>> Without the increased timeout, the pipeline was aborted here before the next log entry
17:36:57.323 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationExecutor] Build operation 'Apply script settings.gradle to settings 'ci-test'' started

 

I am not using NFS but of course the build slave is a virtual machine which will be a little slower.

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 2:51:32 PM8/21/18
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 2:51:34 PM8/21/18
to jenkinsc...@googlegroups.com
Jesse Glick reopened an issue
 

Do not attempt to call System.setProperty from within a sandboxed script. Whitelisting this would constitute a possibly severe vulnerability.

The default value for the check interval should likely be higher. Still, presence of this issue suggests that there is something wrong with the agent’s filesystem, or that control processes are being killed.

Not sure why this marked fixed. PR 57 merely added better logging; it did not change the behavior otherwise.

Change By: Jesse Glick
Resolution: Fixed
Status: Closed Reopened

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 2:53:03 PM8/21/18
to jenkinsc...@googlegroups.com

rodrigc@FreeBSD.org (JIRA)

unread,
Aug 21, 2018, 3:00:02 PM8/21/18
to jenkinsc...@googlegroups.com

Michael Cornel Can you add a new logger to your Jenkins system by navigating to Manage Jenkins -> System Log, and creating a new logger org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep set to ALL .

Then re-run your pipeline looking for debugging messages to highlight where the problem might be?

That logger is defined here: https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java#L71

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 3:07:06 PM8/21/18
to jenkinsc...@googlegroups.com

durable-task PR 81 at least increases the grace period, pending some determination of root cause by users seeing this issue.

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 3:07:07 PM8/21/18
to jenkinsc...@googlegroups.com
Jesse Glick started work on Bug JENKINS-48300
 
Change By: Jesse Glick
Status: Reopened In Progress

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 3:07:10 PM8/21/18
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 3:15:02 PM8/21/18
to jenkinsc...@googlegroups.com
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

Craig Rodrigues that logger is unlikely to be helpful in this case. Really there is no Java logging that is very pertinent to this issue (a -1 return status which vanishes iff HEARTBEAT_CHECK_INTERVAL is increased)—all the meaningful messages are already sent to the build log as of PR 57.

Root cause diagnosis would involve using an interactive shell to somehow figure out why jenkins-log.txt is not getting touched at least every three seconds. (More often when there is active log output from the user process.) Possibly it is getting touched, but the agent JVM in BourneShellScript.exitStatus is not seeing the right timestamp, or is somehow misinterpreting what it sees; or perhaps one of the two controller sh processes (the one usually inside sleep 3) has been killed by something (such as was claimed in JENKINS-50892).

michael@stieler.it (JIRA)

unread,
Aug 21, 2018, 3:41:02 PM8/21/18
to jenkinsc...@googlegroups.com

Ok, it took me a while to actually reproduce this error message.

I have to:

  • Start Gradle manually, so in my case sh('./gradlew -d --no-daemon clean bootJar') – it does not occur if I start the Gradle build using the Artifactory Jenkins plugin
  • Configure the Kubernetes build slave with a memory limit of 512Mi which seems to be not enough and results in a (silent) out of memory problem

So it appears that durable tasks is not actually aborting the build but correctly detects that the process is not running any more. Maybe the error message could give a hint that the most probable reason for the log file not being touched is that the wrapper script has actually died?

michael@stieler.it (JIRA)

unread,
Aug 21, 2018, 3:43:02 PM8/21/18
to jenkinsc...@googlegroups.com
Michael Cornel edited a comment on Bug JENKINS-48300
Ok, it took me a while to actually reproduce this error message.

I have to:
* Start Gradle manually, so in my case  {{ sh('./gradlew -d --no-daemon clean bootJar') }} – it does not occur if I start the Gradle build using the Artifactory Jenkins plugin
* Configure the Kubernetes build slave with a memory limit of
{{ 512Mi }} which seems to be not enough and results in a (silent) out of memory problem


So it appears that durable tasks is not actually aborting the build but correctly detects that the process is not running any more. Maybe the error message could give a hint that the most probable reason for the log file not being touched is that the wrapper script has actually died?

michael@stieler.it (JIRA)

unread,
Aug 21, 2018, 3:44:02 PM8/21/18
to jenkinsc...@googlegroups.com
Michael Cornel edited a comment on Bug JENKINS-48300
Ok, it took me a while to actually reproduce this error message.

I have to:
* Start Gradle manually, so in my case {{sh('./gradlew -d --no-daemon clean bootJar')}} – it does not occur if I start the Gradle build using the Artifactory Jenkins plugin
* Configure the Kubernetes build slave with a memory limit of {{512Mi}} which seems to be not enough and results in a (silent) out of memory problem

So it appears that durable tasks -task is not actually aborting the build but correctly detects that the process is not running any more. Maybe the error message could give a hint that the most probable reason for the log file not being touched is that the wrapper script has actually died?

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 4:32:02 PM8/21/18
to jenkinsc...@googlegroups.com

Michael Cornel that is useful information indeed. If I understand correctly, some out of memory condition is resulting in something (Kubernetes? Docker? the Linux kernel?) deciding to just kill off processes such as the wrapper script. Is the agent JVM also being killed? Whatever the case, Jenkins is then behaving appropriately in marking the sh step as a failure (the -1 pseudo exit code represents the fact that the actual exit code of the process is unknown and something fundamental went wrong), but is not clearly explaining the real problem.

michael@stieler.it (JIRA)

unread,
Aug 21, 2018, 5:08:01 PM8/21/18
to jenkinsc...@googlegroups.com

Almost. So as far as I understand Kubernetes simply "translates" the resource limit configuration and applies it when starting the Docker containers. I am pretty sure that I saw an IOException Out of memory with Gradle stacktrace during one of the builds. Thus, Docker is probably not killing the process but just prevents it to allocate more memory.

I would expect the Gradle JVM to exit with non-zero exit code and the agent to recognize this and immediately mark the build as failed. I don't know if it does or what happens to the agent JVM and so on, though.

jglick@cloudbees.com (JIRA)

unread,
Aug 21, 2018, 11:33:03 PM8/21/18
to jenkinsc...@googlegroups.com

Possibly the container is so hosed that just trying to fork sleep 3 from the controller process fails.

irc@webratz.de (JIRA)

unread,
Aug 29, 2018, 5:17:02 AM8/29/18
to jenkinsc...@googlegroups.com
Andreas Sieferlinger assigned an issue to Andreas Sieferlinger
 
Change By: Andreas Sieferlinger
Assignee: Jesse Glick Andreas Sieferlinger
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

irc@webratz.de (JIRA)

unread,
Aug 29, 2018, 5:18:05 AM8/29/18
to jenkinsc...@googlegroups.com

irc@webratz.de (JIRA)

unread,
Aug 29, 2018, 5:18:05 AM8/29/18
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Aug 29, 2018, 9:18:06 AM8/29/18
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Aug 29, 2018, 9:18:07 AM8/29/18
to jenkinsc...@googlegroups.com
Jesse Glick stopped work on Bug JENKINS-48300
 
Change By: Jesse Glick
Status: In Progress Open

jglick@cloudbees.com (JIRA)

unread,
Aug 29, 2018, 9:18:09 AM8/29/18
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 25, 2018, 10:14:03 AM9/25/18
to jenkinsc...@googlegroups.com
Change By: Devin Nusbaum
Status: Fixed but Unreleased Resolved
Released As: durable-task 1.26

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 25, 2018, 10:17:02 AM9/25/18
to jenkinsc...@googlegroups.com
Devin Nusbaum commented on Bug JENKINS-48300
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

The fix that increases the default heartbeat interval to 5 minutes was just released in version 1.26 of the Durable Task Plugin.

ByteEnable@outlook.com (JIRA)

unread,
Oct 1, 2018, 2:33:02 PM10/1/18
to jenkinsc...@googlegroups.com

I am experiencing this issue randomly.  durable-task plugin is at version 1.26 as well.  I am running the agent on a 10Gbe port as well.  The port is specifically for Jenkins.

Cannot contact XXXXXXX: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.10.11.205/10.10.11.205:54092 failed. The channel is closing down or has closed down
wrapper script does not seem to be touching the log file in /XXXXXXXX@tmp/durable-7e71b4e1
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

rodrigc@FreeBSD.org (JIRA)

unread,
Oct 1, 2018, 2:39:05 PM10/1/18
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Oct 1, 2018, 3:42:04 PM10/1/18
to jenkinsc...@googlegroups.com

Unlikely to be helpful. The problem here is likely a Remoting channel outage, which is not a Pipeline issue. Diagnosing those are tricky.

ByteEnable@outlook.com (JIRA)

unread,
Oct 1, 2018, 4:41:03 PM10/1/18
to jenkinsc...@googlegroups.com

I just hit it again.  I added the logging as requested earlier and the log was empty.  The plugin is stating that the heartbeat interval should be set to 86400.  If that is in seconds; then that is 24 HRS.  I am running Jenkins on Ubuntu inside a HYPER-V VM.   I have 32GB of mem assigned but its only using 5GB.   With 12 CPU's assigned as well.

However, I am using rsync to download some files in my scripts from the server Jenkins is running on.  I had three other Pipeline scripts running at the time in various stages when this occurred.  I was also running top on the Ubuntu VM.  I noticed that rsync was at 90% and the load jumped to around 1.01.

Cannot contact XXXXXXX: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.10.11.205/10.10.11.205:56958 failed. The channel is closing down or has closed down
wrapper script does not seem to be touching the log file in XXXXXXXXXXX@tmp/durable-c5848708


(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

worldstream@protonmail.com (JIRA)

unread,
Oct 5, 2018, 3:08:05 AM10/5/18
to jenkinsc...@googlegroups.com
J S reopened an issue
 

I don't think this problem is solved. I have a Redhat 7 Jenkins Master with the corresponding RedHat 7 Slaves. The plugin version of Jenkins Durable Task is 1.2.6 so the latest version. I have the latest LTS version of Jenkins 2.138.1 and recently I have the following problem:

wrapper script does not seem to be touching the log file in /data/build/workspace/ro-TWADR5DH34OARMVNOXJQZ74HP4G7QAQ@2/build@tmp/durable-0a842734(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)script returned exit code -1

The step  was finished after 10 minutes with the above error. How can I set in the Jenkinsfile the durable task to abort after 60 minutes or 2 hours ? Can I set this anywhere in Jenkins? Could someone please write a tutorial about this I think the problem has been occurring a lot since the last update

Change By: J S
Resolution: Fixed
Status: Resolved Reopened

jglick@cloudbees.com (JIRA)

unread,
Oct 5, 2018, 10:24:06 AM10/5/18
to jenkinsc...@googlegroups.com
Jesse Glick resolved as Fixed
 

J S no this cannot be set per Jenkinsfile, only via system property, as it is merely an escape hatch for a system with a very laggy filesystem. Anyway if the log file received no touch after the new default of 5m (perhaps ×2), probably it was never going to. Something is broken in your system. I cannot diagnose the exact problem for you, though in your case it does not sound like a broken Remoting channel. Possibly the watcher process was killed off by something. I have heard of cases where the Linux kernel running under low memory conditions starts killing processes at random.

Change By: Jesse Glick
Status: Reopened Resolved
Resolution: Fixed

ByteEnable@outlook.com (JIRA)

unread,
Oct 5, 2018, 11:14:03 AM10/5/18
to jenkinsc...@googlegroups.com
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

The kernel invokes the OOM (out of memory) killer when SWAP space is filled.  And memory malloc's keep happening.  Such as a memory leak.  This was around RHEL6.  The issue is not fixed.  I did not experience this issue until upgrading to the latest version recently.  What is a laggy filesystem?  I/O is blocked?  System under heavy load?

jglick@cloudbees.com (JIRA)

unread,
Oct 5, 2018, 11:32:03 AM10/5/18
to jenkinsc...@googlegroups.com

The “laggy filesystem” issue pertains to a failure of a watcher process to touch a process log file while the process is idle, or the failure of the Jenkins agent JVM to see/interpret that timestamp. There could be many causes of that, such as a very slow network filesystem. The fix referenced in this issue was just a fix for a particular root cause: it made the grace period very long, so any filesystem which is still functioning at all should not have that issue. Exit codes of -1 from a sh step can be traced ultimately to many, many causes, such as problems with file permissions when using containers, processes being abruptly killed off by the kernel, the system having been rebooted, etc. If there are other conditions in which a -1 exit code is returned improperly—i.e., the process actually did finish with some real exit code but Jenkins failed to either notice it or display diagnostics—then those would be other issues. I cannot attempt to guess at the root cause encountered by a particular user in a particular condition. In general these things need to be tracked down by logging in to the agent machine and inspecting what is actually going on in the durable task control directory vs. what is happening with the user process (usually, but not necessarily, sh) and the two control processes (always sh).

wuguohua.5281@bytedance.com (JIRA)

unread,
Oct 11, 2018, 4:05:02 AM10/11/18
to jenkinsc...@googlegroups.com
Guohua Wu updated an issue
 
Change By: Guohua Wu
Attachment: image-2018-10-11-16-04-23-478.png

wuguohua.5281@bytedance.com (JIRA)

unread,
Oct 11, 2018, 4:06:02 AM10/11/18
to jenkinsc...@googlegroups.com
Guohua Wu commented on Bug JENKINS-48300
 
Re: Pipeline shell step aborts prematurely with ERROR: script returned exit code -1

I met the same issue recently after upgrading durable-task plugin. Here's the error message: 

My durable-task plugin version is 1.26, the latest.

I wonder if this issue has any work around operation .

brian.murrell@intel.com (JIRA)

unread,
Oct 12, 2018, 12:32:03 PM10/12/18
to jenkinsc...@googlegroups.com

The workaround I tried was to go to Manage Jenkins -> System Console

Did you mean Script Console (/script)

System.setProperty("org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL", 36000)

 When I did that I got:

groovy.lang.MissingMethodException: No signature of method: static java.lang.System.setProperty() is applicable for argument types: (java.lang.String, java.lang.Integer) values: [org.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL, ...]
Possible solutions: setProperty(java.lang.String, java.lang.String), getProperty(java.lang.String), getProperty(java.lang.String, java.lang.String), hasProperty(java.lang.String), getProperties(), getProperties()
	at groovy.lang.MetaClassImpl.invokeStaticMissingMethod(MetaClassImpl.java:1501)
	at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1487)
	at org.codehaus.groovy.runtime.callsite.StaticMetaClassSite.call(StaticMetaClassSite.java:53)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:133)
	at Script1.run(Script1.groovy:1)
	at groovy.lang.GroovyShell.evaluate(GroovyShell.java:585)
	at groovy.lang.GroovyShell.evaluate(GroovyShell.java:623)
	at groovy.lang.GroovyShell.evaluate(GroovyShell.java:594)
	at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
	at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
	at hudson.remoting.LocalChannel.call(LocalChannel.java:45)
	at hudson.util.RemotingDiagnostics.executeGroovy(RemotingDiagnostics.java:111)
	at jenkins.model.Jenkins._doScript(Jenkins.java:4381)
	at jenkins.model.Jenkins.doScript(Jenkins.java:4352)
	at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
	at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:343)
	at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184)
	at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:117)
	at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:129)
	at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
	at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:668)
	at org.kohsuke.stapler.Stapler.service(Stapler.java:238)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:860)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
	at org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:225)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
	at io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:61)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
	at io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
	at hudson.plugins.scm_sync_configuration.extensions.ScmSyncConfigurationFilter$1.call(ScmSyncConfigurationFilter.java:49)
	at hudson.plugins.scm_sync_configuration.extensions.ScmSyncConfigurationFilter$1.call(ScmSyncConfigurationFilter.java:44)
	at hudson.plugins.scm_sync_configuration.ScmSyncConfigurationDataProvider.provideRequestDuring(ScmSyncConfigurationDataProvider.java:106)
	at hudson.plugins.scm_sync_configuration.extensions.ScmSyncConfigurationFilter.doFilter(ScmSyncConfigurationFilter.java:44)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
	at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
	at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
	at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88)
	at org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
	at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:99)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
	at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249)
	at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:90)
	at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:530)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)
	at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
 

brian.murrell@intel.com (JIRA)

unread,
Oct 14, 2018, 9:06:03 AM10/14/18
to jenkinsc...@googlegroups.com

Jesse Glick Your explanation above, is great for people who understand the internals of Jenkins and Pipeline, etc. and how durability works, etc. is great, but it doesn't leave the "layman" (i.e. Jenkins user) a lot to debug with.

Where is this "laggy filesystem"?  On the agent I am gathering?  How exactly is this laggyness being measured?  What would I have to do when logged on to the agent to see what Jenkins is doing to determine "laggy filesystem"?

brian.murrell@intel.com (JIRA)

unread,
Oct 14, 2018, 10:36:03 AM10/14/18
to jenkinsc...@googlegroups.com

I've added -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=3600 to my Java command line but am still getting this error in my jobs.

Here is my entire java command line:

java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -DsessionTimeout=8000 -Xms4g -Xmx8g -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -Xloggc:/var/log/jenkins/gc-%t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=30m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --webroot=/var/lib/jenkins/war --httpsPort=-1 --httpPort=8080 --ajp13Port=-1 -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=3600

 When I put the -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=3600 before the -jar flag as such:

java -Djava.awt.headless=true -DsessionTimeout=8000 -Xms4g -Xmx8g -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -Xloggc:/var/log/jenkins/gc-%t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=30m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=3600 -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --webroot=/var/lib/jenkins/war --httpsPort=-1 --httpPort=8080 --ajp13Port=-1

 Jenkins just doesn't start. The java process starts and runs but nothing is added to jenkins.log and nothing is listening on the web interface.

Any ideas?

jglick@cloudbees.com (JIRA)

unread,
Oct 15, 2018, 10:18:02 AM10/15/18
to jenkinsc...@googlegroups.com

Brian J Murrell yes my explanation was about how to start diagnosing issues in this class, given sufficient knowledge of Jenkins internals. The result of such a diagnosis would be understanding of one new kind of environmental problem that leads to this symptom, and thus a new issue report and an idea for a product patch to either recover automatically or display a user-friendly error. If you are encountering this error on current versions of durable-task, it is likely that your problem is not a laggy filesystem, but something unrelated and yet to be identified.

chris.and.amy.shannon@gmail.com (JIRA)

unread,
Feb 3, 2020, 8:16:07 AM2/3/20
to jenkinsc...@googlegroups.com

In case it helps anyone else who stumbles across this thread, I just ran into this problem and was able to figure out why (which was not a durable-step, or Jenkins thing, but something I was doing wrong).

I basically had three different stages in my pipeline using static code analysis tools.  Each of these tools can be CPU intensive and by default are happy to consume as many cores as are available on the host.  We also have multiple Jenkins executors for each of our nodes (e.g. 4 executors on a 4 core node).

This problem presented itself when I put these three stages in a parallel block, and they all mapped to three executors on the same physical node.  When they started analyzing the code, I'm sure that my system load was completely railed (i.e. 2 if not 3 processes each trying to peg every core to 100% CPU).

It is no surprise that this error message would occur in this scenario.  Sure, Jenkins could have been more patient, but it also pointed to a pipeline architecture problem on my end.

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages