[JIRA] (JENKINS-17116) gracefull job termination

837 views
Skip to first unread message

markusb@cityweb.de (JIRA)

unread,
Mar 7, 2013, 6:43:53 AM3/7/13
to jenkinsc...@googlegroups.com
Issue Type: Bug Bug
Assignee: Unassigned
Components: core
Created: 07/Mar/13 11:43 AM
Description:

Using the freestyle projects to execute bash shell scripts work fine. But cancelling a jenkins job seems to use SIGKILL. In this way the script cannot perform cleanup operations and free resources.

SIGKILL cannot be handled by shell

SIGINT/SIGTERM are not used by jenkins

Preferred: SIGINT -> wait 5 seconds -> SIGKILL

Environment: any
Project: Jenkins
Priority: Minor Minor
Reporter: Markus Breuer
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

martin.danjou14@gmail.com (JIRA)

unread,
Apr 16, 2013, 11:25:32 PM4/16/13
to jenkinsc...@googlegroups.com

To verify this, I added created the following build script:

#!/bin/bash

trap 'echo 1' 1
trap 'echo 2' 2
trap 'echo 3' 3
trap 'echo 4' 4
trap 'echo 5' 5
trap 'echo 6' 6
trap 'echo 7' 7
...
trap 'echo 64' 64
sleep 120

And I have not forgotten any signal in between. When I hit the [x] to kill the job, none of the traps gets invoked, proving that Jenkins uses kill dash nine (kill -9).

No job clean up is possible. This is a terrible design flaw. Please tell me I am wrong.

martin.danjou14@gmail.com (JIRA)

unread,
Apr 16, 2013, 11:25:32 PM4/16/13
to jenkinsc...@googlegroups.com

Making this a major issue because there is no way a free style job can clean up after itself.

Change By: Martin d'Anjou (17/Apr/13 3:25 AM)
Priority: Minor Major

martin.danjou14@gmail.com (JIRA)

unread,
Apr 16, 2013, 11:52:32 PM4/16/13
to jenkinsc...@googlegroups.com
 
Martin d'Anjou edited a comment on Bug JENKINS-17116

I edited this comment. Something is wrong with it.

martin.danjou14@gmail.com (JIRA)

unread,
Apr 17, 2013, 12:22:33 AM4/17/13
to jenkinsc...@googlegroups.com
 
Martin d'Anjou edited a comment on Bug JENKINS-17116

I created this freestyle job, but the traps are never invoked when hitting [x] to "stop" the job.

#!/bin/bash
echo "Starting $0"
echo "Listing traps"
trap -p
echo "Setting trap"
trap 'echo SIGTERM; kill $pid; exit 15;' SIGTERM
trap 'echo SIGINT; kill $pid; exit 2;' SIGINT
echo "Listing traps again"
trap -p
echo "Sleeping"
sleep 10 & pid=$!
echo "Waiting"
wait $pid
echo "Exit status: $?"
echo "Ending"

It looks like Jenkins is using kill -9, but it is not since the rest of the script is executed:

Listing traps
Setting trap
Listing traps again
trap -- 'echo SIGINT; kill $pid; exit 2;' SIGINT
trap -- 'echo SIGTERM; kill $pid; exit 15;' SIGTERM
Sleeping
Waiting
Build was aborted
Aborted by d'Anjou, Martin
Build step 'Groovy Postbuild' marked build as failure
Recording test results
Exit status: 143
Ending

Is it possible that Jenkins disables the traps?

torbent@java.net (JIRA)

unread,
Apr 19, 2013, 5:17:32 AM4/19/13
to jenkinsc...@googlegroups.com
torbent commented on Bug JENKINS-17116

I am struggling with this as well! There is documentation which states that Jenkins uses SIGTERM to kill processes, but I too am having a hard time trapping it. One of the problems I have is that even if my script might trap the TERM, Jenkins appears to not wait for termination of the process(es) it has started. It's a bit difficult, then, to know whether the traps work or not when I cannot see the output.

You should be aware that the bash build scripts are usually invoked with -e, which may "break" your error handling. Jenkins will list all of the processes you have started, including the sleep, and send a TERM to all of them. Your sleep then fails (before you can kill it), causing the rest of the script to fail. It looks like you may have worked around that to get the "Ending" text out, but it caught me and may confuse others trying to reproduce the problem
The "list all of the processes" part involves an environment variable called BUILD_ID. See https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller

By using a set +e (and maybe BUILD_ID=ignore – so many experiments lately) I have managed to make my script ignore TERM, which can consistently lead to an orphaned bash. Jenkins is certain the build is aborted, but the script keeps running. I can kill the script (behind Jenkins) with -9, however.

martin.danjou14@gmail.com (JIRA)

unread,
Apr 19, 2013, 9:13:32 AM4/19/13
to jenkinsc...@googlegroups.com

When the shell script starts with the shabang:

#!/bin/bash
set -o
echo $-

I get:

allexport      	off
braceexpand    	on
emacs          	off
errexit        	off
errtrace       	off
functrace      	off
hashall        	on
histexpand     	off
history        	off
ignoreeof      	off
interactive-comments	on
keyword        	off
monitor        	off
noclobber      	off
noexec         	off
noglob         	off
nolog          	off
notify         	off
nounset        	off
onecmd         	off
physical       	off
pipefail       	off
posix          	off
privileged     	off
verbose        	off
vi             	off
xtrace         	off
hB

When the shell script does not start with the shabang:

set -o
echo $-

I get:

+ set -o
allexport      	off
braceexpand    	on
emacs          	off
errexit        	on
errtrace       	off
functrace      	off
hashall        	on
histexpand     	off
history        	off
ignoreeof      	off
interactive-comments	on
keyword        	off
monitor        	off
noclobber      	off
noexec         	off
noglob         	off
nolog          	off
notify         	off
nounset        	off
onecmd         	off
physical       	off
pipefail       	off
posix          	on
privileged     	off
verbose        	off
vi             	off
xtrace         	on
+ echo ehxB
ehxB

Conclusion: Jenkins forces -ex when there is no shabang (#!/bin/bash) line, so you can control at least that part.

martin.danjou14@gmail.com (JIRA)

unread,
May 7, 2013, 5:50:58 PM5/7/13
to jenkinsc...@googlegroups.com

First point: Changing the value of the BUILD_ID variable to bypass the tree killed is a bad idea: it changes the meaning of BUILD_ID. It would have been better to use a different variable name to express the "don't kill me" idea (hint: if the user sets DONTKILLME=true, then don't kill it).

Second point: Changing BUILD_ID has no effect on the example script shown in the first comment: it seems Jenkins disables the traps. I tried setting BUILD_ID in a job parameter and in the environment injection plugin to no avail.

Here are 2 scenarios explaining why Jenkins must not intercept the signals and must let the freestyle jobs handle their own termination:
1) the freestyle job needs a way to remove temporary files it might have created
2) the freestyle job needs a way to kill remote processes it might have created

I feel scenario 2 needs an explanation: Say the freestyle job spawned a process on a remote host, and disconnected from that remote host. There is no way for the process tree killer to find the connection between the freestyle job bash script, and the remote process, only the freestyle job script can kill the remote job. This is why signals must be propagated and not intercepted.

martin.danjou14@gmail.com (JIRA)

unread,
May 8, 2013, 10:06:57 PM5/8/13
to jenkinsc...@googlegroups.com

After experimenting some more, it seems Jenkins cuts the ties to the child process too soon after sending the TERM signal. Some times, when the job runs on the master, I do see the message from the SIGTERM trap, and a lot of times, I don't see it. This makes it hard to tell what really happens. It looks like Jenkins simply needs to wait for the job process to cut the ties to stdout/stderr before it stops listening to the job itself.

On IRC (May 8, 2013), there was a discussion on changing SIGTERM to SIGTERM -> wait 10 sec -> SIGKILL, but I would prefer if this delay was configurable or even optional, as the clean up done by a properly behaving job could take more than 10 seconds (and it does take a few minutes in my case due to a very large amount of small files to clean up on NFS).

Here are loosely related but different requests:
JENKINS-11995
JENKINS-11996

owen@nerdnetworks.org (JIRA)

unread,
May 10, 2013, 12:54:58 PM5/10/13
to jenkinsc...@googlegroups.com
owenmehegan commented on Bug JENKINS-17116

This may explain a problem I've been seeing. When a user cancels a build while a Ruby 'bundle install' operation is happening, the job exits but the bundle process goes into a zombie-ish state (not literally a zombie process but it never exits), no longer a child of the Jenkins process. I have to kill it manually, and sometimes it freaks out and consumes a lot of resources on the box as well. I'm not sure if we need a bigger/different hammer here, or what.

martin.danjou14@gmail.com (JIRA)

unread,
Jul 24, 2013, 9:01:47 AM7/24/13
to jenkinsc...@googlegroups.com

Jenkins leaks processes when jobs are killed. I think this is related to this issue, so instead of creating a new bug report, I am adding this comment.

To reproduce the process leak, create a new freestyle job from a fresh install, and enter this script:

#!/usr/bin/python
import signal
import time
print "Main 1"
def handler(*ignored):
    print "Ignored 1"
    time.sleep(120)
    print "Ignored 2"

print "Main 2"
signal.signal(signal.SIGTERM, handler)
print "Main 3"
time.sleep(120)
print "Main 4"

Then execute the build, and after a few seconds once the build is running, hit the red [x] button to kill the job. After the job is killed and Jenkins is done, go to the terminal and look for the python process. You should find something like this:

$ ps -efH
...
mdanjou   2154  2150  0 08:22 pts/0    00:00:00     bash
mdanjou   2531  2154 16 08:24 pts/0    00:00:36       java -jar jenkins.war
mdanjou   2601  2531  0 08:25 pts/0    00:00:00         /usr/bin/python /tmp/hudson3048464595979281901.sh

The python script is still in memory, and still executing. However, Jenkins has cut the ties to the python script.

Jenkins must not cut the ties until the script is done.

In this comment, the script is a simple example, in real project scripts, the signal handler is used to clean up temporary files, and to terminate gracefully (e.g. killing other spawned processes).

owen@nerdnetworks.org (JIRA)

unread,
Jul 24, 2013, 4:54:47 PM7/24/13
to jenkinsc...@googlegroups.com
Change By: owenmehegan (24/Jul/13 8:54 PM)
Assignee: Kohsuke Kawaguchi

martin.danjou14@gmail.com (JIRA)

unread,
Jun 20, 2014, 9:17:50 AM6/20/14
to jenkinsc...@googlegroups.com

There is more to it than cleaning up the orphaned processes, which by the way should be done by Jenkins and not as an external process. The way this should work is that Jenkins should send the signal (SIGTERM or SIGTERM) and wait for the sub-processes to do their own cleanup. This gives the sub-processes a chance to propagate the signal to sub-sub-processes of their own (which by the way when you use a grid engine, might be running yet on other remote machines that are not Jenkins slaves).

I modified the first shell script to write to a file during the traps: Jenkins cuts the ties too early and no files show up anywhere.

#!/bin/bash
echo "Starting $0"
echo "Listing traps"
trap -p
echo "Setting trap"
trap 'echo SIGTERM | tee trap.sigterm; kill $pid; exit 15;' SIGTERM
trap 'echo SIGINT  | tee trap.sigint; kill $pid; exit 2;' SIGINT
echo "Listing traps again"
trap -p
echo "Sleeping"
sleep 20 & pid=$!
echo "Waiting"
wait $pid
echo "Exit status: $?"
echo "Ending"

So the SIGINT -> wait N seconds for the build process to return -> SIGKILL (with a user configurable N) would be an acceptable solution. The value of N should be configurable for each job.

appidman@gmail.com (JIRA)

unread,
Jul 16, 2014, 11:45:37 AM7/16/14
to jenkinsc...@googlegroups.com
AppId Man commented on Bug JENKINS-17116

I'm also affected by this issue and would highly appreciate the solution proposed by Martin d'Anjou, in which Jenkins waits (a configurable amount of time) for its children to finish.

Will this be implemented in the near future?

tintinwebweb@oststrom.com (JIRA)

unread,
Jul 16, 2014, 7:18:22 PM7/16/14
to jenkinsc...@googlegroups.com

I see the exact same issue as described in comment-182402.

I am utilizing the execute python-script build-step to invoke pretty long lasting python processes (parent) that also spawn multiple sub-processes on-demand which are subject to be managed by the parent. I've implemented proper signal handling in order to clean up child processes and threads whenever the parent gets terminated. Unfortunately it looks like - as described in comment-182402 - that jenkins notifies the parent but does not wait for the parent to cleanup and terminate but instead detaches from the process leaving it in an zombie like state. In my case I keep finding processes sitting in futex calls waiting for a lock on a resource that never gets unlocked.

Clean-up bash scripts are no option as they do not prevent the process from locking, thus some of the external resources that are also locked by my script will never get freed. I see the option to make jenkins wait for the hudson<...>.py process to gracefully terminate and optionally force termination in case the procs cleanup lasts too long.

I'd appreciate any clues on fixing this issue.
Thanks

sandor.balazsi@ericsson.com (JIRA)

unread,
Aug 11, 2014, 3:25:54 AM8/11/14
to jenkinsc...@googlegroups.com

Is there any progress on this issue?

We are using jenkins to start a java based test framework.
This tool has a couple of java shutdown hook defined that
must be executed on the termination of java process.

Due to this problem jenkins does not wait for the proper
termination of our java process.

yevgen.kovalienia@gmail.com (JIRA)

unread,
Aug 19, 2014, 5:37:01 PM8/19/14
to jenkinsc...@googlegroups.com

Hi all,

I have the same problem, and would appreciate that solution, described by Martin.
Is anybody working on implementation?

daniel@beckweb.net (JIRA)

unread,
Aug 26, 2014, 4:44:03 PM8/26/14
to jenkinsc...@googlegroups.com
Daniel Beck commented on Bug JENKINS-17116

Jenkins preferably uses the java.lang.UNIXProcess.destroy(...) method in the JRE running Jenkins.

In OpenJDK 7 and up it seems to send SIGTERM, which is consistent with my observations below.

http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/27e0909d3fa0/src/solaris/native/java/lang/UNIXProcess_md.c#l722 (parameter is "false")
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/solaris/native/java/lang/UNIXProcess_md.c#l720 (parameter is "false")
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/solaris/native/java/lang/UNIXProcess_md.c#l947

The call from Jenkins:
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L580

A build output of a very simple shell script demonstrating that SIGTERM is handled:

Building on master in workspace /var/lib/jenkins/workspace/jobname
[jobname] $ /bin/sh -xe /tmp/hudson6478022098890718097.sh
+ trap 'echo TERM' TERM
+ sleep 50
Terminated
++ echo TERM
TERM
Build was aborted
Aborted by Daniel Beck
Finished: ABORTED

So check your JRE's source code or documentation to see whether/how UNIXProcess is implemented. OpenJDK (in my case OpenJDK 1.7.0.45) seems to behave.

That said, the logging of hudson.util.ProcessTree might be interesting. Log on FINER or higher.

daniel@beckweb.net (JIRA)

unread,
Aug 26, 2014, 4:46:03 PM8/26/14
to jenkinsc...@googlegroups.com
 
Daniel Beck edited a comment on Bug JENKINS-17116

(To clarify, this comment is about the issue as reported, not any other process killing issues discussed in comments.)

Jenkins preferably uses the java.lang.UNIXProcess.destroy(...) method in the JRE running Jenkins.

In OpenJDK 7 and up it seems to send SIGTERM, which is consistent with my observations below.

http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/27e0909d3fa0/src/solaris/native/java/lang/UNIXProcess_md.c#l722 (parameter is "false")
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/solaris/native/java/lang/UNIXProcess_md.c#l720 (parameter is "false")
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/solaris/native/java/lang/UNIXProcess_md.c#l947

The call from Jenkins:
https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/util/ProcessTree.java#L580

A build output of a very simple shell script demonstrating that SIGTERM is handled:

Building on master in workspace /var/lib/jenkins/workspace/jobname
[jobname] $ /bin/sh -xe /tmp/hudson6478022098890718097.sh
+ trap 'echo TERM' TERM
+ sleep 50
Terminated
++ echo TERM
TERM
Build was aborted
Aborted by Daniel Beck
Finished: ABORTED

So check your JRE's source code or documentation to see whether/how UNIXProcess is implemented. OpenJDK (in my case OpenJDK 1.7.0.45) seems to behave.

That said, the logging of hudson.util.ProcessTree might be interesting. Log on FINER or higher.

This message is automatically generated by JIRA.

pyrogx1133@gmail.com (JIRA)

unread,
Sep 3, 2014, 10:51:21 PM9/3/14
to jenkinsc...@googlegroups.com

pyrogx1133@gmail.com (JIRA)

unread,
Sep 3, 2014, 10:51:22 PM9/3/14
to jenkinsc...@googlegroups.com
Ronald Chen commented on Bug JENKINS-17116

Something is odd. The trap is working when Jenkins is on Ubuntu 12.10 but not on CentOS 6.3

martin.danjou14@gmail.com (JIRA)

unread,
Sep 9, 2014, 8:38:34 AM9/9/14
to jenkinsc...@googlegroups.com

This has gone from bad to worst. I have non-concurrent builds running back to back. When the first one is killed, it somehow keeps running in the background while the other one starts in the same workspace and fails when it should have passed.

Daniel Beck: how do I set the Log to FINER or higher on the process tree, and where do I look up the log? Give me urls please, I sometimes don't understand all the jargon.

This is the Java I am using:

/usr/java/jdk1.7/bin/java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

I run jenkins as "java -jar jenkins.war"

martin.danjou14@gmail.com (JIRA)

unread,
Sep 9, 2014, 8:52:34 AM9/9/14
to jenkinsc...@googlegroups.com
 
Martin d'Anjou edited a comment on Bug JENKINS-17116

This has gone from bad to worst. I have non-concurrent builds running back to back. When the first one is killed, it somehow keeps running in the background while the other one starts in the same workspace and fails when it should have passed.

Daniel Beck: how do I set the Log to FINER or higher on the process tree, and where do I look up the log? Give me urls please, I sometimes don't understand all the jargon.

This is the Java I am using:

/usr/java/jdk1.7/bin/java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

I run jenkins as /usr/java/jdk1.7/bin/java -jar jenkins.war

daniel@beckweb.net (JIRA)

unread,
Sep 9, 2014, 9:58:34 AM9/9/14
to jenkinsc...@googlegroups.com

This has gone from bad to worst

Unhelpful statement without mentioning the involved Jenkins versions. Which were bad, which are worse?

how do I set the Log to FINER or higher on the process tree, and where do I look up the log?

Go to http://jenkins/log, create a new log recorder (use any name), add a logger named hudson.util.ProcessTree and set level to FINER. Save. Go to the log recorder's page occassionally when the issue occurs to see what it logs.

martin.danjou14@gmail.com (JIRA)

unread,
Sep 9, 2014, 3:58:38 PM9/9/14
to jenkinsc...@googlegroups.com

Sorry I should have been more useful in my comment. By worst I meant that I have found that a killed job can corrupt the current job's workspace. I have found a way to reproduce this corruption 100% of the time.

I use Jenkins 1.578 and Java SE JRE 1.7.0_45-b18) Java HotSpot 64-bit Server VM (build 24.35-b08).

I launch jenkins from linux RHEL 6.4 (Santiago) with java -jar jenkins.war

The job needs to be configured with the following script (it is a variation on the python script above):

#!/usr/bin/python
import signal
import time
import os
def handler(*ignored):
    time.sleep(120)
    fh = open("a_file.txt","a")
    fh.write("Handler of Build number: "+os.environ['BUILD_NUMBER'])
    fh.close()

signal.signal(signal.SIGTERM, handler)
fh = open("a_file.txt","w")
fh.write("Main of Build number: "+os.environ['BUILD_NUMBER'])
fh.close()
time.sleep(120)

Then configure the job to archive the artifact named a_file.txt
Run two jobs back to back, kill the first one shortly after it started. Leave the second one to complete until it ends normally.

The log as configured in the above comment, shows:

killAll: process=java.lang.UNIXProcess@3d7c07c9 and envs={HUDSON_COOKIE=06668ba4-b481-4a17-86b3-5f4fbd4061b2}
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Recursively killing pid=1840
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Killing pid=1840
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Recursively killing pid=1840
Sep 09, 2014 3:30:49 PM FINE hudson.util.ProcessTree
Killing pid=1840

The unix process table, after the kill, shows that both jobs are still running:

mdanjou   1251   953  0 10:07 pts/30   00:01:46           java -jar jenkins.war
mdanjou   1840  1251  0 15:30 pts/30   00:00:00             /usr/bin/python /tmp/hudson6469713064377741807.sh
mdanjou   1851  1251  0 15:30 pts/30   00:00:00             /usr/bin/python /tmp/hudson1969984296722384280.sh

Both jobs are still running.

When the second job completes, examine its artifact. It contains this:

Main of Build number: 18Handler of Build number: 17

So the killed build (#17) corrupts the workspace of the running build (#18).

daniel@beckweb.net (JIRA)

unread,
Sep 9, 2014, 4:07:35 PM9/9/14
to jenkinsc...@googlegroups.com
Daniel Beck commented on Bug JENKINS-17116

Makes sense. I don't see how this could be circumvented. Maybe by waiting a bit to see whether SIGTERM worked, and if not, send SIGKILL? But Jenkins uses the JRE's abstraction of "kill a Unix process" and that behavior appears to be implementation dependent.

Should be possible to write a plugin that sends SIGKILL if configured (e.g. for specific jobs only). Would that help?

martin.danjou14@gmail.com (JIRA)

unread,
Sep 9, 2014, 5:39:35 PM9/9/14
to jenkinsc...@googlegroups.com

Maximum flexibility, as a plugin or built-in, in my view and without regards to feasibility, would be:

  • wait a configurable amount of time for the SIGTERM killed process to come to its natural completion (i.e. let it run its traps/handlers)
  • if not dead by the timeout, send SIGKILL and wait for process to be gone (N seconds, configurable)
  • If not dead, move on to the next job or hang (as determined by the user - sometimes hanging is the right thing: spectacular failures are usually easy to debug but it's a judgement call)
  • When moving on, perform the post-build steps

Regarding the last point, I am not sure whether Jenkins is supposed to perform the post-build steps when a build is killed by the user - but it is certainly something that would help me. Perhaps this is something that could be configured?

I do not know what would belong to a plugin vs. what should be built-in.

Reply all
Reply to author
Forward
0 new messages