Expected: pipeline aborts fast and without any issues
Actual (66% reproducibility):
It takes pipeline 20s to abort
Build log contains "Click here to forcibly terminate running steps" and "After 20s process did not stop", indicating that Jenkins has issues with stopping the pipeline
"Click here to forcibly terminate running steps" link is still visible even after the build has finished
Issue analysis:
There is a race condition between 2 minute timer in hudson.util.ProcessTree.WindowsOSProcess#killSoftly introduced for JENKINS-17116 and 20s timer in org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.Execution#stop. It is possible for DurableTaskStep to pretend that step was cancelled while it fact process is still running. Because of this race condition, it is possible to trick Jenkins into thinking that build has finished while if fact there are still processes running in workspace and potentially locking files there (this happens to us in practice).
org.jvnet.winp.WinProcess#sendCtrlC that is used in hudson.util.ProcessTree.WindowsOSProcess#killSoftly is NOT a proper way to terminate processes. Many apps do not interpret CTRL+C as a shutdown signal. (cmd.exe being the most important one here, because running bat in pipeline involved TWO cmd.exe - one running jenkins-wrapper.bat and second running jenkins-main.bat. Why you're not using TerminateProcess function from WinAPI?
There's a race condition between gathering of process list in hudson.util.ProcessTree.Windows#Windows constructor and killing of the processes, during which build can produce new processes that will not be attempted to be killed.
Usage of JENKINS_NODE_COOKIE to find what processes to kill is unreliable because 1) processes are free to alter their environment 2) CreateProcessA allows to pass custom environment variables 3) It has unpredictable order 4) It doesn't match Jenkins behavior on Linux
Expected: pipeline aborts fast and without any issues
Actual (66% reproducibility):
# It takes pipeline 20s to abort # Build log contains "Click here to forcibly terminate running steps" and "After 20s process did not stop", indicating that Jenkins has issues with stopping the pipeline # "Click here to forcibly terminate running steps" link is still visible even after the build has finished
Issue analysis: # There is a race condition between 2 minute timer in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} introduced for JENKINS-17116 by [PR#3414|https://github.com/jenkinsci/jenkins/pull/3414] and 20s timer in {{org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.Execution#stop}}. It is possible for {{DurableTaskStep}} to pretend that step was cancelled while it fact process is still running. Because of this race condition, it is possible to trick Jenkins into thinking that build has finished while if fact there are still processes running in workspace and potentially locking files there (this happens to us in practice). # {{org.jvnet.winp.WinProcess#sendCtrlC}} that is used in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} is NOT a proper way to terminate processes. Many apps do not interpret CTRL+C as a shutdown signal. ({{cmd.exe}} being the most important one here, because running {{bat}} in pipeline involved TWO {{cmd.exe}} - one running {{jenkins-wrapper.bat}} and second running {{jenkins-main.bat}}. Why you're not using [TerminateProcess function|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminateprocess] from WinAPI? # There's a race condition between gathering of process list in {{hudson.util.ProcessTree.Windows#Windows}} constructor and killing of the processes, during which build can produce new processes that will not be attempted to be killed. # Usage of {{JENKINS_NODE_COOKIE}} to find what processes to kill is unreliable because 1) processes are free to alter their environment 2) [CreateProcessA|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessa] allows to pass custom environment variables 3) It has unpredictable order 4) It doesn't match Jenkins behavior on Linux
# Windows # Jenkins 2.176.1 # Create pipeline: {code}node() { bat "ping 127.0.0.1 -n 100000" } {code} # Run pipeline # Abort pipeline # View build log
Expected: pipeline aborts fast and without any issues
Actual (66% reproducibility is less than 100%):
# It takes pipeline 20s to abort # Build log contains "Click here to forcibly terminate running steps" and "After 20s process did not stop", indicating that Jenkins has issues with stopping the pipeline # "Click here to forcibly terminate running steps" link is still visible even after the build has finished
# Sometimes ping processes are NOT terminated even when build has aborted.
Issue analysis: # There is a race condition between 2 minute timer in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} introduced for JENKINS-17116 by [PR#3414|https://github.com/jenkinsci/jenkins/pull/3414] and 20s timer in {{org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.Execution#stop}}. It is possible for {{DurableTaskStep}} to pretend that step was cancelled while it fact process is still running. Because of this race condition, it is possible to trick Jenkins into thinking that build has finished while if fact there are still processes running in workspace and potentially locking files there (this happens to us in practice). # {{org.jvnet.winp.WinProcess#sendCtrlC}} that is used in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} is NOT a proper way to terminate processes. Many apps do not interpret CTRL+C as a shutdown signal. ({{cmd.exe}} being the most important one here, because running {{bat}} in pipeline involved TWO {{cmd.exe}} - one running {{jenkins-wrapper.bat}} and second running {{jenkins-main.bat}}. Why you're not using [TerminateProcess function|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminateprocess] from WinAPI? # There's a race condition between gathering of process list in {{hudson.util.ProcessTree.Windows#Windows}} constructor and killing of the processes, during which build can produce new processes that will not be attempted to be killed. # Usage of {{JENKINS_NODE_COOKIE}} to find what processes to kill is unreliable because 1) processes are free to alter their environment 2) [CreateProcessA|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessa] allows to pass custom environment variables 3) It has unpredictable order 4) It doesn't match Jenkins behavior on Linux
I do not agree that PR#4225 fully fixed this issue. Race conditions between multiple timers are still there. Shortening of softkill timeout makes issue less often but still possible.