[JIRA] (JENKINS-59152) Jenkins fails to properly abort "bat" step

30 views
Skip to first unread message

marat@slonopotamus.org (JIRA)

unread,
Aug 30, 2019, 12:05:02 PM8/30/19
to jenkinsc...@googlegroups.com
Marat Radchenko created an issue
 
Jenkins / Bug JENKINS-59152
Jenkins fails to properly abort "bat" step
Issue Type: Bug Bug
Assignee: Unassigned
Components: core
Created: 2019-08-30 16:04
Environment: Windows
Jenkins 2.176.1
Priority: Major Major
Reporter: Marat Radchenko
  1. Windows
  2. Jenkins 2.176.1
  3. Create pipeline:
    node() {
      bat "ping 127.0.0.1 -n 100000"
    }
    
  4. Run pipeline
  5. Abort pipeline
  6. View build log

Expected: pipeline aborts fast and without any issues

Actual (66% reproducibility):

  1. It takes pipeline 20s to abort
  2. Build log contains "Click here to forcibly terminate running steps" and "After 20s process did not stop", indicating that Jenkins has issues with stopping the pipeline
  3. "Click here to forcibly terminate running steps" link is still visible even after the build has finished

Issue analysis:

  1. There is a race condition between 2 minute timer in hudson.util.ProcessTree.WindowsOSProcess#killSoftly introduced for JENKINS-17116 and 20s timer in org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.Execution#stop. It is possible for DurableTaskStep to pretend that step was cancelled while it fact process is still running. Because of this race condition, it is possible to trick Jenkins into thinking that build has finished while if fact there are still processes running in workspace and potentially locking files there (this happens to us in practice).
  2. org.jvnet.winp.WinProcess#sendCtrlC that is used in hudson.util.ProcessTree.WindowsOSProcess#killSoftly is NOT a proper way to terminate processes. Many apps do not interpret CTRL+C as a shutdown signal. (cmd.exe being the most important one here, because running bat in pipeline involved TWO cmd.exe - one running jenkins-wrapper.bat and second running jenkins-main.bat. Why you're not using TerminateProcess function from WinAPI?
  3. There's a race condition between gathering of process list in hudson.util.ProcessTree.Windows#Windows constructor and killing of the processes, during which build can produce new processes that will not be attempted to be killed.
  4. Usage of JENKINS_NODE_COOKIE to find what processes to kill is unreliable because 1) processes are free to alter their environment 2) CreateProcessA allows to pass custom environment variables 3) It has unpredictable order 4) It doesn't match Jenkins behavior on Linux
Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

marat@slonopotamus.org (JIRA)

unread,
Aug 30, 2019, 12:07:02 PM8/30/19
to jenkinsc...@googlegroups.com
Marat Radchenko updated an issue
Change By: Marat Radchenko
# Windows
# Jenkins 2.176.1
# Create pipeline:
{code}node() {

  bat "ping 127.0.0.1 -n 100000"
}
{code}
# Run pipeline
# Abort pipeline
# View build log


Expected: pipeline aborts fast and without any issues

Actual (66% reproducibility):
# It takes pipeline 20s to abort
# Build log contains "Click here to forcibly terminate running steps" and "After 20s process did not stop", indicating that Jenkins has issues with stopping the pipeline
# "Click here to forcibly terminate running steps" link is still visible even after the build has finished

Issue analysis:
# There is a race condition between 2 minute timer in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} introduced for JENKINS-17116
by [PR#3414|https://github.com/jenkinsci/jenkins/pull/3414] and 20s timer in {{org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.Execution#stop}}. It is possible for {{DurableTaskStep}} to pretend that step was cancelled while it fact process is still running. Because of this race condition, it is possible to trick Jenkins into thinking that build has finished while if fact there are still processes running in workspace and potentially locking files there (this happens to us in practice).
# {{org.jvnet.winp.WinProcess#sendCtrlC}} that is used in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} is NOT a proper way to terminate processes.  Many apps do not interpret CTRL+C as a shutdown signal. ({{cmd.exe}} being the most important one here, because running {{bat}} in pipeline involved TWO {{cmd.exe}} - one running {{jenkins-wrapper.bat}} and second running {{jenkins-main.bat}}. Why you're not using [TerminateProcess function|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminateprocess] from WinAPI?
# There's a race condition between gathering of process list in {{hudson.util.ProcessTree.Windows#Windows}} constructor and killing of the processes, during which build can produce new processes that will not be attempted to be killed.
# Usage of {{JENKINS_NODE_COOKIE}} to find what processes to kill is unreliable because 1) processes are free to alter their environment 2) [CreateProcessA|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessa] allows to pass custom environment variables 3) It has unpredictable order 4) It doesn't match Jenkins behavior on Linux

marat@slonopotamus.org (JIRA)

unread,
Sep 3, 2019, 11:04:01 AM9/3/19
to jenkinsc...@googlegroups.com
Marat Radchenko updated an issue
Change By: Marat Radchenko
Component/s: durable-task-plugin

marat@slonopotamus.org (JIRA)

unread,
Sep 23, 2019, 8:40:03 AM9/23/19
to jenkinsc...@googlegroups.com
Marat Radchenko commented on Bug JENKINS-59152
 
Re: Jenkins fails to properly abort "bat" step

All that I described in this issue can be reproduced by running org.jenkinsci.plugins.workflow.steps.durable_task.ShellStepTest#abort test on Windows. Sometimes it quickly passes. Sometimes it idles with 20s timeout. Sometimes it fails to kill ping process.

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

marat@slonopotamus.org (JIRA)

unread,
Sep 23, 2019, 9:35:01 AM9/23/19
to jenkinsc...@googlegroups.com
Marat Radchenko updated an issue
Change By: Marat Radchenko
Component/s: workflow-durable-task-step-plugin
Component/s: core
Component/s: durable-task-plugin

marat@slonopotamus.org (JIRA)

unread,
Sep 23, 2019, 12:08:02 PM9/23/19
to jenkinsc...@googlegroups.com
Marat Radchenko updated an issue
# Windows
# Jenkins 2.176.1
# Create pipeline:
{code}node() {
  bat "ping 127.0.0.1 -n 100000"
}
{code}
# Run pipeline
# Abort pipeline
# View build log

Expected: pipeline aborts fast and without any issues

Actual ( 66% reproducibility is less than 100% ):

# It takes pipeline 20s to abort
# Build log contains "Click here to forcibly terminate running steps" and "After 20s process did not stop", indicating that Jenkins has issues with stopping the pipeline
# "Click here to forcibly terminate running steps" link is still visible even after the build has finished
# Sometimes ping processes are NOT terminated even when build has aborted.

Issue analysis:
# There is a race condition between 2 minute timer in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} introduced for JENKINS-17116 by [PR#3414|https://github.com/jenkinsci/jenkins/pull/3414] and 20s timer in {{org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.Execution#stop}}. It is possible for {{DurableTaskStep}} to pretend that step was cancelled while it fact process is still running. Because of this race condition, it is possible to trick Jenkins into thinking that build has finished while if fact there are still processes running in workspace and potentially locking files there (this happens to us in practice).
# {{org.jvnet.winp.WinProcess#sendCtrlC}} that is used in {{hudson.util.ProcessTree.WindowsOSProcess#killSoftly}} is NOT a proper way to terminate processes.  Many apps do not interpret CTRL+C as a shutdown signal. ({{cmd.exe}} being the most important one here, because running {{bat}} in pipeline involved TWO {{cmd.exe}} - one running {{jenkins-wrapper.bat}} and second running {{jenkins-main.bat}}. Why you're not using [TerminateProcess function|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminateprocess] from WinAPI?
# There's a race condition between gathering of process list in {{hudson.util.ProcessTree.Windows#Windows}} constructor and killing of the processes, during which build can produce new processes that will not be attempted to be killed.
# Usage of {{JENKINS_NODE_COOKIE}} to find what processes to kill is unreliable because 1) processes are free to alter their environment 2) [CreateProcessA|https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessa] allows to pass custom environment variables 3) It has unpredictable order 4) It doesn't match Jenkins behavior on Linux

marat@slonopotamus.org (JIRA)

unread,
Sep 24, 2019, 10:27:02 AM9/24/19
to jenkinsc...@googlegroups.com
 
Re: Jenkins fails to properly abort "bat" step

See comments to PR#4216 for additional technical analysis.

o.v.nenashev@gmail.com (JIRA)

unread,
Oct 7, 2019, 8:15:02 PM10/7/19
to jenkinsc...@googlegroups.com

o.v.nenashev@gmail.com (JIRA)

unread,
Oct 7, 2019, 8:15:03 PM10/7/19
to jenkinsc...@googlegroups.com
Oleg Nenashev updated Bug JENKINS-59152
 

The fix was released in Jenkins 2.199

Change By: Oleg Nenashev
Status: Open Fixed but Unreleased
Resolution: Fixed
Released As: Jenkins  2.199

o.v.nenashev@gmail.com (JIRA)

unread,
Oct 7, 2019, 8:15:04 PM10/7/19
to jenkinsc...@googlegroups.com

o.v.nenashev@gmail.com (JIRA)

unread,
Oct 7, 2019, 8:16:03 PM10/7/19
to jenkinsc...@googlegroups.com

marat@slonopotamus.org (JIRA)

unread,
Oct 8, 2019, 2:13:02 AM10/8/19
to jenkinsc...@googlegroups.com
 
Re: Jenkins fails to properly abort "bat" step

I do not agree that PR#4225 fully fixed this issue. Race conditions between multiple timers are still there. Shortening of softkill timeout makes issue less often but still possible.

ogondza@gmail.com (JIRA)

unread,
Oct 8, 2019, 8:44:03 AM10/8/19
to jenkinsc...@googlegroups.com
Oliver Gondža updated an issue
 
Change By: Oliver Gondža
Labels: 2.190.2-rejected lts-candidate

ogondza@gmail.com (JIRA)

unread,
Oct 8, 2019, 8:45:03 AM10/8/19
to jenkinsc...@googlegroups.com
Oliver Gondža commented on Bug JENKINS-59152
 
Re: Jenkins fails to properly abort "bat" step

Given the fix is disputed (and far from trivial, IMO), I am postponing the backport to 2.190.3 at least.

ogondza@gmail.com (JIRA)

unread,
Nov 3, 2019, 9:37:03 AM11/3/19
to jenkinsc...@googlegroups.com
Oliver Gondža updated an issue
Change By: Oliver Gondža
Labels: 2.190.2-rejected 2.190.3-rejected lts-candidate

dbeck@cloudbees.com (JIRA)

unread,
Nov 22, 2019, 6:15:05 AM11/22/19
to jenkinsc...@googlegroups.com
Daniel Beck updated an issue
Change By: Daniel Beck
Labels: 2.190.2-rejected 2.190.3-rejected lts-candidate
Reply all
Reply to author
Forward
0 new messages