[JIRA] (JENKINS-55512) Safe shutdown/restart should not block completion of complex jobs (that spawn child jobs)

1 view
Skip to first unread message

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:32:02 AM1/10/19
to jenkinsc...@googlegroups.com
Jim Klimov created an issue
 
Jenkins / Improvement JENKINS-55512
Safe shutdown/restart should not block completion of complex jobs (that spawn child jobs)
Issue Type: Improvement Improvement
Assignee: Unassigned
Components: core
Created: 2019-01-10 12:31
Environment: from ages ago up till current Jenkins 2,156
Priority: Minor Minor
Reporter: Jim Klimov

Jenkins supports "safe" behavior to shut down or restart itself, which seems to be used in plugin updater, thinBackup plugin, operations that can be requested via Jenkins URL (and a link from Jenkins Manage interface), to name a few use-cases.

This mode waits for currently running jobs to complete and disallows new jobs to progress from scheduled to running. The assumption is that the current jobs will complete, the server will be quiet and can be administratively restarted with no severe interruption to its users.

This is problematic however when there are complex jobs, such as MultiPhase or entangled pipelines, where one wrapper job calls as its payload over time a number of other jobs that implement certain tests or other operations. When the safe shutdown mode is enabled, these child jobs can not be started, and the parent job stalls indefinitely waiting for their result, and the Jenkins master is left dysfunctional (not restarted for e.g. upgrade overnight, and not running any new builds).

The expected operational result would be that the currently running wrapper jobs AND any children (and their children) that can get spawned would be allowed to start and awaited to complete (boils down to "completion of all running jobs" as before), after which the safe restart/shutdown normally takes place.

I believe this could be done with some simple check of the build cause in the code which disallows execution of scheduled builds when the safe shutdown is enabled, to allow building of jobs triggered by a job, but would disallow builds triggered by SCM changes, Polling, Indexing, manually triggered (or maybe that one with an option to go through nonetheless?) etc.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:33:02 AM1/10/19
to jenkinsc...@googlegroups.com

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:41:02 AM1/10/19
to jenkinsc...@googlegroups.com
Jim Klimov updated an issue
Change By: Jim Klimov
Component/s: safe-restart
Component/s: saferestart-plugin
Component/s: core

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:42:02 AM1/10/19
to jenkinsc...@googlegroups.com

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:43:02 AM1/10/19
to jenkinsc...@googlegroups.com
Jim Klimov updated an issue
Jenkins supports "safe" behavior to shut down or restart itself, which seems to be used in plugin updater, thinBackup plugin (doing just a quietDown part of this) , operations that can be requested via Jenkins URL (and a link from Jenkins Manage interface), to name a few use-cases.

This mode waits for currently running jobs to complete and disallows new jobs to progress from scheduled to running. The assumption is that the current jobs will complete, the server will be quiet and can be administratively restarted with no severe interruption to its users.

This is problematic however when there are complex jobs, such as MultiPhase or entangled pipelines, where one wrapper job calls as its payload over time a number of other jobs that implement certain tests or other operations. When the safe shutdown mode is enabled, these child jobs can not be started, and the parent job stalls indefinitely waiting for their result, and the Jenkins master is left dysfunctional (not restarted for e.g. upgrade overnight, and not running any new builds).

The expected operational result would be that the currently running wrapper jobs AND any children (and their children) that can get spawned would be allowed to start and awaited to complete (boils down to "completion of all running jobs" as before), after which the safe restart/shutdown normally takes place.

I believe this could be done with some simple check of the build cause in the code which disallows execution of scheduled builds when the safe shutdown is enabled, to allow building of jobs triggered by a job, but would disallow builds triggered by SCM changes, Polling, Indexing, manually triggered (or maybe that one with an option to go through nonetheless?) etc.

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:43:02 AM1/10/19
to jenkinsc...@googlegroups.com
Jim Klimov updated an issue
Jenkins supports "safe" behavior to shut down (safeExit) or restart (safeRestart) itself, which seems to be used in plugin updater, thinBackup plugin (doing just a quietDown part of this), operations that can be requested via Jenkins URL (and a link from Jenkins Manage interface), to name a few use-cases.

This mode waits for currently running jobs to complete and disallows new jobs to progress from scheduled to running. The assumption is that the current jobs will complete, the server will be quiet and can be administratively restarted with no severe interruption to its users.

This is problematic however when there are complex jobs, such as MultiPhase or entangled pipelines, where one wrapper job calls as its payload over time a number of other jobs that implement certain tests or other operations. When the safe shutdown mode is enabled, these child jobs can not be started, and the parent job stalls indefinitely waiting for their result, and the Jenkins master is left dysfunctional (not restarted for e.g. upgrade overnight, and not running any new builds).

The expected operational result would be that the currently running wrapper jobs AND any children (and their children) that can get spawned would be allowed to start and awaited to complete (boils down to "completion of all running jobs" as before), after which the safe restart/shutdown normally takes place.

I believe this could be done with some simple check of the build cause in the code which disallows execution of scheduled builds when the safe shutdown is enabled, to allow building of jobs triggered by a job, but would disallow builds triggered by SCM changes, Polling, Indexing, manually triggered (or maybe that one with an option to go through nonetheless?) etc.

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:44:01 AM1/10/19
to jenkinsc...@googlegroups.com
Jim Klimov updated an issue
Jenkins supports "safe" behavior to shut down (safeExit) or restart (safeRestart) itself, which seems to be used in plugin updater, thinBackup plugin (doing just a quietDown part of this), operations that can be requested via Jenkins URL (and a link from Jenkins Manage interface), to name a few use-cases.


This mode waits for currently running jobs to complete and disallows new jobs to progress from scheduled to running. The assumption is that the current jobs will complete, the server will be quiet and can be administratively restarted with no severe interruption to its users. Documented in more detail at https://support.cloudbees.com/hc/en-us/articles/216118748-How-to-Start-Stop-or-Restart-your-Instance- for example.

This is problematic however when there are complex jobs, such as MultiPhase or entangled pipelines, where one wrapper job calls as its payload over time a number of other jobs that implement certain tests or other operations. When the safe shutdown mode is enabled, these child jobs can not be started, and the parent job stalls indefinitely waiting for their result, and the Jenkins master is left dysfunctional (not restarted for e.g. upgrade overnight, and not running any new builds).

The expected operational result would be that the currently running wrapper jobs AND any children (and their children) that can get spawned would be allowed to start and awaited to complete (boils down to "completion of all running jobs" as before), after which the safe restart/shutdown normally takes place.

I believe this could be done with some simple check of the build cause in the code which disallows execution of scheduled builds when the safe shutdown is enabled, to allow building of jobs triggered by a job, but would disallow builds triggered by SCM changes, Polling, Indexing, manually triggered (or maybe that one with an option to go through nonetheless?) etc.

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 7:57:02 AM1/10/19
to jenkinsc...@googlegroups.com
I believe this could be done with some simple check of the build cause in the code which disallows execution of scheduled builds when the safe shutdown is enabled, to allow building of jobs triggered by a job, but would disallow builds triggered by SCM changes, Polling, Indexing, manually triggered (or maybe that one with an option to go through nonetheless?) etc. This seems like the place (or good starting point): https://github.com/jenkinsci/jenkins/blob/d5eeefd7beb8a00910f8d579ef14124a8be1914c/core/src/main/java/hudson/model/Queue.java#L1786

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 9:23:01 AM1/10/19
to jenkinsc...@googlegroups.com

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 1:28:03 PM1/10/19
to jenkinsc...@googlegroups.com
Jim Klimov started work on Improvement JENKINS-55512
 
Change By: Jim Klimov
Status: Open In Progress

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 1:28:04 PM1/10/19
to jenkinsc...@googlegroups.com

jimklimov@gmail.com (JIRA)

unread,
Jan 10, 2019, 1:29:02 PM1/10/19
to jenkinsc...@googlegroups.com

jimklimov@gmail.com (JIRA)

unread,
Feb 5, 2019, 7:18:02 AM2/5/19
to jenkinsc...@googlegroups.com

jimklimov@gmail.com (JIRA)

unread,
Feb 5, 2019, 7:56:02 AM2/5/19
to jenkinsc...@googlegroups.com
 
Re: Safe shutdown/restart should not block completion of complex jobs (that spawn child jobs)

Upon discussion with Daniel Beck the initial approach I took was not fully proper, at least not for upstreaming the change, as it would unilaterally change behavior for all child jobs, and this might not be applicable for some pipelines that can be (or not be) restartable, and certainly not intended for triggering of downstream jobs (rather than sub-jobs in focus of this issue) that have the same UpstreamCause – such jobs are independent and can wait in the blocked queue until the Jenkins master is safely restarted. This would also not be easily extensible to add behaviors (e.g. the idea about sysadmin forcing some job even though the server is shutting down).

The better suggestion was to make use of Actions, which are an attachable piece of metadata, e.g. to inherit some new nonBlockingJobAction (name is arbitrary), consider its presence when making Queue.java decisions, and attach it to jobs scheduled in the plugins like the Freestyle, MultiPhase build and Pipeline (Workflow) build steps in case the current job would wait for completion of that new child job (async children can also wait until restart of the master, right?).

For additional features, such as an administrative enforcement to run some job even if the server is quieting down, an extension point should be defined in the jenkins-core and then new plugins can be made to take advantage of that based on their purpose, logic and configurability. Effectively, the code in Queue.java would ask all currently loaded implementations of the extension point whether this Queue.Item (and/or Queue.Task) CAN be executed despite the pending shutdown, aka "is not blocked by the pending shutdown", with probably some tri-state outcome (permitted, blocked, no_opinion), with the current implementation being the fallback for a definitive result unless there is some certain verdict earlier.

PS: Hope I got and relayed it right

dbeck@cloudbees.com (JIRA)

unread,
Feb 5, 2019, 9:46:02 AM2/5/19
to jenkinsc...@googlegroups.com

Looks about right.

The main problem seems to be that isBlockedByShutdown takes a Task argument (corresponds to Job), so this would need to be deprecated and replaced in favor of something taking a Queue.Item (precursor to Run/Build). This way, queue items/builds of a job can have different shutdown blocking characteristics.

Outline of plan:

  1. Replace isBlockedByShutdown with this new implementation.
  2. Create NonBlockedQueueItemAction and check for its presence in the new isBlockedByShutdown
  3. Actually add these new actions to queue items where it makes sense:
    1. pipeline-build-step plugin or parameterized-trigger could add it to scheduleBuild2 if they wait for the result to prevent deadlocking.
    2. Alternatively, any plugin could contribute the action via Queue.QueueDecisionHandler and allow manual admin overrides and similar.

dbeck@cloudbees.com (JIRA)

unread,
Feb 5, 2019, 9:49:02 AM2/5/19
to jenkinsc...@googlegroups.com

The other alternative was to add a new extension point specifically for this (perhaps similar to QueueTaskDispatcher?), and have it be implemented in plugins. The problem there might be that we need tri-state Allow/Deny/Don't Care for proper 'fallthrough' to the default implementation. Or perhaps not, unsure. This is a different approach form my previous comment's, and is what Jim's last paragraph partially refers to.

jimklimov@gmail.com (JIRA)

unread,
Nov 5, 2019, 9:10:03 AM11/5/19
to jenkinsc...@googlegroups.com

jimklimov@gmail.com (JIRA)

unread,
Nov 5, 2019, 9:10:03 AM11/5/19
to jenkinsc...@googlegroups.com

For some more alternative approaches, as I recently dug in some other plugins, I wondered if a solution could come from a cannibalized (and adapted by a non-Java developer) copy of `lockable-resources-plugin`?

It already has ways to block startup of jobs (waiting for "this resource" which has no instances available) after all, so it seems feasible to just replace the logic from locked resources processing by what I originally proposed for jobs/builds/causes relationships, and allow or block the builds. In fact, it would be less work to make this behavior optional and used only by specific jobs rather than server-wide, because the lockable resource abilities are something one configures in a job (and it would be some effort to find and rip out the relevant bits, going this way).

Reply all
Reply to author
Forward
0 new messages