[JIRA] (JENKINS-58492) Improve robustness w.r.t. bad slave nodes

11 views
Skip to first unread message

pjdarton@gmail.com (JIRA)

unread,
Jul 15, 2019, 6:50:02 AM7/15/19
to jenkinsc...@googlegroups.com
pjdarton created an issue
 
Jenkins / Improvement JENKINS-58492
Improve robustness w.r.t. bad slave nodes
Issue Type: Improvement Improvement
Assignee: tspengler
Components: hudson-wsclean-plugin
Created: 2019-07-15 10:49
Environment: Master with multiple static slaves
Labels: robustness
Priority: Major Major
Reporter: pjdarton

At present (version 1.0.5) the overall stability and performance of every build becomes only as good as the worst slave.
In an ideal world this isn't a problem, but "in the real world" where not everything is perfect, we need to minimize the impact of a bad slave rather than maximize it. Fortunately, at least for this plugin, that shouldn't be too difficult to achieve

This plugin should be more robust in its dealing with the other slaves:

  1. Timeouts.
    Surround all calls to the remote slaves with timeouts so that we can ensure that the cleanup stage cannot run indefinitely.
  2. Parallel execution.
    Run each remote deletion in a separate thread so that deletions on different slaves can happen in parallel.

Justification:
Scenario one:
If a slave has locked up or is otherwise unresponsive (something we find happens, especially with Windows based slaves) then all builds (that might run on that slave) will end up locking up when they attempt to remove their workspace from that slave.
If we had timeouts then, while we can't rescue the build that's running on the locked-up slave, at least all our other builds will continue unaffected, minimizing the impact of that badly-behaved slave node.
Scenario two:
When there's a lot of slaves, deleting each workspace in sequence can take a long time, causing big delays for the builds; the workspace cleanup phase of a build can be significantly longer than all of the rest of the build activity combined.
If we ran each deletion in parallel then all the slaves could delete their workspaces in parallel, ensuring that the overall delay to the currently-running build was only as long as the slowest slave.

Note: We could make this parallel/serial choice configurable, and we could make the timeout configurable too, with the default for existing configurations being "serial, no timeout" to preserve existing behavior. The Jelly code could set the defaults for new users to be "parallel, 5 minutes" or similar.

TL;DR: Ensure that "this build on this slave" is unaffected by problems with "other slaves that this build could've used".

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

pjdarton@gmail.com (JIRA)

unread,
Aug 13, 2019, 11:27:02 AM8/13/19
to jenkinsc...@googlegroups.com
pjdarton assigned an issue to pjdarton
Change By: pjdarton
Assignee: tspengler pjdarton

pjdarton@gmail.com (JIRA)

unread,
Aug 13, 2019, 11:31:05 AM8/13/19
to jenkinsc...@googlegroups.com
pjdarton commented on Improvement JENKINS-58492
 
Re: Improve robustness w.r.t. bad slave nodes

I've created a PR that'll address this issue ... but until the plugin code has a Jenkinsfile (see PR#5) folks will have to build their own copy of the code.

pjdarton@gmail.com (JIRA)

unread,
Aug 13, 2019, 11:31:05 AM8/13/19
to jenkinsc...@googlegroups.com
pjdarton started work on Improvement JENKINS-58492
 
Change By: pjdarton
Status: Open In Progress

pjdarton@gmail.com (JIRA)

unread,
Aug 21, 2019, 6:57:03 AM8/21/19
to jenkinsc...@googlegroups.com
pjdarton commented on Improvement JENKINS-58492

Update for anyone watching this:
There's now a Pull Request that contains a fix for this issue (plus other enhancements). Anyone can download the .hpi file of the bugfixed plugin from there and then upload that (Manage Jenkins -> Manage Plugins ->Advanced -> Upload plugin) to their own Jenkins server(s) to try it out.

Once I'm confident that everything is OK then I'll merge those changes in and release the new plugin officially.

pjdarton@gmail.com (JIRA)

unread,
Aug 28, 2019, 9:05:03 AM8/28/19
to jenkinsc...@googlegroups.com
pjdarton resolved as Fixed
 

Fixed in version 1.0.6, which was released today.

Change By: pjdarton
Status: In Progress Resolved
Resolution: Fixed
Reply all
Reply to author
Forward
0 new messages