[JIRA] (JENKINS-47821) vsphere plugin 2.16 not respecting slave disconnect settings

15 views
Skip to first unread message

pjdarton@gmail.com (JIRA)

unread,
Mar 23, 2018, 9:30:01 AM3/23/18
to jenkinsc...@googlegroups.com
pjdarton assigned an issue to pjdarton
 
Jenkins / Improvement JENKINS-47821
vsphere plugin 2.16 not respecting slave disconnect settings
Change By: pjdarton
Assignee: pjdarton
Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

pjdarton@gmail.com (JIRA)

unread,
Mar 23, 2018, 9:34:02 AM3/23/18
to jenkinsc...@googlegroups.com
pjdarton commented on Improvement JENKINS-47821
 
Re: vsphere plugin 2.16 not respecting slave disconnect settings

You're saying it's a regression since 2.15?  Hmm, ok...  I certainly hadn't intended to cause this behavior but I'll see if I can find the cause and fix it...

If you can provide any further information then that'd greatly simplify the debugging process.

e.g. what do you mean by "opportunistically"?  What's the scenario in which the VM gets re-used (when it shouldn't) vs being disposed of correctly?

John.Mellor@esentire.com (JIRA)

unread,
Mar 23, 2018, 9:54:02 AM3/23/18
to jenkinsc...@googlegroups.com

Yes, exactly. I never do incremental builds as they are a severely-broken dev practice. I typically setup a node to disconnect after one build, and reset back to a vmware snapshot upon restart. That way I can easily debug a build problem because the machine is left in the state where the build failed, and the next job does not have artifacts present like dependency packages, config files or docker images added by the previous build.

However I am now in a situation where sometimes a queued build runs on the node without going through the reset-back-to-snapshot step, breaking it.

I have a crude workaround of powering the node down after every build, forcing it to go through the power-up steps which will then revert back to snapshot. However, this maximizes the downtime for the node between builds, and prevents some debugging actions because you lose the in-memory structures this way.

pjdarton@gmail.com (JIRA)

unread,
Mar 23, 2018, 10:43:01 AM3/23/18
to jenkinsc...@googlegroups.com
pjdarton commented on Improvement JENKINS-47821

I share your opinions - I never (intentionally) do incremental builds either

If you can figure out a reproducible test case that I can follow here to reproduce the issue (i.e. see it reuse a node using plugin version 2,16 where it didn't on 2.15) then that'll greatly assist (and hence speed up) the diagnostic process and hence dramatically reduce the time-to-fix it.  Debugging something that happens "sometimes" is way more difficult than debugging something that happens "every time you do X".

i.e. If you help me to help you, you'll get a solution a lot quicker

pjdarton@gmail.com (JIRA)

unread,
Mar 27, 2018, 12:02:02 PM3/27/18
to jenkinsc...@googlegroups.com
pjdarton updated an issue
 
Jenkins / Bug JENKINS-47821
Change By: pjdarton
Issue Type: Improvement Bug

pjdarton@gmail.com (JIRA)

unread,
Mar 27, 2018, 12:03:03 PM3/27/18
to jenkinsc...@googlegroups.com
pjdarton started work on Bug JENKINS-47821
 
Change By: pjdarton
Status: Open In Progress

pjdarton@gmail.com (JIRA)

unread,
Mar 27, 2018, 12:25:02 PM3/27/18
to jenkinsc...@googlegroups.com
pjdarton commented on Bug JENKINS-47821
 
Re: vsphere plugin 2.16 not respecting slave disconnect settings

John Mellor  I've spotted what might have been a race condition in the code, giving it an opportunity to go wrong where it didn't before, but without further information regarding your configuration, I have no means to test whether or not it's fixed the issue.

I've made some changes in vsphere-cloud PR#91 and you can download a built plugin from the ci.jenkins.io Jenkins server vsphere-cloud PR-91 CI build job (see "Last Successful Artifacts" - "vsphere-cloud.hpi").
If you download that file you can then install it using "Manage Jenkins" -> "Manage Plugins" -> "Advanced" -> "Upload Plugin".

Give that version of the plugin it a try and see if it makes a difference. If it doesn't help, you'll have to go into way more detail about how you've got things set up so that I can reproduce the issue locally. If it does help then please let me know.

pjdarton@gmail.com (JIRA)

unread,
Mar 27, 2018, 12:26:02 PM3/27/18
to jenkinsc...@googlegroups.com
pjdarton edited a comment on Bug JENKINS-47821
[~alt_jmellor]  I've spotted what might have been a race condition in the code, giving it an opportunity to go wrong where it didn't before , but , without further information regarding your configuration, I have no means to test whether or not it's fixed the issue.

I've made some changes in [vsphere-cloud PR#91|https://github.com/jenkinsci/vsphere-cloud-plugin/pull/91] and you can download a built plugin from the ci.jenkins.io Jenkins server [vsphere-cloud PR-91|https://ci.jenkins.io/job/Plugins/job/vsphere-cloud-plugin/job/PR-91/] CI build job (see "Last Successful Artifacts" - "[vsphere-cloud.hpi|https://ci.jenkins.io/job/Plugins/job/vsphere-cloud-plugin/job/PR-91/lastSuccessfulBuild/artifact/target/vsphere-cloud.hpi]").

If you download that file you can then install it using "Manage Jenkins" -> "Manage Plugins" -> "Advanced" -> "Upload Plugin".

Give that version of the plugin it a try and see if it makes a difference. If it doesn't help, you'll have to go into _way_ more detail about how you've got things set up so that I can reproduce the issue locally. If it does help then please let me know.

scm_issue_link@java.net (JIRA)

unread,
Apr 4, 2018, 12:15:03 PM4/4/18
to jenkinsc...@googlegroups.com

Code changed in jenkins
User: Peter Darton
Path:
src/main/java/org/jenkinsci/plugins/vSphereCloudLauncher.java
src/main/java/org/jenkinsci/plugins/vSphereCloudSlave.java
src/main/java/org/jenkinsci/plugins/vSphereCloudSlaveTemplate.java
src/main/java/org/jenkinsci/plugins/vsphere/RunOnceCloudRetentionStrategy.java
src/main/java/org/jenkinsci/plugins/vsphere/VSphereOfflineCause.java
src/main/resources/org/jenkinsci/plugins/Messages.properties
src/main/resources/org/jenkinsci/plugins/vsphere/Messages.properties
http://jenkins-ci.org/commit/vsphere-cloud-plugin/620868e4808f0df6772c11331dc86bd3ea8413eb
Log:
Merge pull request #91 from pjdarton/prevent-reuse-of-single-use-slaves

JENKINS-47821 Prevent run-once slave from accepting more jobs.

Compare: https://github.com/jenkinsci/vsphere-cloud-plugin/compare/6f78bb0aa164...620868e4808f

John.Mellor@esentire.com (JIRA)

unread,
Apr 5, 2018, 12:28:02 PM4/5/18
to jenkinsc...@googlegroups.com

With the limited testing that I've been able to perform, it looks like this change is provisionally not working.
If I queue up multiple jobs for a single node, and configure the node to do nothing upon end-of-job and reset back to snapshot upon startup, then I do not see an expected reset-to-startup between each job. It looks like it just starts the job on the already-polluted machine and skips the reset-to-snapshot for some reason.

pjdarton@gmail.com (JIRA)

unread,
Apr 6, 2018, 5:37:02 AM4/6/18
to jenkinsc...@googlegroups.com
pjdarton commented on Bug JENKINS-47821

In that case then I'm going to need you to describe your setup, as that's not what I see here (but then, I mostly use the plugin's "Cloud" functionality and am unfamiliar with its other functionality, which I'm guessing is what you're using).

If you can provide a description of how to set up a Jenkins server (that has the vSphere plugin installed) to reproduce this issue, I'll see if I can reproduce it. If I can reproduce it, there's a chance I might be able to fix it.
(FYI fixing bugs in this plugin is not my official day job, so the easier you can make it for me to see the issue for myself, the better the chances are that I can come up with a fix before my boss tells me to do something that is my official day job)

John.Mellor@esentire.com (JIRA)

unread,
Apr 6, 2018, 11:34:02 AM4/6/18
to jenkinsc...@googlegroups.com

For some reason, I am unable to screenshot a typical config into this ticket.
When I configure a high-use build node, I generally set it up for:

Availability: Take this agent online when in demand, and offline when idle
Disconnect after limited builds: 1
What to do when the slave is disconnected: Revert and Restart

If it is a low-use node, then I instead configure for:

What to do when the slave is disconnected: Shutdown

pjdarton@gmail.com (JIRA)

unread,
Apr 6, 2018, 1:55:03 PM4/6/18
to jenkinsc...@googlegroups.com
pjdarton commented on Bug JENKINS-47821

Can we start with where you define the node?
FYI there's multiple ways the plugin can define a slave node, so how you get to the point where you make the choices you've described (can) make a difference.
I need instructions that start from "I've installed Jenkins and I've installed the plugin". I'm guessing that the next step would be to define a vSphere cloud and tell Jenkins the URL of vSphere and login details, and I presume that there will have to be some stuff in that vSphere server too, but I need to know what it consists of.

John.Mellor@esentire.com (JIRA)

unread,
Apr 6, 2018, 2:06:02 PM4/6/18
to jenkinsc...@googlegroups.com
Name of this Cloud: QA Cluster
vSphere Host: https://vsphere.internal
Disable Certificate Verification: checked
Credentials: <valid non-interactive user/password in credentials>
Templates: <none>

FYI, There are several types of clouds configured at this site: google, vmware, k8s, etc.

John.Mellor@esentire.com (JIRA)

unread,
Apr 6, 2018, 2:09:02 PM4/6/18
to jenkinsc...@googlegroups.com

The target vsphere cloud is running an esxi-5.5 cluster managed by vcenter-5.5, and using unshared local disks in RAID-6 as the VMFS volumes. Not sure what else I can give you.

pjdarton@gmail.com (JIRA)

unread,
Apr 6, 2018, 5:31:02 PM4/6/18
to jenkinsc...@googlegroups.com
pjdarton commented on Bug JENKINS-47821

How (by which method?) did you define the slave nodes in Jenkins?
(All my vSphere slaves are created from templates defined in the cloud section; I am aware it's possible to define non-cloud ones by a couple of routes but I've never done that myself)

sqa.valentinmarin@gmail.com (JIRA)

unread,
May 21, 2018, 1:17:03 PM5/21/18
to jenkinsc...@googlegroups.com

Got the same issue here, as in slaves not respecting disconnect after limited builds setting (Jenkins 2.107.2 , vSphere 2.17). Nodes have been defined via Jenkins->Nodes->Slave virtual computer running under vSphere Cloud.

sqa.valentinmarin@gmail.com (JIRA)

unread,
May 22, 2018, 5:12:01 AM5/22/18
to jenkinsc...@googlegroups.com
Valentin Marin edited a comment on Bug JENKINS-47821
Got the same issue here, as in slaves not respecting disconnect after limited builds setting (Jenkins 2.107.2 , vSphere 2.17). Nodes have been defined via Jenkins->Nodes->Slave virtual computer running under vSphere Cloud.


To add a bit of context, I'm running pipeline projects on those nodes and they do not seem to be treated as 'builds' per say, as no executed instances of those are being displayed in node's "Build History" sections.

pjdarton@gmail.com (JIRA)

unread,
May 22, 2018, 6:08:02 AM5/22/18
to jenkinsc...@googlegroups.com
pjdarton commented on Bug JENKINS-47821

Valentin Marin So you've got staticly-defined slaves...  How are they connecting to Jenkins?  SSH?  JNLP?  If JNLP, which protocol version?  And are you passing in a JNLP_SECRET or are they allowed in unauthenticated?  Also, what version of slave.jar are you using on the slave VMs?

I've been tracing oddities in my own Jenkins build environment where slaves that start and then connect via JNLP often "stay online" (briefly) after they've gone offline due to a reboot-induced disconnection (long enough to start a new build job, which then fails because the slave had disconnected), but I've yet to get to the bottom of it (race-conditions are always difficult to debug).  It may be that the issue I'm trying to track down and this issue are all related...

 

FYI I don't think that the lack of pipeline history is a vSphere plugin issue.  I've got a pipeline job that reboots my static (non-VM) Windows slaves and that doesn't show up on their build history, so if a pipeline segment doesn't show up on a normal Jenkins slave's build history, I don't think we can expect it to show up on a vSphere slave's history either, as that'd be common code (the vSphere slave code "extends" the Jenkins core Slave code).

sqa.valentinmarin@gmail.com (JIRA)

unread,
May 22, 2018, 6:21:03 AM5/22/18
to jenkinsc...@googlegroups.com
Valentin Marin edited a comment on Bug JENKINS-47821
Slaves are connected via JNLP (windows service with , while passing the JNLP secret), remoting version 3.17.

sqa.valentinmarin@gmail.com (JIRA)

unread,
May 22, 2018, 6:21:03 AM5/22/18
to jenkinsc...@googlegroups.com

Slaves are connected via JNLP (windows service with secret), remoting version 3.17.

eub.kansas19@gmail.com (JIRA)

unread,
Jul 25, 2018, 2:49:03 PM7/25/18
to jenkinsc...@googlegroups.com

Found a ticket regarding build history and pipelines

This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

eub.kansas19@gmail.com (JIRA)

unread,
Jul 25, 2018, 3:46:02 PM7/25/18
to jenkinsc...@googlegroups.com
Josiah Eubank edited a comment on Bug JENKINS-47821
Found a ticket regarding build history and pipelines JENKINS-38877

eub.kansas19@gmail.com (JIRA)

unread,
Jul 25, 2018, 4:43:02 PM7/25/18
to jenkinsc...@googlegroups.com
Josiah Eubank edited a comment on Bug JENKINS-47821
Found a ticket regarding build history and pipelines JENKINS-38877


Experiencing this still on 2.18, even though the text "Limited Builds is not currently used" no longer appears in the config help

eub.kansas19@gmail.com (JIRA)

unread,
Jul 25, 2018, 5:14:02 PM7/25/18
to jenkinsc...@googlegroups.com
Josiah Eubank edited a comment on Bug JENKINS-47821
Found a ticket regarding build history and pipelines JENKINS-38877

Experiencing this still on 2.18, even though the text "Limited Builds is not currently used" no longer appears in the config help .  Note this is combined with "Take this agent offline when not in demand...."

oren@chapo.co.il (JIRA)

unread,
Jul 29, 2018, 11:35:02 AM7/29/18
to jenkinsc...@googlegroups.com

I've seen this issue also with version 2.16 and 2.18 of the vSphere Cloud plugin, however - it seems like it's not a problem in the plugin, but a limitation of the "cloud" Jenkins interface that the plugin implements.

If you're trying to ensure a slave is always in a "clean" state when allocated, here's my workaround, after hours of painful google-search, trial and error:
1. Node configuration: fill the "Snapshot Name" field (eg "Clean")
2. Node configuration: Availability: "Take this agent online when in demand, and offline when idle"
3. Node configuration: What to do when the slave is disconnected: "Shutdown"
4. Pipeline job configuration: include the following code:

	import jenkins.slaves.*
	import jenkins.model.*
	import hudson.slaves.*
	import hudson.model.*
	
	def SafelyDisposeNode() {
		print "Safely disposing node..."
		def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave
		if (slave == null) {
			error "ERROR: Could not get slave object for node!"
		}
		try
		{
			slave.getComputer().setTemporarilyOffline(true, null)
			if(isUnix()) {
				sh "(sleep 2; poweroff)&"
			} else {
				bat "shutdown -t 2 -s"
			}
			slave.getComputer().disconnect(null)
			sleep 10
		} catch (err) {
			print "ERROR: could not safely dispose node!"
		} finally {
			slave.getComputer().setTemporarilyOffline(false, null)
		}
		print "...node safely disposed."
		slave = null
	}
	
	def DisposableNode(String nodeLabel, Closure body) {
		node(nodeLabel) {
			try {
				body()
			} catch (err) {
				throw err
			} finally {
				SafelyDisposeNode()
			}
		}
	}

5. When you want to ensure the node will NOT be used by another job (or another run of the same job), use a "DisposableNode" block instead of "node" block:

	DisposableNode('MyNodeLabel') {
		// run your pipeline code here.
		// it will make sure the node is shutdown at the end of the block, even if it fails.
		// no other job or build will be able to use the node in its "dirty" state,
		// and vSphere plugin will revert to "clean" snapshot before starting the node again.
	}

6. If other Jobs are using this node (or node label), they all must use the above workaround, to avoid leaving a "dirty" machine for each other.
7. As of the "why is it so important to have node in a clean state?" question, my use case is integration tests of kernel-mode drivers (both Windows and Linux O/S) that typically "break" the O/S and leave it in an unstable state (BSODs and Kernel Panics are common).
8. If your pipeline job is running under a Groovy sandbox, you will need to permit some classes (The job will fail and offer you to whitelist a class, repeat carefully several times).

james.telfer@horiba.com (JIRA)

unread,
Apr 24, 2019, 3:53:02 AM4/24/19
to jenkinsc...@googlegroups.com

Any progress on this?  I have just come up against what looks like the same issue.  Statically defined Windows slaves connecting via JNLPv4.

They seem to completely ignore the 'Disconnect After Limited Builds' option, which re-reading the Wiki seems to be the expected behaviour?

Oren Chapo your work-around doesn't seem to work for me, at least not when using it within declarative pipeline.

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

werner.mueller8@boschrexroth.de (JIRA)

unread,
Jan 2, 2020, 7:23:02 AM1/2/20
to jenkinsc...@googlegroups.com

I modified the workaround to reset the vm in the pipeline itself.

Advantages:

  • Shutdown activities are not required in the  node configuration.
  • The node is resetted before executing the pipeline to the given snapshot

 

def ResettedNode(String vm, String serverName, String snapshotName, Closure body) {
    node(vm) {
        // Reset the computer in the context of the node to avoid running other jobs on this node in the meanwhile
        stage('Reset node')
        {
            def slave = Jenkins.instance.getNode(env.NODE_NAME) as Slave
            
if (slave == null) {
                error "ERROR: Could not get slave object for node!"
            }
            try
            {
                slave.getComputer().setTemporarilyOffline(true, null
)
                vSphere buildStep: [$class: 'PowerOff', vm: vm, evenIfSuspended: true, shutdownGracefully: false, ignoreIfNotExists: false], serverName: serverName
                vSphere buildStep: [$class: 'RevertToSnapshot', vm: vm, snapshotName: snapshotName], serverName: serverName
                vSphere buildStep: [$class: 'PowerOn', timeoutInSeconds: 240, vm: vm], serverName: serverName
                slave.getComputer().disconnect(null)
                sleep 10 // wait, while the agent on the slave is starting up
            } catch (err) {
                print "ERROR: could not reset node!"
            } finally {
                slave.getComputer().setTemporarilyOffline(false, null)
            }
            slave = null
        }
    }
    // Wait for node to come online again
    node(vm) {
        body()
    }
}

ResettedNode('vm', 'vCloud', 'clean') 
{

}

 

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages