[JIRA] (JENKINS-51203) Slave.setNodeProperties is unreliable and inefficient

pjdarton@gmail.com (JIRA)

unread,

May 9, 2018, 12:49:03 PM5/9/18

to jenkinsc...@googlegroups.com

pjdarton updated an issue

Jenkins /

JENKINS-51203

Slave.setNodeProperties is unreliable and inefficient

Change By:	pjdarton
Summary:	Slave.setNodeProperties causes spurious call to Nodes.save is unreliable and inefficient

Add Comment

This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)

pjdarton@gmail.com (JIRA)

unread,

May 9, 2018, 12:52:02 PM5/9/18

to jenkinsc...@googlegroups.com

pjdarton updated an issue

Jenkins /

JENKINS-51203

Slave.setNodeProperties is unreliable and inefficient

Change By:	pjdarton
Environment:	Jenkins with at least one cloud plugin Code that's dynamically- creating multiple Slave nodes , in parallel, that call setNodeProperties have nodeProperties . e.g. cloud plugins or pipelines running parallel jobs in dynamic slaves.

Add Comment

pjdarton@gmail.com (JIRA)

unread,

May 9, 2018, 1:00:02 PM5/9/18

to jenkinsc...@googlegroups.com

pjdarton updated an issue

Jenkins /

JENKINS-51203

Slave.setNodeProperties is unreliable and inefficient

Change By:	pjdarton

TL;DR: {{Slave.setNodeProperties}} is unreliable because it's doing stuff it shouldn't.

When a cloud plugin's {{Cloud.provision}} method gets called, the plugin code has to create one or more instances of {{hudson.model.Slave}} and return them to Jenkins.
Prior to returning them to Jenkins, it has to populate those instances with all the data required. This can (e.g. in the case of the docker-plugin) include a call to {{Slave.setNodeProperties}}.

{{Slave.setNodeProperties}} calls {{nodeProperties.replaceBy()}}. {{nodeProperties}} are a {{hudson.util.PersistedList}} whose owner is {{Jenkins.getInstance().getNodesObject()}} (see [Slave.java line 150|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Slave.java#L150]), and {{PersistedList.replaceBy}} calls {{PersistedList.onModified()}} which then calls [{{Nodes.save()}}|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/jenkins/model/Nodes.java#L277] which writes all nodes to disk (which is a non-trivial operation that isn't thread-safe).

i.e. At present, any change to the node properties for any slave (even those not persisted to disk yet) causes all slaves to get (re-)persisted to disk.

This is very inefficient when there are a lot of slaves - it should, at most, persist just the slave in question, and in the case where a slave doesn't belong to anyone yet, it shouldn't get persisted at all.
Worse still, when there's more than one slave being provisioned at a time (e.g. on a busy Jenkins server with a fast cloud), the persistence functionality can throw an exception and cause the entire operation to fail:
{noformat}
java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-22f02f15d4988b/atomic3648298142656815004tmp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
at java.nio.file.Files.move(Files.java:1395)
at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
at hudson.XmlFile.write(XmlFile.java:198)
at jenkins.model.Nodes.save(Nodes.java:274)
at hudson.util.PersistedList.onModified(PersistedList.java:173)
at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
at hudson.model.Slave.setNodeProperties(Slave.java:299)
at ...
{noformat}

What should happen instead is that changes to {{Slave.nodeProperties}} should only cause those changes to be persisted if the slave itself is persisted. Changes to Slaves that are not (yet) part of Jenkins' list of Nodes should not cause a re-save of the configuration to disk.
e.g. much like the logic done in [{{Node.save()}}|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Node.java#L131].

FYI this was discovered while investigating the cause of numerous docker provisioning failures caused by occasional (unpredictable) persistence failures from {{hudson.util.AtomicFileWriter}} (see docker-plugin issue [644|https://github.com/jenkinsci/docker-plugin/issues/644]). While the docker plugin should cope with such failures, the core code shouldn't be causing these problems in the first place!

I suspect that this could be done by having {{Slave}} have {{nodeProperties.owner}} set to {{this}}, because then it'll benefit from the same optimization logic that's already implemented in {{Node.save()}}.
I also suspect that
Note: Elsewhere, calls to {{Nodes.save()}} should be are wrapped in a {{ synchronized Queue.withLock }} as that block to prevent concurrent updates; the call from {{Slave.nodeProperties}} isn ' t wrapped like this, which is why the exception happens, but it'd be better to avoid calling it entirely than to simply make it' s probably also a cause of persistence issues when multiple threads are adding slaves existing functionality thread-safe .

Add Comment

pjdarton@gmail.com (JIRA)

unread,

May 9, 2018, 1:02:02 PM5/9/18

to jenkinsc...@googlegroups.com

FYI this was discovered while investigating the cause of numerous docker provisioning failures caused by occasional (unpredictable) persistence failures from {{hudson.util.AtomicFileWriter}} (see docker-plugin issue [644|https://github.com/jenkinsci/docker-plugin/issues/644]). While the docker that plugin should code is being enhanced to cope with such failures, the core code shouldn't be causing these problems failures in the first place ! .

Suggestion: I suspect that this could be done fixed by having {{Slave}} have {{nodeProperties.owner}} set to {{this}}, because then it'll benefit from the same optimization logic that's already implemented in {{Node.save()}}.

Note: Elsewhere, calls to {{Nodes.save()}} are wrapped in a {{Queue.withLock}} block to prevent concurrent updates; the call from {{Slave.nodeProperties}} isn't wrapped like this, which is why the exception happens, but it'd be better to avoid calling it entirely than to simply make it's existing functionality thread-safe.

Add Comment

kanstantsin.sha@gmail.com (JIRA)

unread,

Jun 20, 2019, 10:44:03 AM6/20/19

to jenkinsc...@googlegroups.com

Kanstantsin Shautsou edited a comment on

JENKINS-51203

Re: Slave.setNodeProperties is unreliable and inefficient

One more possible chain how to get into this error
{ { code} }
04:25:05 Caused by: java.nio.file.NoSuchFileException: /jenkins/jenkins-work/nodes/yadp877105322/atomic113511861761462995tmp
04:25:05  at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
04:25:05  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
04:25:05  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
04:25:05  at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
04:25:05  at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
04:25:05  at java.nio.file.Files.move(Files.java:1395)
04:25:05  at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
04:25:05  at hudson.XmlFile.write(XmlFile.java:198)
04:25:05  at jenkins.model.Nodes.save(Nodes.java:274)
04:25:05  at hudson.util.PersistedList.onModified(PersistedList.java:173)
04:25:05  at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
04:25:05  at hudson.model.Slave.<init>(Slave.java:198)
04:25:05  at hudson.slaves.AbstractCloudSlave.<init>(AbstractCloudSlave.java:51)
{ { code} }

Add Comment

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

kanstantsin.sha@gmail.com (JIRA)

unread,

Jun 20, 2019, 10:44:04 AM6/20/19

to jenkinsc...@googlegroups.com

Kanstantsin Shautsou commented on

JENKINS-51203

Reply all

Reply to author

Forward