[JIRA] (JENKINS-51203) Slave.setNodeProperties is unreliable and inefficient

6 views
Skip to first unread message

pjdarton@gmail.com (JIRA)

unread,
May 9, 2018, 12:49:03 PM5/9/18
to jenkinsc...@googlegroups.com
pjdarton updated an issue
 
Jenkins / Bug JENKINS-51203
Slave.setNodeProperties is unreliable and inefficient
Change By: pjdarton
Summary: Slave.setNodeProperties causes spurious call to Nodes.save is unreliable and inefficient
Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

pjdarton@gmail.com (JIRA)

unread,
May 9, 2018, 12:52:02 PM5/9/18
to jenkinsc...@googlegroups.com
pjdarton updated an issue
Change By: pjdarton
Environment: Jenkins with at least one cloud plugin Code that's dynamically- creating multiple Slave nodes , in parallel, that call setNodeProperties have nodeProperties .
e.g. cloud plugins or pipelines running parallel jobs in dynamic slaves.

pjdarton@gmail.com (JIRA)

unread,
May 9, 2018, 1:00:02 PM5/9/18
to jenkinsc...@googlegroups.com
pjdarton updated an issue
TL;DR:  {{Slave.setNodeProperties}} is unreliable because it's doing stuff it shouldn't.

 

When a cloud plugin's {{Cloud.provision}} method gets called, the plugin code has to create one or more instances of {{hudson.model.Slave}} and return them to Jenkins.
Prior to returning them to Jenkins, it has to populate those instances with all the data required.  This can (e.g. in the case of the docker-plugin) include a call to {{Slave.setNodeProperties}}.

{{Slave.setNodeProperties}} calls {{nodeProperties.replaceBy()}}.  {{nodeProperties}} are a {{hudson.util.PersistedList}} whose owner is {{Jenkins.getInstance().getNodesObject()}} (see [Slave.java line 150|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Slave.java#L150]), and {{PersistedList.replaceBy}} calls {{PersistedList.onModified()}} which then calls [{{Nodes.save()}}|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/jenkins/model/Nodes.java#L277] which writes all nodes to disk (which is a non-trivial operation that isn't thread-safe).

i.e. At present, any change to the node properties for any slave (even those not persisted to disk yet) causes all slaves to get (re-)persisted to disk.

This is very inefficient when there are a lot of slaves - it should, at most, persist just the slave in question, and in the case where a slave doesn't belong to anyone yet, it shouldn't get persisted at all.
Worse still, when there's more than one slave being provisioned at a time (e.g. on a busy Jenkins server with a fast cloud), the persistence functionality can throw an exception and cause the entire operation to fail:
{noformat}
java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-22f02f15d4988b/atomic3648298142656815004tmp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
at java.nio.file.Files.move(Files.java:1395)
at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
at hudson.XmlFile.write(XmlFile.java:198)
at jenkins.model.Nodes.save(Nodes.java:274)
at hudson.util.PersistedList.onModified(PersistedList.java:173)
at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
at hudson.model.Slave.setNodeProperties(Slave.java:299)
at ...
{noformat}
 

What should happen instead is that changes to {{Slave.nodeProperties}} should only cause those changes to be persisted if the slave itself is persisted.  Changes to Slaves that are not (yet) part of Jenkins' list of Nodes should not cause a re-save of the configuration to disk.
e.g. much like the logic done in [{{Node.save()}}|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Node.java#L131].

FYI this was discovered while investigating the cause of numerous docker provisioning failures caused by occasional (unpredictable) persistence failures from {{hudson.util.AtomicFileWriter}} (see docker-plugin issue [644|https://github.com/jenkinsci/docker-plugin/issues/644]). While the docker plugin should cope with such failures, the core code shouldn't be causing these problems in the first place!

I suspect that this could be done by having {{Slave}} have {{nodeProperties.owner}} set to {{this}}, because then it'll benefit from the same optimization logic that's already implemented in {{Node.save()}}.
I also suspect that
Note: Elsewhere, calls to
{{Nodes.save()}} should be are wrapped in a {{ synchronized Queue.withLock }} as that block to prevent concurrent updates; the call from {{Slave.nodeProperties}} isn ' t wrapped like this, which is why the exception happens, but it'd be better to avoid calling it entirely than to simply make it' s probably also a cause of persistence issues when multiple threads are adding slaves existing functionality thread-safe .

pjdarton@gmail.com (JIRA)

unread,
May 9, 2018, 1:02:02 PM5/9/18
to jenkinsc...@googlegroups.com
FYI this was discovered while investigating the cause of numerous docker provisioning failures caused by occasional (unpredictable) persistence failures from {{hudson.util.AtomicFileWriter}} (see docker-plugin issue [644|https://github.com/jenkinsci/docker-plugin/issues/644]). While the docker that plugin should code is being enhanced to cope with such failures, the core code shouldn't be causing these problems failures in the first place ! .

 

Suggestion: 
I suspect that this could be done fixed by having {{Slave}} have {{nodeProperties.owner}} set to {{this}}, because then it'll benefit from the same optimization logic that's already implemented in {{Node.save()}}.

Note: Elsewhere, calls to {{Nodes.save()}} are wrapped in a {{Queue.withLock}} block to prevent concurrent updates; the call from {{Slave.nodeProperties}} isn't wrapped like this, which is why the exception happens, but it'd be better to avoid calling it entirely than to simply make it's existing functionality thread-safe.

kanstantsin.sha@gmail.com (JIRA)

unread,
Jun 20, 2019, 10:44:03 AM6/20/19
to jenkinsc...@googlegroups.com
Kanstantsin Shautsou edited a comment on Bug JENKINS-51203
 
Re: Slave.setNodeProperties is unreliable and inefficient
One more possible chain how to get into this error
{
{ code} }
04:25:05 Caused by: java.nio.file.NoSuchFileException: /jenkins/jenkins-work/nodes/yadp877105322/atomic113511861761462995tmp
04:25:05  at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
04:25:05  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
04:25:05  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
04:25:05  at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409)
04:25:05  at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
04:25:05  at java.nio.file.Files.move(Files.java:1395)
04:25:05  at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206)
04:25:05  at hudson.XmlFile.write(XmlFile.java:198)
04:25:05  at jenkins.model.Nodes.save(Nodes.java:274)
04:25:05  at hudson.util.PersistedList.onModified(PersistedList.java:173)
04:25:05  at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
04:25:05  at hudson.model.Slave.<init>(Slave.java:198)
04:25:05  at hudson.slaves.AbstractCloudSlave.<init>(AbstractCloudSlave.java:51)
{
{ code} }
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

kanstantsin.sha@gmail.com (JIRA)

unread,
Jun 20, 2019, 10:44:04 AM6/20/19
to jenkinsc...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages