TL;DR: {{Slave.setNodeProperties}} is unreliable because it's doing stuff it shouldn't.
When a cloud plugin's {{Cloud.provision}} method gets called, the plugin code has to create one or more instances of {{hudson.model.Slave}} and return them to Jenkins. Prior to returning them to Jenkins, it has to populate those instances with all the data required. This can (e.g. in the case of the docker-plugin) include a call to {{Slave.setNodeProperties}}.
{{Slave.setNodeProperties}} calls {{nodeProperties.replaceBy()}}. {{nodeProperties}} are a {{hudson.util.PersistedList}} whose owner is {{Jenkins.getInstance().getNodesObject()}} (see [Slave.java line 150|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Slave.java#L150]), and {{PersistedList.replaceBy}} calls {{PersistedList.onModified()}} which then calls [{{Nodes.save()}}|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/jenkins/model/Nodes.java#L277] which writes all nodes to disk (which is a non-trivial operation that isn't thread-safe).
i.e. At present, any change to the node properties for any slave (even those not persisted to disk yet) causes all slaves to get (re-)persisted to disk.
This is very inefficient when there are a lot of slaves - it should, at most, persist just the slave in question, and in the case where a slave doesn't belong to anyone yet, it shouldn't get persisted at all. Worse still, when there's more than one slave being provisioned at a time (e.g. on a busy Jenkins server with a fast cloud), the persistence functionality can throw an exception and cause the entire operation to fail: {noformat} java.nio.file.NoSuchFileException: /var/jenkins_home/nodes/docker-22f02f15d4988b/atomic3648298142656815004tmp at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at hudson.util.AtomicFileWriter.commit(AtomicFileWriter.java:206) at hudson.XmlFile.write(XmlFile.java:198) at jenkins.model.Nodes.save(Nodes.java:274) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.setNodeProperties(Slave.java:299) at ... {noformat}
What should happen instead is that changes to {{Slave.nodeProperties}} should only cause those changes to be persisted if the slave itself is persisted. Changes to Slaves that are not (yet) part of Jenkins' list of Nodes should not cause a re-save of the configuration to disk. e.g. much like the logic done in [{{Node.save()}}|https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Node.java#L131].
FYI this was discovered while investigating the cause of numerous docker provisioning failures caused by occasional (unpredictable) persistence failures from {{hudson.util.AtomicFileWriter}} (see docker-plugin issue [644|https://github.com/jenkinsci/docker-plugin/issues/644]). While the docker plugin should cope with such failures, the core code shouldn't be causing these problems in the first place!
I suspect that this could be done by having {{Slave}} have {{nodeProperties.owner}} set to {{this}}, because then it'll benefit from the same optimization logic that's already implemented in {{Node.save()}}. I also suspect that Note: Elsewhere, calls to {{Nodes.save()}} should be are wrapped in a {{ synchronized Queue.withLock }} as that block to prevent concurrent updates; the call from {{Slave.nodeProperties}} isn ' t wrapped like this, which is why the exception happens, but it'd be better to avoid calling it entirely than to simply make it' s probably also a cause of persistence issues when multiple threads are adding slaves existing functionality thread-safe . |
|
|