I'm currently using the swarm plugin to connect all my slaves to the master. However, whenever the Jenkins service on the master gets restarted, the Jenkins slave will remain offline. It will only come back online when I restart the jenkins swarm plugin process.
Nov 11, 2015 10:06:32 AM org.apache.commons.httpclient.HttpMethodBase getResponseBody WARNING: Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended. Attempting to connect to https://suct2v420.it.mgt:8443/ 98ecac62-d76a-4734-9f9f-9350ee5b4e7d with ID c3c7e53b Could not obtain CSRF crumb. Response code: 404 javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No name matching suct2v420.it.mgt found at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1904) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:279) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:273) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1446) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:209) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:901) at sun.security.ssl.Handshaker.process_record(Handshaker.java:837) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1023) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1332) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1359) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1343) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:563) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153) at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:269) at hudson.plugins.swarm.SwarmClient.connect(SwarmClient.java:229) at hudson.plugins.swarm.Client.run(Client.java:106) at hudson.plugins.swarm.Client.main(Client.java:69) Caused by: java.security.cert.CertificateException: No name matching suct2v420.it.mgt found at sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:208) at sun.security.util.HostnameChecker.match(HostnameChecker.java:93) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:347) at sun.security.ssl.AbstractTrustManagerWrapper.checkAdditionalTrust(SSLContextImpl.java:919) at sun.security.ssl.AbstractTrustManagerWrapper.checkServerTrusted(SSLContextImpl.java:886) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1428) ... 14 more Failed to establish JNLP connection to https://suct2v420.it.mgt:8443/ Retrying in 10 seconds
You can use a service like "supervisor" that will automatically start the swarm jar on the slave if it ever goes down. That is what we use. We have the process retry X times with a sleep of Y seconds in between each attempt. That way, we can restart our master for maintenance, and when it is back online the agents will magically re-connect.
This should be working nowadays. You need to use the -deleteExistingClients so that the Swarm Client can connect after the restart. See PipelineJobTest#buildShellScriptAfterRestart for a working example from a unit test.