[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization

1 view
Skip to first unread message

me@basilcrow.com (JIRA)

unread,
Jun 4, 2019, 1:24:07 AM6/4/19
to jenkinsc...@googlegroups.com
Basil Crow created an issue
 
Jenkins / Bug JENKINS-57831
Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Issue Type: Bug Bug
Assignee: Jeff Thompson
Components: remoting
Created: 2019-06-04 05:23
Environment: Jenkins 2.150.1
Swarm client 3.17 (Remoting 3.30)
Priority: Major Major
Reporter: Basil Crow

Hi, I'm the new maintainer of the Swarm Plugin. I encountered an issue with tonight after doing a routine restart of a Jenkins master (to perform a plugin update) that resulted in all my Swarm clients losing their connection to that master (but not my other masters). I explain the details below. I'd welcome your thoughts on my root cause analysis below, and I'd be happy to collaborate on a solution with you.

Problem

Typically, my Swarm clients reconnect just fine after a master restarts due to my use of the Swarm client -deleteExistingClients feature. In fact, I even have a unit test for this functionality. And tonight, Swarm clients successfully reconnected when all of my Jenkins masters were restarted, except for one. On that single master (but not the others), all the Swarm clients failed to reconnect. The Swarm client logs on all the failed clients showed messages like the following:

2019-06-04 03:08:24 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://example.com/ to configure swarm client.
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext createHttpClientContext() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext Setting HttpClient credentials based on options passed
2019-06-04 03:08:24 FINE hudson.plugins.swarm.Candidate <init> Candidate constructed with url: http://example.com/, secret: <redacted>
2019-06-04 03:08:24 INFO hudson.plugins.swarm.Client run Attempting to connect to http://example.com/ <redacted> with ID 
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createSwarmSlave createSwarmSlave() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext createHttpClientContext() invoked
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext Setting HttpClient credentials based on options passed
2019-06-04 03:08:24 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb Could not obtain CSRF crumb. Response code: 404
2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient connect connect() invoked
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection writeRequests sun.net.www.MessageHeader@4973813a6 pairs: {GET /computer/vm.example.com/slave-agent.jnlp HTTP/1.1: null}{Authorization: Basic YmxhY2tib3g6YmxhY2tib3gxMjM=}{User-Agent: Java/1.8.0_202}{Host: example.com}{Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2}{Connection: keep-alive}
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection getInputStream0 sun.net.www.MessageHeader@6321e8137 pairs: {null: HTTP/1.1 200 OK}{Date: Tue, 04 Jun 2019 03:08:24 GMT}{Content-Type: application/x-java-jnlp-file}{Content-Length: 772}{Connection: keep-alive}{X-Content-Type-Options: nosniff}{Server: Jetty(9.4.z-SNAPSHOT)}
2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main createEngine Setting up agent: vm.example.com
2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main$CuiListener <init> Jenkins agent is running in headless mode.
2019-06-04 03:08:24 INFO hudson.remoting.Engine startEngine Using Remoting version: 3.30
2019-06-04 03:08:24 WARNING hudson.remoting.Engine startEngine No Working Directory. Using the legacy JAR Cache location: /var/tmp/jenkins/.jenkins/cache/jars
2019-06-04 03:08:24 FINE hudson.remoting.Engine startEngine Using standard File System JAR Cache. Root Directory is /var/tmp/jenkins/.jenkins/cache/jars
2019-06-04 03:08:24 FINE org.jenkinsci.remoting.protocol.IOHub create Staring an additional Selector wakeup thread. See JENKINS-47965 for more info
2019-06-04 03:08:24 INFO hudson.remoting.jnlp.Main$CuiListener status Locating server among [http://example.com/]
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection writeRequests sun.net.www.MessageHeader@4ae378ec6 pairs: {GET /tcpSlaveAgentListener/ HTTP/1.1: null}{Authorization: Basic YmxhY2tib3g6YmxhY2tib3gxMjM=}{User-Agent: Java/1.8.0_202}{Host: example.com}{Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2}{Connection: keep-alive}
2019-06-04 03:08:24 FINE sun.net.www.protocol.http.HttpURLConnection getInputStream0 sun.net.www.MessageHeader@fe1efd411 pairs: {null: HTTP/1.1 200 OK}{Date: Tue, 04 Jun 2019 03:08:24 GMT}{Content-Type: text/plain;charset=utf-8}{Content-Length: 12}{Connection: keep-alive}{X-Content-Type-Options: nosniff}{X-Hudson-JNLP-Port: 55000}{X-Jenkins-JNLP-Port: 55000}{X-Instance-Identity: <redacted>}{X-Jenkins-Agent-Protocols: JNLP4-connect, Ping}{Server: Jetty(9.4.z-SNAPSHOT)}
2019-06-04 03:08:24 INFO org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting server accepts the following protocols: [JNLP4-connect, Ping]
2019-06-04 03:08:24 WARNING org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver isPortVisible Connection refused (Connection refused)
2019-06-04 03:08:24 SEVERE hudson.remoting.jnlp.Main$CuiListener error http://example.com/ provided port:55000 is not reachable
java.io.IOException: http://example.com/ provided port:55000 is not reachable
        at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:287)
        at hudson.remoting.Engine.innerRun(Engine.java:523)
        at hudson.remoting.Engine.run(Engine.java:474)

At this point, the Swarm client exited (with unknown exit code) and never recovered. The higher-level Jenkins jobs failed.

Background

Of note is that my masters have a custom JNLP port, which I set from a Groovy initialization script like so:

def env = System.getenv()
int port = env['JENKINS_SLAVE_AGENT_PORT'].toInteger()
if (Jenkins.get().slaveAgentPort != port) {
  Jenkins.get().slaveAgentPort = port
  Jenkins.get().save()
}

So for a short period of time during Jenkins initialization, before this Groovy initialization script is running, Jenkins is up (and therefore replying to HTTP connection requests) but the JNLP port settings haven't been applied yet (so a connection to the JNLP port would fail).

Analysis

Here is my analysis of the situation:

  1. The Jenkins master went down, then started initializing again.
  2. Swarm client successfully communicated with the master via HTTP to create the new agent (see the hudson.plugins.swarm.SwarmClient createSwarmSlave line above, which clearly shows the Swarm client was able to successfully communicate with the master over HTTP).
  3. Swarm client then delegated to Remoting, calling hudson.remoting.jnlp.Main#main, which called _main, which called another main, which called Engine#startEngine. We know this because we see the "Using Remoting version" and "Using custom JAR Cache" lines above.
  4. Engine#startEngine started a thread, which invoked Engine#run. We know this because the org.jenkinsci.remoting.protocol.IOHub create log line was printed.
  5. Engine#run got all the way into Engine#innerRun, which got as far as the endpoint = resolver.resolve() call on line 523. We know this because the log statement "Locating server among..." was printed.
  6. In JnlpAgentEndpoint#resolve, we successfully made an HTTP call to the server to list the available protocols. Again, we know this because the log statement "Remoting server accepts the following protocols" was printed.
  7. In JnlpAgentEndpoint#resolve, we call isPortVisible, and here is where things go haywire. At this point, the JNLP port is not available yet, even though the server is responding to HTTP requests, presumably because my Groovy initialization script hasn't run yet. We get the error http://example.com/ provided port:55000 is not reachable from JnlpAgentEndpointResolver#resolve.
  8. isPortVisible returns false to JnlpAgentEndpointResolver#resolve, which sets firstError to a new IOException, then continues. We have nothing else to loop through, so we get to the bottom of the method and throw firstError, which in this case is the IOException.
  9. The caller of JnlpAgentEndpointResolver#resolve, Engine#innerRun, catches the exception and returns.
  10. Back in Engine#run, innerRun returns and then run returns. At this point the thread dies. We pop the stack all the way back up to Main#main and ultimately back to the Swarm Client itself, which exits.

Possible Solutions

Clearly, this is a suboptimal outcome. (In practice, it took down a bunch of my test automation tonight.) What are your thoughts on how this problem could be solved? Here are some of mine.

  • Ideally, Jenkins core could not respond to HTTP requests until the JNLP port is available. Unfortunately, I don't see a practical way to make this the reality. There doesn't seem to be a way to set the JNLP port early on in Jenkins startup today (hence the need for my Groovy initialization script). I'm not sure whether or not it's feasible to try and add such an option. Even if it were feasible, I still don't know enough about Jenkins early initialization to be able to guarantee that it would close the race. This seems like the ideal solution in the long term, but it's quite impractical for the short or medium term.
  • Could we have Remoting try a little harder, knowing that there is a race between the Jenkins master being available via HTTP and JNLP? In practice this race is very small, and I rarely hit it. Retrying up to a minute or so, with a bit of backoff along the way, might be "good enough". Would this be a direction you want to go in? This option appeals to me because it seems more realistic to implement, and it would also benefit non-Swarm JNLP clients. This seems to be the best medium-term solution.
  • Should we have Swarm somehow detect this condition and re-invoke hudson.remoting.jnlp.Main#main? Swarm already has command-line options for retries, so we could take advantage of one of those to try and restart the JNLP client if the thread dies for some reason. This seems a bit sub-optimal, since it would only benefit Swarm clients and not other JNLP clients. But it could be done as a short-term solution.

Let me know what you think about my analysis and these possible solutions. I'd be happy to collaborate with you to get this fixed.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

me@basilcrow.com (JIRA)

unread,
Jun 4, 2019, 3:02:02 AM6/4/19
to jenkinsc...@googlegroups.com
Basil Crow commented on Bug JENKINS-57831
 
Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization

I still haven't figured out a way to reproduce the problem with a real Jenkins master, but using the crude method in jenkinsci/remoting#325 I can reliably reproduce this on the Remoting side. There, I am simulating a server whose JNLP port isn't reachable by having isPortVisible return a fake value of false the first time, and the real value all other times. I built a Swarm client with these Remoting changes. With that Swarm client and connecting to a regular productionJenkins master, I can reliably reproduce the above scenario that I ran into production. My analysis was almost completely correct, up to #9. Engine#innerRun does indeed catch the exception, but it never returns. Instead, it calls events.error(e), which calls CuiListener#error, which does this:

LOGGER.log(Level.SEVERE, t.getMessage(), t);
System.exit(-1);

Pretty harsh. (Also note the use of the negative exit code, which is technically invalid since only positive numbers are allowed.) You can step through it yourself to get a clearer picture of my analysis if the prose isn't convincing.

If I make this simple change to Remoting to downgrade the IOException emanating from JnlpAgentEndpointResolver#resolve from "severe" (i.e., CuiListener#error kills the process) to "warning" (i.e., CuiListener#status, which just logs a message), then everything works properly:

diff --git a/src/main/java/hudson/remoting/Engine.java b/src/main/java/hudson/remoting/Engine.java
index 0f1b92ed..47bc3e7f 100644
--- a/src/main/java/hudson/remoting/Engine.java
+++ b/src/main/java/hudson/remoting/Engine.java
@@ -522,7 +522,7 @@ public class Engine extends Thread {
                 try {
                     endpoint = resolver.resolve();
                 } catch (Exception e) {
-                    events.error(e);
+                    events.status(e.getMessage());
                     return;
                 }
                 if (endpoint == null) {

With the above patch, we pop the stack back up to Swarm, which retries the connection a second time, and succeeds the second time.

This isn't necessarily a final fix, since I don't know if we'd want to downgrade all IOExceptions coming from JnlpAgentEndpointResolver#resolve from "severe" to "warning" (maybe just ones coming from JnlpAgentEndpointResolver#isPortVisible?), but it clearly illustrates the practical viability of my second proposed solution.

jthompson@cloudbees.com (JIRA)

unread,
Jun 11, 2019, 5:57:02 PM6/11/19
to jenkinsc...@googlegroups.com

You've done some impressive diagnosis and reporting on this one. That will make it a lot easier to understand and see if we can come up with any solutions. I'll need to find time tomorrow or the next day to study this more thoroughly.

I doubt your first suggestion is feasible. The Jenkins init sequence has some cool capabilities but it's pretty limited. Disrupting it can cause lots of problems. It's probably worth investigating, but I'm not hopeful, based on my last attempt to get it to do exactly what I needed.

Your second suggestion has me worried, mostly because of recent events involving #JENKINS-46515. @witokondoria made an attempt to improve the connection sequence but it broke other scenarios, #57713.

I'd really like to see us make improvements in this area, but there may be too many different implementations and scenarios relying on the existing sequence. We may have to pursue your third suggestion.

I'll look it in more detail but I'm very interested in any further ideas or suggestions. 

me@basilcrow.com (JIRA)

unread,
Jun 11, 2019, 6:38:03 PM6/11/19
to jenkinsc...@googlegroups.com

I just discovered jenkinsci/docker#805, which sounds similar to my scenario on the surface. There, Pedro Rodrigues found that the JNLP port number is first set to a random value and then a Groovy initialization script in the Docker image changes to it the desired value. This opens up a race where Jenkins is responding to the wrong JNLP port for a short amount of time during initialization. Pedro fixed this in the Docker image by using a Java system property rather than a Groovy initialization script to set the port. The Java system property is used when Jenkins first initializes JNLP, so this theoretically closes the race.

I wonder if this phenomenon could explain my observations above, where Jenkins was replying to HTTP request but the JNLP port was not available. It seems like the following sequence of events is possible:

  • Jenkins advertises the old JNLP port
  • Swarm/Remoting picks this up
  • Jenkins executes the Groovy initialization script to change the port
  • Swarm/Remoting tries to connect to the old port

I'll be deploying Pedro's changes to the Docker image next week. I don't know for sure they will fix this problem, but I am hopeful that they might.

jthompson@cloudbees.com (JIRA)

unread,
Jun 13, 2019, 5:57:03 PM6/13/19
to jenkinsc...@googlegroups.com
Jeff Thompson edited a comment on Bug JENKINS-57831
You've done some impressive diagnosis and reporting on this one. That will make it a lot easier to understand and see if we can come up with any solutions. I'll need to find time tomorrow or the next day to study this more thoroughly.

I doubt your first suggestion is feasible. The Jenkins init sequence has some cool capabilities but it's pretty limited. Disrupting it can cause lots of problems. It's probably worth investigating, but I'm not hopeful, based on my last attempt to get it to do exactly what I needed.

Your second suggestion has me worried, mostly because of recent events involving #JENKINS-46515. @ [~ witokondoria ] made an attempt to improve the connection sequence but it broke other scenarios, #57713.


I'd really like to see us make improvements in this area, but there may be too many different implementations and scenarios relying on the existing sequence. We may have to pursue your third suggestion.

I'll look it in more detail but I'm very interested in any further ideas or suggestions. 

jthompson@cloudbees.com (JIRA)

unread,
Jun 13, 2019, 5:58:02 PM6/13/19
to jenkinsc...@googlegroups.com

It looks like that Jenkins system environment variable is the supported mechanism for specifying the JNLP port. Since it works right from the earliest init point, it should be reliable. I don't think we need to add anything additional to Jenkins. With that you shouldn't need your Groovy customization any longer. Hopefully that will work for you.

I'm very interested in what you discover. If the system property doesn't solve your problems, I'm interested in working together to see if we can figure out what could – something related to your second suggestion, improving Remoting to be more reliable on connections. We would need to introduce it in such a way that changes would only be active when specified and not for everyone else, who might possibly be relying on existing behavior. Some flag that you could send into the agent to turn on the new behavior.

This seems to be somewhat opposite to what Javier Delgado was working with in #JENKINS-46515 , Remoting PR#193(https://github.com/jenkinsci/remoting/pull/193). In that case he was trying to exit the agent process more quickly to have things kick off again whereas you are trying to get it not exit so quickly.

It's difficult to get it to satisfy all of the different scenarios, but if they're distinguishable and flags can be set, we might be able to introduce specific sequences to help in certain scenarios.

I'll look through it a little bit more, but I don't have any testing environments or configurations that lead into these interesting cases.

me@basilcrow.com (JIRA)

unread,
Jun 13, 2019, 7:16:03 PM6/13/19
to jenkinsc...@googlegroups.com

This seems to be somewhat opposite to what Javier Delgado was working with in #JENKINS-46515 , Remoting PR#193(https://github.com/jenkinsci/remoting/pull/193). In that case he was trying to exit the agent process more quickly to have things kick off again whereas you are trying to get it not exit so quickly.

Swarm has a number of retry-related options:

 -noRetryAfterConnected                 : Do not retry if a successful
                                          connection gets closed.
 -retry N                               : Number of retries before giving up.
                                          Unlimited if not specified.
 -retryBackOffStrategy RETRY_BACK_OFF_S : The mode controlling retry wait time.
 TRATEGY                                  Can be either 'none' (use same
                                          interval between retries) or 'linear'
                                          (increase wait time before each retry
                                          up to maxRetryInterval) or
                                          'exponential' (double wait interval
                                          on each retry up to maxRetryInterval).
                                          Default is 'none'.
 -retryInterval N                       : Time to wait before retry in seconds.
                                          Default is 10 seconds.

When these options are being used, Swarm wants the process to keep running, so Remoting's use of System.exit is problematic. In this case, Swarm really wants Remoting to pass control back to Swarm on failure so that Swarm can retry. Today this takes place mostly by popping the stack back up to Remoting's main() so that Swarm can call Remoting's main again. But this interface is an implementation detail of Swarm/Remoting and we could always redefine that interface (for example, to use some specific exception type) if desired.

Javier's use case seems a bit different; that use case seems more aligned with a situation where you have a service manager (e.g. systemd) that is monitoring the process and restarting it on failure, e.g.:

                [Unit]
                Description=Swarm client
                Requires=network.target
                After=local-fs.target
                After=network.target

                [Service]
                Type=simple
                WorkingDirectory=/var/lib/swarm-client
                ExecStart=java -jar swarm-client.jar [...] 
                Restart=on-failure
                RestartSec=5

                #
                # If the Swarm client loses its connection to the master and
                # needs to be restarted, we don't want to interrupt its child
                # processes, which the new Swarm client process will find when
                # it resumes its connection.
                #
                KillMode=process

                #
                # The JVM's exit status when it receives a SIGTERM (15) will be
                # 128+15=143, so consider that clean exit.
                #
                SuccessExitStatus=143

                [Install]
                WantedBy=multi-user.target

With the above type of setup, you'd be relying on the client to exit non-zero so that systemd could restart it.

I think ultimately the current semantics are a bit muddy and could use some cleaning up. This could be done while preserving backwards compatibility by adding a new flags to Remoting/Swarm. Here's a rough sketch of what the new semantics could look like in the long term:

  • "Exit non-zero on failure": Add a new flag that means "fail fast and don't try to keep the process going". Indicate failure with a non-zero exit status and rely on the caller to retry. The caller would be expected to be something like systemd that has its own retry mechanism built-in. Ideally the flag would be named the same for both Swarm and Remoting. If the flag is passed to Swarm, Swarm would plumb it through to Remoting.
  • We'd want the opposite of the above flag, either as its own flag (i.e. "keep trying") or as the default when no flags are specified. Currently things are "mostly" this way by default with the exception of the harsh System.exit I pointed out above. We could change the default behavior there (and risk some breakage), or have a truly no-risk solution in the form a new "keep trying" flag. If we do the latter, there would be three modes: "fail fast" (new flag), "keep trying" (new flag), and default (i.e. maintain the existing confusing semantics).

I don't see a need to implement any of this right away given that Pedro's JNLP port solution is probably going to chase away the problem for me in the short term. But it's worth keeping this in mind as a possible design direction in the long term.

jthompson@cloudbees.com (JIRA)

unread,
Jun 14, 2019, 2:06:02 PM6/14/19
to jenkinsc...@googlegroups.com

You've described the scenarios and status well. As you note, if the system property for JNLP port works there probably isn't much of a need to make a change at this point. I'll keep this in mind and see if any more information or needs come up. I think any change we might make would have to preserve the existing, default behavior. It's just too complicated to predict who might be relying on what. If Javier Delgado still has an identified need for the "Exit non-zero on failure" we could try again to implement that without disruption.

jthompson@cloudbees.com (JIRA)

unread,
Jun 20, 2019, 1:03:02 PM6/20/19
to jenkinsc...@googlegroups.com

Basil Crow, haven't you gotten any further in validating your needs are met by existing functionality?

me@basilcrow.com (JIRA)

unread,
Jun 20, 2019, 2:04:02 PM6/20/19
to jenkinsc...@googlegroups.com

I deployed Pedro's changes from jenkinsci/docker#805 on Monday, and since then I've done one Jenkins master restart which went off without a hitch. So far, use of the jenkins.model.Jenkins.slaveAgentPort Java property seems to have chased away the problem.

jthompson@cloudbees.com (JIRA)

unread,
Jun 20, 2019, 2:27:02 PM6/20/19
to jenkinsc...@googlegroups.com

Great. It looks like the possible value of making any changes at this time has declined. I'll go ahead and close this issue and the related PR. We'll see if another scenario comes up that could be the driver for some init sequence changes.

jthompson@cloudbees.com (JIRA)

unread,
Jun 20, 2019, 2:28:02 PM6/20/19
to jenkinsc...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages