[JIRA] [ec2-plugin] (JENKINS-34408) EC2 plugin repeatedly tries to provision an unresponsive slave

15 views
Skip to first unread message

mihelich@google.com (JIRA)

unread,
Apr 22, 2016, 10:21:01 PM4/22/16
to jenkinsc...@googlegroups.com
Patrick Mihelich created an issue
 
Jenkins / Bug JENKINS-34408
EC2 plugin repeatedly tries to provision an unresponsive slave
Issue Type: Bug Bug
Assignee: Francis Upton
Components: ec2-plugin
Created: 2016/Apr/23 2:20 AM
Environment: ec2-plugin 1.31
Jenkins 1.642.2
Priority: Critical Critical
Reporter: Patrick Mihelich

Occasionally one of our stopped slaves will not restart, because ec2-plugin is not able to connect to it over SSH. The plugin aborts after the launch timeout, enumerates existing slaves, and selects the exact same unresponsive one to provision, even if there are many other stopped slaves available.

In the system log, I see ec2-plugin repeatedly enumerate all stopped slaves matching the AMI (~20 available) and select the same one: Using existing slave: i-2a021dbe. In the log for that slave, I can see it wait to connect over SSH, abort at the configured launch timeout of 180s, then start attempting to connect again.

Ideally, I would like ec2-plugin to delete any node that fails to launch. When I manually delete the node, the others begin to start up as expected. Marking the node temporarily offline would also be OK, if it doesn't trigger JENKINS-33945. A lesser mitigation would be to select an existing slave at random, instead of deterministically.

Marking as critical because this can completely prevent any stopped nodes from coming back up.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v6.4.2#64017-sha1:e244265)
Atlassian logo

mihelich@google.com (JIRA)

unread,
Apr 24, 2016, 7:09:01 PM4/24/16
to jenkinsc...@googlegroups.com
Patrick Mihelich commented on Bug JENKINS-34408
 
Re: EC2 plugin repeatedly tries to provision an unresponsive slave

Saw this again today with another node. Here's a snippet from its log, showing a new attempt to connect immediately after hitting the launch timeout:

Apr 24, 2016 11:02:15 PM null
INFO: Waiting for SSH to come up. Sleeping 5.
Apr 24, 2016 11:02:22 PM null
INFO: Connecting to ec2-52-53-185-83.us-west-1.compute.amazonaws.com on port 22, with timeout 10000.
Apr 24, 2016 11:02:32 PM null
INFO: Failed to connect via ssh: The kexTimeout (10000 ms) expired.
Apr 24, 2016 11:02:32 PM null
INFO: Waiting for SSH to come up. Sleeping 5.
ERROR: Timed out after 183 seconds of waiting for ssh to become available. (maximum timeout configured is 180)
com.amazonaws.AmazonClientException: Timed out after 183 seconds of waiting for ssh to become available. (maximum timeout configured is 180)
	at hudson.plugins.ec2.ssh.EC2UnixLauncher.connectToSsh(EC2UnixLauncher.java:315)
	at hudson.plugins.ec2.ssh.EC2UnixLauncher.launch(EC2UnixLauncher.java:126)
	at hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:100)
	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:253)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Apr 24, 2016 11:02:44 PM null
FINER: Node linux_2016_04_21 (i-c8d96f7d)(i-c8d96f7d) is ready
Apr 24, 2016 11:02:44 PM null
INFO: Launching instance: i-c8d96f7d
Apr 24, 2016 11:02:44 PM null
INFO: Connecting to ec2-52-53-185-83.us-west-1.compute.amazonaws.com on port 22, with timeout 10000.
Apr 24, 2016 11:02:54 PM null
INFO: Failed to connect via ssh: The kexTimeout (10000 ms) expired.
Apr 24, 2016 11:02:54 PM null
INFO: Waiting for SSH to come up. Sleeping 5.
Apr 24, 2016 11:02:59 PM null
INFO: Connecting to ec2-52-53-185-83.us-west-1.compute.amazonaws.com on port 22, with timeout 10000.
Apr 24, 2016 11:03:09 PM null
INFO: Failed to connect via ssh: The kexTimeout (10000 ms) expired.

jwhitcraft@sugarcrm.com (JIRA)

unread,
May 6, 2016, 9:35:01 AM5/6/16
to jenkinsc...@googlegroups.com

We are also seeing this error, next time it happens i'll pull the log file for it.

shields@kkvesper.jp (JIRA)

unread,
May 7, 2016, 8:38:01 PM5/7/16
to jenkinsc...@googlegroups.com

Are you using the latest master version of the plugin (have to build w/ maven)? The error may be fixed there.

shields@kkvesper.jp (JIRA)

unread,
May 16, 2016, 3:37:02 PM5/16/16
to jenkinsc...@googlegroups.com

mihelich@google.com (JIRA)

unread,
May 17, 2016, 4:27:01 PM5/17/16
to jenkinsc...@googlegroups.com

Not yet. I'll try upgrading to 1.33 later this week. Thanks!

jwhitcraft@sugarcrm.com (JIRA)

unread,
May 18, 2016, 8:50:01 AM5/18/16
to jenkinsc...@googlegroups.com

Patrick Mihelich, my server has been to busy, I should be able to update over the weekend.

jwhitcraft@sugarcrm.com (JIRA)

unread,
May 23, 2016, 8:48:02 AM5/23/16
to jenkinsc...@googlegroups.com

Patrick Mihelich,

I have not seen this error again since upgrading my server last thursday.

shields@kkvesper.jp (JIRA)

unread,
May 25, 2016, 2:13:55 PM5/25/16
to jenkinsc...@googlegroups.com
Johnny Shields closed an issue as Fixed
 

Closing as I do not see issue either. Please reopen if issue persists.

Change By: Johnny Shields
Status: Open Closed
Resolution: Fixed

francisu@gmail.com (JIRA)

unread,
May 25, 2016, 2:20:01 PM5/25/16
to jenkinsc...@googlegroups.com
Francis Upton reopened an issue
Change By: Francis Upton
Resolution: Fixed
Status: Closed Reopened

francisu@gmail.com (JIRA)

unread,
May 25, 2016, 2:21:01 PM5/25/16
to jenkinsc...@googlegroups.com
Francis Upton closed an issue as Cannot Reproduce
 

Reclosing as cannot reproduce. Usually fixed is used only if there is a specific source code fix associated with this issue. If we know what the specifically fixed issue was (and I don't think we do in this case), then we could close it as a duplicate of that one. If it just does not happen anymore due to some previous fix, then use cannot reproduce.

Change By: Francis Upton
Status: Reopened Closed
Resolution: Cannot Reproduce
Reply all
Reply to author
Forward
0 new messages