Sorry to hear that, Allen :-( Could you try the SSHJ [1]? I know you
mention that you're sticking with JSch since Pallet uses it, but at
least this way we could try to localize the problem to the connection
library, or figure out whether the problem is "up in" (harr, harr) in
jclouds ;-)
Thanks
ap
[1] http://code.google.com/p/jclouds/wiki/ReleaseNotes110#SSHJ_driver
I'd encourage pallet to support optional usage of sshj, as we stopped
using jsch for many reasons. sshj has much better logging,
exceptions, etc. (and code quality)
Regardless, if you set the logging category "jclouds.ssh" to trace,
you will get more details, even in jsch (presuming v1.2.1+)
The pattern you are seeing is expected in aws-ec2, and there is set of
properties that help deal with this, already set, but possibly needing
to be tuned.
jclouds.ssh.retryable-messages: a configuration in jclouds to let know
which jsch errors to retry. default: "failed to send channel
request,channel is not opened,invalid data,End of IO Stream
Read,Connection reset,connection is closed by foreign host,socket is
not established"
jclouds.ssh.retry-auth: whether or not to retry on
AuthorizationExceptions, note that in jsch, exceptions with "Auth
fail" in the message are converted to AuthorizationException default:
true
jclouds.ssh.max-retries: how many times to retry on above states.
default: 7 (period between tries increases from 200ms -> 2 seconds per
math.pow(attempt, 2))
It seems that by your logs, it may be taking longer to get past the
"normal" ec2 auth errors. However, it is hard to tell, as the default
jclouds code will block until port 22 is open before attempting to
connect, something I'm not sure you are doing.
> --
> You received this message because you are subscribed to the Google Groups "jclouds-dev" group.
> To post to this group, send email to jclou...@googlegroups.com.
> To unsubscribe from this group, send email to jclouds-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jclouds-dev?hl=en.
>
These settings are good to know. It means wait a maximum of ~((n-4)*2 +
3) sec.
If jclouds.ssh.max-retries is increased to 19, is it possible to run
into sshd_config limits like MaxStartups? (man sez: "Specifies the
maximum number of concurrent unauthenticated connections to the SSH
daemon. Additional connections will be dropped until authentication
succeeds or the LoginGraceTime [120sec. default] expires for a
connection. The default is 10.") If connection attempts are not closed
properly...
Paul
Sounds reasonable. Do you want to try with :jclouds.ssh.max-retries 7
and let us know if it solves things? If so, we can consider making it
default.
Thanks,
-Adrian
The first error was:
2011-10-27 08:54:08,782 [reader] ERROR
net.schmizz.sshj.transport.TransportImpl - Dying because -
net.schmizz.sshj.transport.TransportException: Broken transport;
encountered EOF
The orphaned (not cleaned up) node was probably a result of
BackoffLimitedRetryHandler - Cannot retry after server error,
command has exceeded retry limit 5
caused by
AWSResponseException: RequestLimitExceeded.
I set:
whirr.max-startup-retries=4
jclouds.ssh.max-retries=19
I'll try again tomorrow.
Paul
Hi, Paul.
I'm sorry the cloud gods don't like you! There's nothing you can do about early termination as that is an ec2 service side issue, outside trying a different zone/region/provider. If we arent failing fast on instance death, that can be addressed.
Thanks for the tenacity!
-A
--
You received this message because you are subscribed to the Google Groups "jclouds-dev" group.
To post to this group, send email to jclou...@googlegroups.com.
To unsubscribe from this group, send email to jclouds-dev+unsubscribe@googlegroups.com.
My experience with EC2 over the last 3 years is that less than 5% of nodes are DOA, so this quantity of early failures is very likely to be a false positive; jclouds must be drawing the wrong conclusion.Hi, Paul.
I'm sorry the cloud gods don't like you! There's nothing you can do about early termination as that is an ec2 service side issue, outside trying a different zone/region/provider. If we arent failing fast on instance death, that can be addressed.
--
You received this message because you are subscribed to the Google Groups "jclouds-dev" group.
To post to this group, send email to jclou...@googlegroups.com.
To unsubscribe from this group, send email to jclouds-dev...@googlegroups.com.
Looks good. I wonder, can you find the glitch w/in SshjSshClientTest?
If we can repeat it here, we can assure it won't crop up again.
ex.
public void testExceptionClassesRetry() {
assert ssh.shouldRetry(new SocketTimeoutException("connect timed out"));
assert ssh.shouldRetry(new TransportException("socket closed"));
assert ssh.shouldRetry(new ConnectionException("problem"));
assert ssh.shouldRetry(new ConnectException("Connection refused"));
assert !ssh.shouldRetry(new IOException("channel %s is not
open", new NullPointerException()));
}
Thanks again!
-Adrian
> --
> You received this message because you are subscribed to the Google Groups
> "jclouds-dev" group.
> To post to this group, send email to jclou...@googlegroups.com.
> To unsubscribe from this group, send email to
> jclouds-dev...@googlegroups.com.