Auth fail

1,863 views
Skip to first unread message

Allen Rohner

unread,
Oct 24, 2011, 4:16:42 PM10/24/11
to jclouds-dev
Over the last two weeks, I've lost several days to a periodic failure
starting nodes using jclouds. I'm using 1.2.1 now (1.2.0 and 1.2.0-
SNAPSHOT, previously), and still experiencing the problem.

My problem is there is a ~50% chance of getting
"com.jcraft.jsch.JSchException: Auth fail" when starting an instance.

I'm starting Ubuntu 11.04 instances on AWS ec2. (AMI us-east-1/
ami-06ad526f, the standard 11.04 public image). Over the weekend, I
wrote my own set of functions to start EC2 instances to troubleshoot
the issue.

My pseudo-code looks like

start ec2 instance (using the AWS java SDK)
busy wait until AWS reports the instance is started
while we haven't connected successfully
try
ssh to instance, exec "echo 'hello'"
catch Exception e
printStackTrace e
sleep for 1 second

The results are:

waiting for instance to start
waiting for instance to be ready for SSH
caught com.jcraft.jsch.JSchException java.net.ConnectException:
Connection refused
caught com.jcraft.jsch.JSchException java.net.ConnectException:
Connection refused
caught com.jcraft.jsch.JSchException Session.connect:
java.io.IOException: End of IO Stream Read
caught com.jcraft.jsch.JSchException Session.connect:
java.io.IOException: End of IO Stream Read
caught com.jcraft.jsch.JSchException Session.connect:
java.io.IOException: End of IO Stream Read
caught com.jcraft.jsch.JSchException Session.connect:
java.io.IOException: End of IO Stream Read
caught com.jcraft.jsch.JSchException Session.connect:
java.io.IOException: End of IO Stream Read
caught com.jcraft.jsch.JSchException Session.connect:
java.io.IOException: End of IO Stream Read
caught com.jcraft.jsch.JSchException Session.connect:
java.io.IOException: End of IO Stream Read
caught com.jcraft.jsch.JSchException Auth fail
caught com.jcraft.jsch.JSchException Auth fail
caught com.jcraft.jsch.JSchException Auth fail
caught com.jcraft.jsch.JSchException Auth fail
caught com.jcraft.jsch.JSchException Auth fail
caught com.jcraft.jsch.JSchException Auth fail
caught com.jcraft.jsch.JSchException Auth fail
{:exit 0, :out "hello", :err }

As you can see, there are several seconds where JSch reports "Auth
Fail". When using JClouds, there was a race. I believe JClouds needs
to catch that exception (and probably all JSch exceptions, until the
timeout expires). I haven't repeated the experiment using JSSH,
largely because I also use pallet, and pallet uses JSch.

Thanks,
Allen

Andrew Phillips

unread,
Oct 24, 2011, 4:41:11 PM10/24/11
to jclou...@googlegroups.com
> Over the last two weeks, I've lost several days to a periodic failure
> starting nodes using jclouds. I'm using 1.2.1 now (1.2.0 and 1.2.0-
> SNAPSHOT, previously), and still experiencing the problem.
>
> My problem is there is a ~50% chance of getting
> "com.jcraft.jsch.JSchException: Auth fail" when starting an instance.

Sorry to hear that, Allen :-( Could you try the SSHJ [1]? I know you
mention that you're sticking with JSch since Pallet uses it, but at
least this way we could try to localize the problem to the connection
library, or figure out whether the problem is "up in" (harr, harr) in
jclouds ;-)

Thanks

ap

[1] http://code.google.com/p/jclouds/wiki/ReleaseNotes110#SSHJ_driver

Adrian Cole

unread,
Oct 24, 2011, 4:55:07 PM10/24/11
to jclou...@googlegroups.com
Hi, Allen.

I'd encourage pallet to support optional usage of sshj, as we stopped
using jsch for many reasons.  sshj has much better logging,
exceptions, etc. (and code quality)

Regardless, if you set the logging category "jclouds.ssh" to trace,
you will get more details, even in jsch (presuming v1.2.1+)

The pattern you are seeing is expected in aws-ec2, and there is set of
properties that help deal with this, already set, but possibly needing
to be tuned.

jclouds.ssh.retryable-messages: a configuration in jclouds to let know
which jsch errors to retry.  default: "failed to send channel
request,channel is not opened,invalid data,End of IO Stream
Read,Connection reset,connection is closed by foreign host,socket is
not established"

jclouds.ssh.retry-auth: whether or not to retry on
AuthorizationExceptions, note that in jsch, exceptions with "Auth
fail" in the message are converted to AuthorizationException default:
true

jclouds.ssh.max-retries: how many times to retry on above states.
default: 7 (period between tries increases from 200ms -> 2 seconds per
math.pow(attempt, 2))

It seems that by your logs, it may be taking longer to get past the
"normal" ec2 auth errors. However, it is hard to tell, as the default
jclouds code will block until port 22 is open before attempting to
connect, something I'm not sure you are doing.

> --
> You received this message because you are subscribed to the Google Groups "jclouds-dev" group.
> To post to this group, send email to jclou...@googlegroups.com.
> To unsubscribe from this group, send email to jclouds-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jclouds-dev?hl=en.
>

Paul Baclace

unread,
Oct 24, 2011, 7:37:27 PM10/24/11
to jclou...@googlegroups.com
On 20111024 13:55 , Adrian Cole wrote:
> [...]

> The pattern you are seeing is expected in aws-ec2, and there is set of
> properties that help deal with this, already set, but possibly needing
> to be tuned.
>
> jclouds.ssh.retryable-messages: a configuration in jclouds to let know
> which jsch errors to retry. default: "failed to send channel
> request,channel is not opened,invalid data,End of IO Stream
> Read,Connection reset,connection is closed by foreign host,socket is
> not established"
>
> jclouds.ssh.retry-auth: whether or not to retry on
> AuthorizationExceptions, note that in jsch, exceptions with "Auth
> fail" in the message are converted to AuthorizationException default:
> true
>
> jclouds.ssh.max-retries: how many times to retry on above states.
> default: 7 (period between tries increases from 200ms -> 2 seconds per
> math.pow(attempt, 2))

These settings are good to know. It means wait a maximum of ~((n-4)*2 +
3) sec.

If jclouds.ssh.max-retries is increased to 19, is it possible to run
into sshd_config limits like MaxStartups? (man sez: "Specifies the
maximum number of concurrent unauthenticated connections to the SSH
daemon. Additional connections will be dropped until authentication
succeeds or the LoginGraceTime [120sec. default] expires for a
connection. The default is 10.") If connection attempts are not closed
properly...


Paul

Adrian Cole

unread,
Oct 25, 2011, 2:49:10 AM10/25/11
to jclou...@googlegroups.com
Hi, Paul.

Sounds reasonable. Do you want to try with :jclouds.ssh.max-retries 7
and let us know if it solves things? If so, we can consider making it
default.

Thanks,
-Adrian

Paul Baclace

unread,
Oct 27, 2011, 5:26:26 AM10/27/11
to jclou...@googlegroups.com
On 20111024 23:49 , Adrian Cole wrote:
> Hi, Paul.
>
> Sounds reasonable. Do you want to try with :jclouds.ssh.max-retries 7
> and let us know if it solves things? If so, we can consider making it
> default.
My first test (whirr trunk + jclouds 1.2.1) had the unlucky result of
early termination of 10 nodes as DOA, 1 node orphaned, and failing to
build a cluster. I think that is a new record for bad behavior (it is
whirr trunk).

The first error was:
2011-10-27 08:54:08,782 [reader] ERROR
net.schmizz.sshj.transport.TransportImpl - Dying because -
net.schmizz.sshj.transport.TransportException: Broken transport;
encountered EOF

The orphaned (not cleaned up) node was probably a result of
BackoffLimitedRetryHandler - Cannot retry after server error,
command has exceeded retry limit 5
caused by
AWSResponseException: RequestLimitExceeded.

I set:
whirr.max-startup-retries=4
jclouds.ssh.max-retries=19

I'll try again tomorrow.

Paul

Adrian Cole

unread,
Oct 27, 2011, 6:42:10 AM10/27/11
to jclou...@googlegroups.com

Hi, Paul.

I'm sorry the cloud gods don't like you!  There's nothing you can do about early termination as that is an ec2 service side issue, outside trying a different zone/region/provider.  If we arent failing fast on instance death, that can be addressed.

Thanks for the tenacity!
-A

--
You received this message because you are subscribed to the Google Groups "jclouds-dev" group.
To post to this group, send email to jclou...@googlegroups.com.
To unsubscribe from this group, send email to jclouds-dev+unsubscribe@googlegroups.com.

Paul Baclace

unread,
Oct 27, 2011, 5:40:21 PM10/27/11
to jclou...@googlegroups.com
On 20111027 3:42 , Adrian Cole wrote:

Hi, Paul.

I'm sorry the cloud gods don't like you!  There's nothing you can do about early termination as that is an ec2 service side issue, outside trying a different zone/region/provider.  If we arent failing fast on instance death, that can be addressed.

My experience with EC2 over the last 3 years is that less than 5% of nodes are DOA, so this quantity of early failures is very likely to be a false positive; jclouds must be drawing the wrong conclusion. 

Two weeks ago I launched 18 identical clusters in 2 hours (whirr 0.6.0+ganglia edit and jclouds 1.1.1).  I expected many false positive DOAs, but only saw a few. 

Just in case there is a time of day correlation, I re-ran the exact same test just now (12 hours later).  The result is 4 terminated nodes and 2 orphans, which is in the same ballpark.  The first error is below.  I has the error "exception not retryable", so perhaps the patterns for retryable errors are not set right.

There should be a way to definitively prove true or false positive DOA.  Perhaps a jclouds.debug.keep_failed_nodes=true  is needed so the problem nodes can be inspected by hand.  (A warning should be logged "not terminating 1.2.3.4" when at that decision point if keep_failed_nodes is true.)  Point me to the right files and I will add this.


Paul

------------------ first error ---------------
2011-10-27 20:24:32,091 [user thread 0] ERROR net.schmizz.concurrent.Promise -  <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
2011-10-27 20:24:32,093 [user thread 0] ERROR jclouds.ssh -  << (ubuntu:rsa[fingerprint(...),sha1(...)]@50.18.34.160:22) error acquiring SSHClient(timeout=60000): exception not retryable
net.schmizz.sshj.userauth.UserAuthException: Exhausted available authentication methods
        at net.schmizz.sshj.userauth.UserAuthImpl.authenticate(UserAuthImpl.java:114)
        at net.schmizz.sshj.SSHClient.auth(SSHClient.java:204)
        at net.schmizz.sshj.SSHClient.authPublickey(SSHClient.java:304)
        at net.schmizz.sshj.SSHClient.authPublickey(SSHClient.java:323)
        at org.jclouds.sshj.SshjSshClient$1.create(SshjSshClient.java:199)
        at org.jclouds.sshj.SshjSshClient$1.create(SshjSshClient.java:171)
        at org.jclouds.sshj.SshjSshClient.acquire(SshjSshClient.java:220)
        at org.jclouds.sshj.SshjSshClient.connect(SshjSshClient.java:255)
        at org.jclouds.compute.callables.RunScriptOnNodeAsInitScriptUsingSsh.call(RunScriptOnNodeAsInitScriptUsingSsh.java:89)        at org.jclouds.compute.strategy.CustomizeNodeAndAddToGoodMapOrPutExceptionIntoBadMap.call(CustomizeNodeAndAddToGoodMapOrPutExceptionIntoBadMap.java:150)
        at org.jclouds.compute.strategy.CustomizeNodeAndAddToGoodMapOrPutExceptionIntoBadMap.call(CustomizeNodeAndAddToGoodMapOrPutExceptionIntoBadMap.java:57)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: net.schmizz.sshj.userauth.UserAuthException: publickey auth failed
        at net.schmizz.sshj.userauth.UserAuthImpl.handle(UserAuthImpl.java:157)
        at net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:474)
        at net.schmizz.sshj.transport.Decoder.decode(Decoder.java:127)
        at net.schmizz.sshj.transport.Decoder.received(Decoder.java:195)
        at net.schmizz.sshj.transport.Reader.run(Reader.java:72)
-----------------

Paul Baclace

unread,
Oct 27, 2011, 6:54:36 PM10/27/11
to jclou...@googlegroups.com
More on the "exception not retryable":

The method:

  org/jclouds/sshj/SshjSshClient.java::  causalChainHasMessageContaining(final Exception from)

builds an object that obtains Throwable.getMessage() and matches it against retryableMessages split by ','.
One thing to watch out for is that many exceptions have a different toString() and .getMessage()  { I remember
this one well since NPE has or had an empty .getMessage()) } 

Throwable.toString() should be checked too.  If I understand the functional style being used:

               public boolean apply(Throwable arg0) {
                  return arg0.getMessage() != null && arg0.getMessage().indexOf(input) != -1;
               }

becomes:

               public boolean apply(Throwable arg0) {
                  return ( arg0.toString().indexOf(input) != -1)
                             || (arg0.getMessage() != null && arg0.getMessage().indexOf(input) != -1);
               }


Paul

Alex Heneveld

unread,
Oct 27, 2011, 8:52:52 PM10/27/11
to jclou...@googlegroups.com, Paul Baclace

Hi Paul-

Good catch.  I've made an issue (739) and pull request containing this.

Also see 738 [1] re some of the error messages you are seeing.  There is a race in sshd starting up in the latest Ubuntu AMI's which causes normally benign but scary error messages.  Obviously they're not benign in your case but...

Maybe going for more retries will sort the problem out?  I don't think you'll get hit by sshd MaxStartups because the connections are being terminated by the server.

--Alex

[1] http://code.google.com/p/jclouds/issues/detail?id=738
--
You received this message because you are subscribed to the Google Groups "jclouds-dev" group.
To post to this group, send email to jclou...@googlegroups.com.
To unsubscribe from this group, send email to jclouds-dev...@googlegroups.com.

Adrian Cole

unread,
Oct 28, 2011, 3:43:30 AM10/28/11
to jclou...@googlegroups.com
Hi, Paul.

Looks good. I wonder, can you find the glitch w/in SshjSshClientTest?
If we can repeat it here, we can assure it won't crop up again.

ex.

public void testExceptionClassesRetry() {
assert ssh.shouldRetry(new SocketTimeoutException("connect timed out"));
assert ssh.shouldRetry(new TransportException("socket closed"));
assert ssh.shouldRetry(new ConnectionException("problem"));
assert ssh.shouldRetry(new ConnectException("Connection refused"));
assert !ssh.shouldRetry(new IOException("channel %s is not
open", new NullPointerException()));
}

Thanks again!
-Adrian

> --
> You received this message because you are subscribed to the Google Groups
> "jclouds-dev" group.
> To post to this group, send email to jclou...@googlegroups.com.
> To unsubscribe from this group, send email to

> jclouds-dev...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages