SSH poll failure

33 views
Skip to first unread message

Sam Corbett

unread,
Apr 4, 2014, 1:06:47 PM4/4/14
to brookly...@googlegroups.com
Hi all,

I posted this in IRC the other day but am not sure whether my question
was seen.

Every so often I observe Brooklyn fail to poll for a Redis server's
status. The root error, occurring when Brooklyn polls for the server's
stats, is:

Caused by: java.lang.IllegalStateException:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
at
brooklyn.util.internal.ssh.sshj.SshjTool.checkConnected(SshjTool.java:329)
~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]

Brooklyn sets the entity's status to on fire, but really the problem is
Brooklyn's. The machine and the Redis process are still running. There
are several other entities on the same machine, so a lot of SSH
connections are happening. It occurs fairly regularly in deployments of
my app. There are no policies on the entity.

Can anybody suggest what is going wrong? I've included more output from
an instance of the error below. I can put more logs in a gist if useful.

Thanks,

Sam


2014-04-04 17:17:52,475 WARN Execution failed, invocation error for
check-running RedisStoreImpl{id=aelKjh80}:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh
195]) (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not
connected! (throwing)
2014-04-04 17:17:52,476 INFO STDIN of problem in Task[ssh:
check-running RedisStoreImpl{id=aelKjh80} [Stream[stdin/179B],
TRANSIENT, SUB-TASK,
Wrapped[contextEntity:RedisStoreImpl{id=aelKjh80}]]; EhH9XZkf]:
export
RUN_DIR="/home/users/brooklyn/brooklyn-managed-processes/apps/NMQDtBQA/entities/RedisStore_aelKjh80"
mkdir -p $RUN_DIR
cd $RUN_DIR
./bin/redis-cli -p 6384 ping > /dev/null
2014-04-04 17:17:52,482 WARN Read of
RedisStoreImpl{id=aelKjh80}->Sensor: service.isUp (java.lang.Boolean)
gave exception: brooklyn.util.exceptions.PropagatedRuntimeException:
Execution failed, invocation error for check-running
RedisStoreImpl{id=aelKjh80}:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh
195]) (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
2014-04-04 17:17:52,484 WARN Error executing DstJob:Task[service.isUp @
aelKjh80 <- FunctionPollConfig [TRANSIENT,
Wrapped[contextEntity:RedisStoreImpl{id=aelKjh80}]]; NRJi502L]
(scheduled job of Task[MfIjSMW6] - ); cancelling scheduled execution
brooklyn.util.exceptions.PropagatedRuntimeException:
at
brooklyn.util.exceptions.Exceptions.propagate(Exceptions.java:70)
~[brooklyn-utils-common-0.7.0-SNAPSHOT.jar:na]
Caused by: java.util.concurrent.ExecutionException:
java.lang.IllegalStateException: Execution failed, invocation error for
check-running RedisStoreImpl{id=aelKjh80}:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh
195]) (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
[na:1.7.0_51]
Caused by: java.lang.IllegalStateException: Execution failed, invocation
error for check-running RedisStoreImpl{id=aelKjh80}:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh
195]) (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
at
brooklyn.entity.basic.lifecycle.ScriptHelper.logWithDetailsAndThrow(ScriptHelper.java:325)
~[brooklyn-software-base-0.7.0-SNAPSHOT.jar:na]
Caused by: brooklyn.util.internal.ssh.SshException:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh
195]) (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
at
brooklyn.util.internal.ssh.SshAbstractTool.propagate(SshAbstractTool.java:148)
~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]
Caused by: brooklyn.util.internal.ssh.SshException:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22)
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring
SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
at
brooklyn.util.internal.ssh.SshAbstractTool.propagate(SshAbstractTool.java:148)
~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]
Caused by: java.lang.IllegalStateException:
(broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
at
brooklyn.util.internal.ssh.sshj.SshjTool.checkConnected(SshjTool.java:329)
~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]

Aled Sage

unread,
Apr 4, 2014, 6:59:25 PM4/4/14
to brookly...@googlegroups.com, brookl...@googlegroups.com
Hi Sam,

This is an interesting one. The key problem is "cancelling scheduled execution". When the error happens, we stop polling! I'm looking at how best to fix this.

I've reproduced something like it in a unit test:
    https://github.com/aledsage/brooklyn/blob/fix/DontAbortTasks/software/base/src/test/java/brooklyn/entity/basic/SoftwareProcessEntityTest.java#L233


--- Techie description for brooklyn-dev below ---

The isRunning method creates a sub-task to execute an ssh command, which then fails. Unfortunately the task was not marked as "inessential", so its failure causes the parent task to abort.

Depending when that fails, it can do one of two things:
  1. If it fails during SoftwareProcessImpl.waitForEntityStart then it causes the entire entity start to fail
    (even though we should have just retried the isRunning call).
  2. If it fails later during FunctionFeed (see SoftwareProcessImpl.connectServiceUpIsRunning)
    then it causes the polling to abort, so we never subsequently see the service-up go true again.
We need at least two fixes:
  1. Catch exceptions better in SoftwareProcessImpl.waitForEntityStart, to keep waiting.
  2. For calls to AbstractSoftwareProcessSshDriver.newScript for CHECK_RUNNING, mark the task as "inessential".

For (2), I'm still thinking about what the general solution is for that: is it just for CHECK_RUNNING or is it for everything that doesn't have failOnNonZeroResultCode for example.

Aled

Martin Harris

unread,
Apr 8, 2014, 8:21:53 AM4/8/14
to brookl...@googlegroups.com, brookly...@googlegroups.com
Hi Folks,

I've implemented Aled's suggestion #2, currently just for CHECK_RUNNING. There's a PR (including test) for it here:  https://github.com/brooklyncentral/brooklyn/pull/1309

Cheers

Martin

--
You received this message because you are subscribed to the Google Groups "brooklyn-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to brooklyn-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Martin Harris
Lead Software Engineer
Cloudsoft Corporation Ltd
Reply all
Reply to author
Forward
0 new messages