Re: [brooklyn-users] SSH poll failure

19 views
Skip to first unread message

Aled Sage

unread,
Apr 4, 2014, 6:59:25 PM4/4/14
to brookly...@googlegroups.com, brookl...@googlegroups.com
Hi Sam,

This is an interesting one. The key problem is "cancelling scheduled execution". When the error happens, we stop polling! I'm looking at how best to fix this.

I've reproduced something like it in a unit test:
    https://github.com/aledsage/brooklyn/blob/fix/DontAbortTasks/software/base/src/test/java/brooklyn/entity/basic/SoftwareProcessEntityTest.java#L233


--- Techie description for brooklyn-dev below ---

The isRunning method creates a sub-task to execute an ssh command, which then fails. Unfortunately the task was not marked as "inessential", so its failure causes the parent task to abort.

Depending when that fails, it can do one of two things:
  1. If it fails during SoftwareProcessImpl.waitForEntityStart then it causes the entire entity start to fail
    (even though we should have just retried the isRunning call).
  2. If it fails later during FunctionFeed (see SoftwareProcessImpl.connectServiceUpIsRunning)
    then it causes the polling to abort, so we never subsequently see the service-up go true again.
We need at least two fixes:
  1. Catch exceptions better in SoftwareProcessImpl.waitForEntityStart, to keep waiting.
  2. For calls to AbstractSoftwareProcessSshDriver.newScript for CHECK_RUNNING, mark the task as "inessential".

For (2), I'm still thinking about what the general solution is for that: is it just for CHECK_RUNNING or is it for everything that doesn't have failOnNonZeroResultCode for example.

Aled



On 04/04/2014 18:06, Sam Corbett wrote:
Hi all,

I posted this in IRC the other day but am not sure whether my question was seen.

Every so often I observe Brooklyn fail to poll for a Redis server's status. The root error, occurring when Brooklyn polls for the server's stats, is:

Caused by: java.lang.IllegalStateException: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
    at brooklyn.util.internal.ssh.sshj.SshjTool.checkConnected(SshjTool.java:329) ~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]

Brooklyn sets the entity's status to on fire, but really the problem is Brooklyn's. The machine and the Redis process are still running. There are several other entities on the same machine, so a lot of SSH connections are happening. It occurs fairly regularly in deployments of my app. There are no policies on the entity.

Can anybody suggest what is going wrong? I've included more output from an instance of the error below. I can put more logs in a gist if useful.

Thanks,

Sam


2014-04-04 17:17:52,475 WARN  Execution failed, invocation error for check-running RedisStoreImpl{id=aelKjh80}: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh 195]) (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected! (throwing)
2014-04-04 17:17:52,476 INFO  STDIN of problem in Task[ssh: check-running RedisStoreImpl{id=aelKjh80} [Stream[stdin/179B], TRANSIENT, SUB-TASK, Wrapped[contextEntity:RedisStoreImpl{id=aelKjh80}]]; EhH9XZkf]:
export RUN_DIR="/home/users/brooklyn/brooklyn-managed-processes/apps/NMQDtBQA/entities/RedisStore_aelKjh80"
mkdir -p $RUN_DIR
cd $RUN_DIR
./bin/redis-cli -p 6384 ping > /dev/null
2014-04-04 17:17:52,482 WARN  Read of RedisStoreImpl{id=aelKjh80}->Sensor: service.isUp (java.lang.Boolean) gave exception: brooklyn.util.exceptions.PropagatedRuntimeException: Execution failed, invocation error for check-running RedisStoreImpl{id=aelKjh80}: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh 195]) (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
2014-04-04 17:17:52,484 WARN  Error executing DstJob:Task[service.isUp @ aelKjh80 <- FunctionPollConfig [TRANSIENT, Wrapped[contextEntity:RedisStoreImpl{id=aelKjh80}]]; NRJi502L] (scheduled job of Task[MfIjSMW6] - ); cancelling scheduled execution
brooklyn.util.exceptions.PropagatedRuntimeException:
    at brooklyn.util.exceptions.Exceptions.propagate(Exceptions.java:70) ~[brooklyn-utils-common-0.7.0-SNAPSHOT.jar:na]
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Execution failed, invocation error for check-running RedisStoreImpl{id=aelKjh80}: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh 195]) (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
    at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.7.0_51]
Caused by: java.lang.IllegalStateException: Execution failed, invocation error for check-running RedisStoreImpl{id=aelKjh80}: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh 195]) (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
    at brooklyn.entity.basic.lifecycle.ScriptHelper.logWithDetailsAndThrow(ScriptHelper.java:325) ~[brooklyn-software-base-0.7.0-SNAPSHOT.jar:na]
Caused by: brooklyn.util.internal.ssh.SshException: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring Put(path=[/tmp/brooklyn-20140404-171728400-OTf6-check-running_RedisStoreImpl_i.sh 195]) (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
    at brooklyn.util.internal.ssh.SshAbstractTool.propagate(SshAbstractTool.java:148) ~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]
Caused by: brooklyn.util.internal.ssh.SshException: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) error acquiring SFTPClient() (attempt 1/1, in time 24.1s/2m); out of retries: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
    at brooklyn.util.internal.ssh.SshAbstractTool.propagate(SshAbstractTool.java:148) ~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]
Caused by: java.lang.IllegalStateException: (broo...@ec2-54-197-39-158.compute-1.amazonaws.com:22) ssh not connected!
    at brooklyn.util.internal.ssh.sshj.SshjTool.checkConnected(SshjTool.java:329) ~[brooklyn-core-0.7.0-SNAPSHOT.jar:na]


Martin Harris

unread,
Apr 8, 2014, 8:21:53 AM4/8/14
to brookl...@googlegroups.com, brookly...@googlegroups.com
Hi Folks,

I've implemented Aled's suggestion #2, currently just for CHECK_RUNNING. There's a PR (including test) for it here:  https://github.com/brooklyncentral/brooklyn/pull/1309

Cheers

Martin

--
You received this message because you are subscribed to the Google Groups "brooklyn-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to brooklyn-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Martin Harris
Lead Software Engineer
Cloudsoft Corporation Ltd
Reply all
Reply to author
Forward
0 new messages