Hystrix circuit stays OPEN even after the the offending service is healthy

5,643 views
Skip to first unread message

Mukiza Andrew

unread,
Feb 18, 2016, 3:18:15 PM2/18/16
to HystrixOSS
I am having an Issue where when the hystrix circuit breaker trips, it does not close ever again. I have turned logging to debug and I do not see it trying to allow a test request through in which case it seems to me It will never close since its only supposed to close when a test execution goes through successfully indicating that the offending service is now healthy. According to the documentation the Circuit break configuration defaults should be working but I can not seem to tell why the test request is never allowed through.

2016-02-18 09:00:38,782 checkout-service application-akka.actor.default-dispatcher-7 ERROR akka.actor.OneForOneStrategy - ProccessCheckoutCommand short-circuited and fallback failed.
com.netflix.hystrix.exception.HystrixRuntimeException: ProccessCheckoutCommand short-circuited and fallback failed.
at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:816) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:790) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$1.onError(OperatorOnErrorResumeNextViaFunction.java:99) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:71) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:71) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:71) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at com.netflix.hystrix.AbstractCommand$DeprecatedOnFallbackHookApplication$1.onError(AbstractCommand.java:1521) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at com.netflix.hystrix.AbstractCommand$FallbackHookApplication$1.onError(AbstractCommand.java:1411) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:314) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:306) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable.unsafeSubscribe(Observable.java:8098) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at com.netflix.hystrix.AbstractCommand$1.call(AbstractCommand.java:417) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at com.netflix.hystrix.AbstractCommand$1.call(AbstractCommand.java:363) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable.unsafeSubscribe(Observable.java:8098) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at com.netflix.hystrix.AbstractCommand$ObservableCommand$1.call(AbstractCommand.java:1140) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at com.netflix.hystrix.AbstractCommand$ObservableCommand$1.call(AbstractCommand.java:1136) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at rx.Observable$2.call(Observable.java:162) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable$2.call(Observable.java:154) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable.subscribe(Observable.java:8191) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.Observable.subscribe(Observable.java:8158) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.internal.operators.BlockingOperatorToFuture.toFuture(BlockingOperatorToFuture.java:57) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at rx.observables.BlockingObservable.toFuture(BlockingObservable.java:401) ~[io.reactivex.rxjava-1.1.0.jar:1.1.0]
at com.netflix.hystrix.HystrixCommand.queue(HystrixCommand.java:379) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at com.netflix.hystrix.HystrixCommand.execute(HystrixCommand.java:335) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
at Global$$anonfun$onStart$1$$anonfun$apply$1.apply(Global.scala:25) ~[checkout-service.checkout-service-0.1-sans-externalized.jar:na]
at Global$$anonfun$onStart$1$$anonfun$apply$1.apply(Global.scala:25) ~[checkout-service.checkout-service-0.1-sans-externalized.jar:na]
at infrastructure.NotificationStreamActor$$anonfun$4.applyOrElse(NotificationStreamActor.scala:71) ~[checkout-service.checkout-service-0.1-sans-externalized.jar:na]
at infrastructure.NotificationStreamActor$$anonfun$4.applyOrElse(NotificationStreamActor.scala:68) ~[checkout-service.checkout-service-0.1-sans-externalized.jar:na]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) ~[org.scala-lang.scala-library-2.11.7.jar:na]
at akka.actor.FSM$class.processEvent(FSM.scala:604) ~[com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at infrastructure.NotificationStreamActor.processEvent(NotificationStreamActor.scala:26) ~[checkout-service.checkout-service-0.1-sans-externalized.jar:na]
at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:598) ~[com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:592) ~[com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at akka.actor.Actor$class.aroundReceive(Actor.scala:467) ~[com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at infrastructure.NotificationStreamActor.aroundReceive(NotificationStreamActor.scala:26) ~[checkout-service.checkout-service-0.1-sans-externalized.jar:na]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) [com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at akka.actor.ActorCell.invoke(ActorCell.scala:487) [com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) [com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at akka.dispatch.Mailbox.run(Mailbox.scala:220) [com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) [com.typesafe.akka.akka-actor_2.11-2.3.13.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [org.scala-lang.scala-library-2.11.7.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [org.scala-lang.scala-library-2.11.7.jar:na]
Caused by: java.lang.RuntimeException: Hystrix circuit short-circuited and is OPEN
at com.netflix.hystrix.AbstractCommand$1.call(AbstractCommand.java:414) ~[com.netflix.hystrix.hystrix-core-1.4.23.jar:1.4.23]
... 38 common frames omitted


Matt Jacobs

unread,
Feb 23, 2016, 7:18:07 PM2/23/16
to HystrixOSS
The circuit breaker algorithm is designed to, once tripped, stop executing the run() method for a period of time (defaults to 5 seconds).  After that time elapsed, the next command invocation will execute the run() method.

So the 2 things to check are:
* What is your sleep window defined as?
* Are you still creating new command instances and invoking them after this amount of time has elapsed?

-Matt

Mukiza Andrew

unread,
Feb 25, 2016, 5:15:40 PM2/25/16
to HystrixOSS
Thanks Matt.

As you mentioned, 

We are executing these commands in finite state machine (FSM) so when this error was thrown our state machine would go into a state where its idle (unhandled)  consequently no more instances of the command were created and invoked keeping the circuit open for ever. 

One curious thing though was that despite the fact that we have fallBack for all possible exceptions,  including a catch all case (matching on Throwable type) we still had this exception bubble up into the FSM. For some reason our fallback fails yet we have matched on every possible exception in the fall back.

Matt Jacobs

unread,
Feb 29, 2016, 8:47:07 AM2/29/16
to HystrixOSS
Cool, glad that answered your question.  

Your description makes it sound like the fallback mechanism is outside of Hystrix.  If that's the case, then that logic is yours to manage however you want.  The fallback functionality that Hystrix provides is such that the fallback gets executed as part of the command execution, if necessary.  If that fallback works, then you should have no need for a fallback outside of Hystrix.

-Matt

Mukiza Andrew

unread,
Feb 29, 2016, 9:06:54 AM2/29/16
to HystrixOSS
Hi Matt,

The fallback logic is in the hystrix command it's self. We override getFallback and  pattern match on all possible exceptions (both the exceptions we thrown within the code the Hyxtrix command  wraps and hystrix runtime exceptions) and return a command instruction which our FSM is supposed to rely on to decide whether to schedule a retry or drop the event its processing and go to the next one. 

Find the snippet of the fallback below.


override def getFallback: StreamInstruction.Value = {
  getFailedExecutionException match {
    case ex: HystrixRuntimeException =>
      if (recoverableExceptions.contains(ex.getFailureType)) {
        logger.error(s"recoverable hystrixRuntimeException: retrying the event", ex)
        StreamInstruction.Retry
      } else {
        logger.error(s"unrecoverable hystrixRuntimeException: dropping the event", ex)
        StreamInstruction.ProcessNext
      }
    case ex: ConnectException =>
      logger.warn("External service is not available", ex)
      StreamInstruction.Retry

       ...
       // more exceptions here.

    case t: Throwable =>
      logger.error("Unknown error processing event.", t)
      StreamInstruction.Retry
  }


So essentially we never expect any exceptions to leak into our FSM (the actor that executes the hystrix) command. But for some reason this happens. I have not been able to come up with any possible theory that could cause the exceptions to leak into the caller of the command when we have matched all possible exceptions.


If you have any ideas, please share.

Thanks,
Andrew.

Matt Jacobs

unread,
Mar 2, 2016, 9:20:30 AM3/2/16
to HystrixOSS
There are only 2 cases I can think of where your application code would receive an exception:

1) getFailedExecutionException throws, or some other exception occurs in your logic to match on exception type
2) The fallback gets rejected.  There's a concurrency bound on concurrent fallback execution  If this gets exceeded, then the fallback does not even execute and the exception from the run() method gets immediately propagated.

Can you catch the Exception that your application is receiving?  That might help narrow down what's going on.

-Matt
Reply all
Reply to author
Forward
0 new messages