Hystrix short-circuits frequently during peak load

217 views
Skip to first unread message

sudhe...@gmail.com

unread,
May 4, 2016, 6:32:01 PM5/4/16
to HystrixOSS
We are experiencing short-circuits with no indication of failures. It's mostly during peak volume hours and we are bummed as to what's going on. Our setup is such that microservice M1 invokes a Hystrix wrapped call to microservice M2. During peak times when M1 is about to call M2 we see short-circuits. There's no errors in either M1 or M2. 

We have tuned our coreSize and maxQueueSize to higher values - 24 and 32 respectively. There is no indication of Queue full or timeout's.

How can I diagnose this issue? Thanks in advance.

Matt Jacobs

unread,
May 5, 2016, 1:27:04 AM5/5/16
to HystrixOSS
Short-circuits only occur when a the error percentage rises above a threshold.  This implies that there are errors somewhere.

At Netflix, there are 3 main ways of understanding Hystrix behavior:
1) Historical time-series metrics : We store all of our metrics in Atlas as described here: https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-servo-metrics-publisher.  We can then make queries against the data and visualize Hystrix behavior.
2) Realtime aggregate metrics: We use the metrics event stream described here: https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-metrics-event-stream.  This outputs live data on commands and threadpools that we can view directly.  This can also be consumed by the Hystrix dashboard: https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard.  Per-instance streams can be directly consumed by Hystrix, or you can aggregate streams using Turbine and consume the aggregated Turbine stream in a dashboard.
3) Realtime per-request metrics: You can set up the request-level stream to understand individual requests. It's described here: https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring#request-streams

Hope that helps,
Matt

Jay Boyd

unread,
May 5, 2016, 10:28:45 AM5/5/16
to sudhe...@gmail.com, HystrixOSS
I imagine you reviewed your app's logs and you aren't finding any errors?  If exceptions are logged, you should see HystrixRuntimeExceptions indicating failure or timeouts prior to the circuit being opened.  I had a similar experience (metrics reporting high error rates) but no indication of issues in the application logs.  I finally found a caller that was getting errors from a HystrixCommand and was silently swallowing the exception and using with a default value (they didn't implement a fallback).  Our team is generally very judicious with logging exceptions and you can generally see the cause of a short circuit from the logs just by grepping for HystrixRuntimeException.

--
You received this message because you are subscribed to the Google Groups "HystrixOSS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hystrixoss+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages