Hystrix Performance Overhead

Pratyay Pandey

unread,

Sep 20, 2016, 8:18:02 PM9/20/16

to HystrixOSS

I am using Hystrix for wrapping up couple of my service calls (99 percentile of the call being ~200 ms). My hystrix configuration looks like

- core-size : 80
- executiontimeoutinMilliSeconds : 600
- metricsRollingStatisticalWindowInMilliseconds : 10000
- metricsRollingStatisticalWindowBuckets : 10
(Rest all are defaults.)

Have been observing a weird behaviour in my application (intermittently though). Most of the times, the service calls seem to work fine without any hystrix timeouts (only a few calls timeout in an hour or so).
But occasionally, the hystrix timeouts do increase many-folds.
On analysing the cause, the only thing i could get hold of was that my execute-latency in hystrix (latency for my actual business logic, within the run method in my HystrixCommand) is much muchless than the total-latency (The total time taken by hystrix right from invoking execute() on the command to getting the actual response).

Question :
1. Why is there such huge a difference between my execute and total latencies (execute is much lesser than total latency). What could be the possible reasons for this overhead. (PS : The qps on my server is hardly 10)
2. Is there a document related to this overhead ? How can i figure out the actual bottleneck here ?

Any leads will be appreciated.

Matt Jacobs

unread,

Sep 21, 2016, 12:17:05 PM9/21/16

to HystrixOSS

The total-latency metrics track execution of the entire command, while execute-latency tracks just the run() method. By definition, total-latency always must exceed execute-latency. The delta is basically the amount of time spent in Hystrix bookkeeping/getting a thread from the OS. If you're seeing intermittent problems there, my first guess would be GC that happens to fire between the start of the total-latency timer and the start of the execute-latency timer.

When we profile Java applications, we use Brendan Gregg's Flame Graph methodology. I'd suggest trying this out and seeing what is consuming time on your system. In practice, whenever we've seen those metrics diverge like you described, it's been some pressure elsewhere in the system. We've seen memory leaks, concurrency problems leading to threads stuck at 100% CPU, and other such situations leading to symptoms like you describe.

Hope that helps!

-Matt

Pratyay Pandey

unread,

Sep 24, 2016, 9:12:53 AM9/24/16

to HystrixOSS

We are using G1GC on Java8, and have verified it is not a GC issue. When the issue happens the CPU mostly sits idle. The total latency >= execute latency but the delta here is so huge that it leads to timeouts, and it creates a confusion that client is timing out but in realty its the Hystrix overhead.

Matt Jacobs

unread,

Sep 26, 2016, 7:10:13 PM9/26/16

to HystrixOSS

How does this analysis lead you to believe that Hystrix is responsible? This is not a symptom that we run across internally (or get reported to us from external users), so I strongly suspect that it's something in your application and not in Hystrix.

-Matt

Pratyay Pandey

unread,

Sep 29, 2016, 12:43:34 AM9/29/16

to HystrixOSS

Yes you are right, this was not an Hystrix issue. The issue was with the application, we were using asynchronous logging with discarding threshold 0 and size based triggering policy. The rotation of the file led to subsequent timeouts of Hystrix threads, the async worker thread that reads from logging queue(blocking queue) gets blocked during rotation and causes a pile up in the logger queue, with discarding threshold being 0 this blocking queue gets full and causes application thread/hystrix thread to wait on this queue.

Matt Jacobs

unread,

Sep 29, 2016, 12:18:39 PM9/29/16

to HystrixOSS

Great, glad you found it! Out of curiosity, what was the tool you ended up using to get to the bottom of it? That may be helpful for whoever encounters this thread at some point in the future.

Pratyay Pandey

unread,

Oct 4, 2016, 6:33:46 PM10/4/16

to HystrixOSS

Used flight recorder to find out thread contention

Reply all

Reply to author

Forward