I moved up to the latest hystrix (1.2.17). I also took the advise of Ben Christensen to add the following to my fallback method:
log.error(getExecutionEvents());
try {
throw new Exception();
}
catch (Exception e) {
StringWriter sw = new StringWriter();
PrintWriter pw = new PrintWriter(sw);
e.printStackTrace(pw);
log.error(sw.toString());
}
In the log I saw:
ERROR - 84391 - [http-bio-8080-exec-4] (CreateTokenCommand.java:78) - [TIMEOUT]
ERROR - 84393 - [http-bio-8080-exec-4] (CreateTokenCommand.java:86) - java.lang.Exception
at com.acmeair.web.hystrixcommands.CreateTokenCommand.getFallback(CreateTokenCommand.java:80)
at com.acmeair.web.hystrixcommands.CreateTokenCommand.getFallback(CreateTokenCommand.java:38)
at com.netflix.hystrix.HystrixCommand.getFallbackWithProtection(HystrixCommand.java:868)
at com.netflix.hystrix.HystrixCommand.getFallbackOrThrowException(HystrixCommand.java:1129)
at com.netflix.hystrix.HystrixCommand.access$2000(HystrixCommand.java:81)
at com.netflix.hystrix.HystrixCommand$QueuedExecutionFuture.performActualGet(HystrixCommand.java:1532)
at com.netflix.hystrix.HystrixCommand$QueuedExecutionFuture.get(HystrixCommand.java:1451)
at com.netflix.hystrix.HystrixCommand$QueuedExecutionFuture.get(HystrixCommand.java:1483)
at com.netflix.hystrix.HystrixCommand.execute(HystrixCommand.java:429)
at com.acmeair.web.LoginREST.login(LoginREST.java:52)
I then put log statements around the command execute to understand why on first request to see if it was taking over 1000 ms (the default for the timeout). Surely enough it was. I will attach the log file that contains the first and second request. You can see the first request takes 1013 ms and the second takes only 44 ms.
Of the expensive calls in the first, here is what I see:
DEBUG - 83384 - [http-bio-8080-exec-4] (LoginREST.java:51) - ***** before call to CreateTokenCommand
INFO - 83541 - [hystrix-AcmeAirPool-1] (ConnectionPoolCleaner.java:110) - Initializing ConnectionPoolCleaner for NFHttpClient:acmeair-auth-service-client
This took 157 ms on the first call, and I don't see it on the second call.
DEBUG - 83554 - [acmeair-auth-service-client-ConnectionPoolCleanerThread] (ConnPoolByRoute.java:725) - Closing connections idle longer than 30000 MILLISECONDS
DEBUG - 83743 - [hystrix-AcmeAirPool-1] (BaseLoadBalancer.java:279) - LoadBalancer: pingIntervalSeconds set to 30
This took 189 ms on the first call, and I don't see it on the second call.
DEBUG - 83744 - [hystrix-AcmeAirPool-1] (BaseLoadBalancer.java:298) - LoadBalancer: maxTotalPingTime set to 2
INFO - 83808 - [hystrix-AcmeAirPool-1] (BaseLoadBalancer.java:184) - Client:acmeair-auth-service-client instantiated a LoadBalancer:DynamicServerListLoadBalancer:{NFLoadBalancer:name=acmeair-auth-service-client,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
This took 64 ms on the first call, and I don't see it on the second call.
DEBUG - 83862 - [hystrix-AcmeAirPool-1] (DefaultClientConnection.java:271) - >> User-Agent: Apache-HttpClient/4.1.2 (java 1.5)
DEBUG - 84251 - [hystrix-AcmeAirPool-1] (Wire.java:63) - << "HTTP/1.1 200 OK[\r][\n]"
This took 389 ms on the first call, but only 15 ms on the second call.
DEBUG - 84266 - [hystrix-AcmeAirPool-1] (CreateTokenCommand.java:62) - responseString = {"id":"fcefca64-23f7-46d9-a927-d15c26b28430","customerid":"
ui...@email.com","lastAccessedTime":1374541545537,"timeoutTime":1374627945537}
ERROR - 84390 - [http-bio-8080-exec-4] (CreateTokenCommand.java:77) - calling CreateTokenCommand fallback
This took 124 ms in the first call and represents the time it took for my app to deserialize the message. On the second call, the fallback isn't called, but in the second call it must have taken 10 ms or less.
So I think this shows that the initial Ribbon (with Eureka)/Hystrix and connection establishment along with slower on first invocation Jackson deserialization just happens to take slightly over 1000 ms. My guess it that this failure happens across threads given I ramp up the users in parallel and the connection pool slowness is likely to happen until the pool grows to the thread pool size.
I talked to others at the bootcamp that saw the same behavior. I'd suggest those people do the same as I did - using debugging to measure the first execution time to ensure it is longer than the timeout and therefore hystrix is "failing" the call as expected.
I wonder if this is rather typical in Netflix usage? I wonder if it's typical that the first request fails. The code I have in my command is pretty standard following the rss recipe sample of ribbon usage. I also wonder if there is some way to let Hystrix be more accepting until warmup has occurred.