The first hystrix command always calls my fallback

5,544 views
Skip to first unread message

Andrew Spyker

unread,
Jul 11, 2013, 1:04:41 AM7/11/13
to hystr...@googlegroups.com
Using the acmeair-netflix sample application [1], I ...

- start up the auth-service micro service and wait for it to show up in the Eureka console
- run a request against the auth-service that forces it to initialize Spring and its connections to the backend data services to ensure that the next request will be fast
- start up the webbapp (consumer of auth-service) and  wait for it to show up in Eureka console
- try to login to the webapp which will call the CreateTokenCommand [2] with calls through Ribbon/Eureka to the auth-service
-- as you can see in the log files, the REST call succeeds (lines 976-998), but then I can see it calling my fallback method (line 999)
-- I can also see in the app behavior that the fallback method data is returned as the auth token is generated to empty
- login a second time which will call the CreateTokenCommand
-- as you can see in the log files, the call succeeds on on the second time it doesn't (as expected) call the fallback method (lines 1214-1262)
-- I can also see in the app behavior that the true method data is returned as the auth token is generated as expected

Why is the first call always calling the fallback method?

I have a feeling it is in something about the data about the service as retrieved by Eureka.  I say this as it seems like if I then ...

- stop the webapp
- start the webapp
- login a first third time which will call the CreateTokenCommand, etc.
-- the fallback method is not called and the auth token is generated correct and correct application behavior is observed

hystrix-first-always-fallbacks.zip

Andrew Spyker

unread,
Jul 22, 2013, 10:10:09 PM7/22/13
to hystr...@googlegroups.com
I moved up to the latest hystrix (1.2.17).  I also took the advise of Ben Christensen to add the following to my fallback method:

log.error(getExecutionEvents());

try {
  throw new Exception();
}
catch (Exception e) {
  StringWriter sw = new StringWriter();
  PrintWriter pw = new PrintWriter(sw);
  e.printStackTrace(pw);
  log.error(sw.toString());
}

In the log I saw:

ERROR - 84391 - [http-bio-8080-exec-4] (CreateTokenCommand.java:78) - [TIMEOUT]
ERROR - 84393 - [http-bio-8080-exec-4] (CreateTokenCommand.java:86) - java.lang.Exception
at com.acmeair.web.hystrixcommands.CreateTokenCommand.getFallback(CreateTokenCommand.java:80)
at com.acmeair.web.hystrixcommands.CreateTokenCommand.getFallback(CreateTokenCommand.java:38)
at com.netflix.hystrix.HystrixCommand.getFallbackWithProtection(HystrixCommand.java:868)
at com.netflix.hystrix.HystrixCommand.getFallbackOrThrowException(HystrixCommand.java:1129)
at com.netflix.hystrix.HystrixCommand.access$2000(HystrixCommand.java:81)
at com.netflix.hystrix.HystrixCommand$QueuedExecutionFuture.performActualGet(HystrixCommand.java:1532)
at com.netflix.hystrix.HystrixCommand$QueuedExecutionFuture.get(HystrixCommand.java:1451)
at com.netflix.hystrix.HystrixCommand$QueuedExecutionFuture.get(HystrixCommand.java:1483)
at com.netflix.hystrix.HystrixCommand.execute(HystrixCommand.java:429)
at com.acmeair.web.LoginREST.login(LoginREST.java:52)

I then put log statements around the command execute to understand why on first request to see if it was taking over 1000 ms (the default for the timeout).  Surely enough it was.  I will attach the log file that contains the first and second request.  You can see the first request takes 1013 ms and the second takes only 44 ms.

Of the expensive calls in the first, here is what I see:

DEBUG - 83384 - [http-bio-8080-exec-4] (LoginREST.java:51) - ***** before call to CreateTokenCommand
 INFO - 83541 - [hystrix-AcmeAirPool-1] (ConnectionPoolCleaner.java:110) - Initializing ConnectionPoolCleaner for NFHttpClient:acmeair-auth-service-client

This took 157 ms on the first call, and I don't see it on the second call. 

DEBUG - 83554 - [acmeair-auth-service-client-ConnectionPoolCleanerThread] (ConnPoolByRoute.java:725) - Closing connections idle longer than 30000 MILLISECONDS
DEBUG - 83743 - [hystrix-AcmeAirPool-1] (BaseLoadBalancer.java:279) - LoadBalancer:  pingIntervalSeconds set to 30

This took 189 ms on the first call, and I don't see it on the second call.

DEBUG - 83744 - [hystrix-AcmeAirPool-1] (BaseLoadBalancer.java:298) - LoadBalancer: maxTotalPingTime set to 2
 INFO - 83808 - [hystrix-AcmeAirPool-1] (BaseLoadBalancer.java:184) - Client:acmeair-auth-service-client instantiated a LoadBalancer:DynamicServerListLoadBalancer:{NFLoadBalancer:name=acmeair-auth-service-client,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null

This took 64 ms on the first call, and I don't see it on the second call.

DEBUG - 83862 - [hystrix-AcmeAirPool-1] (DefaultClientConnection.java:271) - >> User-Agent: Apache-HttpClient/4.1.2 (java 1.5)
DEBUG - 84251 - [hystrix-AcmeAirPool-1] (Wire.java:63) - << "HTTP/1.1 200 OK[\r][\n]"

This took 389 ms on the first call, but only 15 ms on the second call.

DEBUG - 84266 - [hystrix-AcmeAirPool-1] (CreateTokenCommand.java:62) - responseString = {"id":"fcefca64-23f7-46d9-a927-d15c26b28430","customerid":"ui...@email.com","lastAccessedTime":1374541545537,"timeoutTime":1374627945537}
ERROR - 84390 - [http-bio-8080-exec-4] (CreateTokenCommand.java:77) - calling CreateTokenCommand fallback

This took 124 ms in the first call and represents the time it took for my app to deserialize the message.  On the second call, the fallback isn't called, but in the second call it must have taken 10 ms or less.

So I think this shows that the initial Ribbon (with Eureka)/Hystrix and connection establishment along with slower on first invocation Jackson deserialization just happens to take slightly over 1000 ms.  My guess it that this failure happens across threads given I ramp up the users in parallel and the connection pool slowness is likely to happen until the pool grows to the thread pool size.

I talked to others at the bootcamp that saw the same behavior.  I'd suggest those people do the same as I did - using debugging to measure the first execution time to ensure it is longer than the timeout and therefore hystrix is "failing" the call as expected.

I wonder if this is rather typical in Netflix usage?  I wonder if it's typical that the first request fails.  The code I have in my command is pretty standard following the rss recipe sample of ribbon usage.  I also wonder if there is some way to let Hystrix be more accepting until warmup has occurred.
first-request-fails-with-more-debugging.zip

Patrick Mueller

unread,
Jul 24, 2013, 10:08:16 AM7/24/13
to hystr...@googlegroups.com
On Monday, July 22, 2013 10:10:09 PM UTC-4, Andrew Spyker wrote:
I wonder if this is rather typical in Netflix usage?  I wonder if it's typical that the first request fails.  The code I have in my command is pretty standard following the rss recipe sample of ribbon usage.  I also wonder if there is some way to let Hystrix be more accepting until warmup has occurred.

 Although "initial startup failures" isn't specifically covered on the Operations wiki page, it does seem like this category of performance should be considered "expected jitter and failure".

Given the richness of the Hystrix API framework, I wonder if you could compose some kind of environment which would extend the various timeout/count values until the system is warmed up.  Or just ignore them altogether.  Alternatively, have an additional set of timeout/count limits that are used until "warmed up - achieved!" is reached, but that doesn't smell quite right.

For some definition of "warmed up - achieved!", of course :-)

Ben Christensen

unread,
Jul 24, 2013, 4:27:14 PM7/24/13
to Andrew Spyker, hystr...@googlegroups.com
Hi Andrew, 

Thanks for the detailed research on this.

I wonder if this is rather typical in Netflix usage?  I wonder if it's typical that the first request fails.  The code I have in my command is pretty standard following the rss recipe sample of ribbon usage.  I also wonder if there is some way to let Hystrix be more accepting until warmup has occurred.

When a new machine comes online at AWS it's fairly typical for errors and timeouts to occur as routes to hundreds or thousands of instances over the network are established. We use some logic with Ribbon to do some pre-warming but in general it's not a black-and-white thing and thus we see and accept increased jitter as a new instance comes online.

We have not pursued making Hystrix aware of this as we consider this an underlying dependency issue (Ribbon doing network access in our case) and Hystrix is just reporting what is happening and correctly timing out until the dependency (typically via a network route) is healthy within the defined constraints. In other words, we have left the concerns separated between the network layer and the Hystrix layer. It also becomes more vague due to the fact that network routes can be dropping and adding all the time as clusters of servers autoscale, so it's not just a startup phenomenon. 

It may be a side-effect of our scale that we are okay just considering this jitter so I am open to the idea that a different behavior may make sense for different use cases and environments. What are your suggested modifications?



--
You received this message because you are subscribed to the Google Groups "HystrixOSS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hystrixoss+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Ben Christensen - API Team
+1-310-781-5511  @benjchristensen

Renat Bekbolatov

unread,
May 21, 2014, 12:23:20 AM5/21/14
to hystr...@googlegroups.com, Andrew Spyker
If I an not wrong, quite a bit of initialization logic (e.g. threadpool/queue creation) happens on the first call to HystrixCommand, so initial request with a small enough timeout threshold will time out and call fallback method.
Reply all
Reply to author
Forward
0 new messages