Fire-and-forget with timeout

537 views
Skip to first unread message

Stuart Sierra

unread,
May 2, 2013, 10:41:13 AM5/2/13
to hystr...@googlegroups.com
Hello, HystrixOSS!

We're starting to use hystrix-clj and have a question about asynchronous, fire-and-forget operations.

I want to send some data to a back-end service. I don't care about the result: my application code is going to fire off the request and forget about it. But I do care about not overwhelming the back-end service. So if the back-end service responds too slowly, I want to back off and not send any requests to it for a while.

We've discovered that the timeout configured for a command is measured from the point where you call `get`, not the point where the command started. If you enqueue a command and never `get` the result, it never times out. Slow responses will never trigger the timeout and never open the circuit breaker. Eventually, the thread pool configuration will start limiting the number of concurrent requests, but it never completely stops.

Is there a better way to handle this situation?

Thanks,
-S

benjchr...@netflix.com

unread,
May 2, 2013, 12:30:13 PM5/2/13
to hystr...@googlegroups.com
Hi Stuart, 

In your use case do you want to back-off and drop messages (if the pool is full or circuit tripped), or do you want to ensure all messages get delivered but just delayed if the circuit trips? 

Depending on which of these use cases you're looking for I can provide more guidance on how best to proceed.

Ben

Stuart Sierra

unread,
May 2, 2013, 12:48:26 PM5/2/13
to hystr...@googlegroups.com
Hi Ben,

We want to back off and drop messages. Specifically, if a bunch of commands take too long, even if we never look at the result of those commands, we want the circuit breaker to trip and start rejecting new commands.

Thanks,
-S



--
You received this message because you are subscribed to a topic in the Google Groups "HystrixOSS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hystrixoss/f7Z__Kv2YfY/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to hystrixoss+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

benjchr...@netflix.com

unread,
May 2, 2013, 5:27:43 PM5/2/13
to hystr...@googlegroups.com, ma...@stuartsierra.com
(accidentally sent privately to Stuart, re-sending to group)

There are 2 ways we do this at Netflix:

1) Directly using HystrixCommand (within request/response context)

For simple use cases we have applications that use HystrixCommand.queue() directly and never call .get() on it (just like what you said you're doing).

If a network timeout occurs (not thread timeout since we don't block on get()) then an exception is thrown and it's seen as a failure that can result in the circuit tripping. Thread-pool size automatically constrains throughput, does rejections and causes circuit tripping if the backend becomes saturated.

A possible functional problem with this is that generally the HystrixRequestContext is scoped to an HTTP request/response loop and fire-and-forget can result in something executing after the response is returned and thus after the HystrixRequestContext has been cleared. If you're not using RequestContext for anything (most people aren't by default) then this may not matter. It can however have an impact on request caching and request collapsing if you use either of those. If any of these items matter then option #2 below is the approach we take.

Regarding the thread timeout problem you mentioned ... there are a few ways a command can timeout (when using thread isolation):

a) queue().get()

The get() itself times out and the underlying Callable is cancelled (either while still in the queue or while its running on a thread).

b) fire-and-forget: queue() without get()

When the Callable gets picked up by a Thread if elapsed time since calling queue() > timeout then the work is skipped and metrics are incremented for a timeout occurring. 

c) queue().get() with a race condition on timeout

A race condition could occur on the get() timeout and the Callable getting scheduled. In that case both (a) and (b) scenarios happen in a race on the same command. The timeout logic and metrics capture is atomically handled for this scenario.


However, none of these account for the use case you refer to which is canceling the underlying network call itself via a Future.cancel/timeout when get() is not invoked. I am not aware of a way to do that with Futures and Executors since I cannot submit a Callable with a "maximum execution time". It is only on a blocking get() call that I can choose to timeout and then cancel the task or via another thread (such as a timer) calling Future.cancel().

The way we handle this for fire-and-forget use cases is that the underlying network activity still must have a timeout that is applicable since we will never trigger an interrupt on the thread (via Future.cancel or Future.get(timeout)). In other words, for the fire-and-forget use case the thread timeout value is mostly useless, it will only take effect if the time from queueing to a thread picking it up exceeds the timeout value.

I am not aware of a mechanism to set a timeout value on a Future/ExecutorService that automatically cancels the Future after a certain time has passed without another thread blocking on the get() method. Unless there is then it is up to the underlying run() method implementation performing the network call to ensure the network timeout is set correctly.

If you want to use the HystrixCommand timeout value for the network timeout you can retrieve it within the run() method via:

     getProperties().executionIsolationThreadTimeoutInMilliseconds() 

That way you don't have 2 different config and can leverage the dynamic updates of Hystrix properties. Of course this assumes that your network client allows you to inject a timeout value on each request.

2) Via a separate queue (in a separate context)

For use cases where the request context is a problem, or we want to get closer to ensuring delivery via queuing instead of dropping on the floor, we'll fire-and-forget into a queue and then have a background thread (or thread-pool) pick up the work and execute the command using HystrixCommand.execute() synchronously with the request context lifecycle managed correctly but decoupled from the user request/response.

It does not sound like you need this but I figured I'd mention it.

I hope the above explanation of options helps.

Ben

Stuart Sierra

unread,
May 3, 2013, 3:02:42 PM5/3/13
to hystr...@googlegroups.com
Hi Ben,

Thanks for an awesome response! Fast and great detail. This answers all my questions.

We were coming around to the solution of doing the timeout in the network call, which makes perfect sense. We also considered simply recording the time an operation took and throwing an exception if it took too long.

It's useful to know that there isn't any other way to enforce "maximum execution time." And I think I understand why -- you need one thread watching the timer in order to signal the thread doing the work.

We're not using HystrixRequestContext, so that's not as issue for us right now.

It was interesting to learn that the thread timeout is taken into consideration when a command is moved from the queue onto a thread (case 1.b. in your description). I didn't know that.

Thanks again,
-S




On Thu, May 2, 2013 at 1:58 PM, Ben Christensen <benjchr...@netflix.com> wrote:
There are 2 ways we do this at Netflix:

1) Directly using HystrixCommand (within request/response context)

For simple use cases we have applications that use HystrixCommand.queue() directly and never call .get() on it (just like what you said you're doing).

If a network timeout occurs (not thread timeout since we don't block on get()) then an exception is thrown and it's seen as a failure that can result in the circuit tripping. Thread-pool size automatically constrains throughput, does rejections and causes circuit tripping if the backend becomes saturated.

A possible functional problem with this is that generally the HystrixRequestContext is scoped to an HTTP request/response loop and fire-and-forget can result in something executing after the response is returned and thus after the HystrixRequestContext has been cleared. If you're not using RequestContext for anything (most people aren't by default) then this may not matter. It can however have an impact on request caching and request collapsing if you use either of those. If any of these items matter then option #2 below is the approach we take.

Regarding the thread timeout problem you mentioned ... there are a few ways a command can timeout (when using thread isolation):

a) queue().get()

The get() itself times out and the underlying Callable is cancelled (either while still in the queue or while its running on a thread).

b) fire-and-forget: queue() without get()

When the Callable gets picked up by a Thread if elapsed time since calling queue() > timeout then the work is skipped and metrics are incremented for a timeout occurring. 

c) queue().get() with a race condition on timeout

A race condition could occur on the get() timeout and the Callable getting scheduled. In that case both (a) and (b) scenarios happen in a race on the same command. The timeout logic and metrics capture is atomically handled for this scenario.


However, none of these account for the use case you refer to which is canceling the underlying network call itself via a Future.cancel/timeout when get() is not invoked. I am not aware of a way to do that with Futures and Executors since I cannot submit a Callable with a "maximum execution time". It is only on a blocking get() call that I can choose to timeout and then cancel the task or via another thread (such as a timer) calling Future.cancel().

The way we handle this for fire-and-forget use cases is that the underlying network activity still must have a timeout that is applicable since we will never trigger an interrupt on the thread (via Future.cancel or Future.get(timeout)). In other words, for the fire-and-forget use case the thread timeout value is mostly useless, it will only take effect if the time from queueing to a thread picking it up exceeds the timeout value.

I am not aware of a mechanism to set a timeout value on a Future/ExecutorService that automatically cancels the Future after a certain time has passed without another thread blocking on the get() method. Unless there is then it is up to the underlying run() method implementation performing the network call to ensure the network timeout is set correctly.

If you want to use the HystrixCommand timeout value for the network timeout you can retrieve it within the run() method via:

     getProperties().executionIsolationThreadTimeoutInMilliseconds() 

That way you don't have 2 different config and can leverage the dynamic updates of Hystrix properties. Of course this assumes that your network client allows you to inject a timeout value on each request.

2) Via a separate queue (in a separate context)

For use cases where the request context is a problem, or we want to get closer to ensuring delivery via queuing instead of dropping on the floor, we'll fire-and-forget into a queue and then have a background thread (or thread-pool) pick up the work and execute the command using HystrixCommand.execute() synchronously with the request context lifecycle managed correctly but decoupled from the user request/response.

It does not sound like you need this but I figured I'd mention it.

I hope the above explanation of options helps.

Ben

--
You received this message because you are subscribed to the Google Groups "HystrixOSS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hystrixoss+...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Ben Christensen - API Team
+1-310-781-5511  @benjchristensen

benjchr...@netflix.com

unread,
Aug 13, 2013, 4:14:04 PM8/13/13
to hystr...@googlegroups.com, ma...@stuartsierra.com
Hystrix 1.3 has been released (https://github.com/Netflix/Hystrix/releases/tag/1.3.0) which is now fully non-blocking internally and supports async timeouts.

You can use the observe/toObservable (eager/lazy) methods of invocation which are completely async and will perform an async timeout.

Ben
Reply all
Reply to author
Forward
0 new messages