timeout behavior of Envoy when a downstream times out on a request

1,797 views
Skip to first unread message

Weita Chen

unread,
Nov 8, 2017, 8:07:40 PM11/8/17
to envoy-users
Hi,

I'd like to understand how Envoy behaves when a downstream service sets a time out on a request.
Basically the configuration is like downstream service -> Envoy -> upstream service, and the traffic can be grpc unary call and streaming call.

Say the downstream service sets a 10 secs timeout on a request and Envoy sets a 20 secs timeout on a request. What I found is that when the downstream service times out the request, Envoy does not time out the request until the full 20 secs are used up. I wonder why Envoy not just timing out a request once the downstream has timed out the original request?

Also, I'd like to confirm that the timeout_ms setting in Envoy does not apply to grpc streaming call, which does not necessary having a response message or streaming responses. That is, the request time for streaming grpc is hard to define. Therefore, the metric of upstream_rq_time does not count the request time of streaming grpc.
Thanks.

Weita

Matt Klein

unread,
Nov 9, 2017, 4:25:11 PM11/9/17
to Weita Chen, envoy-users
Say the downstream service sets a 10 secs timeout on a request and Envoy sets a 20 secs timeout on a request. What I found is that when the downstream service times out the request, Envoy does not time out the request until the full 20 secs are used up. I wonder why Envoy not just timing out a request once the downstream has timed out the original request?

Would need more information to help here. Envoy is fully streaming and reactive. That means that if either upstream or downstream disconnects/resets, Envoy will reset the other side. So if the application is timing out and reseting/disconnecting Envoy should reset/disconnect.

 Also, I'd like to confirm that the timeout_ms setting in Envoy does not apply to grpc streaming call, which does not necessary having a response message or streaming responses. That is, the request time for streaming grpc is hard to define. Therefore, the metric of upstream_rq_time does not count the request time of streaming grpc.

It does currently apply. I would set timeout to 0. See this issue: https://github.com/envoyproxy/envoy/issues/1778 

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.
To post to this group, send email to envoy...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/902ba99f-e5f5-44b9-8b52-d131bd0d19ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Weita Chen

unread,
Nov 9, 2017, 6:17:16 PM11/9/17
to Matt Klein, envoy-users


On Thu, Nov 9, 2017 at 1:25 PM, Matt Klein <mkl...@lyft.com> wrote:
>>
>> Say the downstream service sets a 10 secs timeout on a request and Envoy sets a 20 secs timeout on a request. What I found is that when the downstream service times out the request, Envoy does not time out the request until the full 20 secs are used up. I wonder why Envoy not just timing out a request once the downstream has timed out the original request?
>
>
> Would need more information to help here. Envoy is fully streaming and reactive. That means that if either upstream or downstream disconnects/resets, Envoy will reset the other side. So if the application is timing out and reseting/disconnecting Envoy should reset/disconnect.


Specifically in our non-mesh setup, the downstream service sets a 30-secs timeout and Envoy sets a 0 timeout. In the case, the upstream_rq_time metric at 95%-tile on the Envoy can go up to 2,000,000 msecs when the upstream run into some issue and reply the request long after the 30 secs limit.

In our mesh setup, both Envoys (one co-hosted with the downstream service and the other co-hosted with the upstream service) set a 30-secs timeout. Then, from the access log, I saw the streaming rpc can have duration (%DURATION%) up to 1 hour. That's why I thought the timeout setting does not apply to the streaming rpc.

Thanks for pointing me to the issue: https://github.com/envoyproxy/envoy/issues/1778, currently we just have one route for both unary and streaming calls, and set the timeout as I mentioned above. So, we will need to separate the streaming route from the unary route. But I wonder if there are good explanations to the behaviors I saw above for the non-mesh and mesh setup.


>
>
>>  Also, I'd like to confirm that the timeout_ms setting in Envoy does not apply to grpc streaming call, which does not necessary having a response message or streaming responses. That is, the request time for streaming grpc is hard to define. Therefore, the metric of upstream_rq_time does not count the request time of streaming grpc.
>
>
> It does currently apply. I would set timeout to 0. See this issue: https://github.com/envoyproxy/envoy/issues/1778
>
> On Wed, Nov 8, 2017 at 5:07 PM, Weita Chen <weit...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'd like to understand how Envoy behaves when a downstream service sets a time out on a request.
>> Basically the configuration is like downstream service -> Envoy -> upstream service, and the traffic can be grpc unary call and streaming call.
>>
>> Say the downstream service sets a 10 secs timeout on a request and Envoy sets a 20 secs timeout on a request. What I found is that when the downstream service times out the request, Envoy does not time out the request until the full 20 secs are used up. I wonder why Envoy not just timing out a request once the downstream has timed out the original request?
>>
>> Also, I'd like to confirm that the timeout_ms setting in Envoy does not apply to grpc streaming call, which does not necessary having a response message or streaming responses. That is, the request time for streaming grpc is hard to define. Therefore, the metric of upstream_rq_time does not count the request time of streaming grpc.
>> Thanks.
>>
>> Weita
>>
>> --
>> You received this message because you are subscribed to the Google Groups "envoy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users...@googlegroups.com.

Matt Klein

unread,
Nov 10, 2017, 7:11:47 PM11/10/17
to Weita Chen, envoy-users
But I wonder if there are good explanations to the behaviors I saw above for the non-mesh and mesh setup.

My guess is that your streaming API is an "upload" type API where you are sending data from downstream to upstream? Envoy starts the router timeout when the entire request has been received. If you are streaming from downstream to upstream you won't see any complete request. This is why we will need to add data frame timeouts in both the downstream and upstream direction. This isn't very hard to add and we will need this at Lyft soonish so I would expect it to get added in the next 1-2 months. 

On Thu, Nov 9, 2017 at 3:16 PM, Weita Chen <weit...@gmail.com> wrote:


On Thu, Nov 9, 2017 at 1:25 PM, Matt Klein <mkl...@lyft.com> wrote:
>>
>> Say the downstream service sets a 10 secs timeout on a request and Envoy sets a 20 secs timeout on a request. What I found is that when the downstream service times out the request, Envoy does not time out the request until the full 20 secs are used up. I wonder why Envoy not just timing out a request once the downstream has timed out the original request?
>
>
> Would need more information to help here. Envoy is fully streaming and reactive. That means that if either upstream or downstream disconnects/resets, Envoy will reset the other side. So if the application is timing out and reseting/disconnecting Envoy should reset/disconnect.


Specifically in our non-mesh setup, the downstream service sets a 30-secs timeout and Envoy sets a 0 timeout. In the case, the upstream_rq_time metric at 95%-tile on the Envoy can go up to 2,000,000 msecs when the upstream run into some issue and reply the request long after the 30 secs limit.

In our mesh setup, both Envoys (one co-hosted with the downstream service and the other co-hosted with the upstream service) set a 30-secs timeout. Then, from the access log, I saw the streaming rpc can have duration (%DURATION%) up to 1 hour. That's why I thought the timeout setting does not apply to the streaming rpc.

Thanks for pointing me to the issue: https://github.com/envoyproxy/envoy/issues/1778, currently we just have one route for both unary and streaming calls, and set the timeout as I mentioned above. So, we will need to separate the streaming route from the unary route. But I wonder if there are good explanations to the behaviors I saw above for the non-mesh and mesh setup.


>
>
>>  Also, I'd like to confirm that the timeout_ms setting in Envoy does not apply to grpc streaming call, which does not necessary having a response message or streaming responses. That is, the request time for streaming grpc is hard to define. Therefore, the metric of upstream_rq_time does not count the request time of streaming grpc.
>
>
> It does currently apply. I would set timeout to 0. See this issue: https://github.com/envoyproxy/envoy/issues/1778
>
> On Wed, Nov 8, 2017 at 5:07 PM, Weita Chen <weit...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'd like to understand how Envoy behaves when a downstream service sets a time out on a request.
>> Basically the configuration is like downstream service -> Envoy -> upstream service, and the traffic can be grpc unary call and streaming call.
>>
>> Say the downstream service sets a 10 secs timeout on a request and Envoy sets a 20 secs timeout on a request. What I found is that when the downstream service times out the request, Envoy does not time out the request until the full 20 secs are used up. I wonder why Envoy not just timing out a request once the downstream has timed out the original request?
>>
>> Also, I'd like to confirm that the timeout_ms setting in Envoy does not apply to grpc streaming call, which does not necessary having a response message or streaming responses. That is, the request time for streaming grpc is hard to define. Therefore, the metric of upstream_rq_time does not count the request time of streaming grpc.
>> Thanks.
>>
>> Weita
>>
>> --
>> You received this message because you are subscribed to the Google Groups "envoy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

>> To post to this group, send email to envoy...@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/902ba99f-e5f5-44b9-8b52-d131bd0d19ff%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Matt Klein
> Software Engineer
> mkl...@lyft.com
> https://calendly.com/mattklein123

Weita Chen

unread,
Nov 10, 2017, 7:20:10 PM11/10/17
to Matt Klein, envoy-users
Yep, our streaming APIs are sending data from downstream to upstream. That explained what I observed that timeout_ms didn't apply as expected to our streaming case.
Reply all
Reply to author
Forward
0 new messages