503 errors and empty upstream host url in logs

Weita Chen

unread,

Jul 14, 2017, 7:56:19 PM7/14/17

to envoy-users

Hi,

One of our frontend servers experienced 503 errors for a period of time (~6 hrs). Envoy's access log showed a bunch of 503 errors during that period and the upstream host url is "-". A snippet of the logs is shown below: (Please bear with me that I replaced the request path with /xxx/xxx.)

[2017-07-14T05:00:52.097Z] "POST /xxx/xxx HTTP/2" 503 UO 0 57 0 - "-" "grpc-java-netty/1.3.0" "328403be-8c36-41e6-93fd-1f387cbc31b4" "127.0.0.1:11000" "-"

[2017-07-14T05:00:52.399Z] "POST/xxx/xxx HTTP/2" 503 UO 0 57 0 - "-" "grpc-java-netty/1.3.0" "c01a97b0-e8d5-4ac0-b11c-973f8b0da50e" "127.0.0.1:11000" "-"

[2017-07-14T05:00:52.400Z] "POST /xxx/xxx HTTP/2" 503 UO 0 57 0 - "-" "grpc-java-netty/1.3.0" "4285f245-9cb7-415c-a60d-5b384827d296" "127.0.0.1:11000" "-"

[2017-07-14T05:00:53.040Z] "POST /xxx/xxx HTTP/2" 503 UO 0 57 0 - "-" "grpc-java-netty/1.3.0" "a42994d6-a855-46bb-abfe-2df802fe6b9e" "127.0.0.1:11000" "-"

[2017-07-14T05:00:53.041Z] "POST /xxx/xxx HTTP/2" 503 UO 0 57 0 - "-" "grpc-java-netty/1.3.0" "e3342ff9-2d3f-4fd6-a32a-f34993577fdb" "127.0.0.1:11000" "-"

Our envoy setup is (Frontend -> Envoy) --------> (Backend) and currently we just have Envoy check the healthiness of a backend by simply opening a tcp connection. During that period, the health-check stats showed that all the upstream hosts were healthy, and other frontend servers talked to the Backend servers just fine. After that period of time, everything came back to normal itself, without restarting the frontend server or the envoy. Attached is the stats output of that particular envoy afterwards.

Any ideas about what happened or what metric that I should take look into?

Thanks a lot.

envoy_stats.txt

Weita Chen

unread,

Jul 14, 2017, 9:25:36 PM7/14/17

to envoy-users

BTW., the issue happened again and restarting envoy alone made the issue go away. Still don't know what the root cause is though.

--
You received this message because you are subscribed to a topic in the Google Groups "envoy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/envoy-users/AaCRVkNeES8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to envoy-users+unsubscribe@googlegroups.com.
To post to this group, send email to envoy...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/edb6415f-65a4-4684-aa01-6ac78abd6eb3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Klein

unread,

Jul 14, 2017, 9:54:02 PM7/14/17

to Weita Chen, envoy-users

UO means upstream overflow:

https://lyft.github.io/envoy/docs/configuration/http_conn_man/access_log.html#format-rules

Which means that circuit breaking is kicking in:

https://lyft.github.io/envoy/docs/configuration/cluster_manager/cluster_circuit_breakers.html#config-cluster-manager-cluster-circuit-breakers

Pertinent stats are:

cluster.snappy.upstream_rq_active: 817

cluster.snappy.upstream_rq_pending_failure_eject: 2614

cluster.snappy.upstream_rq_pending_overflow: 5860

(Also upstream_rq_time, not shown here, sent to statsd)

https://lyft.github.io/envoy/docs/configuration/cluster_manager/cluster_stats.html

You may need to tune the default circuit breaking settings for the cluster. The defaults are for 1024 concurrent which you are close to here. Also make sure you are not disabling timeouts on the route in case there are backend issues.

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

To post to this group, send email to envoy...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/CAM5M-XUmN0fhAGzgpGw7H0peTNXF9s1HZR7ZgivOpvEaWQB-Hg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Matt Klein

Software Engineer

mkl...@lyft.com

Weita Chen

unread,

Jul 15, 2017, 2:33:06 AM7/15/17

to Matt Klein, envoy-users

Thanks a lot for your reply!

Two follow-up questions:

1) For the timeout, do you mean setting "timeout_ms" in Route to some reasonable value, other than 0, so that timeout is not disabled? Also, it looks like the Route setting is under virtual host, which is used in the mesh case, rather than my current case, where frontend envoy directly talking to backend?

2) How many http2 connections will get established between an envoy and all its upstream hosts?

In my current case, a frontend server establishes 5 gRPC/HTTP2 connections to its envoy. A envoy talks to 7 upstream hosts. Intuitively, I assume an envoy will have up to 35 = 5 *7 upstream connections, but the stats showed cluster.snappy.upstream_cx_active: 64. Why is that?

Thanks.

Matt Klein

unread,

Jul 15, 2017, 9:37:06 PM7/15/17

to Weita Chen, envoy-users

Inline below.

On Fri, Jul 14, 2017 at 11:32 PM, Weita Chen <weit...@gmail.com> wrote:

Thanks a lot for your reply!

Two follow-up questions:
1) For the timeout, do you mean setting "timeout_ms" in Route to some reasonable value, other than 0, so that timeout is not disabled? Also, it looks like the Route setting is under virtual host, which is used in the mesh case, rather than my current case, where frontend envoy directly talking to backend?

If you are using any HTTP support you have at least one virtual host and one route in your configuration. Egress calls in your setup should definitely have a timeout.

2) How many http2 connections will get established between an envoy and all its upstream hosts?
In my current case, a frontend server establishes 5 gRPC/HTTP2 connections to its envoy. A envoy talks to 7 upstream hosts. Intuitively, I assume an envoy will have up to 35 = 5 *7 upstream connections, but the stats showed cluster.snappy.upstream_cx_active: 64. Why is that?

Envoy creates at most a single h2 connection for each upstream host, per worker thread. (This is independent of draining a connection in the case of GOAWAY or reaching max streams per connection). Thus, if you have 7 upstream hosts and 8 workers (governed by --concurrency which defaults to the # of HW threads), you would expect to see 56 active connections at steady state.

To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/CAM5M-XXx%2Bf5f%3DnztOjmaEXB%2BRmMiGvRjxHpeVFdNyaeE28Z0Yw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Weita Chen

unread,

Aug 18, 2017, 8:13:05 PM8/18/17

to Matt Klein, envoy-users

Hi Matt,

We still have some issues with our envoy configs. The active requests
are being capped at 1024, even after circuit breakers' max_requests is
set to 4096. Attached are our envoy configuration and the stats
output.

Our current architecture is still an envoy running along with a
frontend server and then the envoy forwarding grpc requests to a
upstream cluster, i.e., (frontend -> Envoy) ----> (backend clusters).

Our frontend issues both grpc streaming and unary calls to the
backend. From time to time, an envoy would enter into a state, where
active requests are capped at 1024 and the frontend cannot send new
streaming requests to the backend, probably due to flow-controlling
kicked in by the backend grpc servers. When the envoy/frontend enters
into the state, there are no 503 errors as seen from the envoy
access_log file though. Also, as shown by the log file, the frontend
can still make unary calls to the backend grpc servers correctly.

The questions are
1) Do we misconfigure something? How to increase the active requests cap?

2) Will it be "timeout_ms": 0 that potentially causes the envoy
entering into the state that streaming requests are not returned, and
hence active requests are not coming down?

Thanks a lot!
Weita

>>>>> envoy-users...@googlegroups.com.

>>>>> To post to this group, send email to envoy...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/envoy-users/edb6415f-65a4-4684-aa01-6ac78abd6eb3%40googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "envoy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send

>>>> an email to envoy-users...@googlegroups.com.

>>>> To post to this group, send email to envoy...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/envoy-users/CAM5M-XUmN0fhAGzgpGw7H0peTNXF9s1HZR7ZgivOpvEaWQB-Hg%40mail.gmail.com.
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>>
>>> --
>>> Matt Klein
>>> Software Engineer
>>> mkl...@lyft.com
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "envoy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> email to envoy-users...@googlegroups.com.

stats.txt

envoy.config.json

Matt Klein

unread,

Aug 18, 2017, 10:04:37 PM8/18/17

to Weita Chen, envoy-users

1) Do we misconfigure something? How to increase the active requests cap?

I'm not sure. What SHA are you running? Can you provide the output of /clusters admin endpoint also?

2) Will it be "timeout_ms": 0 that potentially causes the envoy entering into the state that streaming requests are not returned, and hence active requests are not coming down?

Having no timeout is risky. If the streams hang on the backend they will never get cleaned up. (We don't have any concept of streaming timeout today, for example, reset if no data received after X milliseconds). It's possible but I'm not sure what the problem is.

>>>>> envoy-users+unsubscribe@googlegroups.com.

>>>>> To post to this group, send email to envoy...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/envoy-users/edb6415f-65a4-4684-aa01-6ac78abd6eb3%40googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "envoy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send

>>>> an email to envoy-users+unsubscribe@googlegroups.com.

>>>> To post to this group, send email to envoy...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/envoy-users/CAM5M-XUmN0fhAGzgpGw7H0peTNXF9s1HZR7ZgivOpvEaWQB-Hg%40mail.gmail.com.
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>>
>>> --
>>> Matt Klein
>>> Software Engineer
>>> mkl...@lyft.com
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "envoy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> email to envoy-users+unsubscribe@googlegroups.com.

>> To post to this group, send email to envoy...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/envoy-users/CAM5M-XXx%2Bf5f%3DnztOjmaEXB%2BRmMiGvRjxHpeVFdNyaeE28Z0Yw%40mail.gmail.com.
>>
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Matt Klein
> Software Engineer
> mkl...@lyft.com

Matt Klein

unread,

Aug 20, 2017, 8:58:08 PM8/20/17

to Weita Chen, envoy-users

One other quick follow up. I looked at the stats again and I can say that circuit breaking is not happening here because we see the cluster.snappy.upstream_rq_pending_overflow stat as zero. I honestly have no idea what is going on here (I kind of suspect stream limits in either your application or backend) but let's start with the SHA you are running and the output of /clusters and we can go from there.

Matt Klein

unread,

Aug 21, 2017, 10:54:37 AM8/21/17

to Weita Chen, Envoy Users

Can you please run with master and see what results you get? 1.3.0 is several months old and there have been changes in this area. (We will tag 1.4.0 by the end of the week).

Thanks,

Matt

On Sun, Aug 20, 2017 at 10:15 PM, Weita Chen <weit...@gmail.com> wrote:

Hi Matt,

My running envoy is:

envoy version: 3afc7712a04907ffd25ed497626639febfe65735/Modified/DEBUG

Please see that attached file for the output of /clusters and /stats.
I attached the output for two envoys: a bad one (rq_active = 1025) vs a good one (rq_requests = 338).

The two envoys are talking to the same upstream cluster. So, I don't think it will be a limit on the backend.
I suspect it is a limit on the application, specifically we have only one http2 connection from the application (downstream host) to an envoy for the particular gRPC streaming method that is experiencing the issue. I suspect the http2 connection we used (via the grpc/netty library) was configured to use 1024 max concurrent streams. I still need to dig out about how netty sets the max concurrrent streams internally though. Thoughts?

Thanks.

Weita

Matt Klein

unread,

Aug 21, 2017, 11:14:40 AM8/21/17

to Weita Chen, Envoy Users

I actually think this is an issue that has been fixed since 1.3.0. IIRC we used to set max concurrent streams to 1024, so your application if making a single connection would only be able to start 1024 streams on it. Anyway let's see what happens once you test on master.

Thanks,

Matt

Weita Chen

unread,

Aug 22, 2017, 2:40:35 PM8/22/17

to Matt Klein, Envoy Users

Hi Matt,

After upgrading envoy to the tip of the master tree yesterday, we didn't see the active request being capped to 1024 any more. Interestingly, the active requests stayed at around 400 over the past 12 hours. With the old version of envoy, after running the server for 2~3 hours, usually the active requests would climb up 1024 and being capped there.

Thanks a lot for your help!

Weita

Matt Klein

unread,

Aug 22, 2017, 2:41:43 PM8/22/17

to Weita Chen, Envoy Users

Great, glad it is working now.

To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/CAM5M-XUBJBa%3DtTTDdPOuZwoWwXmsWq78DMfs5dps%2BsFDWsw%3Dfw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward