gRPC latency hunting in envoy

1,660 views
Skip to first unread message

Gaurav Poothia

unread,
Jul 2, 2021, 12:33:44 PM7/2/21
to envo...@googlegroups.com

Summary

Hello!

We run envoy in our svc mesh in a fairly typical topology: On request path client talks to client envoy sidecar (on same host) that talks over the network to server envoy sidecar. Finally the server envoy proxies to the server (on the same host). We use mTLS between the envoys (similar to diagram below)

I noticed that in this topology unary gRPC load generators get very different envoy latency (hence different end to end latency) based on whether they run in “closed loop” or “open loop” modes [0] (Pls do see References at bottom of email for definitions of closed and open)

If you run closed loop and open loop load testers at some carefully chosen identical QPS then the former sees much higher envoy latency than latter. Up to ~5x in some experiments.

This rest of this email is an attempt to share observations, hypotheses and repro steps around this phenomenon. Hopefully the community can confirm these and figure out improvements.


Observations

I was doing some benchmarking with unary gRPC. Intentionally dial just a single H/2 connection on all hops. The perf bench looks like this:

Screen Shot 2021-07-02 at 9.15.18 AM.png




Noticed some very odd things:

  1. Whether the load generator is open vs closed loop seems to matter a lot wrt envoy latency. The open loop load gen hits a bottleneck with core saturation on envoy's single worker thread. But until then the gRPC median end to end latency is decent (p50 latency ~2ms). But a closed loop load generator hits peak QPS thanks to latency bottleneck (p50 latency ~10ms). In other words, closed loop peak QPS is limited by how many simultaneous requests it can keep in flight. The latency observed in closed loop test when latency bound is much worse than if you ran an open loop at the same QPS (demonstrated in repro steps)

  1. Closed loop QPS was not always latency bound. If instead of running a closed loop test in “peak QPS” mode if you run it at “target QPS” mode with a target set slightly below peak it will not be latency bound. In other words, closed loop degrades envoy latency suddenly and dramatically at some QPS inflection point.

  2. Dissecting the  latency on a closed loop test (by correlating the H/2 streams for a single gRPC across hops) shows latency is coming from the two envoys in the path and latency somewhat evenly split between request and response proxying.

  3. The amount of latency introduced by envoys seems directly proportional to how “dense” the arriving packets are. “Density” here is defined as the number of gRPC requests or responses per H2  packet. 

e.g. in the most pathological case the closed loop tester configured with N worker threads will degrade to sending all N grpc requests in one packet and wait for N responses to arrive.  Then all responses will arrive together from server causing a ping pong of dense packets (density approaching N)

While it's not always that extreme the Envoy latency observed is proportional to this gRPC/packet density. So two load tests (open vs closed) that send at the same QPS can see different latency based on this density (open - > low density -> low latency vs closed -> high density -> high latency)

  1. Envoy seems to increase the density of the packets as it proxies them across. In other words if on avg N gRPC/packet arrive at envoy and on avg M gRPC/packet leave envoy then M > N (packets going upstream denser than arriving from downstream). If (4) is true then this implies that client envoy will worsen slowdown on server envoy. (See repro steps for how to observe the density increase)




Hypothesis

I tried to dig into why these might be the case:


Why do closed loop load testers degrade to being latency bound at some QPS inflection point and then start sending denser packets than before inflection/open loop load testers?

It’s possible that the closed loop tester sees some chunking behaviour thanks to envoy and because of that the threading model gets into the groove of ping ponging dense packets. Maybe open loop testers, thanks to their threading model, don't perpetuate any chunking of responses and keep the requests staggered across smaller packets. 

But I have no theory to explain why closed loop latency degrades suddenly at “inflection QPS”


Why is envoy latency proportional to arriving packet density?

Here is a sample of some (mix of private and public) envoy trace statements on server envoy during high latency expt on request processing path:

[2021-06-22 22:55:06.559][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:652] [C2134] dispatching 8383 bytes

[2021-06-22 22:55:06.563][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:671] [C2134] took time  4 ms dispatching 8383 bytes

>> snip <<

[2021-06-22 22:55:06.563][3440254][info][connection] [external/envoy/source/common/network/raw_buffer_socket.cc:68] [C2135] write returns: 10012


First line is here before server envoy calls  nghttp2_session_mem_recv

Second line is custom tracing I added to measure how long nghttp2_session_mem_recv took

Third line is when the server envoy actually puts the bits headed upstream on the wire (size larger than the first line because of injected headers).

Note here it took 4 ms to process a high density packet (8k bytes Rx probably means ~64 gRPC requests in the packet where each gRPC request has separate H/2 HEADER and DATA frame = 124 frames).

Further note that not a single gRPC request got proxied across until *all* the gRPC requests in the packet are processed.


This seems to be a result of a confluence of factors.


Here is a very high level pseudocode of relevant parts of request processing path as I understood it:


Libevent event processing loop {

If (read event)

Read event callback {

Read from socket

Nghttp2_session_mem_recv

Callback for processing each H/2 frame {

Process each H/2 frame - decode, pass through http filters,route.

Find/dial connection to upstream, create stream, encode 

Nghttp2_session_send

Send callback {

Copy bytes destined upstream to write buffer

Activate a write event

}

}

}

}

If (write event) {

Write event callback {

Send the bytes in write buffer to socket connecting upstream 

}

}

}


a) libevent will have to wait for read event callback (Event::FileReadyType::Read) to return before it can process the Event::FileReadyType::Write event to call write event callback that sends proxied bits to upstream. 

b) The nghttp2_session_mem_recv will not let the read event callback stack unwind until it has processed all the gRPC requests in the Rx packet. Hence proxied bits for all gRPC requests in the Rx packet accumulate in write_buffer waiting to be sent out together.

c)It seems like each gRPC request processing must be of the order of ~60 microseconds (under my config) to add up to 4 ms for 64 requests (4000/64 = approx 60). Seems high?



Note: I just randomly picked traces on request processing on server envoy but this density proportional slowdown happens on all the 4 envoy paths : (request, response processing) X (client envoy, server envoy)


Why does envoy seem to increase the density of the packets that it proxies across?

Here is another sample of (private) envoy trace statements :


[2021-06-22 22:55:06.397][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:652] [C2134] dispatching 5112 bytes

[2021-06-22 22:55:06.400][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:671] [C2134] took time  2 ms dispatching 5112 bytes

[2021-06-22 22:55:06.400][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:652] [C2134] dispatching 3613 bytes

[2021-06-22 22:55:06.402][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:671] [C2134] took time  1 ms dispatching 3613 bytes

>> snip <<

[2021-06-22 22:55:06.402][3440254][info][connection] [external/envoy/source/common/network/raw_buffer_socket.cc:68] [C2135] write returns: 10372


In this case notice that two Rx packets are processed before libevent calls the write callback to put proxied bits on the wire. So in this case all the gRPC requests from the first packet had to wait for not only all the other gRPC requests in its packet but also in the next packet to be processed before it was proxied out. I don't understand the envoy dispatcher sufficiently to know why two read events got processed before a write event. But this kind of behaviour on client envoy can be a double whammy on latency: slow down on this hop but proxying out a denser packet than received also means more slow down on the next server envoy hop.

 


Questions

  • Does we have possible explanations for why closed loop testers start sending higher density packets than closed loop tester (I have observed this with a closed source gRPC java load tester too in addition to ghz)

  • Does the assertion of envoy latency proportional to gRPC/ H2 packet density data and the code factors suggested for this all check out? 

  • Does the assertion that “envoy increases gRPC/H2 packet density” and code factors suggested for this check out?

  • If all the assertions and code explanations check out then are there any ideas for improvements we can make to envoy to mitigate these?



Repro steps:

You can use any unary gRPC server and load tester that supports open and closed loop modes. But below are some concrete repro bits/configs that you can run over loopback on a single box.


This repro setup will use a single box to make it easier to repro:

Client -- --> Client envoy (on 127.0.0.1:10080 ) --→ Server envoy (on 127.0.0.1:643 ) --→ Server (on 127.0.0.1:31001)


For server bits:

I modified the hello world go/grpc server: 

https://github.com/gpoothia-stripe/grpc-go/commit/e415d4b6697a4c8349761b831936a996502c676c


For client bits I used the open source tool ghz as both the open/closed loop tester [1]


For configuring client and server envoy here are some barebones client envoy and server envoy configurations (with TLS, access logging, stats configured)


I used envoy v1.17

Once bits are set up...


Step 1:

Run ghz in closed loop mode (--no-async switch) to find peak QPS and its latency

./ghz --insecure  --call traffic.api.TrafficTestApi/TrafficRequest -d '{"response_size_bytes": 10, "response_latency_ms": 0}' -n 100000 -c 64 --no-async  127.0.0.1:10080


I observed on my box peak QPS was ~7k and p50 latency was ~10ms 

This is clearly latency bound 1000ms/(64 threads* ~10ms) is just less than 7k i.e. QPS is limited by how many requests it can keep outstanding across 64 threads.


Step 2:

Now run ghz in open loop mode with whatever peak QPS you hit with closed loop (--rps 7000 --async switches) :

./ghz --insecure  --call traffic.api.TrafficTestApi/TrafficRequest -d '{"response_size_bytes": 10, "response_latency_ms": 0}' -n 100000 -c 64 --rps 7000 --async 127.0.0.1:10080


The latency will drop e.g. I observed a p50 latency of ~2.5 ms open loop versus 10ms closed loop (in step 1) at identical QPS!


Step 3:

Similarly if you run ghz back in closed loop mode but with target QPS slightly below peak QPS observed earlier (in step 1) the p50 latency will fall back to ~2ms.


You can also observe via packet capture that 

  • The packets are less dense (num gRPC per H/2 packet) in the open loop case than the closed loop case

  • The number of H/2 packets with gRPC request sent from client app greater than num H/2 packets with gRPC request received by server app. Implying envoy is increasing density on transit. This is also obvious by eyeballing packet sizes on these two captures.




References

[0] In "closed loop" load generators new gRPC requests are generated only on arrival of an earlier gRPC request's response. However N threads can do this independently making thread count an important attribute of closed loop load test.

In "open loop" load generators send new gRPC requests independent of completion of other gRPC requests.

I borrowed the definitions of open vs closed loop load generators from the abstract of this paper (you can ignore everything else in the paper for our purposes here):

https://www.usenix.org/legacy/events/nsdi06/tech/full_papers/schroeder/schroeder.pdf



[1] ghz uses --no-async switch to indicate closed loop mode and --async switch to indicate open loop mode



Thanks in advance!

Gaurav


Zizon Qiu

unread,
Jul 2, 2021, 11:46:11 PM7/2/21
to Gaurav Poothia, envoy-dev
I had a similar performance category when benchmarking against our in-house redis proxy filter.

With 4k concurrent clients,each with a pipeline 1 setting, or close loop mode in this example, The Envoy
uses ~20 CPUs to achieve 200k+ QPS, which in turn is estimated to ~10k+ QPS per CPU/per event loop.
Thus the approximate average cost for each request is 100 microseconds.

And We had added some time stats on the filter, to strip out how much time it is cost on our code path.
The result shows ~40 us(~20- us decode and ~20- us encode+some other stats).

Subtracted from 100 us, it comes with 60 us which coincidentally match the data in this gRPC example.

So, my different hypothesis is that:
1. Envoy/libevent had a relative fix cost on per- event loop processing.
2. When it packs more requests in a single eventloop processing cycle, each request suffers less by averaging this cost.
3. So, in close loop, since it stick to request-response mode, It certainly generate more I/O than the open loop(batching more requests in
a single call)

By the way, I had used the same client benchmark suite to direct against the redis server,
which had ~50k QPS(~20 us per request).

--
You received this message because you are subscribed to the Google Groups "envoy-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-dev/CAFJ0%2B980CK_C8r53MQMDXVBMnnUFP-350zCa3ry%2Bqg79y4WkWQ%40mail.gmail.com.

Gaurav Poothia

unread,
Jul 7, 2021, 6:42:02 PM7/7/21
to envoy-dev
Thanks for sharing!

>> Envoy/libevent had a relative fix cost on per- event loop processing.
At least my particular concern here is not cycles/request but latency/request.
In the section "Why is envoy latency proportional to arriving packet density?" I tried to put forth the hypothesis for why the read event callback will take longer to return for dense packets.

I would appreciate if one of the maintainers can confirm these theories and suggest next steps.

Thanks in advance!

Zizon Qiu

unread,
Jul 8, 2021, 10:50:58 AM7/8/21
to Gaurav Poothia, envoy-dev
Take for example,given a arbitrary event loop cycle, the total time spent can be model as:

Time = sys_cost + read_fd*(read_bytes*cost_per_read_byte) + write_fd*(write_bytes*cost_per_write_byte).

The intuition behind x_bytes*cost_per_x_byte is that each request/response consists of syscall cost and a variety of
business logic costs. Wrap it down to a byte oriented parameter to make it sound generalized.

And further assume, for a proportion of requests will be responded in/after the K-th event loop.
More ever, to make it simpler, think of the above as a mean distributed description that for each event loop, we had 
almost the same x_fd/x_bytes.Or we call it a mean distribution.

Then for the requests packed at read_fd*read_bytes which is of N logical request, the latency is
K*Time = K*( sys_cost + read_fd*(read_bytes*cost_per_read_byte) + write_fd*(write_bytes*cost_per_write_byte)) / N
->
a*sys_cost + b*N*cost_per_read_byte +c*N*cost_per_write_byte 
Where x_fd*x_bytes are replaced by a linear projection of N request.

Which,can be further reduce to
->const_cost + N*cost_per_hybrid_byte

So, when for a single cycle, if N is large or dense, it will lead to a longer Time which is the so-called latency.

And for the longer form,
1. We can suspect that increasing more concurrent connections may also have a negative impact on latency.
2. if the cost_per_byte is relatively cheap, the latency may be dominated by cosnt_cost/sys_cost.


Matt Klein

unread,
Jul 9, 2021, 12:20:48 PM7/9/21
to Zizon Qiu, Antonio Vicente, Joshua Marantz, Rozhkov, Dmitry, Otto van der Schaaf, Gaurav Poothia, envoy-dev
Thanks for the great discussion! Very interesting stuff.

+Antonio Vicente +Joshua Marantz +Rozhkov, Dmitry +Otto van der Schaaf (our performance experts who also have a lot of experience looking at this data.)

At a high level though, there is no free lunch here. We can decrease latency at the expense of throughput or vice versa. Denser packets will generally increase throughput, and the code errs on the side of doing that (for example as you point out trying to read up to a limit, and then write large packet sizes.)

We could potentially allow some of this to be configurable more than it is (further limiting the read event loop size to flush writes more frequently), but I would not immediately assume this will lead to better overall performance. This is a tricky topic with a large amount of tradeoffs.

Personally, I would step back and ask a different question: what real world problem are you trying to solve? Can you provide performance information about a running system and not a synthetic benchmark? That will allow us to look at the real problem(s) that we need to solve.

Zizon Qiu

unread,
Jul 9, 2021, 11:32:31 PM7/9/21
to Matt Klein, Antonio Vicente, Joshua Marantz, Rozhkov, Dmitry, Otto van der Schaaf, Gaurav Poothia, envoy-dev
We are writing a codis/redis filter that would inspect each command in detail,
to do some in-depth stats, and applying QoS policy by command type or even by what key it is.

To bring it live, We need some baseline performance datas to do resource/deployment/safeguard planning which
leads to those ideas.

We use test setting:

memtier_benchmark -t 40 -c 100 --pipeline 1,

and achieve ~21w QPS at ~18 ms latency with 32 envoy concurrency.
It is ~20-30% degrade compare to TCP Proxy(~31w QPS at ~13ms)

However, by continually increasing concurrency, We got no improvement, for both types of filter.

So,latter We re-implement it in a scratch Go version,which had ~35w QPS at ~11 ms level.
And by further increasing P of Go, We gain a ~40w QPS at ~10ms.

The benefit of Go should come from a runtime scheduler that burns CPU in a more meaningful way(offload syscall blocking of P etc,
 which for event loop model, it is simply in blocking).

So,except for the initial QoS problems,
We are now considering, what if Envoy can not balance requests over each event loop or What if a certain downstream had overload 
request rate while the other is underload.
For the latter overload case, A Go can help a lot by implicit runtime multiplex, but in Envoy it needs a bit of thinking on how to balance/migrate 
overload requests to maximize utility.

Although, it seems over-engineered while the QoS should have broken/prevented it from happening.

But It should still be worth considering, since it turns out to be a tricky tuning parameter of concurrency.
Because:
1. we can not simply assume N*concurrency to be our upper limit.
2. When skewed, the utility may not be that efficient, even more, the latency distribution becomes a bit fluctuating.

Or, it is not the right way to use it as a Big Gateway, but should be used as a sidecar instead.



Matt Klein

unread,
Jul 10, 2021, 11:13:25 AM7/10/21
to Zizon Qiu, Antonio Vicente, Joshua Marantz, Rozhkov, Dmitry, Otto van der Schaaf, Gaurav Poothia, envoy-dev
Very interesting, thanks for sharing.

I agree there are tradeoffs of the work stealing scheduler that Go/Tokio/etc. use vs. the non-stealing "simple" scheduler that Envoy uses.

In the Go case I would be curious though about latency outlier due to garbage collection? There must be some impact there? I know they have been continuously improving the GC and scheduler, but there is no free lunch when it comes to that kind of thing either.

Redis is a particularly interesting case for Envoy because the amount of work per op is tiny compared to something like HTTP. Unless you are using all of the other features of envoy (service discovery, etc. etc.) I think it would be tough for Envoy to beat a purpose built solution that does nothing but Redis.

Matt Klein

unread,
Jul 10, 2021, 11:18:13 AM7/10/21
to Zizon Qiu, Antonio Vicente, Joshua Marantz, Rozhkov, Dmitry, Otto van der Schaaf, Gaurav Poothia, envoy-dev
One other question: for your benchmarking did you either use reuse_port or the exact balance connection balancer to make sure connections are spread evenly among workers?

Zizon Qiu

unread,
Jul 10, 2021, 12:35:49 PM7/10/21
to Matt Klein, Antonio Vicente, Joshua Marantz, Rozhkov, Dmitry, Otto van der Schaaf, Gaurav Poothia, envoy-dev
On Sat, Jul 10, 2021 at 11:18 PM Matt Klein <mattkl...@gmail.com> wrote:
One other question: for your benchmarking did you either use reuse_port or the exact balance connection balancer to make sure connections are spread evenly among workers?

Sure.That`s the only limited tuning.
Others like irq/cpu/numa binding are kept in default/consistent across benchmarks.
 
On Sat, Jul 10, 2021 at 9:13 AM Matt Klein <mattkl...@gmail.com> wrote:
Very interesting, thanks for sharing.

I agree there are tradeoffs of the work stealing scheduler that Go/Tokio/etc. use vs. the non-stealing "simple" scheduler that Envoy uses.

In the Go case I would be curious though about latency outlier due to garbage collection? There must be some impact there? I know they have been continuously improving the GC and scheduler, but there is no free lunch when it comes to that kind of thing either.

It could be an issue, if kept running for a longer time.
We had made some pass by value style, aiming to please escape analysis, but did no benchmark/comparison yet.

For some limited micro benchmark, It seems the value receiver method is a bit slower than the pointer receiver method(because of the copy?).

On the Envoy side, We did try offload some work to a separate dispatcher to minimize cost which has positive feedback.

And We had been considering separate the polling to be like Go netpoller, or architect like Netty(a Java NIO framework which has similar filter chain 
design) which utilize separate pool of workers for doing IO and CPU insensitive things, but realized that it will break too much like the ThreadLocalObject and Dispatcher thread safety.
Redis is a particularly interesting case for Envoy because the amount of work per op is tiny compared to something like HTTP. Unless you are using all of the other features of envoy (service discovery, etc. etc.) I think it would be tough for Envoy to beat a purpose built solution that does nothing but Redis.

It is the reason why we first considered Envoy.

If we do choose/decide to invest more in the Go version finally, there will be too much wheel to reinvent.

Sainaath Ramanathan

unread,
Mar 15, 2025, 10:55:17 AMMar 15
to envoy-dev
Hello everyone,

I'm a newbie to Envoy and currently dealing with higher-than-expected latency with gRPC-Web requests in my setup. The setup is simple, the web client communicates with an Envoy proxy, which then forwards the requests to a gRPC backend service. Here are some details:

  • Envoy Proxy: Acts as the intermediary between the client and the gRPC backend.
  • gRPC Backend: Processes the requests forwarded by Envoy.
I've been monitoring the x-envoy-upstream-service-time header on the web client, which consistently shows 1ms for the backend processing time. However, the web client is observing approximately 6.3ms latency, which is almost 6x the upstream service time.

Here's what I've checked so far:

  1. Network Conditions: The network seems stable with no significant fluctuations.
  2. Envoy Configuration: Confirmed that the envoy configuration is correct and the proxy is functioning as expected.
  3. Backend Performance: The backend service isn't experiencing any performance issues.
Given the large discrepancy between the upstream service time and the observed client latency, I am trying to understand what could be causing this. The setup requires high network performance and I have tried to disable all configs that's been described here: https://www.envoyproxy.io/docs/envoy/latest/faq/performance/how_to_benchmark_envoy. Could there be additional overhead introduced by the Envoy proxy or the gRPC-Web protocol itself? Are there other factors I should consider?

Any insights or debugging tips would be greatly appreciated!

Thank you in advance for your help.
Reply all
Reply to author
Forward
0 new messages