Hello!
We run envoy in our svc mesh in a fairly typical topology: On request path client talks to client envoy sidecar (on same host) that talks over the network to server envoy sidecar. Finally the server envoy proxies to the server (on the same host). We use mTLS between the envoys (similar to diagram below)
I noticed that in this topology unary gRPC load generators get very different envoy latency (hence different end to end latency) based on whether they run in “closed loop” or “open loop” modes [0] (Pls do see References at bottom of email for definitions of closed and open)
If you run closed loop and open loop load testers at some carefully chosen identical QPS then the former sees much higher envoy latency than latter. Up to ~5x in some experiments.
This rest of this email is an attempt to share observations, hypotheses and repro steps around this phenomenon. Hopefully the community can confirm these and figure out improvements.
I was doing some benchmarking with unary gRPC. Intentionally dial just a single H/2 connection on all hops. The perf bench looks like this:
Noticed some very odd things:
Whether the load generator is open vs closed loop seems to matter a lot wrt envoy latency. The open loop load gen hits a bottleneck with core saturation on envoy's single worker thread. But until then the gRPC median end to end latency is decent (p50 latency ~2ms). But a closed loop load generator hits peak QPS thanks to latency bottleneck (p50 latency ~10ms). In other words, closed loop peak QPS is limited by how many simultaneous requests it can keep in flight. The latency observed in closed loop test when latency bound is much worse than if you ran an open loop at the same QPS (demonstrated in repro steps)
Closed loop QPS was not always latency bound. If instead of running a closed loop test in “peak QPS” mode if you run it at “target QPS” mode with a target set slightly below peak it will not be latency bound. In other words, closed loop degrades envoy latency suddenly and dramatically at some QPS inflection point.
Dissecting the latency on a closed loop test (by correlating the H/2 streams for a single gRPC across hops) shows latency is coming from the two envoys in the path and latency somewhat evenly split between request and response proxying.
The amount of latency introduced by envoys seems directly proportional to how “dense” the arriving packets are. “Density” here is defined as the number of gRPC requests or responses per H2 packet.
e.g. in the most pathological case the closed loop tester configured with N worker threads will degrade to sending all N grpc requests in one packet and wait for N responses to arrive. Then all responses will arrive together from server causing a ping pong of dense packets (density approaching N)
While it's not always that extreme the Envoy latency observed is proportional to this gRPC/packet density. So two load tests (open vs closed) that send at the same QPS can see different latency based on this density (open - > low density -> low latency vs closed -> high density -> high latency)
Envoy seems to increase the density of the packets as it proxies them across. In other words if on avg N gRPC/packet arrive at envoy and on avg M gRPC/packet leave envoy then M > N (packets going upstream denser than arriving from downstream). If (4) is true then this implies that client envoy will worsen slowdown on server envoy. (See repro steps for how to observe the density increase)
I tried to dig into why these might be the case:
Why do closed loop load testers degrade to being latency bound at some QPS inflection point and then start sending denser packets than before inflection/open loop load testers?
It’s possible that the closed loop tester sees some chunking behaviour thanks to envoy and because of that the threading model gets into the groove of ping ponging dense packets. Maybe open loop testers, thanks to their threading model, don't perpetuate any chunking of responses and keep the requests staggered across smaller packets.
But I have no theory to explain why closed loop latency degrades suddenly at “inflection QPS”
Why is envoy latency proportional to arriving packet density?
Here is a sample of some (mix of private and public) envoy trace statements on server envoy during high latency expt on request processing path:
[2021-06-22 22:55:06.559][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:652] [C2134] dispatching 8383 bytes
[2021-06-22 22:55:06.563][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:671] [C2134] took time 4 ms dispatching 8383 bytes
>> snip <<
[2021-06-22 22:55:06.563][3440254][info][connection] [external/envoy/source/common/network/raw_buffer_socket.cc:68] [C2135] write returns: 10012
First line is here before server envoy calls nghttp2_session_mem_recv
Second line is custom tracing I added to measure how long nghttp2_session_mem_recv took
Third line is when the server envoy actually puts the bits headed upstream on the wire (size larger than the first line because of injected headers).
Note here it took 4 ms to process a high density packet (8k bytes Rx probably means ~64 gRPC requests in the packet where each gRPC request has separate H/2 HEADER and DATA frame = 124 frames).
Further note that not a single gRPC request got proxied across until *all* the gRPC requests in the packet are processed.
This seems to be a result of a confluence of factors.
Here is a very high level pseudocode of relevant parts of request processing path as I understood it:
Libevent event processing loop {
If (read event)
Read event callback {
Read from socket
Nghttp2_session_mem_recv
Callback for processing each H/2 frame {
Process each H/2 frame - decode, pass through http filters,route.
Find/dial connection to upstream, create stream, encode
Nghttp2_session_send
Send callback {
Copy bytes destined upstream to write buffer
Activate a write event
}
}
}
}
If (write event) {
Write event callback {
Send the bytes in write buffer to socket connecting upstream
}
}
}
a) libevent will have to wait for read event callback (Event::FileReadyType::Read) to return before it can process the Event::FileReadyType::Write event to call write event callback that sends proxied bits to upstream.
b) The nghttp2_session_mem_recv will not let the read event callback stack unwind until it has processed all the gRPC requests in the Rx packet. Hence proxied bits for all gRPC requests in the Rx packet accumulate in write_buffer waiting to be sent out together.
c)It seems like each gRPC request processing must be of the order of ~60 microseconds (under my config) to add up to 4 ms for 64 requests (4000/64 = approx 60). Seems high?
Note: I just randomly picked traces on request processing on server envoy but this density proportional slowdown happens on all the 4 envoy paths : (request, response processing) X (client envoy, server envoy)
Why does envoy seem to increase the density of the packets that it proxies across?
Here is another sample of (private) envoy trace statements :
[2021-06-22 22:55:06.397][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:652] [C2134] dispatching 5112 bytes
[2021-06-22 22:55:06.400][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:671] [C2134] took time 2 ms dispatching 5112 bytes
[2021-06-22 22:55:06.400][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:652] [C2134] dispatching 3613 bytes
[2021-06-22 22:55:06.402][3440254][info][http2] [external/envoy/source/common/http/http2/codec_impl.cc:671] [C2134] took time 1 ms dispatching 3613 bytes
>> snip <<
[2021-06-22 22:55:06.402][3440254][info][connection] [external/envoy/source/common/network/raw_buffer_socket.cc:68] [C2135] write returns: 10372
In this case notice that two Rx packets are processed before libevent calls the write callback to put proxied bits on the wire. So in this case all the gRPC requests from the first packet had to wait for not only all the other gRPC requests in its packet but also in the next packet to be processed before it was proxied out. I don't understand the envoy dispatcher sufficiently to know why two read events got processed before a write event. But this kind of behaviour on client envoy can be a double whammy on latency: slow down on this hop but proxying out a denser packet than received also means more slow down on the next server envoy hop.
Does we have possible explanations for why closed loop testers start sending higher density packets than closed loop tester (I have observed this with a closed source gRPC java load tester too in addition to ghz)
Does the assertion of envoy latency proportional to gRPC/ H2 packet density data and the code factors suggested for this all check out?
Does the assertion that “envoy increases gRPC/H2 packet density” and code factors suggested for this check out?
If all the assertions and code explanations check out then are there any ideas for improvements we can make to envoy to mitigate these?
You can use any unary gRPC server and load tester that supports open and closed loop modes. But below are some concrete repro bits/configs that you can run over loopback on a single box.
This repro setup will use a single box to make it easier to repro:
Client -- --> Client envoy (on 127.0.0.1:10080 ) --→ Server envoy (on 127.0.0.1:643 ) --→ Server (on 127.0.0.1:31001)
For server bits:
I modified the hello world go/grpc server:
https://github.com/gpoothia-stripe/grpc-go/commit/e415d4b6697a4c8349761b831936a996502c676c
For client bits I used the open source tool ghz as both the open/closed loop tester [1]
For configuring client and server envoy here are some barebones client envoy and server envoy configurations (with TLS, access logging, stats configured)
Once bits are set up...
Step 1:
Run ghz in closed loop mode (--no-async switch) to find peak QPS and its latency
./ghz --insecure --call traffic.api.TrafficTestApi/TrafficRequest -d '{"response_size_bytes": 10, "response_latency_ms": 0}' -n 100000 -c 64 --no-async 127.0.0.1:10080
I observed on my box peak QPS was ~7k and p50 latency was ~10ms
This is clearly latency bound 1000ms/(64 threads* ~10ms) is just less than 7k i.e. QPS is limited by how many requests it can keep outstanding across 64 threads.
Now run ghz in open loop mode with whatever peak QPS you hit with closed loop (--rps 7000 --async switches) :
./ghz --insecure --call traffic.api.TrafficTestApi/TrafficRequest -d '{"response_size_bytes": 10, "response_latency_ms": 0}' -n 100000 -c 64 --rps 7000 --async 127.0.0.1:10080
The latency will drop e.g. I observed a p50 latency of ~2.5 ms open loop versus 10ms closed loop (in step 1) at identical QPS!
Similarly if you run ghz back in closed loop mode but with target QPS slightly below peak QPS observed earlier (in step 1) the p50 latency will fall back to ~2ms.
You can also observe via packet capture that
The packets are less dense (num gRPC per H/2 packet) in the open loop case than the closed loop case
The number of H/2 packets with gRPC request sent from client app greater than num H/2 packets with gRPC request received by server app. Implying envoy is increasing density on transit. This is also obvious by eyeballing packet sizes on these two captures.
[0] In "closed loop" load generators new gRPC requests are generated only on arrival of an earlier gRPC request's response. However N threads can do this independently making thread count an important attribute of closed loop load test.
In "open loop" load generators send new gRPC requests independent of completion of other gRPC requests.
I borrowed the definitions of open vs closed loop load generators from the abstract of this paper (you can ignore everything else in the paper for our purposes here):
https://www.usenix.org/legacy/events/nsdi06/tech/full_papers/schroeder/schroeder.pdf
[1] ghz uses --no-async switch to indicate closed loop mode and --async switch to indicate open loop mode
Thanks in advance!
Gaurav
--
You received this message because you are subscribed to the Google Groups "envoy-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-dev/CAFJ0%2B980CK_C8r53MQMDXVBMnnUFP-350zCa3ry%2Bqg79y4WkWQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-dev/e7ad83b2-1f1b-408e-9330-b6a6d35e6cf2n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-dev/CAKTRiEL9NSU-TksyKsjZd%3DdrfBW_N_hoWNrPQBqqgSTi44qBiA%40mail.gmail.com.
One other question: for your benchmarking did you either use reuse_port or the exact balance connection balancer to make sure connections are spread evenly among workers?
On Sat, Jul 10, 2021 at 9:13 AM Matt Klein <mattkl...@gmail.com> wrote:Very interesting, thanks for sharing.I agree there are tradeoffs of the work stealing scheduler that Go/Tokio/etc. use vs. the non-stealing "simple" scheduler that Envoy uses.In the Go case I would be curious though about latency outlier due to garbage collection? There must be some impact there? I know they have been continuously improving the GC and scheduler, but there is no free lunch when it comes to that kind of thing either.
Redis is a particularly interesting case for Envoy because the amount of work per op is tiny compared to something like HTTP. Unless you are using all of the other features of envoy (service discovery, etc. etc.) I think it would be tough for Envoy to beat a purpose built solution that does nothing but Redis.