Hey everyone. I'm having difficulty understanding Envoy's (Envoy 1.28) reported metrics in a specific scenario. In this context, Envoy is a standalone deployment acting as a load-balancer between a downstream NodeJS (over gRPC) and an upstream Kotlin API (also over gRPC). The journey between these two services is highly latency-sensitive.
My current issue is an observed gap between upstream and downstream times when performing integration tests with real clients which I had do not observe when performing (much heavier) load tests with e.g. Gatling injectors.
The integration test scenario is as follows: 387 downstream pods connect to a pool of 8 Envoy pods which, in turn, connect to a pool of 43 upstream API pods. During the test, Envoy undergoes around 10K TPS. config.yml includes the configuration used by Envoy during the test.
TPS progression over the duration of the test:
Overview of the size of responses our upstream API returns:
The following image shows the gap between upstream and downstream P99 times during that test, which, from the start, of the test, hovers around 10ms, and at the moment where TPS increases (14:35), there is an increase in this gap following the increase in latency by the upstream API, which then levels out to a higher value than the initially observed gap. This tells me that this gap has a tendency to increase with load.
The following image shows a P95 of the same data, which shows initial gap values of around ~2ms, which is what I expected to see in the P99, and confirms my suspicion that the size of the gap scales with the load intensity. However, the difference between P99 and P95 also leads me to believe that this isn't affecting a large percentage of requests.
It is important to mention that, during load testing, with upwards of 50K TPS, I always observed a P99 of around 2ms of Envoy "overhead" (difference between downstream and upstream), which is what I was expecting to see in this scenario too.
During this scenario, Envoy's "envoy_cluster_upstream_cx_rx_bytes_buffered" metric reports a massive increase in bytes buffered from the upstream connection, which leads me to believe that there is a significant amount of data being received from the upstream API which is not immediately being offloaded to the downstream. To me, that seems to explain the gap between the upstream and downstream times.
I am failing to understand this pattern. It seems to me that the bottleneck is at the connection to the downstream client, as if Envoy is failing to send all of the data received from the upstream to the downstream in time.
The conditions on my previous load-tests, even with far higher TPS, were similar to this scenario, with the only difference being the downstream clients (real clients instead of Gatling injectors). This further confuses me, but leads me to settle on two suspicions:
- Either I have misconfigured Envoy, and there is some sort of buffer-size or window-size configuration which should be higher;
- Or, the consumer is slow, and is applying some sort of backpressure which is slowing down Envoy's ability to offload the response data, though I am unsure of how to prove this.
Some other data which could be useful. During the test, Envoy's concurrency was set to 16. Connections between the downstream clients and the Envoy pool were not evenly balanced, some Envoy pods had around 30 connections while others had up to 60 (these connections were established at the beginning of the test and never closed). Connections between the LB pods and the upstream API pods were fixed at 16 per upstream pod, meaning that each Envoy instance established 16 connections to each API upstream, for a total of 16 * 8 (Envoy instances) * 43 (API pods) = 5504 connections.
Each Envoy instance featured 4CPUs and 1Gb of memory, and no instance went above 35% CPU during the test.
Given this information, I'd like to ask if there's something which I may be doing wrong with Envoy's configuration, or if there are any tips you could provide in order to improve performance in this scenario, or at least offer some understanding into why I'm seeing this gap between downstream and upstream time.
I would also like to enquire about Envoy's memory usage. Our metrics report that no Envoy instance went over 100Mb of memory usage, which I find very hard to believe. In fact, I have not seen Envoy use much memory at all, which is making me question that metric.
Thank you in advance for your attention regarding this matter, any help is welcome.