High CPU Usage for Double-L3/L4 Proxy

437 views
Skip to first unread message

dtehr...@twilio.com

unread,
Aug 8, 2017, 2:50:34 PM8/8/17
to envoy-users
Hi there,

I'm working on some basic performance profiling of Envoy. So far everything looks good, except for one test case: Envoy-to-Envoy L3/L4 proxy. My setup is fairly basic, so I'm wondering if I'm doing something that is unsupported.

The setup of the tests:
  * Using `iperf3` to test two hosts (iperf3 client & server) in the same AZ in AWS. 
  * Single-Envoy: `iperf-client -> Local Envoy -----> iperf-server` yields ~line speed w/ 10% CPU usage on the Local Envoy instance (good).
  * Double-Envoys: `iperf-client -> Local Envoy -----> Remote Envoy -> iperf-server` yields ~line speed w/ 10% CPU Usage on the Local Envoy instance (good), but 40-50% CPU usage on the Remove Envoy.


I'm able to reproduce with a configuration for the Remote Envoy as simple as this:
```
{
    "listeners": [{
        "name": "iperf-listener",
        "address": "tcp://0.0.0.0:15201",
        "filters": [{
            "type": "read",
            "name": "tcp_proxy",
            "config": {
                "stat_prefix": "iperf3-listener",
                "route_config": {"routes": [{"cluster": "iperf"}]}
            }
        }]
    }],
    "admin": "...",
    "cluster_manager": {
        "clusters": [{
            "name": "iperf",
            "type": "static",
            "connect_timeout_ms": 250,
            "lb_type": "round_robin",
            "hosts": [{"url": "tcp://127.0.0.1:5201"}]
        }]
    },
    "statsd_udp_ip_address": "127.0.0.1:8126"
}
```

My intention in that configuration is to route traffic from the public :15201 to the local :5201, which the iperf-server process is listening on.

What do you think? Is this a valid use case? If so, how could I provide additional info for debugging?

Thanks,
Dan

Harvey Tuch

unread,
Aug 8, 2017, 3:11:23 PM8/8/17
to dtehr...@twilio.com, envoy-users
Do you have a flame graph or perf profile to look at? (.. and Hi Dan, I remember you from VMware.. )

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users...@googlegroups.com.
To post to this group, send email to envoy...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/0b378cb8-02a3-4fe0-92b2-7dd4ed3b93f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Klein

unread,
Aug 8, 2017, 5:03:05 PM8/8/17
to Harvey Tuch, dtehr...@twilio.com, envoy-users
Yeah nothing comes to mind. Seems like CPU should be about the same for both instances of Envoy. Per Harvey, perf profile would be great.

To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

To post to this group, send email to envoy...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/0b378cb8-02a3-4fe0-92b2-7dd4ed3b93f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

To post to this group, send email to envoy...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Matt Klein
Software Engineer
mkl...@lyft.com

dtehr...@twilio.com

unread,
Aug 8, 2017, 5:54:51 PM8/8/17
to envoy-users, dtehr...@twilio.com
Harvey Tuch!! Great to see a familiar face on here! Are you still in Cambridge, MA or did you move out west too?

re: flame graph or perf profile - I've attached:
   * An SVG flame graph
   * The source for that flame-graph ("envoy-perf-script.txt")
   * An "envoy.prof" file that I generated by toggling "/cpuprofiler?enable". Hope I generated this correctly.

Thanks,
Dan
envoy-perf-script.svg
envoy-perf-script.txt
envoy.prof

Harvey Tuch

unread,
Aug 8, 2017, 6:26:22 PM8/8/17
to dtehr...@twilio.com, envoy-users
Still in Cambridge :) Thanks for the profiles.

Looking at the flamegraph, other than the call stack whackness, it seems that > 70% of the CPU time is spent in writev to achieve a TCP send. Matt has hypothesized this might be flow control related, can you post the /stats from the admin endpoint as well?

Some other questions and things to consider:
* Is the iperf workload symmetrical? I.e. is it sending as much as it is receiving?
* What happens when you do `iperf-client  -----> Remote Envoy -> iperf-server` ?
* Does varying the per_connection_buffer_limit_bytes in the Listener or Cluster definition make a difference?
* What is line rate here in gbps?

Cheers,
Harvey

dtehr...@twilio.com

unread,
Aug 8, 2017, 7:34:51 PM8/8/17
to envoy-users, dtehr...@twilio.com
Hi Harvey,

Output from the "/stats" admin endpoint during an active test is attached as "stats.txt".

re: Your questions:

> * Is the iperf workload symmetrical? I.e. is it sending as much as it is receiving?
Asymmetrical. The client is doing writes and the server is receiving.

> * What happens when you do `iperf-client  -----> Remote Envoy -> iperf-server` ?
Good idea. I just tried, and see the same behavior on the destination server: 40-50% CPU utilization observed.

> * Does varying the per_connection_buffer_limit_bytes in the Listener or Cluster definition make a difference?
Yup, I tried bumping these to 16MB earlier today, before I sent my original message. Wanted to try the obvious config options before asking.

> * What is line rate here in gbps?
Line rate is 1 Gbit/sec. I'm using AWS c3.2xlarge instances.


Thanks,
Dan
stats.txt

Matt Klein

unread,
Aug 8, 2017, 7:52:50 PM8/8/17
to dtehr...@twilio.com, envoy-users
I suspect you are probably hitting this comment but I'm not sure:

I haven't looked at the libevent code in a while, but AFAIK on the write side, it will set up a very large vectored write. On the read side, I think we are artificially limiting the input data being read in. At the kind of line rate you are talking about I suspect this code will need to get fixed to read in much large vectored blocks.

To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

To post to this group, send email to envoy...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Matt Klein

unread,
Aug 8, 2017, 7:54:21 PM8/8/17
to dtehr...@twilio.com, envoy-users
P.S. It's on my list (and there is GH issue) on getting rid of the libevent evbuffer code entirely, but no one has worked on that yet. There are a bunch of perf improvements possible there.

Matt Klein

unread,
Aug 8, 2017, 7:58:06 PM8/8/17
to dtehr...@twilio.com, envoy-users
Actually on second thought not sure that make sense, as we need to read on the egress side and that seems to work OK. Can you potentially get a flame graph on the 10% side also so we can compare? (The code I referenced does need to be fixed but that is tangential).

Matt Klein

unread,
Aug 8, 2017, 8:36:36 PM8/8/17
to dtehr...@twilio.com, envoy-users
Other random thoughts:
- Still possible read size is somehow implicated since on egress side, read is loopback. On ingress side, it's TCP.
- We universally disable nagle: https://github.com/lyft/envoy/blob/master/source/common/filter/tcp_proxy.cc#L227 (I would try commenting out that line). It seems like this should effect both Envoys so unclear to me it would impact ingress more, but possibly related to the data chunk sizes that are coming in.

So it's possible all of this is related in some way. TBH have not run any max throughput tests like this before. Harvey has some ideas on additional stats to add and like I said before would love to see a flame graph of the egress side also to compare.

Harvey Tuch

unread,
Aug 8, 2017, 8:40:36 PM8/8/17
to Matt Klein, dtehr...@twilio.com, envoy-users
Yeah, I think obtaining from either Envoy or perf an idea of the number of read and write syscalls made over the run would be useful. That way you could divide the data transferred (as obtained from either iperf or /stats endpoint) by this to get an idea of the average readv/writev size. This can be done in Envoy via its standard stats mechanism, or in perf with some incantation from http://www.brendangregg.com/perf.html. I'm currently suspicious we're doing way more writevs then we need to be, but can't confirm this from the data we currently have.

--
Matt Klein
Software Engineer
mkl...@lyft.com



--
Matt Klein
Software Engineer
mkl...@lyft.com



--
Matt Klein
Software Engineer
mkl...@lyft.com



--
Matt Klein
Software Engineer
mkl...@lyft.com

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users...@googlegroups.com.
To post to this group, send email to envoy...@googlegroups.com.

dtehr...@twilio.com

unread,
Aug 9, 2017, 1:28:00 PM8/9/17
to envoy-users, mkl...@lyft.com, dtehr...@twilio.com
Hi Harvey & Matt,

Attached please find two files:
   * ingress_syscalls.txt - Taken on the ingress host w/high CPU usage, via `perf stat -e 'syscalls:sys_enter_*' -p <PID> -- sleep 30`. Shows a count of syscalls, as Harvey requested.
   * egress.tar - Flame graph SVG and supporting data files from the egress-Envoy host, for comparison against the previously posted flame graph from the ingress side. "admin.prof" file from Envoy's cpuprofiler also included.

Let me know if there's anything else I can provide. Do you want me to try with "upstream_connection_->noDelay(false)" ?

Thanks,
Dan
ingress_syscalls.txt
egress.tar

Matt Klein

unread,
Aug 9, 2017, 3:51:04 PM8/9/17
to dtehr...@twilio.com, envoy-users
Do you want me to try with "upstream_connection_->noDelay(false)" ?

If you could just comment on this line and try that would be great. I'm pretty certain the issue here is a combination of sub-optimal read code along with disabling nagle. 

To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

To post to this group, send email to envoy...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Dan Tehranian

unread,
Aug 9, 2017, 5:13:40 PM8/9/17
to Matt Klein, envoy-users
Hi Matt,

I've re-tested with "upstream_connection_->noDelay(false)" commented out. CPU usage on the ingress node did drop, down to 30% of 1 CPU.

Dan

Harvey Tuch

unread,
Aug 9, 2017, 5:18:01 PM8/9/17
to Dan Tehranian, Matt Klein, envoy-users
Interesting! I think we may want to disable Nagle in these perf experiments on the downstream connection, since this is where we're seeing a lot of readv calls relative to writev https://github.com/lyft/envoy/blob/master/source/server/connection_handler_impl.cc#L153. Does that improve things further if you try that?




--
Matt Klein
Software Engineer
mkl...@lyft.com

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users...@googlegroups.com.
To post to this group, send email to envoy...@googlegroups.com.

Harvey Tuch

unread,
Aug 9, 2017, 5:18:43 PM8/9/17
to Dan Tehranian, Matt Klein, envoy-users
(and by disable Nagle, I mean re-enable Nagle by setting noDelay(false) ;))

Matt Klein

unread,
Aug 9, 2017, 5:48:15 PM8/9/17
to Harvey Tuch, Dan Tehranian, envoy-users
IIRC on the remote side, that's only for response data, which I think should basically be happening since it's a write only test? Could definitely try that but I don't think it will help.

I'm pretty sure at this point the issue is stupid read handling code reading too small chunks which is causing more writev calls on the other side even before nagle.

To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

To post to this group, send email to envoy...@googlegroups.com.



--
Matt Klein
Software Engineer
mkl...@lyft.com

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users+unsubscribe@googlegroups.com.

Dan Tehranian

unread,
Aug 9, 2017, 6:28:39 PM8/9/17
to Matt Klein, Harvey Tuch, envoy-users
Re-tested after commenting out https://github.com/lyft/envoy/blob/master/source/server/connection_handler_impl.cc#L153, per Harvey. Unfortunately, this didn't improve the CPU utilization in any noticeable way.

Dan

Matt Klein

unread,
Aug 9, 2017, 6:32:06 PM8/9/17
to Dan Tehranian, Harvey Tuch, envoy-users
If you feel like hacking, here is the libevent code in question:

Ultimately as I said before, I want to get rid of using evbuffer entirely. However, I think if you hack to get rid of IOCTL and potentially allow reading in 128K chunks I'm guessing in mega throughput case with nagle enabled you will do much better.


For more options, visit https://groups.google.com/d/optout.

Dan Tehranian

unread,
Aug 9, 2017, 7:08:58 PM8/9/17
to Matt Klein, Harvey Tuch, envoy-users
Unfortunately, I've not touched C code in ~20 years. I'm probably not the right person for hacking on libevent code :)

Matt Klein

unread,
Aug 9, 2017, 7:10:03 PM8/9/17
to Dan Tehranian, Harvey Tuch, envoy-users
Fair enough. I'm confident someone will get to tuning this type of workload at some point but I can't give any timeframe on that. Sorry it's problematic right now. 

Request: Can you open a GH issue with the details so we can track? Thank you.

Dan Tehranian

unread,
Aug 9, 2017, 7:30:51 PM8/9/17
to Matt Klein, Harvey Tuch, envoy-users
Sure thing. Filed https://github.com/lyft/envoy/issues/1424

Thank you,
Dan
Reply all
Reply to author
Forward
0 new messages