TLS termination reverse proxy and performance issues

557 views
Skip to first unread message

Lorenzo Villani

unread,
Sep 18, 2014, 1:14:30 PM9/18/14
to golan...@googlegroups.com
Hi there,

We are trying to build a TLS termination reverse proxy and load balancer in Go, while simultaneously
benchmarking it against Nginx 1.6.1.

Our first test was to configure both Nginx and our reverse proxy to be a plain simple HTTP reverse
proxy (no load balancing yet), with the Go version being a 3-lines source file which used
httputil.NewSingleHostReverseProxy(). Both Nginx and the Go version performed equally the same, as
expected.

Then, we tried to add SSL/TLS to the mix and we observed a significant drop in performance. We
configured both proxies to accept connections only over TLS 1.0, with different cipher suites
including RC4-SHA, using sslyze [1] to ensure that both servers were configured the same way.

Nginx always ran with a single worker process while we changed GOMAXPROCS between 1 and 8 and
whatever the default value is.

The test server has a 24-core (real + hyper-threading) Intel Xeon CPU with plenty of RAM. The test
"backend server" that sits behind Nginx and the Go proxy is a 5-line application which simply
replies with '42'.

We ran test with both 'ab' and blitz.io which consistently reported between 1.8x and ~3.0x
performance drop of the Go version compared to Nginx. We think this result is probably due to Go's
TLS stack, since with plain HTTP both Nginx and Go performed nearly the same, but we'd like to have
confirmation from the Go development team.

We tried both Go 1.3 and the current development version from Hg (revision b18ebcb9f236) with little
to no difference (the development tip gives slightly worse, and insignificant, results compared to
1.3).

Any idea on how we could improve the situation?

Thanks in advance


[1]: https://github.com/iSECPartners/sslyze

Jeff Hodges

unread,
Sep 19, 2014, 4:04:44 AM9/19/14
to golan...@googlegroups.com
Message has been deleted

Lorenzo Villani

unread,
Sep 19, 2014, 6:53:22 AM9/19/14
to golan...@googlegroups.com
The profiler shows:

(pprof) top25 -cum
Total: 4466 samples
       0   0.0%   0.0%     4261  95.4% runtime.gosched0
       0   0.0%   0.0%     4203  94.1% net/http.(*conn).serve
       0   0.0%   0.0%     4154  93.0% crypto/tls.(*Conn).Handshake
       0   0.0%   0.0%     4153  93.0% crypto/tls.(*Conn).serverHandshake
       0   0.0%   0.0%     4107  92.0% crypto/tls.(*serverHandshakeState).doFullHandshake
       0   0.0%   0.0%     3359  75.2% crypto/tls.(*ecdheKeyAgreement).generateServerKeyExchange
       0   0.0%   0.0%     3087  69.1% crypto/rsa.SignPKCS1v15
       0   0.0%   0.0%     3086  69.1% crypto/rsa.decrypt
       0   0.0%   0.0%     2649  59.3% math/big.nat.expNN
       0   0.0%   0.0%     2630  58.9% math/big.(*Int).Exp
      12   0.3%   0.3%     2604  58.3% math/big.nat.expNNWindowed
      10   0.2%   0.5%     2215  49.6% math/big.nat.div
     496  11.1%  11.6%     2193  49.1% math/big.nat.divLarge
      32   0.7%  12.3%      735  16.5% math/big.nat.mul
     693  15.5%  27.8%      714  16.0% crypto/elliptic.p256ReduceDegree
       0   0.0%  27.8%      704  15.8% crypto/tls.(*ecdheKeyAgreement).processClientKeyExchange
       0   0.0%  27.8%      702  15.7% crypto/elliptic.p256Curve.ScalarMult
     132   3.0%  30.8%      653  14.6% math/big.basicMul
       1   0.0%  30.8%      642  14.4% crypto/elliptic.p256ScalarMult
     544  12.2%  43.0%      544  12.2% math/big.divWW
     486  10.9%  53.9%      486  10.9% math/big.addMulVVW
       3   0.1%  53.9%      467  10.5% crypto/elliptic.p256PointDouble
     447  10.0%  63.9%      447  10.0% math/big.mulAddVWW
      19   0.4%  64.4%      443   9.9% crypto/elliptic.p256Square
       0   0.0%  64.4%      440   9.9% crypto/rsa.modInverse

James Bardin

unread,
Sep 19, 2014, 10:32:07 AM9/19/14
to golan...@googlegroups.com


On Thursday, September 18, 2014 1:14:30 PM UTC-4, Lorenzo Villani wrote:

We ran test with both 'ab' and blitz.io which consistently reported between 1.8x and ~3.0x
performance drop of the Go version compared to Nginx. We think this result is probably due to Go's
TLS stack, since with plain HTTP both Nginx and Go performed nearly the same, but we'd like to have
confirmation from the Go development team.


I think that's about what's expected right now. It's partly that OpenSSL has had a lot more development time for optimization than has crypto/tls, and partly that Go still tends to be slower than C/C++. From my experience, 1.5-3.0x difference is common between C and Go.

+agl may be able to provide some more info, and if there are more optimizations that can be done outside of the compiler itself, but I don't think this is a surprising result.

Brad Fitzpatrick

unread,
Sep 19, 2014, 11:24:25 AM9/19/14
to Lorenzo Villani, golang-nuts
Don't use 'ab' for benchmarking. It's terrible and wastes people's time debugging rather than providing useful insights.

Are you trying to measure connection setup time or HTTP request throughput (using keep-alive connections)?


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

martin....@shopify.com

unread,
Sep 19, 2014, 11:30:53 AM9/19/14
to golan...@googlegroups.com
I'd qualify that a bit, I'd be surprised if it was crypto/tls itself to blame. Given network latencies and all it should be fine to implement a protocol in most higher level languages (and Go can fly pretty low on that scale). It's the cryptographic primitives that cost you the most, the profile shows it clearly, 70% of time spent producing RSA signatures. And that's undoubtedly where OpenSSL libraries are heavily optimized. I believe I've confirmed that with my own experiments (see https://github.com/mkobetic/btls#status), e.g. for BenchmarkRW_AES_256_CBC_SHA256_TLS12 I get almost 8x speedup by using OpenSSL's libcrypto over the native Go crypto algorithms. Granted this benchmark does not involve public key algorithms, but I'm fairly confident that the situation there would be similar.

So to answer the original question, if you want to speed up the Go implementation you need to optimize the relevant cryptographic primitives. You don't even need to optimize all of them, just the ones that matter to you. The downside is that you need to know how to do it right, doing it wrong could easily mean serious security compromises, best is to reuse some trusted/proven code base.

Note also that you need to be careful what you're measuring in your load tests. It seems that currently your load his heavily skewed towards TLS handshakes. Is that how your real traffic will look like? If so there are also other mechanisms that can help offload some of that, e.g. TLS session resumption. On the other hand if your real traffic will be largely long lasting connections with a lot of data transferred, your TLS handshake costs can be largely amortized by that. So make sure you're measuring the right things.

o...@getlantern.org

unread,
Sep 21, 2014, 9:10:28 AM9/21/14
to golan...@googlegroups.com
It looks like your work is being dominated by handshaking.  Given that the actual data returned by the server is minuscule, and I suspect that the test client isn't using TLS session resumption, this is not surprising.

I would be curious how things look if you:

1. Return a much larger amount of data from the server (ideally something that's realistic given the sort of traffic you're looking to proxy in real life)

2. Enable TLS session resumption on the client (I don't know if ab can even do that)

3. Enable HTTP keep-alives on the client (which will reduce the amount of TLS handshaking even if session resumption is not enabled)

Also, if you're not doing so already, I would be curious how things look when testing over a real network.

Cheers,
Ox
Reply all
Reply to author
Forward
0 new messages