TLS very slow on ARM

362 views
Skip to first unread message

Norbert Roos

unread,
Nov 18, 2011, 12:47:13 PM11/18/11
to golan...@googlegroups.com
Hi,

I am experiencing very slow data transfers over TLS connections on an
ARM based Beagle board (1GHz). Here is what i'm doing:

The board receives unencrypted data from the network, base64 encodes it
and sends it through a TLS encrypted websocket connection back to the
network.

Data rates are at 20KB/s. Doing the same with unencrypted websocket
connections results in rates at 1MB/s. From time to time even the TLS
connections work correctly, giving data rates at 700KB/s.

When the 700KB/s rates are reached, the process takes about 67% CPU
load, the 20KB/s needs - you guessed it - 100%.

I measured the data rate by taking the current time with
time.Nanoseconds() right before the websocket.Write() and once again
right after it, so it shouldn't be interrupted by a parallel go routine[1].

It's not that the board is too weak for TLS encryption - as i said,
sometimes this works as expected, and scp a 50MB file always runs with
5MB/s.

Unfortunately, i can't show a simple example program; i modified the
websocket server and client example to send some large data amounts, and
this is always done with 580KB/s.

Now i don't have an idea any more how to trace this any further. I'm
pretty sure that it's not the rest of the program what causes this,
maybe some of you have an advice?

[1] It seems like Go does not like switching between go routines so far,
only when it's really necessary. But that's another theme..

bye
Norbert

roger peppe

unread,
Nov 18, 2011, 12:50:07 PM11/18/11
to Norbert Roos, golan...@googlegroups.com
have you tried running it with GOMAXPROCS > 1 ?

Norbert Roos

unread,
Nov 21, 2011, 8:20:19 AM11/21/11
to golan...@googlegroups.com
> have you tried running it with GOMAXPROCS> 1 ?

Uh, what is this magic? It really does seem to help. Interesting,
because the actual work is done in just one go routine, and even more
interesting, the Beagle board only has one processor...

Norbert

Bobby Powers

unread,
Nov 21, 2011, 8:47:31 AM11/21/11
to Norbert Roos, golan...@googlegroups.com

What kernel and distribution are you using? GOMAXPROCS tells the go
runtime to split goroutines across multiple threads. If this really
seems to help, it would seem to imply that the kernel is doing better
at scheduling goroutines than the go runtime.

Bobby

Dmitry Vyukov

unread,
Nov 21, 2011, 8:52:42 AM11/21/11
to Bobby Powers, Norbert Roos, golan...@googlegroups.com
 ... or that goroutines execute blocking operations (including page faults) behind runtime's back.

Norbert Roos

unread,
Nov 21, 2011, 9:22:25 AM11/21/11
to golan...@googlegroups.com
On 11/21/2011 02:47 PM, Bobby Powers wrote:

> What kernel and distribution are you using?

Debian 6.0.1, Kernel 2.6.39.1-x1

> GOMAXPROCS tells the go
> runtime to split goroutines across multiple threads.

The documentation says this is the maximum number of CPUs, not threads..

> If this really
> seems to help, it would seem to imply that the kernel is doing better
> at scheduling goroutines than the go runtime.

The strange thing in my case is that though there are some other go
routines, they have nothing to do and shouldn't be get scheduled at all.

Norbert

Bobby Powers

unread,
Nov 21, 2011, 9:25:00 AM11/21/11
to Dmitry Vyukov, Norbert Roos, golan...@googlegroups.com

Ah! yes, that makes more sense to me. (sorry for the double reply)

Norbert Roos

unread,
Nov 21, 2011, 9:35:15 AM11/21/11
to golan...@googlegroups.com
On 11/21/2011 02:52 PM, Dmitry Vyukov wrote:

> ... or that goroutines execute blocking operations (including page faults)
> behind runtime's back.

I disabled swapping (there is no swap space at all). And blocking
operations should delay the data transfer only from time to time, but i
see constant data transfer rates of either 670KB/s or 20KB/s - and no
switching between the rates untill the program gets restarted.

Norbert

roger peppe

unread,
Nov 21, 2011, 10:30:17 AM11/21/11
to Norbert Roos, golan...@googlegroups.com

I have seen this kind of behaviour when there are goroutines doing
a significant amount of computation running concurrently with goroutines
that do a significant amount of communication.

My assumption is that when GOMAXPROCS=1 the computing
goroutines get the lion's share of the cycles while allowing the
communicating goroutines to progress only very slowly.
I've seen this happen even when the computation goroutine
is communicating with the others.

I think this is because the Go scheduler is very simplistic.
By raising GOMAXPROCS above 1, you are allowing the OS
to time slice and let communicating goroutines progress even
when there's significant computational load.

I think this behaviour can be observed without anything happening
"behind runtime's back".

Dmitry Vyukov

unread,
Nov 21, 2011, 11:12:01 AM11/21/11
to roger peppe, Norbert Roos, golan...@googlegroups.com
Agree.
Theoretically it can solved in Go runtime (give a goroutine that is unblocked from IO temporal prio boost, and of course preempt long computations), however perhaps periodic runtime.Gosched's during computations can help again.

roger peppe

unread,
Nov 21, 2011, 11:27:14 AM11/21/11
to Dmitry Vyukov, Norbert Roos, golan...@googlegroups.com

is there something wrong with GOMAXPROCS > 1 ?

letting the OS do the time slicing seems like a reasonable
solution to me, assuming there aren't too many compute-bound
goroutines. perhaps it doesn't work in general - i've only
observed external behaviour in a limited number of cases.

Dmitry Vyukov

unread,
Nov 21, 2011, 11:34:45 AM11/21/11
to roger peppe, Norbert Roos, golan...@googlegroups.com
For native programs it is usually OK to create up 2-4 threads per processor, OS is perfectly able to handle such situations in a graceful way. I had not tried it with Go programs, though.

Norbert Roos

unread,
Nov 23, 2011, 9:37:28 AM11/23/11
to golan...@googlegroups.com
>> have you tried running it with GOMAXPROCS > 1 ?

> Uh, what is this magic? It really does seem to help.

Update: No, it didn't help. On the day of a presentation, it fell back
to the slow transfer again.

I'm afraid i have to trace it somewhere in the go sources...

Norbert

roger peppe

unread,
Nov 23, 2011, 9:50:29 AM11/23/11
to Norbert Roos, golan...@googlegroups.com
On 23 November 2011 14:37, Norbert Roos <nr...@webware-experts.de> wrote:
>>> have you tried running it with GOMAXPROCS > 1 ?
>
>> Uh, what is this magic? It really does seem to help.
>
> Update: No, it didn't help. On the day of a presentation, it fell back to
> the slow transfer again.

how annoying.
what value of GOMAXPROCS were you using, out of interest?

Bobby Powers

unread,
Nov 23, 2011, 10:29:44 AM11/23/11
to Norbert Roos, golan...@googlegroups.com
On Wed, Nov 23, 2011 at 3:37 PM, Norbert Roos <nr...@webware-experts.de> wrote:
>>> have you tried running it with GOMAXPROCS > 1 ?
>
>> Uh, what is this magic? It really does seem to help.
>
> Update: No, it didn't help. On the day of a presentation, it fell back to
> the slow transfer again.

Any chance you could post your code, or a testcase that exhibits the
same behavior?

Norbert Roos

unread,
Nov 23, 2011, 10:35:04 AM11/23/11
to golan...@googlegroups.com
On 11/23/2011 03:50 PM, roger peppe wrote:

> what value of GOMAXPROCS were you using, out of interest?

First i took 4 (absolutely randomly), and now i tried it with 2 again..

Norbert

Norbert Roos

unread,
Nov 23, 2011, 10:38:26 AM11/23/11
to golan...@googlegroups.com
On 11/23/2011 04:29 PM, Bobby Powers wrote:

> Any chance you could post your code, or a testcase that exhibits the
> same behavior?

The original code is too large, and the simple testcase which i created
didn't show this behaviour, unfortunately.

Well, i'm going to do some research and will post any interesting
results here!

Norbert

Norbert Roos

unread,
Nov 24, 2011, 8:11:49 AM11/24/11
to golan...@googlegroups.com
On 11/18/2011 06:47 PM, Norbert Roos wrote:

> I am experiencing very slow data transfers over TLS connections on an
> ARM based Beagle board (1GHz). Here is what i'm doing:

I found the reasons; actually there are two:

The first is that the TLS connection uses different encryptions from
time to time. I don't know why this is the case, as the TLS handshake
negotiates TLS_ECDHE_RSA_WITH_RC4_128_SHA in both cases.

The high data rates are running with an XOR encryption as a Stream. The
slow data rates are running with 3DES in BlockMode (see file
crypto/tls/conn.go in func (hc *halfConn) encrypt(b *block)).

The second - and main - reason is that the 3DES encryption is running
very slow, with about 20KB/s. It seems that the implementation of (3)DES
is ineffecient, as other programs are much faster.

Here are the test results when encrypting 10MB:

System program:

# dd if=/dev/zero of=unenc bs=1000 count=10000
# time mcrypt -k 123456789012345678901234 -a tripledes unenc
Warning: It is insecure to specify keywords in the command line
File unenc was encrypted.

real 0m6.044s
user 0m5.820s
sys 0m0.211s

Go program:

# time ./des

real 7m39.010s
user 7m38.922s
sys 0m0.008s

(similar results on the PC: 0m1.389s vs. 0m37.987s)

Source code of "des":

package main

import (
"crypto/des"
"fmt"
)

func main() {

key := []byte("123456789012345678901234")
c, e := des.NewTripleDESCipher(key)

if e != nil {
fmt.Printf("%s\n", e.String())
return
}

in := make([]byte, c.BlockSize())
out := make([]byte, c.BlockSize())
cnt := 10000000 / c.BlockSize()

for i := 0; i < cnt; i++ {
c.Encrypt(out, in)
}
}


It would be nice if the DES encryption could be reworked to become
usable on ARM, too..

Norbert

Dave Cheney

unread,
Nov 24, 2011, 8:57:49 AM11/24/11
to Norbert Roos, golan...@googlegroups.com
Untested theory, TLS handshaking may be relying on the order of a map that stores names to ciphers. This behavior was previously deterministic, sort of, as long as you didn't mutate the map, but has recently been changed to always be non deterministic.

Like I say, just a theory.

Cheers

Dave

Sent from my iPad

Brad Fitzpatrick

unread,
Nov 24, 2011, 11:42:54 AM11/24/11
to Dave Cheney, Norbert Roos, golan...@googlegroups.com
On Thu, Nov 24, 2011 at 5:57 AM, Dave Cheney <da...@cheney.net> wrote:
Untested theory, TLS handshaking may be relying on the order of a map that stores names to ciphers. This behavior was previously deterministic, sort of, as long as you didn't mutate the map, but has recently been changed to always be non deterministic.

Like I say, just a theory.

Looks like a good theory.

var cipherSuites = map[uint16]*cipherSuite{
        TLS_RSA_WITH_RC4_128_SHA:            &cipherSuite{16, 20, 0, rsaKA, false, cipherRC4, macSHA1},
        TLS_RSA_WITH_3DES_EDE_CBC_SHA:       &cipherSuite{24, 20, 8, rsaKA, false, cipher3DES, macSHA1},
        TLS_RSA_WITH_AES_128_CBC_SHA:        &cipherSuite{16, 20, 16, rsaKA, false, cipherAES, macSHA1},
        TLS_ECDHE_RSA_WITH_RC4_128_SHA:      &cipherSuite{16, 20, 0, ecdheRSAKA, true, cipherRC4, macSHA1},
        TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA: &cipherSuite{24, 20, 8, ecdheRSAKA, true, cipher3DES, macSHA1},
        TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA:  &cipherSuite{16, 20, 16, ecdheRSAKA, true, cipherAES, macSHA1},
}

func initDefaultCipherSuites() {
        varDefaultCipherSuites = make([]uint16, len(cipherSuites))
i := 0
        for id := range cipherSuites {
       varDefaultCipherSuites[i] = id
                i++
}
}
 

Norbert Roos

unread,
Nov 24, 2011, 12:25:23 PM11/24/11
to golan...@googlegroups.com
> Looks like a good theory.

Hm, but still strange that the server selects a RC4 cipher (according to
wireshark in the ServerHello message), and the actual encryption is done
with DES. But it does work somehow. Even though too slow.

Norbert

agl

unread,
Nov 28, 2011, 3:39:24 PM11/28/11
to golang-nuts
On Nov 24, 11:42 am, Brad Fitzpatrick <bradf...@golang.org> wrote:
> Looks like a good theory.

I've fixed this in http://code.google.com/p/go/source/detail?r=532f5e24c3c1

However, this would only affect the TLS client. The TLS server accepts
the client's preferences in cipher suite selection.


Cheers

AGL

Reply all
Reply to author
Forward
0 new messages