What's current status of the crypto/des package (performance-wise)?

171 views
Skip to first unread message

mura

unread,
Aug 27, 2016, 12:43:23 PM8/27/16
to golang-nuts
Hi gophers,

We have some legacy systems which use 3DES cryptos extensively and we are attempting to re-write some parts of it in Go recently.
As an initial experiment, we found that the system overall is about 2x~3x slower than the original C version.
Although the culprit can not be completely singled out yet, the Go pprof tool has shown that about half of the execution time is spent on the DES operation (and the feistel function, specifically).

My question therefore are:

1. Is the algorithm adopted by crypto/des the most performant one among known DES algorithms? I am really not an expert in the area and have no idea whether there are certain 'advanced' algorithms that could make the C version more efficient. 

2. If the algorithm used by crypto/des is already optimal, could it be possible due to the quality of the generated code? (FWIW, the systems run on 133MHZ ARMv6 (arm1136 chips))

I have tried with the Go tip (with ARM SSA enabled), which shows an impressive improvement (11mins -> 7mins overall processing time).
Unfortunately, it's still not enough for replacing an existing system with 2x better performance.


agl

unread,
Aug 27, 2016, 3:53:46 PM8/27/16
to golang-nuts
On Saturday, August 27, 2016 at 9:43:23 AM UTC-7, mura wrote:
We have some legacy systems which use 3DES cryptos extensively and we are attempting to re-write some parts of it in Go recently.
As an initial experiment, we found that the system overall is about 2x~3x slower than the original C version.
Although the culprit can not be completely singled out yet, the Go pprof tool has shown that about half of the execution time is spent on the DES operation (and the feistel function, specifically).

It's likely that you were previously using OpenSSL's DES code if it was written in C. I don't think that OpenSSL has ARM assembly for DES, but it does have a lot of a macro-ridden, hand-unrolled code for implementing DES. It's not a different algorithm, but the Go code is written to be concise rather than to eek out performance.

I'm sure that the Go code could be made faster (at the cost of code size and clarity) but if 3DES is a critical hot spot for you, the quickest solution may be to wrap OpenSSL's DES function in cgo.


Cheers

AGL

mura

unread,
Aug 27, 2016, 9:47:45 PM8/27/16
to golang-nuts
Thank you for your information!

I will probably try to port the C version to Go and see if the C tricks can be applied to the Go code.
I am reluctant to work with cgo, to be honest.

BTW, the C implementation does not come from OpenSSL. I will check its licence and see if I can share publicly.

Sam Whited

unread,
Aug 28, 2016, 12:07:39 AM8/28/16
to mura, golang-nuts
On Sat, Aug 27, 2016 at 11:43 AM, mura <aruma...@gmail.com> wrote:
> Although the culprit can not be completely singled out yet, the Go pprof
> tool has shown that about half of the execution time is spent on the DES
> operation (and the feistel function, specifically).

Warning: mere speculation follows.

With long calculations involving many iterations (such as the
iterations in any block cipher, or, I suspect, the feistel function) I've had
performance issues in the past due to the cost of the scheduler
preempting goroutines. Limiting the number of concurrent workers
running DES calls may be beneficial.

With blowfish I've also had good results turning up the
maximum aggressiveness (specifically by adjusting the "hairiness"
metric) of the inliner; this makes the final binary bigger, but may
also prevent the sheduler from preempting a goroutine performing a DES
operation by inlining the function calls within the iterations. This
may apply to the feistel function as well.

Finally, I haven't tried it, but using runtime.LockOSThread
may have similar results, if my understanding of that function is
correct and it prevents the scheduler from preempting the goroutine
until it completes or the goroutine releases the underlying thread.
Importing the runtime package (to say nothing of building a custom
version of the compiler) should probably be saved as a last resort,
but it may be worth testing to see if it improves your performance at
all.

Chances are, you're better off just using a different implementation
that is optimized for performance.

—Sam


--
Sam Whited
pub 4096R/54083AE104EA7AD3

mura

unread,
Aug 28, 2016, 11:46:48 AM8/28/16
to golang-nuts, aruma...@gmail.com
Interesting. Thanks for sharing.

I used to confront with a similar scenario in the C world, where we had to use sched_setscheduler(2) to prevent the OS scheduler from interfering with our application, which requires a below ~50ms variation IIRC. Yeah, it happened even though the application was written in an unmanaged and non-GC language.

mura

unread,
Aug 28, 2016, 11:53:21 AM8/28/16
to golang-nuts
Here is a good news.

The crypto/des package has some low-hanging-fruit optimizations that are worth trying.

The results of the benchmark have being improved to an extent (I tested it on AMD64 though):

> Old

BenchmarkEncrypt-4       2000000               852 ns/op           9.38 MB/s

BenchmarkDecrypt-4       2000000               896 ns/op           8.92 MB/s


> New

BenchmarkEncrypt-4       3000000               536 ns/op          14.90 MB/s

BenchmarkDecrypt-4       3000000               539 ns/op          14.82 MB/s


Besides unrolling the loop in the feistel function and adding bounds-check hints, I made a change to the the arrangement of the sbox matrix to eliminate the redundant calculations of rows and columns in the feistel function, which fortunately does not make the code uglier. The idea is to move the cost from the hot spot to the run-once initialization function whenever possible.


Hopefully there are other places that can be speeded up without messing up the code.

Reply all
Reply to author
Forward
0 new messages