[go] crypto/cipher: add optimized assembly xorBytes for ARM (NEON + non-NEON)

Gerrit Bot (Gerrit)

unread,

May 30, 2022, 4:15:53 PM5/30/22

to goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Gerrit Bot has uploaded this change for review.

crypto/cipher: add optimized assembly xorBytes for ARM (NEON + non-NEON)

xorBytes is used extensively in crypto and can sometimes be the
bottleneck (e.g. CTR). 32-bit ARM is still the most common arch
for IoT, it is a slower platform so crypto optimization is important,
but unfortunately crypto/cipher does not have optimized xorBytes
for ARM.

This PR adds optimized assembly language xorBytes
implementations for both NEON and non-NEON ARM, which can
result in very large performance gains.

Example comparing current tip crypto/cipher xorBytes on
ARMv7+NEON, with results after applying this PR.

```
name                   old time/op    new time/op    delta
XORBytes/8Bytes-4         127ns ± 0%      95ns ± 0%    -25.14%
XORBytes/128Bytes-4      1.21µs ± 1%    0.16µs ± 0%    -86.70%
XORBytes/2048Bytes-4     18.4µs ± 0%     1.3µs ± 0%    -93.17%
XORBytes/32768Bytes-4     304µs ± 1%      31µs ± 1%    -89.93%

name                   old speed      new speed      delta
XORBytes/8Bytes-4      63.2MB/s ± 0%  84.4MB/s ± 0%    +33.59%
XORBytes/128Bytes-4     106MB/s ± 1%   797MB/s ± 0%   +651.92%
XORBytes/2048Bytes-4    111MB/s ± 0%  1627MB/s ± 0%  +1364.86%
XORBytes/32768Bytes-4   108MB/s ± 1%  1071MB/s ± 1%   +892.65%
```

Additionally, this PR improves xor_generic.go to use word copy
for xorBytes on all arches if the dst and src buffers are all aligned,
which in real world usage is often the case. Previously word copy
was used only on platforms that supported unaligned access,
which is unnecessarily strict and left performance on the table.

Fixes #53023

Change-Id: Ib16b2b6ee0f648cb9ca15539ce4998efc604eb6a
GitHub-Last-Rev: d5e9cb24b70cd9699bd43ff75a1b990177d94656
GitHub-Pull-Request: golang/go#53154
---
A src/crypto/cipher/xor_arm.go
A src/crypto/cipher/xor_arm.s
M src/crypto/cipher/xor_generic.go
3 files changed, 226 insertions(+), 9 deletions(-)

diff --git a/src/crypto/cipher/xor_arm.go b/src/crypto/cipher/xor_arm.go
new file mode 100644
index 0000000..b24f178
--- /dev/null
+++ b/src/crypto/cipher/xor_arm.go
@@ -0,0 +1,58 @@
+// Copyright 2022 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package cipher
+
+import (
+	"internal/cpu"
+	"unsafe"
+)
+
+const wordSize = int(unsafe.Sizeof(uintptr(0)))
+
+var hasNEON = cpu.HWCap&(1<<12) != 0
+
+func isAligned(a *byte) bool {
+	return uintptr(unsafe.Pointer(a))%uintptr(wordSize) == 0
+}
+
+// xorBytes xors the bytes in a and b. The destination should have enough
+// space, otherwise xorBytes will panic. Returns the number of bytes xor'd.
+func xorBytes(dst, a, b []byte) int {
+	n := len(a)
+	if len(b) < n {
+		n = len(b)
+	}
+	if n == 0 {
+		return 0
+	}
+	// make sure dst has enough space
+	_ = dst[n-1]
+
+	if hasNEON {
+		xorBytesNEON32(&dst[0], &a[0], &b[0], n)
+	} else if isAligned(&dst[0]) && isAligned(&a[0]) && isAligned(&b[0]) {
+		xorBytesARM32(&dst[0], &a[0], &b[0], n)
+	} else {
+		safeXORBytes(dst, a, b, n)
+	}
+	return n
+}
+
+// n needs to be smaller or equal than the length of a and b.
+func safeXORBytes(dst, a, b []byte, n int) {
+	for i := 0; i < n; i++ {
+		dst[i] = a[i] ^ b[i]
+	}
+}
+
+func xorWords(dst, a, b []byte) {
+	xorBytes(dst, a, b)
+}
+
+//go:noescape
+func xorBytesARM32(dst, a, b *byte, n int)
+
+//go:noescape
+func xorBytesNEON32(dst, a, b *byte, n int)
diff --git a/src/crypto/cipher/xor_arm.s b/src/crypto/cipher/xor_arm.s
new file mode 100644
index 0000000..7b471f1
--- /dev/null
+++ b/src/crypto/cipher/xor_arm.s
@@ -0,0 +1,114 @@
+// Copyright 2022 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+#include "textflag.h"
+
+// func xorBytesARM32(dst, a, b *byte, n int)
+TEXT ·xorBytesARM32(SB), NOSPLIT|NOFRAME, $0
+	MOVW	dst+0(FP), R0
+	MOVW	a+4(FP), R1
+	MOVW	b+8(FP), R2
+	MOVW	n+12(FP), R3
+	CMP	$4, R3
+	BLT	less_than4
+
+loop_4:
+	MOVW.P	4(R1), R4
+	MOVW.P	4(R2), R5
+	EOR	R4, R5, R5
+	MOVW.P	R5, 4(R0)
+
+	SUB	$4, R3
+	CMP	$4, R3
+	BGE	loop_4
+
+less_than4:
+	CMP	$2, R3
+	BLT	less_than2
+	MOVH.P	2(R1), R4
+	MOVH.P	2(R2), R5
+	EOR	R4, R5, R5
+	MOVH.P	R5, 2(R0)
+
+	SUB	$2, R3
+
+less_than2:
+	CMP	$0, R3
+	BEQ	end
+	MOVB	(R1), R4
+	MOVB	(R2), R5
+	EOR	R4, R5, R5
+	MOVB	R5, (R0)
+end:
+	RET
+
+// func xorBytesNEON32(dst, a, b *byte, n int)
+TEXT ·xorBytesNEON32(SB), NOSPLIT|NOFRAME, $0
+	MOVW	dst+0(FP), R0
+	MOVW	a+4(FP), R1
+	MOVW	b+8(FP), R2
+	MOVW	n+12(FP), R3
+	CMP	$32, R3
+	BLT	less_than32
+
+loop_32:
+	WORD	$0xF421020D // vld1.u8 {q0, q1}, [r1]!
+	WORD	$0xF422420D // vld1.u8 {q2, q3}, [r2]!
+	WORD	$0xF3004154 // veor q2, q0, q2
+	WORD	$0xF3026156 // veor q3, q1, q3
+	WORD	$0xF400420D // vst1.u8 {q2, q3}, [r0]!
+
+	SUB	$32, R3
+	CMP	$32, R3
+	BGE	loop_32
+
+less_than32:
+	CMP	$16, R3
+	BLT	less_than16
+	WORD	$0xF4210A0D // vld1.u8 q0, [r1]!
+	WORD	$0xF4222A0D // vld1.u8 q1, [r2]!
+	WORD	$0xF3002152 // veor q1, q0, q1
+	WORD	$0xF4002A0D // vst1.u8 {q1}, [r0]!
+
+	SUB	$16, R3
+
+less_than16:
+	CMP	$8, R3
+	BLT	less_than8
+	WORD	$0xF421070D // vld1.u8 d0, [r1]!
+	WORD	$0xF422170D // vld1.u8 d1, [r2]!
+	WORD	$0xF3001111 // veor d1, d0, d1
+	WORD	$0xF400170D // vst1.u8 {d1}, [r0]!
+
+	SUB	$8, R3
+
+less_than8:
+	CMP	$4, R3
+	BLT	less_than4
+	MOVW.P	4(R1), R4
+	MOVW.P	4(R2), R5
+	EOR	R4, R5, R5
+	MOVW.P	R5, 4(R0)
+
+	SUB	$4, R3
+
+less_than4:
+	CMP	$2, R3
+	BLT	less_than2
+	MOVH.P	2(R1), R4
+	MOVH.P	2(R2), R5
+	EOR	R4, R5, R5
+	MOVH.P	R5, 2(R0)
+
+	SUB	$2, R3
+
+less_than2:
+	CMP	$0, R3
+	BEQ	end
+	MOVB	(R1), R4
+	MOVB	(R2), R5
+	EOR	R4, R5, R5
+	MOVB	R5, (R0)
+end:
+	RET
diff --git a/src/crypto/cipher/xor_generic.go b/src/crypto/cipher/xor_generic.go
index 43517a8..6c26a63 100644
--- a/src/crypto/cipher/xor_generic.go
+++ b/src/crypto/cipher/xor_generic.go
@@ -2,7 +2,7 @@
 // Use of this source code is governed by a BSD-style
 // license that can be found in the LICENSE file.
 
-//go:build !amd64 && !ppc64 && !ppc64le && !arm64
+//go:build !amd64 && !ppc64 && !ppc64le && !arm64 && !arm
 
 package cipher
 
@@ -23,14 +23,9 @@
 	}
 
 	switch {
-	case supportsUnaligned:
+	case supportsUnaligned || (isAligned(&dst[0]) && isAligned(&a[0]) && isAligned(&b[0])):
 		fastXORBytes(dst, a, b, n)
 	default:
-		// TODO(hanwen): if (dst, a, b) have common alignment
-		// we could still try fastXORBytes. It is not clear
-		// how often this happens, and it's only worth it if
-		// the block encryption itself is hardware
-		// accelerated.
 		safeXORBytes(dst, a, b, n)
 	}
 	return n
@@ -39,8 +34,12 @@
 const wordSize = int(unsafe.Sizeof(uintptr(0)))
 const supportsUnaligned = runtime.GOARCH == "386" || runtime.GOARCH == "ppc64" || runtime.GOARCH == "ppc64le" || runtime.GOARCH == "s390x"
 
+func isAligned(a *byte) bool {
+	return uintptr(unsafe.Pointer(a))%uintptr(wordSize) == 0
+}
+
 // fastXORBytes xors in bulk. It only works on architectures that
-// support unaligned read/writes.
+// support unaligned read/writes, or if dst, a, b are all aligned.
 // n needs to be smaller or equal than the length of a and b.
 func fastXORBytes(dst, a, b []byte, n int) {
 	// Assert dst has enough space
@@ -83,7 +82,7 @@
 // fastXORWords XORs multiples of 4 or 8 bytes (depending on architecture.)
 // The slice arguments a and b are assumed to be of equal length.
 func xorWords(dst, a, b []byte) {
-	if supportsUnaligned {
+	if supportsUnaligned || (isAligned(&dst[0]) && isAligned(&a[0]) && isAligned(&b[0])) {
 		fastXORWords(dst, a, b)
 	} else {
 		safeXORBytes(dst, a, b, len(b))

To view, visit change 409394. To unsubscribe, or for help writing mail filters, visit settings.

Gopher Robot (Gerrit)

unread,

May 30, 2022, 4:16:24 PM5/30/22

to Gerrit Bot, goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Congratulations on opening your first change. Thank you for your contribution!

Next steps:
A maintainer will review your change and provide feedback. See
https://go.dev/doc/contribute#review for more info and tips to get your
patch through code review.

Most changes in the Go project go through a few rounds of revision. This can be
surprising to people new to the project. The careful, iterative review process
is our way of helping mentor contributors and ensuring that their contributions
have a lasting impact.

During May-July and Nov-Jan the Go project is in a code freeze, during which
little code gets reviewed or merged. If a reviewer responds with a comment like
R=go1.11 or adds a tag like "wait-release", it means that this CL will be
reviewed as part of the next development cycle. See https://go.dev/s/release
for more details.

View Change

To view, visit change 409394. To unsubscribe, or for help writing mail filters, visit settings.

Adrian Cable (Gerrit)

unread,

Jun 1, 2022, 9:28:53 AM6/1/22

to Gerrit Bot, goph...@pubsubhelper.golang.org, Filippo Valsorda, Adam Langley, Gopher Robot, golang-co...@googlegroups.com

Attention is currently required from: Filippo Valsorda.

View Change

1 comment:

Patchset:
- Patch Set #1:
  ... and now some actual crypto benchmarks on ARMv7+NEON, showing ~5-7% improvement on the AES encrypt/decrypt benchmarks due to this change.

- name old speed new speed delta

- AESCFBEncrypt1K-4 7.75MB/s ± 0% 8.11MB/s ± 0% +4.67% (p=0.008 n=5+5)
  AESCFBDecrypt1K-4 7.69MB/s ± 0% 8.12MB/s ± 0% +5.54% (p=0.008 n=5+5)
  AESCFBDecrypt8K-4 7.71MB/s ± 0% 8.08MB/s ± 1% +4.83% (p=0.008 n=5+5)
  AESOFB1K-4 8.28MB/s ± 1% 8.88MB/s ± 1% +7.24% (p=0.008 n=5+5)
  AESCTR1K-4 8.45MB/s ± 1% 9.09MB/s ± 0% +7.52% (p=0.008 n=5+5)
  AESCTR8K-4 8.47MB/s ± 0% 9.09MB/s ± 0% +7.37% (p=0.008 n=5+5)
  AESCBCEncrypt1K-4 8.24MB/s ± 0% 8.74MB/s ± 0% +6.02% (p=0.008 n=5+5)
  AESCBCDecrypt1K-4 8.42MB/s ± 0% 8.96MB/s ± 0% +6.41% (p=0.008 n=5+5)

To view, visit change 409394. To unsubscribe, or for help writing mail filters, visit settings.

Reply all

Reply to author

Forward