Gerrit Bot has uploaded this change for review.
crypto/cipher: add optimized assembly xorBytes for ARM (NEON + non-NEON)
xorBytes is used extensively in crypto and can sometimes be the
bottleneck (e.g. CTR). 32-bit ARM is still the most common arch
for IoT, it is a slower platform so crypto optimization is important,
but unfortunately crypto/cipher does not have optimized xorBytes
for ARM.
This PR adds optimized assembly language xorBytes
implementations for both NEON and non-NEON ARM, which can
result in very large performance gains.
Example comparing current tip crypto/cipher xorBytes on
ARMv7+NEON, with results after applying this PR.
```
name old time/op new time/op delta
XORBytes/8Bytes-4 127ns ± 0% 95ns ± 0% -25.14%
XORBytes/128Bytes-4 1.21µs ± 1% 0.16µs ± 0% -86.70%
XORBytes/2048Bytes-4 18.4µs ± 0% 1.3µs ± 0% -93.17%
XORBytes/32768Bytes-4 304µs ± 1% 31µs ± 1% -89.93%
name old speed new speed delta
XORBytes/8Bytes-4 63.2MB/s ± 0% 84.4MB/s ± 0% +33.59%
XORBytes/128Bytes-4 106MB/s ± 1% 797MB/s ± 0% +651.92%
XORBytes/2048Bytes-4 111MB/s ± 0% 1627MB/s ± 0% +1364.86%
XORBytes/32768Bytes-4 108MB/s ± 1% 1071MB/s ± 1% +892.65%
```
Additionally, this PR improves xor_generic.go to use word copy
for xorBytes on all arches if the dst and src buffers are all aligned,
which in real world usage is often the case. Previously word copy
was used only on platforms that supported unaligned access,
which is unnecessarily strict and left performance on the table.
Fixes #53023
Change-Id: Ib16b2b6ee0f648cb9ca15539ce4998efc604eb6a
GitHub-Last-Rev: d5e9cb24b70cd9699bd43ff75a1b990177d94656
GitHub-Pull-Request: golang/go#53154
---
A src/crypto/cipher/xor_arm.go
A src/crypto/cipher/xor_arm.s
M src/crypto/cipher/xor_generic.go
3 files changed, 226 insertions(+), 9 deletions(-)
diff --git a/src/crypto/cipher/xor_arm.go b/src/crypto/cipher/xor_arm.go
new file mode 100644
index 0000000..b24f178
--- /dev/null
+++ b/src/crypto/cipher/xor_arm.go
@@ -0,0 +1,58 @@
+// Copyright 2022 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package cipher
+
+import (
+ "internal/cpu"
+ "unsafe"
+)
+
+const wordSize = int(unsafe.Sizeof(uintptr(0)))
+
+var hasNEON = cpu.HWCap&(1<<12) != 0
+
+func isAligned(a *byte) bool {
+ return uintptr(unsafe.Pointer(a))%uintptr(wordSize) == 0
+}
+
+// xorBytes xors the bytes in a and b. The destination should have enough
+// space, otherwise xorBytes will panic. Returns the number of bytes xor'd.
+func xorBytes(dst, a, b []byte) int {
+ n := len(a)
+ if len(b) < n {
+ n = len(b)
+ }
+ if n == 0 {
+ return 0
+ }
+ // make sure dst has enough space
+ _ = dst[n-1]
+
+ if hasNEON {
+ xorBytesNEON32(&dst[0], &a[0], &b[0], n)
+ } else if isAligned(&dst[0]) && isAligned(&a[0]) && isAligned(&b[0]) {
+ xorBytesARM32(&dst[0], &a[0], &b[0], n)
+ } else {
+ safeXORBytes(dst, a, b, n)
+ }
+ return n
+}
+
+// n needs to be smaller or equal than the length of a and b.
+func safeXORBytes(dst, a, b []byte, n int) {
+ for i := 0; i < n; i++ {
+ dst[i] = a[i] ^ b[i]
+ }
+}
+
+func xorWords(dst, a, b []byte) {
+ xorBytes(dst, a, b)
+}
+
+//go:noescape
+func xorBytesARM32(dst, a, b *byte, n int)
+
+//go:noescape
+func xorBytesNEON32(dst, a, b *byte, n int)
diff --git a/src/crypto/cipher/xor_arm.s b/src/crypto/cipher/xor_arm.s
new file mode 100644
index 0000000..7b471f1
--- /dev/null
+++ b/src/crypto/cipher/xor_arm.s
@@ -0,0 +1,114 @@
+// Copyright 2022 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+#include "textflag.h"
+
+// func xorBytesARM32(dst, a, b *byte, n int)
+TEXT ·xorBytesARM32(SB), NOSPLIT|NOFRAME, $0
+ MOVW dst+0(FP), R0
+ MOVW a+4(FP), R1
+ MOVW b+8(FP), R2
+ MOVW n+12(FP), R3
+ CMP $4, R3
+ BLT less_than4
+
+loop_4:
+ MOVW.P 4(R1), R4
+ MOVW.P 4(R2), R5
+ EOR R4, R5, R5
+ MOVW.P R5, 4(R0)
+
+ SUB $4, R3
+ CMP $4, R3
+ BGE loop_4
+
+less_than4:
+ CMP $2, R3
+ BLT less_than2
+ MOVH.P 2(R1), R4
+ MOVH.P 2(R2), R5
+ EOR R4, R5, R5
+ MOVH.P R5, 2(R0)
+
+ SUB $2, R3
+
+less_than2:
+ CMP $0, R3
+ BEQ end
+ MOVB (R1), R4
+ MOVB (R2), R5
+ EOR R4, R5, R5
+ MOVB R5, (R0)
+end:
+ RET
+
+// func xorBytesNEON32(dst, a, b *byte, n int)
+TEXT ·xorBytesNEON32(SB), NOSPLIT|NOFRAME, $0
+ MOVW dst+0(FP), R0
+ MOVW a+4(FP), R1
+ MOVW b+8(FP), R2
+ MOVW n+12(FP), R3
+ CMP $32, R3
+ BLT less_than32
+
+loop_32:
+ WORD $0xF421020D // vld1.u8 {q0, q1}, [r1]!
+ WORD $0xF422420D // vld1.u8 {q2, q3}, [r2]!
+ WORD $0xF3004154 // veor q2, q0, q2
+ WORD $0xF3026156 // veor q3, q1, q3
+ WORD $0xF400420D // vst1.u8 {q2, q3}, [r0]!
+
+ SUB $32, R3
+ CMP $32, R3
+ BGE loop_32
+
+less_than32:
+ CMP $16, R3
+ BLT less_than16
+ WORD $0xF4210A0D // vld1.u8 q0, [r1]!
+ WORD $0xF4222A0D // vld1.u8 q1, [r2]!
+ WORD $0xF3002152 // veor q1, q0, q1
+ WORD $0xF4002A0D // vst1.u8 {q1}, [r0]!
+
+ SUB $16, R3
+
+less_than16:
+ CMP $8, R3
+ BLT less_than8
+ WORD $0xF421070D // vld1.u8 d0, [r1]!
+ WORD $0xF422170D // vld1.u8 d1, [r2]!
+ WORD $0xF3001111 // veor d1, d0, d1
+ WORD $0xF400170D // vst1.u8 {d1}, [r0]!
+
+ SUB $8, R3
+
+less_than8:
+ CMP $4, R3
+ BLT less_than4
+ MOVW.P 4(R1), R4
+ MOVW.P 4(R2), R5
+ EOR R4, R5, R5
+ MOVW.P R5, 4(R0)
+
+ SUB $4, R3
+
+less_than4:
+ CMP $2, R3
+ BLT less_than2
+ MOVH.P 2(R1), R4
+ MOVH.P 2(R2), R5
+ EOR R4, R5, R5
+ MOVH.P R5, 2(R0)
+
+ SUB $2, R3
+
+less_than2:
+ CMP $0, R3
+ BEQ end
+ MOVB (R1), R4
+ MOVB (R2), R5
+ EOR R4, R5, R5
+ MOVB R5, (R0)
+end:
+ RET
diff --git a/src/crypto/cipher/xor_generic.go b/src/crypto/cipher/xor_generic.go
index 43517a8..6c26a63 100644
--- a/src/crypto/cipher/xor_generic.go
+++ b/src/crypto/cipher/xor_generic.go
@@ -2,7 +2,7 @@
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
-//go:build !amd64 && !ppc64 && !ppc64le && !arm64
+//go:build !amd64 && !ppc64 && !ppc64le && !arm64 && !arm
package cipher
@@ -23,14 +23,9 @@
}
switch {
- case supportsUnaligned:
+ case supportsUnaligned || (isAligned(&dst[0]) && isAligned(&a[0]) && isAligned(&b[0])):
fastXORBytes(dst, a, b, n)
default:
- // TODO(hanwen): if (dst, a, b) have common alignment
- // we could still try fastXORBytes. It is not clear
- // how often this happens, and it's only worth it if
- // the block encryption itself is hardware
- // accelerated.
safeXORBytes(dst, a, b, n)
}
return n
@@ -39,8 +34,12 @@
const wordSize = int(unsafe.Sizeof(uintptr(0)))
const supportsUnaligned = runtime.GOARCH == "386" || runtime.GOARCH == "ppc64" || runtime.GOARCH == "ppc64le" || runtime.GOARCH == "s390x"
+func isAligned(a *byte) bool {
+ return uintptr(unsafe.Pointer(a))%uintptr(wordSize) == 0
+}
+
// fastXORBytes xors in bulk. It only works on architectures that
-// support unaligned read/writes.
+// support unaligned read/writes, or if dst, a, b are all aligned.
// n needs to be smaller or equal than the length of a and b.
func fastXORBytes(dst, a, b []byte, n int) {
// Assert dst has enough space
@@ -83,7 +82,7 @@
// fastXORWords XORs multiples of 4 or 8 bytes (depending on architecture.)
// The slice arguments a and b are assumed to be of equal length.
func xorWords(dst, a, b []byte) {
- if supportsUnaligned {
+ if supportsUnaligned || (isAligned(&dst[0]) && isAligned(&a[0]) && isAligned(&b[0])) {
fastXORWords(dst, a, b)
} else {
safeXORBytes(dst, a, b, len(b))
To view, visit change 409394. To unsubscribe, or for help writing mail filters, visit settings.
Congratulations on opening your first change. Thank you for your contribution!
Next steps:
A maintainer will review your change and provide feedback. See
https://go.dev/doc/contribute#review for more info and tips to get your
patch through code review.
Most changes in the Go project go through a few rounds of revision. This can be
surprising to people new to the project. The careful, iterative review process
is our way of helping mentor contributors and ensuring that their contributions
have a lasting impact.
During May-July and Nov-Jan the Go project is in a code freeze, during which
little code gets reviewed or merged. If a reviewer responds with a comment like
R=go1.11 or adds a tag like "wait-release", it means that this CL will be
reviewed as part of the next development cycle. See https://go.dev/s/release
for more details.
Attention is currently required from: Filippo Valsorda.
1 comment:
Patchset:
... and now some actual crypto benchmarks on ARMv7+NEON, showing ~5-7% improvement on the AES encrypt/decrypt benchmarks due to this change.
name old speed new speed delta
AESCFBEncrypt1K-4 7.75MB/s ± 0% 8.11MB/s ± 0% +4.67% (p=0.008 n=5+5)
AESCFBDecrypt1K-4 7.69MB/s ± 0% 8.12MB/s ± 0% +5.54% (p=0.008 n=5+5)
AESCFBDecrypt8K-4 7.71MB/s ± 0% 8.08MB/s ± 1% +4.83% (p=0.008 n=5+5)
AESOFB1K-4 8.28MB/s ± 1% 8.88MB/s ± 1% +7.24% (p=0.008 n=5+5)
AESCTR1K-4 8.45MB/s ± 1% 9.09MB/s ± 0% +7.52% (p=0.008 n=5+5)
AESCTR8K-4 8.47MB/s ± 0% 9.09MB/s ± 0% +7.37% (p=0.008 n=5+5)
AESCBCEncrypt1K-4 8.24MB/s ± 0% 8.74MB/s ± 0% +6.02% (p=0.008 n=5+5)
AESCBCDecrypt1K-4 8.42MB/s ± 0% 8.96MB/s ± 0% +6.41% (p=0.008 n=5+5)
To view, visit change 409394. To unsubscribe, or for help writing mail filters, visit settings.