internal/bytealg: vector implementation of count 1 byte for riscv64
This CL provide a vector implementation of count 1 byte
goos: linux
goarch: riscv64
pkg: bytes
cpu: Spacemit(R) X60
│ sec/op │ sec/op vs base │
CountSingle/10 42.12n ± 0% 44.49n ± 0% +5.64% (p=0.000 n=10)
CountSingle/32 85.74n ± 0% 44.49n ± 0% -48.11% (p=0.000 n=10)
CountSingle/4K 8371.5n ± 0% 396.7n ± 0% -95.26% (p=0.000 n=10)
CountSingle/4M 8929.1µ ± 0% 756.9µ ± 0% -91.52% (p=0.000 n=10)
CountSingle/64M 159.78m ± 0% 13.89m ± 0% -91.30% (p=0.000 n=10)
geomean 33.65µ 6.073µ -81.95%
│ /root/scalar_count.log │ /root/vector_count.log │
│ B/s │ B/s vs base │
CountSingle/10 226.4Mi ± 0% 214.3Mi ± 0% -5.34% (p=0.000 n=10)
CountSingle/32 355.9Mi ± 0% 685.9Mi ± 0% +92.71% (p=0.000 n=10)
CountSingle/4K 466.6Mi ± 0% 9846.5Mi ± 0% +2010.19% (p=0.000 n=10)
CountSingle/4M 448.0Mi ± 0% 5284.9Mi ± 0% +1079.74% (p=0.000 n=10)
CountSingle/64M 400.5Mi ± 0% 4606.2Mi ± 0% +1049.98% (p=0.000 n=10)
geomean 368.0Mi 1.991Gi +454.08%
diff --git a/src/internal/bytealg/count_riscv64.s b/src/internal/bytealg/count_riscv64.s
index 3f255cd..344427e 100644
--- a/src/internal/bytealg/count_riscv64.s
+++ b/src/internal/bytealg/count_riscv64.s
@@ -2,48 +2,53 @@
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
+#include "asm_riscv64.h"
#include "go_asm.h"
#include "textflag.h"
-TEXT ·Count<ABIInternal>(SB),NOSPLIT,$0-40
- // X10 = b_base
- // X11 = b_len
- // X12 = b_cap (unused)
- // X13 = byte to count (want in X12)
- AND $0xff, X13, X12
- MOV ZERO, X14 // count
- ADD X10, X11 // end
+TEXT ·Count<ABIInternal>(SB),NOSPLIT|NOFRAME,$0
+ MOV X13, X12 // b_cap not used
+ JMP count<>(SB)
- PCALIGN $16
-loop:
- BEQ X10, X11, done
- MOVBU (X10), X15
- ADD $1, X10
- BNE X12, X15, loop
- ADD $1, X14
- JMP loop
+TEXT ·CountString<ABIInternal>(SB),NOSPLIT|NOFRAME,$0
+ JMP count<>(SB)
-done:
- MOV X14, X10
- RET
-
-TEXT ·CountString<ABIInternal>(SB),NOSPLIT,$0-32
- // X10 = s_base
+TEXT count<>(SB),NOSPLIT|NOFRAME,$0-0
+ // X10 = s_base, return as counter
// X11 = s_len
// X12 = byte to count
- AND $0xff, X12
- MOV ZERO, X14 // count
- ADD X10, X11 // end
+ MOV X10, X14 // src pointer
+ MOV ZERO, X10 // reset counter
+ AND $0xff, X12, X12 // make sure it's a byte to compare
+ SUB $8, X11, X5
+ BLEZ X5, count_scalar
+#ifndef hasV
+ MOVB internal∕cpu·RISCV64+const_offsetRISCV64HasV(SB), X5
+ BEQZ X5, count_scalar
+#endif
+ PCALIGN $16
+count_vector_loop:
+ VSETVLI X11, E8, M8, TA, MA, X5
+ VLE8V (X14), V8
+ VMSEQVX X12, V8, V0
+ VCPOPM V0, X15
+ ADD X15, X10 // add counter
+ ADD X5, X14
+ SUB X5, X11, X11
+ BEQZ X11, done
+ JMP count_vector_loop
PCALIGN $16
-loop:
- BEQ X10, X11, done
- MOVBU (X10), X15
- ADD $1, X10
- BNE X12, X15, loop
+count_scalar:
+ ADD X14, X11 // end pointer
+ PCALIGN $16
+count_scalar_loop:
+ BEQ X14, X11, done
+ MOVBU (X14), X15
ADD $1, X14
- JMP loop
+ BNE X12, X15, count_scalar_loop
+ ADD $1, X10
+ JMP count_scalar_loop
done:
- MOV X14, X10
RET
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
I tested this locally and it seems to work fine. I also see the expected speedups. We need to wait until https://go-review.googlesource.com/c/go/+/646779/6 and https://go-review.googlesource.com/c/go/+/646775/6 are merged before merging this.
AND $0xff, X12, X12 // make sure it's a byte to compareAND $0xff, X12
SUB $8, X11, X5This looks a little low and the benchmarks in the commit message (the degradation in CountSingle/10) suggests that this might indeed be the case. However, I did benchmark this locally on a banana Pi with GORISCV64=rva23u64 and I didn't see any slowdown in CountSingle/10 so maybe it's okay.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Commit-Queue | +1 |
I tested this locally and it seems to work fine. I also see the expected speedups. We need to wait until https://go-review.googlesource.com/c/go/+/646779/6 and https://go-review.googlesource.com/c/go/+/646775/6 are merged before merging this.
Also CL 646736, for hasV const.
AND $0xff, X12, X12 // make sure it's a byte to compareMeng ZhuoAND $0xff, X12
Done
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
VMSEQVX X13, V8, V0We should avoid V0 here according to https://riscv-optimization-guide.riseproject.dev/#_avoid_using_v0_for_non_mask_operations.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |