runtime: add vector implementation of memclrNoHeapPointers for riscv64
diff --git a/src/runtime/cpuflags.go b/src/runtime/cpuflags.go
index e81e50f..35dce45 100644
--- a/src/runtime/cpuflags.go
+++ b/src/runtime/cpuflags.go
@@ -11,16 +11,18 @@
// Offsets into internal/cpu records for use in assembly.
const (
- offsetX86HasAVX = unsafe.Offsetof(cpu.X86.HasAVX)
- offsetX86HasAVX2 = unsafe.Offsetof(cpu.X86.HasAVX2)
- offsetX86HasERMS = unsafe.Offsetof(cpu.X86.HasERMS)
- offsetX86HasRDTSCP = unsafe.Offsetof(cpu.X86.HasRDTSCP)
-
offsetARMHasIDIVA = unsafe.Offsetof(cpu.ARM.HasIDIVA)
offsetMIPS64XHasMSA = unsafe.Offsetof(cpu.MIPS64X.HasMSA)
offsetLOONG64HasLSX = unsafe.Offsetof(cpu.Loong64.HasLSX)
+
+ offsetRISCV64HasV = unsafe.Offsetof(cpu.RISCV64.HasV)
+
+ offsetX86HasAVX = unsafe.Offsetof(cpu.X86.HasAVX)
+ offsetX86HasAVX2 = unsafe.Offsetof(cpu.X86.HasAVX2)
+ offsetX86HasERMS = unsafe.Offsetof(cpu.X86.HasERMS)
+ offsetX86HasRDTSCP = unsafe.Offsetof(cpu.X86.HasRDTSCP)
)
var (
diff --git a/src/runtime/memclr_riscv64.s b/src/runtime/memclr_riscv64.s
index 16c511c..ead0fe0 100644
--- a/src/runtime/memclr_riscv64.s
+++ b/src/runtime/memclr_riscv64.s
@@ -2,6 +2,8 @@
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
+#include "asm_riscv64.h"
+#include "go_asm.h"
#include "textflag.h"
// See memclrNoHeapPointers Go doc for important implementation constraints.
@@ -15,6 +17,30 @@
MOV $8, X9
BLT X11, X9, check4
+#ifndef hasV
+ MOVB internal∕cpu·RISCV64+const_offsetRISCV64HasV(SB), X5
+ BEQZ X5, memclr_scalar
+#endif
+
+ // Use vector if not 8 byte aligned.
+ AND $7, X10, X5
+ BNEZ X5, vector_loop
+
+ // Use scalar if 8 byte aligned and <= 64 bytes.
+ SUB $64, X11, X6
+ BLEZ X6, aligned
+
+ PCALIGN $16
+vector_loop:
+ VSETVLI X11, E8, M8, TA, MA, X5
+ VMVVI $0, V8
+ VSE8V V8, (X10)
+ ADD X5, X10
+ SUB X5, X11
+ BNEZ X11, vector_loop
+ RET
+
+memclr_scalar:
// Check alignment
AND $7, X10, X5
BEQZ X5, aligned
@@ -37,6 +63,8 @@
BLT X11, X9, zero16
MOV $64, X9
BLT X11, X9, zero32
+
+ PCALIGN $16
loop64:
MOV ZERO, 0(X10)
MOV ZERO, 8(X10)
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
VMVVI $0, V8The VMVI can be moved out of the loop.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
VMVVI $0, V8The VMVI can be moved out of the loop.
Maybe... the catch is we would need to make an equivalent `VSETVLI` call as the M8 is effectively splatting $0 into V8..V15 (with the number of elements also controlled by the current X11 value) - I also wanted to confirm what the required behaviour is if we VMVVI with one X11 value, then use it with a different value (not sure if you know the answer to this). At least for now it seemed easier/safer to call VMMVI in the loop (and it's not going to be the slow part of the code).
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
VMVVI $0, V8Joel SingThe VMVI can be moved out of the loop.
Maybe... the catch is we would need to make an equivalent `VSETVLI` call as the M8 is effectively splatting $0 into V8..V15 (with the number of elements also controlled by the current X11 value) - I also wanted to confirm what the required behaviour is if we VMVVI with one X11 value, then use it with a different value (not sure if you know the answer to this). At least for now it seemed easier/safer to call VMMVI in the loop (and it's not going to be the slow part of the code).
Yes, sorry, ignore my comment. Still trying to get my head around vector.
I also wanted to confirm what the required behaviour is if we VMVVI with one X11 value, then use it with a different value (not sure if you know the answer to this)
I'd say it should work, but I agree, not worth it. One thing we could consider trying/benchmarking in the future is whole vector loads and stores. This would allow us to just set v8 once outside the loop, but we'd need CSR support for this (to get the vlenb), and possibly to manually unroll and we'd also need one call to vsetvli to process the tail, so possibly not worth it either.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
Tested on a Banana Pi. Seeing a geomean improvement of -10.32% on the Memclr tests.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +2 |
| Commit-Queue | +1 |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
BNEZ X11, vector_loop
RETThere is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
vector_loop_body:
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
VSETVLI X11, E8, M8, TA, MA, X5
BNEZ X11, vector_loop_body
RET
```
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
BNEZ X11, vector_loop
RETThere is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
vector_loop_body:
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
VSETVLI X11, E8, M8, TA, MA, X5
BNEZ X11, vector_loop_body
RET
```
@wang...@bytedance.com do you know whether these changes measurably boost the performance of the code? We did discuss doing something like this in the comments above but weren't sure how much of a difference it would make.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
BNEZ X11, vector_loop
RETMark RyanThere is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
vector_loop_body:
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
VSETVLI X11, E8, M8, TA, MA, X5
BNEZ X11, vector_loop_body
RET
```
@wang...@bytedance.com do you know whether these changes measurably boost the performance of the code? We did discuss doing something like this in the comments above but weren't sure how much of a difference it would make.
Sorry, I didn’t look closely at the comments that were already marked as resolved. I used a similar approach in glibc’s memset implementation, and I can confirm that moving VMVVI out of the loop in this way is functionally correct. Perhaps the gains observed in the glibc memset-zero benchtest can be used as a reference? I can share the results later.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
BNEZ X11, vector_loop
RETMark RyanThere is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:
VSETVLI X11, E8, M8, TA, MA, X5
VMVVI $0, V8
vector_loop_body:
VSE8V V8, (X10)
ADD X5, X10
SUB X5, X11
VSETVLI X11, E8, M8, TA, MA, X5
BNEZ X11, vector_loop_body
RET
```
@wang...@bytedance.com do you know whether these changes measurably boost the performance of the code? We did discuss doing something like this in the comments above but weren't sure how much of a difference it would make.
We can save one instruction by hoisting loop invariants out of the loop. This is what LLVM/GCC generate: https://godbolt.org/z/55Ej73ac9.
@mark...@rivosinc.com Based on testing BenchmarkMemclr on SpacemiT X60, moving VMVVI out of the loop provides a geomean performance improvement of around 0.2%.
Thanks for checking this, and thanks also Pengcheng for the llvm and gcc links. Looks like this is a viable optimisation.