Joel Sing (Gerrit)

unread,

Feb 12, 2025, 7:43:08 AMFeb 12

to goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Joel Sing has uploaded the change for review

Commit message

runtime: add vector implementation of memclrNoHeapPointers for riscv64

Change-Id: I95fba5f963d3972f16f09c1c167461d79bcecae1

Change diff

diff --git a/src/runtime/cpuflags.go b/src/runtime/cpuflags.go
index e81e50f..35dce45 100644
--- a/src/runtime/cpuflags.go
+++ b/src/runtime/cpuflags.go
@@ -11,16 +11,18 @@
 
 // Offsets into internal/cpu records for use in assembly.
 const (
-	offsetX86HasAVX    = unsafe.Offsetof(cpu.X86.HasAVX)
-	offsetX86HasAVX2   = unsafe.Offsetof(cpu.X86.HasAVX2)
-	offsetX86HasERMS   = unsafe.Offsetof(cpu.X86.HasERMS)
-	offsetX86HasRDTSCP = unsafe.Offsetof(cpu.X86.HasRDTSCP)
-
 	offsetARMHasIDIVA = unsafe.Offsetof(cpu.ARM.HasIDIVA)
 
 	offsetMIPS64XHasMSA = unsafe.Offsetof(cpu.MIPS64X.HasMSA)
 
 	offsetLOONG64HasLSX = unsafe.Offsetof(cpu.Loong64.HasLSX)
+
+	offsetRISCV64HasV = unsafe.Offsetof(cpu.RISCV64.HasV)
+
+	offsetX86HasAVX    = unsafe.Offsetof(cpu.X86.HasAVX)
+	offsetX86HasAVX2   = unsafe.Offsetof(cpu.X86.HasAVX2)
+	offsetX86HasERMS   = unsafe.Offsetof(cpu.X86.HasERMS)
+	offsetX86HasRDTSCP = unsafe.Offsetof(cpu.X86.HasRDTSCP)
 )
 
 var (
diff --git a/src/runtime/memclr_riscv64.s b/src/runtime/memclr_riscv64.s
index 16c511c..ead0fe0 100644
--- a/src/runtime/memclr_riscv64.s
+++ b/src/runtime/memclr_riscv64.s
@@ -2,6 +2,8 @@
 // Use of this source code is governed by a BSD-style
 // license that can be found in the LICENSE file.
 
+#include "asm_riscv64.h"
+#include "go_asm.h"
 #include "textflag.h"
 
 // See memclrNoHeapPointers Go doc for important implementation constraints.
@@ -15,6 +17,30 @@
 	MOV	$8, X9
 	BLT	X11, X9, check4
 
+#ifndef hasV
+	MOVB	internal∕cpu·RISCV64+const_offsetRISCV64HasV(SB), X5
+	BEQZ	X5, memclr_scalar
+#endif
+
+	// Use vector if not 8 byte aligned.
+	AND	$7, X10, X5
+	BNEZ	X5, vector_loop
+
+	// Use scalar if 8 byte aligned and <= 64 bytes.
+	SUB	$64, X11, X6
+	BLEZ	X6, aligned
+
+	PCALIGN	$16
+vector_loop:
+	VSETVLI	X11, E8, M8, TA, MA, X5
+	VMVVI	$0, V8
+	VSE8V	V8, (X10)
+	ADD	X5, X10
+	SUB	X5, X11
+	BNEZ	X11, vector_loop
+	RET
+
+memclr_scalar:
 	// Check alignment
 	AND	$7, X10, X5
 	BEQZ	X5, aligned
@@ -37,6 +63,8 @@
 	BLT	X11, X9, zero16
 	MOV	$64, X9
 	BLT	X11, X9, zero32
+
+	PCALIGN	$16
 loop64:
 	MOV	ZERO, 0(X10)
 	MOV	ZERO, 8(X10)

Change information

Files:

M src/runtime/cpuflags.go
M src/runtime/memclr_riscv64.s

Change size: S

Delta: 2 files changed, 35 insertions(+), 5 deletions(-)

Open in Gerrit

Related details

Attention set is empty

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Mark Ryan (Gerrit)

unread,

Feb 12, 2025, 8:06:20 AMFeb 12

to Joel Sing, goph...@pubsubhelper.golang.org, Meng Zhuo, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Mark Ryan added 1 comment

File src/runtime/memclr_riscv64.s

Line 36, Patchset 1 (Latest): VMVVI $0, V8

Mark Ryan . unresolved

The VMVI can be moved out of the loop.

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review

No-Unresolved-Comments

Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

open

diffy

Joel Sing (Gerrit)

unread,

Feb 12, 2025, 9:13:52 AMFeb 12

to goph...@pubsubhelper.golang.org, Mark Ryan, Meng Zhuo, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Mark Ryan and Meng Zhuo

Joel Sing added 1 comment

File src/runtime/memclr_riscv64.s

Line 36, Patchset 1 (Latest): VMVVI $0, V8

Mark Ryan . unresolved

The VMVI can be moved out of the loop.

Joel Sing

Maybe... the catch is we would need to make an equivalent `VSETVLI` call as the M8 is effectively splatting $0 into V8..V15 (with the number of elements also controlled by the current X11 value) - I also wanted to confirm what the required behaviour is if we VMVVI with one X11 value, then use it with a different value (not sure if you know the answer to this). At least for now it seemed easier/safer to call VMMVI in the loop (and it's not going to be the slow part of the code).

Open in Gerrit

Related details

Attention is currently required from:

Mark Ryan
Meng Zhuo

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

open

diffy

Mark Ryan (Gerrit)

unread,

Feb 12, 2025, 9:53:47 AMFeb 12

to Joel Sing, goph...@pubsubhelper.golang.org, Meng Zhuo, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Mark Ryan voted and added 2 comments

Votes added by Mark Ryan

Code-Review

+1

2 comments

Patchset-level comments

File-level comment, Patchset 1 (Latest):

Mark Ryan . resolved

LGTM

File src/runtime/memclr_riscv64.s

Line 36, Patchset 1 (Latest): VMVVI $0, V8

Mark Ryan . resolved

The VMVI can be moved out of the loop.

Joel Sing

Maybe... the catch is we would need to make an equivalent `VSETVLI` call as the M8 is effectively splatting $0 into V8..V15 (with the number of elements also controlled by the current X11 value) - I also wanted to confirm what the required behaviour is if we VMVVI with one X11 value, then use it with a different value (not sure if you know the answer to this). At least for now it seemed easier/safer to call VMMVI in the loop (and it's not going to be the slow part of the code).

Mark Ryan

Yes, sorry, ignore my comment. Still trying to get my head around vector.

I also wanted to confirm what the required behaviour is if we VMVVI with one X11 value, then use it with a different value (not sure if you know the answer to this)

I'd say it should work, but I agree, not worth it. One thing we could consider trying/benchmarking in the future is whole vector loads and stores. This would allow us to just set v8 once outside the loop, but we'd need CSR support for this (to get the vlenb), and possibly to manually unroll and we'd also need one call to vsetvli to process the tail, so possibly not worth it either.

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review

No-Unresolved-Comments

Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Mark Ryan (Gerrit)

unread,

Feb 28, 2025, 12:17:26 PMFeb 28

to Joel Sing, goph...@pubsubhelper.golang.org, Meng Zhuo, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Mark Ryan voted and added 1 comment

Votes added by Mark Ryan

Code-Review

+1

1 comment

Patchset-level comments

File-level comment, Patchset 5 (Latest):

Mark Ryan . resolved

Tested on a Banana Pi. Seeing a geomean improvement of -10.32% on the Memclr tests.

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Joel Sing (Gerrit)

unread,

Mar 29, 2025, 9:09:25 AMMar 29

to goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Joel Sing uploaded new patchset

Joel Sing uploaded patch set #6 to this change.

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Meng Zhuo (Gerrit)

unread,

Oct 24, 2025, 4:31:53 AMOct 24

to Joel Sing, goph...@pubsubhelper.golang.org, Meng Zhuo, Mark Ryan, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing

Meng Zhuo voted

Code-Review	+2
Commit-Queue	+1

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing

Submit Requirements:

Code-Review

No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Joel Sing (Gerrit)

unread,

Nov 26, 2025, 9:38:11 AMNov 26

to goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Joel Sing uploaded new patchset

Joel Sing uploaded patch set #11 to this change.

Following approvals got outdated and were removed:

TryBots-Pass: LUCI-TryBot-Result-1 by Go LUCI

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Boyao Wang (Gerrit)

unread,

Dec 11, 2025, 1:57:00 AM (6 days ago) Dec 11

to Joel Sing, goph...@pubsubhelper.golang.org, Go LUCI, Meng Zhuo, Mark Ryan, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Boyao Wang added 1 comment

File src/runtime/memclr_riscv64.s

Line 34, Patchset 11 (Latest):vector_loop:


	VSETVLI	X11, E8, M8, TA, MA, X5

	VMVVI	$0, V8
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11
	BNEZ	X11, vector_loop
	RET

Boyao Wang . unresolved

There is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:


	VSETVLI	X11, E8, M8, TA, MA, X5

	VMVVI	$0, V8
vector_loop_body:
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11


	VSETVLI	X11, E8, M8, TA, MA, X5

	BNEZ	X11, vector_loop_body
	RET
```

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review
No-Unresolved-Comments

Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Mark Ryan (Gerrit)

unread,

Dec 11, 2025, 3:15:51 AM (5 days ago) Dec 11

to Joel Sing, goph...@pubsubhelper.golang.org, Boyao Wang, Go LUCI, Meng Zhuo, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Boyao Wang, Joel Sing and Meng Zhuo

Mark Ryan voted and added 1 comment

Votes added by Mark Ryan

Code-Review

+1

1 comment

File src/runtime/memclr_riscv64.s

Line 34, Patchset 11 (Latest):

vector_loop:
	VSETVLI	X11, E8, M8, TA, MA, X5
	VMVVI	$0, V8
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11
	BNEZ	X11, vector_loop
	RET

Boyao Wang . unresolved

There is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:
	VSETVLI	X11, E8, M8, TA, MA, X5
	VMVVI	$0, V8
vector_loop_body:
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11
	VSETVLI	X11, E8, M8, TA, MA, X5
	BNEZ	X11, vector_loop_body
	RET
```

Mark Ryan

@wang...@bytedance.com do you know whether these changes measurably boost the performance of the code? We did discuss doing something like this in the comments above but weren't sure how much of a difference it would make.

Open in Gerrit

Related details

Attention is currently required from:

Boyao Wang
Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Boyao Wang (Gerrit)

unread,

Dec 11, 2025, 3:31:36 AM (5 days ago) Dec 11

to Joel Sing, goph...@pubsubhelper.golang.org, Go LUCI, Meng Zhuo, Mark Ryan, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Boyao Wang added 1 comment

File src/runtime/memclr_riscv64.s

Line 34, Patchset 11 (Latest):

vector_loop:
	VSETVLI	X11, E8, M8, TA, MA, X5
	VMVVI	$0, V8
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11
	BNEZ	X11, vector_loop
	RET

Boyao Wang . unresolved

There is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:
	VSETVLI	X11, E8, M8, TA, MA, X5
	VMVVI	$0, V8
vector_loop_body:
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11
	VSETVLI	X11, E8, M8, TA, MA, X5
	BNEZ	X11, vector_loop_body
	RET
```

Mark Ryan

@wang...@bytedance.com do you know whether these changes measurably boost the performance of the code? We did discuss doing something like this in the comments above but weren't sure how much of a difference it would make.

Boyao Wang

Sorry, I didn’t look closely at the comments that were already marked as resolved. I used a similar approach in glibc’s memset implementation, and I can confirm that moving VMVVI out of the loop in this way is functionally correct. Perhaps the gains observed in the glibc memset-zero benchtest can be used as a reference? I can share the results later.

Open in Gerrit

Related details

Attention is currently required from:

Joel Sing
Meng Zhuo

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Pengcheng Wang (Gerrit)

unread,

Dec 11, 2025, 3:33:51 AM (5 days ago) Dec 11

to Joel Sing, goph...@pubsubhelper.golang.org, Boyao Wang, Go LUCI, Meng Zhuo, Mark Ryan, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Pengcheng Wang added 1 comment

File src/runtime/memclr_riscv64.s

Line 34, Patchset 11 (Latest):

vector_loop:
	VSETVLI	X11, E8, M8, TA, MA, X5
	VMVVI	$0, V8
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11
	BNEZ	X11, vector_loop
	RET

Boyao Wang . unresolved

There is no need to issue VMVI $0, V8 in every iteration. Perhaps we can initialize V8 once before entering the main vector loop.
```suggestion
vector_loop:
	VSETVLI	X11, E8, M8, TA, MA, X5
	VMVVI	$0, V8
vector_loop_body:
	VSE8V	V8, (X10)
	ADD	X5, X10
	SUB	X5, X11
	VSETVLI	X11, E8, M8, TA, MA, X5
	BNEZ	X11, vector_loop_body
	RET
```

Mark Ryan

@wang...@bytedance.com do you know whether these changes measurably boost the performance of the code? We did discuss doing something like this in the comments above but weren't sure how much of a difference it would make.

Pengcheng Wang

We can save one instruction by hoisting loop invariants out of the loop. This is what LLVM/GCC generate: https://godbolt.org/z/55Ej73ac9.

satisfied_requirement

unsatisfied_requirement

open

diffy

Boyao Wang (Gerrit)

unread,

Dec 14, 2025, 10:23:21 PM (2 days ago) Dec 14

to Joel Sing, goph...@pubsubhelper.golang.org, Go LUCI, Meng Zhuo, Mark Ryan, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Boyao Wang added 1 comment

File src/runtime/memclr_riscv64.s

Boyao Wang

@mark...@rivosinc.com Based on testing BenchmarkMemclr on SpacemiT X60, moving VMVVI out of the loop provides a geomean performance improvement of around 0.2%.

satisfied_requirement

unsatisfied_requirement

open

diffy

Mark Ryan (Gerrit)

unread,

5:31 AM (9 hours ago) 5:31 AM

to Joel Sing, goph...@pubsubhelper.golang.org, Boyao Wang, Go LUCI, Meng Zhuo, Pengcheng Wang, golang-co...@googlegroups.com

Attention needed from Joel Sing and Meng Zhuo

Mark Ryan added 1 comment

File src/runtime/memclr_riscv64.s

Mark Ryan

Thanks for checking this, and thanks also Pengcheng for the llvm and gcc links. Looks like this is a viable optimisation.

satisfied_requirement

unsatisfied_requirement

open

diffy

Reply all

Reply to author

Forward

[go] runtime: add vector implementation of memclrNoHeapPointers for riscv64

Joel Sing (Gerrit)

Joel Sing has uploaded the change for review

Commit message

Change diff

Change information

Related details

Mark Ryan (Gerrit)

Mark Ryan added 1 comment

Related details

Joel Sing (Gerrit)

Joel Sing added 1 comment

Related details

Mark Ryan (Gerrit)

Mark Ryan voted and added 2 comments

Votes added by Mark Ryan

2 comments

Related details

Mark Ryan (Gerrit)

Mark Ryan voted and added 1 comment

Votes added by Mark Ryan

1 comment

Related details

Joel Sing (Gerrit)

Joel Sing uploaded new patchset

Related details

Meng Zhuo (Gerrit)

Meng Zhuo voted

Related details

Joel Sing (Gerrit)

Joel Sing uploaded new patchset

Related details

Boyao Wang (Gerrit)

Boyao Wang added 1 comment

Related details

Mark Ryan (Gerrit)

Mark Ryan voted and added 1 comment

Votes added by Mark Ryan

1 comment

Related details

Boyao Wang (Gerrit)

Boyao Wang added 1 comment

Related details

Pengcheng Wang (Gerrit)

Pengcheng Wang added 1 comment

Boyao Wang (Gerrit)

Boyao Wang added 1 comment

Mark Ryan (Gerrit)

Mark Ryan added 1 comment