runtime: optimize zeroing of registers in secret_amd64.s
Use VPXORQ instead of VMOVAPD because the former, when in the form of a zeroing idiom, is handled directly by the renamer.
Tweak also the KXORQs to operate each on a single register, making it trivial to understand what the intent is, and so that all can potentially execute in parallel.
This PR will be imported into Gerrit with the title and first
comment (this text) used to generate the subject and body of
the Gerrit change.
**Please ensure you adhere to every item in this list.**
More info can be found at https://github.com/golang/go/wiki/CommitMessage
+ The PR title is formatted as follows: `net/http: frob the quux before blarfing`
+ The package name goes before the colon
+ The part after the colon uses the verb tense + phrase that completes the blank in,
"This change modifies Go to ___________"
+ Lowercase verb after the colon
+ No trailing period
+ Keep the title as short as possible. ideally under 76 characters or shorter
+ No Markdown
+ The first PR comment (this one) is wrapped at 76 characters, unless it's
really needed (ASCII art, table, or long link)
+ If there is a corresponding issue, add either `Fixes #1234` or `Updates #1234`
(the latter if this is not a complete fix) to this comment
+ If referring to a repo other than `golang/go` you can use the
`owner/repo#issue_number` syntax: `Fixes golang/tools#1234`
+ We do not use Signed-off-by lines in Go. Please don't add them.
Our Gerrit server & GitHub bots enforce CLA compliance instead.
+ Delete these instructions once you have read and applied them
diff --git a/src/runtime/secret_amd64.s b/src/runtime/secret_amd64.s
index 06103d1..519fe27 100644
--- a/src/runtime/secret_amd64.s
+++ b/src/runtime/secret_amd64.s
@@ -71,33 +71,40 @@
JNE noavx512
// Zero X16-X31
- // Note that VZEROALL above already cleared Z0-Z15.
- VMOVAPD Z0, Z16
- VMOVAPD Z0, Z17
- VMOVAPD Z0, Z18
- VMOVAPD Z0, Z19
- VMOVAPD Z0, Z20
- VMOVAPD Z0, Z21
- VMOVAPD Z0, Z22
- VMOVAPD Z0, Z23
- VMOVAPD Z0, Z24
- VMOVAPD Z0, Z25
- VMOVAPD Z0, Z26
- VMOVAPD Z0, Z27
- VMOVAPD Z0, Z28
- VMOVAPD Z0, Z29
- VMOVAPD Z0, Z30
- VMOVAPD Z0, Z31
+ // VPXORQ r, r, r is a zeroing idiom according to section
+ // 3.5.1.7 "Clearing Registers and Dependency Breaking Idioms" in
+ // "Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1"
+ // (April 2024)
+ VPXORQ Z16, Z16, Z16
+ VPXORQ Z17, Z17, Z17
+ VPXORQ Z18, Z18, Z18
+ VPXORQ Z19, Z19, Z19
+ VPXORQ Z20, Z20, Z20
+ VPXORQ Z21, Z21, Z21
+ VPXORQ Z22, Z22, Z22
+ VPXORQ Z23, Z23, Z23
+ VPXORQ Z24, Z24, Z24
+ VPXORQ Z25, Z25, Z25
+ VPXORQ Z26, Z26, Z26
+ VPXORQ Z27, Z27, Z27
+ VPXORQ Z28, Z28, Z28
+ VPXORQ Z29, Z29, Z29
+ VPXORQ Z30, Z30, Z30
+ VPXORQ Z31, Z31, Z31
// Zero k0-k7
+ // While these are not categorized as zeroing idioms, having them
+ // operate on a single register per instruction makes it easy to
+ // understand what each instruction does.
+ // Note: for wider compatibility these could equally also be KXORW.
KXORQ K0, K0, K0
- KXORQ K0, K0, K1
- KXORQ K0, K0, K2
- KXORQ K0, K0, K3
- KXORQ K0, K0, K4
- KXORQ K0, K0, K5
- KXORQ K0, K0, K6
- KXORQ K0, K0, K7
+ KXORQ K1, K1, K1
+ KXORQ K2, K2, K2
+ KXORQ K3, K3, K3
+ KXORQ K4, K4, K4
+ KXORQ K5, K5, K5
+ KXORQ K6, K6, K6
+ KXORQ K7, K7, K7
noavx512:
// misc registers
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
I spotted some possible problems with your PR:
1. Do you still have the GitHub PR instructions in your commit message text? The PR instructions should be deleted once you have applied them.
2. Do you have the right bug reference format? For this repo, the format is usually 'Fixes #12345' or 'Updates #12345' at the end of the commit message.
Please address any problems by updating the GitHub PR.
When complete, mark this comment as 'Done' and click the [blue 'Reply' button](https://go.dev/wiki/GerritBot#i-left-a-reply-to-a-comment-in-gerrit-but-no-one-but-me-can-see-it) above. These findings are based on heuristics; if a finding does not apply, briefly reply here saying so.
To update the commit title or commit message body shown here in Gerrit, you must edit the GitHub PR title and PR description (the first comment) in the GitHub web interface using the 'Edit' button or 'Edit' menu entry there. Note: pushing a new commit to the PR will not automatically update the commit message used by Gerrit.
For more details, see:
(In general for Gerrit code reviews, the change author is expected to [log in to Gerrit](https://go-review.googlesource.com/login/) with a Gmail or other Google account and then close out each piece of feedback by marking it as 'Done' if implemented as suggested or otherwise reply to each review comment. See the [Review](https://go.dev/doc/contribute#review) section of the Contributing Guide for details.)
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
It isn't to obvious why this needs to be changed.
The register renaming unit also handle register register moves.
Is there something that would show why this needs to be done ?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
I spotted some possible problems with your PR:
1. Do you still have the GitHub PR instructions in your commit message text? The PR instructions should be deleted once you have applied them.
2. Do you have the right bug reference format? For this repo, the format is usually 'Fixes #12345' or 'Updates #12345' at the end of the commit message.Please address any problems by updating the GitHub PR.
When complete, mark this comment as 'Done' and click the [blue 'Reply' button](https://go.dev/wiki/GerritBot#i-left-a-reply-to-a-comment-in-gerrit-but-no-one-but-me-can-see-it) above. These findings are based on heuristics; if a finding does not apply, briefly reply here saying so.
To update the commit title or commit message body shown here in Gerrit, you must edit the GitHub PR title and PR description (the first comment) in the GitHub web interface using the 'Edit' button or 'Edit' menu entry there. Note: pushing a new commit to the PR will not automatically update the commit message used by Gerrit.
For more details, see:
- [how to update commit messages](https://go.dev/wiki/GerritBot/#how-does-gerritbot-determine-the-final-commit-message) for PRs imported into Gerrit.
- the Go project's [conventions for commit messages](https://go.dev/doc/contribute#commit_messages) that you should follow.
(In general for Gerrit code reviews, the change author is expected to [log in to Gerrit](https://go-review.googlesource.com/login/) with a Gmail or other Google account and then close out each piece of feedback by marking it as 'Done' if implemented as suggested or otherwise reply to each review comment. See the [Review](https://go.dev/doc/contribute#review) section of the Contributing Guide for details.)
Done
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
It isn't to obvious why this needs to be changed.
The register renaming unit also handle register register moves.Is there something that would show why this needs to be done ?
Fair question. Two reasons:
FWIW the second does not apply to KXORQ, but we're not changing that as there is apparently no better alternative.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
Generally supportive of this change, but would be interested in seeing some benchmarks. I'm guessing that this all falls into the background compared to the stack clearing and that the process as a whole stalls out-of-order execution because there are no encoded registers for the CPU to execute ahead with.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Carlo Alberto FerrarisIt isn't to obvious why this needs to be changed.
The register renaming unit also handle register register moves.Is there something that would show why this needs to be done ?
Fair question. Two reasons:
- the XOR version does not depend on the semantics of VZEROALL, hence (as it has no preconditions) it's easier to interpret for a reader of the code.
- from the referenced document: "Assembly/Compiler Coding Rule 32. (M impact, ML generality) Use dependency-breaking-idiom instructions to set a register to 0, or to break a false dependence chain resulting from re-use of registers. In contexts where the condition codes must be preserved, move 0 into the register instead. This requires more code space than using XOR and SUB, but avoids setting the condition codes.". AFAICT VMOVAPD is not classified as a dependency-breaking-idiom in the referenced doc, and we do not need to preserve the condition codes.
FWIW the second does not apply to KXORQ, but we're not changing that as there is apparently no better alternative.
I hadn't considered the first point.
For the second point, moves are not dependency breaking since that would mean whatever uses the result of a move could execute before the move's dependencies non speculatively.
My point is that a move is just as fast as a dependency breaking instruction if the dependency is always resolved by the time the consumer of the move needs them.
(the register renaming unit does not need to wait on dependencies and it can handle dependent renames in the same cycle)
Which for a zero value either loaded a long time ago (as part of the ABI), or materialized from a dependency-breaking 0 idiom will always be true.
So I guess your code change is a performance noop except maybe pollute the memory caches a bit since you wrote `VPXORQ Z$, Z$, Z$` rather than `VPXORQ Z0, Z0, Z31` (registers `>= 16` needs a prefix since they weren't originally supported in VEX altho I didn't checked if that cost is paid for each parameter or once if any parameter uses a register `>= 16`)
But I doubt anyone can measure this here.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
A couple points here:
Latency of these instructions is irrelevant. No one should be using the results. If they all were really dependent on each other (which I don't think they current are - the renamer would understand this idiom I think), that would almost not matter. In fact, dependent might be better, as we would do a trickle of zeroing in the background leaving other functional units open for other instructions. (Of course if the renamer does the work, maybe irrelevant anyway.)
Performance in general here really doesn't matter. `secret.Do` is expected to be expensive. I would opt for whatever encoding uses the fewest number of bytes.
If we benchmark this lets use some newer amds and intel to judge impact.
On some systems that handle more moves then alu ops and dont have zeroing optimization this may regress and not always be a win IF performance is the ultimate motivator here. Which there are arguments already made that maybe should not.
Some benchmarks across generations are also here: https://chipsandcheese.com/i/138977313/renameallocate
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
I will replace "optimize" that may give the wrong impression with "canonicalize" that hopefully expresses intent better. The main reasons for this change are as follows:
I am not claiming this will be faster, just that it will be easier to read and more idiomatic. Will make it clear in the CL description.
I will replace "optimize" that may give the wrong impression with "canonicalize" that hopefully expresses intent better. The main reasons for this change are as follows:
- for a human reader understanding of what each instruction does becomes independent of all other instructions (this is not the case currently, for the VMOVs depend on the semantics of the ZEROALL)
- both intel and amd optimization guides describe the VXOR as one of the "zeroing idioms" to be preferred to moving zeros
I am not claiming this will be faster, just that it will be easier to read and more idiomatic. Will make it clear in the CL description.
FWIW, somebody more informed than me may want to also consider whether it makes any sense to have VZEROALL *immediately followed* by other instructions that operate on ZMM registers. My vague recollection is that VZEROALL/VZEROUPPER should be mostly executed to avoid penalties when transitioning from VEX to non-VEX code. Immediately executing more VEX code after the VZEROALL, and then not executing a VZEROUPPER/VZEROALL when done, seems counter to that.