crypto/elliptic P-256 assembly code and AssemblyPolicy updates

923 views
Skip to first unread message

Filippo Valsorda

unread,
Mar 3, 2022, 8:54:59 AM3/3/22
to golang-dev, Security Team
As part of a long-running refactor of the crypto/elliptic package, we are moving the actual implementations of named curves to crypto/elliptic/internal/nistec, which exposes a safe []byte-based API rather than a big.Int one that allows invalid points.

This involves porting the P-256 assembly to implement the nistec API. There are assembly implementations for amd64, arm64, ppc64le, and s390x. They grew organically, so the package structure is a bit of a mess. arm64 was ported from amd64, and ppc64le was ported from s390x, but the two pairs don't share interfaces and types.

I will take care of refactoring amd64 and arm64 as they are supported as first-class ports, but unfortunately we don't have the resources or expertise to port ppc64le and s390x. We also don't want to give up on completing an important refactor that will reduce ongoing maintenance burden because of the assembly cores: we can't let everything with assembly in it "rust shut".

We plan to disable ppc64le and s390x assembly, and ask that the relevant owners plan on doing the work to port it to nistec, ideally for Go 1.19. We will try to make the changes on our side in the first half of March, which leaves six weeks to port ppc64le and s390x.

It looks like both the amd64/arm64 and ppc64le/s390x implementations are 64-bit Montgomery strategies with the same(?) limb schedule, so the difference in interfaces and types might be a historical accident. If that's accurate we'd ask that the ppc64le/s390x assembly be ported to implement the same functions as the amd64/arm64 cores, which should significantly reduce complexity.

I have updated the Assembly Policy to reflect the expectations outlined above (specifically, that code that we don't explicitly take on must have an owner, and that architectures should share structure). I welcome feedback on the changes.

I realize the Assembly Policy might look like a lot of overhead, but it exists to try and make it possible for crypto assembly to exist safely in the standard library. We really appreciate contributions that comply to it, and will do our best to prioritize them. (If you have a CL that is following the policy and is not getting review, please email me directly. Things get lost in the Gerrit sea.)

Michael Munday

unread,
Mar 3, 2022, 10:01:50 AM3/3/22
to Filippo Valsorda, golang-dev, Security Team
Hi Filippo,

Some platforms provide high level implementations of cryptographic algorithms (e.g. newer s390x machines can do full AES-GCM encryption/decryption in a single instruction). When this is the case the assembly is likely to be trivial but a different interface may be required compared to a lower level implementation. Would you consider an amendment to the assembly policy to cover this scenario? For example:

"The interface of the assembly units and of the reference Go implementation must be the same across architectures unless there is a strong technical justification for doing so."

I agree that effort should be put in to make the interfaces as close as possible and avoid arbitrary differences. Allowing code to be refactored freely, including potentially removing assembly, also seems important to me and I think it is acceptable so long as reasonable steps are taken to involve the author/maintainer of the assembly in the discussion.

One other minor thing I'd like to just note is that for some authors it is risky to touch existing architecture specific implementations. If they tweak an existing interface in order to add a new implementation and it causes an performance degradation on a platform their company competes with it could lead to issues even if it was completely unintentional. This can mean there is pressure on them to make the implementations more independent and is perhaps something to be sensitive to when reviewing contributions of high performance assembly.

Thanks,
Michael


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/CA%2B2K_Kq%2B_NmhMUo%2BH_--zRXA0qgZUuP_Ts8g3CiQVyQuFm%2B34w%40mail.gmail.com.

Filippo Valsorda

unread,
Mar 3, 2022, 10:27:53 AM3/3/22
to Michael Munday, golang-dev, Security Team
On Thu, Mar 3, 2022 at 10:01 AM Michael Munday <mike....@lowrisc.org> wrote:
Hi Filippo,

Some platforms provide high level implementations of cryptographic algorithms (e.g. newer s390x machines can do full AES-GCM encryption/decryption in a single instruction). When this is the case the assembly is likely to be trivial but a different interface may be required compared to a lower level implementation. Would you consider an amendment to the assembly policy to cover this scenario? For example:

"The interface of the assembly units and of the reference Go implementation must be the same across architectures unless there is a strong technical justification for doing so."

Ah yep, good point. Done!
 
I agree that effort should be put in to make the interfaces as close as possible and avoid arbitrary differences. Allowing code to be refactored freely, including potentially removing assembly, also seems important to me and I think it is acceptable so long as reasonable steps are taken to involve the author/maintainer of the assembly in the discussion.

Definitely. Explicitly identifying maintainers will also help with directing those discussions.

In this case, I'd be happy to add the ppc64le and s390x maintainers as reviewers to my upcoming changes so they can let me know if the interface is going to be a problem for them.
 
One other minor thing I'd like to just note is that for some authors it is risky to touch existing architecture specific implementations. If they tweak an existing interface in order to add a new implementation and it causes an performance degradation on a platform their company competes with it could lead to issues even if it was completely unintentional. This can mean there is pressure on them to make the implementations more independent and is perhaps something to be sensitive to when reviewing contributions of high performance assembly.

I can see that, and I'll keep it in mind, thank you, but unfortunately I don't have a good solution. Go is a collaborative project, and ultimately complexity/performance tradeoffs happen, even across architectures. If anyone wants to discuss such a concern, feel free to reach out privately if that would help.

lab...@linux.vnet.ibm.com

unread,
Mar 3, 2022, 10:55:03 AM3/3/22
to golang-dev

Hi, can this statement be clarified: 
  • The interface of the assembly units and of the reference Go implementation must be the same across architectures.
Because in crypto/aes in aes_gcm.go and cipher_asm.go the assembler interfaces for arm64 and amd64 are not the same as the Go implementation? Or maybe I am misunderstanding what this means.

Filippo Valsorda

unread,
Mar 3, 2022, 11:04:48 AM3/3/22
to lab...@linux.vnet.ibm.com, golang-dev
On Thu, Mar 3, 2022 at 10:55 AM lab...@linux.vnet.ibm.com <lab...@linux.vnet.ibm.com> wrote:

Hi, can this statement be clarified: 
  • The interface of the assembly units and of the reference Go implementation must be the same across architectures.
Because in crypto/aes in aes_gcm.go and cipher_asm.go the assembler interfaces for arm64 and amd64 are not the same as the Go implementation? Or maybe I am misunderstanding what this means.

Indeed, most of the pre-policy assembly does not comply with the policy. Even before the new rule, there was "Any assembly function needs a reference Go implementation, that’s tested side-by-side with the assembly." which the AES-GCM codebase doesn't follow. We introduced the policy because we had found ourselves unable to keep up with maintenance of the assembly we already had. We'd like for things to get more compliant over time, and we definitely don't want the amount of non-compliant assembly to grow. If you need to coordinate package structure changes to move in that direction please let me know! I have limited bandwidth but I am happy to use it to make things easier in the future.
 
On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:
As part of a long-running refactor of the crypto/elliptic package, we are moving the actual implementations of named curves to crypto/elliptic/internal/nistec, which exposes a safe []byte-based API rather than a big.Int one that allows invalid points.

This involves porting the P-256 assembly to implement the nistec API. There are assembly implementations for amd64, arm64, ppc64le, and s390x. They grew organically, so the package structure is a bit of a mess. arm64 was ported from amd64, and ppc64le was ported from s390x, but the two pairs don't share interfaces and types.

I will take care of refactoring amd64 and arm64 as they are supported as first-class ports, but unfortunately we don't have the resources or expertise to port ppc64le and s390x. We also don't want to give up on completing an important refactor that will reduce ongoing maintenance burden because of the assembly cores: we can't let everything with assembly in it "rust shut".

We plan to disable ppc64le and s390x assembly, and ask that the relevant owners plan on doing the work to port it to nistec, ideally for Go 1.19. We will try to make the changes on our side in the first half of March, which leaves six weeks to port ppc64le and s390x.

It looks like both the amd64/arm64 and ppc64le/s390x implementations are 64-bit Montgomery strategies with the same(?) limb schedule, so the difference in interfaces and types might be a historical accident. If that's accurate we'd ask that the ppc64le/s390x assembly be ported to implement the same functions as the amd64/arm64 cores, which should significantly reduce complexity.

I have updated the Assembly Policy to reflect the expectations outlined above (specifically, that code that we don't explicitly take on must have an owner, and that architectures should share structure). I welcome feedback on the changes.

I realize the Assembly Policy might look like a lot of overhead, but it exists to try and make it possible for crypto assembly to exist safely in the standard library. We really appreciate contributions that comply to it, and will do our best to prioritize them. (If you have a CL that is following the policy and is not getting review, please email me directly. Things get lost in the Gerrit sea.)

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

lab...@linux.vnet.ibm.com

unread,
Mar 4, 2022, 10:37:06 AM3/4/22
to golang-dev
Do you consider the current sha256 and sha512 Go and assembler implementations to comply with this policy?

Paul Murphy

unread,
Mar 9, 2022, 12:59:42 PM3/9/22
to golang-dev
I have a question of clarification regarding this portion of the assembly policy:
  • The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)
Some architectures unconditionally guarantee the existence of certain hardware acceleration features (e.g ppc64/ppc64le require POWER8 or newer, which indirectly mandates the POWER ISA 2.07 vector crypto extensions). If using a shared software infrastructure, such an architecture would always have a vacuous fallback path (The detection criteria for ppc would effectively be  'cpu.PPC64.IsPOWER8', which should always be true).

Would an asm implementation using POWER8 instructions meet this policy since the fallback path is effectively unreachable? Should there be a hook to disable the hardware accelerated path independent of cpu.ARCH feature flags which may not be exposed to GODEBUG?


On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:

Filippo Valsorda

unread,
Mar 9, 2022, 1:03:29 PM3/9/22
to Paul Murphy, golang-dev
On Wed, Mar 9, 2022 at 12:59 PM Paul Murphy <paul.ik...@gmail.com> wrote:
I have a question of clarification regarding this portion of the assembly policy:
  • The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)
Some architectures unconditionally guarantee the existence of certain hardware acceleration features (e.g ppc64/ppc64le require POWER8 or newer, which indirectly mandates the POWER ISA 2.07 vector crypto extensions). If using a shared software infrastructure, such an architecture would always have a vacuous fallback path (The detection criteria for ppc would effectively be  'cpu.PPC64.IsPOWER8', which should always be true).

Would an asm implementation using POWER8 instructions meet this policy since the fallback path is effectively unreachable? Should there be a hook to disable the hardware accelerated path independent of cpu.ARCH feature flags which may not be exposed to GODEBUG?

Oh, definitely no need to test what's effectively an unreachable path because of our minimum requirements. However, it's not clear to me why we'd have that code path? Can it be replaced with a panic()? Can it be removed entirely? If it's not worth testing, it should not be worth keeping.

On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:
As part of a long-running refactor of the crypto/elliptic package, we are moving the actual implementations of named curves to crypto/elliptic/internal/nistec, which exposes a safe []byte-based API rather than a big.Int one that allows invalid points.

This involves porting the P-256 assembly to implement the nistec API. There are assembly implementations for amd64, arm64, ppc64le, and s390x. They grew organically, so the package structure is a bit of a mess. arm64 was ported from amd64, and ppc64le was ported from s390x, but the two pairs don't share interfaces and types.

I will take care of refactoring amd64 and arm64 as they are supported as first-class ports, but unfortunately we don't have the resources or expertise to port ppc64le and s390x. We also don't want to give up on completing an important refactor that will reduce ongoing maintenance burden because of the assembly cores: we can't let everything with assembly in it "rust shut".

We plan to disable ppc64le and s390x assembly, and ask that the relevant owners plan on doing the work to port it to nistec, ideally for Go 1.19. We will try to make the changes on our side in the first half of March, which leaves six weeks to port ppc64le and s390x.

It looks like both the amd64/arm64 and ppc64le/s390x implementations are 64-bit Montgomery strategies with the same(?) limb schedule, so the difference in interfaces and types might be a historical accident. If that's accurate we'd ask that the ppc64le/s390x assembly be ported to implement the same functions as the amd64/arm64 cores, which should significantly reduce complexity.

I have updated the Assembly Policy to reflect the expectations outlined above (specifically, that code that we don't explicitly take on must have an owner, and that architectures should share structure). I welcome feedback on the changes.

I realize the Assembly Policy might look like a lot of overhead, but it exists to try and make it possible for crypto assembly to exist safely in the standard library. We really appreciate contributions that comply to it, and will do our best to prioritize them. (If you have a CL that is following the policy and is not getting review, please email me directly. Things get lost in the Gerrit sea.)

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

lab...@linux.vnet.ibm.com

unread,
Mar 9, 2022, 2:51:05 PM3/9/22
to golang-dev
>Oh, definitely no need to test what's effectively an unreachable path because of our minimum requirements. However, it's not clear to me why we'd have that code path? Can it be replaced with a panic()? Can it be removed entirely? If it's not worth testing, it should not be worth keeping.

If a cpu flag controls whether an assembler implementation is executed, then one case where this could be useful is if there is a suspected bug in the assembler code. Then setting GODEBUG=cpu.HasAES=off would execute the Go code instead of the assembler code, so the two could be compared to see if the bug is in the assembler code or compare the performance differences between the two.

Paul Murphy

unread,
Mar 9, 2022, 4:15:20 PM3/9/22
to golang-dev
On Wednesday, March 9, 2022 at 12:03:29 PM UTC-6 fil...@golang.org wrote:
On Wed, Mar 9, 2022 at 12:59 PM Paul Murphy <paul.ik...@gmail.com> wrote:
I have a question of clarification regarding this portion of the assembly policy:
  • The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)
Some architectures unconditionally guarantee the existence of certain hardware acceleration features (e.g ppc64/ppc64le require POWER8 or newer, which indirectly mandates the POWER ISA 2.07 vector crypto extensions). If using a shared software infrastructure, such an architecture would always have a vacuous fallback path (The detection criteria for ppc would effectively be  'cpu.PPC64.IsPOWER8', which should always be true).

Would an asm implementation using POWER8 instructions meet this policy since the fallback path is effectively unreachable? Should there be a hook to disable the hardware accelerated path independent of cpu.ARCH feature flags which may not be exposed to GODEBUG?

Oh, definitely no need to test what's effectively an unreachable path because of our minimum requirements. However, it's not clear to me why we'd have that code path? Can it be replaced with a panic()? Can it be removed entirely? If it's not worth testing, it should not be worth keeping.

I noticed this in newCipher of crypto/aes/cipher_asm.go in my attempt to move the ppc64le implementation into the shared assembly wrapper. The fallback makes sense for other architectures. Perhaps it is not best example.



Filippo Valsorda

unread,
Mar 10, 2022, 11:59:53 AM3/10/22
to Paul Murphy, golang-dev
If the fallback code is the generic Go implementation that gets tested on other GOARCHs that don't have any assembly, I don't feel like we need to test generic-but-on-this-GOARCH specifically. What's important is that we don't have 1) an implementation that requires instructions that are unreleased/unavailable on the builders because too new or 2) two assembly cores, one of which doesn't get tested because it's the fallback for instructions we do have on the builders (like the SSE2 assembly not getting tested because we have AVX everywhere).

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Paul Murphy

unread,
Mar 25, 2022, 10:08:12 AM3/25/22
to golang-dev
Related to this, I have been working on a patch to improve carrying arithmetic on ppc64le. As someone maintaining, but not well versed in their practical usage, how do the four benchmarks overlap common usage scenarios? Which are the most important? Should I be looking at other benchmarks?

I generated, and enabled the P256 code paths similar to the other curves, and the benchmark results seemed unintuitive. Results below are on a ppc64le/p9 host, both with my pending ppc64le Add64/Sub64 patches. Old numbers are native asm, new numbers are with the tentative fiat code and the pending P256 patches:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.5µs ± 0%   224.9µs ± 0%  +383.79%
ScalarMult/P256                        200µs ± 0%     226µs ± 0%   +13.03%
MarshalUnmarshal/P256/Uncompressed    2.52µs ± 0%    1.32µs ± 0%   -47.56%
MarshalUnmarshal/P256/Compressed      2.53µs ± 0%    1.32µs ± 0%   -47.93%

On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:

Filippo Valsorda

unread,
Apr 4, 2022, 1:11:14 PM4/4/22
to Paul Murphy, golang-dev
On Fri, Mar 25, 2022 at 10:08 AM Paul Murphy <mu...@ibm.com> wrote:
Related to this, I have been working on a patch to improve carrying arithmetic on ppc64le. As someone maintaining, but not well versed in their practical usage, how do the four benchmarks overlap common usage scenarios? Which are the most important? Should I be looking at other benchmarks?

I generated, and enabled the P256 code paths similar to the other curves, and the benchmark results seemed unintuitive. Results below are on a ppc64le/p9 host, both with my pending ppc64le Add64/Sub64 patches. Old numbers are native asm, new numbers are with the tentative fiat code and the pending P256 patches:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.5µs ± 0%   224.9µs ± 0%  +383.79%
ScalarMult/P256                        200µs ± 0%     226µs ± 0%   +13.03%
MarshalUnmarshal/P256/Uncompressed    2.52µs ± 0%    1.32µs ± 0%   -47.56%
MarshalUnmarshal/P256/Compressed      2.53µs ± 0%    1.32µs ± 0%   -47.93%

Sorry, I had missed this.

I use the crypto/elliptic and the crypto/ecdsa benchmarks for this. The former (1 ScalarBaseMult + 1 ScalarMult) are representative of an ECDH run, the latter of the higher-level ECDSA ops.

Those numbers actually look pretty good! 13% slowdown on ScalarMult dropping the assembly entirely sounds excellent to me. Note that the Compressed benchmark was broken (https://go.dev/cl/396934) but it doesn't really matter.

ScalarBaseMult uses a large, aggressively precomputed table which the nistec/fiat group logic doesn't have, so that's why you see it jump up to the ScalarMult performance. If that's the only thing stopping you from dropping the assembly for the fiat code, I can implement the precomputation in nistec.P256Point.ScalarBaseMult. Let me know. I really appreciate that you invested in making the Go code faster rather than adapting the assembly, and I'm happy to help finish the work.

(By the way, you might want to base all your benchmarks on top of https://go.dev/cl/382995.)
 
On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:
As part of a long-running refactor of the crypto/elliptic package, we are moving the actual implementations of named curves to crypto/elliptic/internal/nistec, which exposes a safe []byte-based API rather than a big.Int one that allows invalid points.

This involves porting the P-256 assembly to implement the nistec API. There are assembly implementations for amd64, arm64, ppc64le, and s390x. They grew organically, so the package structure is a bit of a mess. arm64 was ported from amd64, and ppc64le was ported from s390x, but the two pairs don't share interfaces and types.

I will take care of refactoring amd64 and arm64 as they are supported as first-class ports, but unfortunately we don't have the resources or expertise to port ppc64le and s390x. We also don't want to give up on completing an important refactor that will reduce ongoing maintenance burden because of the assembly cores: we can't let everything with assembly in it "rust shut".

We plan to disable ppc64le and s390x assembly, and ask that the relevant owners plan on doing the work to port it to nistec, ideally for Go 1.19. We will try to make the changes on our side in the first half of March, which leaves six weeks to port ppc64le and s390x.

It looks like both the amd64/arm64 and ppc64le/s390x implementations are 64-bit Montgomery strategies with the same(?) limb schedule, so the difference in interfaces and types might be a historical accident. If that's accurate we'd ask that the ppc64le/s390x assembly be ported to implement the same functions as the amd64/arm64 cores, which should significantly reduce complexity.

I have updated the Assembly Policy to reflect the expectations outlined above (specifically, that code that we don't explicitly take on must have an owner, and that architectures should share structure). I welcome feedback on the changes.

I realize the Assembly Policy might look like a lot of overhead, but it exists to try and make it possible for crypto assembly to exist safely in the standard library. We really appreciate contributions that comply to it, and will do our best to prioritize them. (If you have a CL that is following the policy and is not getting review, please email me directly. Things get lost in the Gerrit sea.)

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Filippo Valsorda

unread,
Apr 4, 2022, 1:19:04 PM4/4/22
to golang-dev, Security Team
As a general update on this, CL 382995 is the top of the stack that makes the announced changes, including moving the amd64 and arm64 assembly to crypto/elliptic/internal/nistec, refactoring the assembly interface, and disabling the ppc64le and s390x assembly.

The stack is still going through review, which will take some time, but you should consider the assembly interface stable, and can start building on top of it.

My plans for the rest of the cycle are to add granular tests for the assembly routines, but I don't have an ETA for that and you should not block on it.

Please reach out if you have any questions.

Paul Murphy

unread,
Apr 6, 2022, 12:49:49 PM4/6/22
to golang-dev


On Monday, April 4, 2022 at 12:11:14 PM UTC-5 fil...@golang.org wrote:
On Fri, Mar 25, 2022 at 10:08 AM Paul Murphy <mu...@ibm.com> wrote:
Related to this, I have been working on a patch to improve carrying arithmetic on ppc64le. As someone maintaining, but not well versed in their practical usage, how do the four benchmarks overlap common usage scenarios? Which are the most important? Should I be looking at other benchmarks?

I generated, and enabled the P256 code paths similar to the other curves, and the benchmark results seemed unintuitive. Results below are on a ppc64le/p9 host, both with my pending ppc64le Add64/Sub64 patches. Old numbers are native asm, new numbers are with the tentative fiat code and the pending P256 patches:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.5µs ± 0%   224.9µs ± 0%  +383.79%
ScalarMult/P256                        200µs ± 0%     226µs ± 0%   +13.03%
MarshalUnmarshal/P256/Uncompressed    2.52µs ± 0%    1.32µs ± 0%   -47.56%
MarshalUnmarshal/P256/Compressed      2.53µs ± 0%    1.32µs ± 0%   -47.93%

Sorry, I had missed this.

I use the crypto/elliptic and the crypto/ecdsa benchmarks for this. The former (1 ScalarBaseMult + 1 ScalarMult) are representative of an ECDH run, the latter of the higher-level ECDSA ops.

Those numbers actually look pretty good! 13% slowdown on ScalarMult dropping the assembly entirely sounds excellent to me. Note that the Compressed benchmark was broken (https://go.dev/cl/396934) but it doesn't really matter.

ScalarBaseMult uses a large, aggressively precomputed table which the nistec/fiat group logic doesn't have, so that's why you see it jump up to the ScalarMult performance. If that's the only thing stopping you from dropping the assembly for the fiat code, I can implement the precomputation in nistec.P256Point.ScalarBaseMult. Let me know. I really appreciate that you invested in making the Go code faster rather than adapting the assembly, and I'm happy to help finish the work.

(By the way, you might want to base all your benchmarks on top of https://go.dev/cl/382995.)

If it can approach the speedup of the asm code, maybe the asm isn't needed. If you have the bandwidth to spin such a change (using the fiat code), I am curious how close it can get. The assembly port Lynn is working on is still substantially (2-4x) faster under the ecdsa P256 benchmarks when rebased against your CLs.

A lesser observation I noticed is the fiat conditional move function does not optimize very well (on ppc64). Using a more direct implementation (and updating related call sites) to:

func p256CmovznzU64(arg1 p256Uint1, arg2 uint64, arg3 uint64) uint64{
        var out uint64
        if arg1 == 0 {
                out = arg2
        } else {
                out = arg3
        }
        return out
}

Results in a 15% improvement on PPC64. This allows much better code generation.

Filippo Valsorda

unread,
Apr 6, 2022, 1:17:18 PM4/6/22
to Paul Murphy, golang-dev
On Wed, Apr 6, 2022 at 12:49 PM Paul Murphy <mu...@ibm.com> wrote:
On Monday, April 4, 2022 at 12:11:14 PM UTC-5 fil...@golang.org wrote:
On Fri, Mar 25, 2022 at 10:08 AM Paul Murphy <mu...@ibm.com> wrote:
Related to this, I have been working on a patch to improve carrying arithmetic on ppc64le. As someone maintaining, but not well versed in their practical usage, how do the four benchmarks overlap common usage scenarios? Which are the most important? Should I be looking at other benchmarks?

I generated, and enabled the P256 code paths similar to the other curves, and the benchmark results seemed unintuitive. Results below are on a ppc64le/p9 host, both with my pending ppc64le Add64/Sub64 patches. Old numbers are native asm, new numbers are with the tentative fiat code and the pending P256 patches:

name                                old time/op    new time/op    delta
ScalarBaseMult/P256                   46.5µs ± 0%   224.9µs ± 0%  +383.79%
ScalarMult/P256                        200µs ± 0%     226µs ± 0%   +13.03%
MarshalUnmarshal/P256/Uncompressed    2.52µs ± 0%    1.32µs ± 0%   -47.56%
MarshalUnmarshal/P256/Compressed      2.53µs ± 0%    1.32µs ± 0%   -47.93%

Sorry, I had missed this.

I use the crypto/elliptic and the crypto/ecdsa benchmarks for this. The former (1 ScalarBaseMult + 1 ScalarMult) are representative of an ECDH run, the latter of the higher-level ECDSA ops.

Those numbers actually look pretty good! 13% slowdown on ScalarMult dropping the assembly entirely sounds excellent to me. Note that the Compressed benchmark was broken (https://go.dev/cl/396934) but it doesn't really matter.

ScalarBaseMult uses a large, aggressively precomputed table which the nistec/fiat group logic doesn't have, so that's why you see it jump up to the ScalarMult performance. If that's the only thing stopping you from dropping the assembly for the fiat code, I can implement the precomputation in nistec.P256Point.ScalarBaseMult. Let me know. I really appreciate that you invested in making the Go code faster rather than adapting the assembly, and I'm happy to help finish the work.

(By the way, you might want to base all your benchmarks on top of https://go.dev/cl/382995.)

If it can approach the speedup of the asm code, maybe the asm isn't needed. If you have the bandwidth to spin such a change (using the fiat code), I am curious how close it can get. The assembly port Lynn is working on is still substantially (2-4x) faster under the ecdsa P256 benchmarks when rebased against your CLs.

Hmm, I thought the +13.03% above was [fiat with compiler changes] vs assembly, which made removing the assembly sound doable. However, it sounds there is also a [new assembly port] target and that [fiat with compiler changes] vs [new assembly port] is more like +100% to +300%? That makes it sound much less likely we'll get to drop the assembly. Or do I misunderstand?
 
A lesser observation I noticed is the fiat conditional move function does not optimize very well (on ppc64). Using a more direct implementation (and updating related call sites) to:

func p256CmovznzU64(arg1 p256Uint1, arg2 uint64, arg3 uint64) uint64{
        var out uint64
        if arg1 == 0 {
                out = arg2
        } else {
                out = arg3
        }
        return out
}

Results in a 15% improvement on PPC64. This allows much better code generation.

It's not a CMOV anymore though, as the C stands for constant time :) Branches (over secret data, which arg1 here is assumed to be) are not allowed in constant time code.

If it would make such a difference though maybe we should intrinsify and use subtle.ConstantTimeSelect (which annoyingly works over int though, so there will be casts) or add an intrinsic to math/bits like bits.Add64?

The fiat folks are friendly and open to performance improvements, so we can improve the code, that's not a problem.

Paul Murphy

unread,
Apr 6, 2022, 4:27:20 PM4/6/22
to golang-dev
On Wednesday, April 6, 2022 at 12:17:18 PM UTC-5 fil...@golang.org wrote:

Hmm, I thought the +13.03% above was [fiat with compiler changes] vs assembly, which made removing the assembly sound doable. However, it sounds there is also a [new assembly port] target and that [fiat with compiler changes] vs [new assembly port] is more like +100% to +300%? That makes it sound much less likely we'll get to drop the assembly. Or do I misunderstand?

That is correct with respect to P256 on ppc64le; Lynn is updating the asm to work with CL 382995. Though, I think there may be some hope for native code.

I am wondering, if ScalarBaseMult performed similarly to it's asm counterpart, would it close most or all of that performance gap? Native go is still 4x off.
 
 
A lesser observation I noticed is the fiat conditional move function does not optimize very well (on ppc64). Using a more direct implementation (and updating related call sites) to:

func p256CmovznzU64(arg1 p256Uint1, arg2 uint64, arg3 uint64) uint64{
        var out uint64
        if arg1 == 0 {
                out = arg2
        } else {
                out = arg3
        }
        return out
}

Results in a 15% improvement on PPC64. This allows much better code generation.

It's not a CMOV anymore though, as the C stands for constant time :) Branches (over secret data, which arg1 here is assumed to be) are not allowed in constant time code.

If it would make such a difference though maybe we should intrinsify and use subtle.ConstantTimeSelect (which annoyingly works over int though, so there will be casts) or add an intrinsic to math/bits like bits.Add64?

Understood, it was a quick way to get it to lower into more ideal branch and stack free code on ppc64. Despite inlining, go still passes the result through memory which is not ideal on PPC64 (and likely others). Returning out1 by value would likely improve performance by as much. A builtin might make the intent clearer, but lowering does try recognize and optimize some constant time selects. I can probably improve those on ppc64.

Filippo Valsorda

unread,
Apr 6, 2022, 4:38:54 PM4/6/22
to Paul Murphy, golang-dev
On Wed, Apr 6, 2022 at 4:27 PM Paul Murphy <mu...@ibm.com> wrote:
On Wednesday, April 6, 2022 at 12:17:18 PM UTC-5 fil...@golang.org wrote:

Hmm, I thought the +13.03% above was [fiat with compiler changes] vs assembly, which made removing the assembly sound doable. However, it sounds there is also a [new assembly port] target and that [fiat with compiler changes] vs [new assembly port] is more like +100% to +300%? That makes it sound much less likely we'll get to drop the assembly. Or do I misunderstand?

That is correct with respect to P256 on ppc64le; Lynn is updating the asm to work with CL 382995. Though, I think there may be some hope for native code.

I am wondering, if ScalarBaseMult performed similarly to it's asm counterpart, would it close most or all of that performance gap? Native go is still 4x off.

The fundamental benchmark you need to look at is ScalarMult/P256. If you can get that benchmark for the fiat Go code within the ballpark of the assembly, then I can fix ScalarBaseMult/P256 for you by introducing ScalarBaseMult-specific optimizations in the nistec package, but fixing ScalarBaseMult/P256 won't help with ScalarMult/P256.
 
 
A lesser observation I noticed is the fiat conditional move function does not optimize very well (on ppc64). Using a more direct implementation (and updating related call sites) to:

func p256CmovznzU64(arg1 p256Uint1, arg2 uint64, arg3 uint64) uint64{
        var out uint64
        if arg1 == 0 {
                out = arg2
        } else {
                out = arg3
        }
        return out
}

Results in a 15% improvement on PPC64. This allows much better code generation.

It's not a CMOV anymore though, as the C stands for constant time :) Branches (over secret data, which arg1 here is assumed to be) are not allowed in constant time code.

If it would make such a difference though maybe we should intrinsify and use subtle.ConstantTimeSelect (which annoyingly works over int though, so there will be casts) or add an intrinsic to math/bits like bits.Add64?

Understood, it was a quick way to get it to lower into more ideal branch and stack free code on ppc64. Despite inlining, go still passes the result through memory which is not ideal on PPC64 (and likely others). Returning out1 by value would likely improve performance by as much. A builtin might make the intent clearer, but lowering does try recognize and optimize some constant time selects. I can probably improve those on ppc64.
 
The fiat folks are friendly and open to performance improvements, so we can improve the code, that's not a problem.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Paul Murphy

unread,
Apr 6, 2022, 5:22:00 PM4/6/22
to golang-dev
On Wednesday, April 6, 2022 at 3:38:54 PM UTC-5 fil...@golang.org wrote:
On Wed, Apr 6, 2022 at 4:27 PM Paul Murphy <mu...@ibm.com> wrote:
On Wednesday, April 6, 2022 at 12:17:18 PM UTC-5 fil...@golang.org wrote:

Hmm, I thought the +13.03% above was [fiat with compiler changes] vs assembly, which made removing the assembly sound doable. However, it sounds there is also a [new assembly port] target and that [fiat with compiler changes] vs [new assembly port] is more like +100% to +300%? That makes it sound much less likely we'll get to drop the assembly. Or do I misunderstand?

That is correct with respect to P256 on ppc64le; Lynn is updating the asm to work with CL 382995. Though, I think there may be some hope for native code.

I am wondering, if ScalarBaseMult performed similarly to it's asm counterpart, would it close most or all of that performance gap? Native go is still 4x off.

The fundamental benchmark you need to look at is ScalarMult/P256. If you can get that benchmark for the fiat Go code within the ballpark of the assembly, then I can fix ScalarBaseMult/P256 for you by introducing ScalarBaseMult-specific optimizations in the nistec package, but fixing ScalarBaseMult/P256 won't help with ScalarMult/P256.

ScalarMult/P256 benchmarks as well or better than ppc64 asm with compiler+fiat improvements. The P256 tls and ecdsa benchmarks are still substantially slower with compiler+fiat improvements.

 

Filippo Valsorda

unread,
Apr 6, 2022, 7:09:05 PM4/6/22
to Paul Murphy, golang-dev
Fantastic! Then yeah, I can close the ScalarBaseMult/P256 gap, which should almost surely fix the ECDSA and TLS benchmarks, too. They amount to a ScalarBaseMult, a ScalarMult, and for ECDSA a scalar field inversion for which I can pull in a fiat scalar field implementation as I was sort of already planning to.

Is there a builder I can run benchmarks on, and if so what CLs should I patch in?

lab...@linux.vnet.ibm.com

unread,
Apr 8, 2022, 8:51:47 AM4/8/22
to golang-dev
Hi,

I have not created a CL yet for the assembler patch, I was waiting for your changes to be merged. I don't think Paul has one for his either.
You would have to ask Cherry or somebody there on how to run on the Power builders. We don't have access to those. We have plenty of systems to use here so that is usually not a problem.
I could create a CL but could I need some directions on how to build on top of your CL because it is not obvious how to do that. I couldn't find any documentation on it and I couldn't make it work.

Reply all
Reply to author
Forward
0 new messages