--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/CA%2B2K_Kq%2B_NmhMUo%2BH_--zRXA0qgZUuP_Ts8g3CiQVyQuFm%2B34w%40mail.gmail.com.
Hi Filippo,Some platforms provide high level implementations of cryptographic algorithms (e.g. newer s390x machines can do full AES-GCM encryption/decryption in a single instruction). When this is the case the assembly is likely to be trivial but a different interface may be required compared to a lower level implementation. Would you consider an amendment to the assembly policy to cover this scenario? For example:"The interface of the assembly units and of the reference Go implementation must be the same across architectures unless there is a strong technical justification for doing so."
I agree that effort should be put in to make the interfaces as close as possible and avoid arbitrary differences. Allowing code to be refactored freely, including potentially removing assembly, also seems important to me and I think it is acceptable so long as reasonable steps are taken to involve the author/maintainer of the assembly in the discussion.
One other minor thing I'd like to just note is that for some authors it is risky to touch existing architecture specific implementations. If they tweak an existing interface in order to add a new implementation and it causes an performance degradation on a platform their company competes with it could lead to issues even if it was completely unintentional. This can mean there is pressure on them to make the implementations more independent and is perhaps something to be sensitive to when reviewing contributions of high performance assembly.
Hi, can this statement be clarified:
- The interface of the assembly units and of the reference Go implementation must be the same across architectures.
Because in crypto/aes in aes_gcm.go and cipher_asm.go the assembler interfaces for arm64 and amd64 are not the same as the Go implementation? Or maybe I am misunderstanding what this means.
On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:As part of a long-running refactor of the crypto/elliptic package, we are moving the actual implementations of named curves to crypto/elliptic/internal/nistec, which exposes a safe []byte-based API rather than a big.Int one that allows invalid points.This involves porting the P-256 assembly to implement the nistec API. There are assembly implementations for amd64, arm64, ppc64le, and s390x. They grew organically, so the package structure is a bit of a mess. arm64 was ported from amd64, and ppc64le was ported from s390x, but the two pairs don't share interfaces and types.I will take care of refactoring amd64 and arm64 as they are supported as first-class ports, but unfortunately we don't have the resources or expertise to port ppc64le and s390x. We also don't want to give up on completing an important refactor that will reduce ongoing maintenance burden because of the assembly cores: we can't let everything with assembly in it "rust shut".We plan to disable ppc64le and s390x assembly, and ask that the relevant owners plan on doing the work to port it to nistec, ideally for Go 1.19. We will try to make the changes on our side in the first half of March, which leaves six weeks to port ppc64le and s390x.It looks like both the amd64/arm64 and ppc64le/s390x implementations are 64-bit Montgomery strategies with the same(?) limb schedule, so the difference in interfaces and types might be a historical accident. If that's accurate we'd ask that the ppc64le/s390x assembly be ported to implement the same functions as the amd64/arm64 cores, which should significantly reduce complexity.I have updated the Assembly Policy to reflect the expectations outlined above (specifically, that code that we don't explicitly take on must have an owner, and that architectures should share structure). I welcome feedback on the changes.I realize the Assembly Policy might look like a lot of overhead, but it exists to try and make it possible for crypto assembly to exist safely in the standard library. We really appreciate contributions that comply to it, and will do our best to prioritize them. (If you have a CL that is following the policy and is not getting review, please email me directly. Things get lost in the Gerrit sea.)
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/29ff0d1c-b9a8-4a96-ab2e-5f54ec95185bn%40googlegroups.com.
I have a question of clarification regarding this portion of the assembly policy:
- The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)
Some architectures unconditionally guarantee the existence of certain hardware acceleration features (e.g ppc64/ppc64le require POWER8 or newer, which indirectly mandates the POWER ISA 2.07 vector crypto extensions). If using a shared software infrastructure, such an architecture would always have a vacuous fallback path (The detection criteria for ppc would effectively be 'cpu.PPC64.IsPOWER8', which should always be true).Would an asm implementation using POWER8 instructions meet this policy since the fallback path is effectively unreachable? Should there be a hook to disable the hardware accelerated path independent of cpu.ARCH feature flags which may not be exposed to GODEBUG?
On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:As part of a long-running refactor of the crypto/elliptic package, we are moving the actual implementations of named curves to crypto/elliptic/internal/nistec, which exposes a safe []byte-based API rather than a big.Int one that allows invalid points.This involves porting the P-256 assembly to implement the nistec API. There are assembly implementations for amd64, arm64, ppc64le, and s390x. They grew organically, so the package structure is a bit of a mess. arm64 was ported from amd64, and ppc64le was ported from s390x, but the two pairs don't share interfaces and types.I will take care of refactoring amd64 and arm64 as they are supported as first-class ports, but unfortunately we don't have the resources or expertise to port ppc64le and s390x. We also don't want to give up on completing an important refactor that will reduce ongoing maintenance burden because of the assembly cores: we can't let everything with assembly in it "rust shut".We plan to disable ppc64le and s390x assembly, and ask that the relevant owners plan on doing the work to port it to nistec, ideally for Go 1.19. We will try to make the changes on our side in the first half of March, which leaves six weeks to port ppc64le and s390x.It looks like both the amd64/arm64 and ppc64le/s390x implementations are 64-bit Montgomery strategies with the same(?) limb schedule, so the difference in interfaces and types might be a historical accident. If that's accurate we'd ask that the ppc64le/s390x assembly be ported to implement the same functions as the amd64/arm64 cores, which should significantly reduce complexity.I have updated the Assembly Policy to reflect the expectations outlined above (specifically, that code that we don't explicitly take on must have an owner, and that architectures should share structure). I welcome feedback on the changes.I realize the Assembly Policy might look like a lot of overhead, but it exists to try and make it possible for crypto assembly to exist safely in the standard library. We really appreciate contributions that comply to it, and will do our best to prioritize them. (If you have a CL that is following the policy and is not getting review, please email me directly. Things get lost in the Gerrit sea.)
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/2dbced91-5815-4cb4-9688-33355cafbfe4n%40googlegroups.com.
On Wed, Mar 9, 2022 at 12:59 PM Paul Murphy <paul.ik...@gmail.com> wrote:I have a question of clarification regarding this portion of the assembly policy:
- The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)
Some architectures unconditionally guarantee the existence of certain hardware acceleration features (e.g ppc64/ppc64le require POWER8 or newer, which indirectly mandates the POWER ISA 2.07 vector crypto extensions). If using a shared software infrastructure, such an architecture would always have a vacuous fallback path (The detection criteria for ppc would effectively be 'cpu.PPC64.IsPOWER8', which should always be true).Would an asm implementation using POWER8 instructions meet this policy since the fallback path is effectively unreachable? Should there be a hook to disable the hardware accelerated path independent of cpu.ARCH feature flags which may not be exposed to GODEBUG?Oh, definitely no need to test what's effectively an unreachable path because of our minimum requirements. However, it's not clear to me why we'd have that code path? Can it be replaced with a panic()? Can it be removed entirely? If it's not worth testing, it should not be worth keeping.
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/dae3dd1f-f0b1-4e86-9aa0-262f01a3c184n%40googlegroups.com.
Related to this, I have been working on a patch to improve carrying arithmetic on ppc64le. As someone maintaining, but not well versed in their practical usage, how do the four benchmarks overlap common usage scenarios? Which are the most important? Should I be looking at other benchmarks?I generated, and enabled the P256 code paths similar to the other curves, and the benchmark results seemed unintuitive. Results below are on a ppc64le/p9 host, both with my pending ppc64le Add64/Sub64 patches. Old numbers are native asm, new numbers are with the tentative fiat code and the pending P256 patches:name old time/op new time/op delta
ScalarBaseMult/P256 46.5µs ± 0% 224.9µs ± 0% +383.79%
ScalarMult/P256 200µs ± 0% 226µs ± 0% +13.03%
MarshalUnmarshal/P256/Uncompressed 2.52µs ± 0% 1.32µs ± 0% -47.56%
MarshalUnmarshal/P256/Compressed 2.53µs ± 0% 1.32µs ± 0% -47.93%
On Thursday, March 3, 2022 at 7:54:59 AM UTC-6 fil...@golang.org wrote:As part of a long-running refactor of the crypto/elliptic package, we are moving the actual implementations of named curves to crypto/elliptic/internal/nistec, which exposes a safe []byte-based API rather than a big.Int one that allows invalid points.This involves porting the P-256 assembly to implement the nistec API. There are assembly implementations for amd64, arm64, ppc64le, and s390x. They grew organically, so the package structure is a bit of a mess. arm64 was ported from amd64, and ppc64le was ported from s390x, but the two pairs don't share interfaces and types.I will take care of refactoring amd64 and arm64 as they are supported as first-class ports, but unfortunately we don't have the resources or expertise to port ppc64le and s390x. We also don't want to give up on completing an important refactor that will reduce ongoing maintenance burden because of the assembly cores: we can't let everything with assembly in it "rust shut".We plan to disable ppc64le and s390x assembly, and ask that the relevant owners plan on doing the work to port it to nistec, ideally for Go 1.19. We will try to make the changes on our side in the first half of March, which leaves six weeks to port ppc64le and s390x.It looks like both the amd64/arm64 and ppc64le/s390x implementations are 64-bit Montgomery strategies with the same(?) limb schedule, so the difference in interfaces and types might be a historical accident. If that's accurate we'd ask that the ppc64le/s390x assembly be ported to implement the same functions as the amd64/arm64 cores, which should significantly reduce complexity.I have updated the Assembly Policy to reflect the expectations outlined above (specifically, that code that we don't explicitly take on must have an owner, and that architectures should share structure). I welcome feedback on the changes.I realize the Assembly Policy might look like a lot of overhead, but it exists to try and make it possible for crypto assembly to exist safely in the standard library. We really appreciate contributions that comply to it, and will do our best to prioritize them. (If you have a CL that is following the policy and is not getting review, please email me directly. Things get lost in the Gerrit sea.)
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/a33a9e0c-30d0-463b-83bd-5dc0b10800ecn%40googlegroups.com.
On Fri, Mar 25, 2022 at 10:08 AM Paul Murphy <mu...@ibm.com> wrote:Related to this, I have been working on a patch to improve carrying arithmetic on ppc64le. As someone maintaining, but not well versed in their practical usage, how do the four benchmarks overlap common usage scenarios? Which are the most important? Should I be looking at other benchmarks?I generated, and enabled the P256 code paths similar to the other curves, and the benchmark results seemed unintuitive. Results below are on a ppc64le/p9 host, both with my pending ppc64le Add64/Sub64 patches. Old numbers are native asm, new numbers are with the tentative fiat code and the pending P256 patches:name old time/op new time/op delta
ScalarBaseMult/P256 46.5µs ± 0% 224.9µs ± 0% +383.79%
ScalarMult/P256 200µs ± 0% 226µs ± 0% +13.03%
MarshalUnmarshal/P256/Uncompressed 2.52µs ± 0% 1.32µs ± 0% -47.56%
MarshalUnmarshal/P256/Compressed 2.53µs ± 0% 1.32µs ± 0% -47.93%Sorry, I had missed this.I use the crypto/elliptic and the crypto/ecdsa benchmarks for this. The former (1 ScalarBaseMult + 1 ScalarMult) are representative of an ECDH run, the latter of the higher-level ECDSA ops.Those numbers actually look pretty good! 13% slowdown on ScalarMult dropping the assembly entirely sounds excellent to me. Note that the Compressed benchmark was broken (https://go.dev/cl/396934) but it doesn't really matter.ScalarBaseMult uses a large, aggressively precomputed table which the nistec/fiat group logic doesn't have, so that's why you see it jump up to the ScalarMult performance. If that's the only thing stopping you from dropping the assembly for the fiat code, I can implement the precomputation in nistec.P256Point.ScalarBaseMult. Let me know. I really appreciate that you invested in making the Go code faster rather than adapting the assembly, and I'm happy to help finish the work.(By the way, you might want to base all your benchmarks on top of https://go.dev/cl/382995.)
On Monday, April 4, 2022 at 12:11:14 PM UTC-5 fil...@golang.org wrote:On Fri, Mar 25, 2022 at 10:08 AM Paul Murphy <mu...@ibm.com> wrote:Related to this, I have been working on a patch to improve carrying arithmetic on ppc64le. As someone maintaining, but not well versed in their practical usage, how do the four benchmarks overlap common usage scenarios? Which are the most important? Should I be looking at other benchmarks?I generated, and enabled the P256 code paths similar to the other curves, and the benchmark results seemed unintuitive. Results below are on a ppc64le/p9 host, both with my pending ppc64le Add64/Sub64 patches. Old numbers are native asm, new numbers are with the tentative fiat code and the pending P256 patches:name old time/op new time/op delta
ScalarBaseMult/P256 46.5µs ± 0% 224.9µs ± 0% +383.79%
ScalarMult/P256 200µs ± 0% 226µs ± 0% +13.03%
MarshalUnmarshal/P256/Uncompressed 2.52µs ± 0% 1.32µs ± 0% -47.56%
MarshalUnmarshal/P256/Compressed 2.53µs ± 0% 1.32µs ± 0% -47.93%Sorry, I had missed this.I use the crypto/elliptic and the crypto/ecdsa benchmarks for this. The former (1 ScalarBaseMult + 1 ScalarMult) are representative of an ECDH run, the latter of the higher-level ECDSA ops.Those numbers actually look pretty good! 13% slowdown on ScalarMult dropping the assembly entirely sounds excellent to me. Note that the Compressed benchmark was broken (https://go.dev/cl/396934) but it doesn't really matter.ScalarBaseMult uses a large, aggressively precomputed table which the nistec/fiat group logic doesn't have, so that's why you see it jump up to the ScalarMult performance. If that's the only thing stopping you from dropping the assembly for the fiat code, I can implement the precomputation in nistec.P256Point.ScalarBaseMult. Let me know. I really appreciate that you invested in making the Go code faster rather than adapting the assembly, and I'm happy to help finish the work.(By the way, you might want to base all your benchmarks on top of https://go.dev/cl/382995.)If it can approach the speedup of the asm code, maybe the asm isn't needed. If you have the bandwidth to spin such a change (using the fiat code), I am curious how close it can get. The assembly port Lynn is working on is still substantially (2-4x) faster under the ecdsa P256 benchmarks when rebased against your CLs.
A lesser observation I noticed is the fiat conditional move function does not optimize very well (on ppc64). Using a more direct implementation (and updating related call sites) to:func p256CmovznzU64(arg1 p256Uint1, arg2 uint64, arg3 uint64) uint64{
var out uint64
if arg1 == 0 {
out = arg2
} else {
out = arg3
}
return out
}Results in a 15% improvement on PPC64. This allows much better code generation.
Hmm, I thought the +13.03% above was [fiat with compiler changes] vs assembly, which made removing the assembly sound doable. However, it sounds there is also a [new assembly port] target and that [fiat with compiler changes] vs [new assembly port] is more like +100% to +300%? That makes it sound much less likely we'll get to drop the assembly. Or do I misunderstand?
A lesser observation I noticed is the fiat conditional move function does not optimize very well (on ppc64). Using a more direct implementation (and updating related call sites) to:func p256CmovznzU64(arg1 p256Uint1, arg2 uint64, arg3 uint64) uint64{
var out uint64
if arg1 == 0 {
out = arg2
} else {
out = arg3
}
return out
}Results in a 15% improvement on PPC64. This allows much better code generation.It's not a CMOV anymore though, as the C stands for constant time :) Branches (over secret data, which arg1 here is assumed to be) are not allowed in constant time code.
If it would make such a difference though maybe we should intrinsify and use subtle.ConstantTimeSelect (which annoyingly works over int though, so there will be casts) or add an intrinsic to math/bits like bits.Add64?
On Wednesday, April 6, 2022 at 12:17:18 PM UTC-5 fil...@golang.org wrote:Hmm, I thought the +13.03% above was [fiat with compiler changes] vs assembly, which made removing the assembly sound doable. However, it sounds there is also a [new assembly port] target and that [fiat with compiler changes] vs [new assembly port] is more like +100% to +300%? That makes it sound much less likely we'll get to drop the assembly. Or do I misunderstand?That is correct with respect to P256 on ppc64le; Lynn is updating the asm to work with CL 382995. Though, I think there may be some hope for native code.I am wondering, if ScalarBaseMult performed similarly to it's asm counterpart, would it close most or all of that performance gap? Native go is still 4x off.
A lesser observation I noticed is the fiat conditional move function does not optimize very well (on ppc64). Using a more direct implementation (and updating related call sites) to:func p256CmovznzU64(arg1 p256Uint1, arg2 uint64, arg3 uint64) uint64{
var out uint64
if arg1 == 0 {
out = arg2
} else {
out = arg3
}
return out
}Results in a 15% improvement on PPC64. This allows much better code generation.It's not a CMOV anymore though, as the C stands for constant time :) Branches (over secret data, which arg1 here is assumed to be) are not allowed in constant time code.If it would make such a difference though maybe we should intrinsify and use subtle.ConstantTimeSelect (which annoyingly works over int though, so there will be casts) or add an intrinsic to math/bits like bits.Add64?Understood, it was a quick way to get it to lower into more ideal branch and stack free code on ppc64. Despite inlining, go still passes the result through memory which is not ideal on PPC64 (and likely others). Returning out1 by value would likely improve performance by as much. A builtin might make the intent clearer, but lowering does try recognize and optimize some constant time selects. I can probably improve those on ppc64.The fiat folks are friendly and open to performance improvements, so we can improve the code, that's not a problem.
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-dev/ac54226a-6641-4b22-88d3-e114b245d017n%40googlegroups.com.
On Wed, Apr 6, 2022 at 4:27 PM Paul Murphy <mu...@ibm.com> wrote:On Wednesday, April 6, 2022 at 12:17:18 PM UTC-5 fil...@golang.org wrote:Hmm, I thought the +13.03% above was [fiat with compiler changes] vs assembly, which made removing the assembly sound doable. However, it sounds there is also a [new assembly port] target and that [fiat with compiler changes] vs [new assembly port] is more like +100% to +300%? That makes it sound much less likely we'll get to drop the assembly. Or do I misunderstand?That is correct with respect to P256 on ppc64le; Lynn is updating the asm to work with CL 382995. Though, I think there may be some hope for native code.I am wondering, if ScalarBaseMult performed similarly to it's asm counterpart, would it close most or all of that performance gap? Native go is still 4x off.The fundamental benchmark you need to look at is ScalarMult/P256. If you can get that benchmark for the fiat Go code within the ballpark of the assembly, then I can fix ScalarBaseMult/P256 for you by introducing ScalarBaseMult-specific optimizations in the nistec package, but fixing ScalarBaseMult/P256 won't help with ScalarMult/P256.