crypto/internal/fips140/edwards25519/field: speed up add chains
Repeated squaring forms the bulk of these add chains.
Doing it in a dedicated routine with local variables is faster.
Introduce SquareN for this, and switch to an add chain geared towards
maximizing runs of repeated squares.
Unrolling the SquareN loop 5x gains a few percentage points more;
this could be done as desired in a follow-up.
A planned follow-up will use this newfound speed to delete amd64 asm.
Microbenchmarks:
goos: darwin
goarch: arm64
pkg: crypto/internal/fips140/edwards25519
cpu: Apple M3 Max
│ a │ b │
│ sec/op │ sec/op vs base │
EncodingDecoding-16 5.842µ ± 0% 4.703µ ± 0% -19.51% (n=100)
ScalarBaseMult-16 9.157µ ± 0% 9.178µ ± 0% +0.23% (p=0.000 n=100)
ScalarMult-16 29.28µ ± 0% 29.25µ ± 0% -0.09% (n=100)
VarTimeDoubleScalarBaseMult-16 27.46µ ± 0% 27.46µ ± 0% -0.02% (p=0.002 n=100)
geomean 14.40µ 13.64µ -5.25%
pkg: crypto/internal/fips140/edwards25519/field
│ a │ b │
│ sec/op │ sec/op vs base │
Add-16 3.364n ± 0% 3.350n ± 0% -0.43% (p=0.000 n=100)
Multiply-16 14.15n ± 0% 14.15n ± 0% 0.00% (p=0.000 n=100)
Square-16 10.32n ± 0% 10.25n ± 0% -0.68% (n=100)
Invert-16 2.734µ ± 0% 2.331µ ± 0% -14.74% (n=100)
Mult32-16 5.067n ± 0% 4.926n ± 0% -2.78% (n=100)
Bytes-16 4.595n ± 0% 4.580n ± 0% ~ (p=0.052 n=100)
geomean 17.75n 17.16n -3.31%
Macrobenchmarks:
goos: darwin
goarch: arm64
pkg: crypto/ed25519
cpu: Apple M3 Max
│ before │ after │
│ sec/op │ sec/op vs base │
KeyGeneration-16 13.84µ ± 2% 12.92µ ± 1% -6.65% (p=0.000 n=30)
NewKeyFromSeed-16 13.60µ ± 3% 12.91µ ± 1% -5.09% (p=0.000 n=30)
Signing-16 16.14µ ± 3% 15.75µ ± 1% -2.45% (p=0.000 n=30)
Verification-16 35.47µ ± 2% 34.84µ ± 0% -1.79% (p=0.001 n=30)
geomean 18.12µ 17.39µ -4.02%
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |