I'm excited about this release, my compile times are small but execution time matters a lot more SSA should be a step in the right direction. However, our benchmarks have been oscillating significantly throughout the cycle. Below is some data comparing 1.6 and tip (+082881d) on darwin/amd64.
The Idamax regression is issue 14995. Ddot currently looks like it's flat through the cycle, but as of the fix for 14511 some of the benchmarks were 20% faster. The Dgemm regressions are new to me as of this post. I can open up an issue, though the code is significantly more complicated, and so it'll be harder to determine the cause.
Hopefully some of these concerns will be addressed in 1.8. There are some large (though likely not low-hanging) fruit that could have a significant impact on numeric performance.
Thanks for all the work.
go get -u -t github.com/gonum/blas/native
cd $GOPATH/src/github.com/gonum/blas/native
go test -bench=Dgem -tags=noasm -count=5
(or other regex)
DgemmSmSmSm-8 1.98µs ± 1% 1.93µs ± 2% ~ (p=0.056 n=5+5)
DgemmMedMedMed-8 593µs ± 3% 690µs ±15% ~ (p=0.095 n=5+5)
DgemmMedLgMed-8 4.49ms ±10% 5.24ms ± 3% +16.85% (p=0.008 n=5+5)
DgemmLgLgLg-8 393ms ± 3% 483ms ± 3% +22.78% (p=0.008 n=5+5)
DgemmLgSmLg-8 6.05ms ± 5% 6.91ms ± 3% +14.22% (p=0.008 n=5+5)
DgemmLgLgSm-8 4.67ms ± 5% 5.32ms ± 2% +13.86% (p=0.008 n=5+5)
DgemmHgHgSm-8 427ms ± 4% 391ms ± 6% -8.32% (p=0.016 n=5+5)
DgemmMedMedMedTNT-8 654µs ±12% 767µs ± 2% +17.30% (p=0.008 n=5+5)
DgemmMedMedMedNTT-8 624µs ± 9% 675µs ± 4% ~ (p=0.095 n=5+5)
DgemmMedMedMedTT-8 835µs ± 1% 800µs ± 3% -4.17% (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8 215ns ± 1% 209ns ±16% ~ (p=0.190 n=4+5)
DgemvSmSmNoTransIncN-8 243ns ± 1% 229ns ± 1% -5.75% (p=0.008 n=5+5)
DgemvSmSmTransInc1-8 218ns ± 0% 215ns ± 5% ~ (p=0.127 n=5+5)
DgemvSmSmTransIncN-8 267ns ± 3% 240ns ± 2% -10.12% (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8 11.0µs ± 2% 10.7µs ± 1% -2.95% (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8 13.4µs ± 2% 13.0µs ± 4% ~ (p=0.056 n=5+5)
DgemvMedMedTransInc1-8 13.3µs ± 4% 12.7µs ± 2% -4.28% (p=0.008 n=5+5)
DgemvMedMedTransIncN-8 16.0µs ± 1% 13.3µs ± 3% -16.75% (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8 971µs ± 3% 940µs ± 2% -3.18% (p=0.008 n=5+5)
DgemvLgLgNoTransIncN-8 1.21ms ± 3% 1.21ms ± 2% ~ (p=0.841 n=5+5)
DgemvLgLgTransInc1-8 1.21ms ± 1% 1.22ms ± 3% ~ (p=0.421 n=5+5)
DgemvLgLgTransIncN-8 1.50ms ± 2% 1.21ms ± 2% -19.44% (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8 18.5µs ± 2% 16.9µs ± 1% -8.57% (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8 20.8µs ± 1% 20.0µs ± 2% -4.25% (p=0.008 n=5+5)
DgemvLgSmTransInc1-8 17.4µs ± 2% 17.4µs ± 3% ~ (p=0.690 n=5+5)
DgemvLgSmTransIncN-8 22.5µs ± 2% 20.0µs ± 1% -11.19% (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8 9.09µs ± 1% 8.98µs ± 2% ~ (p=0.222 n=5+5)
DgemvSmLgNoTransIncN-8 11.7µs ± 1% 11.7µs ± 1% ~ (p=0.548 n=5+5)
DgemvSmLgTransInc1-8 12.7µs ± 1% 12.2µs ± 1% -3.86% (p=0.008 n=5+5)
DgemvSmLgTransIncN-8 15.9µs ± 2% 12.6µs ± 2% -20.54% (p=0.008 n=5+5)
DdotSmallBothUnitary-8 21.0ns ± 1% 19.2ns ± 2% -8.21% (p=0.008 n=5+5)
DdotSmallIncUni-8 23.5ns ± 2% 24.7ns ± 1% +5.37% (p=0.008 n=5+5)
DdotSmallUniInc-8 23.2ns ± 4% 23.1ns ± 4% ~ (p=1.000 n=5+5)
DdotSmallBothInc-8 23.4ns ± 2% 23.2ns ± 1% ~ (p=0.143 n=5+5)
DdotMediumBothUnitary-8 919ns ± 3% 919ns ± 2% ~ (p=0.968 n=5+5)
DdotMediumIncUni-8 1.22µs ± 1% 1.18µs ± 2% -3.66% (p=0.008 n=5+5)
DdotMediumUniInc-8 1.18µs ± 0% 1.17µs ± 1% ~ (p=0.119 n=5+5)
DdotMediumBothInc-8 1.27µs ± 4% 1.22µs ± 3% -4.55% (p=0.008 n=5+5)
DdotLargeBothUnitary-8 94.0µs ± 2% 90.7µs ± 4% ~ (p=0.056 n=5+5)
DdotLargeIncUni-8 194µs ± 6% 204µs ± 7% ~ (p=0.151 n=5+5)
DdotLargeUniInc-8 128µs ± 1% 128µs ± 2% ~ (p=0.548 n=5+5)
DdotLargeBothInc-8 256µs ± 2% 265µs ± 3% +3.34% (p=0.016 n=5+5)
DdotHugeBothUnitary-8 11.4ms ± 4% 11.3ms ± 2% ~ (p=1.000 n=5+5)
DdotHugeIncUni-8 27.6ms ± 1% 28.0ms ± 2% ~ (p=0.095 n=5+5)
DdotHugeUniInc-8 18.8ms ± 2% 19.4ms ± 2% +3.32% (p=0.016 n=5+5)
DdotHugeBothInc-8 35.1ms ± 2% 36.0ms ± 2% +2.57% (p=0.032 n=5+5)
IdamaxSmallUnitaryInc-8 31.7ns ± 7% 44.7ns ±12% +41.19% (p=0.008 n=5+5)
IdamaxSmallPosInc-8 27.8ns ±10% 40.2ns ±10% +44.75% (p=0.008 n=5+5)
IdamaxMediumUnitaryInc-8 1.53µs ± 4% 2.16µs ± 4% +40.95% (p=0.008 n=5+5)
IdamaxMediumPosInc-8 1.79µs ± 1% 2.59µs ± 7% +44.16% (p=0.008 n=5+5)
IdamaxLargeUnitaryInc-8 143µs ± 1% 214µs ± 3% +50.24% (p=0.008 n=5+5)
IdamaxLargePosInc-8 192µs ± 1% 264µs ± 5% +37.53% (p=0.008 n=5+5)
IdamaxHugeUnitaryInc-8 14.4ms ± 1% 20.9ms ± 3% +45.16% (p=0.008 n=5+5)
IdamaxHugePosInc-8 26.2ms ± 1% 30.7ms ± 2% +16.93% (p=0.008 n=5+5)