Hi Ben,
Thats an interesting question. I've always believed that, for the in
order ARM cores, fewer instructions was faster, but i'm sure the real
answer is more nuanced. Probably trying to prove this directly using
something like the go1 benchmarks will give you numbers in the noise
floor.
Here's what I would try to prove your hypothesis.
1. write a go benchmark
package p
import "testing"
func fnMov() // implemented in asm
func fnBFC() // implemented in asm
func BenchmarkFnMov(b *testing.B) {
for n := 0; n < b.N; n++ {
fnMov()
}
}
... etc
then write fnMov and fnBFC in a .s file;
The go compiler will not attempt to inline or eliminate those calls as
it cannot provide they don't have side effects so effectively this
uses the testing package as a harness to run asm functions. Yes,
you'll be paying the function dispatch overhead, but it should be a
constant.
This should let you test your code gen hypothesis without needing to
change anything. Also we can take this program and try it on a bunch
of different arm hosts as I'm sure there are different microcodes that
handle this differently.
If it proves useful, I think the next step would be to try to express
the assembly you wrote in step 1 in pure go, as a microbenchmark. Then
you can teach the arm backend to do the appropriate peephole
optimisations and use benchstat to compare the differences; they may
be more or less pronounced when this optimisation is mixed into the
general milieu of machine generated code.
After that, as long as this doesn't produce a regression in binary
size or go1 benchmarks; I'd say this would be a good candidate for
merging.
Thanks
Dave
> --
> You received this message because you are subscribed to the Google Groups
> "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
golang-dev+...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.