How to prove a minor improvement

257 views
Skip to first unread message

Ben Shi

unread,
Aug 16, 2017, 4:44:45 AM8/16/17
to golan...@googlegroups.com

Hello, go developers,


I have a question about how to show a tiny improvement of the arm assembler.


I have an idea to optimize "AND $0xf000f000, Rn". It is originally assembled to

MOVW Offset(PC), R11

AND R11, Rn


But it could be simplified on ARMv7 to 

BFC Rn, #0, #12

BFC Rn, #16, #12


According to ARM's manual, a single BFC is sure to save ticks than a "LDR R11, off(PC)". But the go1 benchmark shows no improvements.


How can I prove that? Write a specific test case for that and add to go1 tests ?


Ben



Dave Cheney

unread,
Aug 16, 2017, 5:37:59 AM8/16/17
to Ben Shi, golang-dev
Hi Ben,

Thats an interesting question. I've always believed that, for the in
order ARM cores, fewer instructions was faster, but i'm sure the real
answer is more nuanced. Probably trying to prove this directly using
something like the go1 benchmarks will give you numbers in the noise
floor.

Here's what I would try to prove your hypothesis.

1. write a go benchmark

package p

import "testing"

func fnMov() // implemented in asm

func fnBFC() // implemented in asm

func BenchmarkFnMov(b *testing.B) {
for n := 0; n < b.N; n++ {
fnMov()
}
}

... etc

then write fnMov and fnBFC in a .s file;

The go compiler will not attempt to inline or eliminate those calls as
it cannot provide they don't have side effects so effectively this
uses the testing package as a harness to run asm functions. Yes,
you'll be paying the function dispatch overhead, but it should be a
constant.

This should let you test your code gen hypothesis without needing to
change anything. Also we can take this program and try it on a bunch
of different arm hosts as I'm sure there are different microcodes that
handle this differently.

If it proves useful, I think the next step would be to try to express
the assembly you wrote in step 1 in pure go, as a microbenchmark. Then
you can teach the arm backend to do the appropriate peephole
optimisations and use benchstat to compare the differences; they may
be more or less pronounced when this optimisation is mixed into the
general milieu of machine generated code.

After that, as long as this doesn't produce a regression in binary
size or go1 benchmarks; I'd say this would be a good candidate for
merging.

Thanks

Dave
> --
> You received this message because you are subscribed to the Google Groups
> "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ralph Corderoy

unread,
Aug 16, 2017, 6:45:09 AM8/16/17
to Ben Shi, golan...@googlegroups.com
Hi Ben,

> But it could be simplified on ARMv7 to
> BFC Rn, #0, #12
> BFC Rn, #16, #12
>
> According to ARM's manual, a single BFC is sure to save ticks than a
> "LDR R11, off(PC)". But the go1 benchmark shows no improvements.

The number of instructions executed is going to be the same, so the
count from `perf stat -e instructions ./a.out' shouldn't vary.
Similarly, the "I refs" from `valgrind --tool=cachegrind --cache-sim=yes
--branch-sim=yes ./a.out'.

You should see a difference in data-cache references from them, though
misses would only be on wider workloads.

You can try `perf stat -e cycles ./a.out', but it often has a displayed
variance anyway. (`-r $n' will run the command n times.) perf-list(1)
shows what your platform provides.
http://infocenter.arm.com/help/topic/com.arm.doc.faqs/ka16403.html says
"DWT_LSUCNT - cycles spent waiting for loads and stores to complete".

If you're interested in just seeing that the instruction choice can be
consistently measured then create a minimal static executable from
assembly that does the LDR before exit(2). Then swap in the BFC and see
if you can measure the difference?

--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

David Chase

unread,
Aug 16, 2017, 11:07:49 AM8/16/17
to golang-dev, Ben Shi
Another option besides writing a benchmark is to instrument that transformation and see how often it occurs compiling general code and existing benchmarks.
That won't tell you if it's faster, but it will tell you if it is likely to matter.
You can also use that information to select or create more realistic benchmarks than an isolated code sequence in a .s file.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages