x86 code alignment guidelines

Russ Cox

unread,

Feb 10, 2010, 9:25:11 PM2/10/10

to golang-nuts

On our test systems we've seen a few instances of code
where nudging it a few bytes one way or the other in
memory has dramatic effects on performance. Here's one:

http://godashboard.appspot.com/benchmarks/math_test.BenchmarkFmod

On Darwin/386, it either takes 9ns or 18ns and oscillates
back and forth as the code moves around in the text segment.
This is an Intel chip but we've seen similar effects on AMD chips.
(For example, the Linux/386 line is an AMD chip.)

Does anyone know of concrete guidelines for how x86 code
should be aligned to avoid this kind of slowdown?

Thanks.
Russ

Ian Lance Taylor

unread,

Feb 11, 2010, 12:44:26 AM2/11/10

to r...@golang.org, golang-nuts

Russ Cox <r...@golang.org> writes:

Obviously the specific guidelines differ from processor to processor.
For Intel processors there is a lot of information in the Intel
Architecture Optimization Reference Manual at
http://www.intel.com/design/pentiumii/manuals/245127.htm .

For AMD processors, take a look at
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

Some general tips for Intel processors:

* For best performance, align all branch targets at a 16-byte
boundary. In practice aligning the start of loops is probably
sufficient.

* Use conditional moves when possible for unpredictable branches. For
predictable branches, prefer branch instructions.

* Initially a forward branch will be predicted to fall through. A
backward branch will be predicated as taken.

* Do not put more than four branches in one 16-byte chunk.

* Do not put more than two branches out of a loop in one 16-byte
chunk.

Ian

Lawrence Bakst

unread,

Feb 11, 2010, 1:00:29 AM2/11/10

to r...@golang.org, golang-nuts

On Wed, Feb 10, 2010 at 9:25 PM, Russ Cox <r...@golang.org> wrote:
> Does anyone know of concrete guidelines for how x86 code
> should be aligned to avoid this kind of slowdown?

Concrete guidelines no. Clues, pointers, and a few guesses yes. My
answers are Intel specific because I don't know as much about AMD
architecture.

1. This is a wild guess. I suspect the general rules are:
a. No more than 4-6 instructions (depends on what kind) per aligned 16
byte chunk
b. Branch instructions and the instruction at the branch targets
should not span an aligned 16 byte chunk
c. It would be "nice" if no instructions span an aligned 16 byte chuck
However you can break rule "a" if you have a loop and can fit it into
the loopback buffer which works differently on "Merom" and "Nehalem"
BTW.
See #3 and #4 below for justification and #5 for why these rules might
not be the right thing to do in general. I fit's a loop try

2. I don't believe that Intel publishes guidelines for code alignment
and it certainly varies from architecture to architecture, so "Intel
chip" means nothing. I am not sure you can optimize across the 3
vendors and myriad of architectures. I would optimize for
"Merom"/Core2.

3. Many of the answers you want and most of my information is from the
following web page:
http://www.agner.org/optimize/

Download the "microarchitecture.pdf" file mentioned in #3.

The file has lots of useful info that the vendors either don't publish
or you have to pull from dozens of presentations. Start reading on
page 74. You need to understand the pre-decoder, decoder, loopback
buffer among other things. NB paragraphs 2 and 3 on page 75.

As you can see the architecture(s) are quite complex:

From the above .pdf file:
In general the pre-decoder can only decode 6 instructions or 16 bytes
worth of instructions, whichever is lower. If you have more than that,
the remainder are processed on the next cycle and worse, the next 16
byte chunk isn't fetched until the following cycle. So you can issue 3
("Merom") or 4 ("Nehalem") instructions followed by 1 instruction.
However if you have a loop the loopback buffer can can ameliorate
decode rate limitations.

4. Branch instruction and branch target alignment are clearly an
issue, but there is almost no direct information on that subject.
Do you have instructions that span 16 byte alignment? Specifically do
the branch instructions or branch targets span aligned 16 byte chucks?

5. If the above are indeed the issue, fixing it by using the rules in
#1 might be not work, because the fix of better instruction alignment
and therefore less instruction density might reduce overall
performance in the majority of other instruction traces.

6. Make sure you don't violate any of the rules in Appendix E of this
Intel document:
http://download.intel.com/design/processor/manuals/248966.pdf

7. One other thing. The Intel C compilers are really very good. You
can learn a great deal about what to do and not to do by looking at
them. Vtune can also help.

Intel has some extremely high level "support" people who have all the
answers to these kind of questions and I assume that Google could gain
access to one of the Gurus. We had one we developed a custom video
CODEC and he was really very sharp, although he gave more credit to
the memory pre-fetcher than it deserved.
--
~leb

Carl

unread,

Feb 14, 2010, 10:07:47 AM2/14/10

to golang-nuts

I think also some much larger issues are at stake tangent to this
topic:

The question that comes to my mind is: if the GO-Compiler Coders
will indeed be able to provide all of those indepth optimizations for
various Vendors and Architectures of Processors. It requires some deep
Skill Sets and the time to do it all.

Possibly another approach might merit some consideration: the
direction of LLVM. The argument against LLVM has been "too Large, too
Slow".... however if you tweak your Developement Methodology it could
make sense in an economical way to go down that road.

The tweak would be to have a predeployment Compiler, light and fast
for developement. Uncompromising compilation speeds and moderately
good native code output.

After you finish your developement cycle and fast compile times
become a less significant issue, you then use a highly optimizing
deployment Compiler to produce the best possible native code for the
specific CPU target Architecture. Then achieving maximum possible
native code efficiency for that specific Processor target. The run
time of the compile should not be of big significance in that Use Case
because you are not doing it very often. Actually maybe only a few
times during the actual deployment of your application, after active
developement has finished. The seperate deployment compiler
Methodology would also open the door for technologies like code
instrumentation's and profile guided runtime optimizations.

The reason I am writing this (loosely worded) proposal is because I
have doubts if all optimizations for all CPU's can ever make it into
the GO-Language in the near future if done otherwise because of the
sheer scope of such a task.

Thanks for you time.

Uriel

unread,

Feb 14, 2010, 6:09:22 PM2/14/10

to golang-nuts

On Sun, Feb 14, 2010 at 4:07 PM, Carl <2ca...@googlemail.com> wrote:
>
> I think also some much larger issues are at stake tangent to this
> topic:
>
> The question that comes to my mind is: if the GO-Compiler Coders
> will indeed be able to provide all of those indepth optimizations for
> various Vendors and Architectures of Processors. It requires some deep
> Skill Sets and the time to do it all.
>
> Possibly another approach might merit some consideration: the
> direction of LLVM. The argument against LLVM has been "too Large, too
> Slow".... however if you tweak your Developement Methodology it could
> make sense in an economical way to go down that road.

Whiel a Go front end for LLVM would be great, how is what you suggest
not possible already using gccgo?

uriel

Carl

unread,

Feb 15, 2010, 2:56:16 PM2/15/10

to golang-nuts

I just checked to see if this has been try'd before and discovered
that there was already a discrussion going on about it back in Nov 09
under the Heading "LLVM?".

So my summary for now would be: it seems to make sense to have an
extremely fast Developer Compiler, which by nature will probably not
be able to be all things to all people, especially in the area of deep
optimization for various Architectures. And additionally to have
augmenting it, a somewhat heavier DEPLOYMENT compiler with full depth
optimizations for various Architectures, this one may be somewhat
slower. The latter candidate could very well be implemented in the
context of LLVM.

So it will be interesting to see how this works out. I am pretty sure
at some point the LLVM Deployent Compiler Topic will come up again.

Ian Lance Taylor

unread,

Feb 16, 2010, 12:22:22 AM2/16/10

to Carl, golang-nuts

Carl <2ca...@googlemail.com> writes:

> So my summary for now would be: it seems to make sense to have an
> extremely fast Developer Compiler, which by nature will probably not
> be able to be all things to all people, especially in the area of deep
> optimization for various Architectures. And additionally to have
> augmenting it, a somewhat heavier DEPLOYMENT compiler with full depth
> optimizations for various Architectures, this one may be somewhat
> slower. The latter candidate could very well be implemented in the
> context of LLVM.

Nothing against LLVM, but gccgo is already working, and it seems that
it could serve the role of the deployment compiler.

Ian

Reply all

Reply to author

Forward