tips for instruction scheduling?

Skip to first unread message

Wei Dai

Apr 3, 2009, 3:07:30 AM4/3/09
Does anyone have tips for how to schedule a sequence of assembly language
instructions, in order to maximize the instructions per cycle (IPC) executed
by a CPU?

What I've been doing so far is just simple trial and error. I'd move a few
instructions around more or less at random, benchmark the resulting code
sequence, and keep it as the base for the next round of tweaking if it's
faster than before. Often, which sequences are faster than others do not
make any intuitive sense. The only tool I've found so far that helps with
this process is AMD's free CodeAnalyst, which includes a cycle-accurate CPU
pipeline simulator for some AMD CPUs. It can tell you exactly where stalls
are occurring and why, so you can concentrate the tweaking on those parts.
Unfortunately Intel doesn't offer a similar tool, and AMD's tool doesn't
cover its latest CPUs.

I'm curious what solutions others may have found for this problem.

Apr 3, 2009, 5:35:00 AM4/3/09
to crypto-optimization
When I was preparing the reference implementation for my submission
last summer I'd found similar results -- and that was coding in C (no
assembly). All my tests were done with GCC 4.x, I'd found that tiny
changes in the order of C statements could make a massive difference
in the execution speed (although running the resulting binary through
gdb didn't show much difference at all in the resulting assembly).
Changes that one would expect to improve performance were usually
terrible, for example when calling a function passing a single pointer
to a struct instead of passing eight or ten parameters on the stack
actually made things worse (very odd).... hence my reference
implementation has parameter lists that look like short stories :-)

To complicate things further, I'd coded and tested the implementation
on a Pentium 4 cpu and because of the amount of static tables I'd
used, the cache seemed to really mess up the timings (lots of memory
access) - so when I tested it on a 64-bit CPU (Intel Dual Core) with a
much larger cache, the performance was not as expected which meant I
had to re-optimise every function.

If you're using Intel CPUs, it might be worth coding small parts in C
and using Intel's ICC compiler (if you have a copy, which I
unfortunately don't) to check what output is produced - from what I've
read, ICC will try and change the position of register only
intructions and those involving memory access to be nearer the optimal
(which I think means it essentially interleaves reg only and reg/mem
instructions so that any memory access can be performed in parallel
with a full CPU instruction).

This is probably straying into territory where I have little or no
knowledge (if you know better, then please correct me as to know more
here would be really useful): have you checked the instruction sizes?
The last time I did any asm was with a x386, and the opcode lengths
were not always a multiple of 4 which I would guess means than any
memory fetch for the next instruction would not be dword aligned.

What type of code are you optimising, does it do a lot of sbox lookups
or is it mainly arithmetic instructions?

Best wishes,


geoffrey park

Apr 4, 2009, 4:44:12 PM4/4/09
Intel does have a code optimization tool, VTune, but it's not free.  They also have a new parallel code optimizer in beta. AMD's CodeAnalyst does much the same thing, though, and it's free. I've used them both  extensively and can say that CodeAnalyst will work just as well for C or assembly optimization as VTune.
I admit some bias, however, as I work for AMD.

It's hard to give many reliable tips on instruction sequencing because modern processors are so complex. C compiler optimizers have extensive knowledge of processor timing and dependancies built in, so very often a hand written assembly routine is no faster and may be slower than the optimzed version the C compiler generates.

Here are some points to keep in mind, though:

* try not to stall pipelining:

   Delay using the result of a calculation long enough for the result to leave the pipe (use CodeAnalyst to figure out how many cycles are required). This may mean slipping an unrelated calculation into the pipe between starting the first calculation and the first use of the result.

* note that floating point and integer units run asynchronously. By interleaving them you can get some gains.

* Intel Hyperthreading processors have two integer units, and one float unit per core. There are other significant differences between Intel  and AMD processors in terms of  cache sizes and speed, so in general, code tuned for Intel won't necessarily be tuned optimally for AMD, and visa versa.

If you are trying develop a fast implementation of any algorithm, remember that modern processor are all multi-core. Four cores is normal, 6 and 8 will be common very soon. If at all possible you should try to use multi-threading. There are tools such as  OpenMP to make this easy. OpenCL is coming soon. This will allow easy cross platform development of massively parallel algorithms running on the 800 or more compute pipes available in modern GPUs.

-Geoff Park

2009/4/3 Wei Dai <>

Krystian Matusiewicz

Apr 21, 2009, 2:56:36 AM4/21/09
to crypto-optimization

On Apr 3, 9:07 am, "Wei Dai" <> wrote:
> Does anyone have tips for how to schedule a sequence of assembly language
> instructions, in order to maximize the instructions per cycle (IPC) executed
> by a CPU?

I played with assembler programming a bit but I don't have any general
to that either.
One thing I found very useful are the optimization manuals by Agner
Apart from a very detailed analysis of how instruction pipelines work
various common CPUs (Core, P4, AMD) he also provides a really nice
set of tables that summarize instruction latencies and some additional
info that helps with scheduling them.

Another thing that could possibly help would be to use performance
counters, but I never actually got to test them in practice...




May 3, 2009, 8:00:36 PM5/3/09
to crypto-optimization
Hi! You can check, e.g.: (12.7-8, page 92
through 97)

You need to consider uops and execution ports on most x86-64 cpu's.
You can find such info (assuming x86) here:

Basically, execution time (in clocks) is calculated in this way:

; LLL = instruction length, FFF = fused uops, ### = execution port #

movapd xmm1, [esi+eax] ; 5| 1| | | | 1| | |
mulpd xmm1, xmm2 ; 4| 1| 1| | | | | |
addpd xmm1, [edi+eax] ; 5| 1| | 1| | 1| | |
movapd [edi+eax], xmm1 ; 5| 1| | | | | 1| 1|
add eax, 16 ; 3| 1| x| x| x| | | |
js L1 ; 2| 1| | | 1| | | |
;Total| 24| 6|1.3|1.3|1.3| 2| 1| 1|
Bottleneck 222

I'm a complete crypto newbie btw.

On Apr 3, 9:07 am, "Wei Dai" <> wrote:

Wei Dai

May 5, 2009, 6:35:40 PM5/5/09
to crypto-optimization
Thanks for these pointers. I think many people on this list are probably
already familiar with these resources, and they are certainly very useful.
But even after reading all of the publicly available information on the x86
microarchitectures, there are still variations in IPC among assembly
language sequences that I can't explain.

I was hoping that others had some extra information that they could share,
but that doesn't seem to be the case. So perhaps a more promising approach
is an optimization tool that will automate the trial-and-error process and
"magically" create the optimal assembly sequence from some specification. I
guess that's what David Harvey and Torbjorn Granlund did for GMP 4.3, and
maybe they will reveal their techniques later.

From: "Starfish" <>
Sent: Sunday, May 03, 2009 5:00 PM
To: "crypto-optimization" <>
Subject: Re: tips for instruction scheduling?
Reply all
Reply to author
0 new messages