When I was preparing the reference implementation for my submission
last summer I'd found similar results -- and that was coding in C (no
assembly). All my tests were done with GCC 4.x, I'd found that tiny
changes in the order of C statements could make a massive difference
in the execution speed (although running the resulting binary through
gdb didn't show much difference at all in the resulting assembly).
Changes that one would expect to improve performance were usually
terrible, for example when calling a function passing a single pointer
to a struct instead of passing eight or ten parameters on the stack
actually made things worse (very odd).... hence my reference
implementation has parameter lists that look like short stories :-)
To complicate things further, I'd coded and tested the implementation
on a Pentium 4 cpu and because of the amount of static tables I'd
used, the cache seemed to really mess up the timings (lots of memory
access) - so when I tested it on a 64-bit CPU (Intel Dual Core) with a
much larger cache, the performance was not as expected which meant I
had to re-optimise every function.
If you're using Intel CPUs, it might be worth coding small parts in C
and using Intel's ICC compiler (if you have a copy, which I
unfortunately don't) to check what output is produced - from what I've
read, ICC will try and change the position of register only
intructions and those involving memory access to be nearer the optimal
(which I think means it essentially interleaves reg only and reg/mem
instructions so that any memory access can be performed in parallel
with a full CPU instruction).
This is probably straying into territory where I have little or no
knowledge (if you know better, then please correct me as to know more
here would be really useful): have you checked the instruction sizes?
The last time I did any asm was with a x386, and the opcode lengths
were not always a multiple of 4 which I would guess means than any
memory fetch for the next instruction would not be dword aligned.
What type of code are you optimising, does it do a lot of sbox lookups
or is it mainly arithmetic instructions?