Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: ARMalayser version 0.55

0 views
Skip to first unread message

druck

unread,
Apr 21, 2006, 8:14:06 PM4/21/06
to
On 21 Apr 2006 druck <ne...@druck.freeuk.com> wrote:
> * Register latency detection for target processors has been corrected and
> warnings are now displayed in all cases. The accuracy of values have been
> improved and knowledge of the ARM9 core added.
>
> * Total cycle count latency counts added to statistics if a target
> processor is specified. A follow up article will be posted on the usage
> of this feature shortly.

Register latancies occur on more recent ARM processors which are capable of
issuing an instructions and continuing before the result is ready, as long
as the following instructions do not use the result. If the result is
required a delay (latency) is introduced. For example, the LDR instruction
on an ARM7 takes 3 cycles, but on ARM9, StrongARM and XScale it may be
issued in one cycle, and subsequent instructions executed as long as they do
not need the loaded value. On the ARM9 and StrongARM the result is available
1 cycle later, and the XScale two cycles later.

The presence of latencies means that code can be optimised by rearranging
instruction order to insert non dependant instructions between the load of
a value and where it is used. The latest Norcorft and GCC compilers support
the processor specific optimisation vai the -cpu and -mtune parameters
respectively. ARMalyser's statistic output gives a rough indication of the
sucess of the optimistions, it is only approximate as the sum of the cycle
length of all instructions in the program and sum of all the latencies
cannot reflect the number of time each instruction is repeated when run.

Below is a table of results of analysis of the ARMalyser executable built
with the various compiler options for processor optimisation, and analysed
with respect to each processor target. The first number is the total cycle
count, the second is the total latency count, and the third is the latancy
percentage as percentage of the cycle count..


ARM7 ARM9 StronARM XScale
------- ----- -------- ------
Norcroft 5.61 53160 44821 44842 47862
0 3896 3910 9156
0% 8.7% 8.7% 19.1%

Norcroft 5.64 52805 42216 42271 43594
0 2047 2067 5856
0% 4.8% 4.9% 13.4%

Norcroft 5.64 -cpu ARM7 53347 43530 43552 45975
0 2736 2749 7366
0% 6.3% 6.3% 16.0%

Norcroft 5.64 -cpu ARM920t 53187 42549 42552 44669
0 1995 2006 6349
0% 4.7% 4.7% 14.2%

Norcroft 5.64 -cpu StrongARM1 53389 42703 42710 44864
0 2006 2020 6380
0% 4.7% 4.7% 14.2%

Norcroft 5.64 -cpu XScale 52803 42214 42269 43592
0 2047 2067 5856
0% 4.8% 4.9% 13.4%

gcc 3.4.5 121383 84077 84184 97838
0 14569 13459 32544
0% 17.3% 17.3% 33.3%

gcc 3.4.5 -O2 62063 44894 44896 44874
0 2180 2196 6015
0% 4.9% 4.9% 13.4%

gcc 3.4.5 -O3 75349 52995 53265 54397
0 2580 2605 7178
0% 4.9% 4.9% 13.2%

gcc 3.4.5 -O3 -mtune=arm7 75819 54746 55019 57877
0 3574 3607 9484
0% 6.5% 6.6% 16.4%

gcc 3.4.5 -O3 -mtune=arm9 75618 52908 53156 55706
0 2486 2497 8232
0% 4.7% 4.7% 14.8%

gcc 3.4.5 -O3 -mtune=strongarm 75143 52427 52737 55042
0 2488 2499 8237
0 4.7% 4.7% 15.0%

gcc 3.4.5 -O3 -mtune=xscale 75349 52995 53265 54397
0 2580 2605 7178
0 4.9% 4.9% 13.2%


As can be seen Norcorft 5.64 and gcc 3.4.5 both optimise for the processor
type unlike Norcroft 5.61 and versions 2 of gcc (not shown). The default
optimisation for each favours the XScale, but also benefits ARM9 and
StrongARM by the almost degree as optimising specifically for them. As
expected the degree of optimisation is less for XScale given the difficulty
of filling the increased latency periods with useful non related
instructions.

Cheers
---Dave

--
_ __ __ __
THE /_\ |__) |\ /| / | | | |__) David J. Ruck
/ \ | \ | \/ | \__ |__ |__| |__) chai...@armclub.org.uk

0 new messages