Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

BeagleBoard instruction timings and RAM speed results

37 views

Skip to first unread message

Terje Slettebø

unread,

Jun 26, 2010, 8:48:35 AM6/26/10

Hi all.

In the same vein as an earlier posting (http://groups.google.com/group/
comp.sys.acorn.misc/browse_frm/thread/49290fed24d221bd/
178fa7fa76eb1df5), I've now done a bunch of performance testing on my
BeagleBoard, which has a 600 MHz ARM Cortex A8 processor and 256 MB
RAM, running a 1280x720 24-bit colour mode.

Since these timings may be of interest to some people (especially
programmers), I'm posting them here, as well as some comments on the
results at the end.

"Superscalar" and "Not superscalar" refers to whether or not it may do
two of these instructions in parallel (if they are independent), using
the Cortex A8 dual ALU. The difference this makes is noted in the
"Performance" section.

Number of clock cycles per instruction [1]:

ALU (Superscalar)
---------------------------
MOV - 1
MVN - 1

ADD - 1
SUB - 1
RSB - 1
ADC - 1
SBC - 1
RSC - 1

AND - 1
ORR - 1
EOR - 1
BIC - 1

CMP - 0.8
CMN - 0.8
TEQ - 0.8
TST - 0.8

Extending (Superscalar)
-----------------------------------
SXTAB - 1

Misc. ALU instructions (Superscalar)
-----------------------------------------------------
BFC - 1
RBIT - 1
REV - 1
CLZ - 1

Misc. ALU instructions (Not superscalar)
-----------------------------------------------------------
USAD8 - 1

Long multiply (Superscalar)
----------------------------------------
SMULL - 3-7
SMLAL - 7

Multiply (Not superscalar)
-------------------------------------
MUL - 2
MLA - 2
MLS - 2

SMUAD - 1

Packing (Superscalar)
--------------------------------
PKHBT - 1

Saturating (Superscalar)
-----------------------------------
SSAT - 2

Saturated arithmetic (Superscalar)
--------------------------------------------------
QADD - 1

SIMD (Superscalar)
----------------------------
SADD8 - 1

Shifts (Superscalar)
-----------------------------
MOV R0,R1,LSL #1 - 1
MOV R0,R1,LSL R2 - 1

Branch (Not superscalar)
------------------------------------
B - 2.25 (unconditional)

Load/store (cache hit)
--------------------------------
LDR - 1 (PC-relative), 2 (register-relative)
STR - 1 (PC-relative), 2 (register-relative)
LDM - 2 (1 register), 5 (8 registers)
STM - 2 (1 register), 9 (8 registers)

FPA (Software emulated, single precision)
-------------------------------------------------------------
MVF - 116
MNF - 116
ABS - 116
SQT - 225
LOG - 8630
LGN - 8900
EXP - 9230
SIN - 9980
COS - 10780
TAN - 10560
ASN - 10270
ACS - 10980
ATN - 9380
RND - 140
URD - 130
NRM - 116

ADF - 146
SUF - 173
RSF - 173
MUF - 138
DVF - 291
RDF - 286
RMF - 166
FML - 138
FDV - 291
FRD - 286
POW - 21590
RPW - 21460
POL - 12510

CMF - 79
CNF - 84
CMFE - 79
CNFE - 84

LDF - 73
STF - 83
LFM - 77
SFM - 70

FLT - 101
FIX - 123

Advanced SIMD and VFP (Not superscalar)
---------------------------------------------------------------
VMOV - 1 (SIMD)

VADD.I8/I16/I32/I64 - 1 (SIMD)
VADD.F32 - 1 (double-precision), 2 (quad-precision) (SIMD)
VADD.F32/F64 - 9 (VFP)

VMUL.I8/I16/I32 - 1 (double-precision), 2 (quad-precision) (SIMD)
VMUL.F32 - 1 (double-precision), 2 (quad-precision) (SIMD)
VMUL.F32 - 12 (VFP)
VMUL.F64 - 17 (VFP)

VDIV.F32 - 31 (VFP)
VDIV.F64 - 56 (VFP)

VSQRT.F32 - 36 (VFP)
VSQRT.F64 - 59 (VFP)

VABS.S8/S16/S32 - 1 (SIMD)
VABS.F32 - 1 (double-precision), 2 (quad-precision) (SIMD)

VCMP.F64 - 7 (VFP)

VCVT.S32.F32 - 1 (double-precision), 2 (quad-precision) (SIMD)

VAND - 1 (SIMD)

VSHL - 1 (double-precision), 2 (quad-precision) (SIMD)

VLDR - 3 (single/double-precision) (VFP)
VSTR - 2-3 (single/double-precision) (VFP)
VLDMIA - 5 (8 single-precision registers), 8 (8 double-precision
registers)
VSTMIA - 8 (8 single-precision registers), 22 (8 double-precision
registers)

Performance [2]
-----------------------
L1 instruction cache hit - 496 MIPS (Not superscalar), 921 MIPS
(Superscalar) (Less than 32 KB in instruction loop)
L2 instruction cache hit - 200-350 MIPS (With or without superscalar)
(Between 32 KB and 256 KB in instruction loop)
L2 instruction cache miss - 70-190 MIPS (More than 256 KB in
instruction loop)

Memory speed [2]
--------------------------
L1 data cache: 3/0.6 GB/s (read/write) [3]
L2 data cache: 0.6-1.5/0.4 GB/s (read/write) [3]
RAM: 230/417 MB/s (read/write) [3]
Video RAM: Same as for RAM

Some observations
==============
Some of this was somewhat surprising, and and in those cases I double-
checked the timings. For example that MUL can't be executed
superscalarly, while SMULL can.

Regarding the corresponding Iyonix timings, there's no longer any
difference between the following instructions:

MOV R0,R0,LSL #1
MOV R0,R1,LSL #1

Furthermore, LDR/STR write-back no longer makes any difference.

Floating-point operations using Advanced SIMD/NEON is much faster than
the corresponding VFP instructions. For example, adding four single-
precision floating point registers using NEON takes 2 clock cycles,
whereas adding a single single-precision floating point register using
VFP takes 9 cycles.

This may well be due to Cortex A8 having a "VFP Lite" implementation,
whereas the Cortex A9 is supposed to have a full VFP implementation
with at least twice the performance.

The NEON unit is very fast, capable of executing two single-precision
floating point operations (addition, subtraction and multiplication)
per clock cycle, giving in excess of 1 GFLOPS, a rather respectable
performance...

Recommendations: For floating-point operations, use the NEON unit
where possible, and only use the VFP unit when needed.

This advice may well need revision for Cortex A9.

Despite being a "VFP Lite" unit, the VFP unit is still much faster
than the software-emulated FPA instructions, and should be used in
preference to these.

Regards,

Terje

[1] I have "calibrated" these clock cycle-results relative to MOV
R0,R1, which is stated in the documentation as taking 1 clock cycle.

[2] This is difficult to measure without more elaborate methods, such
as using the processor performance indicators, as cache replacement
policies makes it difficult to know what exactly is happening.

[3] The much slower write is presumably because it does write-through
to the L2 cache or main memory.

druck

unread,

Jun 27, 2010, 11:16:01 AM6/27/10

Terje Slettebø wrote:
> In the same vein as an earlier posting (http://groups.google.com/group/
> comp.sys.acorn.misc/browse_frm/thread/49290fed24d221bd/
> 178fa7fa76eb1df5), I've now done a bunch of performance testing on my
> BeagleBoard, which has a 600 MHz ARM Cortex A8 processor and 256 MB
> RAM, running a 1280x720 24-bit colour mode.
>
> Since these timings may be of interest to some people (especially
> programmers), I'm posting them here, as well as some comments on the
> results at the end.

[snip]

Thanks for that, it's very interesting. I could try to put those results
in to ARMalyser, although it wont currently handle superscalar
considerations. Which reminds me I did Cortex safe versions months ago,
but haven't uploaded them yet.

---druck

Terje Slettebø

unread,

Jun 28, 2010, 4:42:50 AM6/28/10

On Jun 27, 5:16 pm, druck <n...@druck.org.uk> wrote:
>
> Thanks for that, it's very interesting. I could try to put those results
> in to ARMalyser, although it wont currently handle superscalar
> considerations. Which reminds me I did Cortex safe versions months ago,
> but haven't uploaded them yet.

Hi Dave.

Thanks for the feedback.

Hardly anyone uses assembly code, nowadays (except RISC OS, itself),
but information like this may be used to make intelligent decisions
about things like implementation of libraries and compiler back-ends.
For example, this indicates that an efficient vector or matrix library
may be implemented using the NEON instructions...

It also indicates that unless you need division or square root, NEON
may be preferable to VFP, even if you don't need its SIMD capability.

This also means that if the C/C++ compilers used with RISC OS continue
to produce FPA code, a new FPEmulator (for ARM Cortex A8 computers)
might preferably be implemented mainly using the NEON instructions.
Otherwise, you might change the compiler back-end to produce VFP/NEON
instructions directly.

Regards,

Terje

Theo Markettos

unread,

Jun 28, 2010, 10:11:08 AM6/28/10

In comp.sys.acorn.programmer Terje Slettebø <tsle...@gmail.com> wrote:
> Since these timings may be of interest to some people (especially
> programmers), I'm posting them here, as well as some comments on the
> results at the end.

Thanks for some very interesting results. What was your methodology for
measuring them?

Also, do you have any idea on how many cycles an L1/L2 data cache hit/miss
cost? That might be interesting for those trying to optimise code.

Theo

Terje Slettebø

unread,

Jul 4, 2010, 6:01:56 AM7/4/10

On 28 Jun, 16:11, Theo Markettos <theom+n...@chiark.greenend.org.uk>
wrote:

> In comp.sys.acorn.programmer Terje Slettebø <tslett...@gmail.com> wrote:
>
> > Since these timings may be of interest to some people (especially
> > programmers), I'm posting them here, as well as some comments on the
> > results at the end.
>
> Thanks for some very interesting results. What was your methodology for
> measuring them?

It was quite simple: I first made a loop executing a large number of
"MOV R0,R1" instructions (not using the usual "MOV R0,R0" was to avoid
register dependencies, which may inhibit superscalar execution), with
10 instructions in the loop body (further unrolling proved
insignificant for the results).

I then adjusted the loop count until it took exactly one second (using
OS_ReadMonotonicTime).

Now, I had a relative measurement, so that if I substituted e.g. "MUL
R0,R1,R2,R3" for the MOV instruction (with the registers set up with
some suitable test data), and it took e.g. two seconds, then - since I
knew that MOV takes a single cycle (when not executed superscalarly) -
it would mean that MUL takes two cycles. And so on.

To test for superscalar execution, I modified the loop body, so that
instead of having instructions like:

MOV R0,R1
MOV R0,R1
MOV R0,R1
...

which won't execute superscalarly, because of register dependencies
between the instructions, I used:

MOV R0,R1
MOV R2,R3
MOV R4,R5
...

which did execute superscalarly.

> Also, do you have any idea on how many cycles an L1/L2 data cache hit/miss
> cost? That might be interesting for those trying to optimise code.

That's difficult to say, since I got numbers in a range... However,
the tests indicated that you only got the benefits of superscalar
execution with L1 cache hits (I guess the memory latency drowns any
execution advantages beyond that cache).

All the timings were done with code fitting in the L1 cache (except
for the tests that explicitly test for cache hits), and it seems it
roughly takes twice the time to hit the L2 cache. Hitting the main
memory appears to take around 4-8 times the time it takes to hit the
L1 cache (you can see it both from the reduced MIPS, as well as the
read/write speeds).

0 new messages