beagleboard/cortex-a8 performace

83 views
Skip to first unread message

koen

unread,
Apr 20, 2008, 4:46:36 PM4/20/08
to Beagle Board, siarhei....@gmail.com
Hi,

I'm trying to do some tests to see how the cortex-a8 performs with
video and I'm getting very strange results with mplayer:

The test:

# wget http://samples.mplayerhq.hu/benchmark/testsuite1/matrixbench_normdivx_vbrmp3.avi
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts
idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK

On a nokia n800 (300MHz omap2420):

BENCHMARKs: VC: 122.543s VO: 0.162s A: 0.000s Sys: 1.416s =
124.120s

So it can decode the complete video in ~2 minutes. The beagle:

BENCHMARKs: VC: 193.856s VO: 0.153s A: 0.000s Sys: 2.718s =
196.727s

Wow! That's a *lot* slower than nokia n800. A CPU with twice the
megahertz is 50% slower!

The mplayer used is the one from https://garage.maemo.org/projects/mplayer/
because that has armv6 simd and armv6 vfp optimizations.

The CFLAGS used:

-march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp -
fexpensive-optimizations -ftree-vectorize -fomit-frame-pointer -O4 -
ffast-math

I wondered why that is and got a hint from this:

"Clocking rate (Crystal/DPLL/ARM core): 26.0/266/381 MHz"

So the cpu is not running at 600MHz, but at 381MHz, is that expected?
But even at 381 MHz it should be faster than an omap2.

Does anyone have some idea and/or hints on this? I'll try running the
test-idct and test-unquatize programs later this week


Siarhei Siamashka

unread,
Apr 20, 2008, 8:00:57 PM4/20/08
to koen, Beagle Board
On Sunday 20 April 2008, koen wrote:
> I'm trying to do some tests to see how the cortex-a8 performs with
> video and I'm getting very strange results with mplayer:
>
> The test:
>
> # wget
> http://samples.mplayerhq.hu/benchmark/testsuite1/matrixbench_normdivx_vbrmp
>3.avi # mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts
> idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK

This command line option forces ARMv5TE IDCT (useful for ARM9E and old XScale
cores without IWMMXT support). ARMv6 IDCT can be enabled using
'-lavdopts idct=17', it may work better.

> On a nokia n800 (300MHz omap2420):

AFAIK N800 runs at 330MHz with OS2007 and at up to 400MHz with OS2008.
In order to disable frequency scaling in OS2008 and keep it running
at 400MHz for more reliable results, you can use:

# echo null > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# echo 0 > /sys/power/op_active



>
> BENCHMARKs: VC: 122.543s VO: 0.162s A: 0.000s Sys: 1.416s =
> 124.120s
>
> So it can decode the complete video in ~2 minutes. The beagle:
>
> BENCHMARKs: VC: 193.856s VO: 0.153s A: 0.000s Sys: 2.718s =
> 196.727s
>
> Wow! That's a *lot* slower than nokia n800. A CPU with twice the
> megahertz is 50% slower!

From Cortex-A8 TRM. Instructions Cycle Timing:
Halfword: SMULxx and SMLAxx - 2 cycles
but Dual halfword: SMUAD, SMUSD - 1 cycle

ARMv5TE IDCT heavily uses SMULxx and SMLAxx instructions which take 1 cycle on
ARM9E, ARM11 and XScale.

Anyway, I suspect that the best results can be obtained when using NEON SIMD
optimizations :)

>
> The mplayer used is the one from https://garage.maemo.org/projects/mplayer/
> because that has armv6 simd and armv6 vfp optimizations.
>
> The CFLAGS used:
>
> -march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp -
> fexpensive-optimizations -ftree-vectorize -fomit-frame-pointer -O4 -
> ffast-math
>
> I wondered why that is and got a hint from this:
>
> "Clocking rate (Crystal/DPLL/ARM core): 26.0/266/381 MHz"
>
> So the cpu is not running at 600MHz, but at 381MHz, is that expected?
> But even at 381 MHz it should be faster than an omap2.
>
> Does anyone have some idea and/or hints on this? I'll try running the
> test-idct and test-unquatize programs later this week

That would be also interesting. I'm especially interested in 'test-vfp',
because looking at TRM, seems like VFP also got a major slowdown on
Cortex-A8.

But Cortext-A9 claims to double VFP performance when compared with previous
generation :)

--
Best regards,
Siarhei Siamashka

Koen Kooi

unread,
Apr 21, 2008, 3:38:23 AM4/21/08
to Beagle Board, Siarhei Siamashka
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Op 21 apr 2008, om 02:00 heeft Siarhei Siamashka het volgende
geschreven:


> On Sunday 20 April 2008, koen wrote:
>> I'm trying to do some tests to see how the cortex-a8 performs with
>> video and I'm getting very strange results with mplayer:
>>
>> The test:
>>
>> # wget
>> http://samples.mplayerhq.hu/benchmark/testsuite1/matrixbench_normdivx_vbrmp
>> 3.avi # mplayer -nosound -vo null -quiet -benchmark -loop 12 -
>> lavdopts
>> idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK
>
> This command line option forces ARMv5TE IDCT (useful for ARM9E and
> old XScale
> cores without IWMMXT support). ARMv6 IDCT can be enabled using
> '-lavdopts idct=17', it may work better.

with idct=17:

BENCHMARKs: VC: 186.421s VO: 0.143s A: 0.000s Sys: 2.025s =
188.588s
BENCHMARK%: VC: 98.8504% VO: 0.0760% A: 0.0000% Sys: 1.0736% =
100.0000%

>> That would be also interesting. I'm especially interested in 'test-
>> vfp',
> because looking at TRM, seems like VFP also got a major slowdown on
> Cortex-A8.

root@beagleboard:~/test# ./test-vfp --freq=$(dmesg | grep MHz | grep
ARM |awk -F/ '{print $5}' | awk '{print $1}')

Function: 'vector_fmul_vfp', time=123.040
Function: 'vector_fmul_reverse_vfp', time=116.570
Function: 'float_to_int16_vfp', time=143.864
Function: 'ff_float_to_int16_c', time=38.269

root@beagleboard:~/test# ./test-unquantize --freq=$(dmesg | grep MHz |
grep ARM |awk -F/ '{print $5}' | awk '{print $1}')
no cpu clock frequency specified, trying to autodetect it...
... detected as 469.6MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.05625 usec per element, or 26.4
cycles (469.6MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01772 usec per
element, or 8.3 cycles (469.6MHz)

root@beagleboard:~/test# ./test-idct --freq=$(dmesg | grep MHz | grep
ARM |awk -F/ '{print $5}' | awk '{print $1}') --enable-armv6
avg=-0.08, stddev=36.96, min=-168.00, max=149.00
Assuming cpu clock frequency 381MHz (ARMv6 enabled)
Please be patient and wait for the results, test requires quite a lot
of time to run...
correctness tests passed
- --- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=535.2
simple_idct_put_armv5te cache=no, time=668.3
simple_idct_put_armv5te cache=yes, time=662.9
simple_idct_add_armv5te cache=no, time=890.5
simple_idct_add_armv5te cache=yes, time=744.9
simple_idct_armv5te_ref time=935.8
simple_idct_put_armv5te_ref cache=no, time=1190.6
simple_idct_put_armv5te_ref cache=yes, time=1171.2
simple_idct_add_armv5te_ref cache=no, time=1372.2
simple_idct_add_armv5te_ref cache=yes, time=1229.4
simple_idct_armv6 time=665.1
simple_idct_put_armv6 cache=no, time=934.0
simple_idct_put_armv6 cache=yes, time=754.6
simple_idct_add_armv6 cache=no, time=999.4
simple_idct_add_armv6 cache=yes, time=854.8
- --- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1235.1
simple_idct_put_armv5te cache=no, time=1375.2
simple_idct_put_armv5te cache=yes, time=1367.0
simple_idct_add_armv5te cache=no, time=1617.9
simple_idct_add_armv5te cache=yes, time=1472.9
simple_idct_armv5te_ref time=1616.1
simple_idct_put_armv5te_ref cache=no, time=1863.3
simple_idct_put_armv5te_ref cache=yes, time=1843.0
simple_idct_add_armv5te_ref cache=no, time=2041.1
simple_idct_add_armv5te_ref cache=yes, time=1899.8
simple_idct_armv6 time=1038.1
simple_idct_put_armv6 cache=no, time=1299.8
simple_idct_put_armv6 cache=yes, time=1119.5
simple_idct_add_armv6 cache=no, time=1383.3
simple_idct_add_armv6 cache=yes, time=1234.1

regards,

Koen

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iD8DBQFIDERxMkyGM64RGpERAj7HAKC3KTSetdhYxRO7k4PpSOOgLcC38gCgicGa
rxEtvKoSvUoO86tuxk0gk5w=
=8+j1
-----END PGP SIGNATURE-----

Syed Mohammed, Khasim

unread,
Apr 22, 2008, 5:06:05 AM4/22/08
to beagl...@googlegroups.com
Currently due to power issues we were running ARM MPU at 381Mhz with L2 cache off. We have to pump up this frequency to 500Mhz and more also enable L2 Cache.
 
I should be able to work on power mods board soon, will update you all on this ASAP.
 
Regards,
Khasim
 

koen

unread,
Apr 22, 2008, 9:45:28 AM4/22/08
to Beagle Board

On 22 apr, 11:06, "Syed Mohammed, Khasim " <sm.kha...@gmail.com>
wrote:

> Currently due to power issues we were running ARM MPU at 381Mhz with L2
> cache off. We have to pump up this frequency to 500Mhz and more also enable
> L2 Cache.

What kind of power issues, hardware or software?

> I should be able to work on power mods board soon, will update you all on
> this ASAP.

Great, is there anything I can test on on my board?

regards,

Koen

gco...@netportusa.com

unread,
Apr 22, 2008, 1:47:03 PM4/22/08
to beagl...@googlegroups.com, Beagle Board
Koen,

You have all of these mods already on your board

Gerald

gco...@netportusa.com

unread,
Apr 22, 2008, 1:47:03 PM4/22/08
to beagl...@googlegroups.com, Beagle Board
Koen,

You have all of these mods already on your board

Gerald

>
>

Philip Balister

unread,
Apr 22, 2008, 2:16:13 PM4/22/08
to beagl...@googlegroups.com
My understanding is he needs a new xloader though, is this correct?

Is there a replacement for the DVFlasher to install the xloader easily?

Philip

koen

unread,
Apr 22, 2008, 2:19:06 PM4/22/08
to Beagle Board


On 22 apr, 20:16, "Philip Balister" <philip.balis...@gmail.com> wrote:
> My understanding is he needs a new xloader though, is this correct?
>
> Is there a replacement for the DVFlasher to install the xloader easily?

xloader is loaded from SD by the rom thingy, so I only need a new MLO
(signed xloader) binary.

regards,

Koen

Dirk Behme

unread,
Apr 22, 2008, 3:20:39 PM4/22/08
to beagl...@googlegroups.com

Have you tried to generate MLO by your own? I haven't tried it yet, but

http://code.google.com/p/beagleboard/wiki/BeagleSourceCode

tells us:

-- cut --
Convert x-load.bin to MLO (required for MMC Boot)

1. Use the "SignGP" tool to sign the x-loader image. (“x-load.bin.ift”
file is generated in the same folder.)

./signGP x-load.bin

2. Rename x-load.bin.ift to MLO
-- cut --

X-Loader source is available via

http://elinux.org/BeagleBoard#Git

As X-Loader is a stripped down U-Boot, its include directory links to
uboot. So you need a recent U-Boot with

http://groups.google.com/group/beagleboard/browse_thread/thread/3473b44af1e6e326#

on top. Have a look to omap3530beagle.h. Currently, there is
PRCM_CLK_CFG2_266MHZ configured. Instead of this,
PRCM_CLK_CFG2_332MHZ can be enabled.

Don't know how to enable L2 cache and/or other frequencies, though.
Seems that there is no preparation for other (higher?) frequency
configuration in the public code yet?

Dirk

koen

unread,
Apr 22, 2008, 5:12:52 PM4/22/08
to Beagle Board
On 22 apr, 21:20, Dirk Behme <dirk.be...@googlemail.com> wrote:

> As X-Loader is a stripped down U-Boot, its include directory links to
> uboot. So you need a recent U-Boot with
>
> http://groups.google.com/group/beagleboard/browse_thread/thread/3473b...
>
> on top. Have a look to omap3530beagle.h. Currently, there is
> PRCM_CLK_CFG2_266MHZ configured. Instead of this,
> PRCM_CLK_CFG2_332MHZ can be enabled.
>
> Don't know how to enable L2 cache

CONFIG_L2_OFF looks suspicious like a cache disable option :)

regards,

Koen

Syed Mohammed, Khasim

unread,
Apr 24, 2008, 4:57:26 AM4/24/08
to beagl...@googlegroups.com
The POWER MODS I was refering to were hardware modifications that Gerald confirmed that it is already in place for your boards.
 
For Enabling L2 Cache:
 
1. I have not disabled it in X-loader, so no changes to x-loader for this. However in kernel it is disabled currently, to enabled it you have deselect the option "Disable L2 Cache"
 
2. For running at 500 MPU, I can give out u-boot and x-loader changes, but just waiting for everyone to get their boards modified otherwise it might block others. For now, I have attached the MLO and u-boot.bin for testing. Just try this out, boot the kernel and read out the MPU clock by doing
  cat /proc/omap_clocks | grep "MPU"
 
Regards,
Khasim
 

 
MLO
u-boot.bin

koen

unread,
Apr 24, 2008, 6:05:01 AM4/24/08
to Beagle Board

For future reference:

root@beagleboard:/media/mmcblk0p1# md5sum mlo
6a9f907d630de81f0b8ee8398cf94cf6 mlo
root@beagleboard:/media/mmcblk0p1# md5sum u-boot.bin
2408dd1757856d52e71c110aa653c178 u-boot.bin

> For Enabling L2 Cache:
>
> 1. I have not disabled it in X-loader, so no changes to x-loader for this.
> However in kernel it is disabled currently, to enabled it you have deselect
> the option "Disable L2 Cache"

For 2.6.22-beagle:

koen@lieve:/OE/angstrom-tmp/work/beagleboard-angstrom-linux-gnueabi/
2.6_kernel$ grep CACHE ./arch/arm/configs/omap3_beagle_defconfig
CONFIG_CPU_CACHE_V7=y
CONFIG_CPU_CACHE_VIPT=y
# CONFIG_CPU_ICACHE_DISABLE is not set
# CONFIG_CPU_DCACHE_DISABLE is not set
CONFIG_CPU_L2CACHE_DISABLE=y
# CONFIG_OUTER_CACHE is not set

For linux-omap2 2.6.25:
koen@lieve:/OE/angstrom-tmp/work/beagleboard-angstrom-linux-gnueabi/
linux-omap2-2.6.25-r4/git$ grep CACHE .config
CONFIG_CPU_CACHE_V7=y
CONFIG_CPU_CACHE_VIPT=y
# CONFIG_CPU_ICACHE_DISABLE is not set
# CONFIG_CPU_DCACHE_DISABLE is not set
# CONFIG_OUTER_CACHE is not set
# CONFIG_CDROM_PKTCDVD_WCACHE is not set


> 2. For running at 500 MPU, I can give out u-boot and x-loader changes, but
> just waiting for everyone to get their boards modified otherwise it might
> block others. For now, I have attached the MLO and u-boot.bin for testing.
> Just try this out, boot the kernel and read out the MPU clock by doing
>   cat /proc/omap_clocks | grep "MPU"

With 2.6.22-beagle

root@beagleboard:~# cat /proc/omap_clocks | grep mpu ; uname -a
mpu_ck 0 381000000 0
Linux beagleboard 2.6.22.1-omap1 #2 Wed Mar 26 16:39:33 IST 2008
armv7l unknown unknown GNU/Linux

With 2.6.25-omap1:
root@beagleboard:~# cat /proc/cpuinfo ; uname -a
Processor : ARMv7 Processor rev 2 (v7l)
BogoMIPS : 378.14
Features : swp half thumb fastmult vfp edsp
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x1
CPU part : 0xc08
CPU revision : 2
Cache type : write-through
Cache clean : not required
Cache lockdown : not supported
Cache format : Unified
Cache size : 768
Cache assoc : 1
Cache line length : 8
Cache sets : 64

Hardware : OMAP3 Beagle Board
Revision : 34301000
Serial : 0000000000000000
Linux beagleboard 2.6.25-omap1 #3 PREEMPT Mon Apr 21 08:55:10 CEST
2008 armv7l unknown unknown GNU/Linux

So both still run at 381MHz, but 2.6.25 should have L2 enabled.

regards,

Koen

Syed Mohammed, Khasim

unread,
Apr 24, 2008, 7:36:01 AM4/24/08
to beagl...@googlegroups.com
Did you try with my latest u-boot.bin and MLO files?
 
The 2.6.25 doesnt have the omap-clocks entry in proc, so try 2.6.22 with my latest u-boot.bin and MLO you should get MPU at 500 and then run your demos on 2.6.22.
 
We can then add other peripheral set to 2.6.25.
 
Regards,
Khasim

koen

unread,
Apr 24, 2008, 8:33:50 AM4/24/08
to Beagle Board

> Did you try with my latest u-boot.bin and MLO files?

Yes:

root@beagleboard:/media/mmcblk0p1# md5sum mlo
6a9f907d630de81f0b8ee8398cf94cf6 mlo
root@beagleboard:/media/mmcblk0p1# md5sum u-boot.bin
2408dd1757856d52e71c110aa653c178 u-boot.bin

> The 2.6.25 doesnt have the omap-clocks entry in proc, so try 2.6.22 with my
> latest u-boot.bin and MLO you should get MPU at 500 and then run your demos
> on 2.6.22.

root@beagleboard:~# cat /proc/omap_clocks | grep mpu ; uname -a
mpu_ck 0 381000000 0
Linux beagleboard 2.6.22.1-omap1 #2 Wed Mar 26 16:39:33 IST 2008
armv7l unknown unknown GNU/Linux

Still 381MHz :(, could you md5sum you working MLO and see if it
matches?

koen

unread,
May 7, 2008, 6:46:12 AM5/7/08
to Beagle Board


On 20 apr, 22:46, koen <koen.k...@gmail.com> wrote:
> Hi,
>
> I'm trying to do some tests to see how the cortex-a8 performs with
> video and I'm getting very strange results with mplayer:
>
> The test:
>
> # wgethttp://samples.mplayerhq.hu/benchmark/testsuite1/matrixbench_normdivx...
> # mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts
> idct=16  matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK
>
> On a nokia n800 (300MHz omap2420):
>
> BENCHMARKs: VC: 122.543s VO:   0.162s A:   0.000s Sys:   1.416s =
> 124.120s
>
> So it can decode the complete video in ~2 minutes. The beagle:
>
> BENCHMARKs: VC: 193.856s VO:   0.153s A:   0.000s Sys:   2.718s =
> 196.727s

root@beagleboard:~# uname -a
Linux beagleboard 2.6.26-rc1-omap1 #1 Wed May 7 10:25:34 CEST 2008
armv7l unknown unknown GNU/Linux
root@beagleboard:/media/mmcblk0p1# mplayer -nosound -vo null -quiet -
benchmark - loop 12 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARK
BENCHMARKs: VC: 59.906s VO: 0.067s A: 0.000s Sys: 1.255s =
61.228s
BENCHMARKs: VC: 56.150s VO: 0.133s A: 0.000s Sys: 1.043s =
57.327s

That's 3.5 times faster with L2 cache enabled! That's a nice
improvement :D

regards,

Koen

Jason Kridner

unread,
May 7, 2008, 7:24:14 AM5/7/08
to beagl...@googlegroups.com
Have you tried compiling with armv7a or neon instructions yet?

koen

unread,
May 7, 2008, 7:45:11 AM5/7/08
to Beagle Board


On 7 mei, 13:24, Jason Kridner <jkrid...@gmail.com> wrote:
> Have you tried compiling with armv7a or neon instructions yet?

This mplayer was compiled with

-march=armv7-a -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp -
fexpensive-optimizations -ftree-vectorize -fomit-frame-pointer -O4 -
ffast-math

I haven't seen any patches that add NEON optimized instructions to
mplayer yet. This mplayer does have Siarheis armv6 stuff.

regards,

Koen
Reply all
Reply to author
Forward
0 new messages