Benchmark your PandaBoard

65 views
Skip to first unread message

Tom Turbo

unread,
Jan 28, 2011, 12:54:52 PM1/28/11
to pandaboard
For everybody here already owning a PandaBoard:
I have made a small benchmark for it.
It's phrot,a primality testing program,which relies heavily on double
precision float performance.
Since I have read that the Cortex-A9 has very good floating point
performance,I want to see it's performance on a task like this.
I've also posted a similar thread in the BeagleBoard group,so we can
compare the performance of those two boards too.
To run the test:
You simply download the benchmark(link is below),unzip it somewhere
onto your Panda(where gcc,make and the other basic utilitites for
compiling software have to be installed).
Then you run:
./makeit.sh
Then,if everything went well,you should now have an executable called
"phrot" in that directory.
Run it like this:
./phrot -q 10^16384+1
and post everything it prints out.
This way I can both see if it works correctly and see how fast it is.
Thanks!
Link to archive:
http://rapidshare.com/files/444871561/phrot_beagle.tar.gz

Stehle, Vincent

unread,
Jan 31, 2011, 5:07:18 AM1/31/11
to panda...@googlegroups.com
On Fri, Jan 28, 2011 at 6:54 PM, Tom Turbo <ilikelinu...@gmail.com> wrote:
(benchmark)

> Run it like this:
> ./phrot -q 10^16384+1
> and post everything it prints out.
> This way I can both see if it works correctly and see how fast it is.
> Thanks!
> Link to archive:
> http://rapidshare.com/files/444871561/phrot_beagle.tar.gz

Hi,

Here are two runs for you:

vincent@vincent-panda:~/phrot$ ./phrot -q 10^16384+1
Phil Carmody's Phrot (0.72)
Input 10^16384+1 : Actually testing 10000*1000000^2730+1 (witness=3
2731/6144 limbs)
10^16384+1 is composite LLR64=c5ff6a4a68324d5a. (e=0.01953
(0.0411076~3.737...@0.000) t=81.40s)
vincent@vincent-panda:~/phrot$ ./phrot -q 10^16384+1
Phil Carmody's Phrot (0.72)
Input 10^16384+1 : Actually testing 10000*1000000^2730+1 (witness=3
2731/6144 limbs)
10^16384+1 is composite LLR64=c5ff6a4a68324d5a. (e=0.01953
(0.0411076~3.737...@0.000) t=81.20s)

Best regards,

--
Vincent Stehlé
Systems Engineer - TI France

Texas Instruments France SA, 821 Avenue Jack Kilby, 06270 Villeneuve
Loubet. 036 420 040 R.C.S Antibes. Capital de EUR 753.920

Tom Turbo

unread,
Jan 31, 2011, 8:31:06 AM1/31/11
to pandaboard
Thanks! The LLR64 residue is correct,so I can assume that phrot runs
fine on the Panda too.
The speed is about 1/4 of that of a 3.2 GHZ P4,which is very good
considering this benchmark is so floating point intensive.
But I think I can make it even quicker with some compiler optimization
tuning.
The updated file in the link below has compiler optimization flags
changed from:
-O3
to:
-O3 -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=softfp
which should give it some speed boost. Does GCC use emulated floating
point instructions for default on the A9?
And NEON can not accelerate double precision float work,right?

Could you rerun the test? Thanks!
Link:
http://rapidshare.com/files/445466345/phrot_beagle_2.tar.gz

Stehle, Vincent

unread,
Feb 1, 2011, 4:04:54 AM2/1/11
to panda...@googlegroups.com
On Mon, Jan 31, 2011 at 2:31 PM, Tom Turbo <ilikelinu...@gmail.com> wrote:
(..)
But I think I can make it even quicker with some compiler optimization
tuning.

Sure. No doubt.
 
Does GCC use emulated floating
point instructions for default on the A9?

I think this is the case on the Ubuntu I am using. (Some) default gcc options:

-march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3-d16 -mthumb
 
And NEON can not accelerate double precision float work,right?

I think "doubles" can be NEON accelerated on Cortex-A9. See http://infocenter.arm.com/help/topic/com.arm.doc.ddi0409f/Chdceejc.html
 
Could you rerun the test? Thanks!
Link: http://rapidshare.com/files/445466345/phrot_beagle_2.tar.gz

Here you go:


vincent@vincent-panda:~/phrot$ ./phrot -q 10^16384+1
Phil Carmody's Phrot (0.72)
Input 10^16384+1 :  Actually testing 10000*1000000^2730+1 (witness=3 2731/6144 limbs)
10^16384+1 is composite LLR64=c5ff6a4a68324d5a. (e=0.01953 (0.0411076~3.737...@0.000) t=79.83s)

vincent@vincent-panda:~/phrot$ ./phrot -q 10^16384+1
Phil Carmody's Phrot (0.72)
Input 10^16384+1 :  Actually testing 10000*1000000^2730+1 (witness=3 2731/6144 limbs)
10^16384+1 is composite LLR64=c5ff6a4a68324d5a. (e=0.01953 (0.0411076~3.737...@0.000) t=79.95s)

Is that a one second improvement?

Best regards,

V.

Måns Rullgård

unread,
Feb 1, 2011, 7:29:51 AM2/1/11
to panda...@googlegroups.com
Tom Turbo <ilikelinu...@gmail.com> writes:

> Thanks! The LLR64 residue is correct,so I can assume that phrot runs
> fine on the Panda too.
> The speed is about 1/4 of that of a 3.2 GHZ P4,which is very good
> considering this benchmark is so floating point intensive.

The P4 is one of the worst micro-architectures in modern times.
Almost anything looks good in comparison. That said, the A9 has a
pretty good floating-point unit.

> But I think I can make it even quicker with some compiler optimization
> tuning.
> The updated file in the link below has compiler optimization flags
> changed from:
> -O3
> to:
> -O3 -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=softfp
> which should give it some speed boost. Does GCC use emulated floating
> point instructions for default on the A9?

The gcc defaults depend on what options were set when it was built.

> And NEON can not accelerate double precision float work,right?

NEON vectors are single-precision (or integer) only. Double-precision
operations are always scalar.

--
Måns Rullgård
ma...@mansr.com

Xianghua Xiao

unread,
Feb 2, 2011, 9:04:32 AM2/2/11
to panda...@googlegroups.com
any initial conclusion from these benchmarks?

2011/2/1 Måns Rullgård <ma...@mansr.com>:

Tom Turbo

unread,
Feb 4, 2011, 10:51:03 AM2/4/11
to pandaboard
Yes, the Pandaboard needs ~80 seconds for this number, my P4 3.2 ~26
seconds. That's excellent performance considering the Panda wasn't
really designed for things like primality testing, even tough ARM
processors are becoming more and more suitable for this.
Reply all
Reply to author
Forward
0 new messages