beeing curious about hardfp, I've used the povray benchmark to get some
numbers. I've used povray 2.6.1, see
http://www.povray.org/download/benchmark.php for an explanation. I think
that will give an impression about how much applications with heavy
floating point usage might gain from hardfp.
I've run those tests using a BeagleBoard C4 ((w/o XM, 720 MHz) using the
same (vanilla) kernel 2.6.37.3, both systems where on the same usb-hd,
using different (ext4-)partitions with the same size.
The whole softfp-system was compiled using
CFLAGS="-Os -pipe -mtune=cortex-a8 -mcpu=cortex-a8 -mfpu=neon
-mfloat-abi=softfp -fomit-frame-pointer"
CXXFLAGS="${CFLAGS} -std=gnu++0x -fvisibility-inlines-hidden"
CFLAGS="${CFLAGS} -std=gnu99"
LDFLAGS="-Wl,-O1 -Wl,--enable-new-dtags -Wl,--sort-common -Wl,--as-needed"
and the hardfp-system was compiled using
CFLAGS="-Os -pipe -mtune=cortex-a8 -mcpu=cortex-a8 -mfpu=neon
-mfloat-abi=hard -fomit-frame-pointer"
CXXFLAGS="${CFLAGS} -std=gnu++0x -fvisibility-inlines-hidden"
CFLAGS="${CFLAGS} -std=gnu99"
LDFLAGS="-Wl,-O1 -Wl,--enable-new-dtags -Wl,--sort-common -Wl,--as-needed"
All package-versions where the same and the same patches (if any) where
used. The gcc version was 4.5.2, binutils was 2.21 and glibc was 2.11.2.
Here are the times for "time povray benchmark.ini":
softfp:
Total Time: 10 hours 39 minutes 23 seconds (38363 seconds)
real 639m23.292s
user 639m17.914s
sys 0m0.430s
hardfp:
Total Time: 10 hours 3 minutes 25 seconds (36205 seconds)
real 603m24.803s
user 603m21.188s
sys 0m0.422s
Beeing curious about the compiler optimisations I've done the same
benchmark on the same systems just using -O3 instead of -Os to compile
povray:
softfp:
Total Time: 9 hours 49 minutes 29 seconds (35369 seconds)
real 589m29.634s
user 589m24.016s
sys 0m0.422s
hardfp:
Total Time: 9 hours 22 minutes 13 seconds (33733 seconds)
real 562m12.603s
user 562m9.320s
sys 0m0.469s
So it looks like using hardfp instead of softp might gain about 5-6 %
for applications which are heavily using floating points.
I don't want to interpret if -Os, -O2 or -O3 might be better for your
use case, those optimizations could have heavy implications, escpecially
in regard to floating point and using the fastest optimizations won't
fit allways.
Regards,
Alexander Holler
PS: Before someone asks why I'm using -std=gnu++0x, I'm using it because
c++0x offers some new nice to have features, especially in regard to
"perfect forwarding", and I think almost all c++-programs might benefit
from that, if those new features are used e.g. by the STL. I haven't
checked if those new features are already used somewhere in the standard
libraries (or templates), but ...
Be aware, using -std=gnu++0x actually breaks compilation of some few
c++-programs.
> Hello,
>
> beeing curious about hardfp, I've used the povray benchmark to get
> some numbers. I've used povray 2.6.1, see
> http://www.povray.org/download/benchmark.php for an explanation. I
> think that will give an impression about how much applications with
> heavy floating point usage might gain from hardfp.
>
For applications passing floats to and from functions a lot, yes.
Povray appears to be one of these.
> I don't want to interpret if -Os, -O2 or -O3 might be better for your
> use case, those optimizations could have heavy implications,
> escpecially in regard to floating point and using the fastest
> optimizations won't fit allways.
If you want fast, get a Panda:
Total Time: 1 hours 36 minutes 38 seconds (5798 seconds)
For reference, on an Intel Core i7 940 (2.93GHz):
Total Time: 0 hours 19 minutes 8 seconds (1148 seconds)
--
Måns Rullgård
ma...@mansr.com
Am 12.03.2011 15:20, schrieb Måns Rullgård:
> If you want fast, get a Panda:
Besides that I only was interested in some numbers for hardfp vs. softp,
using a Panda would transport the benchmark back to 1970. ;)
Regards,
Alexander Holler
Mans, do you confirm the beagle figures ? Just out of curiosity, do you
have numbers for an Atom based system ?
Yes, I got similar figures on a Beagle C3. The huge difference is due
to the non-pipelined VFP in Cortex-A8. The hard vs softfp difference
should also be much smaller on A9.
The Nvidia Tegra2 with VFP3-D16 gets this result, also at 1GHz:
Total Time: 1 hours 43 minutes 48 seconds (6228 seconds)
It is a little slower than the Panda but not much.
> Just out of curiosity, do you have numbers for an Atom based system ?
I do not.
--
Måns Rullgård
ma...@mansr.com