Update On Floating Point Performance

rtp...@comcast.net

unread,

Oct 29, 2008, 1:00:09 AM10/29/08

to beagl...@googlegroups.com, Måns Rullgård

Hello Folks,

Just following up:

Thanks to folks for the feedback and suggestions. I tried the suggested options and I even "hacked" the nbench benchmarks (which use all doubles in their C code) to internally use all floats only everywhere. It might be of some interest that there was essentially no effect in doing these things. FP performance still lagged an old x86 clone at 1/2 the clock speed.

The floating point performance is important for many of the applications in 3D graphics and robotics for which I had been considering the OMAP 3. I often have to write code that handles LU decompositions, 3D transformations, etc. in real-time. So, the fact that the processor is so slow (relative to it's integer performance) seems odd. I'm grateful that the Beagleboard is helping me evaluate it thoroughly.

Any other ideas? Is there a compiler branch somewhere that will let this new "SIMD 128bit pipelined FP unit" that is in there somewhere beat out an AMD K6/233 from 12 years ago? It would seem with such a touted (reading ARMs website) hardware FP unit, that the gap between FP performance and INT performance would not be so large.

So, I'm still a bit puzzled unless compiler support is so immature for Neon that we're not seeing anything like the real performance.

-Sincerely,
Todd Pack

-------------- Original message ----------------------
From: Måns Rullgård <ma...@mansr.com>
>
> rtp...@comcast.net writes:
>
> > Hello Folks,
> > I built nbench for my beagleboard and compiled with flags that one
> > would be led to believe would enable floating point operation:
> >
> > -mcpu=cortex-a8 -mfloat-abi=softpf -mfpu=neon
>
> Try adding -ffast-math -fno-math-errno
>

> On the Cortex-A8, double-precision floating-point maths is not
> pipelined, and neither is single-precision if full IEEE compliance is
> required. The flags above should let the compiler generate
> floating-point code that can execute in the pipelined NEON unit for
> single-precision maths.
>
>
> Honestly, how often does anyone run code even resembling those
> benchmarks?
>
>
> That baseline is hardly relevant these days.

rtp...@comcast.net

unread,

Oct 29, 2008, 1:00:09 AM10/29/08

to beagl...@googlegroups.com, Måns Rullgård

Måns Rullgård

unread,

Oct 29, 2008, 1:12:10 AM10/29/08

to beagl...@googlegroups.com

rtp...@comcast.net writes:

> Hello Folks,
>
> Just following up:
>
> Thanks to folks for the feedback and suggestions. I tried the
> suggested options and I even "hacked" the nbench benchmarks (which use
> all doubles in their C code) to internally use all floats only
> everywhere. It might be of some interest that there was essentially no
> effect in doing these things. FP performance still lagged an old x86
> clone at 1/2 the clock speed.
>
> The floating point performance is important for many of the
> applications in 3D graphics and robotics for which I had been
> considering the OMAP 3. I often have to write code that handles LU
> decompositions, 3D transformations, etc. in real-time. So, the fact
> that the processor is so slow (relative to it's integer performance)
> seems odd. I'm grateful that the Beagleboard is helping me evaluate it
> thoroughly.
>
> Any other ideas? Is there a compiler branch somewhere that will let
> this new "SIMD 128bit pipelined FP unit" that is in there somewhere
> beat out an AMD K6/233 from 12 years ago? It would seem with such a
> touted (reading ARMs website) hardware FP unit, that the gap between
> FP performance and INT performance would not be so large.

You have to make sure that what ARM calls runfast mode is enabled for
normal FP instructions to execute in the NEON pipeline. This includes
disabling FP exceptions and selecting the proper rounding mode. The
details should be in the manual.

> So, I'm still a bit puzzled unless compiler support is so immature for
> Neon that we're not seeing anything like the real performance.

Compilers are certainly not very good at using the vector operations
the NEON unit is capable of.

--
Måns Rullgård
ma...@mansr.com

rtp...@comcast.net

unread,

Oct 29, 2008, 1:21:30 AM10/29/08

to beagl...@googlegroups.com, Måns Rullgård

OK. Thanks again. I'll go digging.

Also I'm using a Beagleboard Rev. B5 with the "slow NEON" unit in it, right? Will Rev. C boards have the later silicon and make a big difference in this area as well (I just read up on that issue)?

I know that compilers have hell with the vector units, but the OMAP 3 does *so* well at all the other stuff! 8> I'll keep digging and post back as I have time.

-Sincerely,
Todd Pack

-------------- Original message ----------------------
From: Måns Rullgård <ma...@mansr.com>
>
>

> You have to make sure that what ARM calls runfast mode is enabled for
> normal FP instructions to execute in the NEON pipeline. This includes
> disabling FP exceptions and selecting the proper rounding mode. The
> details should be in the manual.
>

Laurent Desnogues

unread,

Oct 29, 2008, 3:44:52 AM10/29/08

to beagl...@googlegroups.com

This is described here:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0133c/index.html

Note I did not test it.

This was discussed on IRC yesterday:
http://www.beagleboard.org/irclogs/index.php?date=2008-10-28#T17:54:14

Laurent

Ian R

unread,

Oct 29, 2008, 1:19:19 PM10/29/08

to Beagle Board

The Cortex-A8 was designed for high performance vector processing
using the new NEON engine, including vectorized single-precision
floating point but the NEON instructions must be used. Using NEON
means some extra effort for the software engineer identifying critical
sections followed by modifications to ensure NEON instructions are
used. Both gcc and armcc offer auto-vectorization (some source code
changes may be required), but there are also other approaches such as
instrinsics. These can yield substantial performance benefits on
Cortex-A8 for typical applications where there are a small number of
critical routines.

The next-gen Cortex-A9 increases the fp performance with fully
pipelined scalar floating point, and the NEON unit retains the same
pipeline as on A8.

Reply all

Reply to author

Forward