i have been maintaining the meego adaptation for the beagle board for quite some time now. MeeGos
official arm port recently switched to hardfp. Unfortunately this broke all adaptations relying on the
SGX drivers available from TI (the n900 adaptation which drove the hardfp switch has its own SGX libs
made by Nokia which unfortunately won't work on the beagle).
So is there a chance to get SGX libs with hardfp support? Rumors are that other OSs are also considering
switching to hardfp (e.g. ubuntu), so the same need will arrive there as well ...
Regards,
Till
> So is there a chance to get SGX libs with hardfp support?
arm-foo-blah-strip -R .ARM.attributes libsgx.so
Now you have a magical anyfp library, at least if none of the interfaces
use floating point parameters or return values.
--
Måns Rullgård
ma...@mansr.com
That is not really a solution. There is no support from TI for this and
it's a real problem. If major ARM Linux distributions are going hardfp
then we need vendor supported libraries also.
-- Antti
I do not think the distributed versions of Angstrom are hardfp. It is
possible for you to build a hardfp version of Angstrom yourself
though.
Philip
>
> Did you check that distribution already?
>
> Greetings,
>
> Han
>
> --
> You received this message because you are subscribed to the Google Groups
> "Beagle Board" group.
> To post to this group, send email to beagl...@googlegroups.com.
> To unsubscribe from this group, send email to
> beagleboard...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beagleboard?hl=en.
>
why would they be going hardfp then if there is no support for it?
the only hardware with supported hardfp atm is the N900, no?
So if i insist on continuing using softfp i'd have to compile for softfp
myself (or maintain a separate softfp build in MeeGo OBS). I don't think
that's the way to go as any problem introduced by this needs to be solved
by myself. I'd prefer to use as much common MeeGo userland as possible
just to take advantage of the fact that this is running through Nokia QA.
Till
Thanks,
Till
Nokia is all winphone now. Is that hard or soft fp?
--
Måns Rullgård
ma...@mansr.com
Does the library interface use floats? It doesn't matter what it does
internally.
--
Måns Rullgård
ma...@mansr.com
So compiling the exact same sources for a different ABI means no QA check from nokia applies anymore *at all*? I thought being about to change ABI was the whole point of that OBS thingie.
Till
Am Montag 13 Juni 2011 schrieb Måns Rullgård:
> arm-foo-blah-strip -R .ARM.attributes libsgx.so
>
> Now you have a magical anyfp library, at least if none of the interfaces
> use floating point parameters or return values.
Ok, tried that (since someone else suggested that he had success with that).
The libs load and the apps run, but no polygons nor any textures are visible
which is easily explained by a mismatch in the way floats are forwarded between
the app and the libs. In the meantime that person also found out that his setup
isn't really working ...
So there doesn't seem to be a way around hardfp compiled SGX libs here.
Anyone from TI listening? We need your support!
Till
If you compile in softfp mode you'll get something that works. Or do you only care about that arbitrary 'meego' label?
Am Dienstag 21 Juni 2011 schrieb Koen Kooi:
> If you compile in softfp mode you'll get something that works. Or do you only care about that arbitrary 'meego' label?
Going back to softfp has several disadvntages i'd like to avoid:
- It may trigger hidden bugs which i'd have to resolve
- I'd have to recompile the entire MeeGo myself
- The resulting system would not be able to use programs from the repositories
- The resulting system would not make use of it's hardware floating point support
So, softfp is an option i'd really like to avoid.
Till
> Hi,
>
> Am Dienstag 21 Juni 2011 schrieb Koen Kooi:
>> If you compile in softfp mode you'll get something that works. Or do you only care about that arbitrary 'meego' label?
> Going back to softfp has several disadvntages i'd like to avoid:
[..]
> - The resulting system would not make use of it's hardware floating point support
At this point I need to really tell you to do your homework better instead if speading nonsense. With -mfpu={vfpv3-d16,neon} -mfloat-abi you *will* get vfp and/or neon instructions and those *will* use the hardware blocks. It will just use a suboptimal calling convention that has no proven real world benefit besides synthetic benchmarks and povray.
Since you have the basics of the issue all wrong, it is possible that your whole quest for hardfp libs is wrong as well?
Am Dienstag 21 Juni 2011 schrieb Koen Kooi:
> Since you have the basics of the issue all wrong, it is possible that your whole quest for hardfp libs is wrong as well?
Maybe, but i am glad that people like you have a very polite way to educate me.
Till
You could easily create a few wrappers for those functions that need
them.
--
Måns Rullgård
ma...@mansr.com
> While Måns is right that you could technically create hardfp/softfp
> wrappers with a bit of assembly fancy dancing,
There is an even simpler way. Declaring all functions with floating-point
parameters or return values as variadic will force soft-float parameter
passing when calling these. See the AAPCS (IHI0042D) section 6.4.1:
6.4.1 VFP and Base Standard Compatibility
Code compiled for the VFP calling standard is compatible with the base
standard (and vice-versa) if no floating-point or containerized vector
arguments or results are used, or if the only routines that pass or
return such values are variadic routines.
> that would have to be done for all APIs (not just those which pass
> floating point parameters/ results) and would have terrible
> performance (especially on Cortex-A8, where moving a VFP register to
> an ARM register stalls the entire ARM for 20 cycles or so).
The performance would be no more terrible than that of a system built
with softfloat calls using the libraries unaltered, and the performance
of such systems is apparently adequate.
> That's because the softfp calling convention permits the callee to
> smash essentially *all* FPU state,
Where did you get that notion. There is nothing in the ARM ABI docs to
support it. In fact, the paragraph quoted above directly contradicts
your claim.
> while the hardfp convention is callee-save for most VFP/NEON registers
> (d8 and up plus a subset of flags).
D16-D31 are caller-saved.
> So those wrappers would have to save all FPU state that the hardfp API
> considers callee-save,
Which is _exactly the same_ as the softfp. The AAPCS defines the
caller/callee-saved aspects independently of parameter passing.
> whether or not the called function uses the FPU at all -- unless, of
> course, you are willing to run the OpenGL libraries through some sort
> of binary static analysis in order to find which FPU state each API
> touches. Ouch!
Nice straw man.
> And while Koen is right that the hardfp calling convention does not
> yet have much in the way of benchmark support
Are you implying there is some not yet benchmarked case where it
performs significantly better?
> -- and is arguably sub- optimal if your floating-point operations are
> concentrated inside innermost C functions --
Using VFP register parameters (i.e. doing nothing) is never less
efficient than moving them to core registers (doing something).
> I expect that will change as GCC gets better at using the NEON unit
> for integer SIMD and vectorized load/store operations.
Are you saying increased use of NEON by gcc will make hardfp calls
slower?
> Especially on Cortex-A9 and later cores -- which don't have the severe
> penalty for inter-pipeline transfers,
The A9 and later indeed make the softfp calls less costly, reducing any
advantage hardfp might have (which is already small in benchmarks on A8).
> and do have dedicated lanes to memory for the NEON unit
No core released to date, including the A15, has dedicated memory lanes
for NEON. All the Cortex-A* cores have a common load/store unit for all
types of instructions. Some can do multiple concurrent accesses, but
that's orthogonal to this discussion.
> -- the compiler can tighten up the execution of rather a lot of code
> by trampolining structure fetches and stores through the NEON.
Do you have any numbers to back this up? I don't see how going through
NEON registers would be faster than direct LDM/STM on any core.
> If, that is, it can schedule them appropriately to account for
> latencies to and from memory as well as the (reduced but non-zero)
> latency of VFP<->ARM transfers.
The out of order issue on A9 and later makes most such tricks unnecessary.
> The softfp ABI interferes with this by denying the compiler the
> privilege of rescheduling NEON instructions across a function call
> -- even one that doesn't actually use any floating point.
To the extent scheduling across function calls is permitted by the C
standard, the manner of passing parameters has no bearing on such
optimisations.
> (Any function call to which the ABI applies, anyway; which doesn't
> include static C functions, I think, but does include all C++ instance
> methods even if they get inlined -- if I remember the spec correctly.)
If a function is fully inlined, the compiler can of course do whatever
it pleases. That is the entire point of inlining.
> I should be able to produce some benchmark data in support of this
> argument in the next month or so.
You must have a unique approach to benchmarking if it produces results
contradicting everybody else's. Have you considered patenting your
methods?
> (don't forget -ffast-math if you really want NEON floating point).
-ffast-math should only be used with extreme caution as it will give
vastly different results in many cases. Specifically, anything relying
on infinities or NaN values becomes unpredictable, and operations with
very large or very small numbers may lose precision.
--
Måns Rullgård
ma...@mansr.com
That's a good suggestion.
>> and do have dedicated lanes to memory for the NEON unit
>
> No core released to date, including the A15, has dedicated memory lanes
> for NEON. All the Cortex-A* cores have a common load/store unit for all
> types of instructions. Some can do multiple concurrent accesses, but
> that's orthogonal to this discussion.
Probably he wanted to say that NEON unit from Cortex-A8 can load/store
128 bits of data per cycle when accessing L1 cache *memory*, while
ordinary ARM load/store instructions can't handle more than 64 bits
per cycle there. This makes sense in the context of this discussion
because loading data to NEON/VFP registers directly without dragging
it through ARM registers is not a bad idea.
>> -- the compiler can tighten up the execution of rather a lot of code
>> by trampolining structure fetches and stores through the NEON.
>
> Do you have any numbers to back this up? I don't see how going through
> NEON registers would be faster than direct LDM/STM on any core.
My understanding is that it's exactly the other way around. Using
hardfp allows to avoid going through ARM registers for floating point
data, which otherwise might be needed for the sole purpose of
fulfilling ABI requirements in some cases. You are going a bit
overboard trying to argue with absolutely everything what Edwards has
posted :)
As for NEON vs. LDM/STM. There are indeed no reasons why for example
NEON memcpy should be faster than LDM/STM for the large memory buffers
which do not fit caches. But still this is the case for OMAP3, along
with some of other memory performance related WTF questions.
>> If, that is, it can schedule them appropriately to account for
>> latencies to and from memory as well as the (reduced but non-zero)
>> latency of VFP<->ARM transfers.
>
> The out of order issue on A9 and later makes most such tricks unnecessary.
VFP/NEON unit from A9 is still in-order.
--
Best regards,
Siarhei Siamashka
>>> and do have dedicated lanes to memory for the NEON unit
>>
>> No core released to date, including the A15, has dedicated memory lanes
>> for NEON. All the Cortex-A* cores have a common load/store unit for all
>> types of instructions. Some can do multiple concurrent accesses, but
>> that's orthogonal to this discussion.
>
> Probably he wanted to say that NEON unit from Cortex-A8 can load/store
> 128 bits of data per cycle when accessing L1 cache *memory*, while
> ordinary ARM load/store instructions can't handle more than 64 bits
> per cycle there. This makes sense in the context of this discussion
> because loading data to NEON/VFP registers directly without dragging
> it through ARM registers is not a bad idea.
That has nothing to do with calling conventions.
>>> -- the compiler can tighten up the execution of rather a lot of code
>>> by trampolining structure fetches and stores through the NEON.
>>
>> Do you have any numbers to back this up? I don't see how going through
>> NEON registers would be faster than direct LDM/STM on any core.
>
> My understanding is that it's exactly the other way around. Using
> hardfp allows to avoid going through ARM registers for floating point
> data, which otherwise might be needed for the sole purpose of
> fulfilling ABI requirements in some cases. You are going a bit
> overboard trying to argue with absolutely everything what Edwards has
> posted :)
I think he is under the false impression that softfp doesn't have any
callee-saved registers. If that were the case, a leaf function would
avoid the tiny overhead of preserving d8-d15. I can't imagine any
situation where this would make a difference, even if it were true.
> As for NEON vs. LDM/STM. There are indeed no reasons why for example
> NEON memcpy should be faster than LDM/STM for the large memory buffers
> which do not fit caches. But still this is the case for OMAP3, along
> with some of other memory performance related WTF questions.
Using NEON for memcpy has the potential of being more efficient simply
because it has enough registers to hold several cache lines of data.
Michael seems to be arguing for loading things to NEON registers, then
transferring to ARM rather than loading directly to core registers,
which would be an entirely pointless thing to do.
>>> If, that is, it can schedule them appropriately to account for
>>> latencies to and from memory as well as the (reduced but non-zero)
>>> latency of VFP<->ARM transfers.
>>
>> The out of order issue on A9 and later makes most such tricks unnecessary.
>
> VFP/NEON unit from A9 is still in-order.
The A9 issues normal loads out of order with other integer instructions,
meaning bouncing data through NEON is pointless.
--
Måns Rullgård
ma...@mansr.com
The ARM AAPCS says this:
Registers s16-s31 (d8-d15, q4-q7) must be preserved across
subroutine calls; registers s0-s15 (d0-d7, q0-q3) do not need
to be preserved (and can be used for passing arguments or
returning results in standard procedure-call variants).
Registers d16-d31 (q8-q15), if present, do not need to be
preserved.
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/index.html
Laurent
> 2011/6/24 Måns Rullgård <ma...@mansr.com>:
> [...]
>> I think he is under the false impression that softfp doesn't have any
>> callee-saved registers. If that were the case, a leaf function would
>> avoid the tiny overhead of preserving d8-d15. I can't imagine any
>> situation where this would make a difference, even if it were true.
>
> The ARM AAPCS says this:
>
> Registers s16-s31 (d8-d15, q4-q7) must be preserved across
> subroutine calls; registers s0-s15 (d0-d7, q0-q3) do not need
> to be preserved (and can be used for passing arguments or
> returning results in standard procedure-call variants).
> Registers d16-d31 (q8-q15), if present, do not need to be
> preserved.
Exactly. D8-D15 must always be preserved, no others need ever be
preserved. How arguments are passed does not matter.
--
Måns Rullgård
ma...@mansr.com
When the hardware has a somewhat weird and unpredictable behavior,
then it surely makes sense trying different ways of doing the same and
selecting whatever appears to provide better results in practice. But
normally holding several cache lines of data is not a very good use
for the registers. All the accesses to memory are buffered and victim
buffer can hold multiple cache lines anyway. So if you are worried
about a potential negative impact of interleaving reads and writes to
SDRAM, then this is supposed to be already addressed.
I don't known about the other Cortex-A8 based SoCs, but for example
Samsung Hummingbird seems to be very good and well predictable for
everything related to memory performance. It only requires some
prefetching via PLD instructions, but that's enough to fully utilize
memory bandwidth in many cases regardless of what kind of instructions
are actually used to access the memory. OMAPs are surely more
difficult and may need special tricks in order not to get an
unexpected performance loss.
> 2011/6/24 Måns Rullgård <ma...@mansr.com>:
>> Siarhei Siamashka <siarhei....@gmail.com> writes:
>>> As for NEON vs. LDM/STM. There are indeed no reasons why for example
>>> NEON memcpy should be faster than LDM/STM for the large memory buffers
>>> which do not fit caches. But still this is the case for OMAP3, along
>>> with some of other memory performance related WTF questions.
>>
>> Using NEON for memcpy has the potential of being more efficient simply
>> because it has enough registers to hold several cache lines of data.
>
> When the hardware has a somewhat weird and unpredictable behavior,
> then it surely makes sense trying different ways of doing the same and
> selecting whatever appears to provide better results in practice. But
> normally holding several cache lines of data is not a very good use
> for the registers. All the accesses to memory are buffered and victim
> buffer can hold multiple cache lines anyway. So if you are worried
> about a potential negative impact of interleaving reads and writes to
> SDRAM, then this is supposed to be already addressed.
Writing full cache lines can avoid a needless line fill in a
write-allocate cache. The A9 has a 4x64-bit store buffer, which is
exactly one cache line.
--
Måns Rullgård
ma...@mansr.com
If we are speaking about A8 (that's what is used in beagleboards after
all), then write-allocate is not enabled for it in the linux kernel by
default the last time I checked. And based on my old tests, enabling
write-allocate was a real performance disaster for OMAP3430 on
memcpy-alike workload. OMAP3630 was better, but still suffered from
some measurable slowdown. And I could not find any real use cases
where write-allocate could show a clear performance advantage. If you
have some different results which prove write-allocate usefulness for
A8, then I'm definitely interested in this information.
A9 is a bit different beast and write-allocate is needed there for
SMP. But still one of the things OMAP4430 is doing quite well is
memset, so it does not seem to suffer from write-allocate at all.
Anyway, I'm still waiting for my origenboard to be delivered before
doing in-depth comparison between OMAP4 and Exynos4 to get a better
understanding of what ARM Cortex-A9 is actually capable of.
But even theoretically, one store buffer should be enough to eliminate
any needless line fills if the data is sequentially written to a
single destination buffer and if no other unrelated memory writes are
happening in the same inner loop.
> On Jun 24, 3:38 am, Måns Rullgård <m...@mansr.com> wrote:
>> The performance would be no more terrible than that of a system built
>> with softfloat calls using the libraries unaltered, and the performance
>> of such systems is apparently adequate.
>
> Not for my purposes, it's not; but then I write a lot of heavily
> templatized C++ code, and am willing to go through some fairly ugly
> contortions to tighten it up. However, my statement about having to
> wrap function calls with no floating point parameters was quite wrong,
> and I retract it unconditionally.
>
>> > That's because the softfp calling convention permits the callee to
>> > smash essentially *all* FPU state,
>>
>> Where did you get that notion. There is nothing in the ARM ABI docs to
>> support it. In fact, the paragraph quoted above directly contradicts
>> your claim.
>
> You're absolutely right. Q4-Q7 are just as callee-save under the
> softfp ABI as they are under the hardfp ABI. The only additional
> *explicit* state that the official hardfp convention allows one to
> preserve -- not trivially, but with some effort -- is Q0-Q3. (That
> can be done by systematically altering your otherwise non-floating-
> point-using APIs.)
I fail to make sense of that paragraph. D0-D7 are call-clobbered, no
exceptions. If they are not used for arguments, the callee may still
use them as scratch registers.
>> > while the hardfp convention is callee-save for most VFP/NEON registers
>> > (d8 and up plus a subset of flags).
>>
>> D16-D31 are caller-saved.
>
> Mmm, so they are. This is another thing I was misremembering.
> Largely because I don't permit userland code to use them.
So you've invented your own, crippled ABI, then complain about
performance. Clever.
> I work on embedded systems where I control how all the code is
> compiled, and I compile for a neon-d16-fp16 model that doesn't
> correspond to any real hardware.
Any NEON implementation is required to have the full set of 32 D
registers. If you allow NEON, there is no point in restricting the
number of registers. (For pure VFP code, doing so allows the same code
to be used on both full and reduced register set implementations, at a
slight performance cost.)
> I intend to reserve the upper half of the VFP/NEON register bank for
> use in-kernel, so I can trampoline data moves through D16-D31 without
> having to save userland's content and restore it afterwards. (Not
> because saving and restoring them is expensive, but because it would
> have to be done from a place in the kernel where the FPU context- save
> thingy is handy. I'd rather just use Q8-Q15 as scratch registers
> anywhere in the kernel I want to, with nothing to save/restore but the
> FPSCR.)
I can't imagine the cost of stealing these registers from heavy
float/simd users being compensated by a few minor savings in the kernel.
>> > And while Koen is right that the hardfp calling convention does not
>> > yet have much in the way of benchmark support
>>
>> Are you implying there is some not yet benchmarked case where it
>> performs significantly better?
>
> Oh yes. Presently, only when combined with APIs that sling structures
> opaquely as composite types, and code that uses NEON intrinsics to
> load and store them.
Sounds like poor API design.
> But I am expecting those techniques to become common inside template
> libraries within the next couple of years.
If you are right, that's yet another reason to avoid such libraries.
> And even in some non-template libraries; you might take a look at the
> NEON specializations inside Cairo and libjpeg-turbo, and extrapolate
> those to the hard-float case.
I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
and colourspace conversions. Nowhere are floats or simd vectors passed
by value to a function, at least not where it matters for performance.
>> > -- and is arguably sub- optimal if your floating-point operations are
>> > concentrated inside innermost C functions --
>>
>> Using VFP register parameters (i.e. doing nothing) is never less
>> efficient than moving them to core registers (doing something).
>
> On the contrary; hardfp can definitely be a net lose on real code.
> Consider cases where the outer function slings structures with mixed
> integers and floats, and the inner function does the actual floating
> point arithmetic. The hardfp convention requires the caller to
> transfer floating point parameters into VFP registers before entering
> the function, rather than leaving them in integer registers (where
> they can be put for free, because they are already in L1).
Sounds like that API really ought to be passing a pointer to a struct,
not passing the struct by value.
> That's probably a trivial effect; but at least on Cortex-A8, there are
> others that hit some code bases much harder. What if the callee does
> no arithmetic, but passes the argument to a variadic function? Or the
> callee returns a value fetched from memory, which happens to be
> floating point, and the caller turns around and sticks it into an
> otherwise integer-filled structure? Either way, you take the full hit
> of the transfer to D0 and back to the integer side, for nothing.
You seem to be missing something about how structs are actually
represented at the backend of a compiler.
>> > I expect that will change as GCC gets better at using the NEON unit
>> > for integer SIMD and vectorized load/store operations.
>>
>> Are you saying increased use of NEON by gcc will make hardfp calls
>> slower?
>
> The reverse; but I can understand your reading my contorted syntax
> that way. I expect that GCC will get better at using the NEON unit
> for non-floating-point purposes. That will make it worthwhile for
> core libraries, from eglibc and libstdc++ on up, to adapt their
> internal calling conventions to permit the sort of "stupid
> rescheduling tricks" that win when building hardfp.
>
> You may say that it shouldn't matter for APIs that aren't "publicly
> visible", and that no human-readable API should do stupid things like
> pass an opaque operand in Q0-Q3 and return it unchanged as its return
> value (still in Q0-Q3).
Such a constraint cannot be expressed in a C API (nor a C++ one AFAIK).
To make that work, you'd have to either:
1. Change the ABI spec.
2. Teach the compiler extended semantics about specific functions in the
same way it already recognises many standard library calls.
3. Write all code by hand in assembler with no standard calling
conventions at all.
None of these seem particularly compelling, nor likely to happen.
>> > Especially on Cortex-A9 and later cores -- which don't have the severe
>> > penalty for inter-pipeline transfers,
>>
>> The A9 and later indeed make the softfp calls less costly, reducing any
>> advantage hardfp might have (which is already small in benchmarks on A8).
>
> Even the idea that A9 is less friendly *overall* to hardfp than A8 is
> debatable, at the current level of compiler implementation.
The A9 is not in any way "less friendly" to hardfp. It is, however,
less hostile to softfp.
>> > -- the compiler can tighten up the execution of rather a lot of code
>> > by trampolining structure fetches and stores through the NEON.
>>
>> Do you have any numbers to back this up? I don't see how going through
>> NEON registers would be faster than direct LDM/STM on any core.
>
> I will produce those numbers within the month, or admit defeat. :-P
> Seriously, I'd better be able to substantiate this by mid-July or so,
> or my team is going to have to rethink certain aspects of one of its
> current development efforts.
I'm glad I'm not invested in that effort.
>> > If, that is, it can schedule them appropriately to account for
>> > latencies to and from memory as well as the (reduced but non-zero)
>> > latency of VFP<->ARM transfers.
>>
>> The out of order issue on A9 and later makes most such tricks unnecessary.
>
> Er, no. Out of order issue helps reduce bubbles in the ALU for math-
> intensive loads whose working set fits in cache.
Out of order issue potentially allows a load to be issued sooner than it
appears in the instruction stream, thus hiding some of the latency
whether it hits L1 or not.
>> > The softfp ABI interferes with this by denying the compiler the
>> > privilege of rescheduling NEON instructions across a function call
>> > -- even one that doesn't actually use any floating point.
>>
>> To the extent scheduling across function calls is permitted by the C
>> standard, the manner of passing parameters has no bearing on such
>> optimisations.
>
> OK, I admit that I'm planning to cheat here. I'm going to keep state
> that the compiler would otherwise allocate to the callee-save
> registers in Q0-Q3, and keep passing this block into and back out of
> mostly non-floating-point-using APIs, which effectively makes it
> callee-save state that doesn't wind up being touched by the callee.
So you've modified the ABI again.
> When combined with the neon-d16-fp16 model, this should induce the
> compiler to use Q4-Q7 as its NEON working set. Since it knows this
> range is callee-save, it's safe to schedule loads with ample provision
> for cache miss latency, even if it has to move them across function/
> method calls.
So you've reduced the number of NEON registers from 32 to 8, and you're
hoping this will somehow improve performance. The mind boggles.
>> > (Any function call to which the ABI applies, anyway; which doesn't
>> > include static C functions, I think, but does include all C++ instance
>> > methods even if they get inlined -- if I remember the spec correctly.)
>>
>> If a function is fully inlined, the compiler can of course do whatever
>> it pleases. That is the entire point of inlining.
>
> I think it's a little subtler than that in C++; but I am no language
> lawyer. Suffice it to say that what the compiler does *in practice*
> appears to be heavily influenced by whether there is any way for the
> method to be called through a "publicly visible" symbol.
A function identifiable as a symbol, public or not, is by definition not
inlined. It is perfectly legal for the compiler to inline some or all
calls to a function while still producing a symbol with a valid entry
point for it. If this happens, this symbol must of course behave
according to ABI rules. For the inlined "calls", there is no ABI-level
call, and thus calling conventions no longer apply.
In summary, you have created your own ABI that reserves most of the
VFP/NEON registers for special uses that conflict with how AAPCS/VFP
passes floating-point arguments to functions. You then use this as
foundation for a series of contradicting arguments for and/or against
the hardfp ABI over softfp.
--
Måns Rullgård
ma...@mansr.com
Yeah, this sounds great in theory, and this is what the compiler
people want us to believe. But the reality is rather disappointing:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
> In many cases, I want to bypass the cache hierarchy entirely in both
> directions, because that data structure probably won't be touched
> again until after it has aged out of L2 anyway. So the fetch and
> store of "blob" are done via NEON intrinsics through a pointer that
> lies in an uncacheable mapping. Currently this is another constraint
> that cannot be expressed in a C or C++ API; but I don't intend to let
> that stop me, either.
Why would you want to read uncached memory? That's already a huge
performance loss. For example, there is "shadow framebuffer" in
xf86-video-fbdev driver, which exists specifically to get more or less
reasonable performance when attempting to read pixel data back.
Moreover, you can easily enable write-through caching for the
framebuffer on OMAP3 systems, which can be used instead of the shadow
framebuffer with some really good performance results.
>> So you've reduced the number of NEON registers from 32 to 8, and you're
>> hoping this will somehow improve performance. The mind boggles.
>
> Now who's waving around straw men? The load patterns that I'm worried
> about don't often use NEON for algorithms that need 32 8-byte
> registers. Yes, having that full bank of registers makes libjpeg-
> turbo's iDCT more compact; but I don't much care, because JPEG decode
> latency is not the most critical thing in my system.
if you don't care about having any real NEON optimizations in your
system (for JPEG or anything else), then it's surely your choice. It's
the great freedom of open source, etc. But I seriously doubt that
anyone else would be interested :-)
Your post was very verbose and I'm sorry for not replying to the rest
of it. At least looks like you can find the relevant documentation,
read it and (mis)interpret somehow ;-) The question remains whether
you can actually use all of this information in practice to your
advantage. And if you find some really good performance tricks with
the hardfp, ARM or VFP/NEON code, then I would be surely very
interested to look at the compilable examples and benchmark numbers.
If the compiler doesn't know your functions are required to preserve
q0-q3, it will have to assume they are clobbered by a call.
>> > I work on embedded systems where I control how all the code is
>> > compiled, and I compile for a neon-d16-fp16 model that doesn't
>> > correspond to any real hardware.
>>
>> Any NEON implementation is required to have the full set of 32 D
>> registers. If you allow NEON, there is no point in restricting the
>> number of registers. (For pure VFP code, doing so allows the same code
>> to be used on both full and reduced register set implementations, at a
>> slight performance cost.)
>
> As I think I explained, the point of restricting the number of
> registers used in userland code is to leave them free for use in
> kernel code,
So either you are right and every kernel developer I've ever heard of is
wrong, or there is nothing significant to be gained from using NEON in
kernel (outside a few isolated areas like RAID checksumming and some
crypto functions, as was recently discussed).
Are you saying everybody else is imagining their code running orders of
magnitude faster with NEON than without?
>> > I intend to reserve the upper half of the VFP/NEON register bank for
>> > use in-kernel, so I can trampoline data moves through D16-D31 without
>> > having to save userland's content and restore it afterwards. (Not
>> > because saving and restoring them is expensive, but because it would
>> > have to be done from a place in the kernel where the FPU context- save
>> > thingy is handy. I'd rather just use Q8-Q15 as scratch registers
>> > anywhere in the kernel I want to, with nothing to save/restore but the
>> > FPSCR.)
>>
>> I can't imagine the cost of stealing these registers from heavy
>> float/simd users being compensated by a few minor savings in the kernel.
>
> Well, I tried to explain the part about keeping save/restore overhead
> down. I can add a couple of things: unlike ARM-side
> "registers" (which are really just labels in the instruction stream,
> and are allocated from a larger pool of physical registers),
The A9 and up use register renaming from a larger pool. The A8 is fully
in-order and thus has no need for this.
> NEON registers are locked to real hardware locations.
On the A15 NEON registers are allocated from the same pool as core
registers.
> So if the kernel needs to spill userland's values from D16-D31 in
> order to use them for bulk data moves, the store operation is going to
> stall waiting for the completion of any outstanding userland-initiated
> pipeline activity involving them. And on the return to userland, the
> load operation that restores their contents will have to complete
> before the user process can really get going again.
On a context switch it is sometimes necessary to stall in order for any
potential exceptions to be taken in the correct context. Once a store
has cleared all such checks, there is no need to block waiting for it to
hit actual RAM/cache.
> So for short trips into and out of kernel
Short trips into the kernel are generally considered murder for
performance for a number of other reasons, even when the kernel does not
touch the VFP context at all.
> Perhaps someone else could try rephrasing in language Måns might find
> more enlightening -- or correcting me if I'm wrong, which is always
> possible. Otherwise, I guess we're going to have to wait until the
> benchmarks are in. Obviously, if reserving D16-D31 for kernel use
> doesn't prove to be a win in our full system, we won't do it. But my
> measure of "win" may be different from yours. I don't care about
> maximizing the idle fraction of CPU; I care about making my system's
> UI as responsive and jitter-free as possible, even though the bulk of
> the SoC's throughput to DRAM is occupied by video capture/encode/
> decode/display traffic.
The memory system is probably the weakest point in the A8. Having more
registers often means doing fewer loads and stores, which translates
directly into higher throughput.
>> > And even in some non-template libraries; you might take a look at the
>> > NEON specializations inside Cairo and libjpeg-turbo, and extrapolate
>> > those to the hard-float case.
>>
>> I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
>> and colourspace conversions. Nowhere are floats or simd vectors passed
>> by value to a function, at least not where it matters for performance.
>
> As I wrote, "extrapolate these to the hard-float case".
I still don't understand what you meant by that.
> If you look at the code a bit, perhaps you can see the potential
> benefit of refactoring libjpeg-turbo so that jsimd_idct_ifast_neon()
> is written using compiler intrinsics rather than raw assembly, and
> letting the compiler handle register allocation and load/store
> latencies?
No compiler has ever beaten me at either of those tasks, not even come
close.
> And of rewriting idct_helper and transpose_4x4 as inline functions,
> operating on the 8x8 block of 16-bit coefficients -- i. e., a 128-byte
> chunk of data passed by value?
The existing hand-written IDCT is close to as fast as it can possibly be
done without sacrificing precision. Introducing hundreds of ways for
the compiler to screw up is not going to make it any faster.
>> >> Using VFP register parameters (i.e. doing nothing) is never less
>> >> efficient than moving them to core registers (doing something).
>>
>> > On the contrary; hardfp can definitely be a net lose on real code.
>> > Consider cases where the outer function slings structures with mixed
>> > integers and floats, and the inner function does the actual floating
>> > point arithmetic. The hardfp convention requires the caller to
>> > transfer floating point parameters into VFP registers before entering
>> > the function, rather than leaving them in integer registers (where
>> > they can be put for free, because they are already in L1).
>>
>> Sounds like that API really ought to be passing a pointer to a struct,
>> not passing the struct by value.
>
> The inner function doesn't know anything about the struct; it operates
> on bare floats/doubles. The outer function slings mixed structures,
Can you please provide an accurate technical definition of what it
entails to "sling mixed structures"?
> and as soon as it touches them at all, it has them in L1. Under the
> softfp convention, the outer function can pull the floating point
> operands of the inner function into integer registers any time it's
> convenient, maybe as part of an LDM that pulls in some integer/pointer
> elements of the same struct. Then they just need to be spilled out
> onto the stack for the function call, either in the caller (for
> operands beyond the first 4 words' worth) or in the callee (typically
> in the function preamble).
If using hardfp, the caller can load the values directly into VFP
registers at any convenient time, and there is nothing further to be
done. That cannot be slower in any way.
> The callee loads them into VFP registers; at hardware level, this
> happens via a lookaside to L1, so it's basically free as far as memory
> traffic goes.
Having the values already in registers is also free.
> As long as enough useful work can be scheduled in the callee to cover
> the VLD latency, it's all good.
If the values are loaded by the caller, there is possibly more room to
schedule the loads efficiently.
> Back to the specific example I cited: Under the hardfp convention,
> the floating point operands have to get moved over to the VFP side
> before the function call,
Or loaded directly there with VLDR or VLDM.
> which would involve two VMOVs per 64-bit operand.
A single VMOV can move two 32-bit core registers into one VFP D
register, i.e. one VMOV per 64-bit value, and that's only needed if for
some weird reason the float values were sitting in core registers.
Under a hardfp ABI, there is rarely any reason for them to be there,
rather they'd be loaded directly from memory or transferred from
wherever some prior floating-point computation placed the result.
> That's stupid, so instead it gets done by a spill to stack followed by
> a VLD, or by a separate load from the original structure. This may no
> longer be in L1, of course; so there's an opportunity for the compiler
> to screw up; a well written compiler shouldn't. So basically, there's
> going to be a VLD from stack either right before or right after the
> branch.
The only time there will necessarily be a stack access is if passing
more arguments than fit in registers. For hardfp, that's 8 double
precision or 16 single precision values in addition to any integer or
pointer values. Having functions with that many arguments is rare
indeed. On the other hand, if using softfloat calls, only 2 double (4
single) float values may be passed in registers, and one or more
argument is likely to end up on the stack. To summarise, a hardfloat
call looks like this:
1. Load arguments to VFP registers
2. Call function
A softfloat call looks like this:
1. Load values to registers
2. Store values on stack
3. Call function
4. Load values from stack to registers
You are saying 4 steps are more efficient than 2.
>> > That's probably a trivial effect; but at least on Cortex-A8, there are
>> > others that hit some code bases much harder. What if the callee does
>> > no arithmetic, but passes the argument to a variadic function? Or the
>> > callee returns a value fetched from memory, which happens to be
>> > floating point, and the caller turns around and sticks it into an
>> > otherwise integer-filled structure? Either way, you take the full hit
>> > of the transfer to D0 and back to the integer side, for nothing.
>>
>> You seem to be missing something about how structs are actually
>> represented at the backend of a compiler.
>
> Educate me. I say I have a double X in a struct in memory, which I
> want to pass to non-variadic function A, which then passes it to
> variadic function B. The hardfp convention requires that I pull X
> into D0 before branching to A, which has to move it from D0 to r0+r1
> before passing it to B. What about "how structs are actually
> represented at the backend of a compiler" saves me from the overhead
> of this maneuver, relative to the softfp convention (in which X is in
> r0+r1 for the call to A and needn't be touched before A calls B)?
This situation is hardly common (printf calls for debugging aside), and
certainly shouldn't be in any performance-critical code.
> In the second example in that paragraph, I call function C, which
> returns a double Y (fetched from memory, not computed). I want to
> stick this in a struct along with integer J and pointer Q. The hardfp
> convention requires that Y be returned in D0, and to get it into the
> struct I may need to issue three separate stores (STR, VSTR, STR --
> assuming Y is between J and Q and I'm exploiting address post-
> increment). In the softfp convention, Y will be returned in r0+r1,
> and all I have to do is shuffle it into appropriate registers and
> issue one STM.
This is again a fairly contrived example, and a bad one at that. Any
sane person would order that struct as {int, int, double} to minimise
padding. This would allow storing the int values using strd or stm and
the double with vstr, which sequence takes no longer than an stm with
more registers.
> This is, of course, all small stuff. All that I'm trying to show is
> that one shouldn't look for system-wide wins from the hardfp ABI in
> the "obvious" places, because 1) real code doesn't often do things
> that cause softfp to lose significantly, and 2) real code does often
> do things that cause hardfp to lose slightly. To make hardfp win, you
> have to exploit its "invisible" benefits, which are mostly about
> covering memory latencies by using Q0-Q3 to pass values into and out
> of functions that are *still in-flight* as cache-line-sized memory
> transactions.
Now you are arguing for hardfp again. A paragraph ago you were going to
great lengths to find examples where it could theoretically make things
slower.
For this to work, the compiler must know that q0-q3 are preserved by the
call, or it will save and restore these registers if they hold live
values, which defeats the purpose of doing the loads early.
>> >> The A9 and later indeed make the softfp calls less costly, reducing any
>> >> advantage hardfp might have (which is already small in benchmarks on A8).
>>
>> > Even the idea that A9 is less friendly *overall* to hardfp than A8 is
>> > debatable, at the current level of compiler implementation.
>>
>> The A9 is not in any way "less friendly" to hardfp. It is, however,
>> less hostile to softfp.
>
> Its cache hierarchy is different,
I don't see how that is relevant whatsoever to the floating-point
calling convention.
>> >> Do you have any numbers to back this up? I don't see how going through
>> >> NEON registers would be faster than direct LDM/STM on any core.
>>
>> > I will produce those numbers within the month, or admit defeat. :-P
>> > Seriously, I'd better be able to substantiate this by mid-July or so,
>> > or my team is going to have to rethink certain aspects of one of its
>> > current development efforts.
>>
>> I'm glad I'm not invested in that effort.
>
> On this I suppose we agree. You have an admirable track record as a
> coder, and clearly also a deep understanding of some aspects of the
> OMAP chip series. But you seem awfully sure that your bag of tricks
> contains all the tricks that matter. That attitude gets tiresome
> after a while.
I get suspicious when someone claims to that everybody else is doing it
all wrong without showing any hard data to prove it. That is all.
>> > When combined with the neon-d16-fp16 model, this should induce the
>> > compiler to use Q4-Q7 as its NEON working set. Since it knows this
>> > range is callee-save, it's safe to schedule loads with ample provision
>> > for cache miss latency, even if it has to move them across function/
>> > method calls.
>>
>> So you've reduced the number of NEON registers from 32 to 8, and you're
>> hoping this will somehow improve performance. The mind boggles.
>
> Back-of-the-envelope calculations say that the single most critical
> resource in *my* system is DRAM bandwidth,
Then you should probably not be using an OMAP chip at all.
--
Måns Rullgård
ma...@mansr.com
On Mon, Jun 27, 2011 at 11:10 AM, Edwards, Michael
<m.k.e...@gmail.com> wrote:
> On Jun 25, 9:36 am, Måns Rullgård <m....@mansr.com> wrote:
>> I haven't looked at Cairo, but libjpeg uses NEON for things like IDCT
>> and colourspace conversions. Nowhere are floats or simd vectors passed
>> by value to a function, at least not where it matters for performance.
>
> As I wrote, "extrapolate these to the hard-float case". If you look
> at the code a bit, perhaps you can see the potential benefit of
> refactoring libjpeg-turbo so that jsimd_idct_ifast_neon() is written
> using compiler intrinsics rather than raw assembly, and letting the
> compiler handle register allocation and load/store latencies? And of
> rewriting idct_helper and transpose_4x4 as inline functions, operating
> on the 8x8 block of 16-bit coefficients -- i. e., a 128-byte chunk of
> data passed by value? That's exactly what the datatypes defined in
> AAPCS are for.Yeah, this sounds great in theory, and this is what the compiler
people want us to believe. But the reality is rather disappointing:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
> In many cases, I want to bypass the cache hierarchy entirely in both
> directions, because that data structure probably won't be touched
> again until after it has aged out of L2 anyway. So the fetch and
> store of "blob" are done via NEON intrinsics through a pointer that
> lies in an uncacheable mapping. Currently this is another constraint
> that cannot be expressed in a C or C++ API; but I don't intend to let
> that stop me, either.Why would you want to read uncached memory? That's already a huge
performance loss. For example, there is "shadow framebuffer" in
xf86-video-fbdev driver, which exists specifically to get more or less
reasonable performance when attempting to read pixel data back.
Moreover, you can easily enable write-through caching for the
framebuffer on OMAP3 systems, which can be used instead of the shadow
framebuffer with some really good performance results.
>> So you've reduced the number of NEON registers from 32 to 8, and you're
>> hoping this will somehow improve performance. The mind boggles.
>
> Now who's waving around straw men? The load patterns that I'm worried
> about don't often use NEON for algorithms that need 32 8-byte
> registers. Yes, having that full bank of registers makes libjpeg-
> turbo's iDCT more compact; but I don't much care, because JPEG decode
> latency is not the most critical thing in my system.if you don't care about having any real NEON optimizations in your
system (for JPEG or anything else), then it's surely your choice. It's
the great freedom of open source, etc. But I seriously doubt that
anyone else would be interested :-)
Your post was very verbose and I'm sorry for not replying to the rest
of it. At least looks like you can find the relevant documentation,
read it and (mis)interpret somehow ;-) The question remains whether
you can actually use all of this information in practice to your
advantage. And if you find some really good performance tricks with
the hardfp, ARM or VFP/NEON code, then I would be surely very
interested to look at the compilable examples and benchmark numbers.
--
Best regards,
Siarhei Siamashka
Does it work much better for you when compiling the test code from
that gcc bugreport? I could not see any improvements with
gcc-linaro-4.6-2011.06-0.tar.bz2 myself. Or do you have some other
code examples which show great progress?
I know that computers have already beaten humans at playing chess
(even though it took them quite a long time to achieve this). Maybe
one day they will learn how to schedule code for processor pipelines
better than human software developers can do. But at the current rate,
it does not seem to happen any time soon.
> Still a little buggy in places
> (https://bugs.launchpad.net/gcc-linaro/+bug/803232), but I have great hopes
> for it.
It does not look "a little buggy" to me. You managed to hit a compiler
bug immediately after just a single attempt of trying to implement
something not totally trivial with NEON intrinsics. That's more like
an impressive 1/1 failure rate :(
>> > In many cases, I want to bypass the cache hierarchy entirely in both
>> > directions, because that data structure probably won't be touched
>> > again until after it has aged out of L2 anyway. So the fetch and
>> > store of "blob" are done via NEON intrinsics through a pointer that
>> > lies in an uncacheable mapping. Currently this is another constraint
>> > that cannot be expressed in a C or C++ API; but I don't intend to let
>> > that stop me, either.
>>
>> Why would you want to read uncached memory? That's already a huge
>> performance loss. For example, there is "shadow framebuffer" in
>> xf86-video-fbdev driver, which exists specifically to get more or less
>> reasonable performance when attempting to read pixel data back.
>> Moreover, you can easily enable write-through caching for the
>> framebuffer on OMAP3 systems, which can be used instead of the shadow
>> framebuffer with some really good performance results.
>
> It doesn't have to be a performance loss *system-wide*, which is what I care
> about. The data has to get out of cache for the GPU to be able to use it --
> or an on-chip DSP core, or an H.264 encode block, or whatever.
The write-through cache is supposed to ensure that the data also
reaches memory (almost) immediately after it gets modified in cache.
In the other cases, cache flush/invalidate operations can be used to
synchronize the content of CPU cache and memory. Android people seem
to be advocating the use of cached framebuffers too:
http://www.kandroid.org/online-pdk/guide/display_drivers.html
And for OMAP3 it is possible to have multiple framebuffer planes
composited together by the display controller. GPU can potentially
handle it's own overlay, while reserving GFX plane entirely for the
CPU.
I don't know anything about OMAP3 DSP or H.264 block. This is the area
where the other people definitely have a lot more experience.
> And even when you're just talking about CPU algorithms, when the data isn't in cache
> -- as is inevitably the case sometimes when your working set is larger than
> cache -- you've got to get it in somehow. You can let the cache controller
> do the work for you, or you can make a conscious distinction between the
> "hot set" and the broader working set, and access the latter through an
> uncacheable mapping to keep it from evicting the former from cache.
This is something where I would prefer benchmark results. Something
like the following code can be a good start:
http://lists.freedesktop.org/archives/pixman/attachments/20110404/89d0c373/attachment.c
NEON in newer Cortex-A8 processors can be indeed used for performing
fast copying of data for the "cached->cached" or "uncached->cached"
cases even without explicit prefetch via PLD instructions. But it's
not a silver bullet and some other limitations apply. Still it makes a
perfect implementation for memcpy function, which needs to work on
OMAP3630/DM3730.
> The ARMv7-A+NEON is extraordinarily well suited to the explicit strategy, if you
> put a bit of work into it.
It's more like not ARMv7-A+NEON in general, but specifically ARM
Cortex-A8 processors having revision 2 or newer (those which do not
require L1NEON workaround). The other NEON capable ARM processors may
be less suited for this strategy.
>> >> So you've reduced the number of NEON registers from 32 to 8, and you're
>> >> hoping this will somehow improve performance. The mind boggles.
>> >
>> > Now who's waving around straw men? The load patterns that I'm worried
>> > about don't often use NEON for algorithms that need 32 8-byte
>> > registers. Yes, having that full bank of registers makes libjpeg-
>> > turbo's iDCT more compact; but I don't much care, because JPEG decode
>> > latency is not the most critical thing in my system.
>>
>> if you don't care about having any real NEON optimizations in your
>> system (for JPEG or anything else), then it's surely your choice. It's
>> the great freedom of open source, etc. But I seriously doubt that
>> anyone else would be interested :-)
>
> Well, I ran a little trial. I converted the libjpeg-turbo trunk
> implementation of 8x8 iDCT from NEON assembly into NEON compiler intrinsics,
> and let GCC do the register / memory management. I have some initial
> benchmark results; they could be totally wrong (as the implementation almost
> certainly contains errors), but they feel about right based on an inspection
> of the compiler-generated assembly code.
>
> Even as sloppy as 4.5.x is at managing the NEON register pool, it only loses
> about 10% on decompression throughput relative to the hand-coded assembly
> version.
I can't verify these numbers because none of the gcc versions that I
have is able to compile your code (because of that neon intrinsics bug
that you have already reported to linaro).
Was 10% loss measured for jpeg decoding performance overall, or for
iDCT function alone? But in any case, 10% performance loss is already
bad enough, especially considering that your variant basically
directly converts NEON instructions to the corresponding intrinsics.
In this case gcc does not even need to do much job on its own, and the
"register / memory management" should be pretty simple and
straightforward for it.
> And compiling for my 16-register NEON model only loses 25%
> relative to that.
Do you have some gcc patches for this 16-register NEON model available
somewhere?
> Even if GCC 4.6.x didn't recoup part or all of that
> performance loss -- which I am quite confident that it will -- it would be
> worth it in exchange for the ability to move data around quickly in kernel
> code. Our system does not exist primarily to decompress JPEGs.
The performance loss on decoding JPEGs is clearly bad. Also you
sacrifice the possibility of running almost all the existing ARM NEON
code available around. And all of this is needed to gain what?
But why there? And why not directly contact upstream via
libjpeg-turbo-devel mailing list or libjpeg-turbo issue tracker? Also
I hope that you are not planning to go after Chromium or some other
libjpeg-turbo users as the next step...
> My gcc patches are here:
> https://github.com/mkedwards/crosstool-ng/tree/master/patches/gcc/linaro-4.6-bzr
> . They're in the context of a version of crosstool-ng that I've adapted to
> build a Linaro-based toolchain and a reasonably complete sysroot
> environment. It's a little ragged around the edges, but if you would like
> help getting it to work for you, drop in on #linaro (I'm often active there
> when I'm at work).
Thanks for confirming that you are not just after hardfp calling
conventions, but also want your own custom ABI. And in order to make
it less problematic for you, now the whole world has to switch to
using intrinsics.
I must say that I don't like it. And I hope that this neon-d16 variant
never gets accepted into upstream gcc. The diversity and freedom of
choice is good, but not for the things like ABIs and standards.
On Sat, Jul 2, 2011 at 9:35 AM, <m.k.e...@gmail.com> wrote:
> Discussion around the libjpeg-turbo iDCT implementation using NEON
> intrinsics has moved here:
> https://bugzilla.mozilla.org/show_bug.cgi?id=496298 .But why there? And why not directly contact upstream via
libjpeg-turbo-devel mailing list or libjpeg-turbo issue tracker? Also
I hope that you are not planning to go after Chromium or some other
libjpeg-turbo users as the next step...
> My gcc patches are here:
> https://github.com/mkedwards/crosstool-ng/tree/master/patches/gcc/linaro-4.6-bzr
> . They're in the context of a version of crosstool-ng that I've adapted to
> build a Linaro-based toolchain and a reasonably complete sysroot
> environment. It's a little ragged around the edges, but if you would like
> help getting it to work for you, drop in on #linaro (I'm often active there
> when I'm at work).Thanks for confirming that you are not just after hardfp calling
conventions, but also want your own custom ABI. And in order to make
it less problematic for you, now the whole world has to switch to
using intrinsics.
I must say that I don't like it. And I hope that this neon-d16 variant
never gets accepted into upstream gcc. The diversity and freedom of
choice is good, but not for the things like ABIs and standards.