[llvm-dev] NEON FP flags

Renato Golin via llvm-dev

unread,

Mar 22, 2016, 7:28:58 AM3/22/16

to Hal Finkel, James Molloy, LLVM Dev

Hal, James,

My plan to disable vectorization on NEON FP had two steps:

1. Create the infrastructure to detect unsafe FP maths and force NEON
FP via fast-math.
2. Use -mfpmath=neon/sse to fine-tune the flags even further, but this
needs a lot of work in IR.

The expected behaviour is to have most performance with least options,
but with correctness in mind. So, we can't vectorize FP loops without
either -ffast-math or -mfpmath=neon, but we want to tell the users
that they could get more performance out of their compiler if either
option was chosen.

If I force the need for -ffast-math, many other deviations from
IEEE-754 will be allowed, not just denormals, so you're left with
either slow or potentially bad results. Using -mfpmath hits the right
spot, but is less known and is not plugged in yet.

Vectorizing FP loops is a correctness problem in NEON (and I assume
SSE), so it would be good to be safe. But I take it it's not a serious
correctness problem, so we can go about it the right way from the
beginning, which I'm ok with.

So...

If I got it right, we need to tell FP instructions that they allow
denormals. So far, I could only find flags about NaNs, Infs, signed
zeroes and reciprocals, as well as the "fast" flag that turns them all
on.

In the target transform info, we need to add a denormal flag to be set
if fpmath=neon/sse/etc on all FP instructions, and in the vectorizer,
we just test for that flag (which should also be set by fast-math).

The Darwin vs. Linux problem is, then, moved to the target transform
info, only setting the flag on ARM if...

isDarwin OR ARMISA >= v8 OR fastMath OR fpMath == NEON

Makes sense?

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

James Molloy via llvm-dev

unread,

Mar 22, 2016, 7:34:46 AM3/22/16

to Renato Golin, LLVM Dev

Hi,

On 22 Mar 2016, at 11:28, Renato Golin <renato...@linaro.org> wrote:

In the target transform info

I don’t think this part is right. The denormal flag would have to be set by whatever code generates the FP instruction, which would be Clang’s codegen layer. So the if (Darwin) would be there, not in TTI.

Cheers,

James

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Renato Golin via llvm-dev

unread,

Mar 22, 2016, 7:50:49 AM3/22/16

to James Molloy, LLVM Dev

On 22 March 2016 at 11:34, James Molloy <James....@arm.com> wrote:
> I don’t think this part is right. The denormal flag would have to be set by
> whatever code generates the FP instruction, which would be Clang’s codegen
> layer. So the if (Darwin) would be there, not in TTI.

Right, I meant the information to set/not set would be in TTI, not the
actual setting.

I don't want Clang to have that kind of information independent of LLVM.

Hal Finkel via llvm-dev

unread,

Mar 25, 2016, 12:11:25 AM3/25/16

to Renato Golin, LLVM Dev

Hi Renato,

As I understand it, the fundamental property being addresses here is: Are the semantics of scalar FP math the same as vector FP math? TTI seems like a good place to expose that information. If the semantics are indeed different, then the vectorizer would require fast-math flags in order to vectorize FP operations (similarly, gcc's man page says it requires -funsafe-math-optimizations for vectorization unless -mfpu=neon or similar is specified). In this context, this different-semantics query would return true if:

!(isDarwin OR ARMISA >= v8 OR fpMath == NEON)

and then we need to teach people to use -mfpu=neon ;)

I think this more-or-less matches what you've proposed. Is that right?

-Hal

P.S. Looking at gcc's man page, gcc seems to use -mfpu for ARM and -mfpmath for x86. Do we use -mfpmath for both?

----- Original Message -----
> From: "Renato Golin" <renato...@linaro.org>
> To: "James Molloy" <James....@arm.com>
> Cc: "Hal Finkel" <hfi...@anl.gov>, "LLVM Dev" <llvm...@lists.llvm.org>
> Sent: Tuesday, March 22, 2016 6:50:40 AM
> Subject: Re: NEON FP flags
>
> On 22 March 2016 at 11:34, James Molloy <James....@arm.com> wrote:
> > I don’t think this part is right. The denormal flag would have to
> > be set by
> > whatever code generates the FP instruction, which would be Clang’s
> > codegen
> > layer. So the if (Darwin) would be there, not in TTI.
>
> Right, I meant the information to set/not set would be in TTI, not
> the
> actual setting.
>
> I don't want Clang to have that kind of information independent of
> LLVM.
>
> cheers,
> --renato
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Renato Golin via llvm-dev

unread,

Mar 25, 2016, 9:23:11 AM3/25/16

to Hal Finkel, LLVM Dev

On 25 March 2016 at 04:11, Hal Finkel <hfi...@anl.gov> wrote:
> As I understand it, the fundamental property being addresses here is: Are the semantics of scalar FP math the same as vector FP math? TTI seems like a good place to expose that information. If the semantics are indeed different, then the vectorizer would require fast-math flags in order to vectorize FP operations (similarly, gcc's man page says it requires -funsafe-math-optimizations for vectorization unless -mfpu=neon or similar is specified). In this context, this different-semantics query would return true if:

The semantics is indeed different, VFP is IEEE-754 compliant while
NEON is not. We don't want to stop the compiler from using VFP for FP
math, but we want to be cautious when using NEON in the same way..

> !(isDarwin OR ARMISA >= v8 OR fpMath == NEON)
>
> and then we need to teach people to use -mfpu=neon ;)

So, there's the catch. In GCC, -mfpu=neon means to use NEON, which is
not enabled by default, so the compiler assumes that the user is aware
that NEON FP is not IEEE compliant. I don't think that's a safe
assumption, but I also don't want to have a slightly different
behaviour than GCC gratuitously.

Clang defaults to -mfpu=neon when we choose -mcpu=cortex-a* or
-march=armv7a, so our current behaviour is on par with GCC. But I
think that's a dangerous assumption.

Furthermore, the only alternatives we have at the moment is to either
use NEON for everything or nothing. It would be good to have an option
to use NEON for integer arithmetic and VFP for FP if the user requires
IEEE compliance..

> P.S. Looking at gcc's man page, gcc seems to use -mfpu for ARM and -mfpmath for x86. Do we use -mfpmath for both?

We already support -mfpmath=vfp/neon in Clang, but it's bogus. My
proposal is to make it count.

The best way I can think of is to let -mfpmath=vfp *disable* only FP
NEON and -mfpmath=neon *enable* only FP NEON, both orthogonal from
integer math.

Examples:

Works today:
-mfpu=soft -> Int (ALU), FP (LIB), no VFP/NEON instructions
-mfpu=softfp -> Int (ALU), FP (LIB), VFP/NEON instructions allowed
-mfpu=vfp -> Int (ALU), FP (VFP)
-mfpu=neon -> Int (NEON), FP (NEON)

Change proposed:
-mfpmath=neon -mfpu=vfp -> Int (ALU), FP (NEON)
-mfpmath=vfp -mfpu=neon -> Int (NEON), FP (VFP)

This would be similar enough to GCC, and would allow the small number
of users that care about IEEE-754 compliance to disable FP NEON on
demand.

cheers,
--renato

Hal Finkel via llvm-dev

unread,

Mar 25, 2016, 2:15:02 PM3/25/16

to Renato Golin, LLVM Dev

----- Original Message -----
> From: "Renato Golin" <renato...@linaro.org>
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: "LLVM Dev" <llvm...@lists.llvm.org>, "James Molloy" <James....@arm.com>
> Sent: Friday, March 25, 2016 8:23:03 AM
> Subject: Re: NEON FP flags
>

I think this seems reasonable, although it is somewhat unfortunate, in terms of naming, that "-mfpu" affects non-FP operations too. However, I think we're stuck because of what GCC decided to do.

Thanks again,
Hal

>
> cheers,
> --renato
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

James Greenhalgh via llvm-dev

unread,

Mar 29, 2016, 6:09:23 AM3/29/16

to Renato Golin, LLVM Dev

On Fri, Mar 25, 2016 at 01:23:03PM +0000, Renato Golin via llvm-dev wrote:
> On 25 March 2016 at 04:11, Hal Finkel <hfi...@anl.gov> wrote:
> > As I understand it, the fundamental property being addresses here is: Are
> > the semantics of scalar FP math the same as vector FP math? TTI seems like
> > a good place to expose that information. If the semantics are indeed
> > different, then the vectorizer would require fast-math flags in order to
> > vectorize FP operations (similarly, gcc's man page says it requires
> > -funsafe-math-optimizations for vectorization unless -mfpu=neon or similar
> > is specified). In this context, this different-semantics query would return
> > true if:
>
> The semantics is indeed different, VFP is IEEE-754 compliant while
> NEON is not. We don't want to stop the compiler from using VFP for FP
> math, but we want to be cautious when using NEON in the same way..
>
>
> > !(isDarwin OR ARMISA >= v8 OR fpMath == NEON)
> >
> > and then we need to teach people to use -mfpu=neon ;)
>
> So, there's the catch. In GCC, -mfpu=neon means to use NEON, which is
> not enabled by default, so the compiler assumes that the user is aware
> that NEON FP is not IEEE compliant. I don't think that's a safe
> assumption, but I also don't want to have a slightly different
> behaviour than GCC gratuitously.

Note that my discussion below relates to the AArch32 behaviour (the ARM
port of GCC, not the AArch64 port of GCC).

I can see why the text in the man page might be misleading, but let me quote
the part I think Hal was referring to here (with added emphasis):

If the selected floating-point hardware includes the NEON extension
(e.g. -mfpu=neon), note that floating-point operations are **not**
generated by GCC's auto-vectorization pass **unless**
-funsafe-math-optimizations is also specified. This is because
NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are treated
as zero), so the use of NEON instructions may lead to a loss of
precision.

That is to say, GCC will only auto-vectorize floationg-point arithmetic
if both -mfpu=neon AND -funsafe-math-optimizations are given. -mfpu=neon
by itself does not imply that it is OK for GCC to generate non-IEEE
compliant code. The default is safe until explicitly told otherwise.

> Clang defaults to -mfpu=neon when we choose -mcpu=cortex-a* or
> -march=armv7a, so our current behaviour is on par with GCC. But I
> think that's a dangerous assumption.

If your current behaviour is to generate unsafe math when -mfpu=neon
is passed, then I agree this is dangerous. Again, this is *NOT* GCC's
behaviour.

> Furthermore, the only alternatives we have at the moment is to either
> use NEON for everything or nothing. It would be good to have an option
> to use NEON for integer arithmetic and VFP for FP if the user requires
> IEEE compliance.

In GCC, this is -mfpu=neon.

> > P.S. Looking at gcc's man page, gcc seems to use -mfpu for ARM and -mfpmath
> > for x86. Do we use -mfpmath for both?
>
> We already support -mfpmath=vfp/neon in Clang, but it's bogus. My
> proposal is to make it count.
>
> The best way I can think of is to let -mfpmath=vfp *disable* only FP
> NEON and -mfpmath=neon *enable* only FP NEON, both orthogonal from
> integer math.
>
> Examples:
>
> Works today:
> -mfpu=soft -> Int (ALU), FP (LIB), no VFP/NEON instructions
> -mfpu=softfp -> Int (ALU), FP (LIB), VFP/NEON instructions allowed
> -mfpu=vfp -> Int (ALU), FP (VFP)
> -mfpu=neon -> Int (NEON), FP (NEON)
>
> Change proposed:
> -mfpmath=neon -mfpu=vfp -> Int (ALU), FP (NEON)
> -mfpmath=vfp -mfpu=neon -> Int (NEON), FP (VFP)
>
> This would be similar enough to GCC, and would allow the small number
> of users that care about IEEE-754 compliance to disable FP NEON on
> demand.

In GCC today:

-mfpu=vfp is the minimum floating-point instruction set supported, the
choice of which ABI you use (-mfloat-abi) is independent from the choice
of floating-point hardware that exists. -mfpu=soft and -mfpu=softfp are
rejected by GCC.

Starting with that:

-mfloat-abi=soft -> Generate library calls for all floating-point
operations, do not permit Neon operations.
-mfloat-abi=softfp -> Pass floating point arguments using the softfloat
abi (i.e. in core registers). Emit floating point instructions as
appropriate.
-mfloat-abi=hard -> Pass floating point arguments in VFP registers.
Emit floating point instructions as appropriate.

Independent of this, we have -mfpu:

-mfpu=neon -> Permit generation of Neon instructions (both integer and
floating point) where allowed by the language specification. Note that
this does not by itself allow the generation of non-IEEE compliant code.

And on top of that, -funsafe-math-optimizations to enable generating Neon
instructions for floating point operations.

For your set of use cases:

Int (ALU), FP (LIB), no VFP/NEON instructions

-mfloat-abi=soft

Int (ALU), FP (LIB), VFP/NEON instructions allowed

Impossible

Int (ALU), FP (VFP)

-mfloat-abi=hard or -mfloat-abi=softfp
+ -mfpu=vfp (or other non-neon FPU)

Int (NEON), FP (VFP)
-float-abi=hard or -mfloat-abi=softfp
+ -mfpu=neon (or greater)

Int (NEON), FP (NEON)

-float-abi=hard or -mfloat-abi=softfp
+ -mfpu=neon (or greater)
+ -funsafe-math-optimizations (or equivalent)

Int (ALU), FP (NEON)

Impossible (as far as I know).

Hope this helps,
James

Renato Golin via llvm-dev

unread,

Apr 1, 2016, 9:57:01 AM4/1/16

to James Greenhalgh, LLVM Dev

On 29 March 2016 at 11:09, James Greenhalgh <james.gr...@arm.com> wrote:
> That is to say, GCC will only auto-vectorize floationg-point arithmetic
> if both -mfpu=neon AND -funsafe-math-optimizations are given. -mfpu=neon
> by itself does not imply that it is OK for GCC to generate non-IEEE
> compliant code. The default is safe until explicitly told otherwise.

Right, that was what I originally though from Hal's bug report, but
recent emails on the thread confused me.

I think this is the right behaviour, and I'm glad GCC does it, so we
can follow the correct approach from start.

>> Furthermore, the only alternatives we have at the moment is to either
>> use NEON for everything or nothing. It would be good to have an option
>> to use NEON for integer arithmetic and VFP for FP if the user requires
>> IEEE compliance.
>
> In GCC, this is -mfpu=neon.

This makes my life *so* much easier! :)

> In GCC today:
>
> -mfpu=vfp is the minimum floating-point instruction set supported, the
> choice of which ABI you use (-mfloat-abi) is independent from the choice
> of floating-point hardware that exists. -mfpu=soft and -mfpu=softfp are
> rejected by GCC.

Yes, I mixed mfpu with mfloat-abi, my bad.

> For your set of use cases:
>
> Int (ALU), FP (LIB), no VFP/NEON instructions
> -mfloat-abi=soft
>
> Int (ALU), FP (LIB), VFP/NEON instructions allowed
> Impossible

I mentioned this as -mfloat-abi=floatfp. Now I see my representation
of int/fp mixed concepts. Ignore this.

> Int (ALU), FP (VFP)
> -mfloat-abi=hard or -mfloat-abi=softfp
> + -mfpu=vfp (or other non-neon FPU)
>
> Int (NEON), FP (VFP)
> -float-abi=hard or -mfloat-abi=softfp
> + -mfpu=neon (or greater)

Excellent! This means I can only make -fsubnormal flags count, and all
will be the same.

This was my first approach, but Hal convinced me that we may want a
specific flag that is included by fast/unsafe maths flags. See below.

> Int (NEON), FP (NEON)
> -float-abi=hard or -mfloat-abi=softfp
> + -mfpu=neon (or greater)
> + -funsafe-math-optimizations (or equivalent)

Do you have one specifically for subnormals? -funsafe-math is a bit of
a big hammer and will enable other (potentially unwanted) behaviour
from the vectorizer.

However, -ffast-math / unsafe-math should include subnormal support.

> Int (ALU), FP (NEON)
> Impossible (as far as I know).

Irrelevant, as far as I care. :)

cheers,
--renato

James Greenhalgh via llvm-dev

unread,

Apr 1, 2016, 10:27:23 AM4/1/16

to Renato Golin, LLVM Dev

On Fri, Apr 01, 2016 at 02:56:53PM +0100, Renato Golin wrote:
> On 29 March 2016 at 11:09, James Greenhalgh <james.gr...@arm.com> wrote:
> > That is to say, GCC will only auto-vectorize floationg-point arithmetic
> > if both -mfpu=neon AND -funsafe-math-optimizations are given. -mfpu=neon
> > by itself does not imply that it is OK for GCC to generate non-IEEE
> > compliant code. The default is safe until explicitly told otherwise.
>
> Right, that was what I originally though from Hal's bug report, but
> recent emails on the thread confused me.
>
> I think this is the right behaviour, and I'm glad GCC does it, so we
> can follow the correct approach from start.

Perfect. I think this is sensible.

> > Int (NEON), FP (VFP)
> > -float-abi=hard or -mfloat-abi=softfp
> > + -mfpu=neon (or greater)
>
> Excellent! This means I can only make -fsubnormal flags count, and all
> will be the same.
>
> This was my first approach, but Hal convinced me that we may want a
> specific flag that is included by fast/unsafe maths flags. See below.
>
>
> > Int (NEON), FP (NEON)
> > -float-abi=hard or -mfloat-abi=softfp
> > + -mfpu=neon (or greater)
> > + -funsafe-math-optimizations (or equivalent)
>
> Do you have one specifically for subnormals? -funsafe-math is a bit of
> a big hammer and will enable other (potentially unwanted) behaviour
> from the vectorizer.
>
> However, -ffast-math / unsafe-math should include subnormal support.

No, we only have the big hammer throughout the ARM back-end to
enable/disable support for the RTL IR that the vectorizer looks for
when pattern matching. That means you also get your reduction loops and
friends potentially changing your IEEE-754 expectations. Something more
fine-grained would be feasible, but there'd be a fair bit of work needed to
upgrade the implementation. In GCC we either take the performance hit or
you use the big hammer.

> > Int (ALU), FP (NEON)
> > Impossible (as far as I know).
>
> Irrelevant, as far as I care. :)

Having read the bug reports (16275/16274?) I realise I should have
mentioned Neon intrinsics in my original mail. These *are* available
with the appropriate -mfpu/-mfloat-abi/-march flags, no matter whether
your have -funsafe-math-optimizations, and always map to their instruction
(the implementation for this is not neat, essentially we have two backend RTL
patterns, one which is always available for intrinsics, one which is
conditionally available for auto-vectorization).

Thanks,
James

Renato Golin via llvm-dev

unread,

Apr 1, 2016, 10:35:09 AM4/1/16

to James Greenhalgh, LLVM Dev

On 1 April 2016 at 15:27, James Greenhalgh <james.gr...@arm.com> wrote:
>> However, -ffast-math / unsafe-math should include subnormal support.
>
> No, we only have the big hammer throughout the ARM back-end to
> enable/disable support for the RTL IR that the vectorizer looks for
> when pattern matching. That means you also get your reduction loops and
> friends potentially changing your IEEE-754 expectations. Something more
> fine-grained would be feasible, but there'd be a fair bit of work needed to
> upgrade the implementation. In GCC we either take the performance hit or
> you use the big hammer.

Ok, so I think it's sensible to take the same approach and discuss the
extra subnormal flag with GCC/Clang folks.

I'll rebase my original patch so that at least our behaviour is
consistent for now.

> Having read the bug reports (16275/16274?) I realise I should have
> mentioned Neon intrinsics in my original mail. These *are* available
> with the appropriate -mfpu/-mfloat-abi/-march flags, no matter whether
> your have -funsafe-math-optimizations, and always map to their instruction
> (the implementation for this is not neat, essentially we have two backend RTL
> patterns, one which is always available for intrinsics, one which is
> conditionally available for auto-vectorization).

That's what I was trying to say with the Int vs FP vs allow VFP
instructions, but failed miserably. But this is not relevant to this
discussion now (and we have similar issues lowering them :).

My change above will make the behaviour identical, so we can discuss
about the subnormal flag later.