[LLVMdev] LLVM ARM VMLA instruction

377 views
Skip to first unread message

suyog sarda

unread,
Dec 18, 2013, 3:01:39 AM12/18/13
to llv...@cs.uiuc.edu
Hi,
 
Hi,
I was going through Code of LLVM instruction code generation for ARM. I came across VMLA instruction hazards (Floating point multiply and accumulate). I was comparing assembly code emitted by LLVM and GCC, where i saw that GCC was happily using VMLA instruction for floating point while LLVM never used it, instead it used a pair of VMUL and VADD instruction.

I wanted to know if there is any way in which these VMLA hazards can be ignored and make LLVM to emit VMLA instructions? Is there any command line option/compiler switch/FLAG for doing this? I tried '-ffast-math' but it didn't work.

--
With regards,
Suyog

Tim Northover

unread,
Dec 18, 2013, 4:42:37 AM12/18/13
to suyog sarda, LLVM Developers Mailing List
> I was going through Code of LLVM instruction code generation for ARM. I came
> across VMLA instruction hazards (Floating point multiply and accumulate). I
> was comparing assembly code emitted by LLVM and GCC, where i saw that GCC
> was happily using VMLA instruction for floating point while LLVM never used
> it, instead it used a pair of VMUL and VADD instruction.

It looks like Clang allows the formation by default, but you need to
be compiling for a CPU that actually supports the instruction (the key
feature is called "VFPv4". That means one strictly newer than
cortex-a8: cortex-a7 (don't ask), cortex-a9, cortex-a12, cortex-a15 or
krait I believe. With that I get:

$ cat tmp.c
float foo(float accum, float lhs, float rhs) {
return accum + lhs*rhs;
}
$ clang -target armv7-linux-gnueabihf -mcpu=cortex-a15 -S -o- -O3 tmp.c
[...]
foo: @ @foo
@ BB#0: @ %entry
vmla.f32 s0, s1, s2
bx lr

Cheers.

Tim.
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Renato Golin

unread,
Dec 18, 2013, 7:03:41 AM12/18/13
to Tim Northover, LLVM Developers Mailing List
On 18 December 2013 09:42, Tim Northover <t.p.no...@gmail.com> wrote:
That means one strictly newer than
cortex-a8: cortex-a7 (don't ask), cortex-a9, cortex-a12, cortex-a15 or
krait I believe.

Hi Tim,

Cortex A8 and A9 use VFPv3. A7, A12 and A15 use VFPv4.

cheers,
--renato

Tim Northover

unread,
Dec 18, 2013, 7:31:16 AM12/18/13
to Renato Golin, LLVM Developers Mailing List
> Cortex A8 and A9 use VFPv3. A7, A12 and A15 use VFPv4.

That's what I thought! But we do seem to generate vfma on Cortex-A9.
Wonder if that's a bug, or Cortex-A9 is "VFPv3, but chuck in vfma
too"?

Renato Golin

unread,
Dec 18, 2013, 7:47:12 AM12/18/13
to Tim Northover, LLVM Developers Mailing List
On 18 December 2013 12:31, Tim Northover <t.p.no...@gmail.com> wrote:
That's what I thought! But we do seem to generate vfma on Cortex-A9.
Wonder if that's a bug, or Cortex-A9 is "VFPv3, but chuck in vfma
too"?

Hi Tim,

I believe that's the NEON VMLA, not the VFP one. There was a discussion in the past about not using NEON and VFP interchangeably due to IEEE assurances (which NEON doesn't have), but the performance gains are too big. I think the conclusion is to only use NEON instead of VFP (when they're semantically similar) when -unsafe-math is on.

cheers,
--renato

Tim Northover

unread,
Dec 18, 2013, 8:02:42 AM12/18/13
to Renato Golin, LLVM Developers Mailing List
> I believe that's the NEON VMLA, not the VFP one.

Turns out I was misreading the assembly. I wish "vmla" and "vfma"
weren't so similar-looking.

For Suyog that means the option "-ffp-contract=fast" is needed to get
vfma when needed. Sorry about the bad information earlier.

Cheers.

suyog sarda

unread,
Dec 18, 2013, 2:55:17 AM12/18/13
to llv...@cs.uiuc.edu
Hi,
 
I was going through Code of LLVM instruction code generation for ARM. I came across VMLA instruction hazards (Floating point multiply and accumulate). I was comparing assembly code emitted by LLVM and GCC, where i saw that GCC was happily using VMLA instruction for floating point while LLVM never used it, instead it used a pair of VMUL and VADD instruction.

I wanted to know if there is any way in which these VMLA hazards can be ignored and make LLVM to emit VMLA instructions? Is there any command line option/compiler switch/FLAG for doing this? I tried '-ffast-math' but it didn't work.


--
With regards,
Suyog Sarda

Renato Golin

unread,
Dec 18, 2013, 12:59:52 PM12/18/13
to suyog sarda, LLVM Dev
On 18 December 2013 07:55, suyog sarda <sard...@gmail.com> wrote:
I wanted to know if there is any way in which these VMLA hazards can be ignored and make LLVM to emit VMLA instructions? Is there any command line option/compiler switch/FLAG for doing this? I tried '-ffast-math' but it didn't work.

I believe the option you're looking for is: -mattr=-vmlx-forwarding

$ llc -mcpu=cortex-a9 -mattr=-vmlx-forwarding file.ll -0 file.s

cheers,
--renato

Kay Tiong Khoo

unread,
Dec 18, 2013, 1:00:01 PM12/18/13
to Tim Northover, LLVM Developers Mailing List
> "-ffp-contract=fast" is needed

Correct - clang is different than gcc, icc, msvc, xlc, etc. on this. Still haven't seen any explanation for how this is better though...

Tim Northover

unread,
Dec 18, 2013, 1:17:31 PM12/18/13
to Kay Tiong Khoo, LLVM Developers Mailing List
> http://llvm.org/bugs/show_bug.cgi?id=17188
> http://llvm.org/bugs/show_bug.cgi?id=17211

Ah, thanks. That makes a lot more sense now.

> Correct - clang is different than gcc, icc, msvc, xlc, etc. on this. Still
> haven't seen any explanation for how this is better though...

That would be because it follows what C tells us a compiler has to do
by default but provides overrides in either direction if you know what
you're doing.

The key point is that LLVM (currently) has no notion of statement
boundaries, so it would fuse the operations in this function:

float foo(float accum, float lhs, float rhs) {
float product = lhs * rhs;
return accum + product;
}

This isn't allowed even under FP_CONTRACT=on (the multiply and add do
not occur within a single expression), so LLVM can't in good
conscience enable these optimisations by default.

Kay Tiong Khoo

unread,
Dec 18, 2013, 8:02:34 PM12/18/13
to Tim Northover, LLVM Developers Mailing List
Thanks for the explanation, Tim!

gcc 4.8.1 *does* generate an fma for your code example for an x86 target that supports fma. I'd bet that the HW vendors' compilers do the same, but I don't have any of those installed at the moment to test that theory. So this is a bug in those compilers? Do you know how they justify it?

I see section 6.5 "Expressions" in the C standard, and I can see that 6.5.8 would seem to agree with you assuming that a "floating expression" is a subset of "expression"...is there any other part of the standard that you know of that I can reference?

This is made a little weirder by the fact that gcc and clang have a 'fast' setting for fp-contract, but the C standard that I'm looking at states that it is just an "on-off-switch".

Kay Tiong Khoo

unread,
Dec 18, 2013, 8:11:44 PM12/18/13
to Tim Northover, LLVM Developers Mailing List
Just to clarify: gcc 4.8.1 generates that fma at -O2; no FP relaxation or other flags specified.

suyog sarda

unread,
Dec 19, 2013, 3:00:07 AM12/19/13
to Kay Tiong Khoo, LLVM Developers Mailing List, Tim Northover
Hi all,


Thanks for the info. Few observations from my side :


LLVM :


cortex-a8 vfpv3 : no vmla or vfma instruction emitted

cortex-a8 vfpv4 : no vmla or vfma instruction emitted (This is invalid though as cortex-a8 does not have vfpv4)

cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma instructions generated will be invalid )


cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)

cortex-a15 vfpv4 with ffp-contract=fast vfma instruction emitted.


GCC :


cortex-a8 vfpv3 : vmla instruction emitted

cortex-a15 vfpv4 : vfma instruction emitted


I agree to the point that NEON and VFP instructions shouldn't be used  interchangeably.


However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't LLVM also emit vmla (NEON) instruction? Can someone please clarify on this point? The performance gain with vmla instruction is huge. Somewhere i read that LLVM prefers precision accuracy over performance. Is this true and hence LLVM is not emiting vmla instructions for cortex-a8?


 
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Tim Northover

unread,
Dec 19, 2013, 3:35:19 AM12/19/13
to suyog sarda, LLVM Developers Mailing List
> cortex-a8 vfpv4 with ffp-contract=fast : vfma instruction emitted ( this
> seems a bug to me!! If cortex-a8 doesn't come with vfpv4 then vfma
> instructions generated will be invalid )

If I'm understanding correctly, you've specifically told it this
Cortex-A8 *does* come with vfpv4. Those kinds of odd combinations can
be useful sometimes (if only for tests), so I'm not sure policing them
is a good idea.

> cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)

I get a VFP vmla here rather than a NEON one (clang -target
armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are
you seeing something different?

> However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't
> LLVM also emit vmla (NEON) instruction?

It appears we've decided in the past that vmla just isn't worth it on
Cortex-A8. There's this comment in the source:

// Some processors have FP multiply-accumulate instructions that don't
// play nicely with other VFP / NEON instructions, and it's generally better
// to just not use them.

Sufficient benchmarking evidence could overturn that decision, but I
assume the people who added it in the first place didn't do so on a
whim.

> The performance gain with vmla instruction is huge.

Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines
in odd ways, and that was a very primitive core so it's almost
certainly not going to be just as good as a vmul (in fact if I'm
reading correctly, it takes pretty much exactly the same time as
separate vmul and vadd instructions, 10 cycles vs 2 * 5).

suyog sarda

unread,
Dec 19, 2013, 3:36:53 AM12/19/13
to Kay Tiong Khoo, LLVM Developers Mailing List, Tim Northover
Hi,

One more addition to above observation :

LLVM :

cortex-a15 + vfpv4-d16 + ffast-math option WITHOUT ffp-contract=fast option also emits vfma instruction.


suyog sarda

unread,
Dec 19, 2013, 3:50:13 AM12/19/13
to Tim Northover, LLVM Developers Mailing List
Hi Tim,






> cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)

I get a VFP vmla here rather than a NEON one (clang -target
armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are
you seeing something different?

As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.
 

> However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't
> LLVM also emit vmla (NEON) instruction?

It appears we've decided in the past that vmla just isn't worth it on
Cortex-A8. There's this comment in the source:

// Some processors have FP multiply-accumulate instructions that don't
// play nicely with other VFP / NEON instructions, and it's generally better
// to just not use them.

Sufficient benchmarking evidence could overturn that decision, but I
assume the people who added it in the first place didn't do so on a
whim.

> The performance gain with vmla instruction is huge.

Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines
in odd ways, and that was a very primitive core so it's almost
certainly not going to be just as good as a vmul (in fact if I'm
reading correctly, it takes pretty much exactly the same time as
separate vmul and vadd instructions, 10 cycles vs 2 * 5).

It may seem that total number of cycles are more or less same for single vmla and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, then intermediate results will be generated which needs to be stored in memory for future access. This will lead to lot of load/store ops being inserted which degrade performance. Correct me if i am wrong on this, but my observation till date have shown this.
 

Cheers.

Tim.

Tim Northover

unread,
Dec 19, 2013, 4:13:54 AM12/19/13
to suyog sarda, LLVM Developers Mailing List
> As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.

My version of the ARM architecture reference manual (v7 A & R) lists
versions requiring NEON and versions requiring VFP. (Section
A8.8.337). Split in just the way you'd expect (SIMD variants need
NEON).

> It may seem that total number of cycles are more or less same for single vmla
> and vmul+vadd. However, when vmul+vadd combination is used instead of vmla,
> then intermediate results will be generated which needs to be stored in memory
> for future access.

Well, it increases register pressure slightly I suppose, but there's
no need to store anything to memory unless that gets critical.

> Correct me if i am wrong on this, but my observation till date have shown this.

Perhaps. Actual data is needed, I think, if you seriously want to
change this behaviour in LLVM. The test-suite might be a good place to
start, though it'll give an incomplete picture without the externals
(SPEC & other things).

Of course, if we're just speculating we can carry on.

suyog sarda

unread,
Dec 19, 2013, 4:28:46 AM12/19/13
to Tim Northover, LLVM Developers Mailing List
On Thu, Dec 19, 2013 at 2:43 PM, Tim Northover <t.p.no...@gmail.com> wrote:
> As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.

My version of the ARM architecture reference manual (v7 A & R) lists
versions requiring NEON and versions requiring VFP. (Section
A8.8.337). Split in just the way you'd expect (SIMD variants need
NEON).

I will check on this part.
 

> It may seem that total number of cycles are more or less same for single vmla
> and vmul+vadd. However, when vmul+vadd combination is used instead of vmla,
> then intermediate results will be generated which needs to be stored in memory
> for future access.

Well, it increases register pressure slightly I suppose, but there's
no need to store anything to memory unless that gets critical.

> Correct me if i am wrong on this, but my observation till date have shown this.

Perhaps. Actual data is needed, I think, if you seriously want to
change this behaviour in LLVM. The test-suite might be a good place to
start, though it'll give an incomplete picture without the externals
(SPEC & other things).

Of course, if we're just speculating we can carry on.

I wasn't speculating. Let's take an example of a 3*3 simple matrix multiplication (no loops, all multiplication and additions are hard coded - basically all the operations are expanded
e.g Result[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0]  and so on for all 9 elements of the result ).

If i compile above code with "clang -O3 -mcpu=cortex-a8 -mfpu=vfpv3-d16" (only 16 floating point registers present with my arm, so specifying vfpv3-d16), there are 27 vmul, 18 vadd, 23 store and 30 load  ops in total.
If same is compiled with gcc with same options there are 9 vmul, 18 vmla, 9 store and 20 load ops. So, its clear that extra load/store ops gets added with clang as it is not emitting vmla instruction. Won't this lead to performance degradation?

I would also like to know about accuracy with vmla and pair of vmul and vadd ops.

David Tweed

unread,
Dec 19, 2013, 5:32:41 AM12/19/13
to suyog sarda, LLVM Developers Mailing List
On Thu, Dec 19, 2013 at 9:28 AM, suyog sarda <sard...@gmail.com> wrote:

I wasn't speculating. Let's take an example of a 3*3 simple matrix multiplication (no loops, all multiplication and additions are hard coded - basically all the operations are expanded
e.g Result[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0]  and so on for all 9 elements of the result ).

If i compile above code with "clang -O3 -mcpu=cortex-a8 -mfpu=vfpv3-d16" (only 16 floating point registers present with my arm, so specifying vfpv3-d16), there are 27 vmul, 18 vadd, 23 store and 30 load  ops in total.
If same is compiled with gcc with same options there are 9 vmul, 18 vmla, 9 store and 20 load ops. So, its clear that extra load/store ops gets added with clang as it is not emitting vmla instruction. Won't this lead to performance degradation?

I think what Tim is gently suggesting is that it would be informative to actually run the code that clang produces vs the code that gcc produces on some actual hardware and see if there is a performance difference and if it is significant. Often direct experimentation is often quicker than trying to figure out how some code ought to perform. (In almost every experiment I've performed on trying optimizations the actual performance on hardware has been different from the expectations I had before running the code.) Granted, testing doesn't always show benefits in that sometimes microbenchmarks are so simple the compiler can hide the deficiencies of inefficient code that it can't in more complex real-world code, but it's still a good first thing to try.

Cheers,
Dave

--
cheers, dave tweed__________________________
high-performance computing and machine vision expert: david...@gmail.com
"while having code so boring anyone can maintain it, use Python." -- attempted insult seen on slashdot
 

Renato Golin

unread,
Dec 19, 2013, 6:06:03 AM12/19/13
to suyog sarda, LLVM Developers Mailing List
On 19 December 2013 08:50, suyog sarda <sard...@gmail.com> wrote:
It may seem that total number of cycles are more or less same for single vmla and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, then intermediate results will be generated which needs to be stored in memory for future access. This will lead to lot of load/store ops being inserted which degrade performance. Correct me if i am wrong on this, but my observation till date have shown this.

VMLA.F can be either NEON or VFP on A series and the encoding will determine which will be used. In assembly files, the difference is mainly the type vs. the registers used.

The problem we were trying to avoid a long time ago was well researched by Evan Cheng and it has shown that there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD.

Also, please note that, as accurate as cycle counts go, according to the A9 manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to provide the results, so a sequence of VMUL+VADD might be faster, in some contexts or cores, than half the sequence of VMLAs.

As Tim and David said and I agree, without hard data, anything we say might be used against us. ;)

cheers,
--renato

suyog sarda

unread,
Dec 19, 2013, 6:16:44 AM12/19/13
to Renato Golin, LLVM Developers Mailing List


Sorry folks, i didn't specify the actual test case and results in detail previously. The details are as follows :

Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand .

clang version : trunk version (latest as of today 19 Dec 2013)
GCC version : 4.5 (i checked with 4.8 as well)

flags passed to both gcc and clang : -march=armv7-a  -mfloat-abi=softfp  -mfpu=vfpv3-d16  -mcpu=cortex-a8
Optimization level used : O3

No vmla instruction emitted by clang but GCC happily emits it.


This was tested on real hardware. Time taken for a 4x4 matrix multiplication:

clang : ~14 secs
gcc : ~9 secs


Time taken for a 3x3 matrix multiplication:

clang : ~6.5 secs
gcc : ~5 secs


when flag -mcpu=cortex-a8 is changed to -mcpu=cortex-a15, clang emits vmla instructions (gcc emits by default)

Time for 4x4 matrix multiplication :

clang : ~8.5 secs
GCC : ~9secs

Time for matrix multiplication :

clang : ~3.8 secs
GCC : ~5 secs

Please let me know if i am missing something. (-ffast-math option doesn't help in this case.) On examining assembly code for various scenarios above, i concluded what i have stated above regarding more load/store ops.
Also, as stated by Renato - "there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use -mcpu=cortex-a15 as option, clang emits vmla instructions back to back(sequential) . Is there something different with cortex-a15 regarding pipeline stalls, that we are ignoring back to back vmla hazards?
  

Renato Golin

unread,
Dec 19, 2013, 6:27:25 AM12/19/13
to suyog sarda, LLVM Developers Mailing List
On 19 December 2013 11:16, suyog sarda <sard...@gmail.com> wrote:
Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand .

This is one very specific case. How does that behave on all other cases? Normally, every big improvement comes with a cost, and if you only look at the benchmark you're tuning to, you'll never see it. It may be that the cost is small and that we decide to pay the price, but not until we know that the cost is.


This was tested on real hardware. Time taken for a 4x4 matrix multiplication:

What hardware? A7? A8? A9? A15?


Also, as stated by Renato - "there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use -mcpu=cortex-a15 as option, clang emits vmla instructions back to back(sequential) . Is there something different with cortex-a15 regarding pipeline stalls, that we are ignoring back to back vmla hazards?

A8 and A15 are quite different beasts. I haven't read about this hazard in the A15 manual, so I suspect that they have fixed whatever was causing the stall.

cheers,
--renato

suyog sarda

unread,
Dec 19, 2013, 8:30:30 AM12/19/13
to Renato Golin, LLVM Developers Mailing List

Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand .

This is one very specific case. How does that behave on all other cases? Normally, every big improvement comes with a cost, and if you only look at the benchmark you're tuning to, you'll never see it. It may be that the cost is small and that we decide to pay the price, but not until we know that the cost is.


I agree that we should approach in whole than in bits and pieces. I was basically comparing performance of clang and gcc code for benchmarks listed in llvm trunk. I found that wherever there was floating point ops (specifically floating point multiplication), performance with clang was bad. On analyzing further those issues, i came across vmla instruction by gcc. The test cases hit by bad performance of clang are :

Test Case                                                                                                                              No of vmla instructions emitted by gcc (clang does not emit vmla for cortex-a8)
===========                                                                                                                       =======================================================

llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/sphereflake                                  55          

llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/ray.cpp                                        40            

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/ffbench.c                                                     8

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c                                        18

llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c                                    36

With vmul+vadd instruction pair comes extra overhead of load/store ops, as seen in assembly generated. With -mcpu=cortex-a15 option clang performs better, as it emits vmla instructions.
 

This was tested on real hardware. Time taken for a 4x4 matrix multiplication:

What hardware? A7? A8? A9? A15?

I tested it on A15, i don't have access to A8 rightnow, but i intend to test it for A8 as well. I compiled the code for A8 and as it was working fine on A15 without any crash, i went ahead with cortex-a8 option. I don't think i will get A8 hardware soon, can someone please check it on A8 hardware as well (Sorry for the trouble)? 
 


Also, as stated by Renato - "there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use -mcpu=cortex-a15 as option, clang emits vmla instructions back to back(sequential) . Is there something different with cortex-a15 regarding pipeline stalls, that we are ignoring back to back vmla hazards?

A8 and A15 are quite different beasts. I haven't read about this hazard in the A15 manual, so I suspect that they have fixed whatever was causing the stall.

Ok. I couldn't find reference for this. If the pipeline stall issue was fixed in cortex-a15 then LLVM developers will definitely know about this (and hence we are emitting vmla for cortex-a15). I couldn't find any comment related to this in the code. Can someone please point it out? Also, I will be glad to know the code place where we start differentiating between cortex-a8 and cortex-a15 for code generation.    

Renato Golin

unread,
Dec 19, 2013, 8:42:22 AM12/19/13
to suyog sarda, LLVM Developers Mailing List
On 19 December 2013 13:30, suyog sarda <sard...@gmail.com> wrote:
I tested it on A15, i don't have access to A8 rightnow, but i intend to test it for A8 as well. I compiled the code for A8 and as it was working fine on A15 without any crash, i went ahead with cortex-a8 option. I don't think i will get A8 hardware soon, can someone please check it on A8 hardware as well (Sorry for the trouble)? 

It's not surprising that -mcpu=cortex-a15 option performs better on an A15 than -mcpu=cortex-a8. It's also not surprising that you don't see the VMLA hazard we're talking about, since that was (if I recall correctly) specific to A8 (maybe A9, too).

We can only talk about disabling the VMLX-fwd feature from A8 when substantial benchmarks are done on a Cortex-A8. Not number of instructions, but performance. Emitting more VMLAs doesn't mean it'll go faster, as what we found in some cases, actually, is quite the opposite.

In the meantime, if you're using an A15, just use -mcpu=cortex-a15 and hopefully, the code generated will be as fast as possible.

Having Clang detect that you have an A15 automatically is another topic that we could descend, but it has nothing to do with VMLA.



Ok. I couldn't find reference for this. If the pipeline stall issue was fixed in cortex-a15 then LLVM developers will definitely know about this (and hence we are emitting vmla for cortex-a15). I couldn't find any comment related to this in the code. Can someone please point it out? Also, I will be glad to know the code place where we start differentiating between cortex-a8 and cortex-a15 for code generation.

The link below shows some fragments of the thread (I hate gmane), but shows Evan's benchmarks and assumptions.
 

cheers,
--renato

Tim Northover

unread,
Dec 20, 2013, 8:00:34 AM12/20/13
to suyog sarda, LLVM Developers Mailing List
Hi Suyog,

> I tested it on A15, i don't have access to A8 rightnow, but i intend to test
> it for A8 as well.

That's extremely dodgy, the two processors are very different.

> I don't think i
> will get A8 hardware soon, can someone please check it on A8 hardware as
> well (Sorry for the trouble)?

I've got a BeagleBone hanging around, and tested Clang against a
hacked version of itself (without the VMLx disabling on Cortex-A8).
The results (for matmul_f64_4x4, -O3 -mcpu=cortex=a8) were:
1. vfpv3-d16, stock Clang: 96.2s
2. vfpv3-d16, clang + vmla: 95.7s
3. vfpv3, stock clang: 82.9s
4. vfpv3, clang + vmla: 81.1s

Worth investigating more, but as the others have said nowhere near
enough data on its own. Especially since Evan clearly did some
benchmarking himself before specifically disabling the vmla formation.

> Also, I will
> be glad to know the code place where we start differentiating between
> cortex-a8 and cortex-a15 for code generation.

Probably most relevant is the combination of features given to each
processor in lib/Target/ARM/ARM.td. This vmul/vmla difference comes
from "FeatureHasSlowFPVMLx", via ARMSubtarget.h's useFPVMLx and
ARMInstrInfo.td's UseFPVMLx.

Renato Golin

unread,
Dec 20, 2013, 8:17:36 AM12/20/13
to Tim Northover, LLVM Developers Mailing List
On 20 December 2013 13:00, Tim Northover <t.p.no...@gmail.com> wrote:
Worth investigating more, but as the others have said nowhere near
enough data on its own. Especially since Evan clearly did some
benchmarking himself before specifically disabling the vmla formation.

Indeed. Not just specific micro benchmarks. I also did some testing and found similar results.


Probably most relevant is the combination of features given to each
processor in lib/Target/ARM/ARM.td. This vmul/vmla difference comes
from "FeatureHasSlowFPVMLx", via ARMSubtarget.h's useFPVMLx and
ARMInstrInfo.td's UseFPVMLx.

Yes, there's no way to turn that on/off from the command line, but I think this is a good thing, not a bad one.

Ultimately, using the -mcpu flag to chose the right CPU is the best thing you can do, and LLVM should get it right.

Another thing that comes to my mind is that maybe it's time to set Cortex-A9 as the default ARMv7 target... no?

cheers,
--renato
Reply all
Reply to author
Forward
0 new messages