That means one strictly newer thancortex-a8: cortex-a7 (don't ask), cortex-a9, cortex-a12, cortex-a15 or
krait I believe.
That's what I thought! But we do seem to generate vfma on Cortex-A9.Wonder if that's a bug, or Cortex-A9 is "VFPv3, but chuck in vfma
too"?
I wanted to know if there is any way in which these VMLA hazards can be ignored and make LLVM to emit VMLA instructions? Is there any command line option/compiler switch/FLAG for doing this? I tried '-ffast-math' but it didn't work.
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
I get a VFP vmla here rather than a NEON one (clang -target
> cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)
armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are
you seeing something different?
It appears we've decided in the past that vmla just isn't worth it on
> However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't
> LLVM also emit vmla (NEON) instruction?
Cortex-A8. There's this comment in the source:
// Some processors have FP multiply-accumulate instructions that don't
// play nicely with other VFP / NEON instructions, and it's generally better
// to just not use them.
Sufficient benchmarking evidence could overturn that decision, but I
assume the people who added it in the first place didn't do so on a
whim.
Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines
> The performance gain with vmla instruction is huge.
in odd ways, and that was a very primitive core so it's almost
certainly not going to be just as good as a vmul (in fact if I'm
reading correctly, it takes pretty much exactly the same time as
separate vmul and vadd instructions, 10 cycles vs 2 * 5).
Cheers.
Tim.
> As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.
My version of the ARM architecture reference manual (v7 A & R) lists
versions requiring NEON and versions requiring VFP. (Section
A8.8.337). Split in just the way you'd expect (SIMD variants need
NEON).
Well, it increases register pressure slightly I suppose, but there's
> It may seem that total number of cycles are more or less same for single vmla
> and vmul+vadd. However, when vmul+vadd combination is used instead of vmla,
> then intermediate results will be generated which needs to be stored in memory
> for future access.
no need to store anything to memory unless that gets critical.
Perhaps. Actual data is needed, I think, if you seriously want to
> Correct me if i am wrong on this, but my observation till date have shown this.
change this behaviour in LLVM. The test-suite might be a good place to
start, though it'll give an incomplete picture without the externals
(SPEC & other things).
Of course, if we're just speculating we can carry on.
I wasn't speculating. Let's take an example of a 3*3 simple matrix multiplication (no loops, all multiplication and additions are hard coded - basically all the operations are expanded
e.g Result[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0] and so on for all 9 elements of the result ).If i compile above code with "clang -O3 -mcpu=cortex-a8 -mfpu=vfpv3-d16" (only 16 floating point registers present with my arm, so specifying vfpv3-d16), there are 27 vmul, 18 vadd, 23 store and 30 load ops in total.
If same is compiled with gcc with same options there are 9 vmul, 18 vmla, 9 store and 20 load ops. So, its clear that extra load/store ops gets added with clang as it is not emitting vmla instruction. Won't this lead to performance degradation?
It may seem that total number of cycles are more or less same for single vmla and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, then intermediate results will be generated which needs to be stored in memory for future access. This will lead to lot of load/store ops being inserted which degrade performance. Correct me if i am wrong on this, but my observation till date have shown this.
Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand .
This was tested on real hardware. Time taken for a 4x4 matrix multiplication:
Also, as stated by Renato - "there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use -mcpu=cortex-a15 as option, clang emits vmla instructions back to back(sequential) . Is there something different with cortex-a15 regarding pipeline stalls, that we are ignoring back to back vmla hazards?
Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand .
This is one very specific case. How does that behave on all other cases? Normally, every big improvement comes with a cost, and if you only look at the benchmark you're tuning to, you'll never see it. It may be that the cost is small and that we decide to pay the price, but not until we know that the cost is.
This was tested on real hardware. Time taken for a 4x4 matrix multiplication:What hardware? A7? A8? A9? A15?
Also, as stated by Renato - "there is a pipeline stall between two sequential VMLAs (possibly due to the need of re-use of some registers) and this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use -mcpu=cortex-a15 as option, clang emits vmla instructions back to back(sequential) . Is there something different with cortex-a15 regarding pipeline stalls, that we are ignoring back to back vmla hazards?
A8 and A15 are quite different beasts. I haven't read about this hazard in the A15 manual, so I suspect that they have fixed whatever was causing the stall.
I tested it on A15, i don't have access to A8 rightnow, but i intend to test it for A8 as well. I compiled the code for A8 and as it was working fine on A15 without any crash, i went ahead with cortex-a8 option. I don't think i will get A8 hardware soon, can someone please check it on A8 hardware as well (Sorry for the trouble)?
Ok. I couldn't find reference for this. If the pipeline stall issue was fixed in cortex-a15 then LLVM developers will definitely know about this (and hence we are emitting vmla for cortex-a15). I couldn't find any comment related to this in the code. Can someone please point it out? Also, I will be glad to know the code place where we start differentiating between cortex-a8 and cortex-a15 for code generation.
Worth investigating more, but as the others have said nowhere near
enough data on its own. Especially since Evan clearly did some
benchmarking himself before specifically disabling the vmla formation.
Probably most relevant is the combination of features given to eachprocessor in lib/Target/ARM/ARM.td. This vmul/vmla difference comes
from "FeatureHasSlowFPVMLx", via ARMSubtarget.h's useFPVMLx and
ARMInstrInfo.td's UseFPVMLx.