Surprising code being generated by ARM NEON backend

Niall Douglas

unread,

Sep 1, 2016, 10:08:53 AM9/1/16

to Intel SPMD Program Compiler Users

Hi all,

We've been using ISPC to generate optimised implementations of various math routines to superb effect, typically beating our hand written intrinsic editions by 5-10%. So firstly many thanks!

We've seen an odd code generation pattern in the ARM NEON generated by ISPC however:

.text

.globl BinauralReverb_privProcessA4_ARM32HF_NEON_I32X4

.align 2

.type BinauralReverb_privProcessA4_ARM32HF_NEON_I32X4,%function

BinauralReverb_privProcessA4_ARM32HF_NEON_I32X4: @ @BinauralReverb_privProcessA4_ARM32HF_NEON_I32X4

.fnstart

@ BB#0: @ %allocas

push {r4, r5, r11, lr}

vpush {d8, d9, d10, d11, d12, d13}

mov r4, r0

ldr r0, [sp, #68]

vld1.64 {d16, d17}, [r1:128]

mov r5, r3

vld1.64 {d18, d19}, [r2:128]

vmov.i32 q5, #0x0

vldr s0, [r0]

ldr r0, [sp, #72]

vdup.32 d0, d0[0]

vmul.f32 q4, q8, d0[0]

vld1.32 {d16[], d17[]}, [r0:32]

ldr r0, [sp, #64]

vfma.f32 q4, q8, q9

vld1.64 {d12, d13}, [r0:128]

vmov.i32 q9, #0x0

vld1.64 {d16, d17}, [r5:128]

vfma.f32 q9, q8, q6

vst1.64 {d8, d9}, [r4:128]

vpadd.f32 d0, d18, d19

bl add_f32

add r0, r5, #64

vmov.i32 q9, #0x0

vld1.64 {d16, d17}, [r0:128]

vfma.f32 q9, q8, q6

vadd.f32 s0, s0, s16

vstr s0, [r4]

vpadd.f32 d0, d18, d19

bl add_f32

add r0, r5, #128

vmov.i32 q9, #0x0

vld1.64 {d16, d17}, [r0:128]

vfma.f32 q9, q8, q6

vadd.f32 s0, s0, s17

vstr s0, [r4, #4]

vpadd.f32 d0, d18, d19

bl add_f32

add r0, r5, #192

vadd.f32 s0, s0, s18

vstr s0, [r4, #8]

vld1.64 {d16, d17}, [r0:128]

vfma.f32 q5, q8, q6

vpadd.f32 d0, d10, d11

bl add_f32

vadd.f32 s0, s0, s19

mov r0, #0

vstr s0, [r4, #12]

vpop {d8, d9, d10, d11, d12, d13}

pop {r4, r5, r11, pc}

.Lfunc_end_0_6:

.size BinauralReverb_privProcessA4_ARM32HF_NEON_I32X4, .Lfunc_end_0_6-BinauralReverb_privProcessA4_ARM32HF_NEON_I32X4

.cantunwind

.fnend

Note the repeated "bl add_f32" calls. This is implemented by ISPC as:

add_f32: @ @add_f32

.fnstart

@ BB#0:

vadd.f32 s0, s0, s1

bx lr

.Lfunc_end_0_0:

.size add_f32, .Lfunc_end_0_0-add_f32

.cantunwind

.fnend

Obviously there is no good reason why the single instruction vadd.f32 isn't simply being inlined instead of introducing a branch and link call.

Looking into the ISPC source code, we see:

define internal float @add_f32(float, float) {

%r = fadd float %0, %1

ret float %r

}

... and ...

define float @__reduce_add_float(<4 x float>) nounwind readnone {

neon_reduce(float, @llvm.arm.neon.vpadd.v2f32, @add_f32)

}

... and indeed, the above function is indeed using reduce_add().

It looks like reduce_add() causes the NEON LLVM to generate a non-inlineable add_f32 function. Is there some good reason that this LLVM IR isn't marked alwaysinline?

Niall

Matt Pharr

unread,

Sep 1, 2016, 10:49:25 AM9/1/16

to ispc-...@googlegroups.com

On Thu, Sep 1, 2016 at 4:11 AM, Niall Douglas <nialldo...@gmail.com> wrote:

Hi all,

We've been using ISPC to generate optimised implementations of various math routines to superb effect, typically beating our hand written intrinsic editions by 5-10%. So firstly many thanks!

Woo!

We've seen an odd code generation pattern in the ARM NEON generated by ISPC however:

[...]

It looks like reduce_add() causes the NEON LLVM to generate a non-inlineable add_f32 function. Is there some good reason that this LLVM IR isn't marked alwaysinline?

Not that I can recall, and not that I can see from reviewing the code now. More generally, I think(?) that just about all of the functions in builtins/target-* should be marked as alwaysinline; stuff like __half_to_float_uniform also deserves that treatment. As I look through the code for other backends, the 'alwaysinline' stuff is similarly somewhat inconsistent. I assume that most of the time LLVM just inlines the simple stuff anyway, but it'd be nice to make sure there aren't other performance bugs like that one.

Any chance you could make the changes (for NEON at least), make sure things still work, and submit a pull request?

Thanks,

Matt

Niall Douglas

unread,

Sep 1, 2016, 12:52:02 PM9/1/16

to Intel SPMD Program Compiler Users, ma...@pharr.org

We've seen an odd code generation pattern in the ARM NEON generated by ISPC however:
[...]
It looks like reduce_add() causes the NEON LLVM to generate a non-inlineable add_f32 function. Is there some good reason that this LLVM IR isn't marked alwaysinline?

Not that I can recall, and not that I can see from reviewing the code now. More generally, I think(?) that just about all of the functions in builtins/target-* should be marked as alwaysinline; stuff like __half_to_float_uniform also deserves that treatment. As I look through the code for other backends, the 'alwaysinline' stuff is similarly somewhat inconsistent. I assume that most of the time LLVM just inlines the simple stuff anyway, but it'd be nice to make sure there aren't other performance bugs like that one.

Any chance you could make the changes (for NEON at least), make sure things still work, and submit a pull request?

It does seem very odd that LLVM wouldn't automatically inline a function consisting of a single instruction.

I've asked my employer for the time to send a pull request. If it's granted, happy to oblige.

Niall

Niall Douglas

unread,

Sep 2, 2016, 10:31:25 AM9/2/16

to Intel SPMD Program Compiler Users, ma...@pharr.org

It does seem very odd that LLVM wouldn't automatically inline a function consisting of a single instruction.

I've discovered through trial and error it is the lack of the "readnone" modifier which causes LLVM to not inline the function. After looking up that modifier I can see why that would be the case, and indeed why the lack of that modifier would penalise optimisation of ARM NEON generated because LLVM will assume every such function not so marked will change outcomes if global memory state could have been changed. In particular, it would severely restrict the reordering of instructions LLVM could do.

Quite a few of the ARM NEON builtins are missing "readnone". None that I can see of the AVX builtins is missing it. I am surprised this problem hasn't been raised before, it's very obvious from the assembler output.

I've asked my employer for the time to send a pull request. If it's granted, happy to oblige.

I've been allowed this time by my employer who wishes to remain anonymous. I'll issue a pull request next week which applies nounwind readnone alwaysinline to everything in the NEON builtins, using the AVX builtins as a guide. I should think this will improve the optimisation quality of the NEON output quite a bit wherever it uses the builtins.

Niall

Niall Douglas

unread,

Sep 5, 2016, 11:19:23 AM9/5/16

to Intel SPMD Program Compiler Users, ma...@pharr.org

Pull request as requested is at https://github.com/ispc/ispc/pull/1227. My thanks to my employer for sponsoring the improvement.

Thanks for the help and the product guys. For the same effort of hand porting all our SSE and AVX intrinsic code to ARM NEON we got an ISPC port instead which supports everything we need now and into the future. In case anyone is interested, we actually have a Python script call ISPC both on Windows and via the Linux Subsystem for Windows to generate assembler files for ARM NEON float x4, SSE2 float x4, AVX float x8 and AVX2 float x16 for the calling conventions x86, x64-msvc and x64-sysv (the need to call ISPC under Windows vs ISPC under Linux is to get both the x64-msvc and x64-sysv calling convention output). Those assembler files are then parsed by the Python script, translating the armhf output ISPC generates into armel for Android ARM and doing a few other hand tweaks, and are committed directly to source control as they change rarely. During build, the AT&T format assembler files are compiled as normal by cmake for Linux/BSD/OS X/Android but on Windows we abuse the Mingw-w64 GNU as assembler to make it generate a MSVC compatible .obj file from the AT&T assembler but-in-msvc-calling-convention files output by ISPC. That is then linked in by Visual Studio as per normal. Believe it or not, it all works a treat. It's been a very successful risk we took in choosing ISPC to generate assembler instead of doing it by hand, and I'm sad to say we are done with optimisation now and moving on to other topics far removed. Nevertheless thanks once again, and you may like to know the only reason we heard of your work is because I sit on the Programme Committee for CppCon where this (now accepted) talk https://cppcon2016.sched.org/event/7nKw/spmd-programming-using-c-and-ispc was one assigned to me for review. So many thanks to that student for bringing ISPC to our attention!

Niall

Dmitry Babokin

unread,

Sep 6, 2016, 3:19:53 PM9/6/16

to ispc-...@googlegroups.com, Matt Pharr

Niall,

Thanks for sharing your story, it's really rewarding to hear that our tool works so well for you!

You've mentioned that ISPC generated code is 5-10% faster that hand-written intrinsics. Were you talking about ARM only or x86 as well?

Also, I'm curious, what typical speed up are you observing on your code using ARM Neon and SSE/AVX versus scalar implementation?

And thanks for mentioning CppCon submission, I didn't know about that.

Dmitry.

--
You received this message because you are subscribed to the Google Groups "Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ispc-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Niall Douglas

unread,

Sep 7, 2016, 9:19:42 AM9/7/16

to Intel SPMD Program Compiler Users, ma...@pharr.org

You've mentioned that ISPC generated code is 5-10% faster that hand-written intrinsics. Were you talking about ARM only or x86 as well?

The claim was based on comparing VS2015 Update 3 (with the new SSA based optimiser turned on) compiling hand written x64 AVX intrinsics to x64 AVX generated by ISPC based on clang/LLVM 3.8.1 running on an Ivy Bridge CPU. By the way, we have addressing=64 for ISPC for that, it generated 3-5% faster code in our use cases.

I haven't tested other compilers nor other CPUs, it would be likely GCC 6's optimiser would do a better job than MSVC's in my experience. And it wasn't always a win, sometimes the intrinsics beat ISPC code by 5% of so, but they were less than 7% of the total benchmarks compared, in the other 93% ISPC equalled or beat MSVC. In the rest ISPC generated code won by 5%-10%. My employer is very seriously considering switching the entire codebase over to the ISPC output on all platforms, pending equal or better success on much wider benchmarking on a good selection of the types of hardware the customers use. For us worst case performance is much more important than average performance, and whilst worst case improved on my Ivy Bridge CPU, it might not on other CPUs.

Also, I'm curious, what typical speed up are you observing on your code using ARM Neon and SSE/AVX versus scalar implementation?

For the routines written to use SIMD, it's nearly linear to SIMD units, so on my Ivy Bridge SSE2 gains 3.6x, AVX1 7.0x, NEON 3.4x. The code was specifically designed to make the best of SIMD for those code paths so optimised, so this is really an embarrassingly parallel problem. AVX512 ought to approximate 13-15x for example.

For what the customer sees in terms of the product performance as the SIMD parts are buried deep inside the product, on ARM NEON performance on average improves by about one third on average over scalar, worst case improves by two thirds. The worst case is far more important to our customers than the average case.

I cannot give you easily the improvement on Intel over scalar for the whole product as our code lost the ability to work without SSE2 a few months ago after we refactored around FTZ/DAZ always being on. That means I can no longer build - without rehacking cmake - a scalar edition easily. It was more than 50% for SSE2 however, and I can tell you we gain another 13% on that again with AVX1 on my Ivy Bridge.

And thanks for mentioning CppCon submission, I didn't know about that.

You may or may not be aware that ISO WG21 is in the process of standardising C++'s support for SIMD. There were three camps of opinion last time I looked, one just wants intrinsics alone, one wants proper SIMD understanding throughout the C++ language and the STL, the other I can't recall right now. From the attendees listed, all three camps will be present in the audience at that student's CppCon talk, and I am sure all three will have an opinion with regard to WG21 direction given ISPC's prior art (I am not sure if the student realises he has brought such eminence upon himself yet, but it should be a fun talk just for the audience debate alone). I will certainly publicly declare myself in favour of proper SIMD understanding throughout the C++ language and the STL, I was approaching that position myself before using ISPC in earnest. Now I am sure it is the correct move and that an intrinsics only approach is wrong.

You may or may not also be aware that the C++ standard template library is to be rebooted from scratch very soon now, it's informally called "STL2". Microsoft Visual Studio 2017 will ship with support for it. I know that SIMD-awareness was considered important for STL2, so in theory it could come to be that std::vector<float[16]> will "just work" in C++ 2020 for all major compilers.

Niall

Niall Douglas

unread,

Sep 7, 2016, 11:51:00 AM9/7/16

to Intel SPMD Program Compiler Users, ma...@pharr.org

You may or may not be aware that ISO WG21 is in the process of standardising C++'s support for SIMD. There were three camps of opinion last time I looked, one just wants intrinsics alone, one wants proper SIMD understanding throughout the C++ language and the STL, the other I can't recall right now.

I recalled a recent WG21 paper summarising the current state of standardisation discussion by famous engineers JF Bastien and Hans Boehm at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0193r1.html for those interested.

Niall

Dmitry Babokin

unread,

Sep 8, 2016, 2:30:40 PM9/8/16

to ispc-...@googlegroups.com, Matt Pharr

You should have really parallelization friendly code to get close to theoretical scaling on all vector units.

For parallelization approaches, intrinsics are obviously not good enough, as they are not suggesting performance portability and I think there's quite broad consensus about it in the industry. But all alternative are far not ideal. For quite some time auto-vectorization was a way to go. But it's not reliable and we obviously need a language solution. I personally a bit sceptical that C++ standard committee can converge on something by C++21 deadline :) So ISPC and other explicit vectorization solutions have some time till C++ suggest viable alternative. Though I hope it will happen earlier that later.

Niall Douglas

unread,

Sep 9, 2016, 6:11:00 AM9/9/16

to Intel SPMD Program Compiler Users, ma...@pharr.org

On Thursday, September 8, 2016 at 7:30:40 PM UTC+1, Dmitry Babokin wrote:

You should have really parallelization friendly code to get close to theoretical scaling on all vector units.

For parallelization approaches, intrinsics are obviously not good enough, as they are not suggesting performance portability and I think there's quite broad consensus about it in the industry. But all alternative are far not ideal. For quite some time auto-vectorization was a way to go. But it's not reliable and we obviously need a language solution. I personally a bit sceptical that C++ standard committee can converge on something by C++21 deadline :) So ISPC and other explicit vectorization solutions have some time till C++ suggest viable alternative. Though I hope it will happen earlier that later.

I've remembered that third approach I couldn't remember before.

As you may be aware, C++ 17-21 is gaining Coroutines which is an embedded domain specific sublanguage allowing a large subset of C++ in coroutines. The proposal is that SIMD optimal code would be generated by the compiler when you apply a coroutine which performs the same [1] operation to each member of some ContiguousIterable e.g. std::vector<float> with alignas(64). The compiler would spot the fact that the same [1] operations are being applied to an array of a SIMDable type and "do the right thing" as it were.

[1] The hard part is defining "same". Some branching would be allowed, obviously. But the proposal if I remember was the same restrictions as constexpr programming which is much more restrictive than ISPC e.g. no communication possible between instances. That might have loosened since, it's hard to keep up to date with standardisation.

The big advantage of this approach is that because the coroutine EDSL is not fixed yet and backwards compatibility isn't a problem, nasty surprises with legacy codebases ought to not occur. The big disadvantage is that it makes the already contentious Coroutines TS even more contentious :)

Anyway, Microsoft are the ones leading the charge on the Coroutines TS implementation before standardisation, so I guess watch VS2017 closely and see how much of C++ AMP they merge into their Coroutines implementation.

Niall

Dmitry Babokin

unread,

Sep 12, 2016, 5:47:49 AM9/12/16

to ispc-...@googlegroups.com, Matt Pharr

Interesting, it's effectively exploiting functional programming to extract parallelism. I was always wondering why there are no implementations of functional languages, which are specifically targeted at extracting SIMD level parallelism. Or probably I'm just not aware of such languages. Though the limitation, which you've mentioned - no ways to express communication between instances - is severe drawback of the approach. But the potential upside is unified approach for concurrency and parallelism, as coroutines are primarily targeted at concurrency.

For the definition of "same" operations, I think it's not a hard problem and it should not be restricted by the language. It should be up to compiler to decide what level of control flow divergence the hardware can handle.

Reply all

Reply to author

Forward