Re: [LLVMdev] Implementing the ARM NEON Intrinsics for PowerPC

Hal Finkel

unread,

Oct 1, 2013, 11:14:56 PM10/1/13

to Stanislav Manilov, LLVM Developers Mailing List

Stan,

Do you mean that you want to emulate the ARM NEON intrinsics on PowerPC?

-Hal

----- Original Message -----
>
>
> Hello LLVM Devs,
>
>
> Thanks for helping me previously to cross-compile for ARM, I managed
> to get a working toolchain and am currently having fun compiling
> different toy problems and running them on a pandaboard.
>
> As part of my research I am trying to implement the ARM NEON
> Intrinsics in the PowerPC LLVM backend. I am still at the beginning
> of my efforts and am not yet familiar with either the ARM or the
> PowerPC backends. After I started investigating the code and found
> out that in total it is more than 100 kloc for the two backends I
> thought it is a good idea to ask you for some hints of where I
> should start from.
>
> I have written a small unrelated experimental backend for LLVM
> before, so I have some experience with the topic.
>
>
> Thanks,
> - Stan
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Stanislav Manilov

unread,

Oct 1, 2013, 12:14:59 PM10/1/13

to LLVM Developers Mailing List

Stanislav Manilov

unread,

Oct 2, 2013, 4:54:33 AM10/2/13

to Hal Finkel, LLVM Developers Mailing List

Hello Hal,

I am not very familiar with the DSP capabilities of PowerPC, but I imagine there will be instructions for simple vector operations like vector addition, multiplication, etc. so for these I imagine the implementation would consist of just outputting the correct instruction. However, for NEON instructions like the reciprocal step (see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHDIACI.html) it is unlikely that there is a corresponding PowerPC vector instruction, so these will need to be emulated, yes.

- Stan

Steven Newbury

unread,

Oct 2, 2013, 5:12:36 AM10/2/13

to Stanislav Manilov, LLVM Developers Mailing List

How does this make any sense? NEON intrinsics are there to support code
generation targeting the ARM NEON SIMD unit on the ARM architecture.
Power/PowerPC as it's own AltiVec/VSX SIMD units, which in turn has it's
own intrinsics.

If you want write code that explicitly targets CPU execution units it's
necessarily tied to that specific CPU architecture. If you just want to
test code for written for a different CPU on a development box your best
bet is to use a VM like QEMU with CPU emulation.

If you want to write code that will take advantage of whatever SIMD
hardware is available you might want to try abstracting your
implementation and use one of the many libraries which provide a higher
level API to SIMD optimized functionality.

David Tweed

unread,

Oct 2, 2013, 6:40:53 AM10/2/13

to Steven Newbury, Stanislav Manilov, LLVM Developers Mailing List

(Note: these are personal opinions rather than anything from my employer.)

Although unusual, there might be circumstances in which it would make sense.

| If you want write code that explicitly targets CPU execution units it's
| necessarily tied to that specific CPU architecture. If you just want to
| test code for written for a different CPU on a development box your best
| bet is to use a VM like QEMU with CPU emulation.

It's possible to have either already written code to analyse, or be
intending
to write code that will eventually
be deployed on a particular mobile architecture but wish to develop that on
a desktop
machine. Using an architectural simulation will potentially incur more of a
cost than implementing as much optimization of the emulation via compiler
transformation at compile time. (Whether this is actually enough all the
work of writing
an LLVM backend is another question of course.)

Cheers,
Dave

Konstantin Tokarev

unread,

Oct 2, 2013, 6:57:31 AM10/2/13

to Steven Newbury, Stanislav Manilov, LLVM Developers Mailing List

02.10.2013, 14:46, "Steven Newbury" <st...@snewbury.org.uk>:

> How does this make any sense? NEON intrinsics are there to support code
> generation targeting the ARM NEON SIMD unit on the ARM architecture.
> Power/PowerPC as it's own AltiVec/VSX SIMD units, which in turn has it's
> own intrinsics.
>
> If you want write code that explicitly targets CPU execution units it's
> necessarily tied to that specific CPU architecture. If you just want to
> test code for written for a different CPU on a development box your best
> bet is to use a VM like QEMU with CPU emulation.
>
> If you want to write code that will take advantage of whatever SIMD
> hardware is available you might want to try abstracting your
> implementation and use one of the many libraries which provide a higher
> level API to SIMD optimized functionality.

For example, Eigen library [1] supports both AltiVec and NEON.

[1] http://eigen.tuxfamily.org

--
Regards,
Konstantin

Konstantin Tokarev

unread,

Oct 2, 2013, 7:14:11 AM10/2/13

to Stanislav Manilov, Hal Finkel, LLVM Developers Mailing List

02.10.2013, 13:27, "Stanislav Manilov" <stanisla...@gmail.com>:

> Hello Hal,
>
> I am not very familiar with the DSP capabilities of PowerPC, but I imagine there will be instructions for simple vector operations like vector addition, multiplication, etc. so for these I imagine the implementation would consist of just outputting the correct instruction. However, for NEON instructions like the reciprocal step (see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHDIACI.html) it is unlikely that there is a corresponding PowerPC vector instruction, so these will need to be emulated, yes.

Here is an example implementation of reciprocal square root with AltiVec intinsics:

http://web.archive.org/web/20090810124308/http://developer.apple.com/hardwaredrivers/ve/algorithms.html

--
Regards,
Konstantin

Renato Golin

unread,

Oct 2, 2013, 7:17:47 AM10/2/13

to Steven Newbury, LLVM Developers Mailing List, Stanislav Manilov

On 2 October 2013 10:12, Steven Newbury <st...@snewbury.org.uk> wrote:

How does this make any sense?

I have to agree with you that this doesn't make much sense, but there is a case where you would want something like that: when the original source uses NEON intrinsics, and there is no alternative in AltiVec, AVX or even plain C.

We encourage people to use NEON intrinsics, as opposed to writing inline NEON assembly, when the compiler cannot vectorize your code properly. This may fix the current problem of under-performing forward-incompatible inline asm, and it does solve the portability issue across ARM sub-architectures (ex. v7 vs v8), but it doesn't help on portability across entirely different architectures. Since it's not easy to vectorize every code, and not desired to have special cases hard-coded in the vectorizer, I don't see another solution to this problem.

Before, you'd have assembly files with NEON specific code, another with AltiVec specific and so on, and now you'd have C files with each intrinsics, which is better. But, as you said yourself, the semantics of NEON instructions are not the same as other SIMD ISAs, so if you only have the NEON file and want to create an AltiVec version, you'll have to understand both pretty well.

Stanislav,

If I got it right above, I think it would be better if you could do that transformation in IR, with a mapping infrastructure between each SIMD ISA. Something that could represent every possible SIMD instruction, and how each target represents them, so in one side you read the intrinsics (and possibly IR operations on vectors), translate to this meta-SIMD language, then export on the SIMD language that you want.

A tool like this, possibly exporting back to C code (so you can add it to your project as an one-off pass), would be valuable to all programs that have legacy hard-coded SSE routines to run on any platform that support SIMD operations.

I have no idea how easy would be to do that, let alone if it's at all possible, but it seems that this is what you want. Correct me if I'm wrong.

cheers,

--renato

Konstantin Tokarev

unread,

Oct 2, 2013, 7:26:13 AM10/2/13

to David Tweed, Steven Newbury, Stanislav Manilov, LLVM Developers Mailing List

02.10.2013, 15:12, "David Tweed" <david...@arm.com>:

> (Note: these are personal opinions rather than anything from my employer.)
>

> Although unusual, there might be circumstances in which it would make sense.

>
> | If you want write code that explicitly targets CPU execution units it's
> | necessarily tied to that specific CPU architecture. If you just want to
> | test code for written for a different CPU on a development box your best
> | bet is to use a VM like QEMU with CPU emulation.
>

> It's possible to have either already written code to analyse, or be
> intending
> to write code that will eventually
> be deployed on a particular mobile architecture but wish to develop that on
> a desktop
> machine. Using an architectural simulation will potentially incur more of a
> cost than implementing as much optimization of the emulation via compiler
> transformation at compile time. (Whether this is actually enough all the
> work of writing
> an LLVM backend is another question of course.)

Or to compile existing code using NEON intrinsics and run it on PowerPC device
without changes.

--
Regards,
Konstantin

Stanislav Manilov

unread,

Oct 2, 2013, 7:34:24 AM10/2/13

to Renato Golin, LLVM Developers Mailing List

On 2 October 2013 12:17, Renato Golin <renato...@linaro.org> wrote:

On 2 October 2013 10:12, Steven Newbury <st...@snewbury.org.uk> wrote:

How does this make any sense?

I have to agree with you that this doesn't make much sense, but there is a case where you would want something like that: when the original source uses NEON intrinsics, and there is no alternative in AltiVec, AVX or even plain C.

This is exactly the case that I am in. I want to make DSP code written in C, but with NEON intrinsics "portable" as it is less feasible to rewrite it.

Stanislav,

If I got it right above, I think it would be better if you could do that transformation in IR, with a mapping infrastructure between each SIMD ISA. Something that could represent every possible SIMD instruction, and how each target represents them, so in one side you read the intrinsics (and possibly IR operations on vectors), translate to this meta-SIMD language, then export on the SIMD language that you want.

A tool like this, possibly exporting back to C code (so you can add it to your project as an one-off pass), would be valuable to all programs that have legacy hard-coded SSE routines to run on any platform that support SIMD operations.

I have no idea how easy would be to do that, let alone if it's at all possible, but it seems that this is what you want. Correct me if I'm wrong.

Again, the tool you describe is exactly what I ultimately want to create. The translation to AltiVec would be a step towards understanding how to manipulate the intrinsics, but it is not a goal on its own.

Do you have any ideas where in the whole LLVM structure would it fit (should it be implemented as a separate optional pass)?

Thanks,

- Stan

Renato Golin

unread,

Oct 2, 2013, 8:07:29 AM10/2/13

to Stanislav Manilov, LLVM Developers Mailing List

On 2 October 2013 12:34, Stanislav Manilov <stanisla...@gmail.com> wrote:

Again, the tool you describe is exactly what I ultimately want to create. The translation to AltiVec would be a step towards understanding how to manipulate the intrinsics, but it is not a goal on its own.

Do you have any ideas where in the whole LLVM structure would it fit (should it be implemented as a separate optional pass)?

I think there are two separate things:

1. A conversion tool, that will read specific SIMD-1 C files and produce SIMD-2 C files. This will need the C back-end to be working well, or implement its own SIMD-specific C backend, which is in itself, quite a big task. This tool would have to use a function pass that would scan for SIMD-1 intrinsics, and convert them to SIMD-2 in the IR level, so your tool would read the SIMD-1 file as if it were targeting arch-2, and the pass would convert automatically, using the function pass below.

2. A function pass, to do the conversion between SIMD-1 intrinsics to SIMD-2, based on their original namespace inside LLVM (AVX, NEON, etc) and the target parameter (for SIMD-2 output). This FP should be off by default, of course, but could be turned on (say -convert-simd-intrinsics) when compiling legacy code.

I'd start with just cataloguing all NEON and AltiVec intrinsics, and trying to map them. You'll probably hit cases where NEON A == AltiVec A + op1 + op2, so you'll have to take head and tail operations around the intrinsics as possible part of an interchangeable SIMD operation.

As a first example, you could write a function pass to get only the ones that map nicely 1-to-1 and see if the concept works, and if people are happy with your changes. It should be able to read a (very simple) NEON C file and produce compatible PowerPC AltiVec assembly code. After the infrastructure is in place, you can continue incrementing it by adding support for more intrinsics, more SIMD ISAs, and more complex patterns (involving surrounding instructions, etc). In parallel, you could try to create the tool that would do the source-to-source transformation, using the pass that you have written.

Of course, adding tests for all known supported conversions to/from would be critical to the success of your project.

cheers,

--renato

Renato Golin

unread,

Oct 2, 2013, 8:10:21 AM10/2/13

to Stanislav Manilov, Sean Silva, LLVM Developers Mailing List

On 2 October 2013 13:07, Renato Golin <renato...@linaro.org> wrote:

Of course, adding tests for all known supported conversions to/from would be critical to the success of your project.

I'm sure Sean (CC'd) would agree, that adding some documentation would be equally valuable. ;)

--renato

Hal Finkel

unread,

Oct 2, 2013, 8:36:47 AM10/2/13

to Stanislav Manilov, LLVM Developers Mailing List

----- Original Message -----
> On 2 October 2013 12:17, Renato Golin < renato...@linaro.org >
> wrote:
>
>
>
>
> On 2 October 2013 10:12, Steven Newbury < st...@snewbury.org.uk >
> wrote:
>
>
>
>
>
> How does this make any sense?
>
>
> I have to agree with you that this doesn't make much sense, but there
> is a case where you would want something like that: when the
> original source uses NEON intrinsics, and there is no alternative in
> AltiVec, AVX or even plain C.
>
>
> This is exactly the case that I am in. I want to make DSP code
> written in C, but with NEON intrinsics "portable" as it is less
> feasible to rewrite it.

Are you using Clang as the frontend? If so, my recommendation would be to start by creating a header file that implements the NEON intrinsics in terms of generic functionality and the Altivec ones. The header file would need to look kind of like this:

#if defined(__powerpc__) || defined(__ppc__)

#define neon_intrinsic1 ppc_neon_intrinsic1
static __inline__ vec_type __attribute__((__always_inline__, __nodebug__))
ppc_neon_intrinsic1(vec_type a1, vec_type a2) {
...
}

...

#endif

If you look in tools/clang/lib/Headers you'll see lots of example intrinsics header files, and if you look in your build directory in tools/clang/lib/Headers you'll find the arm_neon.h.inc file.

You can certainly do this in terms of an LLVM transformation, but I think that creating some kind of header file would be, at least, where I'd start prototyping this.

Also, you'll want to make sure that the endianness of the ARM and PPC environments agree (or that the code is endian-neutral), otherwise you'll likely have bigger problems ;)

-Hal

>
>
>
>
> Stanislav,
>
>
> If I got it right above, I think it would be better if you could do
> that transformation in IR, with a mapping infrastructure between
> each SIMD ISA. Something that could represent every possible SIMD
> instruction, and how each target represents them, so in one side you
> read the intrinsics (and possibly IR operations on vectors),
> translate to this meta-SIMD language, then export on the SIMD
> language that you want.
>
>
> A tool like this, possibly exporting back to C code (so you can add
> it to your project as an one-off pass), would be valuable to all
> programs that have legacy hard-coded SSE routines to run on any
> platform that support SIMD operations.
>
>
> I have no idea how easy would be to do that, let alone if it's at all
> possible, but it seems that this is what you want. Correct me if I'm
> wrong.
>
>
> Again, the tool you describe is exactly what I ultimately want to
> create. The translation to AltiVec would be a step towards
> understanding how to manipulate the intrinsics, but it is not a goal
> on its own.
>
>
>
> Do you have any ideas where in the whole LLVM structure would it fit
> (should it be implemented as a separate optional pass)?
>
>
> Thanks,
> - Stan

> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Renato Golin

unread,

Oct 2, 2013, 8:45:25 AM10/2/13

to Hal Finkel, LLVM Developers Mailing List, Stanislav Manilov

On 2 October 2013 13:36, Hal Finkel <hfi...@anl.gov> wrote:

You can certainly do this in terms of an LLVM transformation, but I think that creating some kind of header file would be, at least, where I'd start prototyping this.

Yes, this is a good approach to understanding the problem. But I wouldn't use this as a final solution, as it scales quadratically with the number of supported SIMD architectures, including all variations (like NEON v7, v8 and CPU dependent choices).

cheers,

--renato

Stanislav Manilov

unread,

Oct 2, 2013, 9:37:50 AM10/2/13

to Renato Golin, LLVM Developers Mailing List

Thank you all for the help.

Here is my plan of action:

Read up on NEON and AltiVec
Write ((small) parts of) arm_neon.h using AltiVec intrinsics
Write a function pass to convert simple (vector arithmetic) NEON C code to PowerPC AltiVec assembly code and submit for review.
Add NEON intrinsics that map to multiple AltiVec instructions
Add patterns involving surrounding instructions in order to support single complex AltiVec instructions
(not necessarily after 4 and 5, but maybe during): Try producing C code with AltiVec intrinsics as output, when given C code with NEON intrinsics.

Things to be aware of:

Endian-ness
Importance of tests and documentation

I will update you once I have some progress.

Cheers,

- Stan

Alex Rosenberg

unread,

Oct 2, 2013, 9:47:08 AM10/2/13

to Stanislav Manilov, LLVM Developers Mailing List

As crazy as this is, the reverse (AltiVec intrinsics on ARM hardware) was working in tree for a while for the common functions.

Another approach would be to develop a libTooling tool that helps rewrite processor-specific SIMD code to use some generic SIMD library (a C++1y one?) and provide ports of that library.

Alex

Reply all

Reply to author

Forward