AVX and clang code generation

Jonathan Taylor

unread,

May 20, 2015, 6:27:26 AM5/20/15

to perfoptimi...@lists.apple.com

Hi all,

I am revisiting some vector code of mine and exploring whether it can be tweaked for better performance with AVX. In the course of this, I have encountered some rather strange code generation that has me puzzled. Can anybody comment on whether this is to be expected and/or whether there is a good reason for the compiler doing this? This is on Xcode 6.2 - "Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)”.

As part of a more complicated inner loop, I have a class representing a pair of double-precision complex numbers (FAB_ji), and another that I'll call someOtherComplexPair. Behind the scenes I store the pair of complex numbers (four doubles) as a single 256-byte wide AVX vector. I want to extract a single one of those doubles and multiply it with the second object. The code looks either something like:

                double Ar = ((double*)&FAB_ji)[0];	// extract first double as laid out in memory

		sum += Ar * someOtherComplexPair

or:

                double Ar = FAB_ji.mm256()[0];		// explicitly access the first element of the underlying vector variable

		sum += Ar * someOtherComplexPair

The problem is that the second option is translated into some seemingly odd machine code:

vextractf128 xmm5, ymm0, 0x1

vpermilpd xmm6, xmm5, 0x0

vinsertf128 ymm6, ymm6, xmm6, 0x1

where the compiler seems to be falling back to an SSE3 instruction and making an awful lot of work for itself (and creating a new and slower critical path through the code as a result).

The first option translates straight into a vbroadcastsd from a memory location, and the complete inner loop is faster as a result.

My reasons for preferring the second version of the C code are twofold:

1. I confess I do not fully understand the intricacies of C aliasing rules, but I have my suspicions that I am being naughty by typecasting that pointer

2. I feel that the second option expresses my intent more clearly to the compiler, potentially giving it scope to be more intelligent, in future even if not now.

However obviously I am going to stick with the first if it generates faster code! Can anyone comment on whether this behaviour from the compiler seems right for any reason, and whether there is any way I might modify the second option to get what I want?

It may be a little difficult to isolate a compilable example illustrating the full inner loop, but you can view the IACA output at http://pastebin.com/Lz6J5jtr which might serve as a bit of an illustration of what the inner loop looks like.

Cheers

Jonny

Stephen Canon

unread,

May 20, 2015, 10:55:55 AM5/20/15

to Jonathan Taylor, perfoptimi...@lists.apple.com

There’s no obvious reason why this should be happening, especially without more source context. A compilable example of the inner loop would be informative. Absent that, can you show us the definition of the FAB_ji type and it's mm256( ) method?

Thanks,

– Steve

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/scanon%40apple.com

This email sent to sca...@apple.com

Jonathan Taylor

unread,

May 20, 2015, 12:02:04 PM5/20/15

to Stephen Canon, perfoptimi...@lists.apple.com

Thanks for your reply. I’ve managed to condense it down to a self-contained compilable example (though the intent of the function will be largely obfuscated due to the removal of code irrelevant to this example). See:

http://pastebin.com/ZvsVTniK

Cut out the iaca stuff if you want to look at the disassembly in Xcode, but I’ve been using the IACA tool as part of my analysis of the code.

I’ve made a slight change to code used for the option that leads to the broadcast instruction being generated, in response to (legitimate) criticism from an unknown person who emailed me offline. The effects are the same though.

Any insight will be most welcome!

Cheers

Jonny

On 20 May 2015, at 15:54, Stephen Canon <sca...@apple.com> wrote:

There’s no obvious reason why this should be happening, especially without more source context. A compilable example of the inner loop would be informative. Absent that, can you show us the definition of the FAB_ji type and it's mm256( ) method?

Thanks,
– Steve

[quoted text removed to meet the stingy 12kb limit for posts to the list!]

Reply all

Reply to author

Forward