Hi all,
I am revisiting some vector code of mine and exploring whether it can be tweaked for better performance with AVX. In the course of this, I have encountered some rather strange code generation that has me puzzled. Can anybody comment on whether this is to be expected and/or whether there is a good reason for the compiler doing this? This is on Xcode 6.2 - "Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)”.
As part of a more complicated inner loop, I have a class representing a pair of double-precision complex numbers (FAB_ji), and another that I'll call someOtherComplexPair. Behind the scenes I store the pair of complex numbers (four doubles) as a single 256-byte wide AVX vector. I want to extract a single one of those doubles and multiply it with the second object. The code looks either something like:
double Ar = ((double*)&FAB_ji)[0]; // extract first double as laid out in memory
sum += Ar * someOtherComplexPair
or:
double Ar = FAB_ji.mm256()[0]; // explicitly access the first element of the underlying vector variable
sum += Ar * someOtherComplexPair
The problem is that the second option is translated into some seemingly odd machine code:
vextractf128 xmm5, ymm0, 0x1
vpermilpd xmm6, xmm5, 0x0
vinsertf128 ymm6, ymm6, xmm6, 0x1
where the compiler seems to be falling back to an SSE3 instruction and making an awful lot of work for itself (and creating a new and slower critical path through the code as a result).
The first option translates straight into a vbroadcastsd from a memory location, and the complete inner loop is faster as a result.
My reasons for preferring the second version of the C code are twofold:
1. I confess I do not fully understand the intricacies of C aliasing rules, but I have my suspicions that I am being naughty by typecasting that pointer
2. I feel that the second option expresses my intent more clearly to the compiler, potentially giving it scope to be more intelligent, in future even if not now.
However obviously I am going to stick with the first if it generates faster code! Can anyone comment on whether this behaviour from the compiler seems right for any reason, and whether there is any way I might modify the second option to get what I want?
It may be a little difficult to isolate a compilable example illustrating the full inner loop, but you can view the IACA output at
http://pastebin.com/Lz6J5jtr which might serve as a bit of an illustration of what the inner loop looks like.
Cheers
Jonny