| Hi there, I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum, v_sum); Unfortunately - even though AVX features a _mm256_hadd_pd instruction - it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary. Of course one could use two _mm256_hadd_pd instructions with some shuffling in the middle, but I am trying not to use more than 2 instructions like the SSE version. So what would be the fastest way to get the horizontal vector sum using only AVX instructions? Thanks a lot. - Luigi Castelli |
Unfortunately - even though AVX features a _mm256_hadd_pd instruction - it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.
Of course one could use two _mm256_hadd_pd instructions with some shuffling in the middle, but I am trying not to use more than 2 instructions like the SSE version.
So what would be the fastest way to get the horizontal vector sum using only AVX instructions?
The penalty for misalignment on AVX can be relatively steep compared to SSE on SandyBridge. This is in part because the LSU data throughput is the same for both. That is, you can do 1 AVX load or 2 SSE loads per cycle, and 1 SSE store or 1/2 AVX store. There is some competition between loads and stores for resources to do the effective address calculation. Agner Fog will be happy to tell you all about that:
http://www.agner.org/optimize/microarchitecture.pdf
http://www.agner.org/optimize/instruction_tables.pdf
The situation is further aggravated by the fact that what is aligned for SSE may be misaligned for AVX -- 16 byte aligned but not 32 byte aligned. (Malloc is 16 byte aligned, not necessarily 32 byte aligned for small allocations.) Consequently, for LSU bound functions, there will be times where your AVX code is slower than the SSE code it would otherwise replace.
There are some workarounds. The Intel software optimization manual recommends actually breaking up AVX loads or stores into multiple AVX128 loads or stores in a couple of places. Other times you might just write the AVX for the aligned case and fall back on SSE / AVX128 for the rest. You can use posix_memalign(&result, size, align) to get 32 byte aligned blocks. mmap and friends will give you page aligned allocations, as always.
In my own observations, there is a performance penalty for misaligned AVX loads that cross cacheline boundaries and a larger one for AVX stores that cross page boundaries. When you have to choose, align your stores.
Ian
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/perfoptimization-dev-garchive-8409%40googlegroups.com
This email sent to perfoptimization-...@googlegroups.com
On Mar 21, 2012, at 9:09 AM, Jan E. Schotsman wrote:
sysctlbyname should work. "hw.optional.avx1_0"
BTW, what is the status of avx at the moment? Can it be detected with sysctlbyname? What header file is needed?
Or are you doing some geekish things?
Some older OS revisions won't have the selector. In this case, please assume AVX isn't present.
As for the header, you want #include <immintrin.h> and pass -mavx to the compiler, clang (a.k.a. Apple LLVM compiler).
AVX support is a relatively recent addition to the toolchain, so you'll want to make sure your Xcode is up to date. Some tools might still be confused by AVX. You might need to switch from gdb to lldb to get decent debugger support. That is done in the Edit Scheme pane in Xcode.