Fastest way to do horizontal vector sum with AVX

Luigi Castelli

unread,

Mar 17, 2012, 2:34:10 PM3/17/12

to perfoptimi...@lists.apple.com

Hi there,

I have a packed vector of four 64-bit floating-point values.

I would like to get the sum of the vector's elements.

With SSE (and using 32-bit floats) I could just do the following:

v_sum = _mm_hadd_ps(v_sum, v_sum);

Unfortunately - even though AVX features a _mm256_hadd_pd instruction - it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.

Of course one could use two _mm256_hadd_pd instructions with some shuffling in the middle, but I am trying not to use more than 2 instructions like the SSE version.

So what would be the fastest way to get the horizontal vector sum using only AVX instructions?

Thanks a lot.

- Luigi Castelli

Jan E. Schotsman

unread,

Mar 21, 2012, 12:09:09 PM3/21/12

to perfoptimi...@lists.apple.com

On Mar 17, 2012, at 8:00 PM, Luigi Castelli wrote:

Unfortunately - even though AVX features a _mm256_hadd_pd instruction - it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.
Of course one could use two _mm256_hadd_pd instructions with some shuffling in the middle, but I am trying not to use more than 2 instructions like the SSE version.
So what would be the fastest way to get the horizontal vector sum using only AVX instructions?

The AVX2 documentation is clearer on this.

Looks like you have to shuffle.

BTW, what is the status of avx at the moment? Can it be detected with sysctlbyname? What header file is needed?

Or are you doing some geekish things?

Jan E.

Ian Ollmann

unread,

Mar 21, 2012, 3:35:50 PM3/21/12

to perfoptimi...@lists.apple.com

sysctlbyname should work. "hw.optional.avx1_0"
Some older OS revisions won't have the selector. In this case, please assume AVX isn't present.

As for the header, you want #include <immintrin.h> and pass -mavx to the compiler, clang (a.k.a. Apple LLVM compiler).
AVX support is a relatively recent addition to the toolchain, so you'll want to make sure your Xcode is up to date. Some tools might still be confused by AVX. You might need to switch from gdb to lldb to get decent debugger support. That is done in the Edit Scheme pane in Xcode.

<lovely helpful descriptive picture omitted to make list mom happy>

LLDB is not the same as GDB. Here is a survival guide.

http://lldb.llvm.org/tutorial.html
http://lldb.llvm.org/lldb-gdb.html

Please be sure to read section 11.3 of the Intel software optimization manual for proper use of vzeroupper.

http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

I believe the compiler will insert vzeroupper as part of the stack frame teardown and use AVX128 instructions instead of SSE as needed, but best to verify. Obviously, there are not always AVX128 equivalents, especially if you do some integer stuff in there.

Ian

Ian Ollmann

unread,

Mar 21, 2012, 5:23:23 PM3/21/12

to perfoptimi...@lists.apple.com

One other thing...

The penalty for misalignment on AVX can be relatively steep compared to SSE on SandyBridge. This is in part because the LSU data throughput is the same for both. That is, you can do 1 AVX load or 2 SSE loads per cycle, and 1 SSE store or 1/2 AVX store. There is some competition between loads and stores for resources to do the effective address calculation. Agner Fog will be happy to tell you all about that:

http://www.agner.org/optimize/microarchitecture.pdf
http://www.agner.org/optimize/instruction_tables.pdf

The situation is further aggravated by the fact that what is aligned for SSE may be misaligned for AVX -- 16 byte aligned but not 32 byte aligned. (Malloc is 16 byte aligned, not necessarily 32 byte aligned for small allocations.) Consequently, for LSU bound functions, there will be times where your AVX code is slower than the SSE code it would otherwise replace.

There are some workarounds. The Intel software optimization manual recommends actually breaking up AVX loads or stores into multiple AVX128 loads or stores in a couple of places. Other times you might just write the AVX for the aligned case and fall back on SSE / AVX128 for the rest. You can use posix_memalign(&result, size, align) to get 32 byte aligned blocks. mmap and friends will give you page aligned allocations, as always.

In my own observations, there is a performance penalty for misaligned AVX loads that cross cacheline boundaries and a larger one for AVX stores that cross page boundaries. When you have to choose, align your stores.

Ian
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (PerfOptimi...@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/perfoptimization-dev/perfoptimization-dev-garchive-8409%40googlegroups.com

This email sent to perfoptimization-...@googlegroups.com

Ian Ollmann

unread,

Mar 21, 2012, 5:47:46 PM3/21/12

to perfoptimi...@lists.apple.com

And one other, other thing: Have you considered vDSP_sve?

Steven Kortze

unread,

Mar 22, 2012, 12:07:25 AM3/22/12

to perfoptimi...@lists.apple.com

On 3/21/12 3:35 PM, Ian Ollmann wrote:

On Mar 21, 2012, at 9:09 AM, Jan E. Schotsman wrote:

BTW, what is the status of avx at the moment? Can it be detected with sysctlbyname? What header file is needed?

Or are you doing some geekish things?

sysctlbyname should work. "hw.optional.avx1_0"
Some older OS revisions won't have the selector. In this case, please assume AVX isn't present.

As for the header, you want #include <immintrin.h> and pass -mavx to the compiler, clang (a.k.a. Apple LLVM compiler).
AVX support is a relatively recent addition to the toolchain, so you'll want to make sure your Xcode is up to date. Some tools might still be confused by AVX. You might need to switch from gdb to lldb to get decent debugger support. That is done in the Edit Scheme pane in Xcode.

On OS X 10.6, you can use sysctlbyname using "machdep.cpu.features" and then use strstr or an equivalent to look for "AVX1.0". I've seen some code that looks for "machdep.cpu.model" = 42, but I'm not sure how dependable that is. I think that "machdep.cpu.features" is fed by the CPUID instruction. If you are using 10.6, you'll need a recent build of the llvm tools from llvm.org or the Intel C++ compiler.

Steve

Reply all

Reply to author

Forward