AVX(2) slowdown

Anton Ertl

unread,

Nov 14, 2016, 12:30:34 PM11/14/16

to

I just experimented a little with manual vectorization in Intel
assembly language, with some (to me) suprising results. The loop I
wanted to vectorize is

where p is a double *, and everything else is a double.

do {
p--;
ThisDist = sqr(*p-ThisX);
} while (ThisDist>CloseDist);

For my benchmark input this loop is entered about 1.4M times, with 50M
iterations of the loop. Looks like a worthwhile target for
vectorization.

Eventually I did an SSE2 and an AVX2 variant (AT&T syntax):

p=%rax, CloseDist=%xmm2, ThisX=%xmm3

SSE2 AVX2
movapd %xmm2, %xmm4 vbroadcastsd %xmm2, %ymm4
shufpd $0, %xmm4, %xmm4 vbroadcastsd %xmm3, %ymm6
movapd %xmm3, %xmm6 Linner: subq $32, %rax
shufpd $0, %xmm6, %xmm6 vsubpd (%rax), %ymm6, %ymm5
Linner: subq $16, %rax vmulpd %ymm5, %ymm5, %ymm5
movupd (%rax), %xmm5 vcmpltpd %ymm4, %ymm5, %ymm5
subpd %xmm6, %xmm5 vpmovmskb %ymm5, %rcx
mulpd %xmm5, %xmm5 test %rcx, %rcx
cmpltpd %xmm4, %xmm5 je Linner
pmovmskb %xmm5, %rcx bsr %rcx, %rcx
test %rcx, %rcx leaq -7(%rax,%rcx), %rax
je Linner
bsr %rcx, %rcx
leaq -7(%rax,%rcx), %rax

Pretty much the same thing, with SSE2 handling 2 doubles per
iteration, and AVX2 handling 4. Also AVX2 allows encoding some things
in one instruction that need two instructions in SSE2 (the broadcast
and unaligned load-operate instructions).

The performance is surprising (to me): I expected some speedup from SSE2, and
more from AVX2, but what I found (on a Haswell Core i7-4690K) is this:

cycles instructions
187,888,737 366,382,169 C original
167,129,257 282,694,918 SSE2
390,340,078 168,337,307 AVX2

So, SSE2 saves quite a number of instructions, and also saves cycles
(not as many, but still). AVX2 saves more instructions (more than a
factor of 2 compared to the C original), but takes much longer (more
than a factor of 2 compared to the C original). Does anybody have an
idea why AVX2 is performing so badly? Branch misses and L1 cache
misses do not change between these programs.

For those who want to do their own experiments:

C original: <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tspd.c>
SSE2: <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tspf.c>
AVX2: <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tspe.c>

Compile and run with

gcc -O3 tsp<x>.c -lm
a.out 10000

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Melzzzzz

unread,

Nov 14, 2016, 1:30:45 PM11/14/16

to

You forgot zeroupper ;)
Just compile with march=native and see difference...

--
press any key to continue or any other to quit

Bernhard Schornak

unread,

Nov 14, 2016, 2:09:13 PM11/14/16

to

Anton Ertl wrote:

> cycles instructions
> 187,888,737 366,382,169 C original
> 167,129,257 282,694,918 SSE2
> 390,340,078 168,337,307 AVX2
>
> So, SSE2 saves quite a number of instructions, and also saves cycles
> (not as many, but still). AVX2 saves more instructions (more than a
> factor of 2 compared to the C original), but takes much longer (more
> than a factor of 2 compared to the C original). Does anybody have an
> idea why AVX2 is performing so badly? Branch misses and L1 cache
> misses do not change between these programs.

Might depend on the load/store unit of the processor.
Fetching 32 byte might be split up to two consecutive
accesses lasting two clock cycles rather than the one
access required for the SSE version. On AMD machines,
two 64 bit data chunks can be loaded/stored per clock
cycle. Intel probably uses similar hardware solutions
to access data.

BTW:

subq $0x10, %rax
movq (%rax), %xmm5

can be simplified to

mov -0x10(%rax), %xmm5

resolving one dependency (elminates waiting times for
the result of the subtraction).

Greetings from Augsburg

Bernhard Schornak

timca...@aol.com

unread,

Nov 14, 2016, 4:09:42 PM11/14/16

to

I would guess that the data has an end marker of some kind so you don't fall off the end of the array?

I don't know what the data blocking is, but you could un-roll both SSE & AVX versions up to 8 times to kick the parallelism up a notch.
- Tim

timca...@aol.com

unread,

Nov 14, 2016, 4:43:31 PM11/14/16

to

Also, I'm not sure that leaq at the end is doing what you think it is doing....

Bernhard Schornak

unread,

Nov 14, 2016, 4:58:13 PM11/14/16

to

Bernhard Schornak wrote:

> subq $0x10, %rax
> movq (%rax), %xmm5
>
> can be simplified to
>
> mov -0x10(%rax), %xmm5
>
> resolving one dependency (elminates waiting times for
> the result of the subtraction).

As RAX has to move downwards in memory, the SUB now
should be inserted between PMOVMSKB and testing RCX
(and thus separating another dependency pair).

Walking upwards in memory should speed up execution
times inside the inner loop, as well.

BTW: Executing iteration counts with code execution
exceeding a time slice (granted by the multitasking
scheduler) will throw results including task switch
times introduced by the OS. Using RDTSC(P), results
for execution times above few milliseconds could be
thrown with a random number generator, as well...

Bruce Hoult

unread,

Nov 14, 2016, 5:10:07 PM11/14/16

to

I've found that timing tests on Ubuntu 16.04 gives massively more stable results with a lot less variance than on 14.04. I believe this is due to a changed process/core affinity algorithm that is much less likely to decide to move you to another core for no reason, especially on an otherwise lightly loaded machine.

This is of course a linux kernel thing, not a Ubuntu thing. I don't know which kernel version made this change. I think it's an architecture-neutral change, as I've observed the improvement on both x86 and ARM machines.

already...@yahoo.com

unread,

Nov 14, 2016, 7:14:48 PM11/14/16

to

It seem, HW prefetch does not work in your AVX case.
Try sw prefetch with various distances.

BTW, do you have explanation for instruction count in AVX case?
A simple arithmetic suggests, it should be 1.4*50M / 4 * 7 = 123M, but you measured much more.

already...@yahoo.com

unread,

Nov 14, 2016, 7:51:40 PM11/14/16

to

Oh, above I miscounted. 1.4M*50M = 70T, not 70M, so, obviously you didn't run a full test. It means, I don't know how many iterations of the loop you were running.
I have no idea why I decided that you were running 70M/4 iterations.

Melzzzzz

unread,

Nov 14, 2016, 8:43:27 PM11/14/16

to

On Mon, 14 Nov 2016 19:45:31 +0100
Bernhard Schornak <scho...@nospicedham.web.de> wrote:

> Anton Ertl wrote:
>
>
> > cycles instructions
> > 187,888,737 366,382,169 C original
> > 167,129,257 282,694,918 SSE2
> > 390,340,078 168,337,307 AVX2
> >
> > So, SSE2 saves quite a number of instructions, and also saves cycles
> > (not as many, but still). AVX2 saves more instructions (more than a
> > factor of 2 compared to the C original), but takes much longer (more
> > than a factor of 2 compared to the C original). Does anybody have
> > an idea why AVX2 is performing so badly? Branch misses and L1 cache
> > misses do not change between these programs.
>
>
> Might depend on the load/store unit of the processor.

Not in this case:
[bmaxa@maxa-pc examples]$ gcc -O3 tspe.c -o tspe -lm
[bmaxa@maxa-pc examples]$ gcc -O3 tspe1.c -o tspe1 -lm
[bmaxa@maxa-pc examples]$ time ./tspe 10000
sumdist = 88.609717

real 0m0.106s
user 0m0.103s
sys 0m0.000s
[bmaxa@maxa-pc examples]$ time ./tspe1 10000
sumdist = 88.609717

real 0m0.037s
user 0m0.037s
sys 0m0.000s
[bmaxa@maxa-pc examples]$ diff tspe.c tspe1.c
65a66
> " vzeroupper\n"
[bmaxa@maxa-pc examples]$

Terje Mathisen

unread,

Nov 15, 2016, 2:58:48 AM11/15/16

to

Anton Ertl wrote:
> I just experimented a little with manual vectorization in Intel
> assembly language, with some (to me) suprising results. The loop I
> wanted to vectorize is
>
> where p is a double *, and everything else is a double.
>
> do {
> p--;
> ThisDist = sqr(*p-ThisX);
> } while (ThisDist>CloseDist);
>
> For my benchmark input this loop is entered about 1.4M times, with 50M
> iterations of the loop. Looks like a worthwhile target for
> vectorization.

With an average of 35-36 iterations of the loop I would indeed expect
close to 2X/4X speedup, the only problem is the parallel compare which
needs to extract a mask bit per item and move this to an integer
register for the branch exit test.

It might be faster to unroll it more, packing the comparison results
down and then do a final PMOVMSKB to ECX: If you did this then you would
let the multipliers do more work before disturbing the pipeline.

OTOH I have no good idea why AVX could be so slow!?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Andrew Cooper

unread,

Nov 15, 2016, 4:58:56 AM11/15/16

to

What is your adjacent code doing, SSE-wise?

On Haswell and Broadwell hardware, there is a massive pipeline stall
switching between VEX encoded instructions, and legacy encoded
instructions. The Intel software optimisation manual talks about this.

The penalty is much reduced on Skylake, but still non-zero if I remember
correctly.

~Andrew

Melzzzzz

unread,

Nov 15, 2016, 7:29:05 AM11/15/16

to

On Tue, 15 Nov 2016 08:46:25 +0100
Terje Mathisen <terje.m...@nospicedham.tmsw.no> wrote:

> Anton Ertl wrote:
> > I just experimented a little with manual vectorization in Intel
> > assembly language, with some (to me) suprising results. The loop I
> > wanted to vectorize is
> >
> > where p is a double *, and everything else is a double.
> >
> > do {
> > p--;
> > ThisDist = sqr(*p-ThisX);
> > } while (ThisDist>CloseDist);
> >
> > For my benchmark input this loop is entered about 1.4M times, with
> > 50M iterations of the loop. Looks like a worthwhile target for
> > vectorization.
>
> With an average of 35-36 iterations of the loop I would indeed expect
> close to 2X/4X speedup, the only problem is the parallel compare
> which needs to extract a mask bit per item and move this to an
> integer register for the branch exit test.
>
> It might be faster to unroll it more, packing the comparison results
> down and then do a final PMOVMSKB to ECX: If you did this then you
> would let the multipliers do more work before disturbing the pipeline.
>
> OTOH I have no good idea why AVX could be so slow!?

Partial register stall switching between sse2 and avx2.
He just needs to add vzeroupper at and of inline assembly block.

Anton Ertl

unread,

Nov 15, 2016, 7:44:07 AM11/15/16

to

Thank you, that was it. Melzzzzz' mention
<20161114184331.23726fd9@maxa-pc> of zeroupper was cryptic, but with
the help of Google pointed in the same direction, e.g.,
<https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf>.
The solution is to compile tspe.c with

gcc -mavx -O3 tspe.c -lm

(-march=native instead of -mavx as suggested by Melzzzzz produces code
that executes a little slower, using a little fewer instructions.)
Using -mavx for the AVX version (and not for the SSE2 version; no
significant difference for the scalar version) gives the following
results:

i7-4690K i7-6700K
Haswell Skylake
cycles cycles instructions
187,875,667 185,533,466 366,386,914 C original (scalar)
167,123,882 167,605,352 282,699,663 SSE2
143,634,118 147,287,618 168,302,112 AVX2
390,329,471 150,178,375 168,342,038 AVX2 without -mavx (SSE2 for scalars)

So on Skylake the SSE/AVX mixing penalty is indeed much reduced, but
it seems that we pay for it with a small slowdown for AVX-optimized
code.

So, this riddle is solved, but it leads to another: when they
introduced the YMM registers, why did the Intel architects specify
that the SSE instructions leave the upper bits alone? They could have
specified that they zero the upper bits, like the AVX128 instructions
are doing (or like the 32-bit instructions are doing when working on
64-bit registers in AMD64, and similarly in every other 32-bit->64-bit
transition); existing code would be unaffected (it does not use the
upper bits); AVX code would have to treat the YMM registers (at least
the upper half) as caller-saved, but that's certainly less of a
problem than the huge penalty we see on Haswell.

Anton Ertl

unread,

Nov 15, 2016, 7:59:10 AM11/15/16

to

Bernhard Schornak <scho...@nospicedham.web.de> writes:
>BTW:
>
>subq $0x10, %rax
>movq (%rax), %xmm5
>
>can be simplified to
>
>mov -0x10(%rax), %xmm5
>
>resolving one dependency (elminates waiting times for
>the result of the subtraction).

Replacing

Linner: subq $32, %rax
vsubpd (%rax), %ymm6, %ymm5

with

Linner: vsubpd -32(%rax), %ymm6, %ymm5
subq $32, %rax

reduced the Haswell cycles from 143,675,800 to 140,758,812

Anton Ertl

unread,

Nov 15, 2016, 8:19:27 AM11/15/16

to

timca...@aol.com writes:
>> SSE2 AVX2
>> movapd %xmm2, %xmm4 vbroadcastsd %xmm2, %ymm4
>> shufpd $0, %xmm4, %xmm4 vbroadcastsd %xmm3, %ymm6
>> movapd %xmm3, %xmm6 Linner: subq $32, %rax
>> shufpd $0, %xmm6, %xmm6 vsubpd (%rax), %ymm6, %ymm5
>> Linner: subq $16, %rax vmulpd %ymm5, %ymm5, %ymm5
>> movupd (%rax), %xmm5 vcmpltpd %ymm4, %ymm5, %ymm5
>> subpd %xmm6, %xmm5 vpmovmskb %ymm5, %rcx
>> mulpd %xmm5, %xmm5 test %rcx, %rcx
>> cmpltpd %xmm4, %xmm5 je Linner
>> pmovmskb %xmm5, %rcx bsr %rcx, %rcx
>> test %rcx, %rcx leaq -7(%rax,%rcx), %rax
>> je Linner
>> bsr %rcx, %rcx
>> leaq -7(%rax,%rcx), %rax

...

>I would guess that the data has an end marker of some kind so you don't fall off the end of the array?

Yes, this code uses a sentinel. Concerning your other posting, what I
think the leaq is doing is to point p/%rax to the place where it
should be at the end of the inner loop. If it did not do what I think
it does, the code would be unlikely to work.

>I don't know what the data blocking is, but you could un-roll both SSE & AVX versions up to 8 times to kick the parallelism up a notch.

Parallelism is pretty good in this code, because the only loop-carried
dependency is the updates of %rax. Given that this takes one cycle
for every 7 instructions (and these CPUs can only decode 4
instructions or so per cycle, and execution resources may be even more
limited), I doubt that unrolling would help much (ok, fewer
instructions to decode and execute), and, frankly, I am too lazy for
it:-). Also, for the original code some variants executed one or two
sub instructions more per iteration, and it did not change the
performance, so the limit is apparently functional units other than
the integer ALU.

Packing the results of the vcmpltpd instructions as suggested by Terje
Mathiesen saves some vpmovmskb, test, and je instructions, but takes
additional instructions for the packing (and I am not familiar enough
with SSE2 and AVX to know which instructions would be good for this
packing); may be a win, or maybe not.

Terje Mathisen

unread,

Nov 15, 2016, 8:29:13 AM11/15/16

to

Anton Ertl wrote:
> i7-4690K i7-6700K
> Haswell Skylake
> cycles cycles instructions
> 187,875,667 185,533,466 366,386,914 C original (scalar)
> 167,123,882 167,605,352 282,699,663 SSE2
> 143,634,118 147,287,618 168,302,112 AVX2
> 390,329,471 150,178,375 168,342,038 AVX2 without -mavx (SSE2 for scalars)
>
> So on Skylake the SSE/AVX mixing penalty is indeed much reduced, but
> it seems that we pay for it with a small slowdown for AVX-optimized
> code.
>
> So, this riddle is solved, but it leads to another: when they
> introduced the YMM registers, why did the Intel architects specify
> that the SSE instructions leave the upper bits alone? They could have
> specified that they zero the upper bits, like the AVX128 instructions
> are doing (or like the 32-bit instructions are doing when working on
> 64-bit registers in AMD64, and similarly in every other 32-bit->64-bit
> transition); existing code would be unaffected (it does not use the
> upper bits); AVX code would have to treat the YMM registers (at least
> the upper half) as caller-saved, but that's certainly less of a
> problem than the huge penalty we see on Haswell.

I believe I read that they decided to leave the upper halves unmodified
in order to make it easier to write code that took two 128-bit inputs
and merged them, i.e. now you can grab one half, put into the top part
and then load the low half.

PRS stalls is a longtime favorite of mine, dating back to when my very
carefully optimized (close to perfect imho) Pentium code stalled 10-20
cycles every 2 or 3 cycles on a P6. :-(

Terje

Terje Mathisen

unread,

Nov 15, 2016, 8:29:14 AM11/15/16

to

Anton Ertl wrote:
> Bernhard Schornak <scho...@nospicedham.web.de> writes:
>> BTW:
>>
>> subq $0x10, %rax
>> movq (%rax), %xmm5
>>
>> can be simplified to
>>
>> mov -0x10(%rax), %xmm5
>>
>> resolving one dependency (elminates waiting times for
>> the result of the subtraction).
>
> Replacing
>
> Linner: subq $32, %rax
> vsubpd (%rax), %ymm6, %ymm5
>
> with
>
> Linner: vsubpd -32(%rax), %ymm6, %ymm5
> subq $32, %rax
>
> reduced the Haswell cycles from 143,675,800 to 140,758,812

If you are allowed to overrun the end of the inputs (you can obviously
do so for at least 3 elements!) by even more, then I would suggest
handling at least 8 elements on each iteration, with a single
compare/compact/test & branch at the end.

As long as your buffers are 64-byte aligned you can always load a
64-byte cache line safely, without causing any protection faults.

This would work with 4 SSE2 loads/squares or 2 AVX2 sets, and a single
input pointer update just after the last load, giving lots of time for
the new pointer to become valid while the exit checks are performed.

Melzzzzz

unread,

Nov 15, 2016, 8:44:17 AM11/15/16

to

Oh I regularly add vzeroupper instruction at and of every avx routine
so that I didn't have slightest idea that this is not commonly known ;)

Anton Ertl

unread,

Nov 15, 2016, 8:44:18 AM11/15/16

to

Bernhard Schornak <scho...@nospicedham.web.de> writes:
>Walking upwards in memory should speed up execution
>times inside the inner loop, as well.

Why do you think so? Anyway, this would require changing much of the
program, so I am not going to try it.

>BTW: Executing iteration counts with code execution
>exceeding a time slice (granted by the multitasking
>scheduler) will throw results including task switch
>times introduced by the OS. Using RDTSC(P), results
>for execution times above few milliseconds could be
>thrown with a random number generator, as well...

If you mean that the results are unreliable because of task switching
or at least the timer interrupt by the OS, that's why I measured
cycles in user mode (cycles:u) using virtualized performance
monitoring counters. I also use the --repeat feature of perf stat to
make multiple runs, and get the arithmetic mean and the variance of
these runs. E.g., for one measurement with 100 runs on the Haswell it
reports:

143,675,800 cycles:u ( +- 0.01% )

i.e., the variance is pretty low for these runs.

Anton Ertl

unread,

Nov 15, 2016, 9:21:43 AM11/15/16

to

already...@yahoo.com writes:
>It seem, HW prefetch does not work in your AVX case.
>Try sw prefetch with various distances.

I did not expect that to make a difference, because all the data
(160KB) is in the L2 cache, but it does make a difference:

Haswell
cycles instructions
140,742,030 168,297,304 fastest without prefetching
137,903,571 181,400,228 with prefetch

I added prefetching to the fastest version I have, tried PREFETCHT0
and PREFETCHNTA (no significant performance difference for the
distances I tried), and tried different distances; the performance
gets better up to a prefetch distance of 320 bytes (10 iterations),
then is about constant (I also tried a distance of 1024 bytes).

One reason why I did not expect to see a speedup is that in earlier
optimizations, I moved from an array-of-structs to a struct-of-arrays,
leading to many fewer L1 cache misses, but I did not see a performance
improvement from that. See

<http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp9.c>
<http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tspc.c>

>BTW, do you have explanation for instruction count in AVX case?
>A simple arithmetic suggests, it should be 1.4*50M / 4 * 7 = 123M, but you measured much more.

Looking at the difference between the instruction counts above, the
AVX inner loop is executed 13,102,924 times accounting for 91.7M
instructions in the prefetch-less version. So the number of
instructions executed outside the inner loop is around 76M. The
scalar C program executed around 300M instructions (50M*6) in the
inner loop and about 66M instructions elsewhere. I guess that there
are around 7 additional instructions in the AVX version of the outer
loop (we see 4 of these in the code I posted), and with 1.4M
iterations of the outer loop, this accounts for the additional 10M
instructions in the AVX version.

Anton Ertl

unread,

Nov 15, 2016, 9:24:03 AM11/15/16

to

already...@yahoo.com writes:
>Oh, above I miscounted. 1.4M*50M = 70T,

A misunderstanding. The inner loop has 50M iterations total, not per
outer loop iteration. So on average 35 iterations per outer loop
iteration.

already...@yahoo.com

unread,

Nov 15, 2016, 10:30:47 AM11/15/16

to

On Tuesday, November 15, 2016 at 4:21:43 PM UTC+2, Anton Ertl wrote:
> already...@yahoo.com writes:
> >It seem, HW prefetch does not work in your AVX case.
> >Try sw prefetch with various distances.

Obviously, I was wrong about the reason of the slowness. HW prefetch has nothing to do with it.
I didn't realize upfront that Melzzzzz is right, because I didn't realize that average # of iterations in the inner loop is so low.

>
> I did not expect that to make a difference, because all the data
> (160KB) is in the L2 cache, but it does make a difference:
>
> Haswell
> cycles instructions
> 140,742,030 168,297,304 fastest without prefetching
> 137,903,571 181,400,228 with prefetch
>

It makes a difference mostly because ~75% of memory accesses in your inner loop are misaligned. You can make all accesses aligned at cost of moderate complication of the middle loop. I'd guess, it will make bigger difference than prefetch, but still I don't expect dramatic improvements.
For dramatic improvements see below.

IMHO, the best way to improve the speed further is by following The First Principle of Vectorization - work harder, not smarter.

That is, you try to be smart by saving "+= sqr(toury[p-tourx]-ThisY)" calculation in the inner loop. That's a reasonable strategy for a scalar code, but not for vector. In vector code regularity is far more important than saving 1 load, 1 mul and a couple of adds. Get rid of inner loop altogether!
From here on there are two possibilities:
1. Completely branchless inner loop - for that you will have to maintain a SIMD copy of the loop index in ymm, which is pretty easy with AVX2, but possible even with regular AVX.
2. Inner loop with the branch on sqrDistX+sqrDistY > Thisdist

Looking at statistics, the condition (sqrDistX+sqrDistY > Thisdist) occurs approximately 92,000 times per 12,5M iterations of the inner loop, i.e. ~every 135 iterations. The probability looks rather low, so variant (2) is likely a little faster. But variant (1) is cooler as a coding exercise.

already...@yahoo.com

unread,

Nov 15, 2016, 10:40:56 AM11/15/16

to

On Tuesday, November 15, 2016 at 4:21:43 PM UTC+2, Anton Ertl wrote:
> already...@yahoo.com writes:
> >It seem, HW prefetch does not work in your AVX case.
> >Try sw prefetch with various distances.
>
> I did not expect that to make a difference, because all the data
> (160KB) is in the L2 cache, but it does make a difference:
>
> Haswell
> cycles instructions
> 140,742,030 168,297,304 fastest without prefetching
> 137,903,571 181,400,228 with prefetch
>

How do you measure cycles?

I measured with __rdtc() before and after tsp() call and got much smaller numbers on your original code compiled with -O3 -mavx
Scalar: 140M
SSE: 120M
AVX2: 110M

Measured on Haswell Xeon E3-1271 v3, HT off, two out of four core busy doing FPGA fitting, so TB is likely not in effect.

Kalle Olavi Niemitalo

unread,

Nov 15, 2016, 11:46:55 AM11/15/16

to

an...@nospicedham.mips.complang.tuwien.ac.at (Anton Ertl) writes:

> So, this riddle is solved, but it leads to another: when they
> introduced the YMM registers, why did the Intel architects specify
> that the SSE instructions leave the upper bits alone?

Here's an explanation from someone at Intel.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853

They decided to support applications that use AVX, being interupted
by device drivers that save only XMM registers and use SSE.

Anton Ertl

unread,

Nov 15, 2016, 11:56:23 AM11/15/16

to

Terje Mathisen <terje.m...@nospicedham.tmsw.no> writes:

>Anton Ertl wrote:
>> So, this riddle is solved, but it leads to another: when they
>> introduced the YMM registers, why did the Intel architects specify
>> that the SSE instructions leave the upper bits alone? They could have
>> specified that they zero the upper bits, like the AVX128 instructions
>> are doing (or like the 32-bit instructions are doing when working on
>> 64-bit registers in AMD64, and similarly in every other 32-bit->64-bit
>> transition); existing code would be unaffected (it does not use the
>> upper bits); AVX code would have to treat the YMM registers (at least
>> the upper half) as caller-saved, but that's certainly less of a
>> problem than the huge penalty we see on Haswell.
>
>I believe I read that they decided to leave the upper halves unmodified
>in order to make it easier to write code that took two 128-bit inputs
>and merged them, i.e. now you can grab one half, put into the top part
>and then load the low half.

You need a separate instruction for putting something in the upper 128
bits anyway (and I think there are such instructions in AVX), so
that's not a good reason to do it.

Another speculation: If the first implementation implemented the YMM
registers as consisting of 2 128-bit registers, then leaving the upper
half alone would make the implementation simpler. But AFAIK the Sandy
Bridge does not do that (unlike some AMD CPUs); according to
<http://www.realworldtech.com/haswell-cpu/4/> it combines FP and SIMD
bypass networks for AVX, but I don't think that keeping the top of the
YMM registers untouched on SSE instructions would help that. Maybe
the designer of AVX had an AMD-like implementation in mind.

>PRS stalls is a longtime favorite of mine, dating back to when my very
>carefully optimized (close to perfect imho) Pentium code stalled 10-20
>cycles every 2 or 3 cycles on a P6. :-(

Yes, given that they had first-hand and painful knowledge that leaving
parts of registers untouched is a mistake, and so many examples of how
to do it right (basically all the 32-bit->64-bit transitions), it's
really surprising that they made this mistake again with the YMM
registers.

Anton Ertl

unread,

Nov 15, 2016, 12:02:53 PM11/15/16

to

Terje Mathisen <terje.m...@nospicedham.tmsw.no> writes:
>If you are allowed to overrun the end of the inputs (you can obviously
>do so for at least 3 elements!) by even more, then I would suggest
>handling at least 8 elements on each iteration, with a single
>compare/compact/test & branch at the end.

I can arrange enough padding. If you write the code, I am happy to
test it.

Bernhard Schornak

unread,

Nov 15, 2016, 12:32:58 PM11/15/16

to

I do not understand these cryptic messages, but VZEROUPPER
doesn't access memory, so I don't see how this instruction
could improve Anton's AVX code working with the entire YMM
registers rather than their lower halves.

Anton's results are 'cycles', so I guess he uses RDTSC(P),
while your messages list seconds, which is far too coarse:
One clock on a 4 MHz machine is 0.00000000025 seconds!

Bernhard Schornak

unread,

Nov 15, 2016, 12:32:59 PM11/15/16

to

Anton Ertl wrote:

> Bernhard Schornak <scho...@nospicedham.web.de> writes:
>> BTW:
>>
>> subq $0x10, %rax
>> movq (%rax), %xmm5
>>
>> can be simplified to
>>
>> mov -0x10(%rax), %xmm5
>>
>> resolving one dependency (elminates waiting times for
>> the result of the subtraction).
>
> Replacing
>
> Linner: subq $32, %rax
> vsubpd (%rax), %ymm6, %ymm5
>
> with
>
> Linner: vsubpd -32(%rax), %ymm6, %ymm5
> subq $32, %rax
>
> reduced the Haswell cycles from 143,675,800 to 140,758,812

A small step for a few iterations, but a giant leap for
a fully grown program. Who needs to fine tune code will
be happy to gain one microsecond... ;)

Bernhard Schornak

unread,

Nov 15, 2016, 12:33:00 PM11/15/16

to

Anton Ertl wrote:

> Bernhard Schornak <scho...@nospicedham.web.de> writes:
>> Walking upwards in memory should speed up execution
>> times inside the inner loop, as well.
>
> Why do you think so? Anyway, this would require changing much of the
> program, so I am not going to try it.
>
>> BTW: Executing iteration counts with code execution
>> exceeding a time slice (granted by the multitasking
>> scheduler) will throw results including task switch
>> times introduced by the OS. Using RDTSC(P), results
>> for execution times above few milliseconds could be
>> thrown with a random number generator, as well...
>
> If you mean that the results are unreliable because of task switching
> or at least the timer interrupt by the OS, that's why I measured
> cycles in user mode (cycles:u) using virtualized performance
> monitoring counters. I also use the --repeat feature of perf stat to
> make multiple runs, and get the arithmetic mean and the variance of
> these runs. E.g., for one measurement with 100 runs on the Haswell it
> reports:
>
> 143,675,800 cycles:u ( +- 0.01% )
>
> i.e., the variance is pretty low for these runs.

Okay. I prefer using RDTSCP with code running for less than
two milliseconds. In general, this throws the most accurate
results.

already...@yahoo.com

unread,

Nov 15, 2016, 12:38:02 PM11/15/16

to

Melzzzz was 100% correct. Read the rest of the thread.

Anton Ertl

unread,

Nov 15, 2016, 1:22:37 PM11/15/16

to

already...@yahoo.com writes:
>IMHO, the best way to improve the speed further is by following The First P=

>rinciple of Vectorization - work harder, not smarter.
>

>That is, you try to be smart by saving "+=3D sqr(toury[p-tourx]-ThisY)" cal=
>culation in the inner loop. That's a reasonable strategy for a scalar code,=
> but not for vector.

Yes, that was one of the optimizations in the scalar version. It
introduces quite a few branch mispredictions, but it saved enough in
the scalar version to be worth that cost. The tradeoff may be
different in the AVX version. So let's try it:

cycles instructions
137,935,640 181,400,228 smarter
95,480,438 206,482,488 harder

Yes, quite a nice speedup. Code in
<http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tspg.c>. What
I wonder about is that this has 120k branch mispredictions, while a
similar scalar version (tsp5.c) has only 77k branch mispredictions. I
would expect the same number of mispredictions; it seems that gcc
unrolls the middle loop of tspg.c, and that may cause the
mispredictions.

>1. Completely branchless inner loop - for that you will have to maintain a =
>SIMD copy of the loop index in ymm, which is pretty easy with AVX2, but pos=
>sible even with regular AVX.

AVX2 is already necessary (256-bit vpmovmskb is AVX2).

I am not familiar enough with AVX to implement this approach in
acceptable time. I used the other approach:

>2. Inner loop with the branch on sqrDistX+sqrDistY > Thisdist

Anton Ertl

unread,

Nov 15, 2016, 1:24:38 PM11/15/16

to

already...@yahoo.com writes:
>On Tuesday, November 15, 2016 at 4:21:43 PM UTC+2, Anton Ertl wrote:
>> already...@yahoo.com writes:
>> >It seem, HW prefetch does not work in your AVX case.
>> >Try sw prefetch with various distances.
>>
>> I did not expect that to make a difference, because all the data
>> (160KB) is in the L2 cache, but it does make a difference:
>>
>> Haswell
>> cycles instructions
>> 140,742,030 168,297,304 fastest without prefetching
>> 137,903,571 181,400,228 with prefetch
>>
>
>How do you measure cycles?

perf stat -B -r100 -e cycles:u -e instructions:u -e branch-misses:u -e L1-dcache-load-misses:u a.out 10000 >/dev/null

Melzzzzz

unread,

Nov 15, 2016, 1:52:51 PM11/15/16

to

On Tue, 15 Nov 2016 18:22:49 GMT
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> already...@yahoo.com writes:
> >On Tuesday, November 15, 2016 at 4:21:43 PM UTC+2, Anton Ertl
> >wrote:
> >> already...@yahoo.com writes:
> >> >It seem, HW prefetch does not work in your AVX case.
> >> >Try sw prefetch with various distances.
> >>
> >> I did not expect that to make a difference, because all the data
> >> (160KB) is in the L2 cache, but it does make a difference:
> >>
> >> Haswell
> >> cycles instructions
> >> 140,742,030 168,297,304 fastest without prefetching
> >> 137,903,571 181,400,228 with prefetch
> >>
> >
> >How do you measure cycles?
>
> perf stat -B -r100 -e cycles:u -e instructions:u -e branch-misses:u
> -e L1-dcache-load-misses:u a.out 10000 >/dev/null
>
> - anton

Wow, I didn't new about this useful tool!

Melzzzzz

unread,

Nov 15, 2016, 2:18:09 PM11/15/16

to

On Tue, 15 Nov 2016 18:20:18 +0100

It is because partial register stall happens in sse2 code later.
Read whole thread (including not cross-posted messages from comp.arch).
In this example both programs are compiled with -O3 but not with -mavx
so that both programs generate sse2 code. Inline assembly is avx so
that placing vzeroupper at end of inline assembly block makes code
almost three times faster.

timca...@aol.com

unread,

Nov 15, 2016, 2:32:19 PM11/15/16

to

I found out that gcc will use avx style encoding for sse instructions by default, which is a detriment to the sse test AFAICT. Use:

-mno-avx -msse4.2 -mtune=corei7

To get true SSE only code in your benchmark.

- Tim

already...@yahoo.com

unread,

Nov 15, 2016, 3:05:31 PM11/15/16

to

Absolutely not.
vmovmskpd is much more natural tool.
leaq -7(%[p],%%rcx), %[p]
become
leaq 0(%[p],%%rcx, 8), %[p]
The rest is the same.

Other AVX2 instruction in your code is vbroadcastsd. Also unnecessary. It can be replaced by sequence of two instructions. But it is simpler to let compiler to do a job via intrinsic _mm256_set1_pd().

The tricky part in branchless AVX variant is management of the index in the 256-bit floating-point register. My personal solution - make index floating-point. Good enough up to 2**53.

>
> I am not familiar enough with AVX to implement this approach in
> acceptable time. I used the other approach:
>
> >2. Inner loop with the branch on sqrDistX+sqrDistY > Thisdist
>

As I said above, branchless is interesting as exercise, but I don't expect it to be the fastest.

Melzzzzz

unread,

Nov 15, 2016, 3:05:50 PM11/15/16

to

What version?
gcc version 6.2.1 20160830 (GCC)
with -O3 generates only sse2 instructions.

Andrew Cooper

unread,

Nov 15, 2016, 4:33:20 PM11/15/16

to

Because windows. (isn't that always the answer...)

Windows has %xmm registers in its calling convention, which includes
interrupt handlers. Therefore, unmodified device drivers servicing an
interrupt which interrupts some AVX code would cause state corruption.

The Linux approach is far more sane. Thou shalt under no circumstances
use SIMD in the kernel. If you think you need to use it, you are wrong,
and if you really really need to use it, you are required to save and
restore *all* floating point state as punishment.

~Andrew

already...@yahoo.com

unread,

Nov 15, 2016, 4:43:04 PM11/15/16

to

On Tuesday, November 15, 2016 at 11:33:20 PM UTC+2, Andrew Cooper wrote:

> Because windows. (isn't that always the answer...)
>
> Windows has %xmm registers in its calling convention, which includes
> interrupt handlers. Therefore, unmodified device drivers servicing an
> interrupt which interrupts some AVX code would cause state corruption.
>

Because you have no idea. (isn't that always the answer...)
How many Windows drivers did you wrote? Or read? Or at least open DDK docs?

Terje Mathisen

unread,

Nov 16, 2016, 4:18:59 AM11/16/16

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@nospicedham.tmsw.no> writes:
>> If you are allowed to overrun the end of the inputs (you can obviously
>> do so for at least 3 elements!) by even more, then I would suggest
>> handling at least 8 elements on each iteration, with a single
>> compare/compact/test & branch at the end.
>
> I can arrange enough padding. If you write the code, I am happy to
> test it.

...

OK, I downloaded your source code, it is definitely a big win to "work
harder" since that will significantly reduce the number of branches and
branch mispredicts: With random city locations you will have a high
number which are closer in just the X direction without being closer in
the combined (dx,dy) sense.

The fun part was that I didn't even consider writing the code the way
you did since I knew that I could overlap the first and the second sqr()
multiplication, so the actual cost would just be a single cycle plus the
addition to combine them. :-)

Even though the results of the vcmpltpd operation is 2/4 64-bit ints
(either 0 or -1) you can still use vpackusdw to combine two of them into
a single result with twice as many flags, i.e. 8 flags of 32 bits each.

For SSE code you should be able to use three PACKUSDW ops to create a
128-bit result with 8 16-bit flags.

In order to convert this back to an address offset for [p] you just
apply scaling of RCX by 2 or 4 in the LEA at the end.

BTW, I would also start with p=&tourx[ncities-1] in order to avoid the
-32 offset for the vsubpd operations, this means using 0 to ncities-1
for the outer (i) loop.

BTW, you know of course that you can use BSP or similar partitioning to
significantly speed up the search for the closest city, but I assume a
greedy "always grab the closest city first" approach means that this is
more of a CPU benchmark than actual production code. :-)

already...@yahoo.com

unread,

Nov 16, 2016, 6:33:48 AM11/16/16

to

On Tuesday, November 15, 2016 at 8:22:37 PM UTC+2, Anton Ertl wrote:
>
> I am not familiar enough with AVX to implement this approach in
> acceptable time. I used the other approach:
>

// tourx and toury aligned on 32-byte boundaries
void tps(point cities[], double tourx[], double toury[], int ncities)
{
tourx[0] = cities[ncities-1].x;
toury[0] = cities[ncities-1].y;
for (int i=1; i<ncities; i++) {
tourx[i]=cities[i-1].x;
toury[i]=cities[i-1].y;
}

for (int i=1; i<ncities; i++) {
double ThisX = tourx[i-1];
double ThisY = toury[i-1];
double CloseDist = DBL_MAX;
int min_k = i;
int k = i;
if ((i & 3) != 0 || ncities - i < 8) {
// scalar prolog
int nlast = ncities - i < 8 ? ncities : (i + 4) & (-4);
for (; k < nlast; ++k) {
double ThisDist = sqr(tourx[k]-ThisX)+sqr(toury[k]-ThisY);
if (ThisDist <= CloseDist) {
CloseDist = ThisDist;
min_k = k;
}
}
}

int ni = (ncities - k) / 4;
if (ni > 0) {
// main loop
__m256d thisx_v = _mm256_set1_pd(ThisX);
__m256d thisy_v = _mm256_set1_pd(ThisY);
__m256d minDist_v = _mm256_set1_pd(CloseDist);
__m256d minK_v = _mm256_set1_pd((double)min_k);
__m256d k_v = _mm256_setr_pd((double)(k+0), (double)(k+1), (double)(k+2), (double)(k+3));
__m256d kInc_v = _mm256_set1_pd((double)4);
const __m256d* tourx_v = (const __m256d*)&tourx[k];
const __m256d* toury_v = (const __m256d*)&toury[k];
k += ni*4;
for (int kk = 0; kk < ni; ++kk) {
__m256d dx_v = _mm256_sub_pd(tourx_v[kk], thisx_v);
__m256d dy_v = _mm256_sub_pd(toury_v[kk], thisy_v);
__m256d dist_v = _mm256_add_pd(_mm256_mul_pd(dx_v, dx_v),_mm256_mul_pd(dy_v, dy_v));
__m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);
minK_v = _mm256_blendv_pd(minK_v, k_v, le_v);
k_v = _mm256_add_pd(k_v, kInc_v);
}
// fold
__m256d minDist_vh = _mm256_permute2f128_pd(minDist_v, minDist_v, 1);
__m256d minK_vh = _mm256_permute2f128_pd(minK_v, minK_v, 1);
__m256d le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);
minDist_vh = _mm256_permute_pd(minDist_v, 1);
minK_vh = _mm256_permute_pd(minK_v, 1);
le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);
// CloseDist = (double)_mm256_castpd256_pd128(minDist_v);
// min_k = (double)_mm256_castpd256_pd128(minK_v);
memcpy(&CloseDist, &minDist_v, sizeof(double));
double min_kd;
memcpy(&min_kd, &minK_v, sizeof(double));
min_k = (int)min_kd;
}
// scalar epilog
for (; k < ncities; ++k) {
double ThisDist = sqr(tourx[k]-ThisX)+sqr(toury[k]-ThisY);
if (ThisDist <= CloseDist) {
CloseDist = ThisDist;
min_k = k;
}
}
swap(&tourx[i],&tourx[min_k]);
swap(&toury[i],&toury[min_k]);
}
}

Anton Ertl

unread,

Nov 16, 2016, 9:19:16 AM11/16/16

to

Thank you. An interesting design problem. Looking back at it from
today's vantage point, have they really chosen well? Looking at how
hard Microsoft is trying to get users to switch to newer Windows
versions (e.g., not supporting new DirectX on old, but supported
Windows versions), I wonder if they would not have welcomed a design
where it would have been necessary for the OS to enable the use of YMM
registers (i.e., you could not use them on a legacy OS); of course,
Intel prefers users to want newer Intel processors independently of
new Windows versions, but given that they are usually sold together,
that would make little difference in practice.

Anton Ertl

unread,

Nov 16, 2016, 9:51:42 AM11/16/16

to

already...@yahoo.com writes:
>> AVX2 is already necessary (256-bit vpmovmskb is AVX2).
>
>Absolutely not.
>vmovmskpd is much more natural tool.
>leaq -7(%[p],%%rcx), %[p]
>become
>leaq 0(%[p],%%rcx, 8), %[p]
>The rest is the same.

Thanks. Yes, works nicely.

>Other AVX2 instruction in your code is vbroadcastsd. Also unnecessary. It can be replaced by sequence of two instructions.

I chose to use the memory-to-register version (AVX) of this
instruction instead.

The resulting AVX versions have the same or slightly better speed on
Haswell as the AVX2 versions, and they also work on Sandy Bridge:

Xeon E3-1220 i7-4690K
Sandy Bridge Haswell
cycles cycles instructions
240,874,925 192,411,043 413,323,795 scalar smart
153,192,044 136,464,596 181,520,934 AVX smart with prefetch
131,286,268 95,393,579 206,593,195 AVX hard with prefetch
128,785,936 123,517,484 230,002,974 AVX "branchless"

already...@yahoo.com

unread,

Nov 16, 2016, 9:54:05 AM11/16/16

to

On Wednesday, November 16, 2016 at 4:19:16 PM UTC+2, Anton Ertl wrote:
> Kalle Olavi Niemitalo <k...@nospicedham.iki.fi> writes:
> >an...@nospicedham.mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >
> >> So, this riddle is solved, but it leads to another: when they
> >> introduced the YMM registers, why did the Intel architects specify
> >> that the SSE instructions leave the upper bits alone?
> >
> >Here's an explanation from someone at Intel.
> >https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853
> >
> >They decided to support applications that use AVX, being interupted
> >by device drivers that save only XMM registers and use SSE.
>
> Thank you. An interesting design problem.

Not really.
Arguments of Mark Buxton are not convincing at all.
No matter how the hardware defines it, when user process that uses AVX is interrupted, OS have to save full AVX state, because there is no guarantee that interrupt is going to return to the same process. One can do micro-optimization, sort of deferred saving, but it's not obvious at all that it will be faster than eager state saving.
I could imagine, that Intel's definition for legacy instructions can help something like VxWorks, where user has full control of the application, but not necessarily of the OS/BSP. But not for "real" OSes, like Windows/Unix/etc.

The case, presented by Agner Fog (A function using a full YMM register calls a legacy function which is unaware of the YMM extension but saves the corresponding XMM register before using it, and restores the value before returning) looks more real, but it's not obvious that it is practically important.

> Looking back at it from
> today's vantage point, have they really chosen well? Looking at how
> hard Microsoft is trying to get users to switch to newer Windows
> versions (e.g., not supporting new DirectX on old, but supported
> Windows versions), I wonder if they would not have welcomed a design
> where it would have been necessary for the OS to enable the use of YMM
> registers (i.e., you could not use them on a legacy OS);

That's exactly what happened in reality. One can't use AVX on Windows OSes that are older than Win7 Sp1.
More so, SP1 was not ready until several week *after* Sandy Bridge (2011-01-09 vs 2011-02-22). In the mean time, people that bought SandyB, couldn't use AVX under Windows.

Quadibloc

unread,

Nov 16, 2016, 10:40:42 AM11/16/16

to

That is a good answer to another post I saw in this thread, but which I cannot
currently find:

that Intel should have known about the awful hazards of leaving parts of some
registers untouched, having seen them many times, and having done it the
"right" way when they went from 32 bits to 64 bits...

because going from 32 bits to 64 bits involves switching to a different *mode*;
in 64-bit mode, 16-bit software doesn't work any more, for example; whereas
software using AVX, SSE 2, SSE, and MMX all has to run, coexist, and work
perfectly at the same time.

John Savard

Quadibloc

unread,

Nov 16, 2016, 10:43:02 AM11/16/16

to

Found it now, it was this:

) Yes, given that they had first-hand and painful knowledge that leaving
) parts of registers untouched is a mistake, and so many examples of how
) to do it right (basically all the 32-bit->64-bit transitions), it's
) really surprising that they made this mistake again with the YMM
) registers.

but that was _also_ by Anton Ertl, and I thought I remembered that it was from
a different person than the one to whom you replied.

John Savard

Anton Ertl

unread,

Nov 16, 2016, 11:32:34 AM11/16/16

to

already...@yahoo.com writes:
> for (int kk = 0; kk < ni; ++kk) {
> __m256d dx_v = _mm256_sub_pd(tourx_v[kk], thisx_v);
> __m256d dy_v = _mm256_sub_pd(toury_v[kk], thisy_v);
> __m256d dist_v = _mm256_add_pd(_mm256_mul_pd(dx_v, dx_v),_mm256_mul_pd(dy_v, dy_v));
> __m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);
> minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);
> minK_v = _mm256_blendv_pd(minK_v, k_v, le_v);
> k_v = _mm256_add_pd(k_v, kInc_v);
> }
> // fold
> __m256d minDist_vh = _mm256_permute2f128_pd(minDist_v, minDist_v, 1);
> __m256d minK_vh = _mm256_permute2f128_pd(minK_v, minK_v, 1);
> __m256d le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
> minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
> minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);
> minDist_vh = _mm256_permute_pd(minDist_v, 1);
> minK_vh = _mm256_permute_pd(minK_v, 1);
> le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
> minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
> minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);

So you compute four minDists (and corresponding minKs) for the four
subvectors (indexed 0,1,2,3 mod 4) in the inner loop and then take the
global minimum afterwards. Nice trick.

For the k_v computation, why can you not use integers? Is there no
4x64-bit integer support in AVX?

> // CloseDist = (double)_mm256_castpd256_pd128(minDist_v);
> // min_k = (double)_mm256_castpd256_pd128(minK_v);
> memcpy(&CloseDist, &minDist_v, sizeof(double));

In gcc-4.9 this actually stores data to memory and loads it from there
(into the same bits of the same register), for the sole purpose of
satisfying gcc's type system. Ok, it's only executed 10000 times, but
it still is disgusting.

Results (also shown in the other posting):

Xeon E3-1220 i7-4690K
Sandy Bridge Haswell
cycles cycles instructions
240,874,925 192,411,043 413,323,795 scalar smart
153,192,044 136,464,596 181,520,934 AVX smart with prefetch
131,286,268 95,393,579 206,593,195 AVX hard with prefetch
128,785,936 123,517,484 230,002,974 AVX "branchless"

So on Sandy bridge we actually see a speedup for this, for Haswell
it's slower than the branchy version. Given that there are only 24M
instructions more, and a slight reduction in branch mispredictions
(134k->76k), I am surprised by the 28M additional cycles on Haswell.

Anton Ertl

unread,

Nov 16, 2016, 1:04:33 PM11/16/16

to

Terje Mathisen <terje.m...@nospicedham.tmsw.no> writes:
>OK, I downloaded your source code, it is definitely a big win to "work
>harder" since that will significantly reduce the number of branches and
>branch mispredicts: With random city locations you will have a high
>number which are closer in just the X direction without being closer in
>the combined (dx,dy) sense.

Just to give numbers to this: With 10k cities, there are 50M
iterations of the inner loop; There are 1.4M cases where the
X-distance is smaller than the smallest total distance up to now;
there are 85k cases where the total distance is smaller than the
smallest distance up to now. The smart way leads to 1.5M branch
mispredictions, the hard way to 0.13M branch mispredictions, but it
takes 25M additional instructions and 6M additional L1 cache misses
(defanged by prefetching).

>The fun part was that I didn't even consider writing the code the way
>you did since I knew that I could overlap the first and the second sqr()
>multiplication, so the actual cost would just be a single cycle plus the
>addition to combine them. :-)

In the scalar version this optimization gives a nice speedup, from
407M cycles to 274M cycles on Haswell.

>Even though the results of the vcmpltpd operation is 2/4 64-bit ints
>(either 0 or -1) you can still use vpackusdw to combine two of them into
>a single result with twice as many flags, i.e. 8 flags of 32 bits each.

I don't think this works, because this instruction converts signed
32-bit integers into unsigned 16-bit integers, with saturation. I.e.,
0 becomes 0, and 11....111, i.e., -1, also becomes 0.

VPACKSSDW (AVX2) might do the trick.

>BTW, I would also start with p=&tourx[ncities-1] in order to avoid the
>-32 offset for the vsubpd operations, this means using 0 to ncities-1
>for the outer (i) loop.

But one would have to make adjustments in all other uses of p. And
does it buy anything?

>BTW, you know of course that you can use BSP or similar partitioning to
>significantly speed up the search for the closest city, but I assume a
>greedy "always grab the closest city first" approach means that this is
>more of a CPU benchmark than actual production code. :-)

This is actually the example from Jon Bentley's "Writing Efficient
Programs" (1982), transcribed to C by me in 1999 or so. These days I
wanted to see what could be gotten by going beyond the steps that
Bentley had taken.

What is interesting: Despite the age of this example, and the alleged
huge advantages that compilers have made in optimization, that
supposedly make such low-level optimizations as performed in the book
unnecessary, and despite the changes in computer architecture, all of
Bentley's optimizations except inlining still produce speedups even
when using an optimizing compiler (i.e., the compilers do not perform
these optimizations (except inlining) by themselves); and I see a
similar speedup from Bentley's optimizations with optimizing compilers
on current machines (factor 6 on Haswell) as he reported in his book
(factor 5.2 on HP1000 with C, 6.9 on DEC KL10 with Pascal). My
conclusion: Manual optimization is as relevant as ever, and even old
manual optimizations are still relevant.

Looking at the resulting code, the inner loop looked already pretty
good, with hardly any scalar optimization possibilities. The only
scalar optimization I could perform is
array-of-records->record-or-arrays, which reduced cache misses, but
had no performance effect. The code looked vectorizable, and that's
what this thread is about.

already...@yahoo.com

unread,

Nov 16, 2016, 6:59:36 PM11/16/16

to

There is no integer support in AVX. Period.
Integer is AVX2. And since I coded/tested on IvyB, it was important for me to avoid AVX2.

> > // CloseDist = (double)_mm256_castpd256_pd128(minDist_v);
> > // min_k = (double)_mm256_castpd256_pd128(minK_v);
> > memcpy(&CloseDist, &minDist_v, sizeof(double));
>
> In gcc-4.9 this actually stores data to memory and loads it from there
> (into the same bits of the same register), for the sole purpose of
> satisfying gcc's type system.

My gcc 5.3.0 is no better.
But I prefer to blame Intel for forgetting to provide official cast intrinsic for casting from __m256d to double.
Of course, usual pointer cast trick (CloseDist = *(double*)&minDist_v;) works fine and "gcc type system" does not stay on the way. If it was in the inner loop that what I'd likely do. But I see no reasons to code non-critical part in non-standard way.

> Ok, it's only executed 10000 times, but
> it still is disgusting.
>
> Results (also shown in the other posting):
>
> Xeon E3-1220 i7-4690K
> Sandy Bridge Haswell
> cycles cycles instructions
> 240,874,925 192,411,043 413,323,795 scalar smart
> 153,192,044 136,464,596 181,520,934 AVX smart with prefetch
> 131,286,268 95,393,579 206,593,195 AVX hard with prefetch
> 128,785,936 123,517,484 230,002,974 AVX "branchless"
>
> So on Sandy bridge we actually see a speedup for this, for Haswell
> it's slower than the branchy version.

On my IvyB a branchless variant is consistently slower than branchy.
Micro-architecturally, IvyB is far closer to SandyB than to HSWL, so I don't know what to say about it.

> Given that there are only 24M
> instructions more, and a slight reduction in branch mispredictions
> (134k->76k), I am surprised by the 28M additional cycles on Haswell.
>

One possibility is a combination of contention over execution port 1 (5 instructions per iteration) with the minDist_v latency chain:

__m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);

Theoretically, a latency chain is only 4 clocks long, but with contention it could become worse.

Replacing the second mul and add with FMA will ease up a port 1 contation, because FMA can be executed on port 0, but my IvyB has no FMA, so I can't test it.

Another opportunity in the spirit of "work harder not smarter" is shortening a latency chain by mean of replacing "easy" single-clock

minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);

with "harder" 3-clock
minDist_v = _mm256_min_pd(minDist_v, dist_v);

On my IvyB it gives a nice speed up. The resulting branchless version is nearly equal in speed to your best "branchy".

Also, are you measurements repeatable?
When I do several invocations in series, I see huge differences between the first invocation and the rest.
Do you see something like that?

Melzzzzz

unread,

Nov 17, 2016, 8:58:37 AM11/17/16

to

Well, you could do this:

__m128d tmp = _mm256_extractf128_pd(minDist_v,0);
_mm_store_sd(&CloseDist,tmp);
Hopefully gcc is smart enough to optimize away extractf instruction.
Or you can simply cast from 256 to 128.
store_sd uses movsd instruction.

already...@yahoo.com

unread,

Nov 17, 2016, 10:13:27 AM11/17/16

to

On Thursday, November 17, 2016 at 3:58:37 PM UTC+2, Melzzzzz wrote:
>
> Well, you could do this:
>
> __m128d tmp = _mm256_extractf128_pd(minDist_v,0);
> _mm_store_sd(&CloseDist,tmp);
> Hopefully gcc is smart enough to optimize away extractf instruction.
> Or you can simply cast from 256 to 128.
> store_sd uses movsd instruction.
>

Yes.
_mm_store_sd(&CloseDist, _mm256_castpd256_pd128(minDist_v)) looks like a reasonable compromise between aesthetics (a.k.a. ugliness), readability and standard compliance.
And gcc is smart enough to not generate any code.

But I still don't understand why Intel forgot to define _mm256_castpd256_sd() or at very least _mm256_store_sd().

With Microsoft compilers that's less of the problem, because we can write
CloseDist = minDist_v.m256_f64[0];
That looks relatively natural.
But with gcc or with portable code the absence of "official" cast sucks.

already...@yahoo.com

unread,

Nov 17, 2016, 11:18:53 AM11/17/16

to

On Thursday, November 17, 2016 at 1:59:36 AM UTC+2, already...@yahoo.com wrote:
>
> One possibility is a combination of contention over execution port 1 (5 instructions per iteration) with the minDist_v latency chain:
> __m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);
> minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);
> Theoretically, a latency chain is only 4 clocks long, but with contention it could become worse.

Turns out, I am on the right track, but intervention of compiler further complicates things.

'gcc -O3 -mavx -march=ivybridge' reorders instructions in the inner loop in such way that _mm256_blendv_pd(minDist_vh, minDist_v, le_v) appears after

minK_v = _mm256_blendv_pd(minK_v, k_v, le_v);
k_v = _mm256_add_pd(k_v, kInc_v);

That makes dependency chain problem on Haswell much harder. Strangely, it only hurts Haswell. On IvyB it is actually a little faster than an order in which I coded it.

'gcc -O1 -mavx -march=ivybridge' does more or less what I told it to do.
As I above said, on IvyB it makes things a little slower (~4M clocks slower), but on HSWL it runs more than a little faster (~20M clocks faster).

>
> Replacing the second mul and add with FMA will ease up a port 1 contation, because FMA can be executed on port 0, but my IvyB has no FMA, so I can't test it.
>

With -O3 it makes no difference.
With -O1 it improves performance just a little over original with '-O1 -march=ivybridge'. Approximately 1M clocks - quite disappointing, less then 0.1 clock per iteration.

> Another opportunity in the spirit of "work harder not smarter" is shortening a latency chain by mean of replacing "easy" single-clock
> minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);
> with "harder" 3-clock
> minDist_v = _mm256_min_pd(minDist_v, dist_v);
>
> On my IvyB it gives a nice speed up. The resulting branchless version is nearly equal in speed to your best "branchy".
>

On Haswell it also gives a nice speedup over original '-O3'. Approximately 13M clocks. Which still leaves it 7M clocks behind '-O1'.

But when we do both _mm256_min_pd() and _mm256_fmadd_pd() then things suddenly improve: ~25M clocks faster than original -03 and only 5M clocks slower than Anton's branchy variant.

>
> Also, are you measurements repeatable?
> When I do several invocations in series, I see huge differences between the first invocation and the rest.
> Do you see something like that?

Turns out, I almost certainly see an effect of power management on my home computer. At work, where I have power saving turned off, all measurements are very stable.

already...@yahoo.com

unread,

Nov 17, 2016, 12:53:58 PM11/17/16

to

On Thursday, November 17, 2016 at 6:18:53 PM UTC+2, already...@yahoo.com wrote:
> On Thursday, November 17, 2016 at 1:59:36 AM UTC+2, already...@yahoo.com wrote:
> >
> > One possibility is a combination of contention over execution port 1 (5 instructions per iteration) with the minDist_v latency chain:
> > __m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);
> > minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);
> > Theoretically, a latency chain is only 4 clocks long, but with contention it could become worse.
>
> Turns out, I am on the right track, but intervention of compiler further complicates things.
>
> 'gcc -O3 -mavx -march=ivybridge' reorders instructions in the inner loop in such way that _mm256_blendv_pd(minDist_vh, minDist_v, le_v) appears after
> minK_v = _mm256_blendv_pd(minK_v, k_v, le_v);
> k_v = _mm256_add_pd(k_v, kInc_v);
>
> That makes dependency chain problem on Haswell much harder. Strangely, it only hurts Haswell. On IvyB it is actually a little faster than an order in which I coded it.
>
> 'gcc -O1 -mavx -march=ivybridge' does more or less what I told it to do.
> As I above said, on IvyB it makes things a little slower (~4M clocks slower), but on HSWL it runs more than a little faster (~20M clocks faster).
>
>
> >
> > Replacing the second mul and add with FMA will ease up a port 1 contation, because FMA can be executed on port 0, but my IvyB has no FMA, so I can't test it.
> >
>
> With -O3 it makes no difference.
> With -O1 it improves performance just a little over original with '-O1 -march=ivybridge'. Approximately 1M clocks - quite disappointing, less then 0.1 clock per iteration.
>
>
> > Another opportunity in the spirit of "work harder not smarter" is shortening a latency chain by mean of replacing "easy" single-clock
> > minDist_v = _mm256_blendv_pd(minDist_v, dist_v, le_v);
> > with "harder" 3-clock
> > minDist_v = _mm256_min_pd(minDist_v, dist_v);
> >
> > On my IvyB it gives a nice speed up. The resulting branchless version is nearly equal in speed to your best "branchy".
> >
>
> On Haswell it also gives a nice speedup over original '-O3'. Approximately 13M clocks. Which still leaves it 7M clocks behind '-O1'.
>
> But when we do both _mm256_min_pd() and _mm256_fmadd_pd() then things suddenly improve: ~25M clocks faster than original -03 and only 5M clocks slower than Anton's branchy variant.
>

The next round of tweaks that is started by a question: if we are already using FMA(3) then why can't we use AVX2 as well? AFAIK, so far all processors that supported FMA3 also supported AVX2.
So, Hello, AVX2! That's a first time I am coding for you!
The only change over the previous version is that k_v update now uses integer arithmetic. One more instruction off the back of port 1.

And it helps to shave 11M clocks from the execution time! Enough for a branchless variant to become faster than the branchy.
So, so far, our winner on Haswell looks like that:

// tourx and toury aligned on 32-byte boundary
void tsp(point cities[], double tourx[], double toury[], int ncities)
{
int i,kk;

tourx[0] = cities[ncities-1].x;
toury[0] = cities[ncities-1].y;

for (i=1; i<ncities; i++) {

tourx[i]=cities[i-1].x;
toury[i]=cities[i-1].y;
}

for (i=1; i<ncities; i++) {

double ThisX = tourx[i-1];
double ThisY = toury[i-1];
double CloseDist = DBL_MAX;
int min_k = i;
int k = i;

if (ncities - k >= 8) {
// scalar prologue
int nlast = (k + 3) & (-4);

for (; k < nlast; ++k) {
double ThisDist = sqr(tourx[k]-ThisX)+sqr(toury[k]-ThisY);
if (ThisDist <= CloseDist) {
CloseDist = ThisDist;
min_k = k;
}
}

// main loop

int ni = (ncities - k) / 4;

__m256d thisx_v = _mm256_set1_pd(ThisX);
__m256d thisy_v = _mm256_set1_pd(ThisY);
__m256d minDist_v = _mm256_set1_pd(CloseDist);

__m256d minK_v = _mm256_castsi256_pd(_mm256_set1_epi64x(min_k)); // __m256i really, just defined as __m256d to reduce # of casts
__m256i k_v = _mm256_setr_epi64x((k+0), (k+1), (k+2), (k+3));
__m256i kInc_v = _mm256_set1_epi64x(4);

const __m256d* tourx_v = (const __m256d*)&tourx[k];
const __m256d* toury_v = (const __m256d*)&toury[k];
k += ni*4;

for (kk = 0; kk < ni; ++kk) {

__m256d dx_v = _mm256_sub_pd(tourx_v[kk], thisx_v);
__m256d dy_v = _mm256_sub_pd(toury_v[kk], thisy_v);

_mm_prefetch(&tourx_v[kk+10], 0);
_mm_prefetch(&toury_v[kk+10], 0);
__m256d dist_v = _mm256_fmadd_pd(dy_v, dy_v, _mm256_mul_pd(dx_v, dx_v));

__m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);

minDist_v = _mm256_min_pd(minDist_v, dist_v);
minK_v = _mm256_blendv_pd(minK_v, _mm256_castsi256_pd(k_v), le_v);
k_v = _mm256_add_epi64(k_v, kInc_v);

}
// fold
__m256d minDist_vh = _mm256_permute2f128_pd(minDist_v, minDist_v, 1);
__m256d minK_vh = _mm256_permute2f128_pd(minK_v, minK_v, 1);
__m256d le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);
minDist_vh = _mm256_permute_pd(minDist_v, 1);
minK_vh = _mm256_permute_pd(minK_v, 1);
le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);

_mm_store_sd(&CloseDist, _mm256_castpd256_pd128(minDist_v));
uint64_t min_kd;
memcpy(&min_kd, &minK_v, sizeof(uint64_t));
min_k = (int)min_kd;
}

// scalar epilogue

for (; k < ncities; ++k) {
double ThisDist = sqr(tourx[k]-ThisX)+sqr(toury[k]-ThisY);
if (ThisDist <= CloseDist) {
CloseDist = ThisDist;
min_k = k;
}
}
swap(&tourx[i],&tourx[min_k]);
swap(&toury[i],&toury[min_k]);
}
}

For reference: the best branchless variant for IvyB (on IvyB it runs at about the same speed as branchy, but it sucks on Hswl).

// tourx and toury aligned on 32-byte boundary
void tsp(point cities[], double tourx[], double toury[], int ncities)

{
tourx[0] = cities[ncities-1].x;
toury[0] = cities[ncities-1].y;
for (int i=1; i<ncities; i++) {
tourx[i]=cities[i-1].x;
toury[i]=cities[i-1].y;
}

for (int i=1; i<ncities; i++) {
double ThisX = tourx[i-1];
double ThisY = toury[i-1];
double CloseDist = DBL_MAX;
int min_k = i;
int k = i;

if (ncities - k >= 8) {
// scalar prologue
int nlast = (k + 3) & (-4);

for (; k < nlast; ++k) {
double ThisDist = sqr(tourx[k]-ThisX)+sqr(toury[k]-ThisY);
if (ThisDist <= CloseDist) {
CloseDist = ThisDist;
min_k = k;
}
}

// main loop

int ni = (ncities - k) / 4;

__m256d thisx_v = _mm256_set1_pd(ThisX);
__m256d thisy_v = _mm256_set1_pd(ThisY);
__m256d minDist_v = _mm256_set1_pd(CloseDist);
__m256d minK_v = _mm256_set1_pd((double)min_k);
__m256d k_v = _mm256_setr_pd((double)(k+0), (double)(k+1), (double)(k+2), (double)(k+3));
__m256d kInc_v = _mm256_set1_pd((double)4);
const __m256d* tourx_v = (const __m256d*)&tourx[k];
const __m256d* toury_v = (const __m256d*)&toury[k];
k += ni*4;

for (int kk = 0; kk < ni; ++kk) {
__m256d dx_v = _mm256_sub_pd(tourx_v[kk], thisx_v);
__m256d dy_v = _mm256_sub_pd(toury_v[kk], thisy_v);

_mm_prefetch(&tourx_v[kk+10], 0);
_mm_prefetch(&toury_v[kk+10], 0);

__m256d dist_v = _mm256_add_pd(_mm256_mul_pd(dx_v, dx_v),_mm256_mul_pd(dy_v, dy_v));

__m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);

minDist_v = _mm256_min_pd(minDist_v, dist_v);

minK_v = _mm256_blendv_pd(minK_v, k_v, le_v);
k_v = _mm256_add_pd(k_v, kInc_v);
}

// fold
__m256d minDist_vh = _mm256_permute2f128_pd(minDist_v, minDist_v, 1);
__m256d minK_vh = _mm256_permute2f128_pd(minK_v, minK_v, 1);
__m256d le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);
minDist_vh = _mm256_permute_pd(minDist_v, 1);
minK_vh = _mm256_permute_pd(minK_v, 1);
le_v = _mm256_cmp_pd(minDist_v, minDist_vh, _CMP_LE_OQ);
minDist_v = _mm256_blendv_pd(minDist_vh, minDist_v, le_v);
minK_v = _mm256_blendv_pd(minK_vh, minK_v, le_v);

_mm_store_sd(&CloseDist, _mm256_castpd256_pd128(minDist_v));
double min_kd;
_mm_store_sd(&min_kd, _mm256_castpd256_pd128(minK_v));
min_k = (int)min_kd;
}

// scalar epilogue

Melzzzzz

unread,

Nov 17, 2016, 2:52:05 PM11/17/16

to

On Thu, 17 Nov 2016 09:53:46 -0800 (PST)
already...@yahoo.com wrote:

> _mm_store_sd(&CloseDist, _mm256_castpd256_pd128(minDist_v));
> uint64_t min_kd;
> memcpy(&min_kd, &minK_v, sizeof(uint64_t));
> min_k = (int)min_kd;
> }

You could use _mm_extract_epi64 in similar way.

already...@yahoo.com

unread,

Nov 18, 2016, 8:12:05 AM11/18/16

to

And now back to branchy variant, where we had low hanging fruit - alignment.
Here is a branchy variant that have all accesses in the main loop aligned on 32-byte boundary (AVX-only, so runs fine on my home IvyB).

static __inline
double _mm256_castpd256_sd(__m256d x) {
double y;
_mm_store_sd(&y, _mm256_castpd256_pd128(x));
return y;
}

// tourx and toury aligned on 32-byte boundary.
// Alignment is not necessary for correctness, but very important for execution speed
// Both tourx[] and toury[] should have space for 4 guard values at the end

void tsp(point cities[], double tourx[], double toury[], int ncities)
{
tourx[0] = cities[ncities-1].x;
toury[0] = cities[ncities-1].y;
for (int i=1; i<ncities; i++) {
tourx[i]=cities[i-1].x;
toury[i]=cities[i-1].y;
}

for (int i=0; i<4; i++) {
tourx[ncities+i]=0;
toury[ncities+i]=0;

}

for (int i=1; i<ncities; i++) {

// guard
tourx[ncities]=tourx[i-1];
toury[ncities]=toury[i-1];

__m256d thisx_v = _mm256_broadcast_sd(&tourx[i-1]);
__m256d thisy_v = _mm256_broadcast_sd(&toury[i-1]);

int min_k = i;
int k = i;

__m256d dist_v;

double dist_va[4];
dist_va[0] = DBL_MAX;
if ((k & 3) != 0) {
// align k on 4x boundary
__m256d x = _mm256_loadu_pd(&tourx[k]);
__m256d y = _mm256_loadu_pd(&toury[k]);
__m256d dx_v = _mm256_sub_pd(x, thisx_v);
__m256d dy_v = _mm256_sub_pd(y, thisy_v);

dist_v = _mm256_add_pd(_mm256_mul_pd(dx_v, dx_v),_mm256_mul_pd(dy_v, dy_v));

_mm256_storeu_pd(dist_va, dist_v);
double minDist = DBL_MAX;
for (; (k & 3) != 0; ++k) {
if (dist_va[k-i] <= minDist) {
minDist = dist_va[k-i];
if (k >= ncities)
break;
min_k = k;
}
}
dist_va[0] = minDist;
}

double *px = &tourx[k];
double *py = &toury[k];
__m256d minDist_v = _mm256_broadcast_sd(&dist_va[0]);
while (_mm256_castpd256_sd(minDist_v) != 0) // finish when found two cities in the same place
{
int le_msk;
do {
__m256d x = _mm256_loadu_pd(px);
__m256d y = _mm256_loadu_pd(py);
__m256d dx_v = _mm256_sub_pd(x, thisx_v);
__m256d dy_v = _mm256_sub_pd(y, thisy_v);
_mm_prefetch(&px[40], 0);
_mm_prefetch(&py[40], 0);
px += 4;
py += 4;

dist_v = _mm256_add_pd(_mm256_mul_pd(dx_v, dx_v),_mm256_mul_pd(dy_v, dy_v));
__m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LE_OQ);

le_msk = _mm256_movemask_pd(le_v);
} while (le_msk == 0);

_mm256_storeu_pd(dist_va, dist_v);
int prev_k = px - tourx - 4;
do {
#ifdef MSVC
long unsigned ki;
_BitScanForward(&ki, le_msk);
#else
int ki = __builtin_ctz(le_msk);
#endif
minDist_v = _mm256_broadcast_sd(&dist_va[ki]);

int nxt_k = prev_k + ki;
if (nxt_k < ncities)
min_k = nxt_k;

__m256d le_v = _mm256_cmp_pd(dist_v, minDist_v, _CMP_LT_OQ); // LT rather than LE, to guarantee a forward progress
le_msk = _mm256_movemask_pd(le_v);
} while (le_msk != 0);

}

swap(&tourx[i],&tourx[min_k]);
swap(&toury[i],&toury[min_k]);
}
}

On IvyBridge this variant is the fastest so far - ~8M clocks faster than the best branchless AVX variant.
I didn't test yet on the HSWL.

Anton Ertl

unread,

Nov 19, 2016, 11:01:58 AM11/19/16

to

an...@nospicedham.mips.complang.tuwien.ac.at (Anton Ertl) writes:

>Terje Mathisen <terje.m...@nospicedham.tmsw.no> writes:
>>Even though the results of the vcmpltpd operation is 2/4 64-bit ints
>>(either 0 or -1) you can still use vpackusdw to combine two of them into
>>a single result with twice as many flags, i.e. 8 flags of 32 bits each.

...

>VPACKSSDW (AVX2) might do the trick.

Unfortunately, it does not, at least not alone. The 256-bit version
of that instruction is perverse, and I need some additional
permutation work to get the result in a form appropriate for adjusting
the pointer. What

vpackssdw %ymm9,%ymm5,%ymm5

does, given the register contents

ymm9={A,B,C,D,E,F,G,H}
ymm5={i,j,k,l,m,n,o,p}

is produce the follwing result:

ymm5={i,j,k,l,A,B,C,D,m,n,o,p,E,F,G,H}

Useful results would have been

ymm5={i,j,k,l,m,n,o,p,A,B,C,D,E,F,G,H} or
ymm5={A,B,C,D,E,F,G,H,i,j,k,l,m,n,o,p}

So I need to insert a "vperm2 $216, %ymm5, %ymm5 afterwards (magic
incantation from
<https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations>.
Here's the inner loop (plus exit code) afterwards:

400e02: c5 cd 5c 6f e0 vsubpd -0x20(%rdi),%ymm6,%ymm5
400e07: c4 21 45 5c 44 1f e0 vsubpd -0x20(%rdi,%r11,1),%ymm7,%ymm8
400e0e: 48 83 ef 40 sub $0x40,%rdi
400e12: c5 d5 59 ed vmulpd %ymm5,%ymm5,%ymm5
400e16: c4 41 3d 59 c0 vmulpd %ymm8,%ymm8,%ymm8
400e1b: 0f 18 8f c0 fe ff ff prefetcht0 -0x140(%rdi)
400e22: 42 0f 18 8c 1f c0 fe prefetcht0 -0x140(%rdi,%r11,1)
400e29: ff ff
400e2b: c4 c1 55 58 e8 vaddpd %ymm8,%ymm5,%ymm5
400e30: c5 55 c2 cc 01 vcmpltpd %ymm4,%ymm5,%ymm9
400e35: c5 cd 5c 2f vsubpd (%rdi),%ymm6,%ymm5
400e39: c4 21 45 5c 04 1f vsubpd (%rdi,%r11,1),%ymm7,%ymm8
400e3f: c5 d5 59 ed vmulpd %ymm5,%ymm5,%ymm5
400e43: c4 41 3d 59 c0 vmulpd %ymm8,%ymm8,%ymm8
400e48: c4 c1 55 58 e8 vaddpd %ymm8,%ymm5,%ymm5
400e4d: c5 d5 c2 ec 01 vcmpltpd %ymm4,%ymm5,%ymm5
400e52: c4 c1 55 6b e9 vpackssdw %ymm9,%ymm5,%ymm5
400e57: c5 fc 50 cd vmovmskps %ymm5,%ecx
400e5b: 48 85 c9 test %rcx,%rcx
400e5e: 74 a2 je 400e02 <tsp+0x282>
400e60: c4 e3 fd 00 ed d8 vpermq $0xd8,%ymm5,%ymm5
400e66: c5 fc 50 cd vmovmskps %ymm5,%ecx
400e6a: 48 0f bd c9 bsr %rcx,%rcx
400e6e: 48 8d 3c cf lea (%rdi,%rcx,8),%rdi

This unrolling by a factor of 2 saves a bunch of instructions (only
one set of prefetches, one sub, test, je), but the performance is the
same: 176M instructions instead of 207M, 92M Haswell cycles in both
cases (compared to the same code without unrolling). In addition the
unrolled version does not work on Sandy Bridge and Ivy Bridge, because
it uses AVX2.

The two versions are:

no unrolling: http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tspj.c
with unrolling: http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tspk.c

already...@yahoo.com

unread,

Nov 19, 2016, 1:06:44 PM11/19/16

to

Do you realize that "the rest" of your program is not as insignificant as it was for the first scalar version?
For example, you reported 131M clocks for "AVX hard with prefetch" on Sandy Bridge. But according to my own measurements (IvyB, I have no Sandy, but IvyB is very similar to SandyB) tsp routine in this case takes only 70M clocks.

already...@yahoo.com

unread,

Nov 19, 2016, 1:25:32 PM11/19/16

to

On Saturday, November 19, 2016 at 6:01:58 PM UTC+2, Anton Ertl wrote:

As expected. In fact, I don't understand why Terje was expecting any improvement. After all, the number of mispredicted branches is exactly the same.

BTW, Skylake has FP_ADD units both on port 0 and on port 1. So, I expect that on Skylake the key optimization would be alignment of data accesses, all the rest would have very little effect on the execution time. If everything is aligned, as in the version in my post from 2016-11-18, than execution time for tsp() routine should come very close to 3 clock per iteration = 37.5M clocks.

already...@yahoo.com

unread,

Nov 19, 2016, 1:41:24 PM11/19/16

to

On Saturday, November 19, 2016 at 6:01:58 PM UTC+2, Anton Ertl wrote:

Same idea of unrolling can be easily implemented with plain AVX:

vmovmskps %ymm5,%ecx
vmovmskps %ymm9,%r8d
test %rcx,%r8d
je 400e02 <tsp+0x282>
sll $4, %r8d
or %r8d, %rcx
bsr %rcx,%rcx
etc ...

But that's just not very good idea, Too much effort for too little gain.

already...@yahoo.com

unread,

Nov 19, 2016, 2:44:38 PM11/19/16

to

Sorry, it does not work. I forgot what TEST does :(.
And didn't pay attention to 's' prefix in vmovmskp

Should be:
vmovmskpd %ymm5,%ecx
vmovmskpd %ymm9,%r8d
mov %r8d, %r9d
or %ecx, %r8d
je 400e02 <tsp+0x282>
sll $4, %r9d
or %r9d, %ecx
bsr %ecx, %ecx