SSE2-Sort within a register

Gerd Isenberg

unread,

Jan 8, 2005, 12:12:44 PM1/8/05

to

Hi,

some kind of bubble sort with eight ints inside one 128-bit
xmm-regsiters, using pmaxsw and pminsw instructions.

It takes about 100 cycles on amd64. Using the same algoritm with xmm
register pairs allows to sort 16 signed words in the same
unconditional manner takes about 375 cycles.

Here is some sample code using msc inline assembly:

#define CACHE_ALIGN __declspec(align(CACHE_LINE))

void sortXMM8Short(short int *val /*xmm aligned*/ )
{
static const __int64 CACHE_ALIGN consts[10] =
{
0xffff0000ffff0000, 0xffff0000ffff0000,
0x0000ffff0000ffff, 0x0000ffff0000ffff,
0x000000000000ffff, 0xffff000000000000,
0x0000ffff00000000, 0x0000ffff0000ffff,
0xffff0000ffff0000, 0x00000000ffff0000,
};
__asm
{
mov edx, [val] ; get the pointer
lea eax, [consts]
movdqa xmm0, [edx] ; fetch 8 shorts
mov ecx, 4 ; number of runs
align 16
swaprun:
// conditional swap of 4 even pairs 0-1,2-3,...,7-8
movdqa xmm1, xmm0
movdqa xmm2, xmm0
pslldq xmm0, 2 ; one word left
psrldq xmm1, 2 ; one word right
pmaxsw xmm0, xmm2
pminsw xmm1, xmm2
pand xmm0, [eax+0*16]; ffff0000ffff0000:ffff0000ffff0000 max
pand xmm1, [eax+1*16]; 0000ffff0000ffff:0000ffff0000ffff min
por xmm0, xmm1 ; combine even max:min
// conditional swap of 3 odd pairs 1-2,3-4,...,14-15
movdqa xmm1, xmm0
movdqa xmm2, xmm0
pslldq xmm0, 2
psrldq xmm1, 2
pmaxsw xmm0, xmm2
pminsw xmm1, xmm2
pand xmm2, [eax+2*16]; ffff000000000000:000000000000ffff
pand xmm0, [eax+3*16]; 0000ffff0000ffff:0000ffff00000000 max
pand xmm1, [eax+4*16]; 00000000ffff0000:ffff0000ffff0000 min
por xmm0, xmm2 ; combine odd pairs and outer words
por xmm0, xmm1
dec ecx ; next run ?
jnz swaprun
movdqa [edx], xmm0 ; write sorted array
}
}

Gerd

Terje Mathisen

unread,

Jan 9, 2005, 7:55:29 AM1/9/05

to

Gerd Isenberg wrote:

> Hi,
>
> some kind of bubble sort with eight ints inside one 128-bit
> xmm-regsiters, using pmaxsw and pminsw instructions.
>
> It takes about 100 cycles on amd64. Using the same algoritm with xmm
> register pairs allows to sort 16 signed words in the same
> unconditional manner takes about 375 cycles.

Neat code, although with just 16/32 bytes of data, L1 cache accesses
doesn't cost much either. :-)

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Matt

unread,

Jan 10, 2005, 12:18:07 AM1/10/05

to

"Gerd Isenberg" <spam...@crayne.org> wrote in message
news:ad222fb7.05010...@posting.google.com...

> Hi,
>
> some kind of bubble sort with eight ints inside one 128-bit
> xmm-regsiters, using pmaxsw and pminsw instructions.
>
> It takes about 100 cycles on amd64. Using the same algoritm with xmm
> register pairs allows to sort 16 signed words in the same
> unconditional manner takes about 375 cycles.

[...]

I timed your algorithm and got precisely 101 cycles without writing the
result back to memory, so 100 is probably right on the money. I have a +/- 2
cycle systematic bias in my code to measure latency.

Clever to use SSE for that, but it is still possible to do much better than
this using GP registers. The SIMD instruction sets are nearly worthless
unless you are writing specific graphics algorithms; the latency is too high
for it to be tolerable generally. My first attempt with purely GP regs
yields 62 cycles on an Opteron. I have not bothered to optimize it and have
instead kept it as straightforward as possible. A straightforward
implementation with 16 shorts would sort a pairs of 8s (62*2 = 124 cycles)
and then merge sort the two arrays.

A simple implementation of bubble sort in C gave ~336 cycles. (This figure
is not hard & fast because the branches are unpredictable, and I could not
coerce Visual C++ into generating a branchless inner loop.)

I suspect that optimizing the code below could easily squeeze out another
10-20% (maybe even ~40%?), but I won't bother unless someone here is
interested in the numbers.

void SortInt(void)
{
_asm
{
; Convention is to write vectors as <high, low>. So, for example:
; 0x00010002 in memory is <0x0002, 0x0001>.
;
; Input vector is <i7, i6, i5, i4, i3, i2, i1, i0>
; Output vector is <o7, o6, o5, o4, o3, o2, o1, o0>

; Establish relationship eax[1] >= eax[0]
; Therefore eax[1] => {o7, o6, o5, o4, o3, o2, o1}
; eax[0] => {o6, o5, o4, o3, o2, o1, o0}
mov eax, [sort_val] ; eax = <i1, i0>
mov edx, eax ; edx = <i1, i0>
rol eax, 16 ; eax = <i0, i1>
cmp edx, eax
cmovg eax, edx ; eax[1] >= eax[0]

; Establish relationship ecx[1] >= ecx[0]
; Therefore ecx[1] => {o7, o6, o5, o4, o3, o2, o1}
; ecx[0] => {o6, o5, o4, o3, o2, o1, o0}
mov ecx, [sort_val+4] ; ecx = <i3, i2>
mov ebx, ecx ; ebx = <i3, i2>
rol ecx, 16 ; ecx = <i2, i3>
cmp ebx, ecx
cmovg ecx, ebx ; ecx[1] >= ecx[0]

; Establish relationship esi[1] >= esi[0]
; Therefore esi[1] => {o7, o6, o5, o4, o3, o2, o1}
; esi[0] => {o6, o5, o4, o3, o2, o1, o0}
mov esi, [sort_val+8] ; esi = <i5, i4>
mov edx, esi ; edx = <i5, i4>
rol esi, 16 ; esi = <i4, i5>
cmp edx, esi
cmovg esi, edx ; esi[1] >= esi[0]

; Establish relationship edi[1] >= edi[0]
; Therefore edi[1] => {o7, o6, o5, o4, o3, o2, o1}
; edi[0] => {o6, o5, o4, o3, o2, o1, o0}
mov edi, [sort_val+12] ; edi = <i7, i6>
mov ebx, edi ; ebx = <i7, i6>
rol edi, 16 ; edi = <i6, i7>
cmp ebx, edi
cmovg edi, ebx ; edi[1] >= edi[0]

; Establish relationship ecx[1] >= eax[1] >= eax[0]
; Therefore ecx[1] => {o7, o6, o5, o4, o3, o2}
; ecx[0] => {o6, o5, o4, o3, o2, o1, o0}
; eax[1] => {o6, o5, o4, o3, o2, o1}
; eax[0] => {o5, o4, o3, o2, o1, o0}
mov edx, ecx ; edx = <ecx[1], ecx[0]>
cmp eax, ecx
cmovg ecx, eax ; Conditional swap
cmovg eax, edx ; ecx[1] >= eax[1]

; Establish relationship ecx[1] >= ecx[0] >= eax[0]
; Therefore ecx[1] => {o7, o6, o5, o4, o3}
; ecx[0] => {o6, o5, o4, o3, o2, o1}
; eax[1] => {o6, o5, o4, o3, o2, o1}
; eax[0] => {o4, o3, o2, o1, o0}
mov edx, ecx ; edx = <ecx[1], ecx[0]>
cmp ax, cx
cmovg cx, ax ; Conditional swap
cmovg ax, dx ; ecx[1] >= eax[1] >= eax[0]

; Establish relationship ecx[1] >= ecx[0] >= eax[1] >= eax[0]
; Therefore ecx[1] => {o7, o6, o5, o4, o3}
; ecx[0] => {o6, o5, o4, o3, o2}
; eax[1] => {o5, o4, o3, o2, o1}
; eax[0] => {o4, o3, o2, o1, o0}
rol eax, 16 ; eax = <eax[0], eax[1]>
mov edx, ecx ; edx = <ecx[1], ecx[0]>
cmp ax, cx
cmovg cx, ax ; Conditional swap
cmovg ax, dx ; ecx[1] >= ecx[0] >= eax[0] >= eax[1]
rol eax, 16 ; ecx[1] >= ecx[0] >= eax[1] >= eax[0]

; Establish relationship ecx[1] >= eax[1] >= eax[0]
; Therefore ecx[1] => {o7, o6, o5, o4, o3, o2}
; ecx[0] => {o6, o5, o4, o3, o2, o1, o0}
; eax[1] => {o6, o5, o4, o3, o2, o1}
; eax[0] => {o5, o4, o3, o2, o1, o0}
mov edx, ecx ; edx = <ecx[1], ecx[0]>
cmp eax, ecx
cmovg ecx, eax ; Conditional swap
cmovg eax, edx ; ecx[1] >= eax[1]

; Establish relationship ecx[1] >= ecx[0] >= eax[0]
; Therefore ecx[1] => {o7, o6, o5, o4, o3}
; ecx[0] => {o6, o5, o4, o3, o2, o1}
; eax[1] => {o6, o5, o4, o3, o2, o1}
; eax[0] => {o4, o3, o2, o1, o0}
mov edx, ecx ; edx = <ecx[1], ecx[0]>
cmp ax, cx
cmovg cx, ax ; Conditional swap
cmovg ax, dx ; ecx[1] >= eax[1] >= eax[0]

; Establish relationship ecx[1] >= ecx[0] >= eax[1] >= eax[0]
; Therefore ecx[1] => {o7, o6, o5, o4, o3}
; ecx[0] => {o6, o5, o4, o3, o2}
; eax[1] => {o5, o4, o3, o2, o1}
; eax[0] => {o4, o3, o2, o1, o0}
rol eax, 16 ; eax = <eax[0], eax[1]>
mov edx, ecx ; edx = <ecx[1], ecx[0]>
cmp ax, cx
cmovg cx, ax ; Conditional swap
cmovg ax, dx ; ecx[1] >= ecx[0] >= eax[0] >= eax[1]
rol eax, 16 ; ecx[1] >= ecx[0] >= eax[1] >= eax[0]

; Establish relationship edi[1] >= esi[1] >= esi[0]
; Therefore edi[1] => {o7, o6, o5, o4, o3, o2}
; edi[0] => {o6, o5, o4, o3, o2, o1, o0}
; esi[1] => {o6, o5, o4, o3, o2, o1}
; esi[0] => {o5, o4, o3, o2, o1, o0}
mov ebx, edi ; ebx = <edi[1], edi[0]>
cmp esi, edi
cmovg edi, esi ; Conditional swap
cmovg esi, ebx ; edi[1] >= esi[1]

; Establish relationship edi[1] >= edi[0] >= esi[0]
; Therefore edi[1] => {o7, o6, o5, o4, o3}
; edi[0] => {o6, o5, o4, o3, o2, o1}
; esi[1] => {o6, o5, o4, o3, o2, o1}
; esi[0] => {o4, o3, o2, o1, o0}
mov ebx, edi ; ebx = <edi[1], edi[0]>
cmp si, di
cmovg di, si ; Conditional swap
cmovg si, bx ; edi[1] >= esi[1] >= esi[0]

; Establish relationship edi[1] >= edi[0] >= esi[1] >= esi[0]
; Therefore edi[1] => {o7, o6, o5, o4, o3}
; edi[0] => {o6, o5, o4, o3, o2}
; esi[1] => {o5, o4, o3, o2, o1}
; esi[0] => {o4, o3, o2, o1, o0}
rol esi, 16 ; esi = <esi[0], esi[1]>
mov ebx, edi ; ebx = <edi[1], edi[0]>
cmp si, di
cmovg di, si ; Conditional swap
cmovg si, bx ; edi[1] >= edi[0] >= esi[0] >= esi[1]
rol esi, 16 ; edi[1] >= edi[0] >= esi[1] >= esi[0]

#define FINAL_SWAP _asm mov edx, edi /* edx = <edi[1],
edi[0]> */ \
_asm mov ebx, esi /* ebx = <esi[1],
esi[0]> */ \
_asm cmp ecx, edi \
_asm cmovg edi, ecx /* Conditional swap
*/ \
_asm cmovg esi, eax /* edi[1] >= edi[0]
>= esi[1] >= esi[0] */ \
_asm cmovg ecx, edx /* edi[1] >= ecx[1]
*/ \
_asm cmovg eax, ebx /* ecx[1] >= ecx[0]
>= eax[1] >= eax[0] */

#define FINAL_ROT _asm shl edi, 16 /* edi[1] = edi[0]
*/ \
_asm rol esi, 16 /* edi[1] >= esi[0]
>= esi[1] */ \
_asm mov di, si /* edi[1] >= edi[0]
>= esi[1] */ \
_asm mov si, 0x8000 /* edi[1] >=
edi[0] >= esi[1] >= MIN_INT */

; Conditionally swaps edi:esi and ecx:eax to compute edi[1] = max(edi[1],
ecx[1])
; The register pairs are swapped so as to maintain the invariant:
; edi[1] >= edi[0] >= esi[1] >= esi[0]
; ecx[1] >= ecx[0] >= eax[1] >= eax[0]
FINAL_SWAP

; Store edi[1] to sort_val[7]
mov [sort_val+12], edi

; Does a circular swap:
; edi[1] = edi[0]
; edi[0] = esi[1]
; esi[1] = esi[0]
; Stores MIN_INT into esi[0]
FINAL_ROT

; Repeat process seven times to store all 8 words
FINAL_SWAP
mov edx, edi
shr edx, 16
mov [sort_val+12], dx
FINAL_ROT

FINAL_SWAP
mov [sort_val+8], edi
FINAL_ROT

FINAL_SWAP
mov edx, edi
shr edx, 16
mov [sort_val+8], dx
FINAL_ROT

FINAL_SWAP
mov [sort_val+4], edi
FINAL_ROT

FINAL_SWAP
mov edx, edi
shr edx, 16
mov [sort_val+4], dx
FINAL_ROT

FINAL_SWAP
mov [sort_val], edi
FINAL_ROT

FINAL_SWAP
shr edi, 16
mov [sort_val], di
}
}

-Matt

Gerd Isenberg

unread,

Jan 10, 2005, 1:30:57 PM1/10/05

to

Hi Matt,

thanks for pointing out the gp-code. Yes, SSE2 is almost too slow -
OTOH one saves the gp-ressources by using xmm registers - and there is
some hope with future 128-bit ALUS.

Even if there is no practical application to sort eg. moves inside my
chess program, there are probably some other practical applications to
use SSE2, harder to beat with gp.

One is to pick the best move out of a move list with up to 64 scored
moves. Moves and their scores are kept in separate arrays. Scores are
[-512..+511] with 6 lsb containing the index itself.

Other samples are dot-products of bit/bool[64]*byte[64] or
bit/bool[64]*short[64]. It is very neat to expand or sign extend 64
bits, a bitboard, to 64-Bytes in four xmm-registers.
Multiplication with -1|0 Bytes is then performed by simple "and" mask -
followed by some vertical and horizontal (PSADBW) adds.

Gerd

Matt

unread,

Jan 10, 2005, 11:12:56 PM1/10/05

to

"Gerd Isenberg" <spam...@crayne.org> wrote in message

news:1105348184.2...@c13g2000cwb.googlegroups.com...

> Hi Matt,
>
> thanks for pointing out the gp-code. Yes, SSE2 is almost too slow -
> OTOH one saves the gp-ressources by using xmm registers - and there is
> some hope with future 128-bit ALUS.

Yes, but once those come along we will be writing x86-64 code. Not only does
it have twice as many regs, but the sort will be even faster since one can
manipulate 4 shorts at a time.

> Even if there is no practical application to sort eg. moves inside my
> chess program, there are probably some other practical applications to
> use SSE2, harder to beat with gp.

Yes, but very few. :-(

Even the various 3DNow! and MMX bitboard scanning routines that you had
proposed, though clever, ended up being slower. This is of course in part
because it is slow getting data in/out of the SIMD registers, and SIMD
registers are particularly difficult to work with when you may need to
branch on the data. The biggest factor, I think, is that timings on the GP
set are very tight: I can execute 3 ops/cycle with a latency of 1 cycle for
nearly all important ops (multiplies being the only exception, and still
they are quite fast). P4 makes most MMX/SSE2 instructions ridiculously
expensive; Athlon is much better, but still the latency is at best 2 cycles.
Worse still, the MMX/SSE2 instruction sets are so unorthogonal as to
exaggurate these latencies: where are pminsd and pmaxsd? What happened to
pminus and pmaxus?

> One is to pick the best move out of a move list with up to 64 scored
> moves. Moves and their scores are kept in separate arrays. Scores are
> [-512..+511] with 6 lsb containing the index itself.

Wouldn't this be the purpose of your bubble sort? I suppose MMX would win
this since it can perform 8 max(a,b) computations per cycle. In 18 cycles
you have it narrowed down to 4 choices. At that point it is probably cheaper
to transfer the results to GP regs for the final computation. GP regs can do
1 max(a,b) in 2 cycles, so the remainder can be computed in less than 6
cycles. MMX would require 8 cycles for the remainder.

> Other samples are dot-products of bit/bool[64]*byte[64] or
> bit/bool[64]*short[64]. It is very neat to expand or sign extend 64
> bits, a bitboard, to 64-Bytes in four xmm-registers.
> Multiplication with -1|0 Bytes is then performed by simple "and" mask -
> followed by some vertical and horizontal (PSADBW) adds.

Interesting. This would be more difficult to replicate with GP regs and
would obviously require a significant amount of memory access. The
throughput would be just under 1.5 bit->byte expansions per cycle
(shl/setc/sets/store for 2 bits at a time -- all fully overlappable) for a
total latency of about 50 cycles. A single sadbw is somewhere in the range
of 5-8 cycles, so GP regs would at least give MMX/SSE2 a run for their
money.

-Matt

Gerd Isenberg

unread,

Jan 11, 2005, 1:54:30 PM1/11/05

to

Matt wrote:
> "Gerd Isenberg" <spam...@crayne.org> wrote in message
> news:1105348184.2...@c13g2000cwb.googlegroups.com...
> > Hi Matt,
> >
> > thanks for pointing out the gp-code. Yes, SSE2 is almost too slow -
> > OTOH one saves the gp-ressources by using xmm registers - and there
is
> > some hope with future 128-bit ALUS.
>
> Yes, but once those come along we will be writing x86-64 code. Not
only does
> it have twice as many regs, but the sort will be even faster since
one can
> manipulate 4 shorts at a time.

Agreed. I have a 128-bit SIMD or bitboard[2] base class for the memory
layout and two derivations, one with SSE2-intrinsics and one in plain
C++. Both derivations have most operators overloaded.
For instance i can write kogge stone parallel prefix fills in C++ with
template type <XMM> or <GP>, controlling the register incarnations of
local variables. Parameter passing is done via source and target
pointers. One can mix different disjoint direction fills with different
register files. The msc beta for AMD64 and probably other compilers as
well have very nice scheduling abilities to mix instructions from
disjoint inlined routines. So one has something like one double
dispatch XMM- instruction each 2-6 GP-instructions. Of course this is
all branchless stuff.

Yes, PSLLW, PSLLD and PSLLQ work bitwise with possible second register
operand. PSLLDQ works bytewise and only with immediate operand - and so
on...

I guess G5 Altivec is more orthogonal.

>
> > One is to pick the best move out of a move list with up to 64
scored
> > moves. Moves and their scores are kept in separate arrays. Scores
are
> > [-512..+511] with 6 lsb containing the index itself.
>
> Wouldn't this be the purpose of your bubble sort?

No - since you can expect a beta cut after picking and trying a move,
in a lot of nodes, presorting all moves is a waste of time.

The strange thing is that often using twice as much MMX-Instructions
and MMX-registers with much more code (not exactly the doubled size,
due to one prefix byte less) is faster than SSE2 on amd64. That
indicates that there is something to improve with SSE in the future ;-)

Btw. in Hans de Vries paper "Understanding the detailed Architecture
of AMD's 64 bit Core" it sounds a bit more optimistic:

-----------------
http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

1.4 128 bit SSE(2) instructions are split into Doubles
Appendix C of Opteron's Optimization Guide specifies to which class
each and every instruction belongs. Most 128 bit SSE and SSE2
instructions are implemented as double dispatch instructions. Only
those that can not be split into two independent 64 bit operations are
handled as Vector Path (Micro Code) instructions. Those SSE2
instructions that operate on only one half of a 128 bit register are
implemented as a single (Direct Path) instruction.

There are both advantages and disadvantages performance-wise here. A
disadvantage may be that the decode rate of 128 bit SSE2 instructions
is limited to 1.5 per cycle. In general however this not a performance
limiter because the maximum throughput is limited by the FP units and
the retirement hardware to a single 128 bit SSE instruction per cycle.
More important is the extra cycle latency that a Pentium 4 style
implementation would bring is avoided.

-----------------

Gerd

Matt

unread,

Jan 11, 2005, 2:36:23 PM1/11/05

to

"Gerd Isenberg" <spam...@crayne.org> wrote in message

[...]

> The strange thing is that often using twice as much MMX-Instructions
> and MMX-registers with much more code (not exactly the doubled size,
> due to one prefix byte less) is faster than SSE2 on amd64. That
> indicates that there is something to improve with SSE in the future ;-)

This is something that I have rather come to expect; currently there is no
CPU on the market that can execute a full 128-bit SSE/SSE2 op in a single
cycle. The instructions are always split into a pair of 64-bit ops and
executed that way. Basically, on K7/K8, an SSE2 op is quite literally a pair
of MMX ops.

> Btw. in Hans de Vries paper "Understanding the detailed Architecture
> of AMD's 64 bit Core" it sounds a bit more optimistic:

[...]

I have read that, but it is not quite so optimistic, at least for K8. K8
gets better decode bandwidth through the use of Double decoding, but it will
not get 1.5 SSE/SSE2 ops per cycle. It is limited to 1 SSE/SSE2 op per cycle
and whatever else you can fit in. MMX will still be faster (as you note)
simply because you can overlap operations better for better parallelism.
Also, you have higher decode bandwidth with MMX.

There is indeed room for improvement. Perhaps sometime in the future they
will widen the SIMD ALUs to 128-bits and make SSE ops as cheap as MMX.

-Matt