Population Count in SSE2

spam...@crayne.org

unread,

Apr 18, 2006, 10:53:45 PM4/18/06

to

The comp.arch.fpga group hypothesized that bit-matrix dot products
(bitwise AND and a population count) could be done fast enough in SSE2
to bottleneck at memory speed for reading the vectors in from memory.
Thus I could perhaps hope for 25.6 Gbps bitwise dot product.

My current Python code uses lookup tables through a library written in
C. I have a 3.8 Ghz Pentium 4 and get 125 Mbps bitwise dot product
(slow, I know). Can somebody write a population count that operates
using SSE2 for me?

Many thanks if you can provide me help,
AndrewF

Robert Redelmeier

unread,

Apr 19, 2006, 9:19:39 AM4/19/06

to

spam...@crayne.org wrote in part:

> The comp.arch.fpga group hypothesized that bit-matrix dot products
> (bitwise AND and a population count) could be done fast enough in SSE2

Ah, so _that's_ why people are interested in popcnt!

> My current Python code uses lookup tables through a library written in C.

Bletcherous!

> Can somebody write a population count that operates
> using SSE2 for me?

Not I! But there are much better algorithms for the finding.
I think the perfered algo is repeated folding: mask, shift &
add. Norbert wrote something nice, and IIRC AMD has something
in their optimization manual.

-- Robert

>

André Kempe

unread,

Apr 19, 2006, 4:56:15 AM4/19/06

to

i hope thats the one you are looking for. basic function is
do_bit_count. it receives pointers to both populations, ands them
together, and counts the bits. however it does not store them, but this
can be added easily. at number of elements (integers) *must* be a
multiple of 4. if you can some further constraints it can be optimized
even further, since the are a lot of unused registers available? how
about processing 8 elements at once?

note that the variables mask_x need to be aligned to
16-bytes-boundaries, so you have to take care for your compiler.
included are the flags for vc++.

the bit-count algorithm is basically the one provided by Sean Eron
Anderson at
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan

greetings from dresden, germany
andre

#include <iostream>
#include <cassert>

inline
int
do_bit_count( const unsigned int* p_ptr_1 , const unsigned int* p_ptr_2
, const unsigned int num_elems ) {

assert( p_ptr_1 && p_ptr_2 && num_elems && !(num_elems%4) );

unsigned int counter = 0;

__declspec(align(16)) unsigned int mask_1[4] =
{ 0x55555555,0x55555555,0x55555555,0x55555555 };
__declspec(align(16)) unsigned int mask_2[4] =
{ 0x33333333,0x33333333,0x33333333,0x33333333 };
__declspec(align(16)) unsigned int mask_3[4] =
{ 0xF0F0F0F,0xF0F0F0F,0xF0F0F0F,0xF0F0F0F };
__declspec(align(16)) unsigned int mask_4[4] =
{ 0x1010101,0x1010101,0x1010101,0x1010101 };

const unsigned int num_bytes = num_elems * sizeof(*p_ptr_1) - 16;

__asm {
push eax
push ecx
push edx

mov eax , dword ptr [ p_ptr_1 ]
mov edx , dword ptr [ p_ptr_2 ]

mov ecx , dword ptr [ num_bytes ]
pxor xmm7 , xmm7

begin_loop :

// load both populations and and them together
pxor xmm0 , xmm0
movdqu xmm0 , xmmword ptr [ eax + ecx ]
pxor xmm1 , xmm1
movdqu xmm1 , xmmword ptr [ edx + ecx ]
pand xmm1 , xmm0

// copy to xmm0, we continue to work on xmm1
movdqa xmm0 , xmm1
psrld xmm1 , 1
pand xmm1 , xmmword ptr [ mask_1 ]
psubd xmm0 , xmm1

movdqa xmm1 , xmm0
psrld xmm0 , 2
pand xmm1 , xmmword ptr [ mask_2 ]
pand xmm0 , xmmword ptr [ mask_2 ]
paddd xmm1 , xmm0

movdqa xmm0 , xmm1
psrld xmm1 , 4
pand xmm1 , xmmword ptr [ mask_3 ]
paddd xmm0 , xmm1
pshufd xmm1 , xmm0 , 0xf5
pmuludq xmm0 , xmmword ptr [ mask_4 ]
pmuludq xmm1 , xmmword ptr [ mask_4 ]
psrlq xmm0 , 24
psrlq xmm1 , 24

sub ecx , 16

paddq xmm7 , xmm1
paddq xmm7 , xmm0

jnc begin_loop

pshufd xmm6 , xmm7 , 0xfe

pop ecx
pop eax
pop edx

paddd xmm6 , xmm7

movd dword ptr [ counter ] , xmm6
}
/*
for ( unsigned int elem_idx = 0 ; elem_idx < num_elems ; ++elem_idx ) {
unsigned int const w = p_ptr[elem_idx] - ((p_ptr[elem_idx] >> 1) &
0x55555555); // temp
unsigned int const x = (w & 0x33333333) + ((w >> 2) & 0x33333333);
// temp
c += ((x + (x >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
}
*/

return counter;
}

int main(int argc, char* argv[])
{

unsigned int arr[65536];
for ( unsigned int idx = 0 ; idx < 65536 ; ++idx )
{ arr[idx] = 0x03; }

std::cout << do_bit_count( arr , arr , 65536 );

return 0;
}

Gerd Isenberg

unread,

Apr 19, 2006, 8:02:05 AM4/19/06

to

may be that is what you are looking for:
http://groups.google.de/group/comp.lang.asm.x86/browse_frm/thread/c95b16d90486e28a/?hl=de#

Cheers,
Gerd

ldb

unread,

Apr 19, 2006, 9:17:15 AM4/19/06

to

At the speeds you are asking for, this doesn't seem possible. First,
and most importantly, it requires a huge component of horizontal
computation ... and that is a notorious performance killer. There are
ways of successfully accomplishing this, but I have my doubts whether
and SSE2 implementation will be faster than a well optimized ASM one.

Some more information about the datasize might make it easier to
mentally experiment on an algorithm. Is there a maximum datasize or
population count you have in mind?

Terje Mathisen

unread,

Apr 19, 2006, 1:50:02 PM4/19/06

to

If the popcount is for an array, instead of just a single word, then you
should look into Robert Harley's bitslice full-adder reduction first!

Doing this you'd reduce up to 31 128-bit values into just 5 before doing
a parallel popcount on those 5 registers.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Gerd Isenberg

unread,

Apr 19, 2006, 3:14:32 PM4/19/06

to

André Kempe schrieb:

Hi Andre,
as Terje already mentioned, it is smarter to first compress
powerOfTwo-1 words to log2(powerOfTwo) number of words, eg. with 7
words you 3 words (with some parallel prefix and,xor) and to perform
4*popCnt(x2) + 2*popCnt(x1) + popCnt(odd)

I think for P4 and AMD64 nothing beats the SWAR popcount with general
porpose registers - while we have slow double dispatch
SSE2-instructions with two cycles latency, sorry for posting C.

__forceinline
int popCnt (unsigned __int64 bb)
{
unsigned int w = (unsigned int)(bb >> 32), v = (unsigned int)bb;
v = v - ((v >> 1) & 0x55555555);
w = w - ((w >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
v = (v + (v >> 4));
w = (w + (w >> 4));
v &= 0x0F0F0F0F;
w &= 0x0F0F0F0F;
v += w;
v = (v * 0x01010101) >> 24;
return v;
}

For a "weighted" popcount i think the SSE2 solution i introduced is ok.
But to abuse it as an usual popcount (without anding the bytes as sign
extended bits) was about two times slower iirc.

- Gerd

spam...@crayne.org

unread,

Apr 19, 2006, 6:32:32 PM4/19/06

to

I am actually performing matrix multiplication, so I have 100
1-million-bit-length vectors called "reference vectors". I also have
10 1-million-bit-length input vectors. I actually have 3 sets of 100
reference vectors, and each set of 100 reference vectors has a
corresponding 10 input vectors(so I do this three times).

Each reference vector must be dot-product with each input vector. I
must store the maximum dot product achieved with each input vector, and
the index of the reference vector that accomplished it.

So... I know that memory bandwidth won't be a problem because I have
1,000 dot products to accomplish for each 110 bits copied from memory.
Because the vectors are 1-million bits long and each vector gets dot
producted with 10-100 (depending on if it is reference or input) other
vectors, plenty of computations can fit in cache I am sure. I believe
that SSE2 is capable of reading 128 bits from L1 cache in a single
clock cycle, and in EMT64 sse2 has 16 registers, possibly exploiting
extra parrallelism. I know that Harley's method of only performing
"real" population counts after doing many CSA operations (reducing 3
words to 2 words in only 5 instructions) is fastest:

#define CSA(h,l, a,b,c) \
{unsigned u = a ^ b; unsigned v = c; \
h = (a & b) | (u & v); l = u ^ v;} #h is the carry, l is the low
bit.

Thus it is possible to dot product 1 word in an average of 8 simple ALU
instructions or so (see
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.cc for what
the extra work is). I am hoping for 3.8 ghz / 2 clocks per simple sse2
ALU instruction * 128 dotted bits per 8 ALU instructions = 30.4
Gigabits per second of dot-product performance.

Does this seem right to you guys? Can somebody please help me get
there?

Thanks in advance for your help,
Andrew Felch

André Kempe

unread,

Apr 20, 2006, 2:35:57 AM4/20/06

to

you think that using sse2 is slower than general purpose-registers? the
algorithm i implemented is exactly the same you propose! except for the
pmuludq-instruction i had to use. so the only difference would be
instruction-latency. now we have to do some profiling ;-)

spam...@crayne.org

unread,

Apr 20, 2006, 4:13:58 PM4/20/06

to

Oh, wow, I only just saw your post, thanks so much. A couple
questions, why do you use "pmuludq-instruction", just wondering. Also,
could you please, (pretty please) also implement the compression part
that is used on sets of 8 128-bit memory blocks (two sets of these
because we are performing an AND first to get the dot product). Off
course assume the array lengths are a multiple of 128*8.

Here is the compression part of the algorithm written in C (copied from
hacker's delight).

Thanks so much for your help!
Andrew Felch

int pop(unsigned x) {
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
x = x + (x >> 16);
return x & 0x0000003F;
}

/* This is Harley's basic method. It combines groups of three array
elements into two words to which pop(x) is applied. The running time,
ignoring loop control and loads, is 7 elementary ops plus 2 pop counts
for each 3 array elements, i.e., (7 + 2p)/3 ops per word, where p is
the
number of ops for one population count. For p = 15 (code of Fig. 5-2,
inlined) this is 37/3 = 12.33 ops per word. */

#define CSA(h,l, a,b,c) \
{unsigned u = a ^ b; unsigned v = c; \
h = (a & b) | (u & v); l = u ^ v;}

/* At the risk of being a bore, the function below is similar to that
above, but it brings in array elements 8 at a time and combines them
with "ones", "twos", "fours", and "eights". The array is assumed to
have
at least seven elements.
The running time, ignoring loop control and loads, is 36 elementary
ops plus 1 pop count for each 8 array elements, i.e., (36 + p)/8 ops
per
word, where p is the number of ops for one population count. For p = 15
(code of Fig. 5-2, inlined) this is 51/8 = 6.375 ops per word. */

int pop_array5(unsigned A[], int n) {

int tot, i;
unsigned ones, twos, twosA, twosB,
fours, foursA, foursB, eights;

tot = 0; // Initialize.
fours = twos = ones = 0;

for (i = 0; i <= n - 8; i = i + 8) {
CSA(twosA, ones, ones, A[i], A[i+1])
CSA(twosB, ones, ones, A[i+2], A[i+3])
CSA(foursA, twos, twos, twosA, twosB)
CSA(twosA, ones, ones, A[i+4], A[i+5])
CSA(twosB, ones, ones, A[i+6], A[i+7])
CSA(foursB, twos, twos, twosA, twosB)
CSA(eights, fours, fours, foursA, foursB)
tot = tot + pop(eights);
}
tot = 8*tot + 4*pop(fours) + 2*pop(twos) + pop(ones);

for (i = i; i < n; i++) // Simply add in the last
tot = tot + pop(A[i]); // 0 to 7 elements.
return tot;
}

Gerd Isenberg

unread,

Apr 20, 2006, 5:17:11 PM4/20/06

to

André Kempe schrieb:

Ok, i was a bit too overhasty with my statement ;-)
It was based on the experience with my chess program, where i don't
count huge memory streams but some 64-bit words most often already in
edx:eax. Yes, with huge memory streams and a lot of parellism, things
change in favour to SSE2 and MMX.

André Kempe

unread,

Apr 21, 2006, 5:14:27 AM4/21/06

to

spam...@crayne.org wrote:
> Oh, wow, I only just saw your post, thanks so much. A couple
> questions, why do you use "pmuludq-instruction", just wondering.

pmuludq is the only 32-bit integer multiply-instruction available. all
the others work on 16-bit words. pmuludq multiplies only the first and
third dword of the qword, that's why i had to mangle the value first
using pshufd.

> Also,
> could you please, (pretty please) also implement the compression part
> that is used on sets of 8 128-bit memory blocks (two sets of these
> because we are performing an AND first to get the dot product). Off
> course assume the array lengths are a multiple of 128*8.
>

work in progress. just two questions.

lets asume i'd store the ones into seperate values ones_x x in [0,3],
is it true that pop(ones] == sum( pop( ones_x ) ) for all x??

you write about 8 128-bit memory-blocks. where are they? in the loop you
use 8 32-bit-values for computation. or the better question, how will
the pop_array5-function be used?

spam...@crayne.org

unread,

Apr 21, 2006, 7:37:21 PM4/21/06

to

> work in progress. just two questions.

Wow, sweet!

> lets asume i'd store the ones into seperate values ones_x x in [0,3],
> is it true that pop(ones] == sum( pop( ones_x ) ) for all x??

I'm not sure what you mean. "tot = tot + pop(eights); " is equal to:
int( the_hypothetical_sum_of_everything_so_far / 8 ) * 8 , [note my
integer division]. So tot continues to accumulate the "eight's place"
number in the real sum. "ones" still contains the non-pop-counted ones
place bits. twos contains the non-pop-counted twos place. fours
contains the non-pop-counted fours place. Eights contains the
non-pop-counted eights place, but it is pop counted right away and so
not around for long.

> you write about 8 128-bit memory-blocks. where are they? in the loop you
> use 8 32-bit-values for computation. or the better question, how will
> the pop_array5-function be used?

I'd like all the 32-bit operations to be 128-bit. Incidentally, since
I will be running this on my Prescott in 64-bit mode there will 16
SSE2/3 registers, so plenty of room. Otherwise it would not be
possible using only 8 registers without load/store. Also, I'm not sure
why the loop goes 8 iterations, I suppose to emphasize that it should
be run in a loop so that the final pop counts can be avoided as long as
possible. It's fine to leave it in, I will unroll that loop and make
several functions, one for each number of simulated iterations (because
they are unrolled) I want to run. Every instruction counts on this
project, and it will be the fastest bit-matrix-multiplier for any SSE2
supporting cpu! (Harley's method is rarely implemented) Throughput on
SSE2/3 will double on new intel CPU's, so it'll run twice as fast very
soon.

Thanks for all your work helping me,
AndrewF

Terje Mathisen

unread,

Apr 22, 2006, 1:47:51 AM4/22/06

to

spam...@crayne.org wrote:
> #define CSA(h,l, a,b,c) \
> {unsigned u = a ^ b; unsigned v = c; \
> h = (a & b) | (u & v); l = u ^ v;} #h is the carry, l is the low
> bit.
>
> Thus it is possible to dot product 1 word in an average of 8 simple ALU
> instructions or so (see
> http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.cc for what
> the extra work is). I am hoping for 3.8 ghz / 2 clocks per simple sse2
> ALU instruction * 128 dotted bits per 8 ALU instructions = 30.4
> Gigabits per second of dot-product performance.
>
> Does this seem right to you guys? Can somebody please help me get
> there?

I'm afraid that you'll probably have some trouble reaching those 30
Gbits/s, but you should be able to get into the right ballpark by
cache-blocking all your operations.

Anyway, with 16 avaialable 128-bit registers you should be able to
compress 15 input vectors into 4 result vectors, then count those.