My current Python code uses lookup tables through a library written in
C. I have a 3.8 Ghz Pentium 4 and get 125 Mbps bitwise dot product
(slow, I know). Can somebody write a population count that operates
using SSE2 for me?
Many thanks if you can provide me help,
AndrewF
Ah, so _that's_ why people are interested in popcnt!
> My current Python code uses lookup tables through a library written in C.
Bletcherous!
> Can somebody write a population count that operates
> using SSE2 for me?
Not I! But there are much better algorithms for the finding.
I think the perfered algo is repeated folding: mask, shift &
add. Norbert wrote something nice, and IIRC AMD has something
in their optimization manual.
-- Robert
>
i hope thats the one you are looking for. basic function is
do_bit_count. it receives pointers to both populations, ands them
together, and counts the bits. however it does not store them, but this
can be added easily. at number of elements (integers) *must* be a
multiple of 4. if you can some further constraints it can be optimized
even further, since the are a lot of unused registers available? how
about processing 8 elements at once?
note that the variables mask_x need to be aligned to
16-bytes-boundaries, so you have to take care for your compiler.
included are the flags for vc++.
the bit-count algorithm is basically the one provided by Sean Eron
Anderson at
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
greetings from dresden, germany
andre
#include <iostream>
#include <cassert>
inline
int
do_bit_count( const unsigned int* p_ptr_1 , const unsigned int* p_ptr_2
, const unsigned int num_elems ) {
assert( p_ptr_1 && p_ptr_2 && num_elems && !(num_elems%4) );
unsigned int counter = 0;
__declspec(align(16)) unsigned int mask_1[4] =
{ 0x55555555,0x55555555,0x55555555,0x55555555 };
__declspec(align(16)) unsigned int mask_2[4] =
{ 0x33333333,0x33333333,0x33333333,0x33333333 };
__declspec(align(16)) unsigned int mask_3[4] =
{ 0xF0F0F0F,0xF0F0F0F,0xF0F0F0F,0xF0F0F0F };
__declspec(align(16)) unsigned int mask_4[4] =
{ 0x1010101,0x1010101,0x1010101,0x1010101 };
const unsigned int num_bytes = num_elems * sizeof(*p_ptr_1) - 16;
__asm {
push eax
push ecx
push edx
mov eax , dword ptr [ p_ptr_1 ]
mov edx , dword ptr [ p_ptr_2 ]
mov ecx , dword ptr [ num_bytes ]
pxor xmm7 , xmm7
begin_loop :
// load both populations and and them together
pxor xmm0 , xmm0
movdqu xmm0 , xmmword ptr [ eax + ecx ]
pxor xmm1 , xmm1
movdqu xmm1 , xmmword ptr [ edx + ecx ]
pand xmm1 , xmm0
// copy to xmm0, we continue to work on xmm1
movdqa xmm0 , xmm1
psrld xmm1 , 1
pand xmm1 , xmmword ptr [ mask_1 ]
psubd xmm0 , xmm1
movdqa xmm1 , xmm0
psrld xmm0 , 2
pand xmm1 , xmmword ptr [ mask_2 ]
pand xmm0 , xmmword ptr [ mask_2 ]
paddd xmm1 , xmm0
movdqa xmm0 , xmm1
psrld xmm1 , 4
pand xmm1 , xmmword ptr [ mask_3 ]
paddd xmm0 , xmm1
pshufd xmm1 , xmm0 , 0xf5
pmuludq xmm0 , xmmword ptr [ mask_4 ]
pmuludq xmm1 , xmmword ptr [ mask_4 ]
psrlq xmm0 , 24
psrlq xmm1 , 24
sub ecx , 16
paddq xmm7 , xmm1
paddq xmm7 , xmm0
jnc begin_loop
pshufd xmm6 , xmm7 , 0xfe
pop ecx
pop eax
pop edx
paddd xmm6 , xmm7
movd dword ptr [ counter ] , xmm6
}
/*
for ( unsigned int elem_idx = 0 ; elem_idx < num_elems ; ++elem_idx ) {
unsigned int const w = p_ptr[elem_idx] - ((p_ptr[elem_idx] >> 1) &
0x55555555); // temp
unsigned int const x = (w & 0x33333333) + ((w >> 2) & 0x33333333);
// temp
c += ((x + (x >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
}
*/
return counter;
}
int main(int argc, char* argv[])
{
unsigned int arr[65536];
for ( unsigned int idx = 0 ; idx < 65536 ; ++idx )
{ arr[idx] = 0x03; }
std::cout << do_bit_count( arr , arr , 65536 );
return 0;
}
Cheers,
Gerd
Some more information about the datasize might make it easier to
mentally experiment on an algorithm. Is there a maximum datasize or
population count you have in mind?
If the popcount is for an array, instead of just a single word, then you
should look into Robert Harley's bitslice full-adder reduction first!
Doing this you'd reduce up to 31 128-bit values into just 5 before doing
a parallel popcount on those 5 registers.
Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Hi Andre,
as Terje already mentioned, it is smarter to first compress
powerOfTwo-1 words to log2(powerOfTwo) number of words, eg. with 7
words you 3 words (with some parallel prefix and,xor) and to perform
4*popCnt(x2) + 2*popCnt(x1) + popCnt(odd)
I think for P4 and AMD64 nothing beats the SWAR popcount with general
porpose registers - while we have slow double dispatch
SSE2-instructions with two cycles latency, sorry for posting C.
__forceinline
int popCnt (unsigned __int64 bb)
{
unsigned int w = (unsigned int)(bb >> 32), v = (unsigned int)bb;
v = v - ((v >> 1) & 0x55555555);
w = w - ((w >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
v = (v + (v >> 4));
w = (w + (w >> 4));
v &= 0x0F0F0F0F;
w &= 0x0F0F0F0F;
v += w;
v = (v * 0x01010101) >> 24;
return v;
}
For a "weighted" popcount i think the SSE2 solution i introduced is ok.
But to abuse it as an usual popcount (without anding the bytes as sign
extended bits) was about two times slower iirc.
- Gerd
Each reference vector must be dot-product with each input vector. I
must store the maximum dot product achieved with each input vector, and
the index of the reference vector that accomplished it.
So... I know that memory bandwidth won't be a problem because I have
1,000 dot products to accomplish for each 110 bits copied from memory.
Because the vectors are 1-million bits long and each vector gets dot
producted with 10-100 (depending on if it is reference or input) other
vectors, plenty of computations can fit in cache I am sure. I believe
that SSE2 is capable of reading 128 bits from L1 cache in a single
clock cycle, and in EMT64 sse2 has 16 registers, possibly exploiting
extra parrallelism. I know that Harley's method of only performing
"real" population counts after doing many CSA operations (reducing 3
words to 2 words in only 5 instructions) is fastest:
#define CSA(h,l, a,b,c) \
{unsigned u = a ^ b; unsigned v = c; \
h = (a & b) | (u & v); l = u ^ v;} #h is the carry, l is the low
bit.
Thus it is possible to dot product 1 word in an average of 8 simple ALU
instructions or so (see
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.cc for what
the extra work is). I am hoping for 3.8 ghz / 2 clocks per simple sse2
ALU instruction * 128 dotted bits per 8 ALU instructions = 30.4
Gigabits per second of dot-product performance.
Does this seem right to you guys? Can somebody please help me get
there?
Thanks in advance for your help,
Andrew Felch
you think that using sse2 is slower than general purpose-registers? the
algorithm i implemented is exactly the same you propose! except for the
pmuludq-instruction i had to use. so the only difference would be
instruction-latency. now we have to do some profiling ;-)
Here is the compression part of the algorithm written in C (copied from
hacker's delight).
Thanks so much for your help!
Andrew Felch
int pop(unsigned x) {
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
x = x + (x >> 16);
return x & 0x0000003F;
}
/* This is Harley's basic method. It combines groups of three array
elements into two words to which pop(x) is applied. The running time,
ignoring loop control and loads, is 7 elementary ops plus 2 pop counts
for each 3 array elements, i.e., (7 + 2p)/3 ops per word, where p is
the
number of ops for one population count. For p = 15 (code of Fig. 5-2,
inlined) this is 37/3 = 12.33 ops per word. */
#define CSA(h,l, a,b,c) \
{unsigned u = a ^ b; unsigned v = c; \
h = (a & b) | (u & v); l = u ^ v;}
/* At the risk of being a bore, the function below is similar to that
above, but it brings in array elements 8 at a time and combines them
with "ones", "twos", "fours", and "eights". The array is assumed to
have
at least seven elements.
The running time, ignoring loop control and loads, is 36 elementary
ops plus 1 pop count for each 8 array elements, i.e., (36 + p)/8 ops
per
word, where p is the number of ops for one population count. For p = 15
(code of Fig. 5-2, inlined) this is 51/8 = 6.375 ops per word. */
int pop_array5(unsigned A[], int n) {
int tot, i;
unsigned ones, twos, twosA, twosB,
fours, foursA, foursB, eights;
tot = 0; // Initialize.
fours = twos = ones = 0;
for (i = 0; i <= n - 8; i = i + 8) {
CSA(twosA, ones, ones, A[i], A[i+1])
CSA(twosB, ones, ones, A[i+2], A[i+3])
CSA(foursA, twos, twos, twosA, twosB)
CSA(twosA, ones, ones, A[i+4], A[i+5])
CSA(twosB, ones, ones, A[i+6], A[i+7])
CSA(foursB, twos, twos, twosA, twosB)
CSA(eights, fours, fours, foursA, foursB)
tot = tot + pop(eights);
}
tot = 8*tot + 4*pop(fours) + 2*pop(twos) + pop(ones);
for (i = i; i < n; i++) // Simply add in the last
tot = tot + pop(A[i]); // 0 to 7 elements.
return tot;
}
Ok, i was a bit too overhasty with my statement ;-)
It was based on the experience with my chess program, where i don't
count huge memory streams but some 64-bit words most often already in
edx:eax. Yes, with huge memory streams and a lot of parellism, things
change in favour to SSE2 and MMX.
pmuludq is the only 32-bit integer multiply-instruction available. all
the others work on 16-bit words. pmuludq multiplies only the first and
third dword of the qword, that's why i had to mangle the value first
using pshufd.
> Also,
> could you please, (pretty please) also implement the compression part
> that is used on sets of 8 128-bit memory blocks (two sets of these
> because we are performing an AND first to get the dot product). Off
> course assume the array lengths are a multiple of 128*8.
>
work in progress. just two questions.
lets asume i'd store the ones into seperate values ones_x x in [0,3],
is it true that pop(ones] == sum( pop( ones_x ) ) for all x??
you write about 8 128-bit memory-blocks. where are they? in the loop you
use 8 32-bit-values for computation. or the better question, how will
the pop_array5-function be used?
Wow, sweet!
> lets asume i'd store the ones into seperate values ones_x x in [0,3],
> is it true that pop(ones] == sum( pop( ones_x ) ) for all x??
I'm not sure what you mean. "tot = tot + pop(eights); " is equal to:
int( the_hypothetical_sum_of_everything_so_far / 8 ) * 8 , [note my
integer division]. So tot continues to accumulate the "eight's place"
number in the real sum. "ones" still contains the non-pop-counted ones
place bits. twos contains the non-pop-counted twos place. fours
contains the non-pop-counted fours place. Eights contains the
non-pop-counted eights place, but it is pop counted right away and so
not around for long.
> you write about 8 128-bit memory-blocks. where are they? in the loop you
> use 8 32-bit-values for computation. or the better question, how will
> the pop_array5-function be used?
I'd like all the 32-bit operations to be 128-bit. Incidentally, since
I will be running this on my Prescott in 64-bit mode there will 16
SSE2/3 registers, so plenty of room. Otherwise it would not be
possible using only 8 registers without load/store. Also, I'm not sure
why the loop goes 8 iterations, I suppose to emphasize that it should
be run in a loop so that the final pop counts can be avoided as long as
possible. It's fine to leave it in, I will unroll that loop and make
several functions, one for each number of simulated iterations (because
they are unrolled) I want to run. Every instruction counts on this
project, and it will be the fastest bit-matrix-multiplier for any SSE2
supporting cpu! (Harley's method is rarely implemented) Throughput on
SSE2/3 will double on new intel CPU's, so it'll run twice as fast very
soon.
Thanks for all your work helping me,
AndrewF
I'm afraid that you'll probably have some trouble reaching those 30
Gbits/s, but you should be able to get into the right ballpark by
cache-blocking all your operations.
Anyway, with 16 avaialable 128-bit registers you should be able to
compress 15 input vectors into 4 result vectors, then count those.