Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Population count in SSE2, again

163 views
Skip to first unread message

James Van Buskirk

unread,
Apr 12, 2008, 5:14:45 AM4/12/08
to
There was a thread about population count in comp.lang.fortran but
nobody seemed to like my assembly code over there:

http://groups.google.com/group/comp.lang.fortran/msg/271b9850acfafabc

Looking around, I found an old thread:

http://groups.google.com/group/comp.lang.asm.x86/msg/54d29f732e51de93

which was very similar to the code I posted, so I thought I would
post a revision of my code along with an explanation of how it is
different.

Of course the most obvious difference is that my code is actual
SSE2 code, whereas the older code was 32-bit C code. SSE2 has the
surprising PSADBW instruction which sweeps the floor after 8-bit
sums have been completed in my wrap() subroutine (analogous to the
pop() function in the older code). It's not clear how you are
supposed to tease a compiler into generating this instruction;
Fortran has a way to request it in a single expression but it is
unfathomable code that the compiler probably wouldn't convert in
the desired way, so the normal thing you see is a multiply and
shift as in the AMD Software Optimization Guide for AMD64
Processors.

But the big difference lies in implementation of the Carry-Save
Adder (CSA) compression scheme. You will recall that the
fundamental sequence of operations, starting with inputs A, B,
and C is:

L = A .XOR. B
H = A .AND. B
H = H .OR. (C .AND. L)
L = C .XOR. L

Now, on a 3-register ISA such as Alpha, this sequence can be carried
out in 5 instructions, but there is a complication in a 2-register
ISA like the topical SSE2 because you need both A .XOR. B and
A .AND. B, but SSE2 always overwrites one of its operands. Same
with C .XOR. L and C .AND. L, so it seems that you will need to
issue 2 copy instructions so that the sequence will take 7
instructions instead of 5.

I was thinking about this the other day and one of the things that
bugs me about the MMX/SSE ISA is that the PANDN instruction is
backwards. That is, the sequence:

A = A .XOR. B
B = B .AND. (.NOT. A)

would leave A and B in the roles of L and H respectively in the
canonical sequence of Harley (et al?), but alas, the PANDN
instruction negates the destination register rather than the source.
I couldn't see any way to make this work until it occurred to me
that the thing to do was to negate the starting register and then
proceed from there:

L = .TRUE.
....
L = L .XOR. A
A = A .AND. L
L = L .XOR. B
B = B .AND. L
B = A .OR. B

Where this scheme implemented in function popcnt1() in my code
is completely analogous to the pop_array5() function in Andrew
Felch's old post:

current old
-----------
L ones
A A[i]
B A[i+1]
B twosA

for the inputs and outputs of the first CSA iteration. The
difference is that I start out with ones = .NOT. .FALSE. so
that on the first iteration I get

L = L .XOR. A = .TRUE. .XOR. A0 = .NOT. A0 ; Negation of current sum
A = A .AND. L = A0 .AND. (.NOT. A0) = .FALSE. ; Correct carry!
L = L .XOR. B = (.NOT. A0) .XOR. B0
= .NOT. (A0 .XOR. B0) ; Negation of current sum!
B = B .AND. L = B0 .AND. ((.NOT. A0) .XOR. B0)
= B0 .AND. (((.NOT. A0) .AND. (.NOT. B0)) .OR. (A0 .AND. B0))
= (B0 .AND. ((.NOT. A0) .AND. (.NOT. B0))) .OR. (B0 .AND. (A0 .AND. B0))
= (.FALSE.) .OR. (A0 .AND. B0) = A0 .AND. B0; Correct carry!
B = A0 .OR. B0 = .FALSE. .OR. (A0 .AND. B0)
= A0 .AND. B0; Correct twosA

Thinking about this, the reader can envision that simply starting
with fours = twos = ones = ~0 (negative logic) in function pop_array5
in the old code permits us to carry out all those CSA operations in 5
SSE2 instructions each rather than the 7 which were necessary with
positive logic. Of course the line:

tot = 8*tot + 4*pop(fours) + 2*pop(twos) + pop(ones);

is replaced by:

tot = 8*tot - 4*pop(fours) - 2*pop(twos) - pop(ones) + 896;

to correct for the negative logic. Hence the overhead for
eliminating 2 out of 7 inner loop instructions is only the
addition of the total number of weighted bits in all counters,
just one instruction outside the loop. The implementation of
negative logic reduced my test of counting bits in 32768 bytes
from about 9100 clocks to about 6950 clocks on my Core 2 Duo
E6700. Here it is in GAS with gfortran driver program:

C:\gfortran\clf\popcnt>type big_popcnt3.s
.text
..globl _tm1
.def _tm1; .scl 2; .type 32; .endef
_tm1:
rdtsc
shrq $32, %rdx
orq %rdx, %rax
ret
..globl _popcnt1
.def _popcnt1; .scl 2; .type 32; .endef
..align 16
_popcnt1:
subq $56, %rsp
movaps %xmm6, (%rsp)
movaps %xmm7, 16(%rsp)
movaps %xmm8, 32(%rsp)
movaps %xmm14, 64(%rsp)
movaps %xmm15, 80(%rsp)
movq $0x3333333333333333, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm15
xorps %xmm14, %xmm14

pcmpeqd %xmm1, %xmm1
pcmpeqd %xmm3, %xmm3
pcmpeqd %xmm5, %xmm5
xorps %xmm7, %xmm7

movl $32768, %edx

..align 16
loop1: movaps (%rcx), %xmm6
xorps %xmm6, %xmm1
andps %xmm1, %xmm6
movaps 16(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm6
movaps 32(%rcx), %xmm4
xorps %xmm4, %xmm1
andps %xmm1, %xmm4
movaps 48(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm4
movaps 64(%rcx), %xmm2
xorps %xmm2, %xmm1
andps %xmm1, %xmm2
movaps 80(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm2
movaps 96(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
movaps 112(%rcx), %xmm8
xorps %xmm8, %xmm1
andps %xmm1, %xmm8
orps %xmm8, %xmm0

xorps %xmm6, %xmm3
andps %xmm3, %xmm6
xorps %xmm4, %xmm3
andps %xmm3, %xmm4
orps %xmm4, %xmm6
xorps %xmm2, %xmm3
andps %xmm3, %xmm2
xorps %xmm0, %xmm3
andps %xmm3, %xmm0
orps %xmm0, %xmm2
xorps %xmm6, %xmm5
andps %xmm5, %xmm6
xorps %xmm2, %xmm5
andps %xmm5, %xmm2
orps %xmm2, %xmm6

movaps stuff, %xmm0
andps %xmm6, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm6
movaps %xmm15, %xmm0
andnps %xmm6, %xmm0
psrld $2, %xmm0
andps %xmm15, %xmm6
paddd %xmm0, %xmm6
movaps %xmm6, %xmm0
psrld $4, %xmm6
paddd %xmm0, %xmm6
andps nonsense, %xmm6
psadbw %xmm14, %xmm6
paddd %xmm6, %xmm7

addq $128, %rcx
subq $128, %rdx
jnz loop1

movaps %xmm5, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movaps %xmm3, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movaps %xmm1, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movhlps %xmm7, %xmm1
paddq %xmm7, %xmm1
movq %xmm1, %rax
addq $896, %rax

movaps (%rsp), %xmm6
movaps 16(%rsp), %xmm7
movaps 32(%rsp), %xmm8
movaps 64(%rsp), %xmm14
movaps 80(%rsp), %xmm15
addq $56, %rsp
ret

..align 16
wrap: movaps stuff, %xmm0
andps %xmm6, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm6
movaps %xmm15, %xmm0
andnps %xmm6, %xmm0
psrld $2, %xmm0
andps %xmm15, %xmm6
paddd %xmm0, %xmm6
movaps %xmm6, %xmm0
psrld $4, %xmm6
paddd %xmm0, %xmm6
andps nonsense, %xmm6
psadbw %xmm14, %xmm6
ret

..globl _popcnt3
.def _popcnt3; .scl 2; .type 32; .endef
..align 16
_popcnt3:
movq $0x3333333333333333, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm3
xorps %xmm2, %xmm2

xorps %xmm4, %xmm4

movl $32768, %edx
addq %rdx, %rcx
negq %rdx

..align 16
loop2: movaps (%rcx,%rdx), %xmm1
movapd stuff, %xmm0
andps %xmm1, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm1
movapd %xmm3, %xmm0
andnps %xmm1, %xmm0
psrld $2, %xmm0
andpd %xmm3, %xmm1
paddd %xmm0, %xmm1
movaps %xmm1, %xmm0
psrld $4, %xmm1
paddd %xmm0, %xmm1
andpd nonsense, %xmm1
psadbw %xmm2, %xmm1
paddd %xmm1, %xmm4

addq $16, %rdx
jnz loop2
movhlps %xmm4, %xmm1
paddq %xmm4, %xmm1
movq %xmm1, %rax

ret

.data
..align 16
stuff:
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
nonsense:
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f

C:\gfortran\clf\popcnt>type big_test1.f90
program big_test1
use ISO_C_BINDING
implicit none
interface
function tm1() bind(C)
import C_INT64_T
implicit none

integer(C_INT64_T) tm1
end function tm1
end interface

interface
function popcnt1(x) bind(C)
import C_INT64_T
import C_PTR
implicit none

integer(C_INT64_T) popcnt1
type(C_PTR), value :: x
end function popcnt1
end interface

interface
function popcnt2(x,n) bind(C)
import C_INT64_T
import C_PTR
import C_INT
implicit none

integer(C_INT64_T) popcnt2
type(C_PTR), value :: x
integer(C_INT) n
end function popcnt2
end interface

interface
function popcnt3(x) bind(C)
import C_INT64_T
import C_PTR
implicit none

integer(C_INT64_T) popcnt3
type(C_PTR), value :: x
end function popcnt3
end interface

interface
subroutine sieve(t, n) bind(C)
import C_PTR
import C_INT
implicit none

type(C_PTR), value :: t
integer(C_INT) n
end subroutine sieve
end interface

integer(C_INT8_T), allocatable, target :: x(:)
integer(C_INT), parameter :: Nbytes = 32768 ! popcnt1 hardwired to this
integer, parameter :: align = 16 ! Will use xmm registers
type(C_PTR) x_ptr
integer(C_INTPTR_T) x_start
integer(C_INTPTR_T) t_start
type(C_PTR) t_ptr
integer i
integer(C_INT64_T) t0, t1, np

allocate(x(Nbytes+align-1))
x_ptr = C_LOC(x(1))
x_start = transfer(x_ptr, x_start)
t_start = iand(x_start+align-1, int(not(align-1),C_INTPTR_T))
t_ptr = transfer(t_start, t_ptr)
call sieve(t_ptr, Nbytes)
do i = 1, 4
t0 = tm1()
np = popcnt1(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt1 np = ', np, ' clocks = ', t1-t0
t0 = tm1()
np = popcnt2(t_ptr, Nbytes/4)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt2 np = ', np, ' clocks = ', t1-t0
t0 = tm1()
np = popcnt3(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt3 np = ', np, ' clocks = ', t1-t0
end do

end program big_test1

subroutine sieve(t, n) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT) n
integer(C_INT8_T) t(0:n-1)
integer i, lim, j

lim = sqrt(8*n+0.5_C_DOUBLE)
t = -1
t(0) = ibclr(t(0), 0)
do i = 2, lim
if(btest(t((i-1)/8),modulo(i-1,8))) then
do j = i**2, 8*n, i
t((j-1)/8) = ibclr(t((j-1)/8),modulo(j-1,8))
end do
end if
end do
end subroutine sieve

function popcnt2(MAT, Nwords) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT64_T) popcnt2
integer(C_INT) Nwords
integer(C_INT16_T) MAT(2*Nwords)
integer i, j
integer(C_INT8_T), parameter :: IBITC(0:255) = [( &
ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1), &
i=0,255)]
integer(C_INT16_T) K1, K2

popcnt2 = 0
do i = 1, 2*Nwords
K1 = MAT(i)
K2 = ishft(K1, -8)
K1 = iand(K1, int(z'ff', C_INT16_T))
popcnt2 = popcnt2+IBITC(K1)+IBITC(K2)
end do
end function popcnt2

C:\gfortran\clf\popcnt>C:\gfortran\win64\bin\x86_64-pc-mingw32-gfortran -O2
big_
test1.f90 big_popcnt3.s -obig_test1

C:\gfortran\clf\popcnt>big_test1 > big_test1.txt

C:\gfortran\clf\popcnt>big_test1 > big_test1.txt

C:\gfortran\clf\popcnt>big_test1 > big_test1.txt

C:\gfortran\clf\popcnt>type big_test1.txt
popcnt1 np = 23000 clocks = 6940
popcnt2 np = 23000 clocks = 82890
popcnt3 np = 23000 clocks = 15180
popcnt1 np = 23000 clocks = 6900
popcnt2 np = 23000 clocks = 82690
popcnt3 np = 23000 clocks = 15120
popcnt1 np = 23000 clocks = 6900
popcnt2 np = 23000 clocks = 82560
popcnt3 np = 23000 clocks = 15070
popcnt1 np = 23000 clocks = 6960
popcnt2 np = 23000 clocks = 82550
popcnt3 np = 23000 clocks = 15060

I hope you don't mind the Fortran driver. The popcnt1 function is
invoked by passing the address of a 16-byte aligned array of 32768
bytes to be population counted in rcx and returns its result in rax.
It uses the 64-bit Windows calling convention. In the above, popcnt2
was an 8-bit LUT solution and popcnt3 was similar to popcnt1 but
without CSA compression. I hope you enjoy this wrinkle on an old
technique.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Terence

unread,
Apr 12, 2008, 7:55:40 PM4/12/08
to
I built my business on software for bit counting, running since 1976.
I thought I understood the objective of the posts about "popcount" and
"wrap", but I see I don't.

If one wants to count "1" or "0" bits in strings of bytes or words,
you have to access memory anyway for the bytes. So there are a variety
of very simple, very fast ways of doing this using the continous sum
of table-indexed words or bytes, whose index is the words or byte to
count bits in. There is no need for extenal fixed/floating point
chips; just four or five machine language intructions found in the
most basic cpu chips. So you call one pre-builtt function from Fortran
which returns the answer for a vast string.

Terje Mathisen

unread,
Apr 13, 2008, 9:08:36 AM4/13/08
to
James Van Buskirk wrote:
> But the big difference lies in implementation of the Carry-Save
> Adder (CSA) compression scheme. You will recall that the
> fundamental sequence of operations, starting with inputs A, B,
> and C is:

[snipped bitslice adder logic using negated input bits]

Nice!!

OTOH, POPCNT is turning up as a regular x86 opcode these days, and I do
assume this will be _very_ hard to beat?

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

James Van Buskirk

unread,
Apr 13, 2008, 11:20:49 AM4/13/08
to
"Terje Mathisen" <spam...@crayne.org> wrote in message
news:B5mdncw9LPFJmJ_V...@giganews.com...

> James Van Buskirk wrote:
>> But the big difference lies in implementation of the Carry-Save
>> Adder (CSA) compression scheme. You will recall that the
>> fundamental sequence of operations, starting with inputs A, B,
>> and C is:

> [snipped bitslice adder logic using negated input bits]

> Nice!!

> OTOH, POPCNT is turning up as a regular x86 opcode these days, and I do
> assume this will be _very_ hard to beat?

As written, it's getting about 32768/6950 = 4.7 bytes per clock.
In 32-bit mode it may still be above 4 bytes per clock, but you
would have to be able to perform 2 loads per clock (Core 2 Duo only
has one load port) and 2 POPCNTs per clock (Phenom only has one
POPCNT-capable ALU) to do better than this. Also there may simply
be some not-well-documented throughput limitation on Core 2 Duo
which impairs performance by 33% (see the thread I started entitled
'Throughput of PXOR'.) A processor without that limitation might be
50% faster or about 7 bytes per clock. Also if a processor had
256-bit wide registers available, the speed could double, even
outstripping a POPCNT unit with throughput of 1 per clock cycle in
64-bit mode. Of course, if a POPCNT instruction were available for
256-bit wide registers, it would be game over.

Maarten Kronenburg

unread,
Apr 13, 2008, 3:48:40 PM4/13/08
to

"James Van Buskirk" wrote in message

> There was a thread about population count in comp.lang.fortran but
> nobody seemed to like my assembly code over there:

For what it's worth below I give my 32-bit assembler for popcnt.
The format is for lzasm but can easily be changed for nasm.
The two arguments are the pointer and the number of 32-bit words.
The code is from Hackers Delight converted to 32-bit assembler.
On Core2Duo I measured 13 cycles/limb, where limb=4bytes=32bits.
Also I peeked for popcnt in GMP 4.2.2, see:
http://gmplib.org/
They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
IA-64 popcnt instruction.
Their table lookup popcnt runs 8 cycle/limb (32-bit).
Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!
But because it seems more complicated you should also test it for all
possible 128-bit bit patterns.
Regards, Maarten.

PROC __int_popcnt
ARG pdata:dword, ndata:dword
push ebp
mov ebp, esp
push edi
push esi
mov edi, [ndata]
mov esi, [pdata]
lea esi, [esi-4]
xor eax, eax
@@1:
mov ecx, [esi+4*edi]
mov edx, ecx
shr edx, 1
and ecx, 55555555h
and edx, 55555555h
add ecx, edx
mov edx, ecx
shr edx, 2
and ecx, 33333333h
and edx, 33333333h
add ecx, edx
mov edx, ecx
shr edx, 4
add ecx, edx
and ecx, 0F0F0F0Fh
mov edx, ecx
shr edx, 8
add ecx, edx
mov edx, ecx
shr edx, 16
add ecx, edx
and ecx, 3Fh
add eax, ecx
dec edi
jnz short @@1
pop esi
pop edi
pop ebp
ret
ENDP __int_popcnt

Jake Waskett

unread,
Apr 13, 2008, 5:43:32 PM4/13/08
to
On Sun, 13 Apr 2008 21:48:40 +0200, Maarten Kronenburg wrote:

> But because it seems more complicated you should also test it for all
> possible 128-bit bit patterns.

Since there are 340282366920938463463374607431768211456 of those, that
might be slightly unrealistic. :-)

Maarten Kronenburg

unread,
Apr 13, 2008, 7:55:14 PM4/13/08
to

"Jake Waskett" wrote in message

Yes that is true, perhaps a lot of randomly generated ones.
When say 2^32 randomly generated ones count correct,
then one can say with almost certainty that the method is correct
and has no errors.

Terence

unread,
Apr 13, 2008, 8:14:55 PM4/13/08
to

"James Van Buskirk" wrote in message
(code)

Someone tell me why 21 instructions and a branch, per word, is somehow
better/faster than

Initializtion all methods need.

zero summing index (eg ax or eax)
load address of string to esi
load string count to ecx
load table address to ebx
back: ; 3 instructions per word follow
add to current sum, from count table indexed by the 8 byte or 16 bit
word (one instruction)
increment string pointer
(esi) (one instruction)
branch to repeat if not finished string byte/word string length
(dec-branch instruction)

?
I want to know, if I haven't understood what you all seem to be trying
to do!
Is it that all those instuctions are all somehow loaded to the cpu
chip inststruction store and processed on chip? Because there is
always the memory access to be considered.

Wolfgang Kern

unread,
Apr 14, 2008, 6:42:35 AM4/14/08
to

"Terence" wrote:

The shown code reminds me of what Randy Hyde posted a while back
in ALA for counting set bits.
Even this algo suffer on many dependency stalls, it reads four bytes
and processes 32 bits per iteration, so it reduces the costly memory
accesses which you encounter with a byte by byte indexed table lookup.
Granted, a 256 byte table could be easy held in cache, so it would be
worth to benchmark and compare the performance of the two methods.

I never needed a bit counter yet, but if I ever need one, I'd try
if the known to be awful slow BSF/BSR instructions wont save on
many iterations, ...sure the worst case would be all bits set then,
so your suggestion may perform quite well:

xor eax,eax ;init count
xor ecx,ecx ;used as dwords_size
mov esi,start_of
mov edi, Byte_LUT ;should be prepared already
prefetchnta [edi]
[edi+40h]
[edi+80h]
[edi+C0h]
L0:
mov ebx,[esi+ecx*4]
movzx edx,bl ;four times unrolled
shr ebx,8 ;because a loop branch
add AL,[edi+edx] ;may cost more time
adc eax,0x0100 ;
movzx edx,bl
shr ebx,8
add AL,[edi+edx]
adc eax,0x0100
movzx edx,bl
shr ebx,8
add AL,[edi+edx]
adc eax,0x0100
movzx edx,bl
;shr ebx,8 ;not needed in last
add AL,[edi+edx]
adc eax,0x0100
inc ecx
cmp ecx,dword_size
jc L0
; eax = bit count

but I'm not sure that your short variant:

;ecx=byte_size
L0:
movzx edx,[esi]
inc esi
add AL,[edi+edx]
adc eax,0x0100
dec ecx
jnz L0

could be faster, because it reads 'bytes' more often from memory.

perhaps I type in in one day and post timing results then.
__
wolfgang


Robert Redelmeier

unread,
Apr 14, 2008, 10:21:05 AM4/14/08
to
Terence <spam...@crayne.org> wrote in part:

> "James Van Buskirk" wrote in message
> (code) [clever XMM bitslice code omitted]


> Someone tell me why 21 instructions and a branch,
> per word, is somehow better/faster than

[byte-wise LUT pseudocode snipped]

Easy, it just is -- 4.7 bytes/clock . Your LUT would
be good at 0.5 and have real trouble breaking 1.0 .

Why is equally easy to see -- James "word" is a 128-bit
wide XMM register (presumably zero padded]. That is 16
bytes at a time. Furthermore the code is unrolled x8,
meaning 128 _bytes_ per 61 instruction inner loop.
James' timing gives a reasonable 2.2 ipc.

I still have this nagging feeling there's something
faster with multiplying by magic numbers.

-- Robert

Terence

unread,
Apr 14, 2008, 8:00:42 AM4/14/08
to
I DO understand that if you have millions of very long bits strings
that you have to count bits in or match mbits in (as in DNA or RNA
matching) you might want a very tight efficient loop algorithm.

But then again, from a practical point of view, you still have to
consider the access to the huge strings of bits millions of times. Ar
we nick-picking in the wrong place?

Are we not seeing the woods for the trees?.

James Van Buskirk

unread,
Apr 14, 2008, 4:58:34 AM4/14/08
to
"Terence" <spam...@crayne.org> wrote in message
news:25562a29-faf9-4899...@n1g2000prb.googlegroups.com...

> Someone tell me why 21 instructions and a branch, per word, is somehow
> better/faster than

I think you may be referring to Maarten Kronenburg's example, which
has 24 instructions+branch. The point you are missing is that
these complicated methods are counting several words per loop
trip. Well, MK's method is a naive version, and is surely slower
than a well-optimized 8-bit LUT would be. But the point is that
counting 64 bits could be done in just two more operations than
counting 32 bits (actually the same number in a more sophisticated
implementation) whereas the LUT method requires about twice as
many operations to count up twice the bits. The limitation of
LUTs is that they can only get so big before every access is a
cache miss, so the number of bits that can be handled in
parallel is quite limited. Parallel methods are limited by
register width which is up to 128 bits in x86 just now.

A good explanation of Maarten Kronenburg's code may be found
in the Wikipedia entry on Hamming weight,

http://en.wikipedia.org/wiki/Hamming_weight

The compression stage is explained in the old thread referenced
in my original post,

http://groups.google.com:80/group/comp.lang.asm.x86/msg/54d29f732e51de93

Not to mention

http://www.icis.ntu.edu.sg/scs-ijit/91/91-1.pdf

Well, there's so much available that if you're really interested
these references should keep you busy for a while.

James Van Buskirk

unread,
Apr 14, 2008, 4:38:07 AM4/14/08
to
"Maarten Kronenburg" <spam...@crayne.org> wrote in message
news:4802639c$0$720$7ade...@textreader.nntp.internl.net...

> For what it's worth below I give my 32-bit assembler for popcnt.

This is a crude version of the well-known code in AMD's optimization
manual, and is not as fast as the AMD code.

> Also I peeked for popcnt in GMP 4.2.2, see:
> http://gmplib.org/
> They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
> IA-64 popcnt instruction.

I couldn't find where you got that number from. In

http://www.technovelty.org/code/arch/popcount.html

I see a result of 6 clocks per 8 bytes for Itanium 2.

> Their table lookup popcnt runs 8 cycle/limb (32-bit).

Obviously one should be able to get close to 1 byte per cycle
with a Core 2 Duo and 8-bit LUT. Only crappy code would be
short of this mark by a factor of 2.

> Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!

Also on a machine with throughput of one 64-bit hardware per clock
cycle it may be possible to get about 10 bytes per cycle, if the
other throughput problem I am experiencing is not also a problem in
the code I envision.

> But because it seems more complicated you should also test it for all
> possible 128-bit bit patterns.

No, the thing to do is to follow the references given and assure
oneself that the algorithm is mathematically sound. The compression
stage only needs a truth table for all combinations of three input
bits (2**3 antries) to be exhaustive and the wrap-up stage can
be thought of as 16 independent bytes in parallel. The only
aspects that I haven't seen in previous code were use PSADBW
for the final sweep stage (which is, if anything, more
transparent than the multiply and shift given by AMD, but integer
registers don't have SIMD operations in the x86 ISA) and the
negative logic in the compression stage (which can be extablished
by an 8-entry truth table if you flunked logic circuits.)

After that, testing with extreme conditions such as all bits set
and all bits clear as well as random bits to guard against gross
errors in coding, while not exhaustive of all possibilities, is
at least persuasive to me. Besides, the algorithm mashes
together 1024 bits in each loop trip, distilling out one 128-bit
drop to count up, not to mention that the dregs of previous
iterations is an input to each successive iteration, so even
an unattainable number like 128 bits doesn't even start to
describe the complexity of the algorithm. It's explained fairly
well in the second link I gave in the seminal post of this
thread. Check it out and see what's going on here.

The Wikipedia entry on Hamming weight is pretty good for the
wrap stage of the algorithm, but the code there is almost
identical to your example, so it probably doesn't help you.

Maarten Kronenburg

unread,
Apr 14, 2008, 12:09:21 PM4/14/08
to

"Terence" wrote in message

>
> "James Van Buskirk" wrote in message
> (code)
>
> Someone tell me why 21 instructions and a branch, per word, is somehow
> better/faster than
>

First I also didn't believe it, but somehow array elements are combined into
one element, and only that one element is counted, it's from:
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.c
The combination function is faster than the counting function, that's the
trick.
Now I'm also trying to make my code into one from pop_array3.

Jake Waskett

unread,
Apr 14, 2008, 7:19:35 AM4/14/08
to

That seems reasonable. A word of caution, though: many pseudo-random
number routines will start to repeat their output sequence after 2^32 (or
sometimes 2^31 or even fewer) calls. Some routines also have
semi-predictable bit patterns in their output. So be sure to check that
you're *truly* checking a representative random sample of the required
size.

Maarten Kronenburg

unread,
Apr 14, 2008, 2:53:32 PM4/14/08
to

"James Van Buskirk" wrote in message
> "Maarten Kronenburg" wrote in message

>
> > For what it's worth below I give my 32-bit assembler for popcnt.
>
> This is a crude version of the well-known code in AMD's optimization
> manual, and is not as fast as the AMD code.
>

Yes it's the basic one, I'm now putting it in xmm and see where I get.
I work in 32-bit system so the number of registers is a limitation.

> > Also I peeked for popcnt in GMP 4.2.2, see:
> > http://gmplib.org/
> > They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
> > IA-64 popcnt instruction.
>
> I couldn't find where you got that number from. In
>

Just go to:
http://gmplib.org
and download gmp-4.2.2.tar.gz.
Then look for the files popcount.asm.
I just copied their timings, perhaps they have already improved.

> http://www.technovelty.org/code/arch/popcount.html
>
> I see a result of 6 clocks per 8 bytes for Itanium 2.
>
> > Their table lookup popcnt runs 8 cycle/limb (32-bit).
>
> Obviously one should be able to get close to 1 byte per cycle
> with a Core 2 Duo and 8-bit LUT. Only crappy code would be
> short of this mark by a factor of 2.
>
> > Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!
>
> Also on a machine with throughput of one 64-bit hardware per clock
> cycle it may be possible to get about 10 bytes per cycle, if the
> other throughput problem I am experiencing is not also a problem in
> the code I envision.
>
> > But because it seems more complicated you should also test it for all
> > possible 128-bit bit patterns.
>
> No, the thing to do is to follow the references given and assure
> oneself that the algorithm is mathematically sound. The compression
> stage only needs a truth table for all combinations of three input
> bits (2**3 antries) to be exhaustive and the wrap-up stage can
> be thought of as 16 independent bytes in parallel. The only
> aspects that I haven't seen in previous code were use PSADBW
> for the final sweep stage (which is, if anything, more
> transparent than the multiply and shift given by AMD, but integer
> registers don't have SIMD operations in the x86 ISA) and the
> negative logic in the compression stage (which can be extablished
> by an 8-entry truth table if you flunked logic circuits.)
>

Yes I'm sure you are right, but a typo in assembler is easily made.

> After that, testing with extreme conditions such as all bits set
> and all bits clear as well as random bits to guard against gross
> errors in coding, while not exhaustive of all possibilities, is
> at least persuasive to me. Besides, the algorithm mashes
> together 1024 bits in each loop trip, distilling out one 128-bit
> drop to count up, not to mention that the dregs of previous
> iterations is an input to each successive iteration, so even
> an unattainable number like 128 bits doesn't even start to
> describe the complexity of the algorithm. It's explained fairly
> well in the second link I gave in the seminal post of this
> thread. Check it out and see what's going on here.
>
> The Wikipedia entry on Hamming weight is pretty good for the
> wrap stage of the algorithm, but the code there is almost
> identical to your example, so it probably doesn't help you.
>

Yes I found:
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.c
To put those in assembler as you did is quite some work.
Perhaps you may find and eliminate dependency chains in your assembler code,
that's the normal way to optimize exisiting assembler, but sometimes it's
rather a puzzle.
My popcnt is part of an integer library, just like GMP, and it's linear so
its performance is not so critical, but for some other special applications
it may be critical. Amazing that you can get so far with optimizing tricks.
Regards, Maarten.

James Van Buskirk

unread,
Apr 14, 2008, 3:53:21 PM4/14/08
to
"Wolfgang Kern" <spam...@crayne.org> wrote in message
news:ftvcf7$umq$1...@newsreader2.utanet.at...

> Even this algo suffer on many dependency stalls, it reads four bytes
> and processes 32 bits per iteration, so it reduces the costly memory
> accesses which you encounter with a byte by byte indexed table lookup.
> Granted, a 256 byte table could be easy held in cache, so it would be
> worth to benchmark and compare the performance of the two methods.

Well, I tried rewriting the 8-bit LUT in assembly to minimize
memory access, also I tried the 4-bit LUT. In the following,
popcnt1 is the topical code of this thread, popcnt2 is the
8-bit LUT, popcnt3 is popcnt1 without compression as in the AMD
manual, and popcnt4 is the 4-bit LUT:

C:\gfortran\clf\popcnt>type big_popcnt4.s

movl $32768, %edx

..globl _popcnt2
.def _popcnt2; .scl 2; .type 32; .endef
..align 16
_popcnt2:
movq %rbx, 8(%rsp)
movq %rsi, 16(%rsp)
movq %rdi, 24(%rsp)
movq %rbp, 32(%rsp)
movl (%rdx), %ebp
xorl %eax, %eax
leaq LUT(%rip), %rdi
byte_loop:
movaps (%rcx), %xmm0
movq %xmm0, %rbx
movzx %bl, %rsi
xorl %edx, %edx
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
movhlps %xmm0, %xmm0
movq %xmm0, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
movzx %dl, %edx
addq %rdx, %rax
addq $16, %rcx
subq $16, %rbp
jnz byte_loop
movq 8(%rsp), %rbx
movq 16(%rsp), %rsi
movq 24(%rsp), %rdi
movq 32(%rsp), %rbp
ret

xorps %xmm4, %xmm4

ret

..globl _popcnt4
.def _popcnt4; .scl 2; .type 32; .endef
..align 16
_popcnt4:
movaps %xmm6, 8(%rsp)
movl (%rdx), %edx
movaps nonsense(%rip), %xmm2
movaps LUT(%rip), %xmm3
xorps %xmm4, %xmm4
xorps %xmm6, %xmm6
oword_loop:
movaps (%rcx), %xmm0
movaps %xmm0, %xmm1
andps %xmm2, %xmm0
movaps %xmm3, %xmm5
pshufb %xmm0, %xmm5
psadbw %xmm4, %xmm5
paddd %xmm5, %xmm6

psrld $4, %xmm1
andps %xmm2, %xmm1

movaps %xmm3, %xmm5
pshufb %xmm1, %xmm5
psadbw %xmm4, %xmm5
paddd %xmm5, %xmm6

addq $16, %rcx
subq $16, %rdx
jnz oword_loop
movq %xmm6, %rax
movhlps %xmm6, %xmm6
movq %xmm6, %rdx
addq %rdx, %rax
movq 8(%rsp), %xmm6
ret

.data
..align 16
stuff:
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
nonsense:
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f

LUT:
.byte 0,1,1,2,1,2,2,3
.byte 1,2,2,3,2,3,3,4
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 4,5,5,6,5,6,6,7
.byte 5,6,6,7,6,7,7,8

C:\gfortran\clf\popcnt>type big_test4.f90
program big_test4

interface
function popcnt4(x,n) bind(C)


import C_INT64_T
import C_PTR
import C_INT
implicit none

integer(C_INT64_T) popcnt4


type(C_PTR), value :: x
integer(C_INT) n

end function popcnt4
end interface

np = popcnt2(t_ptr, Nbytes)


t1 = tm1()
write(*,'(2(a,i0))') 'popcnt2 np = ', np, ' clocks = ', t1-t0
t0 = tm1()
np = popcnt3(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt3 np = ', np, ' clocks = ', t1-t0

t0 = tm1()
np = popcnt4(t_ptr, Nbytes)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt4 np = ', np, ' clocks = ', t1-t0
end do

end program big_test4

subroutine sieve(t, n) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT) n
integer(C_INT8_T) t(0:n-1)
integer i, lim, j

lim = sqrt(8*n+0.5_C_DOUBLE)
t = -1
t(0) = ibclr(t(0), 0)
do i = 2, lim
if(btest(t((i-1)/8),modulo(i-1,8))) then
do j = i**2, 8*n, i
t((j-1)/8) = ibclr(t((j-1)/8),modulo(j-1,8))
end do
end if
end do
end subroutine sieve

C:\gfortran\clf\popcnt>c:\gfortran\win64\bin\x86_64-pc-mingw32-gfortran -O2
big_
test4.f90 big_popcnt4.s -obig_test4

C:\gfortran\clf\popcnt>big_test4 > big_test4.txt

C:\gfortran\clf\popcnt>type big_test4.txt
popcnt1 np = 23000 clocks = 7370
popcnt2 np = 23000 clocks = 47560
popcnt3 np = 23000 clocks = 15190
popcnt4 np = 23000 clocks = 17540
popcnt1 np = 23000 clocks = 7010
popcnt2 np = 23000 clocks = 47480
popcnt3 np = 23000 clocks = 15130
popcnt4 np = 23000 clocks = 17530
popcnt1 np = 23000 clocks = 6920
popcnt2 np = 23000 clocks = 47540
popcnt3 np = 23000 clocks = 15030
popcnt4 np = 23000 clocks = 17460
popcnt1 np = 23000 clocks = 6930
popcnt2 np = 23000 clocks = 47520
popcnt3 np = 23000 clocks = 15040
popcnt4 np = 23000 clocks = 17520

So we got down to about 1 byte per 1.5 clocks for the 8-bit LUT.
Quite disappointing. The 4-bit LUT is close to the AMD manual
method, though. For a processor with fast PSHUFB and no
hardware POPCNT this may be a good thing. PSHUFB unfortunately is
slow on my Core 2 Duo E6700.

Terence

unread,
Apr 14, 2008, 6:09:35 PM4/14/08
to
On Apr 14, 6:38 pm, "James Van Buskirk" <spamt...@crayne.org> wrote:

> The Wikipedia entry on Hamming weight is pretty good for the
> wrap stage of the algorithm, but the code there is almost
> identical to your example, so it probably doesn't help you.
>

Thanks, James.
Now I see it and understand it.
There is a code width size beyond which table look-up summing takes
longer.
That Wikipedia reference makes the principle and sugested algorithms
clear.
All my univeresity time was pre Hamming.

Terje Mathisen

unread,
Apr 15, 2008, 8:28:26 AM4/15/08
to
Maarten Kronenburg wrote:
> "Terence" wrote in message
>> "James Van Buskirk" wrote in message
>> (code)
>>
>> Someone tell me why 21 instructions and a branch, per word, is somehow
>> better/faster than
>>
>
> First I also didn't believe it, but somehow array elements are combined into
> one element, and only that one element is counted, it's from:

There's no "somehow" at all about it:

This is a classic bitslice implementation of a FULL_ADD, i.e. 3 inputs
and 2 outputs.

Terence

unread,
Apr 15, 2008, 5:29:34 AM4/15/08
to
I mean, prior to Hamming working with Shannon on what Hamming got well
known for in about 1974, since I graduated in 1955, 1957, 1958, 1963
and 1972.

Gerd Isenberg

unread,
Apr 15, 2008, 5:56:39 AM4/15/08
to
Another SSE2 alternative for a 64-bit popcount, sign extending 64 bits
to 64 bytes to add them.
May be combined with add/major as well...

U32 popCount(U64 bb) {
static const U64 CACHE_ALIGN masks[8] = {
C64(0x0101010101010101), C64(0x0202020202020202),
C64(0x0404040404040404), C64(0x0808080808080808),
C64(0x1010101010101010), C64(0x2020202020202020),
C64(0x4040404040404040), C64(0x8080808080808080),
};
__m128i x0, x1, x2, x3, zr; U32 cnt;
__m128i * pM = (__m128i*) masks;
x0 = _mm_cvtsi64x_si128 ( bb );
x0 = _mm_unpacklo_epi64 ( x0, x0 );
zr = _mm_setzero_si128();
x3 = _mm_andnot_si128 ( x0, pM[3] );
x2 = _mm_andnot_si128 ( x0, pM[2] );
x1 = _mm_andnot_si128 ( x0, pM[1] );
x0 = _mm_andnot_si128 ( x0, pM[0] );
x3 = _mm_cmpeq_epi8 ( x3, zr );
x2 = _mm_cmpeq_epi8 ( x2, zr );
x1 = _mm_cmpeq_epi8 ( x1, zr );
x0 = _mm_cmpeq_epi8 ( x0, zr );
x2 = _mm_add_epi8 ( x2, x3 );
x0 = _mm_add_epi8 ( x0, x1 );
x0 = _mm_add_epi8 ( x0, x2 );
x0 = _mm_sad_epu8 ( x0, zr );
cnt = -_mm_cvtsi128_si32( x0 )
-_mm_extract_epi16( x0, 4 );
return cnt & 255;
}

bb$ = 8
?popCount@@YAI_K@Z PROC
00000 66 0f ef db pxor xmm3, xmm3
00004 66 48 0f 6e d1 movd xmm2, rcx
00009 66 0f 6c d2 punpcklqdq xmm2, xmm2
0000d 66 0f 6f e2 movdqa xmm4, xmm2
00011 66 0f 6f c2 movdqa xmm0, xmm2
00015 66 0f 6f ca movdqa xmm1, xmm2
00019 66 0f df 15 30 00 00 00 pandn xmm2, XMMWORD PTR ?masks
+48
00021 66 0f df 25 00 00 00 00 pandn xmm4, XMMWORD PTR ?masks
00029 66 0f df 0d 20 00 00 00 pandn xmm1, XMMWORD PTR ?masks
+32
00031 66 0f df 05 10 00 00 00 pandn xmm0, XMMWORD PTR ?masks
+16
00039 66 0f 74 e3 pcmpeqb xmm4, xmm3
0003d 66 0f 74 c3 pcmpeqb xmm0, xmm3
00041 66 0f 74 cb pcmpeqb xmm1, xmm3
00045 66 0f fc e0 paddb xmm4, xmm0
00049 66 0f 74 d3 pcmpeqb xmm2, xmm3
0004d 66 0f fc ca paddb xmm1, xmm2
00051 66 0f fc e1 paddb xmm4, xmm1
00055 66 0f f6 e3 psadbw xmm4, xmm3
00059 66 0f 7e e0 movd eax, xmm4
0005d 66 0f c5 cc 04 pextrw ecx, xmm4, 4
00062 03 c8 add ecx, eax
00064 f7 d9 neg ecx
00066 0f b6 c1 movzx eax, cl
00069 c3 ret 0
?popCount@@YAI_K@Z ENDP

http://chessprogramming.wikispaces.com/SSE2#SSE2dotproduct
http://chessprogramming.wikispaces.com/Population+Count

Gerd

Maarten Kronenburg

unread,
Apr 15, 2008, 11:13:38 AM4/15/08
to

"Maarten Kronenburg" wrote in message
>
> "James Van Buskirk" wrote in message
> > There was a thread about population count in comp.lang.fortran but
> > nobody seemed to like my assembly code over there:
>
> For what it's worth below I give my 32-bit assembler for popcnt.

The assembler code I gave above, but in mmx and using psadbw as you
suggested, runs 5 cycles/32-bit.
Now I try to put it in xmm and hope to do even a little better, I will put
the result here.
Regards, Maarten

Maarten Kronenburg

unread,
Apr 15, 2008, 3:58:21 PM4/15/08
to

"Maarten Kronenburg" wrote in message
>

Now I have the xmm version, it runs a litle below 3 cycles/32-bit.
Tomorrow after some more testing I will put it here.

James Van Buskirk

unread,
Apr 16, 2008, 2:33:16 AM4/16/08
to
"Gerd Isenberg" <spam...@crayne.org> wrote in message
news:61cb8b24-6249-49eb...@t54g2000hsg.googlegroups.com...

> Another SSE2 alternative for a 64-bit popcount, sign extending 64 bits
> to 64 bytes to add them.
> May be combined with add/major as well...

> http://chessprogramming.wikispaces.com/Population+Count

This didn't seem too promising, but I tried it anyhow.

interface
function popcnt5(x,n) bind(C)


import C_INT64_T
import C_PTR
import C_INT
implicit none

integer(C_INT64_T) popcnt5


type(C_PTR), value :: x
integer(C_INT) n

end function popcnt5
end interface

..align 16
_popcnt5:
movaps %xmm6, 8(%rsp)
leaq masks(%rip), %rax
movl (%rdx), %edx
xorps %xmm3, %xmm3
xorps %xmm6, %xmm6
..align 16
qword_loop:
movaps (%rcx), %xmm5
movddup %xmm5, %xmm2
movaps %xmm2, %xmm4
movaps %xmm2, %xmm0
movaps %xmm2, %xmm1
andnps 48(%rax), %xmm2
andnps (%rax), %xmm4
andnps 32(%rax), %xmm1
andnps 16(%rax), %xmm0
pcmpeqb %xmm3, %xmm4
pcmpeqb %xmm3, %xmm0
pcmpeqb %xmm3, %xmm1
paddb %xmm0, %xmm4
pcmpeqb %xmm3, %xmm2
paddb %xmm1, %xmm2
paddb %xmm4, %xmm2

movhlps %xmm5, %xmm5
movaps %xmm5, %xmm4
movaps %xmm5, %xmm0
movaps %xmm5, %xmm1
andnps 48(%rax), %xmm5
andnps (%rax), %xmm4
andnps 32(%rax), %xmm1
andnps 16(%rax), %xmm0
pcmpeqb %xmm3, %xmm4
pcmpeqb %xmm3, %xmm0
pcmpeqb %xmm3, %xmm1
paddb %xmm0, %xmm4
pcmpeqb %xmm3, %xmm5
paddb %xmm1, %xmm5
paddb %xmm4, %xmm5

paddb %xmm5, %xmm2
psadbw %xmm3, %xmm2
pslld $24, %xmm2
psrad $24, %xmm2
psubd %xmm2, %xmm6

addq $16, %rcx
subq $16, %rdx

jnz qword_loop

movq %xmm6, %rax
movhlps %xmm6, %xmm6
movq %xmm6, %rdx
addq %rdx, %rax

movaps 8(%rsp), %xmm6
ret

masks:
.long 0x01010101,0x01010101,0x02020202,0x02020202
.long 0x04040404,0x04040404,0x08080808,0x08080808
.long 0x10101010,0x10101010,0x20202020,0x20202020
.long 0x40404040,0x40404040,0x80808080,0x80808080

Also I tweaked the 8-bit LUT:

..align 16
_popcnt2:
movq %rbx, 8(%rsp)
movq %rsi, 16(%rsp)
movq %rdi, 24(%rsp)
movq %rbp, 32(%rsp)
movl (%rdx), %ebp

xorl %r8d, %r8d
leaq LUT(%rip), %rdi
..align 16
byte_loop:
# movaps (%rcx), %xmm0 # 1 oword load
# movq %xmm0, %rbx # 1 oword load
movq (%rcx), %rbx # 2 qword loads
movzx %bl, %rsi
xorq %rdx, %rdx
movb (%rsi,%rdi,1), %dl
movzx %bh, %esi
movb (%rsi,%rdi,1), %al


shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi

addb (%rsi,%rdi,1), %al


shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi

addb (%rsi,%rdi,1), %al


shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi

addb (%rsi,%rdi,1), %al
# movhlps %xmm0, %xmm0 # 1 oword load
# movq %xmm0, %rbx # 1 oword load
movq 8(%rcx), %rbx # 2 qword loads


movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi

addb (%rsi,%rdi,1), %al


shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi

addb (%rsi,%rdi,1), %al


shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi

addb (%rsi,%rdi,1), %al


shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi

addb (%rsi,%rdi,1), %al
addb %al, %dl
addq %rdx, %r8


addq $16, %rcx
subq $16, %rbp
jnz byte_loop

movq %r8, %rax


movq 8(%rsp), %rbx
movq 16(%rsp), %rsi
movq 24(%rsp), %rdi
movq 32(%rsp), %rbp
ret

And the 4-bit LUT:

..align 16
_popcnt4:
movaps %xmm6, 8(%rsp)
movl (%rdx), %edx
movaps nonsense(%rip), %xmm2

xorps %xmm3, %xmm3
xorps %xmm6, %xmm6
..align 16
oword_loop:
movaps (%rcx), %xmm0
movaps %xmm2, %xmm1
andnps %xmm0, %xmm1
movaps LUT(%rip), %xmm5
andps %xmm2, %xmm0
movaps %xmm5, %xmm4
pshufb %xmm0, %xmm5

psrld $4, %xmm1

pshufb %xmm1, %xmm4
paddd %xmm4, %xmm5
psadbw %xmm3, %xmm5
paddd %xmm5, %xmm6

addq $16, %rcx
subq $16, %rdx
jnz oword_loop

movq %xmm6, %rax
movhlps %xmm6, %xmm6
movq %xmm6, %rdx
addq %rdx, %rax

movaps 8(%rsp), %xmm6
ret

(Fit the above into code from
http://groups.google.com/group/comp.lang.asm.x86/msg/0cd4b133fd17c86a
)

With result:

popcnt1 np = 23000 clocks = 7150
popcnt2 np = 23000 clocks = 43240


popcnt3 np = 23000 clocks = 15120

popcnt4 np = 23000 clocks = 15660
popcnt5 np = 23000 clocks = 37870
popcnt1 np = 23000 clocks = 6950
popcnt2 np = 23000 clocks = 43150
popcnt3 np = 23000 clocks = 15140
popcnt4 np = 23000 clocks = 15600
popcnt5 np = 23000 clocks = 38040
popcnt1 np = 23000 clocks = 6930
popcnt2 np = 23000 clocks = 43120
popcnt3 np = 23000 clocks = 14980
popcnt4 np = 23000 clocks = 15570
popcnt5 np = 23000 clocks = 37540


popcnt1 np = 23000 clocks = 6960

popcnt2 np = 23000 clocks = 43100
popcnt3 np = 23000 clocks = 15040
popcnt4 np = 23000 clocks = 15560
popcnt5 np = 23000 clocks = 37490

So [my adaptation of] your code is getting 32768/37500 = 0.87 bytes
per clock. Faster than my improved 8-bit LUT at 0.76 bytes per clock
but still running with the Slowskys. OTOH, I have gotten the 4-bit
LUT up to nearly the speed of the SWAR method (as the above-referenced
web page styles it) so it may be a candidate for the wrap-up stage of
the compression-accelerated code.

Wolfgang Kern

unread,
Apr 16, 2008, 9:34:09 AM4/16/08
to

"James Van Buskirk" <spam...@crayne.org> schrieb im Newsbeitrag
news:P9adnbQZvO6pK57V...@comcast.com...

> "Wolfgang Kern" <spam...@crayne.org> wrote in message
> news:ftvcf7$umq$1...@newsreader2.utanet.at...

>> Even this algo suffer on many dependency stalls, it reads four bytes
>> and processes 32 bits per iteration, so it reduces the costly memory
>> accesses which you encounter with a byte by byte indexed table lookup.
>> Granted, a 256 byte table could be easy held in cache, so it would be
>> worth to benchmark and compare the performance of the two methods.

> Well, I tried rewriting the 8-bit LUT in assembly to minimize
> memory access, also I tried the 4-bit LUT. In the following,
> popcnt1 is the topical code of this thread, popcnt2 is the
> 8-bit LUT, popcnt3 is popcnt1 without compression as in the AMD
> manual, and popcnt4 is the 4-bit LUT:

I see in your later posts you found already faster solutions ...

[code ..]
when I look at this table ...
> LUT:
> .byte 0,1,1,2,1,2,2,3 00 1.
> .byte 1,2,2,3,2,3,3,4 08 2.
> .byte 1,2,2,3,2,3,3,4 10 =08 [2
> .byte 2,3,3,4,3,4,4,5 18 3.

> .byte 1,2,2,3,2,3,3,4 20 = 08 [2
> .byte 2,3,3,4,3,4,4,5 28 = 18 [3
> .byte 2,3,3,4,3,4,4,5 30 = 18 [3
> .byte 3,4,4,5,4,5,5,6 38 4.

> .byte 1,2,2,3,2,3,3,4 40 = 08 [2
> .byte 2,3,3,4,3,4,4,5 48 = 18 [3
> .byte 2,3,3,4,3,4,4,5 50 = 18 [3
> .byte 3,4,4,5,4,5,5,6 58 = 38 [4

> .byte 2,3,3,4,3,4,4,5 60 = 18 [3
> .byte 3,4,4,5,4,5,5,6 68 = 38 [4
> .byte 3,4,4,5,4,5,5,6 70 = 38 [4
> .byte 4,5,5,6,5,6,6,7 78 5.

> .byte 1,2,2,3,2,3,3,4 80 = 08 [2
> .byte 2,3,3,4,3,4,4,5 88 = 18 [3
> .byte 2,3,3,4,3,4,4,5 90 = 18 [3
> .byte 3,4,4,5,4,5,5,6 98 = 38 [4

> .byte 2,3,3,4,3,4,4,5 a0 = 18 [3
> .byte 3,4,4,5,4,5,5,6 a8 = 38 [4
> .byte 3,4,4,5,4,5,5,6 b0 = 38 [4
> .byte 4,5,5,6,5,6,6,7 b8 = 78 [5

> .byte 2,3,3,4,3,4,4,5 c0 = 18 [3
> .byte 3,4,4,5,4,5,5,6 c8 = 38 [4
> .byte 3,4,4,5,4,5,5,6 d0 = 38 [4
> .byte 4,5,5,6,5,6,6,7 d8 = 78 [5

> .byte 3,4,4,5,4,5,5,6 e0 = 38 [4
> .byte 4,5,5,6,5,6,6,7 e8 = 78 [5
> .byte 4,5,5,6,5,6,6,7 f0 = 78 [5
> .byte 5,6,6,7,6,7,7,8 f8 6.

I wonder if a few calculations wont reduce the LUT to 6*8 bytes
or even smaller with a double '0,1,1,2' index conversion,
which then could be held in registers ...
Haven't checked on the compressed method in detail, so this
may be in use already anyway.

[code..]

> So we got down to about 1 byte per 1.5 clocks for the 8-bit LUT.
> Quite disappointing. The 4-bit LUT is close to the AMD manual
> method, though. For a processor with fast PSHUFB and no
> hardware POPCNT this may be a good thing. PSHUFB unfortunately is
> slow on my Core 2 Duo E6700.

My AMD64 docs show latency/throughput = 2(5)/1 for POPCNT, and I see
this as 0.25 cycles per byte on GPR64 and 0.625 on mem64, .. not bad
and hard to beat with an algo.

__
wolfgang


Maarten Kronenburg

unread,
Apr 16, 2008, 8:42:37 AM4/16/08
to

"James Van Buskirk" wrote in message
> There was a thread about population count in comp.lang.fortran but
> nobody seemed to like my assembly code over there:
>

Below is my xmm version, it runs a little under 3 cycles/32-bit.
It's not as fast as your code (1 cycle/32-bit), but on 32-bit system there
are not enough xmm registers to do the CSA trick, so this is how far I get.
As mentioned pdata is the pointer and ndata is the number of 32-bit words.
This is for lzasm, for nasm replace @@ with .label and [pdata] [ndata] with
[esp+16] [esp+20],
and remove PROC and the ARG and ENDP lines.

PROC __int_popcnt
ARG pdata:dword, ndata:dword
push ebp
mov ebp, esp

push esi
push edi
mov esi, [pdata]
mov edi, [ndata]
pxor xmm7, xmm7
mov eax, 0F0F0F0Fh
movd xmm6, eax
punpckldq xmm6, xmm6
unpcklpd xmm6, xmm6
movdqa xmm5, xmm6
pslld xmm5, 2
pxor xmm5, xmm6
movdqa xmm4, xmm5
pslld xmm4, 1
pxor xmm4, xmm5
pxor xmm0, xmm0
sub esi, 16
mov ecx, edi
and edi, 0FFFFFFFCh
mov eax, edi
lea eax, [eax+4]
jz short @@3
@@1:
movdqu xmm1, [esi+4*edi]
@@2:
movdqa xmm2, xmm1
psrld xmm2, 1
pand xmm1, xmm4
pand xmm2, xmm4
paddd xmm1, xmm2
movdqa xmm2, xmm1
psrld xmm2, 2
pand xmm1, xmm5
pand xmm2, xmm5
paddd xmm1, xmm2
movdqa xmm2, xmm1
psrld xmm2, 4
paddd xmm1, xmm2
pand xmm1, xmm6
psadbw xmm1, xmm7
paddd xmm0, xmm1
sub edi, 4
jnz short @@1
@@3:
and ecx, 3
jz short @@5
mov edi, 4
dec ecx
jnz short @@4
movd xmm1, [esi+4*eax]
jmp short @@2
@@4:
movq xmm1, [esi+4*eax]
dec ecx
jz short @@2
movd xmm2, [esi+4*eax+8]
movlhps xmm1, xmm2
dec ecx
jmp short @@2
@@5:
movhlps xmm1, xmm0
paddd xmm0, xmm1
movd eax, xmm0
pop edi
pop esi

James Van Buskirk

unread,
Apr 16, 2008, 12:05:48 PM4/16/08
to
"Wolfgang Kern" <spam...@crayne.org> wrote in message
news:fu4v9j$b5g$1...@newsreader2.utanet.at...

> "James Van Buskirk" <spam...@crayne.org> schrieb im Newsbeitrag
> news:P9adnbQZvO6pK57V...@comcast.com...

> when I look at this table ...

I think I see what you are driving at. Actually if you look
closer at the 4-bit LUT code it holds the first 16 bytes of the
LUT in one register, and looks up bytes in two stages. Look at
the _popcnt4 function (latest version to be found at
http://groups.google.com/group/comp.lang.asm.x86/msg/ee1ed0bbd6f35874
) and it may already be doing what you are proposing.

> My AMD64 docs show latency/throughput = 2(5)/1 for POPCNT, and I see
> this as 0.25 cycles per byte on GPR64 and 0.625 on mem64, .. not bad
> and hard to beat with an algo.

This doesn't beat the _popcnt1 version, which is getting 4.7 bytes
per clock, but the latency/throughput numbers quoted above for
Phenom imply almost 8 bytes/clock to me:

popcnt (%rcx), %rbx
addq %rbx, %rax
popcnt 8(%rcx), %rbx
addq %rbx, %rax
....

But I think it can go even faster because Phenom should be capable
of 3 DirectPath operations per cycle as well as two loads, and the
above sequence only uses one load and two DirectPath operations
per cycle. Thus every third instruction could be devoted to
a compression algorithm like the one seen in _popcnt1, whose
final counting scheme could be accelerated with hardware popcount.
Of course my attempts to get Core 2 Duo do sustain 3 useful
calculational operations per cycle have been in vain, so there may
also be throughput issues with Phenom that prevent this idea from
becoming reality, but if anyone out there has a Phenom running
64-bit Windows I would be eager to run some tests...

Gerd Isenberg

unread,
Apr 16, 2008, 12:58:11 PM4/16/08
to
Thanks for giving it a try.

The routine was intrinsically a waste product from a byte[64] times
bit[64] dot-product and was not really considered a serious candidate
for population counting. Anyway, the independent instruction chains
were tempting compared to the dependencies of the SWAR one - if you
count some 64-bit words up and then, but no streams - with a lot of
other memory traffic around polluting the caches. But of course popcnt-
instruction will do much better if available for that purpose.

Btw. you seem to introduce some penalties by mixing float and integer
instructions, e.g. andnps instead of pandn, movaps instead of movdqa
etc. See the intel optimization manual 5.1 General Rules On SIMD
Integer code. Not sure whether this is an issue with pxor versus xorps
as well, related to your other question on pxor throughput on core 2.

Gerd

Maarten Kronenburg

unread,
Apr 16, 2008, 1:38:21 PM4/16/08
to

"Maarten Kronenburg" wrote in message
>

As always there is some room for improvement,
below the loop runs the other way which is a bit better:

PROC __int_popcnt
ARG pdata:dword, ndata:dword
push ebp
mov ebp, esp
push esi
push edi
mov esi, [pdata]
mov edi, [ndata]
pxor xmm7, xmm7
mov eax, 0F0F0F0Fh
movd xmm6, eax
punpckldq xmm6, xmm6
unpcklpd xmm6, xmm6
movdqa xmm5, xmm6
pslld xmm5, 2
pxor xmm5, xmm6
movdqa xmm4, xmm5
pslld xmm4, 1
pxor xmm4, xmm5
pxor xmm0, xmm0

mov ecx, edi
and edi, 0FFFFFFFCh

lea esi, [esi+4*edi]
neg edi


jz short @@3
@@1:
movdqu xmm1, [esi+4*edi]
@@2:
movdqa xmm2, xmm1
psrld xmm2, 1
pand xmm1, xmm4
pand xmm2, xmm4
paddd xmm1, xmm2
movdqa xmm2, xmm1
psrld xmm2, 2
pand xmm1, xmm5
pand xmm2, xmm5
paddd xmm1, xmm2
movdqa xmm2, xmm1
psrld xmm2, 4
paddd xmm1, xmm2
pand xmm1, xmm6
psadbw xmm1, xmm7
paddd xmm0, xmm1

add edi, 4


jnz short @@1
@@3:
and ecx, 3
jz short @@5

mov edi, 0FFFFFFFCh


dec ecx
jnz short @@4

movd xmm1, [esi]
jmp short @@2
@@4:
movq xmm1, [esi]


dec ecx
jz short @@2

movd xmm2, [esi+8]

James Van Buskirk

unread,
Apr 16, 2008, 2:41:36 PM4/16/08
to
"Maarten Kronenburg" <spam...@crayne.org> wrote in message
news:4805f444$0$717$7ade...@textreader.nntp.internl.net...

> Below is my xmm version, it runs a little under 3 cycles/32-bit.
> It's not as fast as your code (1 cycle/32-bit), but on 32-bit system there
> are not enough xmm registers to do the CSA trick, so this is how far I
> get.

> @@1:


> movdqu xmm1, [esi+4*edi]
> @@2:
> movdqa xmm2, xmm1
> psrld xmm2, 1
> pand xmm1, xmm4
> pand xmm2, xmm4
> paddd xmm1, xmm2
> movdqa xmm2, xmm1
> psrld xmm2, 2
> pand xmm1, xmm5
> pand xmm2, xmm5
> paddd xmm1, xmm2
> movdqa xmm2, xmm1
> psrld xmm2, 4
> paddd xmm1, xmm2
> pand xmm1, xmm6
> psadbw xmm1, xmm7
> paddd xmm0, xmm1
> sub edi, 4
> jnz short @@1

3 cycles per 4 bytes is pretty slow. What kind of processor are you
testing with? There are some tweaks that can speed up your code.
Firstly, look at how the first stage where 1-bit counters are
combined to form 2-bits counters is performed in the AMD manual.
Notice that they use a subtraction instead of an addition, a
clever optimization that can save you an AND.

You might think that it's a good thing to keep all your constants in
registers, but that is more problematic on Intel processors where
you have only limited ROB read bandwidth. What that amounts to is
you can only read at most two registers per cycle than aren't "in
flight" or three if the third is used as an index. See 248966.pdf,
section 3.5.2.1 on ROB read port stalls. Originally my code had all
the constants in registers but I moved some of them to memory with
consequent speed-up. Also you may consider making edi 4X as big
and addressing as movdqu xmm1, [edi+1*esi] so that your dead
register is the index register. Thus it won't participate in ROB
read port stalls.

Also when you have to perform a movaps to copy a value because of
the 2-register ISA we are laboring under (movaps is one byte shorter
than movdqa which is why it is more freqently seen, even in integer
code) copy a constant rather than the in-flight variable. This
allows the processor to issue the copy instruction out of order
well in advance of where it's needed so that it can't get in the way
of the critical path instructions.

After you have moved some of the constants from registers to memory
and tweaked your code a little, you may find that as a side effect
of having constants in memory you have now some free registers so
that you may be able to do a little CSA compression after all.

James Van Buskirk

unread,
Apr 16, 2008, 2:59:38 PM4/16/08
to
"Gerd Isenberg" <spam...@crayne.org> wrote in message
news:fdc6b0bd-c736-46fc...@m73g2000hsh.googlegroups.com...

> Btw. you seem to introduce some penalties by mixing float and integer
> instructions, e.g. andnps instead of pandn, movaps instead of movdqa
> etc. See the intel optimization manual 5.1 General Rules On SIMD
> Integer code. Not sure whether this is an issue with pxor versus xorps
> as well, related to your other question on pxor throughput on core 2.

Agner Fog seems to think of this as an old wive's tale. If you look
in instruction_tables.pdf on his web site he gives the unit (flt or
int) that the instruction goes to and all logical operations: pandn
andnps, andnpd as well as pxor, xorps, and xorpd, go to the int unit.
This is annoying in that a sequence such as

mulpd xmm0, xmm1
xorpd xmm0, xmm2
addpd xmm0, xmm3

The xorpd instruction to negate xmm0 before the addition takes three
clocks because its input came from the flt unit and its output will
go to the flt unit, the stall occurring even though it has 'pd' in its
name. Same with movdqa, movaps, movapd reg, reg: all 6 forms go through
the int unit. May be different on AMD processors, I haven't had the
opportunity to check timing code there. As a result, I normally
use the packed-single rather than packed-double or integer from
for the instructions because it's one byte shorter. One can always
change things around and time everything again to see if it makes a
positive or negative difference.

Maarten Kronenburg

unread,
Apr 16, 2008, 3:39:47 PM4/16/08
to

"James Van Buskirk" wrote in message
> "Maarten Kronenburg" wrote in message
>

> 3 cycles per 4 bytes is pretty slow. What kind of processor are you


> testing with? There are some tweaks that can speed up your code.
> Firstly, look at how the first stage where 1-bit counters are
> combined to form 2-bits counters is performed in the AMD manual.
> Notice that they use a subtraction instead of an addition, a
> clever optimization that can save you an AND.
>

Yes I saw that also in your code, I will try to study and understand this.
This would indeed save me one and in the loop.

> You might think that it's a good thing to keep all your constants in
> registers, but that is more problematic on Intel processors where
> you have only limited ROB read bandwidth. What that amounts to is
> you can only read at most two registers per cycle than aren't "in
> flight" or three if the third is used as an index. See 248966.pdf,
> section 3.5.2.1 on ROB read port stalls. Originally my code had all
> the constants in registers but I moved some of them to memory with
> consequent speed-up. Also you may consider making edi 4X as big
> and addressing as movdqu xmm1, [edi+1*esi] so that your dead
> register is the index register. Thus it won't participate in ROB
> read port stalls.

OK I will try that also.

>
> Also when you have to perform a movaps to copy a value because of
> the 2-register ISA we are laboring under (movaps is one byte shorter
> than movdqa which is why it is more freqently seen, even in integer
> code) copy a constant rather than the in-flight variable. This
> allows the processor to issue the copy instruction out of order
> well in advance of where it's needed so that it can't get in the way
> of the critical path instructions.

Yes I understand, then the copy is out of the dependency chain.

>
> After you have moved some of the constants from registers to memory
> and tweaked your code a little, you may find that as a side effect
> of having constants in memory you have now some free registers so
> that you may be able to do a little CSA compression after all.
>
> --

Thanks for your suggestions, I have some work to do.
My library has many other assembler routines for multiplication etc,
optimizing those also means getting instructions out of the dependency
chain. Mostly the first step is getting a working version, then the second
step is this kind of optimiziation.
Regards, Maarten.

Maarten Kronenburg

unread,
Apr 17, 2008, 10:43:31 AM4/17/08
to

"Maarten Kronenburg" wrote in message
>
>
Optimizing the loop instructions improves from 2.8 to 2.4 cycles/32-bit.
The first lines in the loop should be:
movdqa xmm2, xmm4
pandn xmm2, xmm1
psrld xmm2, 1
psubd xmm1, xmm2
movdqa xmm2, xmm5
pandn xmm2, xmm1

psrld xmm2, 2
pand xmm1, xmm5
paddd xmm1, xmm2
; ...
Another function that may be optimized this way is bit reversal.

0 new messages