Population count in SSE2, again

James Van Buskirk

unread,

Apr 12, 2008, 5:14:45 AM4/12/08

to

There was a thread about population count in comp.lang.fortran but
nobody seemed to like my assembly code over there:

http://groups.google.com/group/comp.lang.fortran/msg/271b9850acfafabc

Looking around, I found an old thread:

http://groups.google.com/group/comp.lang.asm.x86/msg/54d29f732e51de93

which was very similar to the code I posted, so I thought I would
post a revision of my code along with an explanation of how it is
different.

Of course the most obvious difference is that my code is actual
SSE2 code, whereas the older code was 32-bit C code. SSE2 has the
surprising PSADBW instruction which sweeps the floor after 8-bit
sums have been completed in my wrap() subroutine (analogous to the
pop() function in the older code). It's not clear how you are
supposed to tease a compiler into generating this instruction;
Fortran has a way to request it in a single expression but it is
unfathomable code that the compiler probably wouldn't convert in
the desired way, so the normal thing you see is a multiply and
shift as in the AMD Software Optimization Guide for AMD64
Processors.

But the big difference lies in implementation of the Carry-Save
Adder (CSA) compression scheme. You will recall that the
fundamental sequence of operations, starting with inputs A, B,
and C is:

L = A .XOR. B
H = A .AND. B
H = H .OR. (C .AND. L)
L = C .XOR. L

Now, on a 3-register ISA such as Alpha, this sequence can be carried
out in 5 instructions, but there is a complication in a 2-register
ISA like the topical SSE2 because you need both A .XOR. B and
A .AND. B, but SSE2 always overwrites one of its operands. Same
with C .XOR. L and C .AND. L, so it seems that you will need to
issue 2 copy instructions so that the sequence will take 7
instructions instead of 5.

I was thinking about this the other day and one of the things that
bugs me about the MMX/SSE ISA is that the PANDN instruction is
backwards. That is, the sequence:

A = A .XOR. B
B = B .AND. (.NOT. A)

would leave A and B in the roles of L and H respectively in the
canonical sequence of Harley (et al?), but alas, the PANDN
instruction negates the destination register rather than the source.
I couldn't see any way to make this work until it occurred to me
that the thing to do was to negate the starting register and then
proceed from there:

L = .TRUE.
....
L = L .XOR. A
A = A .AND. L
L = L .XOR. B
B = B .AND. L
B = A .OR. B

Where this scheme implemented in function popcnt1() in my code
is completely analogous to the pop_array5() function in Andrew
Felch's old post:

current old
-----------
L ones
A A[i]
B A[i+1]
B twosA

for the inputs and outputs of the first CSA iteration. The
difference is that I start out with ones = .NOT. .FALSE. so
that on the first iteration I get

L = L .XOR. A = .TRUE. .XOR. A0 = .NOT. A0 ; Negation of current sum
A = A .AND. L = A0 .AND. (.NOT. A0) = .FALSE. ; Correct carry!
L = L .XOR. B = (.NOT. A0) .XOR. B0
= .NOT. (A0 .XOR. B0) ; Negation of current sum!
B = B .AND. L = B0 .AND. ((.NOT. A0) .XOR. B0)
= B0 .AND. (((.NOT. A0) .AND. (.NOT. B0)) .OR. (A0 .AND. B0))
= (B0 .AND. ((.NOT. A0) .AND. (.NOT. B0))) .OR. (B0 .AND. (A0 .AND. B0))
= (.FALSE.) .OR. (A0 .AND. B0) = A0 .AND. B0; Correct carry!
B = A0 .OR. B0 = .FALSE. .OR. (A0 .AND. B0)
= A0 .AND. B0; Correct twosA

Thinking about this, the reader can envision that simply starting
with fours = twos = ones = ~0 (negative logic) in function pop_array5
in the old code permits us to carry out all those CSA operations in 5
SSE2 instructions each rather than the 7 which were necessary with
positive logic. Of course the line:

tot = 8*tot + 4*pop(fours) + 2*pop(twos) + pop(ones);

is replaced by:

tot = 8*tot - 4*pop(fours) - 2*pop(twos) - pop(ones) + 896;

to correct for the negative logic. Hence the overhead for
eliminating 2 out of 7 inner loop instructions is only the
addition of the total number of weighted bits in all counters,
just one instruction outside the loop. The implementation of
negative logic reduced my test of counting bits in 32768 bytes
from about 9100 clocks to about 6950 clocks on my Core 2 Duo
E6700. Here it is in GAS with gfortran driver program:

C:\gfortran\clf\popcnt>type big_popcnt3.s
.text
..globl _tm1
.def _tm1; .scl 2; .type 32; .endef
_tm1:
rdtsc
shrq $32, %rdx
orq %rdx, %rax
ret
..globl _popcnt1
.def _popcnt1; .scl 2; .type 32; .endef
..align 16
_popcnt1:
subq $56, %rsp
movaps %xmm6, (%rsp)
movaps %xmm7, 16(%rsp)
movaps %xmm8, 32(%rsp)
movaps %xmm14, 64(%rsp)
movaps %xmm15, 80(%rsp)
movq $0x3333333333333333, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm15
xorps %xmm14, %xmm14

pcmpeqd %xmm1, %xmm1
pcmpeqd %xmm3, %xmm3
pcmpeqd %xmm5, %xmm5
xorps %xmm7, %xmm7

movl $32768, %edx

..align 16
loop1: movaps (%rcx), %xmm6
xorps %xmm6, %xmm1
andps %xmm1, %xmm6
movaps 16(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm6
movaps 32(%rcx), %xmm4
xorps %xmm4, %xmm1
andps %xmm1, %xmm4
movaps 48(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm4
movaps 64(%rcx), %xmm2
xorps %xmm2, %xmm1
andps %xmm1, %xmm2
movaps 80(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm2
movaps 96(%rcx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
movaps 112(%rcx), %xmm8
xorps %xmm8, %xmm1
andps %xmm1, %xmm8
orps %xmm8, %xmm0

xorps %xmm6, %xmm3
andps %xmm3, %xmm6
xorps %xmm4, %xmm3
andps %xmm3, %xmm4
orps %xmm4, %xmm6
xorps %xmm2, %xmm3
andps %xmm3, %xmm2
xorps %xmm0, %xmm3
andps %xmm3, %xmm0
orps %xmm0, %xmm2
xorps %xmm6, %xmm5
andps %xmm5, %xmm6
xorps %xmm2, %xmm5
andps %xmm5, %xmm2
orps %xmm2, %xmm6

movaps stuff, %xmm0
andps %xmm6, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm6
movaps %xmm15, %xmm0
andnps %xmm6, %xmm0
psrld $2, %xmm0
andps %xmm15, %xmm6
paddd %xmm0, %xmm6
movaps %xmm6, %xmm0
psrld $4, %xmm6
paddd %xmm0, %xmm6
andps nonsense, %xmm6
psadbw %xmm14, %xmm6
paddd %xmm6, %xmm7

addq $128, %rcx
subq $128, %rdx
jnz loop1

movaps %xmm5, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movaps %xmm3, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movaps %xmm1, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movhlps %xmm7, %xmm1
paddq %xmm7, %xmm1
movq %xmm1, %rax
addq $896, %rax

movaps (%rsp), %xmm6
movaps 16(%rsp), %xmm7
movaps 32(%rsp), %xmm8
movaps 64(%rsp), %xmm14
movaps 80(%rsp), %xmm15
addq $56, %rsp
ret

..align 16
wrap: movaps stuff, %xmm0
andps %xmm6, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm6
movaps %xmm15, %xmm0
andnps %xmm6, %xmm0
psrld $2, %xmm0
andps %xmm15, %xmm6
paddd %xmm0, %xmm6
movaps %xmm6, %xmm0
psrld $4, %xmm6
paddd %xmm0, %xmm6
andps nonsense, %xmm6
psadbw %xmm14, %xmm6
ret

..globl _popcnt3
.def _popcnt3; .scl 2; .type 32; .endef
..align 16
_popcnt3:
movq $0x3333333333333333, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm3
xorps %xmm2, %xmm2

xorps %xmm4, %xmm4

movl $32768, %edx
addq %rdx, %rcx
negq %rdx

..align 16
loop2: movaps (%rcx,%rdx), %xmm1
movapd stuff, %xmm0
andps %xmm1, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm1
movapd %xmm3, %xmm0
andnps %xmm1, %xmm0
psrld $2, %xmm0
andpd %xmm3, %xmm1
paddd %xmm0, %xmm1
movaps %xmm1, %xmm0
psrld $4, %xmm1
paddd %xmm0, %xmm1
andpd nonsense, %xmm1
psadbw %xmm2, %xmm1
paddd %xmm1, %xmm4

addq $16, %rdx
jnz loop2
movhlps %xmm4, %xmm1
paddq %xmm4, %xmm1
movq %xmm1, %rax

ret

.data
..align 16
stuff:
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
nonsense:
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f

C:\gfortran\clf\popcnt>type big_test1.f90
program big_test1
use ISO_C_BINDING
implicit none
interface
function tm1() bind(C)
import C_INT64_T
implicit none

integer(C_INT64_T) tm1
end function tm1
end interface

interface
function popcnt1(x) bind(C)
import C_INT64_T
import C_PTR
implicit none

integer(C_INT64_T) popcnt1
type(C_PTR), value :: x
end function popcnt1
end interface

interface
function popcnt2(x,n) bind(C)
import C_INT64_T
import C_PTR
import C_INT
implicit none

integer(C_INT64_T) popcnt2
type(C_PTR), value :: x
integer(C_INT) n
end function popcnt2
end interface

interface
function popcnt3(x) bind(C)
import C_INT64_T
import C_PTR
implicit none

integer(C_INT64_T) popcnt3
type(C_PTR), value :: x
end function popcnt3
end interface

interface
subroutine sieve(t, n) bind(C)
import C_PTR
import C_INT
implicit none

type(C_PTR), value :: t
integer(C_INT) n
end subroutine sieve
end interface

integer(C_INT8_T), allocatable, target :: x(:)
integer(C_INT), parameter :: Nbytes = 32768 ! popcnt1 hardwired to this
integer, parameter :: align = 16 ! Will use xmm registers
type(C_PTR) x_ptr
integer(C_INTPTR_T) x_start
integer(C_INTPTR_T) t_start
type(C_PTR) t_ptr
integer i
integer(C_INT64_T) t0, t1, np

allocate(x(Nbytes+align-1))
x_ptr = C_LOC(x(1))
x_start = transfer(x_ptr, x_start)
t_start = iand(x_start+align-1, int(not(align-1),C_INTPTR_T))
t_ptr = transfer(t_start, t_ptr)
call sieve(t_ptr, Nbytes)
do i = 1, 4
t0 = tm1()
np = popcnt1(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt1 np = ', np, ' clocks = ', t1-t0
t0 = tm1()
np = popcnt2(t_ptr, Nbytes/4)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt2 np = ', np, ' clocks = ', t1-t0
t0 = tm1()
np = popcnt3(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt3 np = ', np, ' clocks = ', t1-t0
end do

end program big_test1

subroutine sieve(t, n) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT) n
integer(C_INT8_T) t(0:n-1)
integer i, lim, j

lim = sqrt(8*n+0.5_C_DOUBLE)
t = -1
t(0) = ibclr(t(0), 0)
do i = 2, lim
if(btest(t((i-1)/8),modulo(i-1,8))) then
do j = i**2, 8*n, i
t((j-1)/8) = ibclr(t((j-1)/8),modulo(j-1,8))
end do
end if
end do
end subroutine sieve

function popcnt2(MAT, Nwords) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT64_T) popcnt2
integer(C_INT) Nwords
integer(C_INT16_T) MAT(2*Nwords)
integer i, j
integer(C_INT8_T), parameter :: IBITC(0:255) = [( &
ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1), &
i=0,255)]
integer(C_INT16_T) K1, K2

popcnt2 = 0
do i = 1, 2*Nwords
K1 = MAT(i)
K2 = ishft(K1, -8)
K1 = iand(K1, int(z'ff', C_INT16_T))
popcnt2 = popcnt2+IBITC(K1)+IBITC(K2)
end do
end function popcnt2

C:\gfortran\clf\popcnt>C:\gfortran\win64\bin\x86_64-pc-mingw32-gfortran -O2
big_
test1.f90 big_popcnt3.s -obig_test1

C:\gfortran\clf\popcnt>big_test1 > big_test1.txt

C:\gfortran\clf\popcnt>type big_test1.txt
popcnt1 np = 23000 clocks = 6940
popcnt2 np = 23000 clocks = 82890
popcnt3 np = 23000 clocks = 15180
popcnt1 np = 23000 clocks = 6900
popcnt2 np = 23000 clocks = 82690
popcnt3 np = 23000 clocks = 15120
popcnt1 np = 23000 clocks = 6900
popcnt2 np = 23000 clocks = 82560
popcnt3 np = 23000 clocks = 15070
popcnt1 np = 23000 clocks = 6960
popcnt2 np = 23000 clocks = 82550
popcnt3 np = 23000 clocks = 15060

I hope you don't mind the Fortran driver. The popcnt1 function is
invoked by passing the address of a 16-byte aligned array of 32768
bytes to be population counted in rcx and returns its result in rax.
It uses the 64-bit Windows calling convention. In the above, popcnt2
was an 8-bit LUT solution and popcnt3 was similar to popcnt1 but
without CSA compression. I hope you enjoy this wrinkle on an old
technique.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Terence

unread,

Apr 12, 2008, 7:55:40 PM4/12/08

to

I built my business on software for bit counting, running since 1976.
I thought I understood the objective of the posts about "popcount" and
"wrap", but I see I don't.

If one wants to count "1" or "0" bits in strings of bytes or words,
you have to access memory anyway for the bytes. So there are a variety
of very simple, very fast ways of doing this using the continous sum
of table-indexed words or bytes, whose index is the words or byte to
count bits in. There is no need for extenal fixed/floating point
chips; just four or five machine language intructions found in the
most basic cpu chips. So you call one pre-builtt function from Fortran
which returns the answer for a vast string.

Terje Mathisen

unread,

Apr 13, 2008, 9:08:36 AM4/13/08

to

James Van Buskirk wrote:
> But the big difference lies in implementation of the Carry-Save
> Adder (CSA) compression scheme. You will recall that the
> fundamental sequence of operations, starting with inputs A, B,
> and C is:

[snipped bitslice adder logic using negated input bits]

Nice!!

OTOH, POPCNT is turning up as a regular x86 opcode these days, and I do
assume this will be _very_ hard to beat?

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

James Van Buskirk

unread,

Apr 13, 2008, 11:20:49 AM4/13/08

to

"Terje Mathisen" <spam...@crayne.org> wrote in message
news:B5mdncw9LPFJmJ_V...@giganews.com...

> James Van Buskirk wrote:
>> But the big difference lies in implementation of the Carry-Save
>> Adder (CSA) compression scheme. You will recall that the
>> fundamental sequence of operations, starting with inputs A, B,
>> and C is:

> [snipped bitslice adder logic using negated input bits]

> Nice!!

> OTOH, POPCNT is turning up as a regular x86 opcode these days, and I do
> assume this will be _very_ hard to beat?

As written, it's getting about 32768/6950 = 4.7 bytes per clock.
In 32-bit mode it may still be above 4 bytes per clock, but you
would have to be able to perform 2 loads per clock (Core 2 Duo only
has one load port) and 2 POPCNTs per clock (Phenom only has one
POPCNT-capable ALU) to do better than this. Also there may simply
be some not-well-documented throughput limitation on Core 2 Duo
which impairs performance by 33% (see the thread I started entitled
'Throughput of PXOR'.) A processor without that limitation might be
50% faster or about 7 bytes per clock. Also if a processor had
256-bit wide registers available, the speed could double, even
outstripping a POPCNT unit with throughput of 1 per clock cycle in
64-bit mode. Of course, if a POPCNT instruction were available for
256-bit wide registers, it would be game over.

Maarten Kronenburg

unread,

Apr 13, 2008, 3:48:40 PM4/13/08

to

"James Van Buskirk" wrote in message

> There was a thread about population count in comp.lang.fortran but
> nobody seemed to like my assembly code over there:

For what it's worth below I give my 32-bit assembler for popcnt.
The format is for lzasm but can easily be changed for nasm.
The two arguments are the pointer and the number of 32-bit words.
The code is from Hackers Delight converted to 32-bit assembler.
On Core2Duo I measured 13 cycles/limb, where limb=4bytes=32bits.
Also I peeked for popcnt in GMP 4.2.2, see:
http://gmplib.org/
They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
IA-64 popcnt instruction.
Their table lookup popcnt runs 8 cycle/limb (32-bit).
Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!
But because it seems more complicated you should also test it for all
possible 128-bit bit patterns.
Regards, Maarten.

PROC __int_popcnt
ARG pdata:dword, ndata:dword
push ebp
mov ebp, esp
push edi
push esi
mov edi, [ndata]
mov esi, [pdata]
lea esi, [esi-4]
xor eax, eax
@@1:
mov ecx, [esi+4*edi]
mov edx, ecx
shr edx, 1
and ecx, 55555555h
and edx, 55555555h
add ecx, edx
mov edx, ecx
shr edx, 2
and ecx, 33333333h
and edx, 33333333h
add ecx, edx
mov edx, ecx
shr edx, 4
add ecx, edx
and ecx, 0F0F0F0Fh
mov edx, ecx
shr edx, 8
add ecx, edx
mov edx, ecx
shr edx, 16
add ecx, edx
and ecx, 3Fh
add eax, ecx
dec edi
jnz short @@1
pop esi
pop edi
pop ebp
ret
ENDP __int_popcnt

Jake Waskett

unread,

Apr 13, 2008, 5:43:32 PM4/13/08

to

On Sun, 13 Apr 2008 21:48:40 +0200, Maarten Kronenburg wrote:

> But because it seems more complicated you should also test it for all
> possible 128-bit bit patterns.

Since there are 340282366920938463463374607431768211456 of those, that
might be slightly unrealistic. :-)

Maarten Kronenburg

unread,

Apr 13, 2008, 7:55:14 PM4/13/08

to

"Jake Waskett" wrote in message

Yes that is true, perhaps a lot of randomly generated ones.
When say 2^32 randomly generated ones count correct,
then one can say with almost certainty that the method is correct
and has no errors.

Terence

unread,

Apr 13, 2008, 8:14:55 PM4/13/08

to

"James Van Buskirk" wrote in message

(code)

Someone tell me why 21 instructions and a branch, per word, is somehow
better/faster than

Initializtion all methods need.

zero summing index (eg ax or eax)
load address of string to esi
load string count to ecx
load table address to ebx
back: ; 3 instructions per word follow
add to current sum, from count table indexed by the 8 byte or 16 bit
word (one instruction)
increment string pointer
(esi) (one instruction)
branch to repeat if not finished string byte/word string length
(dec-branch instruction)

?
I want to know, if I haven't understood what you all seem to be trying
to do!
Is it that all those instuctions are all somehow loaded to the cpu
chip inststruction store and processed on chip? Because there is
always the memory access to be considered.

Wolfgang Kern

unread,

Apr 14, 2008, 6:42:35 AM4/14/08

to

"Terence" wrote:

The shown code reminds me of what Randy Hyde posted a while back
in ALA for counting set bits.
Even this algo suffer on many dependency stalls, it reads four bytes
and processes 32 bits per iteration, so it reduces the costly memory
accesses which you encounter with a byte by byte indexed table lookup.
Granted, a 256 byte table could be easy held in cache, so it would be
worth to benchmark and compare the performance of the two methods.

I never needed a bit counter yet, but if I ever need one, I'd try
if the known to be awful slow BSF/BSR instructions wont save on
many iterations, ...sure the worst case would be all bits set then,
so your suggestion may perform quite well:

xor eax,eax ;init count
xor ecx,ecx ;used as dwords_size
mov esi,start_of
mov edi, Byte_LUT ;should be prepared already
prefetchnta [edi]
[edi+40h]
[edi+80h]
[edi+C0h]
L0:
mov ebx,[esi+ecx*4]
movzx edx,bl ;four times unrolled
shr ebx,8 ;because a loop branch
add AL,[edi+edx] ;may cost more time
adc eax,0x0100 ;
movzx edx,bl
shr ebx,8
add AL,[edi+edx]
adc eax,0x0100
movzx edx,bl
shr ebx,8
add AL,[edi+edx]
adc eax,0x0100
movzx edx,bl
;shr ebx,8 ;not needed in last
add AL,[edi+edx]
adc eax,0x0100
inc ecx
cmp ecx,dword_size
jc L0
; eax = bit count

but I'm not sure that your short variant:

;ecx=byte_size
L0:
movzx edx,[esi]
inc esi
add AL,[edi+edx]
adc eax,0x0100
dec ecx
jnz L0

could be faster, because it reads 'bytes' more often from memory.

perhaps I type in in one day and post timing results then.
__
wolfgang

Robert Redelmeier

unread,

Apr 14, 2008, 10:21:05 AM4/14/08

to

Terence <spam...@crayne.org> wrote in part:

> "James Van Buskirk" wrote in message

> (code) [clever XMM bitslice code omitted]

> Someone tell me why 21 instructions and a branch,
> per word, is somehow better/faster than

[byte-wise LUT pseudocode snipped]

Easy, it just is -- 4.7 bytes/clock . Your LUT would
be good at 0.5 and have real trouble breaking 1.0 .

Why is equally easy to see -- James "word" is a 128-bit
wide XMM register (presumably zero padded]. That is 16
bytes at a time. Furthermore the code is unrolled x8,
meaning 128 _bytes_ per 61 instruction inner loop.
James' timing gives a reasonable 2.2 ipc.

I still have this nagging feeling there's something
faster with multiplying by magic numbers.

-- Robert

Terence

unread,

Apr 14, 2008, 8:00:42 AM4/14/08

to

I DO understand that if you have millions of very long bits strings
that you have to count bits in or match mbits in (as in DNA or RNA
matching) you might want a very tight efficient loop algorithm.

But then again, from a practical point of view, you still have to
consider the access to the huge strings of bits millions of times. Ar
we nick-picking in the wrong place?

Are we not seeing the woods for the trees?.

James Van Buskirk

unread,

Apr 14, 2008, 4:58:34 AM4/14/08

to

"Terence" <spam...@crayne.org> wrote in message
news:25562a29-faf9-4899...@n1g2000prb.googlegroups.com...

> Someone tell me why 21 instructions and a branch, per word, is somehow
> better/faster than

I think you may be referring to Maarten Kronenburg's example, which
has 24 instructions+branch. The point you are missing is that
these complicated methods are counting several words per loop
trip. Well, MK's method is a naive version, and is surely slower
than a well-optimized 8-bit LUT would be. But the point is that
counting 64 bits could be done in just two more operations than
counting 32 bits (actually the same number in a more sophisticated
implementation) whereas the LUT method requires about twice as
many operations to count up twice the bits. The limitation of
LUTs is that they can only get so big before every access is a
cache miss, so the number of bits that can be handled in
parallel is quite limited. Parallel methods are limited by
register width which is up to 128 bits in x86 just now.

A good explanation of Maarten Kronenburg's code may be found
in the Wikipedia entry on Hamming weight,

http://en.wikipedia.org/wiki/Hamming_weight

The compression stage is explained in the old thread referenced
in my original post,

http://groups.google.com:80/group/comp.lang.asm.x86/msg/54d29f732e51de93

Not to mention

http://www.icis.ntu.edu.sg/scs-ijit/91/91-1.pdf

Well, there's so much available that if you're really interested
these references should keep you busy for a while.

James Van Buskirk

unread,

Apr 14, 2008, 4:38:07 AM4/14/08

to

"Maarten Kronenburg" <spam...@crayne.org> wrote in message
news:4802639c$0$720$7ade...@textreader.nntp.internl.net...

> For what it's worth below I give my 32-bit assembler for popcnt.

This is a crude version of the well-known code in AMD's optimization
manual, and is not as fast as the AMD code.

> Also I peeked for popcnt in GMP 4.2.2, see:
> http://gmplib.org/
> They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
> IA-64 popcnt instruction.

I couldn't find where you got that number from. In

http://www.technovelty.org/code/arch/popcount.html

I see a result of 6 clocks per 8 bytes for Itanium 2.

> Their table lookup popcnt runs 8 cycle/limb (32-bit).

Obviously one should be able to get close to 1 byte per cycle
with a Core 2 Duo and 8-bit LUT. Only crappy code would be
short of this mark by a factor of 2.

> Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!

Also on a machine with throughput of one 64-bit hardware per clock
cycle it may be possible to get about 10 bytes per cycle, if the
other throughput problem I am experiencing is not also a problem in
the code I envision.

> But because it seems more complicated you should also test it for all
> possible 128-bit bit patterns.

No, the thing to do is to follow the references given and assure
oneself that the algorithm is mathematically sound. The compression
stage only needs a truth table for all combinations of three input
bits (2**3 antries) to be exhaustive and the wrap-up stage can
be thought of as 16 independent bytes in parallel. The only
aspects that I haven't seen in previous code were use PSADBW
for the final sweep stage (which is, if anything, more
transparent than the multiply and shift given by AMD, but integer
registers don't have SIMD operations in the x86 ISA) and the
negative logic in the compression stage (which can be extablished
by an 8-entry truth table if you flunked logic circuits.)

After that, testing with extreme conditions such as all bits set
and all bits clear as well as random bits to guard against gross
errors in coding, while not exhaustive of all possibilities, is
at least persuasive to me. Besides, the algorithm mashes
together 1024 bits in each loop trip, distilling out one 128-bit
drop to count up, not to mention that the dregs of previous
iterations is an input to each successive iteration, so even
an unattainable number like 128 bits doesn't even start to
describe the complexity of the algorithm. It's explained fairly
well in the second link I gave in the seminal post of this
thread. Check it out and see what's going on here.

The Wikipedia entry on Hamming weight is pretty good for the
wrap stage of the algorithm, but the code there is almost
identical to your example, so it probably doesn't help you.

Maarten Kronenburg

unread,

Apr 14, 2008, 12:09:21 PM4/14/08

to

"Terence" wrote in message

>
> "James Van Buskirk" wrote in message
> (code)
>
> Someone tell me why 21 instructions and a branch, per word, is somehow
> better/faster than
>

First I also didn't believe it, but somehow array elements are combined into
one element, and only that one element is counted, it's from:
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.c
The combination function is faster than the counting function, that's the
trick.
Now I'm also trying to make my code into one from pop_array3.

Jake Waskett

unread,

Apr 14, 2008, 7:19:35 AM4/14/08

to

That seems reasonable. A word of caution, though: many pseudo-random
number routines will start to repeat their output sequence after 2^32 (or
sometimes 2^31 or even fewer) calls. Some routines also have
semi-predictable bit patterns in their output. So be sure to check that
you're *truly* checking a representative random sample of the required
size.

Maarten Kronenburg

unread,

Apr 14, 2008, 2:53:32 PM4/14/08

to

"James Van Buskirk" wrote in message

> "Maarten Kronenburg" wrote in message

>
> > For what it's worth below I give my 32-bit assembler for popcnt.
>
> This is a crude version of the well-known code in AMD's optimization
> manual, and is not as fast as the AMD code.
>

Yes it's the basic one, I'm now putting it in xmm and see where I get.
I work in 32-bit system so the number of registers is a limitation.

> > Also I peeked for popcnt in GMP 4.2.2, see:
> > http://gmplib.org/
> > They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
> > IA-64 popcnt instruction.
>
> I couldn't find where you got that number from. In
>

Just go to:
http://gmplib.org
and download gmp-4.2.2.tar.gz.
Then look for the files popcount.asm.
I just copied their timings, perhaps they have already improved.

> http://www.technovelty.org/code/arch/popcount.html
>
> I see a result of 6 clocks per 8 bytes for Itanium 2.
>
> > Their table lookup popcnt runs 8 cycle/limb (32-bit).
>
> Obviously one should be able to get close to 1 byte per cycle
> with a Core 2 Duo and 8-bit LUT. Only crappy code would be
> short of this mark by a factor of 2.
>
> > Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!
>
> Also on a machine with throughput of one 64-bit hardware per clock
> cycle it may be possible to get about 10 bytes per cycle, if the
> other throughput problem I am experiencing is not also a problem in
> the code I envision.
>
> > But because it seems more complicated you should also test it for all
> > possible 128-bit bit patterns.
>
> No, the thing to do is to follow the references given and assure
> oneself that the algorithm is mathematically sound. The compression
> stage only needs a truth table for all combinations of three input
> bits (2**3 antries) to be exhaustive and the wrap-up stage can
> be thought of as 16 independent bytes in parallel. The only
> aspects that I haven't seen in previous code were use PSADBW
> for the final sweep stage (which is, if anything, more
> transparent than the multiply and shift given by AMD, but integer
> registers don't have SIMD operations in the x86 ISA) and the
> negative logic in the compression stage (which can be extablished
> by an 8-entry truth table if you flunked logic circuits.)
>

Yes I'm sure you are right, but a typo in assembler is easily made.

> After that, testing with extreme conditions such as all bits set
> and all bits clear as well as random bits to guard against gross
> errors in coding, while not exhaustive of all possibilities, is
> at least persuasive to me. Besides, the algorithm mashes
> together 1024 bits in each loop trip, distilling out one 128-bit
> drop to count up, not to mention that the dregs of previous
> iterations is an input to each successive iteration, so even
> an unattainable number like 128 bits doesn't even start to
> describe the complexity of the algorithm. It's explained fairly
> well in the second link I gave in the seminal post of this
> thread. Check it out and see what's going on here.
>
> The Wikipedia entry on Hamming weight is pretty good for the
> wrap stage of the algorithm, but the code there is almost
> identical to your example, so it probably doesn't help you.
>

Yes I found:
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.c
To put those in assembler as you did is quite some work.
Perhaps you may find and eliminate dependency chains in your assembler code,
that's the normal way to optimize exisiting assembler, but sometimes it's
rather a puzzle.
My popcnt is part of an integer library, just like GMP, and it's linear so
its performance is not so critical, but for some other special applications
it may be critical. Amazing that you can get so far with optimizing tricks.
Regards, Maarten.

James Van Buskirk

unread,

Apr 14, 2008, 3:53:21 PM4/14/08

to

"Wolfgang Kern" <spam...@crayne.org> wrote in message
news:ftvcf7$umq$1...@newsreader2.utanet.at...

> Even this algo suffer on many dependency stalls, it reads four bytes
> and processes 32 bits per iteration, so it reduces the costly memory
> accesses which you encounter with a byte by byte indexed table lookup.
> Granted, a 256 byte table could be easy held in cache, so it would be
> worth to benchmark and compare the performance of the two methods.

Well, I tried rewriting the 8-bit LUT in assembly to minimize
memory access, also I tried the 4-bit LUT. In the following,
popcnt1 is the topical code of this thread, popcnt2 is the
8-bit LUT, popcnt3 is popcnt1 without compression as in the AMD
manual, and popcnt4 is the 4-bit LUT:

C:\gfortran\clf\popcnt>type big_popcnt4.s

movl $32768, %edx

..globl _popcnt2
.def _popcnt2; .scl 2; .type 32; .endef
..align 16
_popcnt2:
movq %rbx, 8(%rsp)
movq %rsi, 16(%rsp)
movq %rdi, 24(%rsp)
movq %rbp, 32(%rsp)
movl (%rdx), %ebp
xorl %eax, %eax
leaq LUT(%rip), %rdi
byte_loop:
movaps (%rcx), %xmm0
movq %xmm0, %rbx
movzx %bl, %rsi
xorl %edx, %edx
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
movhlps %xmm0, %xmm0
movq %xmm0, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
shrq $16, %rbx
movzx %bl, %esi
addb (%rsi,%rdi,1), %dl
movzx %bh, %esi
addb (%rsi,%rdi,1), %dl
movzx %dl, %edx
addq %rdx, %rax
addq $16, %rcx
subq $16, %rbp
jnz byte_loop
movq 8(%rsp), %rbx
movq 16(%rsp), %rsi
movq 24(%rsp), %rdi
movq 32(%rsp), %rbp
ret

xorps %xmm4, %xmm4

ret

..globl _popcnt4
.def _popcnt4; .scl 2; .type 32; .endef
..align 16
_popcnt4:
movaps %xmm6, 8(%rsp)
movl (%rdx), %edx
movaps nonsense(%rip), %xmm2
movaps LUT(%rip), %xmm3
xorps %xmm4, %xmm4
xorps %xmm6, %xmm6
oword_loop:
movaps (%rcx), %xmm0
movaps %xmm0, %xmm1
andps %xmm2, %xmm0
movaps %xmm3, %xmm5
pshufb %xmm0, %xmm5
psadbw %xmm4, %xmm5
paddd %xmm5, %xmm6

psrld $4, %xmm1
andps %xmm2, %xmm1

movaps %xmm3, %xmm5
pshufb %xmm1, %xmm5
psadbw %xmm4, %xmm5
paddd %xmm5, %xmm6

addq $16, %rcx
subq $16, %rdx
jnz oword_loop
movq %xmm6, %rax
movhlps %xmm6, %xmm6
movq %xmm6, %rdx
addq %rdx, %rax
movq 8(%rsp), %xmm6
ret

.data
..align 16
stuff:
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
.long 0xaaaaaaaa
nonsense:
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f
.long 0x0f0f0f0f

LUT:
.byte 0,1,1,2,1,2,2,3
.byte 1,2,2,3,2,3,3,4
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 1,2,2,3,2,3,3,4
.byte 2,3,3,4,3,4,4,5
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 2,3,3,4,3,4,4,5
.byte 3,4,4,5,4,5,5,6
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 3,4,4,5,4,5,5,6
.byte 4,5,5,6,5,6,6,7
.byte 4,5,5,6,5,6,6,7
.byte 5,6,6,7,6,7,7,8

C:\gfortran\clf\popcnt>type big_test4.f90
program big_test4

interface
function popcnt4(x,n) bind(C)

import C_INT64_T
import C_PTR
import C_INT
implicit none

integer(C_INT64_T) popcnt4

type(C_PTR), value :: x
integer(C_INT) n

end function popcnt4
end interface

np = popcnt2(t_ptr, Nbytes)

t1 = tm1()
write(*,'(2(a,i0))') 'popcnt2 np = ', np, ' clocks = ', t1-t0
t0 = tm1()
np = popcnt3(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt3 np = ', np, ' clocks = ', t1-t0

t0 = tm1()
np = popcnt4(t_ptr, Nbytes)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt4 np = ', np, ' clocks = ', t1-t0
end do

end program big_test4

subroutine sieve(t, n) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT) n
integer(C_INT8_T) t(0:n-1)
integer i, lim, j

lim = sqrt(8*n+0.5_C_DOUBLE)
t = -1
t(0) = ibclr(t(0), 0)
do i = 2, lim
if(btest(t((i-1)/8),modulo(i-1,8))) then
do j = i**2, 8*n, i
t((j-1)/8) = ibclr(t((j-1)/8),modulo(j-1,8))
end do
end if
end do
end subroutine sieve

C:\gfortran\clf\popcnt>c:\gfortran\win64\bin\x86_64-pc-mingw32-gfortran -O2
big_
test4.f90 big_popcnt4.s -obig_test4

C:\gfortran\clf\popcnt>big_test4 > big_test4.txt

C:\gfortran\clf\popcnt>type big_test4.txt
popcnt1 np = 23000 clocks = 7370
popcnt2 np = 23000 clocks = 47560
popcnt3 np = 23000 clocks = 15190
popcnt4 np = 23000 clocks = 17540
popcnt1 np = 23000 clocks = 7010
popcnt2 np = 23000 clocks = 47480
popcnt3 np = 23000 clocks = 15130
popcnt4 np = 23000 clocks = 17530
popcnt1 np = 23000 clocks = 6920
popcnt2 np = 23000 clocks = 47540
popcnt3 np = 23000 clocks = 15030
popcnt4 np = 23000 clocks = 17460
popcnt1 np = 23000 clocks = 6930
popcnt2 np = 23000 clocks = 47520
popcnt3 np = 23000 clocks = 15040
popcnt4 np = 23000 clocks = 17520

So we got down to about 1 byte per 1.5 clocks for the 8-bit LUT.
Quite disappointing. The 4-bit LUT is close to the AMD manual
method, though. For a processor with fast PSHUFB and no
hardware POPCNT this may be a good thing. PSHUFB unfortunately is
slow on my Core 2 Duo E6700.

Terence

unread,

Apr 14, 2008, 6:09:35 PM4/14/08

to

On Apr 14, 6:38 pm, "James Van Buskirk" <spamt...@crayne.org> wrote:

> The Wikipedia entry on Hamming weight is pretty good for the
> wrap stage of the algorithm, but the code there is almost
> identical to your example, so it probably doesn't help you.
>

Thanks, James.
Now I see it and understand it.
There is a code width size beyond which table look-up summing takes
longer.
That Wikipedia reference makes the principle and sugested algorithms
clear.
All my univeresity time was pre Hamming.

Terje Mathisen

unread,

Apr 15, 2008, 8:28:26 AM4/15/08

to

Maarten Kronenburg wrote:
> "Terence" wrote in message
>> "James Van Buskirk" wrote in message
>> (code)
>>
>> Someone tell me why 21 instructions and a branch, per word, is somehow
>> better/faster than
>>
>
> First I also didn't believe it, but somehow array elements are combined into
> one element, and only that one element is counted, it's from:

There's no "somehow" at all about it:

This is a classic bitslice implementation of a FULL_ADD, i.e. 3 inputs
and 2 outputs.

Terence

unread,

Apr 15, 2008, 5:29:34 AM4/15/08

to

I mean, prior to Hamming working with Shannon on what Hamming got well
known for in about 1974, since I graduated in 1955, 1957, 1958, 1963
and 1972.

Gerd Isenberg

unread,

Apr 15, 2008, 5:56:39 AM4/15/08

to

Another SSE2 alternative for a 64-bit popcount, sign extending 64 bits
to 64 bytes to add them.
May be combined with add/major as well...

U32 popCount(U64 bb) {
static const U64 CACHE_ALIGN masks[8] = {
C64(0x0101010101010101), C64(0x0202020202020202),
C64(0x0404040404040404), C64(0x0808080808080808),
C64(0x1010101010101010), C64(0x2020202020202020),
C64(0x4040404040404040), C64(0x8080808080808080),
};
__m128i x0, x1, x2, x3, zr; U32 cnt;
__m128i * pM = (__m128i*) masks;
x0 = _mm_cvtsi64x_si128 ( bb );
x0 = _mm_unpacklo_epi64 ( x0, x0 );
zr = _mm_setzero_si128();
x3 = _mm_andnot_si128 ( x0, pM[3] );
x2 = _mm_andnot_si128 ( x0, pM[2] );
x1 = _mm_andnot_si128 ( x0, pM[1] );
x0 = _mm_andnot_si128 ( x0, pM[0] );
x3 = _mm_cmpeq_epi8 ( x3, zr );
x2 = _mm_cmpeq_epi8 ( x2, zr );
x1 = _mm_cmpeq_epi8 ( x1, zr );
x0 = _mm_cmpeq_epi8 ( x0, zr );
x2 = _mm_add_epi8 ( x2, x3 );
x0 = _mm_add_epi8 ( x0, x1 );
x0 = _mm_add_epi8 ( x0, x2 );
x0 = _mm_sad_epu8 ( x0, zr );
cnt = -_mm_cvtsi128_si32( x0 )
-_mm_extract_epi16( x0, 4 );
return cnt & 255;
}

bb$ = 8
?popCount@@YAI_K@Z PROC
00000 66 0f ef db pxor xmm3, xmm3
00004 66 48 0f 6e d1 movd xmm2, rcx
00009 66 0f 6c d2 punpcklqdq xmm2, xmm2
0000d 66 0f 6f e2 movdqa xmm4, xmm2
00011 66 0f 6f c2 movdqa xmm0, xmm2
00015 66 0f 6f ca movdqa xmm1, xmm2
00019 66 0f df 15 30 00 00 00 pandn xmm2, XMMWORD PTR ?masks
+48
00021 66 0f df 25 00 00 00 00 pandn xmm4, XMMWORD PTR ?masks
00029 66 0f df 0d 20 00 00 00 pandn xmm1, XMMWORD PTR ?masks
+32
00031 66 0f df 05 10 00 00 00 pandn xmm0, XMMWORD PTR ?masks
+16
00039 66 0f 74 e3 pcmpeqb xmm4, xmm3
0003d 66 0f 74 c3 pcmpeqb xmm0, xmm3
00041 66 0f 74 cb pcmpeqb xmm1, xmm3
00045 66 0f fc e0 paddb xmm4, xmm0
00049 66 0f 74 d3 pcmpeqb xmm2, xmm3
0004d 66 0f fc ca paddb xmm1, xmm2
00051 66 0f fc e1 paddb xmm4, xmm1
00055 66 0f f6 e3 psadbw xmm4, xmm3
00059 66 0f 7e e0 movd eax, xmm4
0005d 66 0f c5 cc 04 pextrw ecx, xmm4, 4
00062 03 c8 add ecx, eax
00064 f7 d9 neg ecx
00066 0f b6 c1 movzx eax, cl
00069 c3 ret 0
?popCount@@YAI_K@Z ENDP

http://chessprogramming.wikispaces.com/SSE2#SSE2dotproduct
http://chessprogramming.wikispaces.com/Population+Count

Gerd

Maarten Kronenburg

unread,

Apr 15, 2008, 11:13:38 AM4/15/08

to

"Maarten Kronenburg" wrote in message
>

> "James Van Buskirk" wrote in message
> > There was a thread about population count in comp.lang.fortran but
> > nobody seemed to like my assembly code over there:
>
> For what it's worth below I give my 32-bit assembler for popcnt.

The assembler code I gave above, but in mmx and using psadbw as you
suggested, runs 5 cycles/32-bit.
Now I try to put it in xmm and hope to do even a little better, I will put
the result here.
Regards, Maarten

Maarten Kronenburg

unread,

Apr 15, 2008, 3:58:21 PM4/15/08

to

"Maarten Kronenburg" wrote in message
>

Now I have the xmm version, it runs a litle below 3 cycles/32-bit.
Tomorrow after some more testing I will put it here.

James Van Buskirk

unread,

Apr 16, 2008, 2:33:16 AM4/16/08

to

"Gerd Isenberg" <spam...@crayne.org> wrote in message
news:61cb8b24-6249-49eb...@t54g2000hsg.googlegroups.com...

> Another SSE2 alternative for a 64-bit popcount, sign extending 64 bits
> to 64 bytes to add them.
> May be combined with add/major as well...

> http://chessprogramming.wikispaces.com/Population+Count

This didn't seem too promising, but I tried it anyhow.

interface
function popcnt5(x,n) bind(C)

import C_INT64_T
import C_PTR
import C_INT
implicit none

integer(C_INT64_T) popcnt5

type(C_PTR), value :: x
integer(C_INT) n

end function popcnt5
end interface

..align 16
_popcnt5:
movaps %xmm6, 8(%rsp)
leaq masks(%rip), %rax
movl (%rdx), %edx
xorps %xmm3, %xmm3
xorps %xmm6, %xmm6
..align 16
qword_loop:
movaps (%rcx), %xmm5
movddup %xmm5, %xmm2
movaps %xmm2, %xmm4
movaps %xmm2, %xmm0
movaps %xmm2, %xmm1
andnps 48(%rax), %xmm2
andnps (%rax), %xmm4
andnps 32(%rax), %xmm1
andnps 16(%rax), %xmm0
pcmpeqb %xmm3, %xmm4
pcmpeqb %xmm3, %xmm0
pcmpeqb %xmm3, %xmm1
paddb %xmm0, %xmm4
pcmpeqb %xmm3, %xmm2
paddb %xmm1, %xmm2
paddb %xmm4, %xmm2

movhlps %xmm5, %xmm5
movaps %xmm5, %xmm4
movaps %xmm5, %xmm0
movaps %xmm5, %xmm1
andnps 48(%rax), %xmm5
andnps (%rax), %xmm4
andnps 32(%rax), %xmm1
andnps 16(%rax), %xmm0
pcmpeqb %xmm3, %xmm4
pcmpeqb %xmm3, %xmm0
pcmpeqb %xmm3, %xmm1
paddb %xmm0, %xmm4
pcmpeqb %xmm3, %xmm5
paddb %xmm1, %xmm5
paddb %xmm4, %xmm5

paddb %xmm5, %xmm2
psadbw %xmm3, %xmm2
pslld $24, %xmm2
psrad $24, %xmm2
psubd %xmm2, %xmm6

addq $16, %rcx
subq $16, %rdx

jnz qword_loop

movq %xmm6, %rax
movhlps %xmm6, %xmm6
movq %xmm6, %rdx
addq %rdx, %rax

movaps 8(%rsp), %xmm6
ret

masks:
.long 0x01010101,0x01010101,0x02020202,0x02020202
.long 0x04040404,0x04040404,0x08080808,0x08080808
.long 0x10101010,0x10101010,0x20202020,0x20202020
.long 0x40404040,0x40404040,0x80808080,0x80808080

Also I tweaked the 8-bit LUT:

..align 16
_popcnt2:
movq %rbx, 8(%rsp)
movq %rsi, 16(%rsp)
movq %rdi, 24(%rsp)
movq %rbp, 32(%rsp)
movl (%rdx), %ebp

xorl %r8d, %r8d
leaq LUT(%rip), %rdi
..align 16
byte_loop:
# movaps (%rcx), %xmm0 # 1 oword load
# movq %xmm0, %rbx # 1 oword load
movq (%rcx), %rbx # 2 qword loads
movzx %bl, %rsi
xorq %rdx, %rdx
movb (%rsi,%rdi,1), %dl
movzx %bh, %esi
movb (%rsi,%rdi,1), %al