Any libraries for vector mask and vector population count?

Dan

unread,

Apr 6, 2008, 11:01:07 AM4/6/08

to

Hi,

I have some legacy code which includes calls to the CAL2 vector mask
and vector population count instructions. Lacking a nearby Cray, does
anyone know of subroutines that implement these instructions?

I've searched comp.lang.fortran, sent email to cray computer (no
reply, probably no one left writing in CAL for speed), searched with
google (all Photoshop hits which I did not understand), checked the
'95, '03, and proposed '08 Fortran standards, but no luck.

The Intel manuals have MMX/SSE{1234} instructions which might work,
and AMD has the "something-now" which are apparently the same, but
I've not seen any demo code emulating the mask and popcount
instructions.

Any suggestions would be welcome.

TIA,
dan

Tim Prince

unread,

Apr 6, 2008, 11:38:30 AM4/6/08

to

Dan wrote:

>
> I have some legacy code which includes calls to the CAL2 vector mask
> and vector population count instructions. Lacking a nearby Cray, does
> anyone know of subroutines that implement these instructions?
>

Maybe it would help to show some of the code, if it isn't secret like much
usage of popcount.
Apparently, GMP supports popcount for those processors which implement it.

> The Intel manuals have MMX/SSE{1234} instructions which might work,
> and AMD has the "something-now" which are apparently the same, but
> I've not seen any demo code emulating the mask and popcount
> instructions.

SSE popcount instruction isn't available in currently released hardware,
but it will be along soon.

SSE auto-vectorization of conditional operations is often done with vector
masks. Fortran MERGE for SSE2 is vectorized with masks; the latest
additions to SSE include more direct merge support.
IAND and IEOR support masking operations. If your Fortran doesn't
auto-vectorize them in the context you require, there are SSE intrinsic
operators, supported by C (and possibly 1 or 2 Fortran) compilers.

Dan Nagle

unread,

Apr 6, 2008, 1:06:09 PM4/6/08

to

Hello,

On 2008-04-06 11:01:07 -0400, Dan <daviso...@gmail.com> said:
>
> I have some legacy code which includes calls to the CAL2 vector mask
> and vector population count instructions. Lacking a nearby Cray, does
> anyone know of subroutines that implement these instructions?

IIRC, vector mask is close to the standard merge().
Perhaps Dick can recall better them I can.

--
Cheers!

Dan Nagle

Dick Hendrickson

unread,

Apr 6, 2008, 1:42:38 PM4/6/08

to

What do you want them for? Is blinding speed necessary? Or do
you just need to get the old code working? If it's the latter,
the, as Dan said, the vector mask calls are similar to the
merge intrinsic.

The vector mask triggered off zero or one bits in the mask,
whereas the merge works on true or false. There were also
a set of functions with names like CVMGP that returned the
first or second argument depending on whether or not the third
was Positive.

Pop count is harder. If the hardware doesn't have that instruction
or if the compiler won't generate it, or if you don't want to
do assembler, you can simulate it with a ton of shifts and ands
or ors. I think I remember a few years ago someone here saying
they had a popcount subroutine. It wasn't just brute force;
I think they split the input into 8 or 16 bit chunks and then
did a table lookup on those and added the results. If your
machine supports 64 bit addressing, you could do it as a single
step table lookup, although creating the table would be a
big deal ;) .

In many cases the Cray functions were actually non-standard
intrinsics and the compiler recognized them and generated
inline code rather than "really" calling a subroutine.

Dick Hendrickson

Dan

unread,

Apr 6, 2008, 1:50:39 PM4/6/08

to

Hi Tim and Dan,

> > I have some legacy code which includes calls to the CAL2 vector mask
> > and vector population count instructions. Lacking a nearby Cray, does
> > anyone know of subroutines that implement these instructions?
>
> IIRC, vector mask is close to the standard merge().
> Perhaps Dick can recall better them I can.

Perhaps I'm wrong, but as I read the standard, both MERGE() and
MERGE_BITS() both want to play with integers. The standard explicitly
says

Arguments (for MERGE_BITS)
I shall be of type integer or a boz-literal-constant.

J shall be of type integer or a boz-literal-constant.
If both I and J are of type integer they shall have the same
kind type parameter.
******** I and J shall not both be boz-literal-
constants.**********************************
(boz == binary, octal, hex)

Result Characteristics.
Same as I if I is of type integer; otherwise, same as J.
Result Value.
If any argument is a boz-literal-constant,
*************it is first converted as if by the intrinsic function INT
**********************
to the type and kind type parameter of the result. The
result has the value of
IOR (IAND (I,MASK), IAND (J, NOT (MASK))).

I'm probably missing something obvious, but this is not good. My data
are all four-bit values, and the original idea was somewhat like a
systolic array. Just keep reloading the second vector (J, in the terms
of the standard) do the mask, do the popcount, reload J with the next
set of bits. (Highly simplified, please let me know if I haven't
conveyed the idea clearly). There is already a tremendous amount of
bookkeeping going on, and I would prefer not to add a format
conversion (the implied INT) into the mix. Am I missing something?

Thanks again,
dan

Tim Prince

unread,

Apr 6, 2008, 3:10:12 PM4/6/08

to

Tim Prince wrote:
> Dan wrote:
>
>>
>> I have some legacy code which includes calls to the CAL2 vector mask
>> and vector population count instructions. Lacking a nearby Cray, does
>> anyone know of subroutines that implement these instructions?
>>
>
> Maybe it would help to show some of the code, if it isn't secret like
> much usage of popcount.
> Apparently, GMP supports popcount for those processors which implement it.
>
>> The Intel manuals have MMX/SSE{1234} instructions which might work,
>> and AMD has the "something-now" which are apparently the same, but
>> I've not seen any demo code emulating the mask and popcount
>> instructions.
>
> SSE popcount instruction isn't available in currently released hardware,
> but it will be along soon.

Wrong, it's in the original SSE4 supported by Penryn processors:
http://softwarecommunity.intel.com/articles/eng/1193.htm
(no compiler support, except for a C compatible library function).

FX

unread,

Apr 6, 2008, 3:48:00 PM4/6/08

to

> no compiler support

I'm not sure exactly what you mean by that:

$ cat a.c
int foo(int i) { return __builtin_popcount (i); }
$ gcc -S a.c -msse4
$ cat a.s
.text
.globl _foo
_foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
popcntl %eax, %eax
leave
ret
.subsections_via_symbols

There clearly is a popcntl opcode used. If you don't specify -msse4, you get a
call to a function in the GCC support library (libgcc).

$ gcc -v
Using built-in specs.
Target: i386-apple-darwin8.10.1
Configured with: /tmp/gfortran-20080221/ibin/../gcc/configure --prefix=/usr/local/gfortran --enable-languages=c,fortran --with-gmp=/tmp/gfortran-20080221/gfortran_libs --enable-bootstrap
Thread model: posix
gcc version 4.4.0 20080221 (experimental) [trunk revision 132519] (GCC)

--
FX

Dick Hendrickson

unread,

Apr 6, 2008, 5:09:19 PM4/6/08

to

Dan wrote:
> Hi Tim and Dan,
>
>>> I have some legacy code which includes calls to the CAL2 vector mask
>>> and vector population count instructions. Lacking a nearby Cray, does
>>> anyone know of subroutines that implement these instructions?
>> IIRC, vector mask is close to the standard merge().
>> Perhaps Dick can recall better them I can.
>
>
> Perhaps I'm wrong, but as I read the standard, both MERGE() and
> MERGE_BITS() both want to play with integers. The standard explicitly
> says

Yeah, you're wrong ;). MERGE takes any type (or kind) of arguments
for the first two (as long as they are of the same kind) and a logical
for the third. MERGE_BITS is a tough one to use; it's a Fortran 2008
function and it's going to be a while before there's a F2008 compiler.

I'm really lost about what you are trying to do. Could you give a
short code sample? On the one hand it sounds like you are doing
new code, and I thought your original message was about a legacy
code. But, I'm easily confused.

Dick Hendrickson
>
> Thanks again,
> dan

Dan

unread,

Apr 6, 2008, 5:48:55 PM4/6/08

to

Hi all,

Yes, raw speed is needed. That's why we went from fortran to assembler
in the first place. Since the code was written, the amount of data has
increased by three, perhaps four, orders of magnitude.

I just looked over the standard's definition of integer, and it
implies all integers are signed. Is that true? If there are unsigned
integers, then there should be no problem.

For those with serious assembler leanings, appended below is a brief
description of the CAL routine, then the cal routine itself. I added "<
++++" at the vector mask instructions.

For those lucky enough not to have to program in CAL, S registers are
scalar, V are vector, T are transfer, A and B are data registers, This
code was originally written for a Cray-1, so it is simpler than X and
Y assembler. The code is uncommented deliberately. Sorry about that.

thanks all,
dan davison

CAL SUBROUTINES:
c
c CMPRS1N (number-1,element1 array1,element1 array2)
c
c The number of consecutive elements examined is specified,
c as well as the addresses of the first elements in each array.
c If the element of array2 is nonzero, the corresponding
c element of array1 is kept. The kept elements are written
c without space between and in their original order in
c the array1 commencing with the position of the first
c element specified. The value of the first argument is changed
c to the number of kept elements minus one.

* Copyright, 1989, The Regents of the University of California.
* This software was produced under a U. S. Government contract
* (W-7405-ENG-36) by the Los Alamos National Laboratory, which
* is operated by the University of California for the U. S.
* Department of Energy. The U. S. Government is licensed to use,
* reproduce, and distribute this software. Permission is granted
* to the public to copy and use this software without charge,
* provided that this Notice and any statement of authorship are
* reproduced on all copies. Neither the Government nor the
* University makes any warranty, express or implied, or assumes
* any liability or responsibility for the use of this software.
*
IDENT CMPRS1N (ju,iz(0),ii(0))
CMPRS1N ENTER PRELOAD=0,NP=3,ALIGN=ON
ARGADD A1,1,ARGPTR=A6 address of ju
ARGADD A2,2,ARGPTR=A6 address of iz(0)
ARGADD A6,3,ARGPTR=A6 address of ii(0)
B71 A1
A1 0,A1
S2 A1
A1 A1+1
VL A1
A7 VL
S2 S2>6
A1 S2
A1 -A1
A3 A2
B72 A2
JLUP = *
A0 A6
A6 A6+A7
V6 ,A0,1
V1,VM V6,N <++++++++ vector mask
S1 VM <++++++++ vector mask
B77 A7
A7 PS1
A0 A7
VL A7
JAZ SKIP
A0 A2
V2 ,A0,V1
A0 A3
A3 A3+A7
,A0,1 V2
SKIP = *
A7 B77
A2 A2+A7
A0 A1
A1 A1+1
A7 ZS0
VL A7
JAM JLUP
A2 B72
A2 A3-A2
A2 A2-1
A1 B71
,A1 A2
EXIT
END

Tim Prince

unread,

Apr 6, 2008, 6:22:05 PM4/6/08

to

OK, that's interesting. gcc 4.4 leaves other compilers in the dust as far
as popcount is concerned. I take it this won't be accessible directly by
gfortran.

Terence

unread,

Apr 6, 2008, 6:42:59 PM4/6/08

to

I think popcount is just a bit count of '1's in a string of bits.

The standard way in market research, which uses this a lot, is to
generate a table of bit counts for all possible 8-bit words (or use a
pre-calculated one). Better is use an assembler fucntion.

The popcount of a bit string is just the continued sum of the table
entries corresponding to the 8-bit words it comprises (if the stirg is
not an 8-bit multiple in length, the last is zero entended on
capture). The assembler routine is surprisingly short and the main
time loss is the function housekeeping in Fortan itself, very little
occuring in stack control in assembler.

The same applies to a 16 or 32-bit version, which for raw speed can
use one large table, or treat each register word as a set of 8-bit sub-
words and use a 256-word table.

I added these functions and others as object files in an additional
link library used by the Fortran compiler.

FX

unread,

Apr 6, 2008, 7:28:24 PM4/6/08

to

> I take it this won't be accessible directly by gfortran.

Fortran 2008 has it. I think it will be in gfortran-4.4.

--
FX

glen herrmannsfeldt

unread,

Apr 6, 2008, 7:34:22 PM4/6/08

to

Terence wrote:
> I think popcount is just a bit count of '1's in a string of bits.

> The standard way in market research, which uses this a lot, is to
> generate a table of bit counts for all possible 8-bit words (or use a
> pre-calculated one). Better is use an assembler fucntion.

It works pretty well in C, which allows one to reference data
as an array of unsigned char. How fast depends on how fast
the machine can do byte access.

Another way often done in C depends on the properties of twos
complement arithmetic as:

k=0;
while(j=j&(j-1)) k++;

Though neither C nor Fortran guarantee twos complement, it is
popular enough that people use the system dependent code.

k=0
while(j.ne.0)
j=iand(j,j-1)
k=k+1
end while

The time will be somewhat proportional to the number of one
bits in j. For small numbers of one bits it will likely
be faster than the byte table lookup. One could also
unroll the loop.

-- glen

Walter Spector

unread,

Apr 6, 2008, 8:22:05 PM4/6/08

to

Dan wrote:
> ...

> For those lucky enough not to have to program in CAL, S registers are
> scalar, V are vector, T are transfer, A and B are data registers, This
> code was originally written for a Cray-1, so it is simpler than X and
> Y assembler. The code is uncommented deliberately. Sorry about that.

All it does is the same as the Fortran-90 PACK intrinsic. Just use
the intrinsic. If it isn't fast enough, complain to your compiler
vendor.

And it won't work on a Cray-1. It uses the compressed index and hardware
gather instructions. Which means mid-life X-MP, onwards.

The Cray compilers were eventually able to vectorize this idiom
automatically. The CAL routine was unnecessary, and probably slower
than letting the compiler generate inline code.

W.

Dan

unread,

Apr 6, 2008, 8:41:43 PM4/6/08

to

Tim, FX, Terrence, Dick, Glen, Dan, and anyone I missed,

Thank you very, very much. Dick asked me to me clearer on what the
problem is. I have the legacy code (posted earlier) that uses vector
mask (VM) and vector popcount (PS) to compare two 2-bit encoded
strings. The code posted assumes 64 bit vector registers and 2-bit
encoded DNA sequences (00=A, 01=B, 10=C, 11=D). The international
standard for DNA codes has expanded from 4 to 16(*), so the new
version I have to produce must operate on DNA sequences encoded into 4
bits.

The original code did a lot of setup and bookkeeping in an odd mix of
f66 and f77 with lots of CALMATH extensions. I've removed most of the
cruft from the Fortran routines (a late-night collision between f90
and f95 in a dark alley) which compiles and appears to Do The Right
Thing. The other assembler routines (except for two) were easily re-
written into f77 and painfully ('cause I had to learn pieces of f95)
into f95.

I decided I did NOT want to get into this situation - being dependent
on assembler - again. That lead to my asking whether there were
fortran subroutines around that emulated vector mask and vector
popcount. F05 and F08 have cleaned up calling C routines quite a bit,
but there are no F05 and F08 compilers, apparently.

If gcc will let me at the sse{3,4} instructions, then I can go ahead
and keep converting the code. Integers are unsigned or can be declared/
KINDed unsigned, so MERGE() is not a problem.

All y'all have come up with at least two variations on popcount that
I've not seen before and are much better than mine.

My take-home lesson is that I can (1) use MERGE() for the vector
masking and (2) have a few ideas on vector popcount, and (3) there
are no VM and PS libraries available, but some chip I've not heard of
yet has popcount built in (SSE4).

I am mildly confused about why these instructions are already present.
They're used in petrochemical, economics & financial, biological,
encryption/decryption, 3D graphics, image analysis,
crystallography...oh well, so it goes.

Thank you all,
dan davison
(*) The sixteen come from A, G, C, T, not A, not C, not G, not T, (A||
G), (C||T), and so on ...

James Van Buskirk

unread,

Apr 6, 2008, 8:46:32 PM4/6/08

to

"Terence" <tbwr...@cantv.net> wrote in message
news:022f6f0d-6e84-492b...@s13g2000prd.googlegroups.com...

>I think popcount is just a bit count of '1's in a string of bits.

Correct.

> The standard way in market research, which uses this a lot, is to
> generate a table of bit counts for all possible 8-bit words (or use a
> pre-calculated one). Better is use an assembler fucntion.

Well, the lookup table method is quite slow. I recalled that AMD
provided algorithms for efficient POPCNT. In their phenom docs,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf

they just recommend the POPCNT instruction in section 8.8. So we
have to go back in time to

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

and we find a nifty divide and conquer algorithm in section 8.6.
For bigger bit strings using a PSADBW instruction could clean up
a set of 8-bit sums quickly. There is another trick that involves
using m n-bit registers (actually 2*m-1) as n m-bit adders and knocks
out large bit arrays ridiculously rapidly, but let's not get into
that today.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

glen herrmannsfeldt

unread,

Apr 6, 2008, 9:19:25 PM4/6/08

to

Dan wrote:

> Thank you very, very much. Dick asked me to me clearer on what the
> problem is. I have the legacy code (posted earlier) that uses vector
> mask (VM) and vector popcount (PS) to compare two 2-bit encoded
> strings. The code posted assumes 64 bit vector registers and 2-bit
> encoded DNA sequences (00=A, 01=B, 10=C, 11=D). The international
> standard for DNA codes has expanded from 4 to 16(*), so the new
> version I have to produce must operate on DNA sequences encoded into 4
> bits.

Yes, including the ambiguity codes for all 2**4 combinations
including or excluding each base. Usually, you won't do a
direct comparison of ambiguity codes, though. But you probably
know that.

Both vector merge and population count can be written in reasonably
standard Fortran. (There are complications due to signed arithmetic,
but they most likely won't cause a problem on real machines.)
The MERGE intrinsic might be slightly faster, but maybe not
a lot faster. Population count without hardware support
also might be only slightly faster with an intrinsic, if
one existed.

The Fortran 2003 C interoperability facility is supported reasonably
well by some current compilers.

As someone mentioned but didn't give the details, there is a fairly
fast way to do vector population count using logical operators
and shift operators, which should work well with Fortran intrinsic
functions. The fastest form for vectors likely depends how fast
memory (hopefully cached) cycles are.

-- glen

Dan

unread,

Apr 6, 2008, 9:18:33 PM4/6/08

to

On Apr 6, 8:22 pm, Walter Spector <w6ws_xthiso...@earthlink.net>
wrote:

> All it does is the same as the Fortran-90 PACK intrinsic. Just use
> the intrinsic. If it isn't fast enough, complain to your compiler
> vendor

Thank you - I will check this out immediately.

> And it won't work on a Cray-1. It uses the compressed index and hardware
> gather instructions. Which means mid-life X-MP, onwards.

True. The code posted was second generation, after the lab replaced
all the Cray-1s with 4-processor X and later 8-processor Ys. The first
generation was from a lab-specific operating system and a lab-specific
compiler and math library. The later machines used UNICOS and
(usually) standard math libraries. The was much wailing and rending of
garments when that happened.

> The Cray compilers were eventually able to vectorize this idiom
> automatically. The CAL routine was unnecessary, and probably slower
> than letting the compiler generate inline code.

I'm not sure. The first generation CAL was much faster than the
fortran subroutine. However, as you noted, that was on the Cray-1. We
just carried over onto the later Crays and I certainly never checked.
Certainly the fortran compiler on the Y did a much better job (judged
by instrumenting the code) than the 1 and X. However, since the
routine did not use both "sides" of the vector array (you probably
remember the V arrays were split between processors) management
reasonably required us to stick to using the 1 or the X.

Again, thanks very much for your comments.

glen herrmannsfeldt

unread,

Apr 6, 2008, 9:24:59 PM4/6/08

to

Walter Spector wrote:
(snip)

> All it does is the same as the Fortran-90 PACK intrinsic. Just use
> the intrinsic. If it isn't fast enough, complain to your compiler
> vendor.

> And it won't work on a Cray-1. It uses the compressed index and hardware
> gather instructions. Which means mid-life X-MP, onwards.

I had thought he meant gather (pack), also, but looking at the CRAY
manual on bitsavers it does look more like merge. Without the
vector registers on Cray machines I wouldn't expect the intrinsic
to be a lot faster. (That is, I would guess less than twice as fast.)

Remember Amdahl's law.

-- glen

glen herrmannsfeldt

unread,

Apr 7, 2008, 5:28:56 AM4/7/08

to

Dan wrote:

> Yes, raw speed is needed. That's why we went from fortran to assembler
> in the first place. Since the code was written, the amount of data has
> increased by three, perhaps four, orders of magnitude.

> I just looked over the standard's definition of integer, and it
> implies all integers are signed. Is that true? If there are unsigned
> integers, then there should be no problem.

They are signed, but you can use them with the bit manipulation
instructions, anyway. Funny things might happen on ones complement
machines, but that should be rare. I offer two population
count methods in Fortran that should be fairly fast.
The second seems to generate lots of moves to/from memory
on g95 without optimization, but keeps intermediates in
registers with -O2.

For g95 on IA32 the -O2 case it takes 29 instructions for the
second method, straight line code with no branches or
intermediate stores. For a vector population count it might
be possible to optimize it a little more to overlap more
of the additions. That might help on a machine with more
registers than IA32 to keep intermediate values.

do i=1,100000000
j=i
k=0
do while(j.ne.0)
j=iand(j,j-1)
k=k+1
end do
l1=iand(i,z'55555555')+ishft(iand(i,z'aaaaaaaa'),-1)
l2=iand(l1,z'33333333')+ishft(iand(l1,z'cccccccc'),-2)
l4=iand(l2,z'0f0f0f0f')+ishft(iand(l2,z'f0f0f0f0'),-4)
l8=iand(l4,z'0f0f0f0f')+ishft(iand(l4,z'f0f0f0f0'),-4)
l16=iand(l8,z'00ff00ff')+ishft(iand(l8,z'ff00ff00'),-8)
l32=iand(l16,z'0000ffff')+ishft(iand(l16,z'ffff0000'),-16)
m=l32
if(k.ne.m) write(*,*) i,k,m
enddo
end

-- glen

Dan

unread,

Apr 7, 2008, 8:18:11 AM4/7/08

to

Glen noted:

>
> Yes, including the ambiguity codes for all 2**4 combinations
> including or excluding each base. Usually, you won't do a
> direct comparison of ambiguity codes, though. But you probably
> know that.

The full code, called d-squared (no, not my initials, "distance
squared"), is a sequence comparison algorithm rather than a sequence
alignment algorithm. Without adding context by segmenting and
windowing the library of sequences, all it tells you is that there is
some region in sequence A whose composition (possibly with inversions,
deletions, and unknown base codes via the full IUPAC set) matches
sequence B within a certain tolerance. Only in extremely rare, or
completely meaningless cases can the two sequences be aligned.

It's a very fast way of finding Alu and L7 sequences in the human
genome (only 1.8 million copies per cell!) which have many internal
insertions, deletions, and inversions. Some newer and faster
sequencing technologies produce an enormous amount of cruft initially,
so this code is getting reworked.

dan

Dan

unread,

Apr 7, 2008, 8:27:19 AM4/7/08

to

On Apr 6, 8:22 pm, Walter Spector <w6ws_xthiso...@earthlink.net>
wrote:

> All it does is the same as the Fortran-90 PACK intrinsic. Just use

> the intrinsic. If it isn't fast enough, complain to your compiler
> vendor.

I've listened to the PACK intrinsic from the 2003 draft standard(*),
and I don't think this is a PACK. It's a merge - the conditional
vector merge in more recent cray fortran compilers is closer to what
we were trying to do.

thanks,
dan
(*) Low vision, so I listen to PDFs because Adobe has a great built-in
text-to-speech function. Most of my posts are done on a Mac for the
same reason, but I can and do still miss things, of course. Listening
to code I haven't written and can't read takes a looonnnngg time.

Walter Spector

unread,

Apr 7, 2008, 10:28:26 AM4/7/08

to

Dan wrote:
> On Apr 6, 8:22 pm, Walter Spector <w6ws_xthiso...@earthlink.net>
> wrote:
>

>> All it does is the same as the Fortran-90 PACK intrinsic....

> I've listened to the PACK intrinsic from the 2003 draft standard(*),
> and I don't think this is a PACK.

Yes, your CAL routine simply does a PACK. It is very straightforward
code.

> It's a merge - the conditional
> vector merge in more recent cray fortran compilers is closer to what
> we were trying to do.

The CVMGx intrinsics are very old (like from early Cray-1 days),
and don't pack/unpack data. They were equivalent to todays
Fortran-90 MERGE intrinsic.

W.

glen herrmannsfeldt

unread,

Apr 8, 2008, 12:52:33 AM4/8/08

to

Dick Hendrickson wrote:

> Pop count is harder. If the hardware doesn't have that instruction
> or if the compiler won't generate it, or if you don't want to
> do assembler, you can simulate it with a ton of shifts and ands
> or ors. I think I remember a few years ago someone here saying
> they had a popcount subroutine. It wasn't just brute force;
> I think they split the input into 8 or 16 bit chunks and then
> did a table lookup on those and added the results.

For machines with fast addressing of eight bit data that might
be pretty fast. Otherwise, it doesn't take that many shifts
and ands. For 32 bit data, here are two ways to do it:

do i=0,100000000
j=i

C Method based on properties of carry in binary arithmetic
C Time is somewhat proportional to the number of one bits.

k=0
do while(j.ne.0)
j=iand(j,j-1)
k=k+1
end do

C Method based on addition in binary arithmetic.
C Time should be about constant.
l1=iand(i,z'49249249')+ishft(iand(i,z'92492492'),-1)+
* ishft(iand(i,z'24924924'),-2)
l3=l1+ishft(l1,-18)
l6=iand(l3,z'71c7')+ishft(iand(l3,z'38e38'),-3)
l18=l6+ishft(l6,-6)+ishft(l6,-12)
m=iand(l18,z'3f')

C compare the two results

if(k.ne.m) write(*,*) i,k,m
enddo
end

It taked O(log n) shifts, ands, and additions.

With IA32 g95 and -O2 the second method is done in 24 instructions
with no branches. That should go pretty fast on modern processors.

-- glen

Terence

unread,

Apr 8, 2008, 7:25:14 PM4/8/08

to

I have an additional comment.
What Dan wants to do is what some early spelling chack programs used
to locate the most probable intended word against the one given, by
assuming only deletions and inclusions can occur between the sample
string and a libaray of related strings. The best match is picked on a
points basis of minimum difference.

Dan is using 2-bit codes for ALUT codons, which I regret, eliminates
the possibility of some valid rare alternatives found in DNA. Even so,
in this coding, the matching and shifting has to be two-bit wide, not
one. That makes the filtering less easy. Speed and size apart I would
have preferred characters to bits. Or hardware that is trinary (it
once existed!).
I still think an assembler routine called from Fortran is the only
hope of a fast functio.
And these (similar functions) are called from (Fortran) market
research programs, to count and filter responses.

Terence

unread,

Apr 8, 2008, 7:45:35 PM4/8/08

to

Am I missing something?
Assuming input data as nwords of 16-bit words in MAT and a table of
bitcounts in IBITC for integers 0 to 255, you calculate the '1' bit
count of such a string as follows:-

INTEGER*4 MAT(Nwords),N
INTEGER*4 IBITC(256)
DATA IBITC/0,1,1,2,1,2,2,3,......16/
C
N=0
DO 1 I=1,nwords
K1=MAT(I)
K2=ISHR(K1,8) +1
K1=IAND(K1,#FF) +1
N=N+IBITC(K1)+IBITC(K2)
1 CONTINUE

If this is done in assembler directly is is even faster.

Gerry Ford

unread,

Apr 8, 2008, 8:30:02 PM4/8/08

to

"Terence" <tbwr...@cantv.net> wrote in message

news:8aef8a66-8d45-4d8a...@p39g2000prm.googlegroups.com...

There's something about #FF that looks strange to me, beyond unfamiliarity
to low-level processes in fortran. I find this a good resource:
http://www.robelle.com/smugbook/ascii.txt

Does #FF mean 256th char in fortran?

--

"That this social order with its pauperism, famines, prisons, gallows,
armies, and wars is necessary to society; that still greater disaster
would ensue if this organization were destroyed; all this is said only
by those who profit by this organization, while those who suffer from it
- and they are ten times as numerous - think and say quite the contrary."

~~ Leo Tolstoy

Gary Scott

unread,

Apr 8, 2008, 10:23:27 PM4/8/08

to

Gerry Ford wrote:
> "Terence" <tbwr...@cantv.net> wrote in message
> news:8aef8a66-8d45-4d8a...@p39g2000prm.googlegroups.com...
>
>>Am I missing something?
>>Assuming input data as nwords of 16-bit words in MAT and a table of
>>bitcounts in IBITC for integers 0 to 255, you calculate the '1' bit
>>count of such a string as follows:-
>>
>> INTEGER*4 MAT(Nwords),N
>> INTEGER*4 IBITC(256)
>> DATA IBITC/0,1,1,2,1,2,2,3,......16/
>>C
>> N=0
>> DO 1 I=1,nwords
>> K1=MAT(I)
>> K2=ISHR(K1,8) +1
>> K1=IAND(K1,#FF) +1
>> N=N+IBITC(K1)+IBITC(K2)
>> 1 CONTINUE
>>
>>If this is done in assembler directly is is even faster.
>>
>
> There's something about #FF that looks strange to me, beyond unfamiliarity
> to low-level processes in fortran. I find this a good resource:
> http://www.robelle.com/smugbook/ascii.txt

That's a nice MS/DEC/Compaq/Intel extension for hexadecimal notation.
You can read it as 16#FF as the 16 is the default implied base. (Z'FF')

>
> Does #FF mean 256th char in fortran?
>
>

--

Gary Scott
mailto:garylscott@sbcglobal dot net

Fortran Library: http://www.fortranlib.com

Support the Original G95 Project: http://www.g95.org
-OR-
Support the GNU GFortran Project: http://gcc.gnu.org/fortran/index.html

If you want to do the impossible, don't hire an expert because he knows
it can't be done.

-- Henry Ford

glen herrmannsfeldt

unread,

Apr 8, 2008, 11:20:00 PM4/8/08

to

Gary Scott wrote:

> That's a nice MS/DEC/Compaq/Intel extension for hexadecimal notation.
> You can read it as 16#FF as the 16 is the default implied base. (Z'FF')

When did DEC do this? The VAX/VMS compilers I knew used the IBM Z
notation. (No apostrophes, only in DATA statements.)

Before VAX, DEC only had octal for non-decimal constants.

-- glen

James Van Buskirk

unread,

Apr 9, 2008, 4:45:38 AM4/9/08

to

"Terence" <tbwr...@cantv.net> wrote in message
news:8aef8a66-8d45-4d8a...@p39g2000prm.googlegroups.com...

> Am I missing something?

Well, the above doesn't seem very fast. Your snippet was
incomplete and it's unclear whether you intended an 8-bit LUT or a
16-bit LUT. gfortran barfed on the 16-bit LUT, so what we are left
with is:

C:\gfortran\clf\popcnt>type big_popcnt.s
.text
.globl _tm1
.def _tm1; .scl 2; .type 32; .endef
_tm1:
rdtsc
shrq $32, %rdx
orq %rdx, %rax
ret
.globl _popcnt1
.def _popcnt1; .scl 2; .type 32; .endef
_popcnt1:
subq $88, %rsp
movaps %xmm6, (%rsp)
movaps %xmm7, 16(%rsp)
movaps %xmm8, 32(%rsp)
movaps %xmm12, 48(%rsp)
movaps %xmm13, 64(%rsp)
movaps %xmm14, 96(%rsp)
movaps %xmm15, 112(%rsp)
movq $0xaaaaaaaaaaaaaaaa, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm15
movq $0x3333333333333333, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm14
movq $0x0f0f0f0f0f0f0f0f, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm13
xorps %xmm12, %xmm12

xorps %xmm2, %xmm2
xorps %xmm3, %xmm3
xorps %xmm4, %xmm4
xorps %xmm5, %xmm5
xorps %xmm6, %xmm6
xorps %xmm7, %xmm7
xorps %xmm8, %xmm8

movl $32768, %edx
addq %rdx, %rcx
negq %rdx
loop1: movaps (%rcx,%rdx), %xmm0
movaps %xmm0, %xmm3
andps %xmm2, %xmm3
xorps %xmm0, %xmm2
movaps 16(%rcx,%rdx), %xmm0
movaps %xmm0, %xmm1
andps %xmm2, %xmm1
xorps %xmm0, %xmm2
orps %xmm1, %xmm3
movaps %xmm3, %xmm5
andps %xmm4, %xmm5
xorps %xmm3, %xmm4
movaps 32(%rcx,%rdx), %xmm0
movaps %xmm0, %xmm3
andps %xmm2, %xmm3
xorps %xmm0, %xmm2
movaps 48(%rcx,%rdx), %xmm0
movaps %xmm0, %xmm1
andps %xmm2, %xmm1
xorps %xmm0, %xmm2
orps %xmm1, %xmm3
movaps %xmm3, %xmm0
andps %xmm4, %xmm0
xorps %xmm3, %xmm4
orps %xmm0, %xmm5
movaps %xmm5, %xmm7
andps %xmm6, %xmm7
xorps %xmm5, %xmm6
movaps 64(%rcx,%rdx), %xmm0
movaps %xmm0, %xmm3
andps %xmm2, %xmm3
xorps %xmm0, %xmm2
movaps 80(%rcx,%rdx), %xmm0
movaps %xmm0, %xmm1
andps %xmm2, %xmm1
xorps %xmm0, %xmm2
orps %xmm1, %xmm3
movaps %xmm3, %xmm5
andps %xmm4, %xmm5
xorps %xmm3, %xmm4
movaps 96(%rcx,%rdx), %xmm0
movaps %xmm0, %xmm3
andps %xmm2, %xmm3
xorps %xmm0, %xmm2
movaps 112(%rcx,%rdx), %xmm0
movaps %xmm0, %xmm1
andps %xmm2, %xmm1
xorps %xmm0, %xmm2
orps %xmm1, %xmm3
movaps %xmm3, %xmm0
andps %xmm4, %xmm0
xorps %xmm3, %xmm4
orps %xmm0, %xmm5
movaps %xmm5, %xmm0
andps %xmm6, %xmm0
xorps %xmm5, %xmm6
orps %xmm0, %xmm7

call wrap
paddd %xmm7, %xmm8

addq $128, %rdx
jnz loop1

movaps %xmm6, %xmm7
call wrap
paddq %xmm8, %xmm8
paddq %xmm7, %xmm8
movaps %xmm4, %xmm7
call wrap
paddq %xmm8, %xmm8
paddq %xmm7, %xmm8
movaps %xmm2, %xmm7
call wrap
paddq %xmm8, %xmm8
paddq %xmm7, %xmm8
movhlps %xmm8, %xmm1
paddq %xmm8, %xmm1
movq %xmm1, %rax

movaps (%rsp), %xmm6
movaps 16(%rsp), %xmm7
movaps 32(%rsp), %xmm8
movaps 48(%rsp), %xmm12
movaps 64(%rsp), %xmm13
movaps 96(%rsp), %xmm14
movaps 112(%rsp), %xmm15
addq $88, %rsp
ret

wrap: movaps %xmm15, %xmm0
andps %xmm7, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm7
movaps %xmm14, %xmm0
andnps %xmm7, %xmm0
psrld $2, %xmm0
andps %xmm14, %xmm7
paddd %xmm0, %xmm7
movaps %xmm7, %xmm0
psrld $4, %xmm7
paddd %xmm0, %xmm7
andps %xmm13, %xmm7
psadbw %xmm12, %xmm7
ret

C:\gfortran\clf\popcnt>type big_test.f90
program big_test
use ISO_C_BINDING
implicit none
interface
function tm1() bind(C)
import C_INT64_T
implicit none

integer(C_INT64_T) tm1
end function tm1
end interface

interface
function popcnt1(x) bind(C)
import C_INT64_T
import C_PTR
implicit none

integer(C_INT64_T) popcnt1
type(C_PTR), value :: x
end function popcnt1
end interface

interface
function popcnt2(x,n) bind(C)
import C_INT64_T
import C_PTR
import C_INT
implicit none

integer(C_INT64_T) popcnt2
type(C_PTR), value :: x
integer(C_INT) n
end function popcnt2
end interface

interface
subroutine sieve(t, n) bind(C)
import C_PTR
import C_INT
implicit none

type(C_PTR), value :: t
integer(C_INT) n
end subroutine sieve
end interface

integer(C_INT8_T), allocatable, target :: x(:)
integer(C_INT), parameter :: Nbytes = 32768 ! popcnt1 hardwired to this
integer, parameter :: align = 16 ! Will use xmm registers
type(C_PTR) x_ptr
integer(C_INTPTR_T) x_start
integer(C_INTPTR_T) t_start
type(C_PTR) t_ptr
integer i
integer(C_INT64_T) t0, t1, np

allocate(x(Nbytes+align-1))
x_ptr = C_LOC(x(1))
x_start = transfer(x_ptr, x_start)
t_start = iand(x_start+align-1, int(not(align-1),C_INTPTR_T))
t_ptr = transfer(t_start, t_ptr)
call sieve(t_ptr, Nbytes)
do i = 1, 4
t0 = tm1()
np = popcnt1(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt1 np = ', np, ' clocks = ', t1-t0
t0 = tm1()
np = popcnt2(t_ptr, Nbytes/4)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt2 np = ', np, ' clocks = ', t1-t0
end do

end program big_test

subroutine sieve(t, n) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT) n
integer(C_INT8_T) t(0:n-1)
integer i, lim, j

lim = sqrt(8*n+0.5_C_DOUBLE)
t = -1
t(0) = ibclr(t(0), 0)
do i = 2, lim
if(btest(t((i-1)/8),modulo(i-1,8))) then
do j = i**2, 8*n, i
t((j-1)/8) = ibclr(t((j-1)/8),modulo(j-1,8))
end do
end if
end do
end subroutine sieve

!function popcnt2(MAT, Nwords) bind(C)
! use ISO_C_BINDING
! implicit none
! integer(C_INT64_T) popcnt2
! integer(C_INT) Nwords
! integer(C_INT32_T) MAT(Nwords)
! integer i, j
! integer(C_INT8_T), parameter :: IBITC(0:65535) = [( &
! ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
! ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1)+ &
! ibits(i,8,1)+ibits(i,9,1)+ibits(i,10,1)+ibits(i,11,1)+ &
! ibits(i,12,1)+ibits(i,13,1)+ibits(i,14,1)+ibits(i,15,1), &
! i=0,65535)]
! integer(C_INT32_T) K1, K2
!
! popcnt2 = 0
! do i = 1, Nwords
! K1 = MAT(i)
! K2 = ishft(K1, -16)
! K1 = iand(K1, int(z'ffff', C_INT32_T))
! popcnt2 = popcnt2+IBITC(K1)+IBITC(K2)
! end do
!end function popcnt2

function popcnt2(MAT, Nwords) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT64_T) popcnt2
integer(C_INT) Nwords
integer(C_INT16_T) MAT(2*Nwords)
integer i, j
integer(C_INT8_T), parameter :: IBITC(0:255) = [( &
ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1), &
i=0,255)]
integer(C_INT16_T) K1, K2

popcnt2 = 0
do i = 1, 2*Nwords
K1 = MAT(i)
K2 = ishft(K1, -8)
K1 = iand(K1, int(z'ff', C_INT16_T))
popcnt2 = popcnt2+IBITC(K1)+IBITC(K2)
end do
end function popcnt2

C:\gfortran\clf\popcnt>as big_popcnt.s

C:\gfortran\clf\popcnt>c:\gfortran\win64\bin\x86_64-pc-mingw32-gfortran -O2
big_
test.f90 big_popcnt.s -obig_test

C:\gfortran\clf\popcnt>big_test
popcnt1 np = 23000 clocks = 9060
popcnt2 np = 23000 clocks = 82810
popcnt1 np = 23000 clocks = 9230
popcnt2 np = 23000 clocks = 82830
popcnt1 np = 23000 clocks = 9290
popcnt2 np = 23000 clocks = 82770
popcnt1 np = 23000 clocks = 9310
popcnt2 np = 23000 clocks = 82790

Gary Scott

unread,

Apr 9, 2008, 8:46:21 AM4/9/08

to

glen herrmannsfeldt wrote:

> Gary Scott wrote:
>
>> That's a nice MS/DEC/Compaq/Intel extension for hexadecimal notation.
>> You can read it as 16#FF as the 16 is the default implied base. (Z'FF')
>
>
> When did DEC do this? The VAX/VMS compilers I knew used the IBM Z
> notation. (No apostrophes, only in DATA statements.)

When they created DVF. I doubt if it was retrofit to VAX.

>
> Before VAX, DEC only had octal for non-decimal constants.
>
> -- glen
>

Gary Scott

unread,

Apr 9, 2008, 8:49:23 AM4/9/08

to

glen herrmannsfeldt wrote:

> Gary Scott wrote:
>
>> That's a nice MS/DEC/Compaq/Intel extension for hexadecimal notation.
>> You can read it as 16#FF as the 16 is the default implied base. (Z'FF')
>
>
> When did DEC do this? The VAX/VMS compilers I knew used the IBM Z
> notation. (No apostrophes, only in DATA statements.)

Also, I believe it supports bases from 2-36 (or something like that)

2#00000001
3#...
4#...
...
36#...

>
> Before VAX, DEC only had octal for non-decimal constants.
>
> -- glen
>

Terence

unread,

Apr 9, 2008, 5:37:52 PM4/9/08

to

1) OOPS! Should be ...8/ in the data statement.
2) My MS 3.31 compiler does the function in under 20 operations; on
integer value load, one indexed load, one shift, one increment, one
store, one and immediate, one increment, one load, one add, another
add, one store.
With a table, it's just one load, an indexed load, an add, a store.

James Van Buskirk

unread,

Apr 10, 2008, 2:59:16 PM4/10/08

to

"James Van Buskirk" <not_...@comcast.net> wrote in message
news:ybqdnX1x--CqH2Ha...@comcast.com...

> C:\gfortran\clf\popcnt>big_test
> popcnt1 np = 23000 clocks = 9060
> popcnt2 np = 23000 clocks = 82810

Well, the above seemed really slow, so I looked at my assembly code.
I still can't figure out what the big holdup is, but I did manage to
find a subtle code transformation that reduced execution time by 20%.

C:\gfortran\clf\popcnt>type big_popcnt1.s

.text
.globl _tm1
.def _tm1; .scl 2; .type 32; .endef
_tm1:
rdtsc
shrq $32, %rdx
orq %rdx, %rax
ret
.globl _popcnt1
.def _popcnt1; .scl 2; .type 32; .endef

.align 16
_popcnt1:
subq $72, %rsp

movaps %xmm6, (%rsp)
movaps %xmm7, 16(%rsp)

movaps %xmm12, 32(%rsp)
movaps %xmm13, 48(%rsp)
movaps %xmm14, 80(%rsp)
movaps %xmm15, 96(%rsp)

movq $0xaaaaaaaaaaaaaaaa, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm15
movq $0x3333333333333333, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm14
movq $0x0f0f0f0f0f0f0f0f, %rax
movq %rax, %xmm0
movddup %xmm0, %xmm13
xorps %xmm12, %xmm12

pcmpeqd %xmm1, %xmm1
pcmpeqd %xmm3, %xmm3
pcmpeqd %xmm5, %xmm5
xorps %xmm7, %xmm7

movl $32768, %edx
addq %rdx, %rcx
negq %rdx

.align 16
loop1: movapd (%rcx,%rdx), %xmm6
xorps %xmm6, %xmm1
andps %xmm1, %xmm6
movaps 16(%rcx,%rdx), %xmm0
xorps %xmm0, %xmm1
andpd %xmm1, %xmm0
orps %xmm0, %xmm6
xorps %xmm6, %xmm3
andps %xmm3, %xmm6
movaps 32(%rcx,%rdx), %xmm4
xorps %xmm4, %xmm1
andps %xmm1, %xmm4
movaps 48(%rcx,%rdx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm4
xorps %xmm4, %xmm3
andps %xmm3, %xmm4
orps %xmm4, %xmm6
xorps %xmm6, %xmm5
andps %xmm5, %xmm6
movaps 64(%rcx,%rdx), %xmm4
xorps %xmm4, %xmm1
andps %xmm1, %xmm4
movaps 80(%rcx,%rdx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm4
xorps %xmm4, %xmm3
andps %xmm3, %xmm4
movaps 96(%rcx,%rdx), %xmm2
xorps %xmm2, %xmm1
andps %xmm1, %xmm2
movaps 112(%rcx,%rdx), %xmm0
xorps %xmm0, %xmm1
andps %xmm1, %xmm0
orps %xmm0, %xmm2
xorps %xmm2, %xmm3
andps %xmm3, %xmm2
orps %xmm2, %xmm4
xorps %xmm4, %xmm5
andps %xmm5, %xmm4
orps %xmm4, %xmm6

movaps %xmm15, %xmm0
andps %xmm6, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm6
movaps %xmm14, %xmm0
andnps %xmm6, %xmm0
psrld $2, %xmm0
andps %xmm14, %xmm6
paddd %xmm0, %xmm6
movaps %xmm6, %xmm0
psrld $4, %xmm6
paddd %xmm0, %xmm6
andps %xmm13, %xmm6
psadbw %xmm12, %xmm6
paddd %xmm6, %xmm7

addq $128, %rdx
jnz loop1

movaps %xmm5, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movaps %xmm3, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movaps %xmm1, %xmm6
call wrap
paddq %xmm7, %xmm7
psubq %xmm6, %xmm7
movhlps %xmm7, %xmm1
paddq %xmm7, %xmm1
movq %xmm1, %rax
addq $896, %rax

movaps (%rsp), %xmm6
movaps 16(%rsp), %xmm7

movaps 32(%rsp), %xmm12
movaps 48(%rsp), %xmm13
movaps 80(%rsp), %xmm14
movaps 96(%rsp), %xmm15
addq $72, %rsp
ret

.align 16
wrap: movaps %xmm15, %xmm0
andps %xmm6, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm6
movaps %xmm14, %xmm0
andnps %xmm6, %xmm0
psrld $2, %xmm0
andps %xmm14, %xmm6
paddd %xmm0, %xmm6
movaps %xmm6, %xmm0
psrld $4, %xmm6
paddd %xmm0, %xmm6
andps %xmm13, %xmm6
psadbw %xmm12, %xmm6
ret

.globl _popcnt3
.def _popcnt3; .scl 2; .type 32; .endef
.align 16
_popcnt3:
movaps %xmm7, 8(%rsp)

movq $0xaaaaaaaaaaaaaaaa, %rax
movq %rax, %xmm0

movddup %xmm0, %xmm5

movq $0x3333333333333333, %rax
movq %rax, %xmm0

movddup %xmm0, %xmm4

movq $0x0f0f0f0f0f0f0f0f, %rax
movq %rax, %xmm0

movddup %xmm0, %xmm3
xorps %xmm2, %xmm2

xorps %xmm7, %xmm7

movl $32768, %edx
addq %rdx, %rcx
negq %rdx

.align 16
loop2: movaps (%rcx,%rdx), %xmm1
movapd %xmm5, %xmm0
andps %xmm1, %xmm0
psrld $1, %xmm0
psubd %xmm0, %xmm1
movapd %xmm4, %xmm0
andnps %xmm1, %xmm0
psrld $2, %xmm0
andps %xmm4, %xmm1
paddd %xmm0, %xmm1
movaps %xmm1, %xmm0
psrld $4, %xmm1
paddd %xmm0, %xmm1
andps %xmm3, %xmm1
psadbw %xmm2, %xmm1
paddd %xmm1, %xmm7

addq $16, %rdx
jnz loop2
movhlps %xmm7, %xmm1
paddq %xmm7, %xmm1
movq %xmm1, %rax

movaps 8(%rsp), %xmm7
ret

C:\gfortran\clf\popcnt>type big_test1.f90
program big_test1

interface
function popcnt3(x) bind(C)

import C_INT64_T
import C_PTR
implicit none

integer(C_INT64_T) popcnt3
type(C_PTR), value :: x
end function popcnt3
end interface

np = popcnt3(t_ptr)
t1 = tm1()
write(*,'(2(a,i0))') 'popcnt3 np = ', np, ' clocks = ', t1-t0
end do

end program big_test1

subroutine sieve(t, n) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT) n
integer(C_INT8_T) t(0:n-1)
integer i, lim, j

lim = sqrt(8*n+0.5_C_DOUBLE)
t = -1
t(0) = ibclr(t(0), 0)
do i = 2, lim
if(btest(t((i-1)/8),modulo(i-1,8))) then
do j = i**2, 8*n, i
t((j-1)/8) = ibclr(t((j-1)/8),modulo(j-1,8))
end do
end if
end do
end subroutine sieve

function popcnt2(MAT, Nwords) bind(C)

use ISO_C_BINDING
implicit none
integer(C_INT64_T) popcnt2
integer(C_INT) Nwords
integer(C_INT16_T) MAT(2*Nwords)
integer i, j
integer(C_INT8_T), parameter :: IBITC(0:255) = [( &
ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1), &
i=0,255)]
integer(C_INT16_T) K1, K2

popcnt2 = 0
do i = 1, 2*Nwords
K1 = MAT(i)
K2 = ishft(K1, -8)
K1 = iand(K1, int(z'ff', C_INT16_T))
popcnt2 = popcnt2+IBITC(K1)+IBITC(K2)
end do
end function popcnt2

C:\gfortran\clf\popcnt>C:\gfortran\win64\bin\x86_64-pc-mingw32-gfortran -O2
big_
test1.f90 big_popcnt1.s -obig_test1

C:\gfortran\clf\popcnt>big_test1 > big_test1.txt

C:\gfortran\clf\popcnt>type big_test1.txt
popcnt1 np = 23000 clocks = 7020
popcnt2 np = 23000 clocks = 82680
popcnt3 np = 23000 clocks = 15470
popcnt1 np = 23000 clocks = 7130
popcnt2 np = 23000 clocks = 82660
popcnt3 np = 23000 clocks = 15570
popcnt1 np = 23000 clocks = 7120
popcnt2 np = 23000 clocks = 82670
popcnt3 np = 23000 clocks = 15470
popcnt1 np = 23000 clocks = 7130
popcnt2 np = 23000 clocks = 82700
popcnt3 np = 23000 clocks = 15470

So now I'm down to 7100 clocks for 32768 bytes or 4.6 bytes per
clock. popcnt2 is a corrected version of the 8-bit LUT function
that Terence posted which gets only 0.40 bytes per clock, and
popcnt3 is like popcnt1 but skips the compression step and gets 2.1
bytes per clock cycle. So I have improved my assembly language
version to the point where it should be faster than hardware POPCNT
on 32-bit machines and is an order of magnitude faster than at least
one plausible competitive algorithm, but I can't help feeling that
I am missing something... shouldn't it be somewhat faster than this?
It's unfortunate that Intel doesn't provide optimization tools like
AMD does. If I could examine a Pipeline View of my code I might be
able to spot something, but it's too much work to generate this by
hand.

Oh yeah, the code that kills gfortran is like this:

C:\gfortran\clf\popcnt>type popcnt16.f90
function popcnt16(MAT, Nwords) bind(C)
use ISO_C_BINDING
implicit none
integer(C_INT64_T) popcnt16
integer(C_INT) Nwords
integer(C_INT32_T) MAT(Nwords)
integer i, j
integer(C_INT8_T), parameter :: IBITC(0:65535) = [( &
ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1)+ &
ibits(i,8,1)+ibits(i,9,1)+ibits(i,10,1)+ibits(i,11,1)+ &
ibits(i,12,1)+ibits(i,13,1)+ibits(i,14,1)+ibits(i,15,1), &
i=0,65535)]
integer(C_INT32_T) K1, K2

popcnt16 = 0
do i = 1, Nwords
K1 = MAT(i)
K2 = ishft(K1, -16)

K1 = iand(K1, int(z'ffff', C_INT32_T))

popcnt16 = popcnt16+IBITC(K1)+IBITC(K2)
end do
end function popcnt16

C:\gfortran\clf\popcnt>C:\gcc_equation\bin\x86_64-pc-mingw32-gfortran -v
Built by Equation Solution (http://www.Equation.com).
Using built-in specs.
Target: x86_64-pc-mingw32
Configured with:
../gcc-4.4-20080404-mingw/configure --host=x86_64-pc-mingw32 --
build=x86_64-unknown-linux-gnu --target=x86_64-pc-mingw32 --prefix=/home/gfortra
n/gcc-home/binary/mingw32/native/x86_64/gcc/4.4-20080404 --with-gmp=/home/gfortr
an/gcc-home/binary/mingw32/native/x86_64/gmp --with-mpfr=/home/gfortran/gcc-home
/binary/mingw32/native/x86_64/mpfr --with-sysroot=/home/gfortran/gcc-home/binary
/mingw32/cross/x86_64/gcc/4.4-20080404 --with-gcc --with-gnu-ld --with-gnu-as
--
disable-shared --disable-nls --disable-tls --enable-languages=c,fortran --enable
-threads=win32 --enable-libgomp --disable-win32-registry
Thread model: win32
gcc version 4.4.0 20080404 (experimental) (GCC)

C:\gfortran\clf\popcnt>C:\gcc_equation\bin\x86_64-pc-mingw32-gfortran -c
popcnt1
6.f90
popcnt16.f90: In function 'popcnt16':
popcnt16.f90:20: internal compiler error: in gfc_conv_array_initializer, at
fort
ran/trans-array.c:3882
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.

glen herrmannsfeldt

unread,

Apr 10, 2008, 3:44:53 PM4/10/08

to

James Van Buskirk wrote:

(snip)

> Oh yeah, the code that kills gfortran is like this:

> integer(C_INT8_T), parameter :: IBITC(0:65535) = [( &

> ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
> ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1)+ &
> ibits(i,8,1)+ibits(i,9,1)+ibits(i,10,1)+ibits(i,11,1)+ &
> ibits(i,12,1)+ibits(i,13,1)+ibits(i,14,1)+ibits(i,15,1), &
> i=0,65535)]

(snip)

> gcc version 4.4.0 20080404 (experimental) (GCC)

> C:\gfortran\clf\popcnt>C:\gcc_equation\bin\x86_64-pc-mingw32-gfortran -c
> popcnt1
> 6.f90
> popcnt16.f90: In function 'popcnt16':
> popcnt16.f90:20: internal compiler error: in gfc_conv_array_initializer, at
> fort

Maybe a stack overflow trying to store the whole thing.

How about as a DATA statement instead of PARAMETER?

This reminds me of limits on the size of static arrays with a
C compiler on the Alpha. (Probably the DEC compiler, though it
might have been gcc.)

-- glen

glen herrmannsfeldt

unread,

Apr 10, 2008, 5:08:30 PM4/10/08

to

James Van Buskirk wrote:
(snip)

> integer(C_INT8_T), parameter :: IBITC(0:65535) = [( &

> ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
> ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1)+ &
> ibits(i,8,1)+ibits(i,9,1)+ibits(i,10,1)+ibits(i,11,1)+ &
> ibits(i,12,1)+ibits(i,13,1)+ibits(i,14,1)+ibits(i,15,1), &
> i=0,65535)]

(snip)

> C:\gfortran\clf\popcnt>C:\gcc_equation\bin\x86_64-pc-mingw32-gfortran -c
> popcnt1
> 6.f90
> popcnt16.f90: In function 'popcnt16':
> popcnt16.f90:20: internal compiler error: in gfc_conv_array_initializer, at
> fort

g95 seems to be able to do it, but it is very slow.
This one statement compiles about 50 times slower than all the other
statements in my program.

(42 seconds on an 800MHz Celeron, 0.8s if initialized to zero.
Don't ask why, but that is the machine that has g95 on it.)

-- glen

Thomas Koenig

unread,

Apr 11, 2008, 3:15:35 PM4/11/08

to

On 2008-04-10, James Van Buskirk <not_...@comcast.net> wrote:

> Oh yeah, the code that kills gfortran is like this:

> integer(C_INT8_T), parameter :: IBITC(0:65535) = [( &

> ibits(i,0,1)+ibits(i,1,1)+ibits(i,2,1)+ibits(i,3,1)+ &
> ibits(i,4,1)+ibits(i,5,1)+ibits(i,6,1)+ibits(i,7,1)+ &
> ibits(i,8,1)+ibits(i,9,1)+ibits(i,10,1)+ibits(i,11,1)+ &
> ibits(i,12,1)+ibits(i,13,1)+ibits(i,14,1)+ibits(i,15,1), &
> i=0,65535)]

This is http://gcc.gnu.org/PR19925 , a known problem.