Popcount (was Re: The PENTIUM PRO 200 Mhz is a be

James Kirkpatrick

unread,

Mar 28, 1996, 3:00:00 AM3/28/96

to

|>In article <trimmDo...@netcom.com>,
|>Gary M. Watson <tr...@netcom.com> wrote:
|>>What was the engineering justification for the Cray team to put
|>>the population count in hardware? What, besides chess, uses this
|>>enough to justify its existance? If there's a good reason, maybe
|>>we can forward it to the P7 team.

There is strong evidence popcount was the "NSA instruction". However,
in our case it came in handy on our CDC Cyber mainframes. These were
octal 60-bit-word ones-complement systems, and a word could contain
either positive zero or negative zero. In Fortran, it was difficult
to distiguish the two zeros, so we used popcount as a fast way to tell
positive zero from negative zero. (The application was bit-mapped
graphics, not numerical computation). Of course this was no longer an
issue with Cray systems, which I assume were/are 64-bit twos-complement.

Jim

Jack Harper

unread,

Mar 28, 1996, 3:00:00 AM3/28/96

to

In article <1996Mar28.0...@roper.uwyo.edu>, jim...@uwyo.edu
(James Kirkpatrick) wrote:

> |>In article <trimmDo...@netcom.com>,
> |>Gary M. Watson <tr...@netcom.com> wrote:
> |>>What was the engineering justification for the Cray team to put
> |>>the population count in hardware? What, besides chess, uses this
> |>>enough to justify its existance? If there's a good reason, maybe
> |>>we can forward it to the P7 team.

Also -- The old Univac 1100 mainframes had an instruction called, I think,
LSC for "Left Shift and Count" which would left shift a word/register
until a '1' bit was encountered and then load another register with the
number of shifts required to find the first non-zero bit.

This is, I think from previous comments, different from the "population
count", however. Back in the early 1970's, I remember wondering about the
use of such a thing and searched through the source code for the Univac
(Exec-8) operating system kernal for usage and, as I recall, found the
instruction used in only one place -- a file segment allocation table
implemented as a bit string.

Could someone confirm that I am correct that the "LSC" would not directly
implement the population count... But, maybe I am wrong. I seem to
remember hearing from the Univac field support guys at the time (20 years
ago) that the NSA had a number of Univacs.

Regards to all...

Jack

---------------------------------------------------------------------
Jack Harper Bank Systems 2000, Inc.
e-mail: jha...@bs2000.com 350 Indiana Street, Suite 350
voice: 303-277-1892 fax: 303-277-1785 Golden, Colorado 80401 USA

"21st Century Banking Applications"
Private Label Optical Bank Card Systems
Visit our Web Page: http://www.bs2000.com/talos
---------------------------------------------------------------------

Don Fong

unread,

Mar 28, 1996, 3:00:00 AM3/28/96

to

In article <1996Mar28.0...@roper.uwyo.edu>,

James Kirkpatrick <jim...@uwyo.edu> wrote:
>There is strong evidence popcount was the "NSA instruction". However,
>in our case it came in handy on our CDC Cyber mainframes. These were
>octal 60-bit-word ones-complement systems, and a word could contain
>either positive zero or negative zero. In Fortran, it was difficult
>to distiguish the two zeros, so we used popcount as a fast way to tell
>positive zero from negative zero.

hard to believe popcount would be faster than a simple boolean test.
i learned to program on a CDC 6400. the fortran compiler used
negative 0 for .TRUE. and positive 0 for .FALSE. (or was it the other
way around?) consequently if you did an arithmetic (rather than logical)
comparison, you could wind up with the surprising result that
.TRUE. = .FALSE. !
--
--- don fong ``i still want the peace dividend''
--

Robert Hyatt

unread,

Mar 28, 1996, 3:00:00 AM3/28/96

to

In article <jharper-2803...@p27.denver1.dialup.csn.net>,
Jack Harper <jha...@bs2000.com> wrote:
-->In article <1996Mar28.0...@roper.uwyo.edu>, jim...@uwyo.edu
-->(James Kirkpatrick) wrote:
-->
-->> |>In article <trimmDo...@netcom.com>,
-->> |>Gary M. Watson <tr...@netcom.com> wrote:
-->> |>>What was the engineering justification for the Cray team to put
-->> |>>the population count in hardware? What, besides chess, uses this
-->> |>>enough to justify its existance? If there's a good reason, maybe
-->> |>>we can forward it to the P7 team.
-->
-->Also -- The old Univac 1100 mainframes had an instruction called, I think,
-->LSC for "Left Shift and Count" which would left shift a word/register
-->until a '1' bit was encountered and then load another register with the
-->number of shifts required to find the first non-zero bit.
-->
-->This is, I think from previous comments, different from the "population
-->count", however. Back in the early 1970's, I remember wondering about the
-->use of such a thing and searched through the source code for the Univac
-->(Exec-8) operating system kernal for usage and, as I recall, found the
-->instruction used in only one place -- a file segment allocation table
-->implemented as a bit string.
-->
-->Could someone confirm that I am correct that the "LSC" would not directly
-->implement the population count... But, maybe I am wrong. I seem to
-->remember hearing from the Univac field support guys at the time (20 years
-->ago) that the NSA had a number of Univacs.
-->
-->Regards to all...
-->
-->Jack
-->

Several machines had similar operations. I don't remember the opcode
any longer, but the "vaxen" had a searching shift instruction that would
halt when the sign bit changed and return the number of bits shifted.

That's not popcnt, although it could be iterated on to produce one of
course.

--
Robert Hyatt Computer and Information Sciences
hy...@cis.uab.edu University of Alabama at Birmingham
(205) 934-2213 115A Campbell Hall, UAB Station
(205) 934-5473 FAX Birmingham, AL 35294-1170

Peter Butler

unread,

Mar 29, 1996, 3:00:00 AM3/29/96

to

[snip discussion of popcount]

>There is strong evidence popcount was the "NSA instruction". However,
>in our case it came in handy on our CDC Cyber mainframes. These were
>octal 60-bit-word ones-complement systems, and a word could contain
>either positive zero or negative zero. In Fortran, it was difficult
>to distiguish the two zeros, so we used popcount as a fast way

[snip to end]

But why not a Compass subroutine? Popcount took 60 minor cycles.
Thats 6 uSec for the kids and IBM 360 people. Floating mul only took
10 minor cycles and I believe integer add (60 bit) took 4.

Any 6000/Cyber 70 left? (I did job card xlate, job init, job advancer
and job terminate, "tape stageing II" and some early "deadstart" code
before SCOPE was moved to MN.)

-pb

Roy Mongiovi

unread,

Mar 29, 1996, 3:00:00 AM3/29/96

to

In article <4jgr24$2...@dfw-ixnews5.ix.netcom.com>,

Peter Butler <pbut...@ix.netcom.com> wrote:
>But why not a Compass subroutine? Popcount took 60 minor cycles.
>Thats 6 uSec for the kids and IBM 360 people. Floating mul only took
>10 minor cycles and I believe integer add (60 bit) took 4.

On the 6400-style CPU (with the unified arithmetic unit), it
took something like 64 minor cycles to do a population count.
That's 4 cycles for setup, and one cycle per bit.

On the original 6600 CPU, with multiple functional units, it
only took something like 6 minor cycles. The 6600 was the
first CDC machine, and it was designed for speed. The 6400
came along later and was designed to sell at a reasonable
price....
--
Roy J. Mongiovi Systems Support Specialist Information Technology
Georgia Institute of Technology
Tough are the souls that tread the knife's edge Atlanta, Georgia 30332-0715
J. Tull r...@prism.gatech.edu

Jerry Leichter

unread,

Mar 29, 1996, 3:00:00 AM3/29/96

to Peter Butler

Peter Butler wrote:
| >There is strong evidence popcount was the "NSA instruction". However,
| >in our case it came in handy on our CDC Cyber mainframes. These were
| >octal 60-bit-word ones-complement systems, and a word could contain
| >either positive zero or negative zero. In Fortran, it was difficult
| >to distiguish the two zeros, so we used popcount as a fast way
|

| But why not a Compass subroutine? Popcount took 60 minor cycles.
| Thats 6 uSec for the kids and IBM 360 people. Floating mul only took
| 10 minor cycles and I believe integer add (60 bit) took 4.

There numbers may have been correct for a 6400 (though I have my doubts), but
they are definitely incorrect for a 6600. I'm 99% sure that popcount took only
a couple of minor cycles - probably 4. (BTW, it was done in the divide unit,
for - apparently - very little extra hardware cost.) I also recally that
integer add took 3 minor cycles on a 6600. (3 minor cycles were the "baseline"
for operations - copying one register to another - done in the boolean unit,
along with all other boolean operations - took 3 minor cycles. When you work
out the timing, the boolean unit requests a reservation on its output register
*before* its operands actually arrive from its input registers!)

-- Jerry

Preston Briggs

unread,

Mar 30, 1996, 3:00:00 AM3/30/96

to

Jack Harper <jha...@bs2000.com> wrote:
>Also -- The old Univac 1100 mainframes had an instruction called, I think,

>LSC for "Left Shift and Count" which would left shift a word/register

>until a '1' bit was encountered and then load another register with the

>number of shifts required to find the first non-zero bit.

LSC would be useful for doing floating-point manipulations.
Normalizes the mantissa.

Which also explains why you wouldn't find it in the OS kernel.

Preston Briggs

francois grieu

unread,

Apr 1, 1996, 3:00:00 AM4/1/96

to

Popcount is usefull in an astonishing number of applications; for example
the number of bits that are equal to 0 (trivial to compute from Popcount)
makes a nice checksum for data embeded in EPROM/EEPROM memory, because any
number of 0-to-1 or 1-to-0 transition will get caught, including changes in
the checksum, provided all transistions are in the same direction.

Doing an efficient Popcount function in software is fun; here are
my favorite C implementations.

/* Popcount1 : count bits in input
optimised for speed on 'sparse' input
execution time grows linearly with the result
no dependancy on size of unsigned long
*/
int Popcount1(unsigned long x)
{ int n = 0;
if (x) do ++n; while ((x &= x-1)!=0);
return n;
}

/* Popcount2 : return number of bits set in x
table-based method, 8 bits at a time
execution time grows with the rank of leftmost non-empty chunk
no dependancy on size of unsigned long
*/
int Popcount2(unsigned long x)
{ int n = 0;
do n += "\0\1\1\2\1\2\2\3\1\2\2\3\2\3\3\4"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\4\5\5\6\5\6\6\7\5\6\6\7\6\7\7\10" [x & 0xFF];
while ((x >>= 8)!=0);
return n;
}

/* Popcount3 : return number of bits set in x
optimised for speed on random input, on architectures with fast modulus
execution time is constant (except for variation on execution of final
modulus)
as is, limited to 32 bit unsigned long, but just expanding the
three hex constants will fix this, up to 254 bit architectures
*/
int Popcount3(unsigned long x)
{ x = ((x>>1)&0x55555555) + (x&0x55555555);
x = ((x>>2)&0x33333333) + (x&0x33333333);
return (((x>>4)+x)&0x0F0F0F0F) % 255;
}

/* validation suite for Popcount1 / Popcount2 / Popcount3 above */
#include <stdio.h>
int main(void)
{ unsigned long x,y;
int i1,i2,i3;
long err;
err = 0;
x = 0; err = 0;
do
{ i1 = Popcount1(x); i2 = Popcount2(x); i3 = Popcount3(x);
if (i1!=i2 || i1!=i3)
if (++err <= 12) printf("error :%5d%5d%5d for 0x%lX\n",i1,i2,i3,x);
/* the following will step x thru a reasonable range of values */
y = x; x += x>>14; x += 1;
} while (x>y);
printf(err==0?"test pass\n":"test fails, %d errors\n",err);
return 0;
}

Paul Rubin

unread,

Apr 2, 1996, 3:00:00 AM4/2/96

to

In article <fgrieu-0104...@pppa111.micronet.fr>,

francois grieu <fgr...@micronet.fr> wrote:
>/* Popcount3 : return number of bits set in x
> optimised for speed on random input, on architectures with fast modulus
> execution time is constant (except for variation on execution of final
>modulus)
> as is, limited to 32 bit unsigned long, but just expanding the
> three hex constants will fix this, up to 254 bit architectures
>*/
>int Popcount3(unsigned long x)
>{ x = ((x>>1)&0x55555555) + (x&0x55555555);
> x = ((x>>2)&0x33333333) + (x&0x33333333);
> return (((x>>4)+x)&0x0F0F0F0F) % 255;
>}

You can compute the modulus by adding up all the bytes of the number,
mod 256 (the idea is similar to "casting out nines"). Of course
execution time is then linear in the length of the word and it
is probably slower than the table based method. But it might be
useful if you are short of memory.