There is strong evidence popcount was the "NSA instruction". However,
in our case it came in handy on our CDC Cyber mainframes. These were
octal 60-bit-word ones-complement systems, and a word could contain
either positive zero or negative zero. In Fortran, it was difficult
to distiguish the two zeros, so we used popcount as a fast way to tell
positive zero from negative zero. (The application was bit-mapped
graphics, not numerical computation). Of course this was no longer an
issue with Cray systems, which I assume were/are 64-bit twos-complement.
Jim
> |>In article <trimmDo...@netcom.com>,
> |>Gary M. Watson <tr...@netcom.com> wrote:
> |>>What was the engineering justification for the Cray team to put
> |>>the population count in hardware? What, besides chess, uses this
> |>>enough to justify its existance? If there's a good reason, maybe
> |>>we can forward it to the P7 team.
Also -- The old Univac 1100 mainframes had an instruction called, I think,
LSC for "Left Shift and Count" which would left shift a word/register
until a '1' bit was encountered and then load another register with the
number of shifts required to find the first non-zero bit.
This is, I think from previous comments, different from the "population
count", however. Back in the early 1970's, I remember wondering about the
use of such a thing and searched through the source code for the Univac
(Exec-8) operating system kernal for usage and, as I recall, found the
instruction used in only one place -- a file segment allocation table
implemented as a bit string.
Could someone confirm that I am correct that the "LSC" would not directly
implement the population count... But, maybe I am wrong. I seem to
remember hearing from the Univac field support guys at the time (20 years
ago) that the NSA had a number of Univacs.
Regards to all...
Jack
---------------------------------------------------------------------
Jack Harper Bank Systems 2000, Inc.
e-mail: jha...@bs2000.com 350 Indiana Street, Suite 350
voice: 303-277-1892 fax: 303-277-1785 Golden, Colorado 80401 USA
"21st Century Banking Applications"
Private Label Optical Bank Card Systems
Visit our Web Page: http://www.bs2000.com/talos
---------------------------------------------------------------------
Several machines had similar operations. I don't remember the opcode
any longer, but the "vaxen" had a searching shift instruction that would
halt when the sign bit changed and return the number of bits shifted.
That's not popcnt, although it could be iterated on to produce one of
course.
--
Robert Hyatt Computer and Information Sciences
hy...@cis.uab.edu University of Alabama at Birmingham
(205) 934-2213 115A Campbell Hall, UAB Station
(205) 934-5473 FAX Birmingham, AL 35294-1170
But why not a Compass subroutine? Popcount took 60 minor cycles.
Thats 6 uSec for the kids and IBM 360 people. Floating mul only took
10 minor cycles and I believe integer add (60 bit) took 4.
Any 6000/Cyber 70 left? (I did job card xlate, job init, job advancer
and job terminate, "tape stageing II" and some early "deadstart" code
before SCOPE was moved to MN.)
-pb
On the 6400-style CPU (with the unified arithmetic unit), it
took something like 64 minor cycles to do a population count.
That's 4 cycles for setup, and one cycle per bit.
On the original 6600 CPU, with multiple functional units, it
only took something like 6 minor cycles. The 6600 was the
first CDC machine, and it was designed for speed. The 6400
came along later and was designed to sell at a reasonable
price....
--
Roy J. Mongiovi Systems Support Specialist Information Technology
Georgia Institute of Technology
Tough are the souls that tread the knife's edge Atlanta, Georgia 30332-0715
J. Tull r...@prism.gatech.edu
-- Jerry
LSC would be useful for doing floating-point manipulations.
Normalizes the mantissa.
Which also explains why you wouldn't find it in the OS kernel.
Preston Briggs
Doing an efficient Popcount function in software is fun; here are
my favorite C implementations.
/* Popcount1 : count bits in input
optimised for speed on 'sparse' input
execution time grows linearly with the result
no dependancy on size of unsigned long
*/
int Popcount1(unsigned long x)
{ int n = 0;
if (x) do ++n; while ((x &= x-1)!=0);
return n;
}
/* Popcount2 : return number of bits set in x
table-based method, 8 bits at a time
execution time grows with the rank of leftmost non-empty chunk
no dependancy on size of unsigned long
*/
int Popcount2(unsigned long x)
{ int n = 0;
do n += "\0\1\1\2\1\2\2\3\1\2\2\3\2\3\3\4"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"\
"\4\5\5\6\5\6\6\7\5\6\6\7\6\7\7\10" [x & 0xFF];
while ((x >>= 8)!=0);
return n;
}
/* Popcount3 : return number of bits set in x
optimised for speed on random input, on architectures with fast modulus
execution time is constant (except for variation on execution of final
modulus)
as is, limited to 32 bit unsigned long, but just expanding the
three hex constants will fix this, up to 254 bit architectures
*/
int Popcount3(unsigned long x)
{ x = ((x>>1)&0x55555555) + (x&0x55555555);
x = ((x>>2)&0x33333333) + (x&0x33333333);
return (((x>>4)+x)&0x0F0F0F0F) % 255;
}
/* validation suite for Popcount1 / Popcount2 / Popcount3 above */
#include <stdio.h>
int main(void)
{ unsigned long x,y;
int i1,i2,i3;
long err;
err = 0;
x = 0; err = 0;
do
{ i1 = Popcount1(x); i2 = Popcount2(x); i3 = Popcount3(x);
if (i1!=i2 || i1!=i3)
if (++err <= 12) printf("error :%5d%5d%5d for 0x%lX\n",i1,i2,i3,x);
/* the following will step x thru a reasonable range of values */
y = x; x += x>>14; x += 1;
} while (x>y);
printf(err==0?"test pass\n":"test fails, %d errors\n",err);
return 0;
}
You can compute the modulus by adding up all the bytes of the number,
mod 256 (the idea is similar to "casting out nines"). Of course
execution time is then linear in the length of the word and it
is probably slower than the table based method. But it might be
useful if you are short of memory.