The NSA Instruction

John Gilmore

unread,

Apr 29, 1992, 4:56:36 PM4/29/92

to

It occurred to me to wonder just how many architectures have the following
instructions:

Population Count -- tell me how many 1-bits are in this word

Find First One (or Zero) -- can speed up emulation
of popcount on machines without it

I have gotten the impression that the Pop Count instruction has been
added to more than one architecture upon direct request from the US National
Security Agency, because it apparently occurs frequently in their inner
loops, but is slow to emulate in software (table lookup on parts of the word,
followed by addition of the table entries, seems best to me).

An informal question at last week's Asilomar Microcomputer Workshop
showed 5 or 10 hands when asked `Who has designed an architecture that
had one of these instructions added to it'. Now I'm looking for actual
information on *which* architectures and *why* the instructions were
added. (I wouldn't expect a RISC arch to have this in the first
rev, due to low frequency of use, but once NSA decided it might be
worth buying the machine, then it might appear in the second rev.)

It's been guessed that the reason they use it so much is for correlation,
e.g. xor two things and count the one-bits and you see roughly how close
they are. Any other ideas?
--
John Gilmore {sun,uunet,pyramid}!hoptoad!gnu g...@toad.com g...@cygnus.com
"It isn't given to us to know those rare moments when people
are wide open and the lightest touch can wither or heal."

Chris Perleberg

unread,

Apr 29, 1992, 5:46:06 PM4/29/92

to

In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
> It occurred to me to wonder just how many architectures have the following
> instructions:
>
> Population Count -- tell me how many 1-bits are in this word
>
> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it
>
> I have gotten the impression that the Pop Count instruction has been
> added to more than one architecture upon direct request from the US National
> Security Agency, because it apparently occurs frequently in their inner
> loops, but is slow to emulate in software (table lookup on parts of the word,
> followed by addition of the table entries, seems best to me).

More than once I've needed this instruction (Population Count) for
text pattern matching algorithms. But I've never encountered a machine
that supported it, and doing table lookup on parts of the word is too
costly.... Then you have the problem of how to specify in C that you
want to use this instruction....

Chris Perleberg
pch...@dcc.uchile.cl

Eliot Moss

unread,

Apr 29, 1992, 8:23:40 PM4/29/92

to

I believe that at least on machine in the Control Data / Cray family has a
Population Count instruction. It is useful in parts of garbage collection
algorithms that work using mark bits, to count the bits quickly. There was
interesting lore developed at the MIT Labs (Computer Science and AI) on the
best sequences for doing this on a PDP-10, which does not support it directly.
As for the Find First instructions, machine that I know provide it include the
PDP-10 (reportedly put in for telephone company folks, who might want to scan
quickly for phones lines with state changes, or something like that) and the
VAX. I believe the Tera computer design (not TeraData!) has both instructions,
and lots of other interesting bit twiddling things I have not seen elsewhere.
The hacked up PDP-1 that was at MIT for a number of years could probably do
these things, too, since one project (probably somebody's MS thesis at the
time) put a fancy collection of shifting/rotating, etc. instructions into the
machine. The 80386 has Find First instructions (for both bit reading
directions), the 68010 and Alpha do not. I don't have any other processor
books handy. I don't think IBM 360/370 series computers have this support
either.

The only way in which the population count and find first instructions might
not be considered RISC-like is there frequency of occurrence, following the
principle of "most bang for the chip real estate used". Of course, that
principle is application dependent: if you're making processors for a
particular domain in which Pop Count / Find First might be important, then
perhaps that should go in. In fact, the exatr logic in a functional unit may
not be that bad. The use of opcodes, and the possibility that these
instructions might be harder than most to make execute in a single short clock
cycle, might also argue against including them, especially if the issue is
scanning a large array, and the cycles used in a reasonable loop body might be
comparable to or less than the load and store costs on cache miss. Just a
thought, anyway.
--

J. Eliot B. Moss, Assistant Professor
Department of Computer Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, MA 01003
(413) 545-4206, 545-1249 (fax); Mo...@cs.umass.edu

Sean Eric Fagan

unread,

Apr 29, 1992, 8:33:48 PM4/29/92

to

In article <1992Apr29.2...@dcc.uchile.cl> pch...@pehuen.dcc.uchile.cl (Chris Perleberg) writes:
>More than once I've needed this instruction (Population Count) for
>text pattern matching algorithms. But I've never encountered a machine
>that supported it, and doing table lookup on parts of the word is too
>costly....

Every supercomputer I've seen had it, the Cyber being the one I'm most
familiar with. The impression I have, however, is that the 6000 did not
have a pop count instruction (it didn't even have parity!), but that both it
and parity were added shortly (because 'farmers buy a lot of computers'
8-)).

>Then you have the problem of how to specify in C that you
>want to use this instruction....

This is the easy part. First, actual FORTRAN code to do a pop count on an
integer on a Cyber:

COUNT = POP(I)

To do it in C, the obvious code would be

count = pop(i);

(Oh, it might be ifdef'd in, and/or define'd, and/or something like
__builtin_pop, etc. But the idea is about the same.)

I'm not aware of any compilers that will take code and replace it with a pop
count instruction. (Might be interesting to ask in comp.compilers, though.)

--
Sean Eric Fagan | "One form to rule them all, one form to find them, one
s...@kithrup.COM | form to bring them all and in the darkness rewrite the
-----------------+ hell out of them" -- sendmail ruleset 3 comment from DEC.
Any opinions expressed are my own, and generally unpopular with others.

Bob Supnik

unread,

Apr 29, 1992, 9:18:45 PM4/29/92

to

In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes...

>It occurred to me to wonder just how many architectures have the following
>instructions:
>
> Population Count -- tell me how many 1-bits are in this word
>
> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it
>
>I have gotten the impression that the Pop Count instruction has been
>added to more than one architecture upon direct request from the US National
>Security Agency, because it apparently occurs frequently in their inner
>loops, but is slow to emulate in software (table lookup on parts of the word,
>followed by addition of the table entries, seems best to me).

> [more deleted]

PDP-10's (and -20's) had a JFFO - jump if find first one - and VAX's have
a FFS/FFC - find first set/clear. Both were done to facilitate searching
of bit vectors used as allocation maps for disk or memory blocks. Neither
had population count, although recent VAX's have had such a facility
at the micro-machine level to facilitate counting the number of registers
saved (and therefore the size of the stack frame needed) by a CALLx
instruction.

Cray Research systems have both, certainly in the YMP and C90, possibly
earlier.

Bob Supnik >Sup...@human.enet.dec.com
>All opinions expressed are those of a hardline microcoder
>and do not reflect those of Digital Equipment Corporation

Sarr J. Blumson

unread,

Apr 29, 1992, 9:48:10 PM4/29/92

to

> It occurred to me to wonder just how many architectures have the following
> instructions:
>
> Population Count -- tell me how many 1-bits are in this word
>
> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it
>

The DEC 6/10 so often nostalgized (like that word!) about had a
JumpifFindFirstOne instruction, useful for floating fixed point
numbers and such.

The Philco 2000 machines (I'm REALLY old) had JumpQNEGative and
POSitive instructions which were useful for doing both operations
quickly which tested the sign bit of the Q register AND shifted it
left one bit; they were heavily pipelined machines optimized so that
short loops using these instructions were very (we're talking early
60s here) fast.
(Sarr Blumson)

Robert Hyatt

unread,

Apr 29, 1992, 9:58:02 PM4/29/92

to

In article <1992Apr29.2...@dcc.uchile.cl> pch...@pehuen.dcc.uchile.cl (Chris Perleberg) writes:

Check out the Crays for the Pop count instruction. Several in years past
had a shift searching or some such instruction that would count leading
zeroes or ones, depending on how you coded it. Haven't assem. programmed
a vax in 8 years or so but seem to recall such an instruction was the
reason the current unix kernel uses it's 32 run queues that use a single
bit to indicate that a queue is non-empty, (each queue is 4 or more priority
classes lumped together, you test a bit rather than testing 4 separate linked
lists, but you don't have to test each bit sequentially, but count the
leading zeroes, add one and that is the run queue that is non-empty.)

--
!Robert Hyatt Computer and Information Sciences !
!hy...@cis.uab.edu University of Alabama at Birmingham !

Vadim Antonov

unread,

Apr 29, 1992, 10:52:51 PM4/29/92

to

In article <MOSS.92Ap...@ibis.cs.umass.edu> mo...@cs.umass.edu writes:

"Get least significant one" instruction can be easily simulated as

bitno = table[(n&~(n-1)) % 37]

(modulo 19 or 67 for 16 and 64 bit words correspondinly).
An interesting fact is that this consturction works FASTER than
80386 built-in BSR (bit scan reverse) instruction in the worst case.

There is also a number of ways to count 1s in the word -- say,
by a sequence of shifts, ANDs and ADDs. The best result is something like
15 trivial instructions (can't remember it right now). Anyway,
it's nearly as efficient as hardware implmentations.

The instructions I really feel missed is the bit composition/decomposition
(BESM-6 sborka/razborka). These commands did the following transformations:

(EXAMPLE)
decomposition:
OP1 A B C D E F G H I J K L M N O P
OP2 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1

-> RES 0 0 0 0 0 0 0 0 0 C D E G M N P

composition:
OP1 A B C D E F G H I J K L M N O P
OP2 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1

-> RES 0 0 J K L 0 M 0 0 0 0 0 N O 0 P

where A-P are some bits.

As you can see these instructions are mutually reversible (i.e.

decomp(comp(X, M), M) == X & M'
comp(decomp(X, M), M) == X & M

(M' is the word with number_of_ones(M) least significant bits set).

As you can see, these instructions includes logical shifts combined
with conjunctions as their particular cases. They were beautiful
for dealing with extarcting characters from BESM-6s 48bit words.

There also were interesting applications in digital search algorithms and
(i believe) in cryptography.

Does anybody know an efficient way to simulate them using vanilla
instruction set? I'd really appreciate it.

Vadim Antonov
Berkeley Software Design, Inc.

Mark Riordan

unread,

Apr 30, 1992, 12:11:55 AM4/30/92

to

g...@hoptoad.uucp (John Gilmore) writes:
: It occurred to me to wonder just how many architectures have the following

: instructions:
:
: Population Count -- tell me how many 1-bits are in this word
:
: Find First One (or Zero) -- can speed up emulation
: of popcount on machines without it

:
: had one of these instructions added to it'. Now I'm looking for actual

: information on *which* architectures and *why* the instructions were
: added.

The CDC 6000 mainframes had a popcount instruction, and a FP normalize
instruction which could find the first 1 in the lower 48 bits
of a 60-bit word. (The result going to a B register.)
Back when I was learning this (mid-70's),
we were told that the popcount had been added to the architecture
(apparently right from the start) by the request of the Government,
possibly for crypto work. But that may have been just a rumor.

I seem to recall that the CDC 6600 was announced in 1964, and
probably first delivered around '66.

Mark Riordan m...@scss3.cl.msu.edu

Don Gillies

unread,

Apr 30, 1992, 3:41:59 AM4/30/92

to

Here is the magic routine, transcribed from DEC-10 assembler into C
source code. This subject comes up every year in comp.lang.c or
elsewhere.

-------------------------------------------------------------------------
/*
* This function counts the number of bits in a long.
* It is limited to 63 bit longs, but a minor mod can cope with 511 bits.
*
* It is so magic, an explanation is required:
* Consider a 3 bit number as being
* 4a+2b+c
* if we shift it right 1 bit, we have
* 2a+b
* subtracting this from the original gives
* 2a+b+c
* if we shift the original 2 bits right we get
* a
* and so with another subtraction we have
* a+b+c
* which is the number of bits in the original number.
* Suitable masking allows the sums of the octal digits in a
* 32 bit number to appear in each octal digit. This isn't much help
* unless we can get all of them summed together.
* This can be done by modulo arithmetic (sum the digits in a number by molulo
* the base of the number minus one) the old "casting out nines" trick they
* taught in school before calculators were invented.
* Now, using mod 7 wont help us, because our number will very likely have
* more than 7 bits set. So add the octal digits together to get base64
* digits, and use modulo 63.
* (Those of you with 64 bit machines need to add 3 octal digits together to
* get base512 digits, and use mod 511.)
*
* This is HACKMEM 169, as used in X11 sources.
* Source: MIT AI Lab memo, late 1970's.
*/
int bitcnt(n)
register unsigned long n;
{
register unsigned long tmp;

tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}
--

John Nagle

unread,

Apr 30, 1992, 4:34:53 AM4/30/92

to

g...@hoptoad.uucp (John Gilmore) writes:

>It occurred to me to wonder just how many architectures have the following
>instructions:

> Population Count -- tell me how many 1-bits are in this word

> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it

Find First One is a basic part of doing floating point arithmetic
in software, and is useful when looking for free space in bitmapped
disk allocation tables. Some CPUs used to do this operation by
using the floating point unit's normalizer, but that is rare today.

Population Count is said to be useful in computer chess.
Control Data (remember them?) actually had a "Director of Chess" at
one point, back when only supercomputers could play a decent game.

What does NSA want in a CPU? Considerable information is
available, though dated. NSA had IBM construct a special cryptographic
processing unit for its Stretch machine back in the late 1950s. This became
the Harvest machine. Harvest was a Stretch without the floating point
unit but with a "streaming" unit designed to NSA specs. The
streaming unit could convoke two operand strings by making
repetitive memory fetches, perform tasks repetitively on operand pairs,
and return a string of statistics to memory. The command set included
matching, substitution by table lookup, tallying the occurence of specified
logical relationships, and other operations of interest in cryptanalysis.
Stretch/Harvest was the first byte-oriented machine, and one of the first
pipelined CPU machines. These machines had a slow memory for their
CPU speed and a large (over 700 instructions) instruction set, so that
a single instruction could do as much work as possible without using
up memory bandwidth reading instructions. (The Stretch instruction
set was considered oversized, even by its designers.)

Harvest also had the first automated tape library ("Tractor").
Harvest was installed in February 1962 and operated for fourteen years.
There were eight Stretch machines built with floating point units, but only
one Harvest. Stretch machines started around $13 million, and the
Harvest machine was the biggest one ever built.

Although the Stretch machines were not a commercial success, they
established much of the architecture that later appeared in the IBM
System 360 (and lives on to this day). So the 360 family is the place
to look for Harvest influences. 360s include a "Translate" and
"Translate and Test" instruction, which may well be descended from
the Harvest machine's streaming unit. Translate and Test reads a
string of bytes, looking up each byte in a table and replacing it with
the byte in the table, stopping when a byte with a specific value is
found or when the count of bytes to process runs out. This is not
an instruction of general utility. Every 360 and 370 had it,
although I think some emulated it in software. That may well be
a Harvest-inspired feature.

360s don't have population count, though. So that's probably not an
NSA-inspired innovation. UNIVAC and Burroughs mainframes didn't have
population count through at least the 1980s. The Cray-I was the
first machine of which I am aware that had it.

The Cray-I has a minimalist instruction set. (In fact, the Cray-I
is a simple computer; it just has a lot of copies of key elements.)
If the Cray-I had an instruction, it was there for a good reason.
It was designed by a very small team, headed by Seymour Cray.
So Seymour Cray is probably the one to ask on population count.

I suspect that other CPU designers are copying the Cray when they include
population count.

John Nagle

See "Computer Advances Pioneered by Cryptological Organizations",
by S.S. Snyder, 1980, Annals of the History of Computing, pp. 66-70., and
"IBM's Early Computers", by C.J. Bashe, MIT Press, 1986.

John Gregor

unread,

Apr 30, 1992, 4:50:19 AM4/30/92

to

In article <29...@hoptoad.uucp> g...@hoptoad.uucp (John Gilmore) writes:
|+ Population Count
|+ Find First One (or Zero)
|+ Now I'm looking for actual information on *which* architectures and
|+ *why* the instructions were added.

The CM-5 supports both. Can't say what the main reason was.

JohnG -- Eagerly awaiting a CM-5 to arrive here in December!!!!!
--
John A. Gregor USmail: College of Oceanography
E-mail: jo...@oce.orst.edu Oregon State University
Voice #: +1 503 737-3022 Oceanography Admin Bldg. #104
Fax #: +1 503 737-2064 Corvallis, OR 97331-5503

Torben AEgidius Mogensen

unread,

Apr 30, 1992, 6:56:46 AM4/30/92

to

g...@hoptoad.uucp (John Gilmore) writes:

> Population Count -- tell me how many 1-bits are in this word

>I have gotten the impression that the Pop Count instruction has been

>added to more than one architecture upon direct request from the US National
>Security Agency, because it apparently occurs frequently in their inner
>loops, but is slow to emulate in software (table lookup on parts of the word,
>followed by addition of the table entries, seems best to me).

There was a discussion about similar instructions that are
(apparently) easy to do in hardware and slow to do in software.
The outcome of the discussion was that some of the things that would
seem to require a number of instrunctions that is linear in the number
of bits in a word, can with some thought be done in a logarithmic
number of instructions. The PopCount instruction is one of these. Here
is is written in ARM assembler for 32 bit words:

\ Assume Rm1 contains 10101010101010101010101010101010 (binary),
\ Rm2 contains 11001100110011001100110011001100,
\ Rm3 contains 11110000111100001111000011110000,
\ Rm4 contains 11111111000000001111111100000000,
\ Rm5 contains 11111111111111110000000000000000,
\ R0 contains word to be PopCounted
\ At exit R0 contains PopCount, R1 is destroyed
.popcount
AND R1,R0,Rm1 \ use mask 1 to zero all even bits
SUB R0,R0,R1 \ R0 contains the even bits
ADD R0,R0,R1, LSR 1 \ R0 contains PopCount of every bit pair
AND R1,R0,Rm2 \ use mask 2 to zero all even bit pairs
SUB R0,R0,R1 \ R0 contains the even bit pairs
ADD R0,R0,R1, LSR 2 \ R0 contains PopCount of every nibble
AND R1,R0,Rm3 \ use mask 3 to zero all even nibbles
SUB R0,R0,R1 \ R0 contains the even nibbles
ADD R0,R0,R1, LSR 4 \ R0 contains PopCount of every byte
AND R1,R0,Rm4 \ use mask 4 to zero all even bytes
SUB R0,R0,R1 \ R0 contains the even bytes
ADD R0,R0,R1, LSR 8 \ R0 contains PopCount of every halfword
AND R1,R0,Rm5 \ use mask 5 to zero LS halfword
SUB R0,R0,R1 \ R0 contains the LS halfword
ADD R0,R0,R1, LSR 16 \ R0 contains PopCount of full word

This takes 15 cycles for a 32 bit word. Doing it for a 64 bit word (on
a 64 bit processor) would only requite 3 more cycles (and one more
mask word). By using immediate addressing, the last mask word can be
replaced by a constant (since at most 5 bits can be set in the MS
halfword).

OK, 15 cycles is more than a hardware solution would need, but your
program would have use PopCount a lot to justify the hardware. Using
the find-first-1-bit instruction would tend to make PopCount slower
than the above, unless you have very sparse bitstrings. I assume that
cryptographic processors implement PopCount and possibly bit-reverse
and similar instructions in hardware.

Torben Mogensen (tor...@diku.dk)

P.S.
This is a modified version of a article, which I cancelled and
replaced by this. There was an error in the program in the first
posting.

Steve Masticola

unread,

Apr 30, 1992, 8:53:04 AM4/30/92

to

gil...@m.cs.uiuc.edu (Don Gillies) writes:

]/*

] * This function counts the number of bits in a long.
] * It is limited to 63 bit longs, but a minor mod can cope with 511 bits.

[ Explanation deleted - see previous article ]

] */

]int bitcnt(n)
]register unsigned long n;
]{
] register unsigned long tmp;
]
] tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
] return ((tmp + (tmp >> 3)) & 030707070707) % 63;

]}

Correct me if I'm wrong, but mod 63 implies a division by 63, doesn't
it? And division is still pretty expensive, more so than doing a
population count directly by shift-add-mask, isn't it? So where's the
savings?

- Steve (mast...@cs.rutgers.edu)

Marco Zagha

unread,

Apr 30, 1992, 9:06:48 AM4/30/92

to

In article <1992Apr30....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
>In article <1992Apr29.2...@dcc.uchile.cl> pch...@pehuen.dcc.uchile.cl (Chris Perleberg) writes:
>>More than once I've needed this instruction (Population Count) for
>>text pattern matching algorithms. But I've never encountered a machine
>>that supported it, and doing table lookup on parts of the word is too
>>costly....
>
>Every supercomputer I've seen had it, the Cyber being the one I'm most
>familiar with. The impression I have, however, is that the 6000 did not
>have a pop count instruction (it didn't even have parity!), but that both it
>and parity were added shortly (because 'farmers buy a lot of computers'
>8-)).

One thing to keep in mind is that these bit instructions are very
useful on vector architectures with a vector mask register (e.g.
Cray). This may be one of the reasons they have been on so many
supercomputers.

For example, if you want to find the first element in an array passing
a logical test, the find-first-one instruction on the vector mask will
give you the index you need. More generally, if you want to execute a
scalar loop over all the elements of a vector register passing a
logical test, find-first-one along with shift instructions gives you a
way to do it in time proportional to the number of elements passing
the test rather than the total number of elements in the vector
register.

Popcount on the vector mask is often used along with the
compress-index (enumerate vector-mask) instruction when packing out
unwanted elements of a vector register; the popcount is needed to
determine how many elements are selected and set the vector register
length.

== Marco
Internet: mar...@cs.cmu.edu Uucp: ...!seismo!cs.cmu.edu!marcoz
Bitnet: marcoz%cs.cmu.edu@carnegie CSnet: marcoz%cs.cm...@relay.cs.net

Gregory R. Travis

unread,

Apr 30, 1992, 10:03:07 AM4/30/92

to

In <1992Apr30....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:

>Every supercomputer I've seen had it, the Cyber being the one I'm most
>familiar with. The impression I have, however, is that the 6000 did not
>have a pop count instruction (it didn't even have parity!), but that both it
>and parity were added shortly (because 'farmers buy a lot of computers'

The 6600 certainly had population count; I used it. Note that it was about
the SLOWEST instruction in the whole instruction set. I'll have to dig out
the COMPASS manuals when I get home and find out just HOW slow.

Parity was never added to the 6000 line, that came with the 7000 line.

Why did I use it? Well, I could tell you but I'd have to kill you
afterwards...

greg
--
Gregory Reed Travis D P S I
Data Parallel Systems Incorporated gr...@dpsi.com (For MX mailers only!)
Bloomington, IN gr...@cica.indiana.edu (For the others)

roe...@vxcrna.cern.ch

unread,

Apr 30, 1992, 11:12:21 AM4/30/92

to

In article <Apr.30.08.52....@cadenza.rutgers.edu>, mast...@cadenza.rutgers.edu (Steve Masticola) writes:
> [...]

> Correct me if I'm wrong, but mod 63 implies a division by 63, doesn't it?

a % 63 can be implemented with a few % 64's, >>= 6's, and some glue.
The cost of this would have to be included in the evaluation, of course.
Also remember that the value of 3 in the original (3 as in octal) isn't
magic -- one would want to evaluate a few values.

--
Frederick G. M. Roeber | CERN -- European Center for Nuclear Research
e-mail: roe...@cern.ch or roe...@caltech.edu | work: +41 22 767 31 80

Andy Malis

unread,

Apr 30, 1992, 12:46:21 PM4/30/92

to

na...@netcom.com (John Nagle) writes:
>> Find First One (or Zero) -- can speed up emulation
>> of popcount on machines without it

> Find First One is a basic part of doing floating point arithmetic
>in software, and is useful when looking for free space in bitmapped
>disk allocation tables. Some CPUs used to do this operation by
>using the floating point unit's normalizer, but that is rare today.

FFO is also useful in applications that do their own priority-
based scheduling. You have a list of priortized processes
represented by a bit string, along with a table of process
addresses. When you need to schedule a process, set its bit in
the bit string (having an instruction that can set or clear
individual bits based upon their offset in the word also helps).
Then, when it comes time to run a new process, use FFO to find
the first bit in the bit string (the highest priority process
that needs to run). The returned offset is used to index into
the table of process addresses.

Andy Malis <ma...@bbn.com> UUCP: {harvard,rutgers,uunet}!bbn!malis

James Davies

unread,

Apr 30, 1992, 12:46:29 PM4/30/92

to

In article <1992Apr30....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
>
>I'm not aware of any compilers that will take code and replace it with a pop
>count instruction. (Might be interesting to ask in comp.compilers, though.)

Our C and Fortran compilers use the popcount instructions in a number of
different ways. The one that comes to mind first is to find the vector
length to use for code under an IF. We also use the C _popcnt intrinsic
in the compiler in bitvector manipulation and in computing GCD's, and
probably in other places. I doubt if the NSA had any influence on this ;-)

Peter Desnoyers

unread,

Apr 30, 1992, 2:13:06 PM4/30/92

to

gil...@m.cs.uiuc.edu (Don Gillies) writes:

>Here is the magic routine, transcribed from DEC-10 assembler into C
>source code. This subject comes up every year in comp.lang.c or
>elsewhere.

>-------------------------------------------------------------------------
>/*
> * This function counts the number of bits in a long.
> * It is limited to 63 bit longs, but a minor mod can cope with 511 bits.
> *
> * It is so magic, an explanation is required:
> * Consider a 3 bit number as being
> * 4a+2b+c
> * if we shift it right 1 bit, we have
> * 2a+b
> * subtracting this from the original gives
> * 2a+b+c
> * if we shift the original 2 bits right we get
> * a
> * and so with another subtraction we have
> * a+b+c
> * which is the number of bits in the original number.

[... then you add adjacent octal digits, then sum 6-bit digits by
"casting out 63s" ...]

On many machines the modulo 63 operation is going to be rather slow.
Unfortunately, this algorithm cannot be applied repetitively to
combine octal digits, as it only works in binary. (it requires 1 ==
base-1)

A somewhat different (but related) algorithm:

To add the digits in an N-digit number in base B, add all adjacent
pairs of digits, forming an N/2-digit number in base B^2, and then add
the digits in the resulting base B^2 number.

I.e. for 16 bits:

x = (x&0x5555) + ((x&0xAAAA) >> 1);
x = (x&0x3333) + ((x&0xCCCC) >> 2);
x = (x&0x0F0F) + ((x&0xF0F0) >> 4);
x = (x&0x00FF) + (x >> 8); /* or x = x % 255 */

This takes 4*log2(N) - 1 instructions, where N is the word length in
bits. (3*log2(N)-1 if you have a barrel shifter)

You can apply the modulus trick after 3 steps, so the answer won't
overflow. However, it may not be quicker - I've tested the two
approaches on a decstation 2100 with 32-bit words, and even though
casting out 255s saves 6 instructions, it runs slower.

Note that you could combine these two algorithms - for instance, you
could do 48 bits in 5 steps by using the HAKMEM trick to get 16 octal
counts, then using the binary algorithm above to combine them.

Another observation is that divide-and-conquer algorithms like this,
using shifts and masks, can be used for a number of other bitwise
operations, such as bit reversal:

x = ((x&0x5555)<<1) | ((x&0xAAAA) >> 1);
x = ((x&0x3333)<<2) | ((x&0xCCCC) >> 2);
...

or finding highest bit set:

if (x & 0xFF00) {n += 8; x &= 0xFF00}
if (x & 0xF0F0) {n += 4; x &= 0xF0F0} /* or n += table[x%255] */
...

If you have a fast modulus operator, you could take x after the first
step above, cast out 255s, and then look up in a 256-element table to
get the rest of the answer.

Peter Desnoyers

--

Robert Stewart

unread,

Apr 30, 1992, 2:15:30 PM4/30/92

to

In article <29...@hoptoad.uucp> g...@hoptoad.uucp (John Gilmore) writes:

>It occurred to me to wonder just how many architectures have the following
>instructions:
>
> Population Count -- tell me how many 1-bits are in this word
>
> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it

From the transputer databook (1st edition, 1989)

IMS T800

OP code Mnemonic Proc. cyc. Name
------- -------- ---------- ----
76 bitcnt b + 2 count bits set in word
77 bitrevword 36 reverse bits in word
78 bitrevnbits n + 4 reverse bottom n bits in word

b = Bit number of the highest bit set in register A. Bit 0 is the least
significant bit.
n = Number of places shifted

Proc. cyc. refers to the number of periods TPCLPL taken by an instruction
executing in internal memory.

Robert Stewart
University of Texas at Austin

leic...@zodiac.rutgers.edu

unread,

Apr 30, 1992, 4:09:52 PM4/30/92

to

Here's an alternative version of the sequence that I put together a while
back. It works in groups of 4 bits - more natural for a 32-bit word than
groups of 3 - and is set up to produce in-line code which, if the arguments
are constants, will be a compile-time constant.

-- Jerry

/*
* Macros to compute a bit value from a bit number, bit number from bit value,
* and the bit count (number of bits on) for 32-bit quantities. The results
* are compile-time constants if the arguments are.
*
* The definition of BITCOUNT is based on an algorithm condensed from one
* extracted from a graph coloring register allocator which manipulates
* bitvectors. It was inspired by an analogous algorithm for counting bits in
* 36 bit words in MIT's HAKMEM. The first VAX version was designed by Chris
* Smith and written by Jane Hoffman. The original conversion to a form
* suitable for C on a VAX was by Brad Merrill; that code was fixed and
* re-written by Jerry Leichter. It assumes a 32-bit word. Note that the
* argument is evaluated multiple times, so should not have side-effects. A
* full explanation of the algorithm is too long to include here; in summary,
* BX_(x) replaces each 4-bit sequence in x by the number of bits in that
* sequence (which will, of course, fit in 4 bits). BITCOUNT folds the 4-bit
* sequences into 8-bit sequences. The 8-bit sequences are then added up by
* recognizing that they can be viewed as a base 256 number, and the sum of
* the digits of a base b number is the same mod b-1 as the number mod b-1.
*
* The original implementation produce results in groups of 3, then 6 bits.
* While natural on a PDP-10's 36-bit word, this requires special case
* handling of the sign bit on a VAX, since the top 3-bit sequence is really
* only 2 bits long.
*/
#define BITVAL(n) (1 << (n))
#define BITNUM(b) BITCOUNT((b)-1)
#define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x) ((x) - (((x)>>1)&0x77777777) \
- (((x)>>2)&0x33333333) \
- (((x)>>3)&0x11111111))

Gregg Townsend

unread,

Apr 30, 1992, 4:27:37 PM4/30/92

to

> The 6600 certainly had population count; I used it. Note that it was about
> the SLOWEST instruction in the whole instruction set. I'll have to dig out
> the COMPASS manuals when I get home and find out just HOW slow.
>

> Gregory Reed Travis D P S I
> Data Parallel Systems Incorporated gr...@dpsi.com (For MX mailers only!)
> Bloomington, IN gr...@cica.indiana.edu (For the others)

It was reasonably fast on the 6600 at 8 cycles. For comparison, an add
or shift took 3 cycles and a floating divide took 29.

But it was the the slowest instruction, at 68 cycles, on the CDC 6400.
A shift or add took 6 cycles there, and a floating divide took 57.

Gregg Townsend / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
+1 602 621 4325 g...@cs.arizona.edu 110 57 16 W / 32 13 45 N / +758m

Tim Olson

unread,

Apr 30, 1992, 4:59:58 PM4/30/92

to

John Gilmore writes

| It occurred to me to wonder just how many architectures have the following
| instructions:
|
| Population Count -- tell me how many 1-bits are in this word
|
| Find First One (or Zero) -- can speed up emulation
| of popcount on machines without it
|

| An informal question at last week's Asilomar Microcomputer Workshop
| showed 5 or 10 hands when asked `Who has designed an architecture that
| had one of these instructions added to it'. Now I'm looking for actual
| information on *which* architectures and *why* the instructions were
| added. (I wouldn't expect a RISC arch to have this in the first
| rev, due to low frequency of use, but once NSA decided it might be
| worth buying the machine, then it might appear in the second rev.)
|
| It's been guessed that the reason they use it so much is for correlation,
| e.g. xor two things and count the one-bits and you see roughly how close
| they are. Any other ideas?

CLZ (Count Leading Zeros) exists in the Am29000 RISC processor
architecture, although it has nothing to do with NSA. The two main
uses are:

1) speeding up the renormalization stage of the software floating
point emulation routines

2) quick prioritization of external inputs (e.g. for choosing an
interrupt source from a set that are all connected to the same
interrupt request line

--
Tim Olson
Advanced Micro Devices
t...@amd.com

John R. Levine

unread,

Apr 30, 1992, 5:14:58 PM4/30/92

to

>360s include a "Translate" and "Translate and Test" instruction, which may
>well be descended from the Harvest machine's streaming unit. Translate and
Test reads a string of bytes, looking up each byte in a table and replacing it
>with the byte in the table, stopping when a byte with a specific value is
>found or when the count of bytes to process runs out. This is not
>an instruction of general utility.

You're confusing TRT and the Vax MOVTUC instructions here. TRT is
actually fairly useful. It scans a string looking up each byte in the
translation table, until it finds a non-zero translation. At that point
it stores the address and translated value of that byte in registers 1 and
2, but it never modifies memory. It's actually a reasonable way to do
lexical scanning with a state machine, using zero table entries to mean
stay in the same state and non-zero entries to indicate the next state to
move to. The designers probably didn't think of it in quite that way, but
rather for the special case of looking for delimiters.

TR, which reads each byte from a string and replaces it by looking it up
in a translation table, is useful both for the obvious job of translating
among character codes, most often ASCII and EBCDIC, and for some cute
tricks like reversing the byte order in a word and as part of a two
instruction sequence that turns a binary field into printable hex
characters.

TR and TRT, like all 360 string instructions, use a fixed string length no
greater than 256, so you have to write complex loops using "execute" to
handle strings of variable length or longer than 256. TR also has serious
performance problem because of the way it interacts with the 370's virtual
memory. A TR either has to execute completely or not at all, because the
370 has no way to store the state of a partially executed instruction
except for a few "long string" instructions added to the archicture later.
But a TR may cause up to 6 page faults, for the instruction itself, the
input string, and the translation table, each of which may span a page
boundary. Only the translation table entries that are actually used have
to be resident, and the only way to find out which ones will be used is to
do the translation. So each TR is executed twice, a preliminary trial
execution that doesn't store anything to see if it can complete, and a
second real execution that actually stores the results if all the pages
are present. TRT doesn't have this problem. Since it doesn't modify
memory and only stores some condition code and register values when it's
done, the CPU can easily abort it if there's a page fault half way
through.

The VAX MOVTUC does indeed to move translated until you hit a specific
value, which seems useful mostly to implement the C strcpy() and strncpy()
routines, using a degenerate table that translates each byte to itself and
an escape value of 0 to stop at the string's terminating null byte. I
expect it's probably one of those Vax instructions that's slower than the
equivalent simple instructions.

Regards,
John Levine, jo...@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl

EDWARD CRAIG HYATT

unread,

Apr 30, 1992, 9:11:07 PM4/30/92

to

Why doesn't somebody just type
in the code and compare the
two methods. Not me. I'm
too lazy. :-)

- Craig
ech...@eos.ncsu.edu

Peter David THOMPSON

unread,

May 1, 1992, 12:58:33 AM5/1/92

to

Try the 68030 & 68040 (and maybe the 020). BFFO (bitfield find first one).
All I have is the Motorola 68030 tech ref & I've lent it. Pay your own Aus
$25.00.

pdt

* These opinions belong to p...@mundil.cs.mu.OZ.AU unless otherwise specified.
"GNU Make will no longer go into an infinite loop when fed the horrid trash
that passes for makefiles that `imake' produces (so you can compile X, despite
the extreme stubbornness and irrationality of its maintainers)."-version 3.55

Ted Dunning

unread,

May 1, 1992, 1:32:12 AM5/1/92

to

In article <1992Apr30....@kithrup.COM> s...@kithrup.COM (Sean
Eric Fagan) writes:

I'm not aware of any compilers that will take code and replace it
with a pop count instruction. (Might be interesting to ask in
comp.compilers, though.)

gcc will do very nice inlining of machine language. you can even
write your code to interact with the register allocator nicely.

Eliot Moss

unread,

May 1, 1992, 8:55:31 AM5/1/92

to

Folks have mentioned the "HAKMEM trick/technique" for computing population
count without a pop count instruction, with the approach being to combine
adjacent digits, and then do a reduction mod (2^k)-1, where k is the size of
the digit accumulated. If divide is slow, and multiply is fast and produces a
double length result, you can accomplish the same effect as follows. Let's
suppose that you are counting a 32-bit word and have accumulated 8 digits of 4
bits each. In hexadecimal this would look like:

0p0q0r0s for some digits p, q, r, and s

Now multiply this times 01010101. If we write it out long form as in
grade school, we get the following:

0p0q0r0s
x 01010101
--------
0p0q0r0s
00000000
0p0q0r0s
00000000
0p0q0r0s
00000000
0p0q0r0s
00000000
---------------
zzyyxxwwvvuutt0s

Note that vv is the sum of the digits that we originally desired. Also note
that there can be no overflow into or out of vv, by design. Further, we do not
even need the double length result, so long as we can get the correct low
order bits.

Whether this is faster than oher techniques depends on your hardware of
course, but I thought folks might find it interesting.
--

J. Eliot B. Moss, Assistant Professor
Department of Computer Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, MA 01003
(413) 545-4206, 545-1249 (fax); Mo...@cs.umass.edu

Alan Klietz

unread,

May 1, 1992, 2:51:34 PM5/1/92

to

In article <1992Apr30.0...@cis.uab.edu> hy...@cis.uab.edu (Robert Hyatt) writes:
<In article <1992Apr29.2...@dcc.uchile.cl> pch...@pehuen.dcc.uchile.cl (Chris Perleberg) writes:
<>In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
<>> It occurred to me to wonder just how many architectures have the following
<>> instructions:
<>>
<>> Population Count -- tell me how many 1-bits are in this word
<>>
<>> Find First One (or Zero) -- can speed up emulation
<>> of popcount on machines without it
<>>

<>I've never encountered a machine that supported it
<>

<> Chris Perleberg
<> pch...@dcc.uchile.cl
<
<
<Check out the Crays for the Pop count instruction.

Pop Count and Find First are also in the CM-5 vector instruction set.

--
Alan E. Klietz
Minnesota Supercomputer Center, Inc.
1200 Washington Avenue South
Minneapolis, MN 55415
Ph: +1 612 626 1737 Internet: al...@msc.edu

Steve Jay {Ultra Unix SW Mgr}

unread,

May 1, 1992, 8:45:37 PM5/1/92

to

In <1992Apr30....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:

>This is the easy part. First, actual FORTRAN code to do a pop count on an
>integer on a Cyber:

> COUNT = POP(I)

>I'm not aware of any compilers that will take code and replace it with a pop
>count instruction. (Might be interesting to ask in comp.compilers, though.)

CDC's Fortran compiler did, starting with FTN version 4, I believe. It
had a whole bunch of "intrinsic" functions, such as SHIFT, COUNT, etc.,
that would turn into the appropriate in-line machine instructions.

Steve Jay
s...@ultra.com ...ames!ultra!shj
Ultra Network Technologies / 101 Dagget Drive / San Jose, CA 95134 / USA
(408) 922-0100 x130 "Home of the 1 Gigabit/Second network"

Sean Eric Fagan

unread,

May 2, 1992, 3:04:14 PM5/2/92

to

In article <1992May2.0...@ultra.com> s...@ultra.com (Steve Jay {Ultra Unix SW Mgr}) writes:
>In <1992Apr30....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
>>I'm not aware of any compilers that will take code and replace it with a pop
>>count instruction. (Might be interesting to ask in comp.compilers, though.)
>CDC's Fortran compiler did, starting with FTN version 4, I believe. It
>had a whole bunch of "intrinsic" functions, such as SHIFT, COUNT, etc.,
>that would turn into the appropriate in-line machine instructions.

Once again, I am not aware of any compilers which will take code, such as

int bitcnt(n)
register unsigned long n;
{
register unsigned long tmp;

tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}

and turn it into a pop count instruction.

If I'd meant built-in functions that were, essentially, inline assembly, I
would not have brought up the example of the Cyber FORTRAN compiler and
COUNT(), would I've?

--
Sean Eric Fagan | "One form to rule them all, one form to find them, one
s...@kithrup.COM | form to bring them all and in the darkness rewrite the
-----------------+ hell out of them" -- sendmail ruleset 3 comment from DEC.
Any opinions expressed are my own, and generally unpopular with others.

Dan Bernstein

unread,

May 2, 1992, 7:12:45 PM5/2/92

to

In article <1992May02.1...@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
> Once again, I am not aware of any compilers which will take code, such as

[ HAKMEM-derived bit count idiom ]

> and turn it into a pop count instruction.

Gee, that complaint sure sounds familiar. As I've said many times
before, the compiler for a machine with such an instruction should
recognize at least one idiom for that instruction---and should
*document* the idiom so that people can use it. (As Hugh LaMaster
observed last time we had this discussion, this is exactly what vector
compilers do already.) With this and judicious #ifdefs you can get code
which is fast for any machine.

It would be even better if the language had a ``quickpick'' structure
(as Q does) to let the compiler choose the fastest code out of several
alternatives provided by the programmer. That way you could write out
several idioms for a critical operation like bit counting. Your code
would run quickly on any machine whose compiler recognized one of the
idioms---even if you had never heard of that machine when you wrote the
code!

The difference between idioms and vendor-specific language extensions is
that idioms are, by definition, portable---a compiler which doesn't
recognize an idiom will still produce correct code for it, just as a
compiler which doesn't recognize Convex's conventions for array aliasing
declarations in C will still produce correct code for arrays. Idioms
give programmers and vendors a chance to work together in producing
efficient code without any painful compatibility hassles.

(Followups to comp.arch---the NSA conspiracy theory is getting stale.)

---Dan

Seiji Kaneko

unread,

May 2, 1992, 10:36:48 PM5/2/92

to

>>>>> In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
>> It occurred to me to wonder just how many architectures have the following
>> instructions:
>> Population Count -- tell me how many 1-bits are in this word
>> Find First One (or Zero) -- can speed up emulation
>> of popcount on machines without it
<>I've never encountered a machine that supported it

Hitachi's M series(IBM 370 PCM in japan) has Find first one
(FLM: find first left-most one.) instruction. Export version
(ex. EX** series from HDS) don't.
Though it has been supported for over 15 years, I've never seen
any good examples of this instruction in commercial softwares.
--
Seiji Kaneko kan...@icsi.Berkeley.EDU,kan...@sdl.hitachi.co.jp
..................................... 76270...@compuserve.com

Glenn A Sullivan

unread,

May 3, 1992, 6:22:49 AM5/3/92

to

In article <1992Apr30....@msuinfo.cl.msu.edu>, m...@scss3.cl.msu.edu (Mark Riordan) writes:
> g...@hoptoad.uucp (John Gilmore) writes:
.............. [lots deleted]
> possibly for crypto work. But that may have been just a rumor.

From an article in Fortune magazine, but mainly from an article in
Annals (sp) of Computing [I think], Seymour Cray worked for a northern
company named something like Dynamic Research or Electronic Research ?
which produced code-breaking equipment for the government.
He went to CDC. He founded Cray Research.
He was awarded (by Nixon?) a civilian award for "critical service".
I doubt the award was for the number-crunching power of the Cray-1, but
for its code-bbreaking power.

Allen Sullivan

Jeremy Webber

unread,

May 3, 1992, 7:48:02 PM5/3/92

to

In article <1992May1.0...@odin.diku.dk> tor...@diku.dk (Torben AEgidius Mogensen) writes:

rste...@ccwf.cc.utexas.edu (Robert Stewart) writes:

[timing for Transputer bitcount instructions]

As mentioned in earlier postings by myself and several others, bit
count and bit reverse can be done in ~15 instructions using masks and
shifts. If these take about one cycle each, it would take far less
time than using the predefined instruction.

However, on the Transputer these instructions don't take one cycle each. The
bitcount instruction was added to the T800, it doesn't exist in the T414. It
is used in some graphics algorithms to convert coverage masks into intensity,
as well as for other reasons. The bitcount instruction does result in a useful
increase in speed for these cases.

This is not a defense of the Transputer instruction set though.

-jeremy
--
--
Jeremy Webber Internet: jer...@cs.adelaide.edu.au
Digital Arts Film and Television, jer...@chook.ua.oz.au
3 Milner St, Hindmarsh, SA 5007, Phone: +61 8 346 4532
Australia FAX: +61 8 346 4537

David K. Bradley

unread,

May 4, 1992, 11:26:42 AM5/4/92

to

Population Count -- tell me how many 1-bits are in this word

Find First One (or Zero) -- can speed up emulation
of popcount on machines without it

Such instructions are useful on a hypercube system. The population count
instruction can be used to determine the distance between two nodes in the
hypercube network. The first generation Ncube system had an instruction that
found the rightmost one. This was useful in implementing the "ecube" routing
commonly used in hypercube networks. Presumably subsequent Ncube systems
continue to support this instruction.

As other posters have noted, the population count instruction is very useful
on vector systems that support index compression (gather scatter) and vector
mask instructions. The population count of the vector mask is the vector
length.

--
David Bradley bra...@csrd.uiuc.edu
Center for Supercomputing Research and Development
University of Illinois at Urbana-Champaign

Mike Tighe

unread,

May 4, 1992, 11:43:54 AM5/4/92

to

tor...@diku.dk (Torben AEgidius Mogensen) writes:

>As mentioned in earlier postings by myself and several others, bit
>count and bit reverse can be done in ~15 instructions using masks and
>shifts. If these take about one cycle each, it would take far less
>time than using the predefined instruction.

How do you figure? Even in your best case scenario, it takes 15 clocks to
compute one answer. On most machines that I have used that implement
popcnt(), they can do it in around 2-4 clocks for scalar and 1 result/clock
for vector. That seems to me to be around 4-15 times faster than your
software scenario.

-Mike
--
Mike Tighe, (214) 497-4206
ti...@convex.com

Torben AEgidius Mogensen

unread,

May 5, 1992, 4:52:17 AM5/5/92

to

ti...@convex.COM (Mike Tighe) writes:

I figured this from the table in the posting I replied to, which
stated that on the T800, the bit count instruction takes n+2 (or was
it n+1?) cycles, where n is the position of the MSb, and the bit
reverse instruction takes 32 (34?) cycles. I fully appreciate that it
is possible to do this far faster in hardware, but in a tight loop on
a T800 a software bit count or reverse will be a lot faster than the
"hardware" (microcoded?) instruction (assuming the code I showed runs
in a similar number of cycles on a T800 as on an ARM).

Torben Mogensen (tor...@diku.dk)

Matt Mahoney

unread,

May 5, 1992, 8:49:44 AM5/5/92

to

In article <29...@hoptoad.uucp> g...@hoptoad.uucp (John Gilmore) writes:
>It occurred to me to wonder just how many architectures have the following
>instructions:
>

> Population Count -- tell me how many 1-bits are in this word
>
> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it

The DEC-20 (a 36-bit mainframe forerunner of the VAX) had a Find First
One and Jump instruction (back in 1979 when I last worked on one).

The 80x86 architecture has a parity flag that can give you the last bit
of a popcount. However, the XLAT instruction (8-bit lookup table)
is probably the fastest implementation.

-------------------------------- _\/_
Matt Mahoney, m...@epg.harris.com |(TV)| Drug of the Nation
#include <disclaimer.h> |____|

jon reeves

unread,

May 5, 1992, 11:19:30 AM5/5/92

to

In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
|> It occurred to me to wonder just how many architectures have the following
|> instructions:

|>...

|> Find First One (or Zero) -- can speed up emulation
|> of popcount on machines without it

The venerable old UNIVAC/Sperry Univac/Sperry/Unisys 1100/2200 architecture has
the instructions "load shift and count" and "double LSC" that will circular shift
the target 36-bit word/72-bit double word until the two high-order bits differ.
The resultant shift count is stored in the next higher register.

This was used primarily for normalizing floating point numbers.
--
Jon Reeves Digital Equipment Corporation, Alpha Open Systems Group
5 Wentworth Drive, MS GSF1-1K13, Hudson NH 03051-4929 USA
+1 603 884 5859 ree...@decvax.dec.com FAX: +1 603 884 1685

Robert Bernecky

unread,

May 5, 1992, 12:14:34 PM5/5/92

to

In article <KANEKO.92...@axiom.icsi.Berkeley> kan...@icsi.Berkeley.EDU (Seiji Kaneko) writes:
>>>>>> In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
>>> It occurred to me to wonder just how many architectures have the following
>>> instructions:
>>> Population Count -- tell me how many 1-bits are in this word
>>> Find First One (or Zero) -- can speed up emulation
>>> of popcount on machines without it
><>I've never encountered a machine that supported it
>
>Hitachi's M series(IBM 370 PCM in japan) has Find first one
>(FLM: find first left-most one.) instruction. Export version
>(ex. EX** series from HDS) don't.
>Though it has been supported for over 15 years, I've never seen
>any good examples of this instruction in commercial softwares.

I worked with the Hitachi Software Works, as a consultant, helping
them to port SHARP APL to VOS3 (their version of MVS), and to make it
support their vector facility. Some of the APL idioms which I recommended
we special case for speed were "reduction" and "indexof" on Booleans.

In particular:
+/B
where B is a Boolean array (let's just talk vector here, although it
does perform in parallel on higher rank arrays) effectively sticks a
plus sign between elements of B, and evaluates that expression. Therefore,
if B is 0 1 0 0 1 1 1, then +/B is the same as:
0+1+0+0+1+1+1

The reduction code to trap plus, or, and, equal, not equal, max, minn, and
several other reductions on Booleans is trivial. It also runs Very Fast.

Indexof returns the first occurence of each element of the right argument
in the left argument. For example:
'abcdefghijkl' indexof 'deaf'
4 5 1 6

For Booleans, it is simple the "find first zero (or one):
0 0 0 0 1 0 1 indexof 1 1 0
5 5 0

Easy to special case. I wrote on this topic in the APL conference Proceedings
in 1973, about how I sped up indexof and membership in SHARP APL.
I think even IBM picked up my algorithms a bit later( Around 1988...?)
Bob

Robert Bernecky r...@yrloc.ipsa.reuter.com bern...@itrchq.itrc.on.ca
Snake Island Research Inc (416) 368-6944 FAX: (416) 360-4694
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9
Canada

John R. Grout

unread,

May 5, 1992, 10:01:22 PM5/5/92

to

m...@caesun6.harris-atd.com (Matt Mahoney) writes:

>In article <29...@hoptoad.uucp> g...@hoptoad.uucp (John Gilmore) writes:
>>It occurred to me to wonder just how many architectures have the following
>>instructions:
>>
>> Population Count -- tell me how many 1-bits are in this word
>>
>> Find First One (or Zero) -- can speed up emulation
>> of popcount on machines without it

>The DEC-20 (a 36-bit mainframe forerunner of the VAX) had a Find First
>One and Jump instruction (back in 1979 when I last worked on one).

This instruction was Jump if Find First One (JFFO)... JFFO AC,E meant
"if AC is non-zero, load a count of the leading zeros in AC into AC+1
and jump to E; if AC is zero, clear AC+1 and don't jump"... my college
had both a KA10-based DEC-10 and a KL10-based DEC-20 and JFFO was
present on both of them, so if JFFO were added at a potential
customer's request (as I heard back then), it must have been added a
_long_ time ago... To use a tight JFFO loop in disk block allocation,
TENEX and TOPS-20 used 36-bit maps with one-bits for free blocks and
zero-bits for allocated blocks.

--
John R. Grout
University of Illinois, Urbana-Champaign

Center for Supercomputing Research and Development

INTERNET: j-g...@uiuc.edu

Hugh LaMaster -- RCS

unread,

May 5, 1992, 10:26:24 PM5/5/92

to

In article <1992May5.1...@yrloc.ipsa.reuter.COM>, r...@yrloc.ipsa.reuter.COM (Robert Bernecky) writes:
|> In article <KANEKO.92...@axiom.icsi.Berkeley> kan...@icsi.Berkeley.EDU (Seiji Kaneko) writes:
|> >>>>>> In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
|> >>> It occurred to me to wonder just how many architectures have the following
|> >>> instructions:
|> >>> Population Count -- tell me how many 1-bits are in this word
|> >>> Find First One (or Zero) -- can speed up emulation
|> >>> of popcount on machines without it

:

|> >Hitachi's M series(IBM 370 PCM in japan) has Find first one
|> >(FLM: find first left-most one.) instruction. Export version
|> >(ex. EX** series from HDS) don't.

:

|> where B is a Boolean array (let's just talk vector here, ...
:

|> Easy to special case. I wrote on this topic in the APL conference Proceedings
|> in 1973, about how I sped up indexof and membership in SHARP APL.
|> I think even IBM picked up my algorithms a bit later( Around 1988...?)

I don't think this is a dead issue, architecturally. Interested parties
should check out the following book:

"Vector Models for Data-Parallel Computing"
Guy E. Blelloch
MIT Press-1990
LC #: QA76.5 B5456 1990

--
Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster
NASA Ames Research Center Internet: lama...@ames.arc.nasa.gov
Moffett Field, CA 94035-1000 Or: lama...@george.arc.nasa.gov
Phone: 415/604-1056 #include <usenet/std_disclaimer.h>

Rob Warnock

unread,

May 6, 1992, 12:29:04 AM5/6/92

to

sup...@human.enet.dec.com (Bob Supnik) writes:
+---------------
| PDP-10's (and -20's) had a JFFO - jump if find first one - and VAX's have
| a FFS/FFC - find first set/clear. Both were done to facilitate searching
| of bit vectors used as allocation maps for disk or memory blocks.
+---------------

While both are useful for searching bit-mapped alloation tables, the VAX's
FFS/FFC is definitely *inferior* to the PDP-10's JFFO (or Am29000's CLZ, etc.)
in that the VAX FFS/FFC counts from the *right*, not the left like everyone
else's priority-encoder instruction (another name for it). I suspect the VAX
designers were led astray by an excessive purist view of the LittleEndian-
ness of the VAX, and forgot that find-MOST-significant-one is the important
fundamental operation of the two.

Find-LEAST-significant-one is easily synthesized from FMSO (a.k.a. JFFO, CLZ),
with FLSO(x) = FMSO(x & (-x)). The reverse is not true. There is no "simple"
combination of basic operations -- *including* FLSO in the "basic" set --
which will give you FMSO. [There was a CACM paper a couple of decades ago
which proved this...]

The designers of the VAX slipped up in not providing a FMSO instruction.

-Rob

-----
Rob Warnock, MS-9U/510 rp...@sgi.com
Silicon Graphics, Inc. (415)335-1673 before 6pm PDT May 8, 1992
2011 N. Shoreline Blvd. (415)390-1673 after 6pm PDT May 8, 1992
Mountain View, CA 94039-7311 "Please make a note of it."

Robin Fairbairns

unread,

May 6, 1992, 7:52:21 AM5/6/92

to

In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
> It occurred to me to wonder just how many architectures have the following
> instructions:
>
> Population Count -- tell me how many 1-bits are in this word
>
> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it

The (extremely) odd Modular One had both, as option bits on its shift
instruction. Popcount was achieved by a Tally/Logical shift - it gave
you the count for the number of bits you shifted out (I don't remember
which of its many curious registers got the result). Find First was a
Justify/Logical shift (again, it told you how far it had gone, but I
don't remember where).

The Modular One was an Iann Barron development - an expensive machine,
but (for its time) very fast (implemented in MECL 2). The company
didn't do too well (the PDP-11 came along and was more friendly to
programmers ;-), but Iann eventually resurfaced launching the
Transputer.

I don't imagine the NSA had anything to do with the instruction set of
the Modular One. The instructions (at least justify) were slated as
being for software emulation of floating point.
--
Robin Fairbairns, Senior Consultant, postmaster and general dogsbody
Laser-Scan Ltd., Science Park, Milton Rd., Cambridge CB4 4FY, UK
Email: ro...@lsl.co.uk --or-- r...@cl.cam.ac.uk

Bob Clements

unread,

May 6, 1992, 9:53:23 AM5/6/92

to

In article <1992May6....@csrd.uiuc.edu> j-g...@uiuc.edu writes:
>>The DEC-20 (a 36-bit mainframe forerunner of the VAX) had a Find First
>>One and Jump instruction (back in 1979 when I last worked on one).
>
>This instruction was Jump if Find First One (JFFO)... JFFO AC,E meant
>"if AC is non-zero, load a count of the leading zeros in AC into AC+1
>and jump to E; if AC is zero, clear AC+1 and don't jump"... my college
>had both a KA10-based DEC-10 and a KL10-based DEC-20 and JFFO was
>present on both of them, so if JFFO were added at a potential
>customer's request (as I heard back then), it must have been added a
>_long_ time ago... To use a tight JFFO loop in disk block allocation,
>TENEX and TOPS-20 used 36-bit maps with one-bits for free blocks and
>zero-bits for allocated blocks.

OK, I've been resisting posting on this thread, since it gets covered
annually over in alt.folklore.computers, but I'll confirm this.

Authority-info: Two of us, Alan Kotok and myself, designed the
logic of the KA-10. Alan was the more senior of us, so I had to
do more of the bureaucratic paperwork and therefore got the
title of "Project Engineer". (There was very little such
paperwork by today's standards, of course.)

The JFFO instruction was added for the KA-10 (the first of the
PDP-10 / DECsystem-10 processors). It was _not_ present
in the PDP-6, as an earlier poster incorrectly said.

JFFO was added very late in the development cycle, at the request
of a potential customer. The customer was a European telephone
exchange maker. They wanted to use it to scan for changed bits
representing phone line occupancies.

This change was, of course, a new incompatibility with the predecessor
machine, the PDP-6, so we were reluctant to add the instruction.
We asked Marketing how many machines we would sell if we added it.
They said "At least 5 for a testbed, and hundreds if they decide
to go ahead with the design."

So we added it. It was 1968, which was a leap year. I wrote the
blurb for the internal Sales Newsletter, proclaiming that in
honor of Leap Year, we were increasing the number of instructions
in the machines instruction set from 365 to 366. I still have
a copy of that newsletter somewhere, I think.

End result: The potential customer never bought a single machine.

But we found lots of uses for the instruction. One of the more
obscure was "Clear register 0 and jump" in one instruction. This
worked because the stack register was essentially always register
17 (0x0f for you youngsters) and register+1 (mod 16) was zero.
The stack register always had a negative count in the left half,
so a JFFO using register 17 would always jump, and always find
the first ONE in bit 0, thereby putting the bit number (0) in
register 17+1=0.

And to close the loop back to the title of this thread, the NSA
did buy some PDP-10s later, of course, but they were not the
instigators of the JFFO instruction.

[For comp.arch: The implementation was a hardwired shift in a
loop, looking for the sign bit to become set. There was a
pre-check for all zeroes in the high 18 bits. In that case, the
low half was copied into the high half and the shift counter was
preset to 18 instead of zero. This saved a lot of unneeded
shifting and was fairly cheap because the laft-half-equals-zero
gate already existed. I once disabled that gate intentionally,
making it always do the extra shifts. This came up with the
right answer, but too slowly. Then I challenged the diagnostic
programmers to find what I had broken. They gave up after a
while.]

>John R. Grout
>University of Illinois, Urbana-Champaign
>Center for Supercomputing Research and Development

Bob Clements, K1BC, clem...@bbn.com

Bob Supnik

unread,

May 6, 1992, 8:12:56 PM5/6/92

to

In article <ke4...@sgi.sgi.com>, rp...@rigden.wpd.sgi.com (Rob Warnock) writes...
> [stuff deleted]

>While both are useful for searching bit-mapped alloation tables, the VAX's
>FFS/FFC is definitely *inferior* to the PDP-10's JFFO (or Am29000's CLZ, etc.)
>in that the VAX FFS/FFC counts from the *right*, not the left like everyone
>else's priority-encoder instruction (another name for it). I suspect the VAX
>designers were led astray by an excessive purist view of the LittleEndian-
>ness of the VAX, and forgot that find-MOST-significant-one is the important
>fundamental operation of the two.

> [stuff deleted]

>The designers of the VAX slipped up in not providing a FMSO instruction.

For their PURPOSE, which was bit vector searching (not priority encoding),
searching from the right, rather than the left, works just as well. In
addition, the right-to-left search allows FFS/FFC to share logic with
other instructions which use a (right-to-left) bit mask, such as CALLx,
PUSHR, POPR, and RET. In these instructions, the correspondence between
bit number and register number allows significant simplifications in
hardware to accelerate these instructions.

[I've had the pleasure (?) of microcoding FFS and FFC on four different
VAX implementations, so this is based on experience.]

Bob Supnik >Sup...@human.enet.dec.com
>All opinions expressed are those of a hardline microcoder
>and do not reflect those of Digital Equipment Corporation

John G Dobnick

unread,

May 6, 1992, 8:35:08 PM5/6/92

to

From article <1992May5.1...@aosg.gsf.dec.com>, by ree...@yauw.aosg.gsf.dec.com (jon reeves):

> In article <29...@hoptoad.uucp>, g...@hoptoad.uucp (John Gilmore) writes:
> |> It occurred to me to wonder just how many architectures have the following
> |> instructions:
> |>...
> |> Find First One (or Zero) -- can speed up emulation
> |> of popcount on machines without it
>
> The venerable old UNIVAC/Sperry Univac/Sperry/Unisys 1100/2200 architecture
> has the instructions "load shift and count" and "double LSC" that will
> circular shift the target 36-bit word/72-bit double word until the two
> high-order bits differ.
> The resultant shift count is stored in the next higher register.
>
> This was used primarily for normalizing floating point numbers.

These instructions are also used, as someone else suggested, for the
purposes of task dispatching (via a word of "flag bits" indicating
eligable tasks and an associated pointer vector), and disk space
allocation (disk allocation tables being stored as bitmaps).

The UNIVAC machines also have what might be considered "single bit"
versions of the "load shift and count" --

Jump Negative (Positive) and Shift -- tests the sign bit of a
register, jumps if the sign bit is set (not set), and then
does a one-bit left circular shift of the register. Useful
for scanning a bit stream.

Univac hardware also has "parity testing" instructions -- which test
whether a word in a register contains an even or odd number of bits.
No "population count" instructions, though.
--
John G Dobnick (Exec Guru, ret.) ATTnet: (414) 229-5727
Computing Services Division INTERNET: j...@uwm.edu
University of Wisconsin - Milwaukee UUCP: uunet!uwm!jgd

"Knowing how things work is the basis for appreciation,
and is thus a source of civilized delight." -- William Safire

Roger Shepherd

unread,

May 8, 1992, 4:00:28 AM5/8/92

to

In article <1992May5.1...@wraxall.inmos.co.uk>, d...@frogland.inmos.co.uk (David Shepherd) writes:
|> In article <1992May5.0...@odin.diku.dk>, tor...@diku.dk (Torben AEgidius Mogensen) writes:
|> >ti...@convex.COM (Mike Tighe) writes:

|> >I figured this from the table in the posting I replied to, which
|> >stated that on the T800, the bit count instruction takes n+2 (or was
|> >it n+1?) cycles, where n is the position of the MSb, and the bit
|> >reverse instruction takes 32 (34?) cycles. I fully appreciate that it
|> >is possible to do this far faster in hardware, but in a tight loop on
|> >a T800 a software bit count or reverse will be a lot faster than the
|> >"hardware" (microcoded?) instruction (assuming the code I showed runs
|> >in a similar number of cycles on a T800 as on an ARM).
|>

|> i think someone here did a fast software solution which
|> involved using the 4 bytes in the word as an offset into
|> a 256 entry lookup table - which was reasonably fast for
|> software.

As far as I know the above is the best implementation of this operation on a
transputer without the using the bitcnt instruction. The code loads the
individual bytes out of the word to be bit-counted, and uses table look-up
to determine the number of bits set in each byte. If careful use is
made of internal RAM this will operate in about the same number of cycles
as the bitcnt instruction, otherwise it is significantly slower.

Now from a purist's point of view the ``correct'' way to speed up this
operation on the transputer would be to make load_byte, load_local and
various other transputer instructions go faster, and add a cache. However,
that is difficult - microcoding a bitcnt instruction on the existing
datapath was comparatively easy.

Was it worthwhile? It is hard to tell - the addition of the instruction has
de-skilled the programming of the bit-count operation and has added a "bullet
feature" to the machine - both of these make it easier to sell.
--
Roger Shepherd, INMOS Ltd JANET: ro...@uk.co.inmos
1000 Aztec West UUCP: ukc!inmos!roger or uunet!inmos-c!roger
Almondsbury INTERNET: ro...@inmos.com
+44 454 616616 ROW: ro...@inmos.com OR ro...@inmos.co.uk

Paul Koning

unread,

May 11, 1992, 12:39:22 PM5/11/92

to

Not quite. That technique only works in machines that use two's complement
binary arithmetic. Since most machines (and all recent ones I can think of)
are in that class, it's easy to forget there are other choices. On a
CDC 6600, the above will NOT work... nor on an IBM 1620!

paul

Phil Howard

unread,

May 11, 1992, 3:43:59 PM5/11/92

to

s...@kithrup.COM (Sean Eric Fagan) writes:

>Once again, I am not aware of any compilers which will take code, such as
>
>int bitcnt(n)
>register unsigned long n;
>{
> register unsigned long tmp;
>
> tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
> return ((tmp + (tmp >> 3)) & 030707070707) % 63;
>}
>
>and turn it into a pop count instruction.

What a compiler COULD do is define something like a special psuedo-macro
that will generate the pop count instruction. A real macro would be there
in case the program needs to be compiled on a machine without the instruction.

It is just not in the nature of C to take any code like the above (which is
NOT the only way accomplish this in C) and change it to an instruction.

Anyway, I already use macros defined for the purposes. Here is a copy of
my header file (bits.h) to define the macros. Now when you are writing a
C compiler for a machine with instructions that do POP COUNT, make that
compiler have an option to invoke special handling for these macros, and
not use the macro as defined, but generate code with that instruction.

/***********************************************************************\
|macro: COUNTBITS16 |
|purpose: count the bits that are 1 in a 16 bit lvalue |
|usage: COUNTBITS16( bits ); |
| or: count = COUNTBITS16( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define COUNTBITS16(b) (\
(b)=(((b)>> 1)&0x5555)+((b)&0x5555),\
(b)=(((b)>> 2)&0x3333)+((b)&0x3333),\
(b)=(((b)>> 4)&0x0707)+((b)&0x0707),\
(b)=(((b)>> 8)&0x000f)+((b)&0x000f),\
(b))

/***********************************************************************\
|macro: COUNTBITS32 |
|purpose: count the bits that are 1 in a 32 bit lvalue |
|usage: COUNTBITS32( bits ); |
| or: count = COUNTBITS32( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define COUNTBITS32(b) (\
(b)=(((b)>> 1)&0x55555555)+((b)&0x55555555),\
(b)=(((b)>> 2)&0x33333333)+((b)&0x33333333),\
(b)=(((b)>> 4)&0x07070707)+((b)&0x07070707),\
(b)=(((b)>> 8)&0x000f000f)+((b)&0x000f000f),\
(b)=(((b)>>16)&0x0000001f)+((b)&0x0000001f),\
(b))

/***********************************************************************\
|macro: COUNTBITS64 |
|purpose: count the bits that are 1 in a 64 bit lvalue |
|usage: COUNTBITS64( bits ); |
| or: count = COUNTBITS64( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define COUNTBITS64(b) (\
(b)=(((b)>> 1)&0x5555555555555555)+((b)&0x5555555555555555),\
(b)=(((b)>> 2)&0x3333333333333333)+((b)&0x3333333333333333),\
(b)=(((b)>> 4)&0x0707070707070707)+((b)&0x0707070707070707),\
(b)=(((b)>> 8)&0x000f000f000f000f)+((b)&0x000f000f000f000f),\
(b)=(((b)>>16)&0x0000001f0000001f)+((b)&0x0000001f0000001f),\
(b)=(((b)>>32)&0x000000000000003f)+((b)&0x000000000000003f),\
(b))

/***********************************************************************\
|macro: REVERSEBITS16 |
|purpose: reverse the bit order in a 16 bit lvalue |
|usage: REVERSEBITS16( bits ); |
| or: value = REVERSEBITS16( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define REVERSEBITS16(b) (\
(b)=(((b)>> 1)&0x5555)|(((b)&0x5555)<< 1),\
(b)=(((b)>> 2)&0x3333)|(((b)&0x3333)<< 2),\
(b)=(((b)>> 4)&0x0f0f)|(((b)&0x0f0f)<< 4),\
(b)=(((b)>> 8)&0x00ff)|(((b)&0x00ff)<< 8),\
(b))

/***********************************************************************\
|macro: REVERSEBITS32 |
|purpose: reverse the bit order in a 32 bit lvalue |
|usage: REVERSEBITS32( bits ); |
| or: value = REVERSEBITS32( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define REVERSEBITS32(b) (\
(b)=(((b)>> 1)&0x55555555)|(((b)&0x55555555)<< 1),\
(b)=(((b)>> 2)&0x33333333)|(((b)&0x33333333)<< 2),\
(b)=(((b)>> 4)&0x0f0f0f0f)|(((b)&0x0f0f0f0f)<< 4),\
(b)=(((b)>> 8)&0x00ff00ff)|(((b)&0x00ff00ff)<< 8),\
(b)=(((b)>>16)&0x0000ffff)|(((b)&0x0000ffff)<<16),\
(b))

/***********************************************************************\
|macro: REVERSEBITS64 |
|purpose: reverse the bit order in a 64 bit lvalue |
|usage: REVERSEBITS64( bits ); |
| or: value = REVERSEBITS64( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define REVERSEBITS64(b) (\
(b)=(((b)>> 1)&0x5555555555555555)|(((b)&0x5555555555555555)<< 1),\
(b)=(((b)>> 2)&0x3333333333333333)|(((b)&0x3333333333333333)<< 2),\
(b)=(((b)>> 4)&0x0f0f0f0f0f0f0f0f)|(((b)&0x0f0f0f0f0f0f0f0f)<< 4),\
(b)=(((b)>> 8)&0x00ff00ff00ff00ff)|(((b)&0x00ff00ff00ff00ff)<< 8),\
(b)=(((b)>>16)&0x0000ffff0000ffff)|(((b)&0x0000ffff0000ffff)<<16),\
(b)=(((b)>>32)&0x00000000ffffffff)|(((b)&0x00000000ffffffff)<<32),\
(b))
--
/***********************************************************************\
| Phil Howard --- KA9WGN --- p...@netcom.com | "The problem with |
| depending on government is that you cannot depend on it" - Tony Brown |
\***********************************************************************/

Sean Eric Fagan

unread,

May 11, 1992, 8:43:58 PM5/11/92

to

In article <0#lk0=g....@netcom.com> p...@netcom.com (Phil Howard ) writes:
>It is just not in the nature of C to take any code like the above (which is
>NOT the only way accomplish this in C) and change it to an instruction.

"not in the nature of C"?

Huh?

Guess you haven't seen any good compilers, for any language.

Herman Rubin

unread,

May 12, 1992, 10:33:40 AM5/12/92

to

In article <1992May12....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
>In article <0#lk0=g....@netcom.com> p...@netcom.com (Phil Howard ) writes:
>>It is just not in the nature of C to take any code like the above (which is
>>NOT the only way accomplish this in C) and change it to an instruction.

>"not in the nature of C"?

>Huh?

>Guess you haven't seen any good compilers, for any language.

I agree that a compiler CAN look for a specific coding scheme, even
with a FEW modifications permitted, but I doubt that much more can
be done in this direction, even with human interaction.

Instead of limiting computer users to the weak set of constructs of
C (or any other language), it is necessary to expand the set of constructs
so that the programmer can state what is wanted at least in a manner that
a bright human can recognize the substitutability of a machine procedure.

The programmer generally has an idea of what is to be accomplished, and
should be allowed to communicate this to the compiler or whatever. If
required to make a lengthy circumlocution, it may very well be almost
impossible for someone or something to decide what is wanted. How easy
would it be for someone to decide that a procedure to compute trigonometric
functions to thousands of digits, described in detail, could be replaced by
one which uses the complex AGM?
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@pop.stat.purdue.edu (Internet, bitnet)
{purdue,pur-ee}!pop.stat!hrubin(UUCP)

Mark C. Carroll

unread,

May 13, 1992, 3:18:37 PM5/13/92

to

In article <48...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>In article <1992May12....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
>>In article <0#lk0=g....@netcom.com> p...@netcom.com (Phil Howard ) writes:
>>>It is just not in the nature of C to take any code like the above (which is
>>>NOT the only way accomplish this in C) and change it to an instruction.
>
>>"not in the nature of C"?
>
>>Huh?
>
>>Guess you haven't seen any good compilers, for any language.
>
>I agree that a compiler CAN look for a specific coding scheme, even
>with a FEW modifications permitted, but I doubt that much more can
>be done in this direction, even with human interaction.
>

Herman, why do you refuse to learn the slightest thing about optimization
before you babble???

The compilation process of an optimizer is typically the following:
1> Parse the course code, and translate it into an intermediate form.
2> Gather information on the intermediate form, by performing various
kinds of analysis.
3> Transform the intermediate form based on information gathered during
phase 2.
4> Translate intermediate code into assembly code.
5> Perform peephole optimizations on the generated assembly code.
(Peepholing consists of scanning short segments of assembly,
and determining if there is any code sequence with the same
semantic effect but fewer cycles to execute.)

The point of all this is: the constructs present in the language do
_not_ have a direct affect on the optimizations possible in the
language. You do _not_ have to write a specific statement sequence in
the source code in order to allow the compiler to perform a particular
optimization. I know that you're completely convinced that it's
impossible for a compiler to do optimization based on anything but
pattern matches - but you're wrong.

Semantic analysis allows the compiler to determine the effect of a
code sequence. Based on the semantic analysis, optizations are done.
The information generated from semantic analysis is _not_ limited to
information available from the source code syntax.

>Instead of limiting computer users to the weak set of constructs of
>C (or any other language), it is necessary to expand the set of constructs
>so that the programmer can state what is wanted at least in a manner that
>a bright human can recognize the substitutability of a machine procedure.

In general, when it comes time to optimize code written for any of the
high performance modern architectures (particularly parallel
processors), the code transformations that produce the greatest affect
make the code incomprehensible to a human reader. The compiler can
analyze the code very effectively, and perform transformations that
you couldn't. But the moment you're allowed to start mucking around
with low level details, you'll start to +degrade+ performance, by
making it impossible to perform many of the most significant code
optimizations.

Take, for example, a vectorizing compiler. You can gain _very_
significant improvements in performance by vectorizing array accesses
in loops. To do this, you generally have to reorder the statements
within a loop, and then split the loop into several loops, and then
convert some of the newly generated loops into vector instructions.
The ability to perform these transformations is strongly dependent on
the ability to determine the exact statement to statement dependences
in the original source code.

Now, suppose a hotshot programmer comes along, and says "Hey, if I
could use my nifty-neato-kean XYZ instruction here, it would _really_
speed things up". So, he uses the Rubin language to add the XYZ
instruction directly to the language. Now, the optimizer wasn't
designed for XYZ instructions in its intermediate code - so it's GOT
to assume the worst, and so the loops with an XYZ instruction suddenly
can't be vectorized.

In general, it's the responsibility of the system designer to ensure
that languages for that machine have the ability to perform the
correct optimizations to exploit the machine architecture. Allowing a
hotshot who doesn't understand how the compiler works to try to
hand-optimize is insane.

A modern compiler has to do a LOT of messy work. It's not just
generating the correct instructions anymore. Instruction scheduling is
very messy - do you _really_ want to have to start keeping track of
interlocks, delay slots, etc? Do you really want to have to do the
dependence analysis on a large program by hand? Do you _really_ want
to write programs that are unreadable, so that you can have the
ability to hand-tool them?

>The programmer generally has an idea of what is to be accomplished, and
>should be allowed to communicate this to the compiler or whatever. If
>required to make a lengthy circumlocution, it may very well be almost
>impossible for someone or something to decide what is wanted. How easy
>would it be for someone to decide that a procedure to compute trigonometric
>functions to thousands of digits, described in detail, could be replaced by
>one which uses the complex AGM?

If a programmer who needs to use the values of trig functions to 10
digits chooses to use a language which uses infinite-precision real
numbers, then that programmer is an idiot, and I won't spend any time
worrying about how innefficient er code is. Similarly, if some
programmer decides to write code using constructive real numbers in
Fortran, then I'm not going to feel badly that er implementation comes
out slow. A programmer is a craftsperson - and should choose the tools
that suit the job. If E's writing numerical code, let er use a
programming language intended for numerical computation, and which has
an optimizer that will do a good job on numerical code. If E chooses
to use a jackhammer to pound nails, or a screwdriver to break up
pavement, then E deserves to suffer.

<MC>
--
|| Mark Craig Carroll: <MC> || "Humanity isn't a physical description,
|| Univ of Delaware, Dept of CIS|| it's a spiritual goal. It's not something
|| Grad Student/Labstaff Hacker || you're given, it's something you earn."
|| car...@udel.edu || - _One_, by Richard Bach

Herman Rubin

unread,

May 14, 1992, 7:07:12 AM5/14/92

to

In article <1992May13.1...@udel.edu> car...@hercules.cis.udel.edu (Mark C. Carroll) writes:

[Invective deleted.]

>The compilation process of an optimizer is typically the following:
> 1> Parse the course code, and translate it into an intermediate form.
> 2> Gather information on the intermediate form, by performing various
> kinds of analysis.
> 3> Transform the intermediate form based on information gathered during
> phase 2.
> 4> Translate intermediate code into assembly code.
> 5> Perform peephole optimizations on the generated assembly code.
> (Peepholing consists of scanning short segments of assembly,
> and determining if there is any code sequence with the same
> semantic effect but fewer cycles to execute.)

I am quite aware of all this. I am also aware of what it cannot do
that I can do.

...................

HR>>Instead of limiting computer users to the weak set of constructs of
HR>>C (or any other language), it is necessary to expand the set of constructs
HR>>so that the programmer can state what is wanted at least in a manner that
HR>>a bright human can recognize the substitutability of a machine procedure.

.....................

>Now, suppose a hotshot programmer comes along, and says "Hey, if I
>could use my nifty-neato-kean XYZ instruction here, it would _really_
>speed things up". So, he uses the Rubin language to add the XYZ
>instruction directly to the language. Now, the optimizer wasn't
>designed for XYZ instructions in its intermediate code - so it's GOT
>to assume the worst, and so the loops with an XYZ instruction suddenly
>can't be vectorized.

So you are saying that the compiler can decide that the XYZ instruction
will speed up its code, but cannot handle statements by the programmer
which explicitly use the XYZ instruction?

>In general, it's the responsibility of the system designer to ensure
>that languages for that machine have the ability to perform the
>correct optimizations to exploit the machine architecture. Allowing a
>hotshot who doesn't understand how the compiler works to try to
>hand-optimize is insane.

So the programmer is limited to the constructs the language designers
thought about? Why is exponentiation not in C? It is in the far older
Fortran. The pow() function in C does not do the same thing in many
cases.

......................

HR>>The programmer generally has an idea of what is to be accomplished, and
HR>>should be allowed to communicate this to the compiler or whatever. If
HR>>required to make a lengthy circumlocution, it may very well be almost
HR>>impossible for someone or something to decide what is wanted. How easy
HR>>would it be for someone to decide that a procedure to compute trigonometric
HR>>functions to thousands of digits, described in detail, could be replaced by
HR>>one which uses the complex AGM?

>If a programmer who needs to use the values of trig functions to 10
>digits chooses to use a language which uses infinite-precision real
>numbers, then that programmer is an idiot, and I won't spend any time
>worrying about how innefficient er code is. Similarly, if some
>programmer decides to write code using constructive real numbers in
>Fortran, then I'm not going to feel badly that er implementation comes
>out slow. A programmer is a craftsperson - and should choose the tools
>that suit the job. If E's writing numerical code, let er use a
>programming language intended for numerical computation, and which has
>an optimizer that will do a good job on numerical code. If E chooses
>to use a jackhammer to pound nails, or a screwdriver to break up
>pavement, then E deserves to suffer.

So someone doing numerical analysis can get by with 10 digit accuracy?
Maybe and maybe not. The point which seems to be missed is that the
language may not have the tools in its toolbox. Now ANY programming
language I have seen is adequate to carry out clumsily any job.

I intend to post a moderate-sized program which I do not believe any
optimized can see what is needed on a vector machine. But here is
a simple function computation; see if you can find an algorithm to
do it reasonably efficiently.

We want to compute a function f such that, if x is represented in
the usual manner in binary: x = Sum a_n*2^(-n), the a's being 0
or 1, then f(x) = Sum n*a_n*2^(-n).

This function does occur naturally, and is not merely contrived
to discombobulate.

Ted Dunning

unread,

May 15, 1992, 5:42:17 PM5/15/92

to

damn.... somebody mentioned population count and find first
instructions and not surprisingly, herman rubin starts popping up and
yammering about what a compiler should do even though he hasn't a clue
how to actually do it.

hope he disappears soon so i don't feel compelled to unsubscribe.

p.s. kill files don't work on herman. part of what i want to avoid is
all of the refutations he inspires.

Bill Stewart 908-949-0705 erebus.att.com!wcs

unread,

May 17, 1992, 12:31:30 AM5/17/92

to

In article <1992May12....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
In article <0#lk0=g....@netcom.com> p...@netcom.com (Phil Howard ) writes:
>It is just not in the nature of C to take any code like the above (which is
>NOT the only way accomplish this in C) and change it to an instruction.
"not in the nature of C"?
Huh?

A way that IS in the nature of C is to write code like

#ifdef COMPILER_USES_BITCOUNT_INSTRUCTION
b = _BITCOUNT_INSTRUCTION(n) ;
#else
/* normal code */
... etc. ...
#endif

Or, if you have ANSI C, do something with #pragma, and hope your
compiler doesn't mind.
--
Pray for peace; Bill
#Bill Stewart +1-908-949-0705 erebus.att.com!wcs AT&T Bell Labs 4M312 Holmdel NJ
# I'm trying to get sendmail working right; apologies if mail to me bounces.