Population Count -- tell me how many 1-bits are in this word
Find First One (or Zero) -- can speed up emulation
of popcount on machines without it
I have gotten the impression that the Pop Count instruction has been
added to more than one architecture upon direct request from the US National
Security Agency, because it apparently occurs frequently in their inner
loops, but is slow to emulate in software (table lookup on parts of the word,
followed by addition of the table entries, seems best to me).
An informal question at last week's Asilomar Microcomputer Workshop
showed 5 or 10 hands when asked `Who has designed an architecture that
had one of these instructions added to it'. Now I'm looking for actual
information on *which* architectures and *why* the instructions were
added. (I wouldn't expect a RISC arch to have this in the first
rev, due to low frequency of use, but once NSA decided it might be
worth buying the machine, then it might appear in the second rev.)
It's been guessed that the reason they use it so much is for correlation,
e.g. xor two things and count the one-bits and you see roughly how close
they are. Any other ideas?
--
John Gilmore {sun,uunet,pyramid}!hoptoad!gnu g...@toad.com g...@cygnus.com
"It isn't given to us to know those rare moments when people
are wide open and the lightest touch can wither or heal."
More than once I've needed this instruction (Population Count) for
text pattern matching algorithms. But I've never encountered a machine
that supported it, and doing table lookup on parts of the word is too
costly.... Then you have the problem of how to specify in C that you
want to use this instruction....
Chris Perleberg
pch...@dcc.uchile.cl
The only way in which the population count and find first instructions might
not be considered RISC-like is there frequency of occurrence, following the
principle of "most bang for the chip real estate used". Of course, that
principle is application dependent: if you're making processors for a
particular domain in which Pop Count / Find First might be important, then
perhaps that should go in. In fact, the exatr logic in a functional unit may
not be that bad. The use of opcodes, and the possibility that these
instructions might be harder than most to make execute in a single short clock
cycle, might also argue against including them, especially if the issue is
scanning a large array, and the cycles used in a reasonable loop body might be
comparable to or less than the load and store costs on cache miss. Just a
thought, anyway.
--
J. Eliot B. Moss, Assistant Professor
Department of Computer Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, MA 01003
(413) 545-4206, 545-1249 (fax); Mo...@cs.umass.edu
Every supercomputer I've seen had it, the Cyber being the one I'm most
familiar with. The impression I have, however, is that the 6000 did not
have a pop count instruction (it didn't even have parity!), but that both it
and parity were added shortly (because 'farmers buy a lot of computers'
8-)).
>Then you have the problem of how to specify in C that you
>want to use this instruction....
This is the easy part. First, actual FORTRAN code to do a pop count on an
integer on a Cyber:
COUNT = POP(I)
To do it in C, the obvious code would be
count = pop(i);
(Oh, it might be ifdef'd in, and/or define'd, and/or something like
__builtin_pop, etc. But the idea is about the same.)
I'm not aware of any compilers that will take code and replace it with a pop
count instruction. (Might be interesting to ask in comp.compilers, though.)
--
Sean Eric Fagan | "One form to rule them all, one form to find them, one
s...@kithrup.COM | form to bring them all and in the darkness rewrite the
-----------------+ hell out of them" -- sendmail ruleset 3 comment from DEC.
Any opinions expressed are my own, and generally unpopular with others.
PDP-10's (and -20's) had a JFFO - jump if find first one - and VAX's have
a FFS/FFC - find first set/clear. Both were done to facilitate searching
of bit vectors used as allocation maps for disk or memory blocks. Neither
had population count, although recent VAX's have had such a facility
at the micro-machine level to facilitate counting the number of registers
saved (and therefore the size of the stack frame needed) by a CALLx
instruction.
Cray Research systems have both, certainly in the YMP and C90, possibly
earlier.
Bob Supnik >Sup...@human.enet.dec.com
>All opinions expressed are those of a hardline microcoder
>and do not reflect those of Digital Equipment Corporation
The Philco 2000 machines (I'm REALLY old) had JumpQNEGative and
POSitive instructions which were useful for doing both operations
quickly which tested the sign bit of the Q register AND shifted it
left one bit; they were heavily pipelined machines optimized so that
short loops using these instructions were very (we're talking early
60s here) fast.
(Sarr Blumson)
Check out the Crays for the Pop count instruction. Several in years past
had a shift searching or some such instruction that would count leading
zeroes or ones, depending on how you coded it. Haven't assem. programmed
a vax in 8 years or so but seem to recall such an instruction was the
reason the current unix kernel uses it's 32 run queues that use a single
bit to indicate that a queue is non-empty, (each queue is 4 or more priority
classes lumped together, you test a bit rather than testing 4 separate linked
lists, but you don't have to test each bit sequentially, but count the
leading zeroes, add one and that is the run queue that is non-empty.)
--
!Robert Hyatt Computer and Information Sciences !
!hy...@cis.uab.edu University of Alabama at Birmingham !
"Get least significant one" instruction can be easily simulated as
bitno = table[(n&~(n-1)) % 37]
(modulo 19 or 67 for 16 and 64 bit words correspondinly).
An interesting fact is that this consturction works FASTER than
80386 built-in BSR (bit scan reverse) instruction in the worst case.
There is also a number of ways to count 1s in the word -- say,
by a sequence of shifts, ANDs and ADDs. The best result is something like
15 trivial instructions (can't remember it right now). Anyway,
it's nearly as efficient as hardware implmentations.
The instructions I really feel missed is the bit composition/decomposition
(BESM-6 sborka/razborka). These commands did the following transformations:
(EXAMPLE)
decomposition:
OP1 A B C D E F G H I J K L M N O P
OP2 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1
-> RES 0 0 0 0 0 0 0 0 0 C D E G M N P
composition:
OP1 A B C D E F G H I J K L M N O P
OP2 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1
-> RES 0 0 J K L 0 M 0 0 0 0 0 N O 0 P
where A-P are some bits.
As you can see these instructions are mutually reversible (i.e.
decomp(comp(X, M), M) == X & M'
comp(decomp(X, M), M) == X & M
(M' is the word with number_of_ones(M) least significant bits set).
As you can see, these instructions includes logical shifts combined
with conjunctions as their particular cases. They were beautiful
for dealing with extarcting characters from BESM-6s 48bit words.
There also were interesting applications in digital search algorithms and
(i believe) in cryptography.
Does anybody know an efficient way to simulate them using vanilla
instruction set? I'd really appreciate it.
Vadim Antonov
Berkeley Software Design, Inc.
The CDC 6000 mainframes had a popcount instruction, and a FP normalize
instruction which could find the first 1 in the lower 48 bits
of a 60-bit word. (The result going to a B register.)
Back when I was learning this (mid-70's),
we were told that the popcount had been added to the architecture
(apparently right from the start) by the request of the Government,
possibly for crypto work. But that may have been just a rumor.
I seem to recall that the CDC 6600 was announced in 1964, and
probably first delivered around '66.
Mark Riordan m...@scss3.cl.msu.edu
-------------------------------------------------------------------------
/*
* This function counts the number of bits in a long.
* It is limited to 63 bit longs, but a minor mod can cope with 511 bits.
*
* It is so magic, an explanation is required:
* Consider a 3 bit number as being
* 4a+2b+c
* if we shift it right 1 bit, we have
* 2a+b
* subtracting this from the original gives
* 2a+b+c
* if we shift the original 2 bits right we get
* a
* and so with another subtraction we have
* a+b+c
* which is the number of bits in the original number.
* Suitable masking allows the sums of the octal digits in a
* 32 bit number to appear in each octal digit. This isn't much help
* unless we can get all of them summed together.
* This can be done by modulo arithmetic (sum the digits in a number by molulo
* the base of the number minus one) the old "casting out nines" trick they
* taught in school before calculators were invented.
* Now, using mod 7 wont help us, because our number will very likely have
* more than 7 bits set. So add the octal digits together to get base64
* digits, and use modulo 63.
* (Those of you with 64 bit machines need to add 3 octal digits together to
* get base512 digits, and use mod 511.)
*
* This is HACKMEM 169, as used in X11 sources.
* Source: MIT AI Lab memo, late 1970's.
*/
int bitcnt(n)
register unsigned long n;
{
register unsigned long tmp;
tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}
--
>It occurred to me to wonder just how many architectures have the following
>instructions:
> Population Count -- tell me how many 1-bits are in this word
> Find First One (or Zero) -- can speed up emulation
> of popcount on machines without it
Find First One is a basic part of doing floating point arithmetic
in software, and is useful when looking for free space in bitmapped
disk allocation tables. Some CPUs used to do this operation by
using the floating point unit's normalizer, but that is rare today.
Population Count is said to be useful in computer chess.
Control Data (remember them?) actually had a "Director of Chess" at
one point, back when only supercomputers could play a decent game.
What does NSA want in a CPU? Considerable information is
available, though dated. NSA had IBM construct a special cryptographic
processing unit for its Stretch machine back in the late 1950s. This became
the Harvest machine. Harvest was a Stretch without the floating point
unit but with a "streaming" unit designed to NSA specs. The
streaming unit could convoke two operand strings by making
repetitive memory fetches, perform tasks repetitively on operand pairs,
and return a string of statistics to memory. The command set included
matching, substitution by table lookup, tallying the occurence of specified
logical relationships, and other operations of interest in cryptanalysis.
Stretch/Harvest was the first byte-oriented machine, and one of the first
pipelined CPU machines. These machines had a slow memory for their
CPU speed and a large (over 700 instructions) instruction set, so that
a single instruction could do as much work as possible without using
up memory bandwidth reading instructions. (The Stretch instruction
set was considered oversized, even by its designers.)
Harvest also had the first automated tape library ("Tractor").
Harvest was installed in February 1962 and operated for fourteen years.
There were eight Stretch machines built with floating point units, but only
one Harvest. Stretch machines started around $13 million, and the
Harvest machine was the biggest one ever built.
Although the Stretch machines were not a commercial success, they
established much of the architecture that later appeared in the IBM
System 360 (and lives on to this day). So the 360 family is the place
to look for Harvest influences. 360s include a "Translate" and
"Translate and Test" instruction, which may well be descended from
the Harvest machine's streaming unit. Translate and Test reads a
string of bytes, looking up each byte in a table and replacing it with
the byte in the table, stopping when a byte with a specific value is
found or when the count of bytes to process runs out. This is not
an instruction of general utility. Every 360 and 370 had it,
although I think some emulated it in software. That may well be
a Harvest-inspired feature.
360s don't have population count, though. So that's probably not an
NSA-inspired innovation. UNIVAC and Burroughs mainframes didn't have
population count through at least the 1980s. The Cray-I was the
first machine of which I am aware that had it.
The Cray-I has a minimalist instruction set. (In fact, the Cray-I
is a simple computer; it just has a lot of copies of key elements.)
If the Cray-I had an instruction, it was there for a good reason.
It was designed by a very small team, headed by Seymour Cray.
So Seymour Cray is probably the one to ask on population count.
I suspect that other CPU designers are copying the Cray when they include
population count.
John Nagle
See "Computer Advances Pioneered by Cryptological Organizations",
by S.S. Snyder, 1980, Annals of the History of Computing, pp. 66-70., and
"IBM's Early Computers", by C.J. Bashe, MIT Press, 1986.
The CM-5 supports both. Can't say what the main reason was.
JohnG -- Eagerly awaiting a CM-5 to arrive here in December!!!!!
--
John A. Gregor USmail: College of Oceanography
E-mail: jo...@oce.orst.edu Oregon State University
Voice #: +1 503 737-3022 Oceanography Admin Bldg. #104
Fax #: +1 503 737-2064 Corvallis, OR 97331-5503
> Population Count -- tell me how many 1-bits are in this word
>I have gotten the impression that the Pop Count instruction has been
>added to more than one architecture upon direct request from the US National
>Security Agency, because it apparently occurs frequently in their inner
>loops, but is slow to emulate in software (table lookup on parts of the word,
>followed by addition of the table entries, seems best to me).
There was a discussion about similar instructions that are
(apparently) easy to do in hardware and slow to do in software.
The outcome of the discussion was that some of the things that would
seem to require a number of instrunctions that is linear in the number
of bits in a word, can with some thought be done in a logarithmic
number of instructions. The PopCount instruction is one of these. Here
is is written in ARM assembler for 32 bit words:
\ Assume Rm1 contains 10101010101010101010101010101010 (binary),
\ Rm2 contains 11001100110011001100110011001100,
\ Rm3 contains 11110000111100001111000011110000,
\ Rm4 contains 11111111000000001111111100000000,
\ Rm5 contains 11111111111111110000000000000000,
\ R0 contains word to be PopCounted
\ At exit R0 contains PopCount, R1 is destroyed
.popcount
AND R1,R0,Rm1 \ use mask 1 to zero all even bits
SUB R0,R0,R1 \ R0 contains the even bits
ADD R0,R0,R1, LSR 1 \ R0 contains PopCount of every bit pair
AND R1,R0,Rm2 \ use mask 2 to zero all even bit pairs
SUB R0,R0,R1 \ R0 contains the even bit pairs
ADD R0,R0,R1, LSR 2 \ R0 contains PopCount of every nibble
AND R1,R0,Rm3 \ use mask 3 to zero all even nibbles
SUB R0,R0,R1 \ R0 contains the even nibbles
ADD R0,R0,R1, LSR 4 \ R0 contains PopCount of every byte
AND R1,R0,Rm4 \ use mask 4 to zero all even bytes
SUB R0,R0,R1 \ R0 contains the even bytes
ADD R0,R0,R1, LSR 8 \ R0 contains PopCount of every halfword
AND R1,R0,Rm5 \ use mask 5 to zero LS halfword
SUB R0,R0,R1 \ R0 contains the LS halfword
ADD R0,R0,R1, LSR 16 \ R0 contains PopCount of full word
This takes 15 cycles for a 32 bit word. Doing it for a 64 bit word (on
a 64 bit processor) would only requite 3 more cycles (and one more
mask word). By using immediate addressing, the last mask word can be
replaced by a constant (since at most 5 bits can be set in the MS
halfword).
OK, 15 cycles is more than a hardware solution would need, but your
program would have use PopCount a lot to justify the hardware. Using
the find-first-1-bit instruction would tend to make PopCount slower
than the above, unless you have very sparse bitstrings. I assume that
cryptographic processors implement PopCount and possibly bit-reverse
and similar instructions in hardware.
Torben Mogensen (tor...@diku.dk)
P.S.
This is a modified version of a article, which I cancelled and
replaced by this. There was an error in the program in the first
posting.
]/*
] * This function counts the number of bits in a long.
] * It is limited to 63 bit longs, but a minor mod can cope with 511 bits.
[ Explanation deleted - see previous article ]
] */
]int bitcnt(n)
]register unsigned long n;
]{
] register unsigned long tmp;
]
] tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
] return ((tmp + (tmp >> 3)) & 030707070707) % 63;
]}
Correct me if I'm wrong, but mod 63 implies a division by 63, doesn't
it? And division is still pretty expensive, more so than doing a
population count directly by shift-add-mask, isn't it? So where's the
savings?
- Steve (mast...@cs.rutgers.edu)
One thing to keep in mind is that these bit instructions are very
useful on vector architectures with a vector mask register (e.g.
Cray). This may be one of the reasons they have been on so many
supercomputers.
For example, if you want to find the first element in an array passing
a logical test, the find-first-one instruction on the vector mask will
give you the index you need. More generally, if you want to execute a
scalar loop over all the elements of a vector register passing a
logical test, find-first-one along with shift instructions gives you a
way to do it in time proportional to the number of elements passing
the test rather than the total number of elements in the vector
register.
Popcount on the vector mask is often used along with the
compress-index (enumerate vector-mask) instruction when packing out
unwanted elements of a vector register; the popcount is needed to
determine how many elements are selected and set the vector register
length.
== Marco
Internet: mar...@cs.cmu.edu Uucp: ...!seismo!cs.cmu.edu!marcoz
Bitnet: marcoz%cs.cmu.edu@carnegie CSnet: marcoz%cs.cm...@relay.cs.net
>Every supercomputer I've seen had it, the Cyber being the one I'm most
>familiar with. The impression I have, however, is that the 6000 did not
>have a pop count instruction (it didn't even have parity!), but that both it
>and parity were added shortly (because 'farmers buy a lot of computers'
The 6600 certainly had population count; I used it. Note that it was about
the SLOWEST instruction in the whole instruction set. I'll have to dig out
the COMPASS manuals when I get home and find out just HOW slow.
Parity was never added to the 6000 line, that came with the 7000 line.
Why did I use it? Well, I could tell you but I'd have to kill you
afterwards...
greg
--
Gregory Reed Travis D P S I
Data Parallel Systems Incorporated gr...@dpsi.com (For MX mailers only!)
Bloomington, IN gr...@cica.indiana.edu (For the others)
a % 63 can be implemented with a few % 64's, >>= 6's, and some glue.
The cost of this would have to be included in the evaluation, of course.
Also remember that the value of 3 in the original (3 as in octal) isn't
magic -- one would want to evaluate a few values.
--
Frederick G. M. Roeber | CERN -- European Center for Nuclear Research
e-mail: roe...@cern.ch or roe...@caltech.edu | work: +41 22 767 31 80
> Find First One is a basic part of doing floating point arithmetic
>in software, and is useful when looking for free space in bitmapped
>disk allocation tables. Some CPUs used to do this operation by
>using the floating point unit's normalizer, but that is rare today.
FFO is also useful in applications that do their own priority-
based scheduling. You have a list of priortized processes
represented by a bit string, along with a table of process
addresses. When you need to schedule a process, set its bit in
the bit string (having an instruction that can set or clear
individual bits based upon their offset in the word also helps).
Then, when it comes time to run a new process, use FFO to find
the first bit in the bit string (the highest priority process
that needs to run). The returned offset is used to index into
the table of process addresses.
Andy Malis <ma...@bbn.com> UUCP: {harvard,rutgers,uunet}!bbn!malis
Our C and Fortran compilers use the popcount instructions in a number of
different ways. The one that comes to mind first is to find the vector
length to use for code under an IF. We also use the C _popcnt intrinsic
in the compiler in bitvector manipulation and in computing GCD's, and
probably in other places. I doubt if the NSA had any influence on this ;-)
>Here is the magic routine, transcribed from DEC-10 assembler into C
>source code. This subject comes up every year in comp.lang.c or
>elsewhere.
>-------------------------------------------------------------------------
>/*
> * This function counts the number of bits in a long.
> * It is limited to 63 bit longs, but a minor mod can cope with 511 bits.
> *
> * It is so magic, an explanation is required:
> * Consider a 3 bit number as being
> * 4a+2b+c
> * if we shift it right 1 bit, we have
> * 2a+b
> * subtracting this from the original gives
> * 2a+b+c
> * if we shift the original 2 bits right we get
> * a
> * and so with another subtraction we have
> * a+b+c
> * which is the number of bits in the original number.
[... then you add adjacent octal digits, then sum 6-bit digits by
"casting out 63s" ...]
On many machines the modulo 63 operation is going to be rather slow.
Unfortunately, this algorithm cannot be applied repetitively to
combine octal digits, as it only works in binary. (it requires 1 ==
base-1)
A somewhat different (but related) algorithm:
To add the digits in an N-digit number in base B, add all adjacent
pairs of digits, forming an N/2-digit number in base B^2, and then add
the digits in the resulting base B^2 number.
I.e. for 16 bits:
x = (x&0x5555) + ((x&0xAAAA) >> 1);
x = (x&0x3333) + ((x&0xCCCC) >> 2);
x = (x&0x0F0F) + ((x&0xF0F0) >> 4);
x = (x&0x00FF) + (x >> 8); /* or x = x % 255 */
This takes 4*log2(N) - 1 instructions, where N is the word length in
bits. (3*log2(N)-1 if you have a barrel shifter)
You can apply the modulus trick after 3 steps, so the answer won't
overflow. However, it may not be quicker - I've tested the two
approaches on a decstation 2100 with 32-bit words, and even though
casting out 255s saves 6 instructions, it runs slower.
Note that you could combine these two algorithms - for instance, you
could do 48 bits in 5 steps by using the HAKMEM trick to get 16 octal
counts, then using the binary algorithm above to combine them.
Another observation is that divide-and-conquer algorithms like this,
using shifts and masks, can be used for a number of other bitwise
operations, such as bit reversal:
x = ((x&0x5555)<<1) | ((x&0xAAAA) >> 1);
x = ((x&0x3333)<<2) | ((x&0xCCCC) >> 2);
...
or finding highest bit set:
if (x & 0xFF00) {n += 8; x &= 0xFF00}
if (x & 0xF0F0) {n += 4; x &= 0xF0F0} /* or n += table[x%255] */
...
If you have a fast modulus operator, you could take x after the first
step above, cast out 255s, and then look up in a 256-element table to
get the rest of the answer.
Peter Desnoyers
--
From the transputer databook (1st edition, 1989)
IMS T800
OP code Mnemonic Proc. cyc. Name
------- -------- ---------- ----
76 bitcnt b + 2 count bits set in word
77 bitrevword 36 reverse bits in word
78 bitrevnbits n + 4 reverse bottom n bits in word
b = Bit number of the highest bit set in register A. Bit 0 is the least
significant bit.
n = Number of places shifted
Proc. cyc. refers to the number of periods TPCLPL taken by an instruction
executing in internal memory.
Robert Stewart
University of Texas at Austin
-- Jerry
/*
* Macros to compute a bit value from a bit number, bit number from bit value,
* and the bit count (number of bits on) for 32-bit quantities. The results
* are compile-time constants if the arguments are.
*
* The definition of BITCOUNT is based on an algorithm condensed from one
* extracted from a graph coloring register allocator which manipulates
* bitvectors. It was inspired by an analogous algorithm for counting bits in
* 36 bit words in MIT's HAKMEM. The first VAX version was designed by Chris
* Smith and written by Jane Hoffman. The original conversion to a form
* suitable for C on a VAX was by Brad Merrill; that code was fixed and
* re-written by Jerry Leichter. It assumes a 32-bit word. Note that the
* argument is evaluated multiple times, so should not have side-effects. A
* full explanation of the algorithm is too long to include here; in summary,
* BX_(x) replaces each 4-bit sequence in x by the number of bits in that
* sequence (which will, of course, fit in 4 bits). BITCOUNT folds the 4-bit
* sequences into 8-bit sequences. The 8-bit sequences are then added up by
* recognizing that they can be viewed as a base 256 number, and the sum of
* the digits of a base b number is the same mod b-1 as the number mod b-1.
*
* The original implementation produce results in groups of 3, then 6 bits.
* While natural on a PDP-10's 36-bit word, this requires special case
* handling of the sign bit on a VAX, since the top 3-bit sequence is really
* only 2 bits long.
*/
#define BITVAL(n) (1 << (n))
#define BITNUM(b) BITCOUNT((b)-1)
#define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x) ((x) - (((x)>>1)&0x77777777) \
- (((x)>>2)&0x33333333) \
- (((x)>>3)&0x11111111))
It was reasonably fast on the 6600 at 8 cycles. For comparison, an add
or shift took 3 cycles and a floating divide took 29.
But it was the the slowest instruction, at 68 cycles, on the CDC 6400.
A shift or add took 6 cycles there, and a floating divide took 57.
Gregg Townsend / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
+1 602 621 4325 g...@cs.arizona.edu 110 57 16 W / 32 13 45 N / +758m
CLZ (Count Leading Zeros) exists in the Am29000 RISC processor
architecture, although it has nothing to do with NSA. The two main
uses are:
1) speeding up the renormalization stage of the software floating
point emulation routines
2) quick prioritization of external inputs (e.g. for choosing an
interrupt source from a set that are all connected to the same
interrupt request line
--
Tim Olson
Advanced Micro Devices
t...@amd.com
You're confusing TRT and the Vax MOVTUC instructions here. TRT is
actually fairly useful. It scans a string looking up each byte in the
translation table, until it finds a non-zero translation. At that point
it stores the address and translated value of that byte in registers 1 and
2, but it never modifies memory. It's actually a reasonable way to do
lexical scanning with a state machine, using zero table entries to mean
stay in the same state and non-zero entries to indicate the next state to
move to. The designers probably didn't think of it in quite that way, but
rather for the special case of looking for delimiters.
TR, which reads each byte from a string and replaces it by looking it up
in a translation table, is useful both for the obvious job of translating
among character codes, most often ASCII and EBCDIC, and for some cute
tricks like reversing the byte order in a word and as part of a two
instruction sequence that turns a binary field into printable hex
characters.
TR and TRT, like all 360 string instructions, use a fixed string length no
greater than 256, so you have to write complex loops using "execute" to
handle strings of variable length or longer than 256. TR also has serious
performance problem because of the way it interacts with the 370's virtual
memory. A TR either has to execute completely or not at all, because the
370 has no way to store the state of a partially executed instruction
except for a few "long string" instructions added to the archicture later.
But a TR may cause up to 6 page faults, for the instruction itself, the
input string, and the translation table, each of which may span a page
boundary. Only the translation table entries that are actually used have
to be resident, and the only way to find out which ones will be used is to
do the translation. So each TR is executed twice, a preliminary trial
execution that doesn't store anything to see if it can complete, and a
second real execution that actually stores the results if all the pages
are present. TRT doesn't have this problem. Since it doesn't modify
memory and only stores some condition code and register values when it's
done, the CPU can easily abort it if there's a page fault half way
through.
The VAX MOVTUC does indeed to move translated until you hit a specific
value, which seems useful mostly to implement the C strcpy() and strncpy()
routines, using a degenerate table that translates each byte to itself and
an escape value of 0 to stop at the string's terminating null byte. I
expect it's probably one of those Vax instructions that's slower than the
equivalent simple instructions.
Regards,
John Levine, jo...@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl
- Craig
ech...@eos.ncsu.edu
pdt
* These opinions belong to p...@mundil.cs.mu.OZ.AU unless otherwise specified.
"GNU Make will no longer go into an infinite loop when fed the horrid trash
that passes for makefiles that `imake' produces (so you can compile X, despite
the extreme stubbornness and irrationality of its maintainers)."-version 3.55
I'm not aware of any compilers that will take code and replace it
with a pop count instruction. (Might be interesting to ask in
comp.compilers, though.)
gcc will do very nice inlining of machine language. you can even
write your code to interact with the register allocator nicely.
0p0q0r0s for some digits p, q, r, and s
Now multiply this times 01010101. If we write it out long form as in
grade school, we get the following:
0p0q0r0s
x 01010101
--------
0p0q0r0s
00000000
0p0q0r0s
00000000
0p0q0r0s
00000000
0p0q0r0s
00000000
---------------
zzyyxxwwvvuutt0s
Note that vv is the sum of the digits that we originally desired. Also note
that there can be no overflow into or out of vv, by design. Further, we do not
even need the double length result, so long as we can get the correct low
order bits.
Whether this is faster than oher techniques depends on your hardware of
course, but I thought folks might find it interesting.
--
J. Eliot B. Moss, Assistant Professor
Department of Computer Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, MA 01003
(413) 545-4206, 545-1249 (fax); Mo...@cs.umass.edu
Pop Count and Find First are also in the CM-5 vector instruction set.
--
Alan E. Klietz
Minnesota Supercomputer Center, Inc.
1200 Washington Avenue South
Minneapolis, MN 55415
Ph: +1 612 626 1737 Internet: al...@msc.edu
>This is the easy part. First, actual FORTRAN code to do a pop count on an
>integer on a Cyber:
> COUNT = POP(I)
>I'm not aware of any compilers that will take code and replace it with a pop
>count instruction. (Might be interesting to ask in comp.compilers, though.)
CDC's Fortran compiler did, starting with FTN version 4, I believe. It
had a whole bunch of "intrinsic" functions, such as SHIFT, COUNT, etc.,
that would turn into the appropriate in-line machine instructions.
Steve Jay
s...@ultra.com ...ames!ultra!shj
Ultra Network Technologies / 101 Dagget Drive / San Jose, CA 95134 / USA
(408) 922-0100 x130 "Home of the 1 Gigabit/Second network"
Once again, I am not aware of any compilers which will take code, such as
int bitcnt(n)
register unsigned long n;
{
register unsigned long tmp;
tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}
and turn it into a pop count instruction.
If I'd meant built-in functions that were, essentially, inline assembly, I
would not have brought up the example of the Cyber FORTRAN compiler and
COUNT(), would I've?
--
Sean Eric Fagan | "One form to rule them all, one form to find them, one
s...@kithrup.COM | form to bring them all and in the darkness rewrite the
-----------------+ hell out of them" -- sendmail ruleset 3 comment from DEC.
Any opinions expressed are my own, and generally unpopular with others.
Gee, that complaint sure sounds familiar. As I've said many times
before, the compiler for a machine with such an instruction should
recognize at least one idiom for that instruction---and should
*document* the idiom so that people can use it. (As Hugh LaMaster
observed last time we had this discussion, this is exactly what vector
compilers do already.) With this and judicious #ifdefs you can get code
which is fast for any machine.
It would be even better if the language had a ``quickpick'' structure
(as Q does) to let the compiler choose the fastest code out of several
alternatives provided by the programmer. That way you could write out
several idioms for a critical operation like bit counting. Your code
would run quickly on any machine whose compiler recognized one of the
idioms---even if you had never heard of that machine when you wrote the
code!
The difference between idioms and vendor-specific language extensions is
that idioms are, by definition, portable---a compiler which doesn't
recognize an idiom will still produce correct code for it, just as a
compiler which doesn't recognize Convex's conventions for array aliasing
declarations in C will still produce correct code for arrays. Idioms
give programmers and vendors a chance to work together in producing
efficient code without any painful compatibility hassles.
(Followups to comp.arch---the NSA conspiracy theory is getting stale.)
---Dan
Hitachi's M series(IBM 370 PCM in japan) has Find first one
(FLM: find first left-most one.) instruction. Export version
(ex. EX** series from HDS) don't.
Though it has been supported for over 15 years, I've never seen
any good examples of this instruction in commercial softwares.
--
Seiji Kaneko kan...@icsi.Berkeley.EDU,kan...@sdl.hitachi.co.jp
..................................... 76270...@compuserve.com
From an article in Fortune magazine, but mainly from an article in
Annals (sp) of Computing [I think], Seymour Cray worked for a northern
company named something like Dynamic Research or Electronic Research ?
which produced code-breaking equipment for the government.
He went to CDC. He founded Cray Research.
He was awarded (by Nixon?) a civilian award for "critical service".
I doubt the award was for the number-crunching power of the Cray-1, but
for its code-bbreaking power.
Allen Sullivan
rste...@ccwf.cc.utexas.edu (Robert Stewart) writes:
[timing for Transputer bitcount instructions]
As mentioned in earlier postings by myself and several others, bit
count and bit reverse can be done in ~15 instructions using masks and
shifts. If these take about one cycle each, it would take far less
time than using the predefined instruction.
However, on the Transputer these instructions don't take one cycle each. The
bitcount instruction was added to the T800, it doesn't exist in the T414. It
is used in some graphics algorithms to convert coverage masks into intensity,
as well as for other reasons. The bitcount instruction does result in a useful
increase in speed for these cases.
This is not a defense of the Transputer instruction set though.
-jeremy
--
--
Jeremy Webber Internet: jer...@cs.adelaide.edu.au
Digital Arts Film and Television, jer...@chook.ua.oz.au
3 Milner St, Hindmarsh, SA 5007, Phone: +61 8 346 4532
Australia FAX: +61 8 346 4537
Find First One (or Zero) -- can speed up emulation
of popcount on machines without it
Such instructions are useful on a hypercube system. The population count
instruction can be used to determine the distance between two nodes in the
hypercube network. The first generation Ncube system had an instruction that
found the rightmost one. This was useful in implementing the "ecube" routing
commonly used in hypercube networks. Presumably subsequent Ncube systems
continue to support this instruction.
As other posters have noted, the population count instruction is very useful
on vector systems that support index compression (gather scatter) and vector
mask instructions. The population count of the vector mask is the vector
length.
--
David Bradley bra...@csrd.uiuc.edu
Center for Supercomputing Research and Development
University of Illinois at Urbana-Champaign
>As mentioned in earlier postings by myself and several others, bit
>count and bit reverse can be done in ~15 instructions using masks and
>shifts. If these take about one cycle each, it would take far less
>time than using the predefined instruction.
How do you figure? Even in your best case scenario, it takes 15 clocks to
compute one answer. On most machines that I have used that implement
popcnt(), they can do it in around 2-4 clocks for scalar and 1 result/clock
for vector. That seems to me to be around 4-15 times faster than your
software scenario.
-Mike
--
Mike Tighe, (214) 497-4206
ti...@convex.com
I figured this from the table in the posting I replied to, which
stated that on the T800, the bit count instruction takes n+2 (or was
it n+1?) cycles, where n is the position of the MSb, and the bit
reverse instruction takes 32 (34?) cycles. I fully appreciate that it
is possible to do this far faster in hardware, but in a tight loop on
a T800 a software bit count or reverse will be a lot faster than the
"hardware" (microcoded?) instruction (assuming the code I showed runs
in a similar number of cycles on a T800 as on an ARM).
Torben Mogensen (tor...@diku.dk)
The DEC-20 (a 36-bit mainframe forerunner of the VAX) had a Find First
One and Jump instruction (back in 1979 when I last worked on one).
The 80x86 architecture has a parity flag that can give you the last bit
of a popcount. However, the XLAT instruction (8-bit lookup table)
is probably the fastest implementation.
-------------------------------- _\/_
Matt Mahoney, m...@epg.harris.com |(TV)| Drug of the Nation
#include <disclaimer.h> |____|
The venerable old UNIVAC/Sperry Univac/Sperry/Unisys 1100/2200 architecture has
the instructions "load shift and count" and "double LSC" that will circular shift
the target 36-bit word/72-bit double word until the two high-order bits differ.
The resultant shift count is stored in the next higher register.
This was used primarily for normalizing floating point numbers.
--
Jon Reeves Digital Equipment Corporation, Alpha Open Systems Group
5 Wentworth Drive, MS GSF1-1K13, Hudson NH 03051-4929 USA
+1 603 884 5859 ree...@decvax.dec.com FAX: +1 603 884 1685
I worked with the Hitachi Software Works, as a consultant, helping
them to port SHARP APL to VOS3 (their version of MVS), and to make it
support their vector facility. Some of the APL idioms which I recommended
we special case for speed were "reduction" and "indexof" on Booleans.
In particular:
+/B
where B is a Boolean array (let's just talk vector here, although it
does perform in parallel on higher rank arrays) effectively sticks a
plus sign between elements of B, and evaluates that expression. Therefore,
if B is 0 1 0 0 1 1 1, then +/B is the same as:
0+1+0+0+1+1+1
The reduction code to trap plus, or, and, equal, not equal, max, minn, and
several other reductions on Booleans is trivial. It also runs Very Fast.
Indexof returns the first occurence of each element of the right argument
in the left argument. For example:
'abcdefghijkl' indexof 'deaf'
4 5 1 6
For Booleans, it is simple the "find first zero (or one):
0 0 0 0 1 0 1 indexof 1 1 0
5 5 0
Easy to special case. I wrote on this topic in the APL conference Proceedings
in 1973, about how I sped up indexof and membership in SHARP APL.
I think even IBM picked up my algorithms a bit later( Around 1988...?)
Bob
Robert Bernecky r...@yrloc.ipsa.reuter.com bern...@itrchq.itrc.on.ca
Snake Island Research Inc (416) 368-6944 FAX: (416) 360-4694
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9
Canada
>In article <29...@hoptoad.uucp> g...@hoptoad.uucp (John Gilmore) writes:
>>It occurred to me to wonder just how many architectures have the following
>>instructions:
>>
>> Population Count -- tell me how many 1-bits are in this word
>>
>> Find First One (or Zero) -- can speed up emulation
>> of popcount on machines without it
>The DEC-20 (a 36-bit mainframe forerunner of the VAX) had a Find First
>One and Jump instruction (back in 1979 when I last worked on one).
This instruction was Jump if Find First One (JFFO)... JFFO AC,E meant
"if AC is non-zero, load a count of the leading zeros in AC into AC+1
and jump to E; if AC is zero, clear AC+1 and don't jump"... my college
had both a KA10-based DEC-10 and a KL10-based DEC-20 and JFFO was
present on both of them, so if JFFO were added at a potential
customer's request (as I heard back then), it must have been added a
_long_ time ago... To use a tight JFFO loop in disk block allocation,
TENEX and TOPS-20 used 36-bit maps with one-bits for free blocks and
zero-bits for allocated blocks.
--
John R. Grout
University of Illinois, Urbana-Champaign
Center for Supercomputing Research and Development
INTERNET: j-g...@uiuc.edu
I don't think this is a dead issue, architecturally. Interested parties
should check out the following book:
"Vector Models for Data-Parallel Computing"
Guy E. Blelloch
MIT Press-1990
LC #: QA76.5 B5456 1990
--
Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster
NASA Ames Research Center Internet: lama...@ames.arc.nasa.gov
Moffett Field, CA 94035-1000 Or: lama...@george.arc.nasa.gov
Phone: 415/604-1056 #include <usenet/std_disclaimer.h>
While both are useful for searching bit-mapped alloation tables, the VAX's
FFS/FFC is definitely *inferior* to the PDP-10's JFFO (or Am29000's CLZ, etc.)
in that the VAX FFS/FFC counts from the *right*, not the left like everyone
else's priority-encoder instruction (another name for it). I suspect the VAX
designers were led astray by an excessive purist view of the LittleEndian-
ness of the VAX, and forgot that find-MOST-significant-one is the important
fundamental operation of the two.
Find-LEAST-significant-one is easily synthesized from FMSO (a.k.a. JFFO, CLZ),
with FLSO(x) = FMSO(x & (-x)). The reverse is not true. There is no "simple"
combination of basic operations -- *including* FLSO in the "basic" set --
which will give you FMSO. [There was a CACM paper a couple of decades ago
which proved this...]
The designers of the VAX slipped up in not providing a FMSO instruction.
-Rob
-----
Rob Warnock, MS-9U/510 rp...@sgi.com
Silicon Graphics, Inc. (415)335-1673 before 6pm PDT May 8, 1992
2011 N. Shoreline Blvd. (415)390-1673 after 6pm PDT May 8, 1992
Mountain View, CA 94039-7311 "Please make a note of it."
The (extremely) odd Modular One had both, as option bits on its shift
instruction. Popcount was achieved by a Tally/Logical shift - it gave
you the count for the number of bits you shifted out (I don't remember
which of its many curious registers got the result). Find First was a
Justify/Logical shift (again, it told you how far it had gone, but I
don't remember where).
The Modular One was an Iann Barron development - an expensive machine,
but (for its time) very fast (implemented in MECL 2). The company
didn't do too well (the PDP-11 came along and was more friendly to
programmers ;-), but Iann eventually resurfaced launching the
Transputer.
I don't imagine the NSA had anything to do with the instruction set of
the Modular One. The instructions (at least justify) were slated as
being for software emulation of floating point.
--
Robin Fairbairns, Senior Consultant, postmaster and general dogsbody
Laser-Scan Ltd., Science Park, Milton Rd., Cambridge CB4 4FY, UK
Email: ro...@lsl.co.uk --or-- r...@cl.cam.ac.uk
OK, I've been resisting posting on this thread, since it gets covered
annually over in alt.folklore.computers, but I'll confirm this.
Authority-info: Two of us, Alan Kotok and myself, designed the
logic of the KA-10. Alan was the more senior of us, so I had to
do more of the bureaucratic paperwork and therefore got the
title of "Project Engineer". (There was very little such
paperwork by today's standards, of course.)
The JFFO instruction was added for the KA-10 (the first of the
PDP-10 / DECsystem-10 processors). It was _not_ present
in the PDP-6, as an earlier poster incorrectly said.
JFFO was added very late in the development cycle, at the request
of a potential customer. The customer was a European telephone
exchange maker. They wanted to use it to scan for changed bits
representing phone line occupancies.
This change was, of course, a new incompatibility with the predecessor
machine, the PDP-6, so we were reluctant to add the instruction.
We asked Marketing how many machines we would sell if we added it.
They said "At least 5 for a testbed, and hundreds if they decide
to go ahead with the design."
So we added it. It was 1968, which was a leap year. I wrote the
blurb for the internal Sales Newsletter, proclaiming that in
honor of Leap Year, we were increasing the number of instructions
in the machines instruction set from 365 to 366. I still have
a copy of that newsletter somewhere, I think.
End result: The potential customer never bought a single machine.
But we found lots of uses for the instruction. One of the more
obscure was "Clear register 0 and jump" in one instruction. This
worked because the stack register was essentially always register
17 (0x0f for you youngsters) and register+1 (mod 16) was zero.
The stack register always had a negative count in the left half,
so a JFFO using register 17 would always jump, and always find
the first ONE in bit 0, thereby putting the bit number (0) in
register 17+1=0.
And to close the loop back to the title of this thread, the NSA
did buy some PDP-10s later, of course, but they were not the
instigators of the JFFO instruction.
[For comp.arch: The implementation was a hardwired shift in a
loop, looking for the sign bit to become set. There was a
pre-check for all zeroes in the high 18 bits. In that case, the
low half was copied into the high half and the shift counter was
preset to 18 instead of zero. This saved a lot of unneeded
shifting and was fairly cheap because the laft-half-equals-zero
gate already existed. I once disabled that gate intentionally,
making it always do the extra shifts. This came up with the
right answer, but too slowly. Then I challenged the diagnostic
programmers to find what I had broken. They gave up after a
while.]
>John R. Grout
>University of Illinois, Urbana-Champaign
>Center for Supercomputing Research and Development
Bob Clements, K1BC, clem...@bbn.com
For their PURPOSE, which was bit vector searching (not priority encoding),
searching from the right, rather than the left, works just as well. In
addition, the right-to-left search allows FFS/FFC to share logic with
other instructions which use a (right-to-left) bit mask, such as CALLx,
PUSHR, POPR, and RET. In these instructions, the correspondence between
bit number and register number allows significant simplifications in
hardware to accelerate these instructions.
[I've had the pleasure (?) of microcoding FFS and FFC on four different
VAX implementations, so this is based on experience.]
Bob Supnik >Sup...@human.enet.dec.com
>All opinions expressed are those of a hardline microcoder
>and do not reflect those of Digital Equipment Corporation
These instructions are also used, as someone else suggested, for the
purposes of task dispatching (via a word of "flag bits" indicating
eligable tasks and an associated pointer vector), and disk space
allocation (disk allocation tables being stored as bitmaps).
The UNIVAC machines also have what might be considered "single bit"
versions of the "load shift and count" --
Jump Negative (Positive) and Shift -- tests the sign bit of a
register, jumps if the sign bit is set (not set), and then
does a one-bit left circular shift of the register. Useful
for scanning a bit stream.
Univac hardware also has "parity testing" instructions -- which test
whether a word in a register contains an even or odd number of bits.
No "population count" instructions, though.
--
John G Dobnick (Exec Guru, ret.) ATTnet: (414) 229-5727
Computing Services Division INTERNET: j...@uwm.edu
University of Wisconsin - Milwaukee UUCP: uunet!uwm!jgd
"Knowing how things work is the basis for appreciation,
and is thus a source of civilized delight." -- William Safire
|> >I figured this from the table in the posting I replied to, which
|> >stated that on the T800, the bit count instruction takes n+2 (or was
|> >it n+1?) cycles, where n is the position of the MSb, and the bit
|> >reverse instruction takes 32 (34?) cycles. I fully appreciate that it
|> >is possible to do this far faster in hardware, but in a tight loop on
|> >a T800 a software bit count or reverse will be a lot faster than the
|> >"hardware" (microcoded?) instruction (assuming the code I showed runs
|> >in a similar number of cycles on a T800 as on an ARM).
|>
|> i think someone here did a fast software solution which
|> involved using the 4 bytes in the word as an offset into
|> a 256 entry lookup table - which was reasonably fast for
|> software.
As far as I know the above is the best implementation of this operation on a
transputer without the using the bitcnt instruction. The code loads the
individual bytes out of the word to be bit-counted, and uses table look-up
to determine the number of bits set in each byte. If careful use is
made of internal RAM this will operate in about the same number of cycles
as the bitcnt instruction, otherwise it is significantly slower.
Now from a purist's point of view the ``correct'' way to speed up this
operation on the transputer would be to make load_byte, load_local and
various other transputer instructions go faster, and add a cache. However,
that is difficult - microcoding a bitcnt instruction on the existing
datapath was comparatively easy.
Was it worthwhile? It is hard to tell - the addition of the instruction has
de-skilled the programming of the bit-count operation and has added a "bullet
feature" to the machine - both of these make it easier to sell.
--
Roger Shepherd, INMOS Ltd JANET: ro...@uk.co.inmos
1000 Aztec West UUCP: ukc!inmos!roger or uunet!inmos-c!roger
Almondsbury INTERNET: ro...@inmos.com
+44 454 616616 ROW: ro...@inmos.com OR ro...@inmos.co.uk
Not quite. That technique only works in machines that use two's complement
binary arithmetic. Since most machines (and all recent ones I can think of)
are in that class, it's easy to forget there are other choices. On a
CDC 6600, the above will NOT work... nor on an IBM 1620!
paul
>Once again, I am not aware of any compilers which will take code, such as
>
>int bitcnt(n)
>register unsigned long n;
>{
> register unsigned long tmp;
>
> tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
> return ((tmp + (tmp >> 3)) & 030707070707) % 63;
>}
>
>and turn it into a pop count instruction.
What a compiler COULD do is define something like a special psuedo-macro
that will generate the pop count instruction. A real macro would be there
in case the program needs to be compiled on a machine without the instruction.
It is just not in the nature of C to take any code like the above (which is
NOT the only way accomplish this in C) and change it to an instruction.
Anyway, I already use macros defined for the purposes. Here is a copy of
my header file (bits.h) to define the macros. Now when you are writing a
C compiler for a machine with instructions that do POP COUNT, make that
compiler have an option to invoke special handling for these macros, and
not use the macro as defined, but generate code with that instruction.
/***********************************************************************\
|macro: COUNTBITS16 |
|purpose: count the bits that are 1 in a 16 bit lvalue |
|usage: COUNTBITS16( bits ); |
| or: count = COUNTBITS16( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define COUNTBITS16(b) (\
(b)=(((b)>> 1)&0x5555)+((b)&0x5555),\
(b)=(((b)>> 2)&0x3333)+((b)&0x3333),\
(b)=(((b)>> 4)&0x0707)+((b)&0x0707),\
(b)=(((b)>> 8)&0x000f)+((b)&0x000f),\
(b))
/***********************************************************************\
|macro: COUNTBITS32 |
|purpose: count the bits that are 1 in a 32 bit lvalue |
|usage: COUNTBITS32( bits ); |
| or: count = COUNTBITS32( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define COUNTBITS32(b) (\
(b)=(((b)>> 1)&0x55555555)+((b)&0x55555555),\
(b)=(((b)>> 2)&0x33333333)+((b)&0x33333333),\
(b)=(((b)>> 4)&0x07070707)+((b)&0x07070707),\
(b)=(((b)>> 8)&0x000f000f)+((b)&0x000f000f),\
(b)=(((b)>>16)&0x0000001f)+((b)&0x0000001f),\
(b))
/***********************************************************************\
|macro: COUNTBITS64 |
|purpose: count the bits that are 1 in a 64 bit lvalue |
|usage: COUNTBITS64( bits ); |
| or: count = COUNTBITS64( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define COUNTBITS64(b) (\
(b)=(((b)>> 1)&0x5555555555555555)+((b)&0x5555555555555555),\
(b)=(((b)>> 2)&0x3333333333333333)+((b)&0x3333333333333333),\
(b)=(((b)>> 4)&0x0707070707070707)+((b)&0x0707070707070707),\
(b)=(((b)>> 8)&0x000f000f000f000f)+((b)&0x000f000f000f000f),\
(b)=(((b)>>16)&0x0000001f0000001f)+((b)&0x0000001f0000001f),\
(b)=(((b)>>32)&0x000000000000003f)+((b)&0x000000000000003f),\
(b))
/***********************************************************************\
|macro: REVERSEBITS16 |
|purpose: reverse the bit order in a 16 bit lvalue |
|usage: REVERSEBITS16( bits ); |
| or: value = REVERSEBITS16( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define REVERSEBITS16(b) (\
(b)=(((b)>> 1)&0x5555)|(((b)&0x5555)<< 1),\
(b)=(((b)>> 2)&0x3333)|(((b)&0x3333)<< 2),\
(b)=(((b)>> 4)&0x0f0f)|(((b)&0x0f0f)<< 4),\
(b)=(((b)>> 8)&0x00ff)|(((b)&0x00ff)<< 8),\
(b))
/***********************************************************************\
|macro: REVERSEBITS32 |
|purpose: reverse the bit order in a 32 bit lvalue |
|usage: REVERSEBITS32( bits ); |
| or: value = REVERSEBITS32( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define REVERSEBITS32(b) (\
(b)=(((b)>> 1)&0x55555555)|(((b)&0x55555555)<< 1),\
(b)=(((b)>> 2)&0x33333333)|(((b)&0x33333333)<< 2),\
(b)=(((b)>> 4)&0x0f0f0f0f)|(((b)&0x0f0f0f0f)<< 4),\
(b)=(((b)>> 8)&0x00ff00ff)|(((b)&0x00ff00ff)<< 8),\
(b)=(((b)>>16)&0x0000ffff)|(((b)&0x0000ffff)<<16),\
(b))
/***********************************************************************\
|macro: REVERSEBITS64 |
|purpose: reverse the bit order in a 64 bit lvalue |
|usage: REVERSEBITS64( bits ); |
| or: value = REVERSEBITS64( bits ); |
|note: the lvalue argument always ends up with the result |
\***********************************************************************/
#define REVERSEBITS64(b) (\
(b)=(((b)>> 1)&0x5555555555555555)|(((b)&0x5555555555555555)<< 1),\
(b)=(((b)>> 2)&0x3333333333333333)|(((b)&0x3333333333333333)<< 2),\
(b)=(((b)>> 4)&0x0f0f0f0f0f0f0f0f)|(((b)&0x0f0f0f0f0f0f0f0f)<< 4),\
(b)=(((b)>> 8)&0x00ff00ff00ff00ff)|(((b)&0x00ff00ff00ff00ff)<< 8),\
(b)=(((b)>>16)&0x0000ffff0000ffff)|(((b)&0x0000ffff0000ffff)<<16),\
(b)=(((b)>>32)&0x00000000ffffffff)|(((b)&0x00000000ffffffff)<<32),\
(b))
--
/***********************************************************************\
| Phil Howard --- KA9WGN --- p...@netcom.com | "The problem with |
| depending on government is that you cannot depend on it" - Tony Brown |
\***********************************************************************/
"not in the nature of C"?
Huh?
Guess you haven't seen any good compilers, for any language.
>"not in the nature of C"?
>Huh?
>Guess you haven't seen any good compilers, for any language.
I agree that a compiler CAN look for a specific coding scheme, even
with a FEW modifications permitted, but I doubt that much more can
be done in this direction, even with human interaction.
Instead of limiting computer users to the weak set of constructs of
C (or any other language), it is necessary to expand the set of constructs
so that the programmer can state what is wanted at least in a manner that
a bright human can recognize the substitutability of a machine procedure.
The programmer generally has an idea of what is to be accomplished, and
should be allowed to communicate this to the compiler or whatever. If
required to make a lengthy circumlocution, it may very well be almost
impossible for someone or something to decide what is wanted. How easy
would it be for someone to decide that a procedure to compute trigonometric
functions to thousands of digits, described in detail, could be replaced by
one which uses the complex AGM?
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@pop.stat.purdue.edu (Internet, bitnet)
{purdue,pur-ee}!pop.stat!hrubin(UUCP)
Herman, why do you refuse to learn the slightest thing about optimization
before you babble???
The compilation process of an optimizer is typically the following:
1> Parse the course code, and translate it into an intermediate form.
2> Gather information on the intermediate form, by performing various
kinds of analysis.
3> Transform the intermediate form based on information gathered during
phase 2.
4> Translate intermediate code into assembly code.
5> Perform peephole optimizations on the generated assembly code.
(Peepholing consists of scanning short segments of assembly,
and determining if there is any code sequence with the same
semantic effect but fewer cycles to execute.)
The point of all this is: the constructs present in the language do
_not_ have a direct affect on the optimizations possible in the
language. You do _not_ have to write a specific statement sequence in
the source code in order to allow the compiler to perform a particular
optimization. I know that you're completely convinced that it's
impossible for a compiler to do optimization based on anything but
pattern matches - but you're wrong.
Semantic analysis allows the compiler to determine the effect of a
code sequence. Based on the semantic analysis, optizations are done.
The information generated from semantic analysis is _not_ limited to
information available from the source code syntax.
>Instead of limiting computer users to the weak set of constructs of
>C (or any other language), it is necessary to expand the set of constructs
>so that the programmer can state what is wanted at least in a manner that
>a bright human can recognize the substitutability of a machine procedure.
In general, when it comes time to optimize code written for any of the
high performance modern architectures (particularly parallel
processors), the code transformations that produce the greatest affect
make the code incomprehensible to a human reader. The compiler can
analyze the code very effectively, and perform transformations that
you couldn't. But the moment you're allowed to start mucking around
with low level details, you'll start to +degrade+ performance, by
making it impossible to perform many of the most significant code
optimizations.
Take, for example, a vectorizing compiler. You can gain _very_
significant improvements in performance by vectorizing array accesses
in loops. To do this, you generally have to reorder the statements
within a loop, and then split the loop into several loops, and then
convert some of the newly generated loops into vector instructions.
The ability to perform these transformations is strongly dependent on
the ability to determine the exact statement to statement dependences
in the original source code.
Now, suppose a hotshot programmer comes along, and says "Hey, if I
could use my nifty-neato-kean XYZ instruction here, it would _really_
speed things up". So, he uses the Rubin language to add the XYZ
instruction directly to the language. Now, the optimizer wasn't
designed for XYZ instructions in its intermediate code - so it's GOT
to assume the worst, and so the loops with an XYZ instruction suddenly
can't be vectorized.
In general, it's the responsibility of the system designer to ensure
that languages for that machine have the ability to perform the
correct optimizations to exploit the machine architecture. Allowing a
hotshot who doesn't understand how the compiler works to try to
hand-optimize is insane.
A modern compiler has to do a LOT of messy work. It's not just
generating the correct instructions anymore. Instruction scheduling is
very messy - do you _really_ want to have to start keeping track of
interlocks, delay slots, etc? Do you really want to have to do the
dependence analysis on a large program by hand? Do you _really_ want
to write programs that are unreadable, so that you can have the
ability to hand-tool them?
>The programmer generally has an idea of what is to be accomplished, and
>should be allowed to communicate this to the compiler or whatever. If
>required to make a lengthy circumlocution, it may very well be almost
>impossible for someone or something to decide what is wanted. How easy
>would it be for someone to decide that a procedure to compute trigonometric
>functions to thousands of digits, described in detail, could be replaced by
>one which uses the complex AGM?
If a programmer who needs to use the values of trig functions to 10
digits chooses to use a language which uses infinite-precision real
numbers, then that programmer is an idiot, and I won't spend any time
worrying about how innefficient er code is. Similarly, if some
programmer decides to write code using constructive real numbers in
Fortran, then I'm not going to feel badly that er implementation comes
out slow. A programmer is a craftsperson - and should choose the tools
that suit the job. If E's writing numerical code, let er use a
programming language intended for numerical computation, and which has
an optimizer that will do a good job on numerical code. If E chooses
to use a jackhammer to pound nails, or a screwdriver to break up
pavement, then E deserves to suffer.
<MC>
--
|| Mark Craig Carroll: <MC> || "Humanity isn't a physical description,
|| Univ of Delaware, Dept of CIS|| it's a spiritual goal. It's not something
|| Grad Student/Labstaff Hacker || you're given, it's something you earn."
|| car...@udel.edu || - _One_, by Richard Bach
[Invective deleted.]
>The compilation process of an optimizer is typically the following:
> 1> Parse the course code, and translate it into an intermediate form.
> 2> Gather information on the intermediate form, by performing various
> kinds of analysis.
> 3> Transform the intermediate form based on information gathered during
> phase 2.
> 4> Translate intermediate code into assembly code.
> 5> Perform peephole optimizations on the generated assembly code.
> (Peepholing consists of scanning short segments of assembly,
> and determining if there is any code sequence with the same
> semantic effect but fewer cycles to execute.)
I am quite aware of all this. I am also aware of what it cannot do
that I can do.
...................
HR>>Instead of limiting computer users to the weak set of constructs of
HR>>C (or any other language), it is necessary to expand the set of constructs
HR>>so that the programmer can state what is wanted at least in a manner that
HR>>a bright human can recognize the substitutability of a machine procedure.
.....................
>Now, suppose a hotshot programmer comes along, and says "Hey, if I
>could use my nifty-neato-kean XYZ instruction here, it would _really_
>speed things up". So, he uses the Rubin language to add the XYZ
>instruction directly to the language. Now, the optimizer wasn't
>designed for XYZ instructions in its intermediate code - so it's GOT
>to assume the worst, and so the loops with an XYZ instruction suddenly
>can't be vectorized.
So you are saying that the compiler can decide that the XYZ instruction
will speed up its code, but cannot handle statements by the programmer
which explicitly use the XYZ instruction?
>In general, it's the responsibility of the system designer to ensure
>that languages for that machine have the ability to perform the
>correct optimizations to exploit the machine architecture. Allowing a
>hotshot who doesn't understand how the compiler works to try to
>hand-optimize is insane.
So the programmer is limited to the constructs the language designers
thought about? Why is exponentiation not in C? It is in the far older
Fortran. The pow() function in C does not do the same thing in many
cases.
......................
HR>>The programmer generally has an idea of what is to be accomplished, and
HR>>should be allowed to communicate this to the compiler or whatever. If
HR>>required to make a lengthy circumlocution, it may very well be almost
HR>>impossible for someone or something to decide what is wanted. How easy
HR>>would it be for someone to decide that a procedure to compute trigonometric
HR>>functions to thousands of digits, described in detail, could be replaced by
HR>>one which uses the complex AGM?
>If a programmer who needs to use the values of trig functions to 10
>digits chooses to use a language which uses infinite-precision real
>numbers, then that programmer is an idiot, and I won't spend any time
>worrying about how innefficient er code is. Similarly, if some
>programmer decides to write code using constructive real numbers in
>Fortran, then I'm not going to feel badly that er implementation comes
>out slow. A programmer is a craftsperson - and should choose the tools
>that suit the job. If E's writing numerical code, let er use a
>programming language intended for numerical computation, and which has
>an optimizer that will do a good job on numerical code. If E chooses
>to use a jackhammer to pound nails, or a screwdriver to break up
>pavement, then E deserves to suffer.
So someone doing numerical analysis can get by with 10 digit accuracy?
Maybe and maybe not. The point which seems to be missed is that the
language may not have the tools in its toolbox. Now ANY programming
language I have seen is adequate to carry out clumsily any job.
I intend to post a moderate-sized program which I do not believe any
optimized can see what is needed on a vector machine. But here is
a simple function computation; see if you can find an algorithm to
do it reasonably efficiently.
We want to compute a function f such that, if x is represented in
the usual manner in binary: x = Sum a_n*2^(-n), the a's being 0
or 1, then f(x) = Sum n*a_n*2^(-n).
This function does occur naturally, and is not merely contrived
to discombobulate.
hope he disappears soon so i don't feel compelled to unsubscribe.
p.s. kill files don't work on herman. part of what i want to avoid is
all of the refutations he inspires.
A way that IS in the nature of C is to write code like
#ifdef COMPILER_USES_BITCOUNT_INSTRUCTION
b = _BITCOUNT_INSTRUCTION(n) ;
#else
/* normal code */
... etc. ...
#endif
Or, if you have ANSI C, do something with #pragma, and hope your
compiler doesn't mind.
--
Pray for peace; Bill
#Bill Stewart +1-908-949-0705 erebus.att.com!wcs AT&T Bell Labs 4M312 Holmdel NJ
# I'm trying to get sendmail working right; apologies if mail to me bounces.