Since argument reduction for the trigonometric functions is
done in practice by multiplying by 1/pi I do not see how it would ben-
efit from your proposed instruction. I believe this instruction would
have little general utility.
An efficient way to do the fortran "x=dint(y)" is important.
This is lacking on the Risc System 6000 as was pointed out in a Los
Alamos report. By the way is it true that it is impossible to
reasonably express this in ADA?
Peter Montgomery says:
Yes, most programs are written in these languages.
As Dan says, the language designers made mistakes. During one
review period for Fortran 90, I requested an operation which
takes four nonnegative integers a, b, c, n with
a < n or b < n (c is optional and defaults to 0).
The requested operation returns q and/or r, where
a*b + c = q*n + r and 0 <= r < n
Why didn't you ask for an integer*8 data type? Wouldn't
that give you what you want and have much more general utility?
I would like an instruction which counts the number of ones in
the binary representation of an integer but I find it hard to argue
that this would be widely used.
James B. Shearer
There is somewhat greater loss of accuracy in this, and it is still
needed to extract the integer part to an integer register and the
fractional part. Thus, it needs at least three operations, instead
of one. Also, if one is calculating something like x - ln(1+x), a
natural operation in certain problems, the computing problems become
a little larger than one would expect. In fact, to avoid a loss of
accuracy in some quite usual situations, it would even be a good idea
to have Boolean operations on floats, and unnormalized floating
arithmetic. Floating arithmetic, apart from some preliminaries,
and normalization problems, is exactly the same as integer, so why
should there be a separate arithmetic unit even?
> An efficient way to do the fortran "x=dint(y)" is important.
> This is lacking on the Risc System 6000 as was pointed out in a Los
> Alamos report. By the way is it true that it is impossible to
> reasonably express this in ADA?
> Peter Montgomery says:
> Yes, most programs are written in these languages.
> As Dan says, the language designers made mistakes. During one
> review period for Fortran 90, I requested an operation which
> takes four nonnegative integers a, b, c, n with
> a < n or b < n (c is optional and defaults to 0).
> The requested operation returns q and/or r, where
>
> a*b + c = q*n + r and 0 <= r < n
>
> Why didn't you ask for an integer*8 data type? Wouldn't
> that give you what you want and have much more general utility?
Yes and no. It would have much more general utility, but it would
do an abysmally inefficient job in this situation. You would need
to have a way of indicating that the product of two integers*4 is
an integer*8, which I do not know how to do in any language with
which I am familiar without writing it as a function, and I do not
think that one should have to write mult48(a,b) instead of a*b. In
addition, how would you write the operation which, when applied to
a*b+c and n, yields both q and r? Is the reason why it is difficult
to get double length multiplication, and division with quotient and
remainder, that the language designers left these out, and then the
As far as adding this type, why not allow the user to add any type? This
was present for up to three additional types on Fortran VI(?) on the CDC3600.
> I would like an instruction which counts the number of ones in
> the binary representation of an integer but I find it hard to argue
> that this would be widely used.
Cray seems to find this one important enough to include on his machines,
which have a real paucity of instructions.
I think you will find that those who want to add "bizarre" instructions in
both hardware and software to do what the present stuff does not do well
understand the problems of both, and have some idea of what is feasible
at reasonable cost.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
Then, the only stumbling block would be the flexible syntax for
procedure calls, which could be addressed with a preprocessor rather
than an entire language.
Am I missing something here, or is this problem really as easy to
solve as it seems? -- Darren
--
--- Darren New --- Grad Student --- CIS --- Univ. of Delaware ---
----- Network Protocols, Graphics, Programming Languages,
Formal Description Techniques (esp. Estelle), Coffee, Amigas -----
=+=+=+ Let GROPE be an N-tuple where ... +=+=+=
favorite hand-held calculator. If sin(pi) = 0, the implementation is
wrong!)
No. Since that is what people expect, many library writers go to
special pains to make sure that this case works out "right".
Historically, Suns (for example) provided user selectable pi precision
(see the Numerical Computation Guide for details) .... this hasn't
proved to be very useful because few know or seem to care; but has
caused a bit of grief once or twice.
--
----------------------------------------------------------------
Keith H. Bierman kbie...@Eng.Sun.COM | k...@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
GCC is already. I saw a 68881 floating-point include file for gcc that
was nothing more than
static inline const double sin(register double x)
{
register double ans;
asm("fsin %1,%0" : "=f" (ans) : "f" (x));
return ans;
}
The syntax is a bit cryptic to express everything you can tell the
optimizer about, but basically, the first string is a pattern, and the
colon-separated parts are output and input operands (you can add
another section for clobbered registers). "=f" means it needs an "f"
type operand (which the machine description uses for floating point
args) and it's assigned to. "f" means it's not assigned to. Gcc will
merge your code in with the caller, hoist sin() out of loops it's invariant
in (const in the function type says you can do that with this function,
although once it's inlined, gcc can see that for itself), and generally
behave as if sin() was an intrinsic rather than a library call.
You can also express the idea that a given input is not read until after
the outputs are written, that an input and output have to be in the same
place, and other things.
>Then, the only stumbling block would be the flexible syntax for
>procedure calls, which could be addressed with a preprocessor rather
>than an entire language.
>
>Am I missing something here, or is this problem really as easy to
>solve as it seems? -- Darren
Well, the syntax is clumsy for really large blocks, but it works well for
dropping, for eaxmple, an atomic test-and-set or compare-and-swap into
some C code.
I believe the reason it isn't more commonly done is that few compilers
have the strong separation gcc does between the compiler and the machine
description. So you can't easily add a description of an instruction at
run time and have the optimizer work with it smoothly.
--
-Colin
Yes and no. Yes, most users expect sin(pi) to be zero. No, most
users do _not_ expect sin(3.1415927) to be zero (or, whatever PI
approximation you use). Any user who needs trig functions should be
_expected_ to know that PI is not a rational number and that floating
point numbers are a subset of the rationals. So, any user who needs
trig functions should _expect_ that sin(x)==0.0 is only true when
x==0.0 if x is a floating point number.
J. Giles
approximation you use). Any user who needs trig functions should be
_expected_ to know that PI is not a rational number and that floating
Tell that to a spreadsheet user. It is impossible to legislate what
people _will_ expect. We can pontificate on what they should, but
unless we are working in the education field we can't expect it work.
I havn't seen Seymour posting here, and I don't often meet him.
Now that I am suitably impressed, would you share with us what
he would tell us if we did ask him?
> --
> dik t. winter, cwi, amsterdam, nederland
> d...@cwi.nl
dan herrick
herr...@iccgcc.decnet.ab.com
Here's a (sort of) working example using gcc 1.39 on a uVAX. "g" is
the "general" operand class that covers all the VAX addressing modes.
int main()
{
int i, j, k, l;
i = 12;
for (k =0; k < 10; k++) {
asm("foo %1,%0": "=g" (j) : "g" (i));
asm("bar %1,%0": "=g" (l) : "g" (i));
}
printf("%d\n",j);
return 0;
}
Note that foo is loop-invariant, while bar is dead. gcc -O -fstrength-reduce
produces:
#NO_APP
gcc_compiled.:
.text
LC0:
.ascii "%d\12\0"
.align 1
.globl _main
_main:
.word 0x0
#APP
foo $12,r1
#NO_APP
movl $9,r0
L4:
decl r0
jgeq L4
pushl r1
pushab LC0
calls $2,_printf
clrl r0
ret
I'm slightly annoyed by gcc's failure to delete the empty loop.
Testing reveals it fails to delete even trivially empty loops. At
least it's not a commo case. Interestingly, if we make bar depend on
the loop index k, it's still dead code, and still deleted, but it
inhibits the loop optimisation to count-down form. I guess it's the
order in which optimisations are applied. The code is deleted after
the loop is generated in the count-up form:
int main()
{
int i, j, k, l;
i = 12;
for (k =0; k < 10; k++) {
asm("foo %1,%0": "=g" (j) : "g" (i));
asm("bar %1,%0": "=g" (l) : "g" (k));
}
printf("%d\n",j);
return 0;
}
#NO_APP
gcc_compiled.:
.text
LC0:
.ascii "%d\12\0"
.align 1
.globl _main
_main:
.word 0x0
clrl r0
#APP
foo $12,r1
#NO_APP
L4:
incl r0
cmpl r0,$9
jleq L4
pushl r1
pushab LC0
calls $2,_printf
clrl r0
ret
Oh, well, 2.0 is coming, and there has to be something to fix...
--
-Colin
Pop count. Count the set bits in a word. CXi on the CDC Cyber series.
--
Sean Eric Fagan | "I made the universe, but please don't blame me for it;
s...@kithrup.COM | I had a bellyache at the time."
-----------------+ -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.
In article <KHB.91Fe...@chiba.Eng.Sun.COM> k...@chiba.Eng.Sun.COM (Keith Bierman fpgroup) writes:
> No. Since that is what people expect, many library writers go to
> special pains to make sure that this case works out "right".
At least the current version of Sun's sin does not do that. I know indeed
one implementation that did range reduction by a multiplication by 1/pi,
where the actual value of 1/pi used was tweaked in its low order bit to
assure that pi*1/pi = 1.0. I will argue that such an implementation is wrong.
The first consideration is that on a given machine (calculating in finite
precision) the value of pi is not exactly representable. So when we enter pi
one way or the other, the internal value will be a close approximation
(say pi'). Now we can calculate a very good approximation to the mathematical
result of sin(pi'):
sin(pi') = -sin(pi' - pi) ~= pi - pi'
So the mathematical result is in the order of the machine precision (if the
system will correctly convert the value of pi).
Given this it is still not clear whether a returned value of 0.0 is allowable
or not. Traditionally numerical mathematics uses two forms of error
analysis: forward and backward.
In forward error analysis, the numerical result of an operation is compared
to the exact mathematical result; if the difference is small there is a small
absolute or relative error.
In backward error analysis (since J. H. Wilkinson in the late 50's and early
60's) inputs to the operation are sought that would give the result as an
exact mathematical result. If those inputs are close to the actual inputs
it is said that the error is small.
Clearly, in terms of backward error analysis sin(pi')=0.0 is allowed, while
it is disallowed in forward error analysis (if we require good relative
precision).
So the question is: should we allow backward error analysis here, or allow
forward error analysis only. This depends on the situation. If it is known
that the inputs to an operation are approximations, backward error analysis
is clearly allowed (the result is the exact mathematical result of one of
the possible inputs that were represented by the actual input). If this is
not known, backward error analysis is generally frowned upon. If we look at
sin (and the other elementary mathematical functions) as black boxes, there
is a tendency to disallow backward error analysis. Furthermore, backward
error analysis can have surprises: suppose we have a machine that represents
floating point numbers with a 48 bit mantissa (that is not a random choice);
assume we have two floating point numbers, x=2^48 and y=2^48-1. Is it
allowed that x-y delivers 0.0 or 2.0 (as is done on some machines)? Using
backward error analysis this is clearly allowed, the result is the exact
mathematical result if we flip the low order bit of one of the inputs; and
that is a small change. The general consensus is however that such imprecise
operations are not allowed; and all basic operations should have a small error
in terms of forward error analysis.
Traditionally only in the field of numerical algebra backward error analysis
is allowable. The reason is simply that forward error analysis can not give
satisfactory results for the most important algorithms. (Before Wilkinson
it was estimated that the numerical solution of a linear systems with direct
methods would not be possible if the order of the system exceeded 10, due to
numerical stability. Nowadays order 1000 and more systems are solved
routinely using direct methods.) This is also the reason that numerical
algebraists are more willing to allow backward error analysis in other fields
of numerical mathematics. (The Karlsruhe people disagree here, and always
require forward error analysis; they are designing algorithms that allow it.)
There is one additional problem if sin(pi') != 0.0. Most uses of the sine
function in heavy numeric calculations are of the form sin(pi * x). If we
look at the function sinpi(x) == sin(pi * x) as a black-box, implementation
in terms of the standard sine function can only be satisfactory if the
function sinpi already does perform the range reduction needed. This has
lead in the forthcoming Ada standard for elementary functions to the
specification of a sine function with two parameters, the first giving the
x argument, the second giving the scale used for the measuring (with a default
of 2.0*pi %). This allows the user to calculate sinpi(x) [Ada: sin(x,2),
because if x=2 we have a complete circle], but also sin(x) with x in degrees
[sin(x,360)], grads [sin(x,400)] and bams, but I disremember what a bam is
(I think there is more than one definition).
--
% Actually there are two functions sin, one with one argument (and cycle 2.0*pi)
and one with two arguments. There are some reasons for this.
I don't know about Cray, but I'm told that POP count was put in the
Gould machines (which tended to be used for things like telemetry,
i.e. listening to and decoding Russian phone calls and missile
launches) was put in for the security boys.
--
Andy Glew, gl...@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway,
Hillsboro, Oregon 97124-6497
I don't understand this comment. This problem can occur
with the divide with remainder instruction as well. The problem is
that approximating x mod pi by x mod y where y is the machine rep-
resentation of pi may lose all accuracy.
Herman Rubin says:
There is somewhat greater loss of accuracy in this, and it is still
needed to extract the integer part to an integer register and the
fractional part. Thus, it needs at least three operations, instead
of one. Also, if one is calculating something like x - ln(1+x), a
natural operation in certain problems, the computing problems become
a little larger than one would expect. In fact, to avoid a loss of
accuracy in some quite usual situations, it would even be a good idea
to have Boolean operations on floats, and unnormalized floating
arithmetic. Floating arithmetic, apart from some preliminaries,
and normalization problems, is exactly the same as integer, so why
should there be a separate arithmetic unit even?
I don't understand the comments about loss of accuracy.
As noted above some or all of the bits of the remainder will be
bogus because pi is not a machine number. Accurate argument re-
duction requires first finding the integer part, n, of the quotient
then computing x-n*pi carefully using pi to more than machine pre-
cision. It is more convenient to do this if n is in floating for-
mat. Counting operations is a poor test of speed when floating
divide typically takes 5 times or more as long as floating mult-
iply. I don't understand the reference to x-ln(1+x). This will
always lose accuracy if evaluated in this form for x near 0. It
is usually possible to perform boolean operations on floats al-
though it may be a little awkward. I presume the reason for a
separate floating point unit is that this leads to faster machines.
Regarding my suggestion for an integer*8 type Herman Rubin
comments:
Yes and no. It would have much more general utility, but it would
do an abysmally inefficient job in this situation. You would need
to have a way of indicating that the product of two integers*4 is
an integer*8, which I do not know how to do in any language with
which I am familiar without writing it as a function, and I do not
think that one should have to write mult48(a,b) instead of a*b. In
addition, how would you write the operation which, when applied to
a*b+c and n, yields both q and r?
Regarding a function to get a the 8 byte product of 2
4 byte integers I don't see why this is any worse than Montgomery's
function. In any case it is not needed. Let i4,j4 be 4 byte
i8,j8,k8 be 8 byte. Then write
i8=i4
j8=j4
k8=i8*j8
A sufficiently intelligent compiler will do the right thing. It may
work to write
k8=i4*j4
This sometimes works in the analogous case where k8 is 4 byte and i4
and j4 are 2 bytes. However I am not sure strictly speaking that it
should. What does the Fortran standard say?
As for getting both q and r this is easy just write
iq=ia/ib
ir=mod(ia,ib)
A reasonable compiler will only generate 1 divide with remainder.
I will confess however that while this works when everything is
4 bytes I don't quite see how to make it work when ia is 8 bytes
(since it seems unreasonable to define the quotient of a 8 byte
integer by a 4 byte integer to be 4 bytes and if it is defined
to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte
giving 4 byte quotient instruction).
James B. Shearer
Almost right. A scalar population count instruction was not new to the CRAY-1,
since the CDC 6600 and descendents all supported it ("CXi Xk", opcode 47).
What was missing on some CRAY-1As and CRAY-1Bs was a vector population count
instruction. The S and M models do support that.
The CRAY-1 (and X-MP/Y-MP followons) *is* missing a vector leading zero count
instruction, though it's there in scalar form. Need a CRAY-2 for the vector
version.
Both operations are quite useful, as has been noted in comp.arch often in the
past.
It does seem to be widely used indeed; as far as I know the inmos transputer
T800 has this instruction, along with other instructions like inverting
the sequence of the bits in an int etc. I think those are used for
CRC calculation and picture processing, though I'm not sure about that.
--
Volker Herminghaus-Shirai (v...@britesun.radig.de)
panic: 80x86! Trying to vomit...
> Then, the only stumbling block would be the flexible syntax for
> procedure calls, which could be addressed with a preprocessor rather
> than an entire language.
The problem in CALLS would not be that great, but it is great in
replacement statements, inlining, transfer decisions and unusual
tests, allowing user-defined syntax and operators, explicit temporary
variables, and others. There might even have to be a path analysis
within a statement or short block.
The other problem would be that of weak typing. It must be possible
to override types, and that must be machine dependent. The problem is
not with the complexity of the machine hardware, but with the inflexibility
of the assembler. Single machine instructions can be quite complex and
difficult to write in the current assembler notation. But I doubt if
GCC is even close to being able to handle the problem, or even G++.
> Am I missing something here, or is this problem really as easy to
> solve as it seems? -- Darren
The problem is somewhat more difficult, but I believe if one restricts
optimization to the back end it could be done. Front end analysis is
much more difficult, and should not be allowed to swamp the problem.
................
> Regarding a function to get a the 8 byte product of 2
> 4 byte integers I don't see why this is any worse than Montgomery's
> function. In any case it is not needed. Let i4,j4 be 4 byte
> i8,j8,k8 be 8 byte. Then write
> i8=i4
> j8=j4
> k8=i8*j8
> A sufficiently intelligent compiler will do the right thing. It may
> work to write
> k8=i4*j4
> This sometimes works in the analogous case where k8 is 4 byte and i4
> and j4 are 2 bytes. However I am not sure strictly speaking that it
> should. What does the Fortran standard say?
The first approach would require a very intelligent compiler, indeed.
The second approach is the "right" one. However, every standard I have
read would try the first. Since i8*j8 might not even exist on the machine,
at best an inlined function would have to be brought in instead of the
single hardware instruction for k8=i4*j4.
I have never seen any language description in which the type of the result
in a replacement statement affected what operation is performed. The closest
I know is the C format, which may or may not work,
k8 = (*int8)i4*j4.
> As for getting both q and r this is easy just write
> iq=ia/ib
> ir=mod(ia,ib)
> A reasonable compiler will only generate 1 divide with remainder.
> I will confess however that while this works when everything is
> 4 bytes I don't quite see how to make it work when ia is 8 bytes
> (since it seems unreasonable to define the quotient of a 8 byte
> integer by a 4 byte integer to be 4 bytes and if it is defined
> to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte
> giving 4 byte quotient instruction).
> James B. Shearer
I do not see why the case of ia being 8 bits is any worse. Sure, there
is the problem of overflow, but so what? Also, it should not be ever
necessary to do something as complicated as the mod(ia,ib) notation for
basic operators, or even not so basic, and more than it should be
necessary to write add(a,b) for a+b. But what is really needed in
notation is something like
iq,ir = ia/ib.
In fact, the old literature (and old machines!) are full of "bizarre"
instructions that proved ill-advised. A few examples:
- "uninterruptible" instructions which complicated the memory
interface to the point of slowing down normal accesses.
- addressing modes which studies could not find a single use of.
(PDP-11 autodecrement deferred - omitted from the VAX.)
- my personal favorite, the Star-100 instruction which took
Fortran source (a vector of characters), and returned a vector
with: comments and sequence numbers stripped; continuation lines
catenated; and non-significant blanks elided. This was done
because of the then-famous observation that the XPL compiler
spent more time in "get next character" than in any other
routine. The drawbacks, however, are serious. How to add new
compiler directives? How to map line numbers back to the
original line numbers, so as to have decent error messages? And
most important, why bother, since modern compilers now spend
their time elsewhere.
- The Symbol machine, which put the entire compiler in hardware.
Do not be fooled by all the laudatory retrospective articles.
They were written by the machine's designers, and the machine's
gross failure is underdescribed in the literature.
I collect these little bird droppings of history. Any other favorites
(that hold lessons)?
--
Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
Pop count. Count the set bits in a word. CXi on the CDC Cyber series.
just how important is this? It's not so hard to duplicate in C.
There are two ways that I know of, one very clever (it's in MIT's
hackmem), the other using a look-up table. The better one uses only
11 instructions, all in registers.
How fast was the CXi instruction compared to other instructions on the
Cyber?
unsigned
Ones(mask) /* HACKMEM 169 */
unsigned long mask;
{
unsigned long y;
y = (mask >> 1) &033333333333;
y = mask - y - ((y >>1) & 033333333333);
return (((y + (y >> 3)) & 030707070707) % 077);
}
(DEC3100, cc -S -O)
Ones:
.option O2
.frame $sp, 0, $31
li $5, -613566757
srl $14, $4, 1
and $3, $14, $5
subu $15, $4, $3
srl $24, $3, 1
and $25, $24, $5
subu $3, $15, $25
srl $8, $3, 3
addu $2, $8, $3
and $2, $2, -954437177
remu $2, $2, 63
j $31
.end Ones
this is what I would probably do if I didn't know about the above; it
uses 23 instructions, including 4 loads.
unsigned
count_ones(x)
unsigned long x;
{
static int table[];
return table[x & 0xff] +
table[(x & 0xff00) >> 8] +
table[(x & 0xff0000) >> 16] +
table[(x & 0xff000000) >> 24];
}
count_ones:
.option O2
.frame $sp, 0, $31
la $3, $$4
and $14, $4, 255
mul $15, $14, 4
addu $24, $3, $15
lw $25, 0($24)
and $8, $4, 65280
srl $9, $8, 8
mul $10, $9, 4
addu $11, $3, $10
lw $12, 0($11)
addu $13, $25, $12
and $14, $4, 16711680
srl $15, $14, 16
mul $24, $15, 4
addu $8, $3, $24
lw $9, 0($8)
addu $10, $13, $9
and $11, $4, -16777216
srl $25, $11, 24
mul $12, $25, 4
addu $14, $3, $12
lw $15, 0($14)
addu $2, $10, $15
j $31
.end count_ones
--
IBM
Scott Draves Intel
sp...@cs.cmu.edu Microsoft
The original machine for this was the CDC6600. Some of this is from memory,
but any minor errors here are unimportant.
The machine had multiple functional units, which operated independently.
There was a scoreboard which kept track of the demands on the units.
The fastest rate at which instructions could be issued was 1/cycle,
although no operation took only one cycle. CXi Xj could be issued
if Xj was not busy, and started its 8 cycles of execution when Xi
was available. A transfer of any kind stopped issuance until the
transfer was carried out, There were other timing considerations.
The shortest competing program in the deleted part of the posting
had 11 instructions. It also used quite a few of the quite scarce
supply of registers, which would interfere with overlapping this
operation with others. Most of these operations took at least 3 cycles.
So a factor of 4-30 would be appropriate for the effect of not having
the instruction.
This insight takes me back to working with CDC 6000-series machines, wehre
a "population count" instruction permitted some pretty outrageous algorithms
to do some neat things, including some shift and mask tricks which revealed
the highest-order bit set in a (60 bit) word. This was very useful for
dealing with bitstrings.
Of course, at times things got a little carried away; folks used the floating
normalize instruction to, among other things, calculate disk block numbers.
In essence, I have found that this instruction is an optimization; but as
long as it's kept in the "implementation details" (innermost loops of
high level constructs) it's cool.
After all, we're not usually in the business of sweating MIPS instruction
streams or realtime Multiflow (RIP) instruction stuffing.
--
Dave Fuller
Sequent Computer Systems Think of this as the hyper-signature.
(708) 318-0050 (humans) It means all things to all people.
dafu...@sequent.com
On the high end ones, about the same as most of the other instructions:
two to three cycles. On the slowest machine, it took upwards of 30 cycles,
I think.
>The fastest rate at which instructions could be issued was 1/cycle,
>although no operation took only one cycle.
True. Most took two or three. (Multiply took 5, divide took 70.) Note
that I use the CDC Cyber 170/760, since that's the machine I have the most
experience with.
>The shortest competing program in the deleted part of the posting
>had 11 instructions. It also used quite a few of the quite scarce
>supply of registers, which would interfere with overlapping this
>operation with others.
The problem with using the tables is that the cyber had no byte-access
instruction (it was also a 60-bit machine, with 18-bit addressing, and *no*
paged memory). If it had byte-access, it would be possible to come up with
a table and sequence of instructions that would be a lot slower on the fast
machines, but only a little bit slower on the slow machines. If CDC had
ever gone the route of lots of other manufacturers, it would have been
likely that, at some point, there would have been a machine that could
execute the loop faster than the CXi instruction.
Oh, yeah, one other thing: I've been told that the pop-count instruction
used some of the parity-checkiing hardware; since it was there anyway, I
wouldn't be surprised if this were true. However, I don't know enough about
the internals of the machine.
Oh, yeah. A point for Herman to consider: the FORTRAN compiler implemented
this instruction in almost exactly the same way gcc would implement it: it
was an intrinsic it knew a little bit about; it knew, for example, that it
would return the same value for the same input. But that's all it knew. In
gcc, I would do
static inline const int
pop_count(int x) {
int temp;
asm ("CX%0 %1" : "=X" (temp) : "X" (x));
return temp;
}
The compiler would never generate code that used it, unless you used the
pseudo-routine COUNT (I think that was it).
> As for getting both q and r this is easy just write
> iq=ia/ib
> ir=mod(ia,ib)
>A reasonable compiler will only generate 1 divide with remainder.
I have done numerical analysis work and multi-precision arithmetic on many
different machines and operating systems.
Only ONCE have I ever seen a compiler bright enough to do the above
optimization. (Burroughs 6700; a stack machine; Algol 68)
Furthermore, what one usually wants is (A*B + C)/D and (A*B + C) mod D.
(see Peter Montgomery's requests)
Even on machines that support double length integer multiplies, one
cannot put the above operations into HLL because the compiler will not
generate the double length multiply (say 32 x 32 --> 64) nor will it
then do the (64 /32 --> 32 bit quotient & remainder). Since A*B can overflow
32 bits one is FORCED to call assembler routines to do this. Calling an
assembler routine for simple arithmetic is EXPENSIVE because of subroutine
call overhead.
>I will confess however that while this works when everything is
>4 bytes I don't quite see how to make it work when ia is 8 bytes
>(since it seems unreasonable to define the quotient of a 8 byte
>integer by a 4 byte integer to be 4 bytes and if it is defined
agreed in general.
>to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte
>giving 4 byte quotient instruction).
Typically when doing multiple precision work it is easy to guarantee that
(in the above example) D is greater than A and B [D is typically the RADIX],
hence one knows that (AB+C)/D will fit in 4 bytes.
I ** believe ** that as computer security and secure communications become
more important to users that we will see such things in the future. These
sorts of computations are usually essential in cryptographic algorithms.
As the market starts to demand support for cryptographic applications we will
see support for such arithmetic.
--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"
>In fact, the old literature (and old machines!) are full of "bizarre"
>instructions that proved ill-advised. A few examples:
>
> [ some good examples removed]
>
>- addressing modes which studies could not find a single use of.
> (PDP-11 autodecrement deferred - omitted from the VAX.)
This example differs somewhat from his others, which were basically
deliberate design decisions that are not now generally seen as the
wise way to go.
But the autodecrement deferred addressing mode arose more out of
symmetry (== simple instruction decoding) than somebody saying "gee,
I'll bet that would be a useful addressing mode"). The pdp-ll had
autoincrement/autodecrement, and it had deferred/non-deferred. The
"ill-advised" combination of autodecrement deferred was just a result
of symmetrical instruction decode logic.
DEC's pdp-10 instruction set had the same oddities. The machine had
a plethora of no-op equivalent instructions (some of which referenced
memory and some of which did not), not because the designers thought
it would be advisable to have lots of no-ops, but because the very
simple, highly gate-efficient instruction decode logic took advantage
of instruction-set symmetries.
(I think I have notes around here that say the instruction decode for
the original instruction set was done in something like 700 gates - but
don't quote this; I may have the number wrong ...)
The whole point is just that unused instructions or opcodes don't
necessarily equate to stupidity or lack of foresight on the part of
the designers. Sometimes they're really a result of cleverness :-).
Terry R. Friedrichsen
te...@venus.sunquest.com (Internet)
uunet!sunquest!terry (Usenet)
te...@sds.sdsc.edu (alternate address; I live in Tucson)
Quote: "Do, or do not. There is no 'try'." - Yoda, The Empire Strikes Back
-- Lennart Augustsson
[This signature is intentionally left blank.]
Ah yes. Clearly the following does not work....
/*
* return quotient and remainder from (a*b + c) divrem d
*/
#if 0
/*
* This is the way we would like to do it, but gcc emits one extra
* instruction, as it is not smart enough to completly eliminate the
* addressing on r (it uses a register for r, rather than a pointer,
* but never quite goes all the way).
*/
static __inline int divrem(int a, int b, int c, int d, int *r) {
int q;
double tmp; /* force reg pair allocation */
asm("emul %1,%2,%3,%0" : "=g"(tmp) : "g"(a), "g"(b), "g"(c));
asm("ediv %3,%2,%0,%1" : "=g"(q), "=g"(*r) : "r"(tmp), "g"(d));
return q;
}
#else
/*
* So instead we will use rather peculiar gcc syntax.
* Note that the macro uses a, b, c, d, q, and r exactly once each,
* and thus side effects (*p++, etc.) are safe.
*/
#define divrem(q, r, a, b, c, d) ({ \
double divrem_tmp; \
asm("emul %1,%2,%3,%0" : "=g"(divrem_tmp) : \
"g"(a), "g"(b), "g"(c)); \
asm("ediv %3,%2,%0,%1" : "=g"(q), "=g"(r) : \
"r"(divrem_tmp), "g"(d)); \
})
#endif
int a[100], b[100], c[100], d[100];
int q[100], r[100];
void doit(int n) {
int i;
for (i = 0; i < n; i++) {
#if 0
q[i] = divrem(a[i], b[i], c[i], d[i], &r[i]);
#else
divrem(q[i], r[i], a[i], b[i], c[i], d[i]);
#endif
}
}
But wait! Maybe, just *maybe*, we should try it out before dismissing it.
Well goll-ee, it seems to work!
When compiled on a Tahoe (the Tahoe is a `RISC'---a `Reused Instruction
Set Computer'; its emul and ediv are just like those on the VAX) with
`gcc -O -S' this compiles to (compiler comments and other drek stripped):
_doit:
.word 0x3c0
movl 4(fp),r4
clrl r2
cmpl r2,r4
jgeq L13
movab _a,r9
movab _b,r8
movab _c,r7
movab _q,r6
movab _r,r5
movab _d,r3
L12:
emul (r9)[r2],(r8)[r2],(r7)[r2],r0
ediv (r3)[r2],r0,(r6)[r2],(r5)[r2]
incl r2
cmpl r2,r4
jlss L12
L13:
ret
(Note that the Tahoe does not have auto-increment addressing modes,
and this is in fact the best that can be done.)
On the VAX the loop changes to (gcc 1.37.1, -fstrength-reduce -mgnu):
L12:
emul (r2)+,(r3)+,(r4)+,r0 # the registers
ediv (r5),r0,r1,r0 # are allocated in
movl r1,(r6)+ # a different order.
movl r0,(r7)+
addl2 $4,r5
jaoblss r9,r8,L12
Apparently the machine-dependent part has not been taught to combine
`ediv' properly; it should be:
L12:
emul (r2)+,(r3)+,(r4)+,r0
ediv (r5)+,r0,(r6)+,(r7)+
jaoblss r9,r8,L12
A bit of work on vax.md should fix it.
This has its drawbacks: the syntax is distinctly un-pretty, and it
requires gcc, and it is machine-dependent. It does, however, work.
--
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA Domain: to...@ee.lbl.gov
I disagree !
Below is the output produced by gcc on a 386. It looks like gcc is
able to handle both of these cases.
It does cheat some, in the multiply it only uses the lower 32 bits of the
result (even though the machine does 64). This would be a fairly simple
problem to fix.
===== sample 1 =====
int i,j,k,l;
main()
{
i = k/l;
j = k%l;
}
.file "showem.c"
gcc_compiled.:
.text
.align 4
.globl main
main:
pushl %ebp
movl %esp,%ebp
subl $4,%esp
movl k,%eax
cltd
movl %eax,k
idivl l
movl %eax,i
movl %edx,j
leave
ret
.comm l,4
.comm k,4
.comm j,4
.comm i,4
==== sample 2 ====
int a,b,c,d,j,k;
main()
{
j = (a*b+c)/d;
k = (a*b+c)%d;
}
.file "showem2.c"
gcc_compiled.:
.text
.align 4
.globl main
main:
pushl %ebp
movl %esp,%ebp
movl a,%eax
imull b,%eax
addl c,%eax
cltd
idivl d
movl %eax,j
movl %edx,k
leave
ret
.comm k,4
.comm j,4
.comm d,4
.comm c,4
.comm b,4
.comm a,4
======
:This has its drawbacks: the syntax is distinctly un-pretty, and it
:requires gcc, and it is machine-dependent. It does, however, work.
Ah yes. A universal solution. Machine dependent. Ugly syntax. Only
works with one compiler.
Furthermore, it does nothing to contradict the assertions I made above
that the arithmetic could not be done in the HLL and that assembler
routines had to be called. It does inline the subroutine, however.
In article <10...@dog.ee.lbl.gov> I demonstrate that the last claim
is false.
>>This has its drawbacks: the syntax is distinctly un-pretty, and it
>>requires gcc, and it is machine-dependent. It does, however, work.
In article <1991Feb25....@linus.mitre.org> b...@faron.mitre.org
(Robert D. Silverman) writes:
>Ah yes. A universal solution. Machine dependent. Ugly syntax. Only
>works with one compiler.
Do you want a machine-independent solution for using the machine
instruction(s) that compute(s) (A * B + C) divrem D?
This is clearly impossible, since some machines have no such instructions.
Do you want a machine-independent (but language-dependent) syntax for
expressing `(A * B + C) divrem D', and one which does not require a
subroutine call per operation?
This is what I provided. If you believe you can improve on the syntax,
go to it; knock yourself out. (Incidentally, the only reason it `only
works with one compiler' is that only one compiler compiles the
particular language GCC implements. GCC is not a C compiler [well, it
is if you run it with `-ansi -pedantic']. There is, however, no
constraint that says that GCC's language may not be adopted by other
compilers.)
I dare say you do not want a language-independent syntax, since `syntax'
and `language' are tightly-interwoven concepts.
Do you want machine-independent semantics, given some syntax, for
(A * B + C) divrem D? I believe this not to be the case (since I
believe that you would prefer 64-bit A,B,C,D with 128-bit intermediate,
which I believe you believe is not available on many machines).
>Furthermore, it does nothing to contradict the assertions I made above
>that the arithmetic could not be done in the HLL and that assembler
>routines had to be called. It does inline the subroutine, however.
I have only a vague idea what `could not be done in the HLL' means.
Did I use a construct that is not provided by the compiler? Or is
`assembler routines had to be called' the key phrase? Do you mean
`machine dependent instructions had to be emitted'? Of course this is
the case. Do you mean `the rules for emitting the machine dependent
instructions were not built into the compiler'? This depends on what
one means by `built in'. Is the following sin() routine `built in'
to the compiler?:
static __inline const double sin(double x) {
double result;
__asm("sin %1,%0" : "=f"(result) : "f"(x));
return result;
}
I claim that if this appears in <math.h>, it *is* built-in to the
compiler, despite the fact that the compiler reads these rules from an
external file rather than having them embedded directly in its
executable. I further claim that if you require an <mp.h> to exist and
to define a mul_add_div_rem operation, and make constructing a proper
<mp.h> part of `building the compiler' (just as constructing a proper
<math.h> and <string.h> and <stdlib.h> is already part of building any
hosted ANSI C compiler), that your mul_add_div_rem operation will be
`built in'.
Or maybe this is not what you wanted? What exactly was it you *did*
want? I know what *I* would want: I would want a syntax I could use to
write mul_add_div_rem operations such that:
- they were not `fragile' (i.e., could be inserted anywhere);
- they could directly use machine instructions that implement the
operation wherever such are available;
- they did not impose extra overhead;
and this is exactly what I constructed. Perhaps I could have done a
better job; but perhaps I would have spent more than an hour on it....
I would not propose this. I would propose defining the type of
an integer*4 x integer*4 multiplication to be integer*8. Then the right
thing will be done whether i8 is 8 bytes or 4 bytes long.
Several people have discussed various instruction sequences with
the assumption that all fixed point instructions take the same amount of
time. While on many machines most fixed point instructions are 1 cycle,
this is generally not true of fixed point multiply and especially not
true of fixed point divide. Thus a compiler which uses 2 divides for
iq=ia/ib
iq=mod(ia,ib)
may be wasting the equivalent of 20+ typical fixed point instructions.
In fact if you know the compiler will do this the sequence
iq=ia/ib
ir=ia-iq*ib
will be much faster on many machines. The 11 instruction pop subroutine
used a divide with remainder. On many machines this instruction will be
considerably slower than the other 10 combined. If one is doing many
divides by the same integer p it will often pay to replace them with
multiplies by some "magic constant" based on the binary representation
of 1/p.
Regarding wrongheaded instructions I believe there are numerous
examples where a complicated instruction was implemented in such a way
that a sequence of simpler instructions doing exactly the same thing
would execute faster. Of course this may be an implementation problem
rather than an architecture problem.
James B. Shearer
> I further claim that if you require an <mp.h> to exist and
> to define a mul_add_div_rem operation, and make constructing a proper
> <mp.h> part of `building the compiler' (just as constructing a proper
> <math.h> and <string.h> and <stdlib.h> is already part of building any
> hosted ANSI C compiler), that your mul_add_div_rem operation will be
> `built in'.
And this is exactly what Bob Silverman wants (if I read his mind
correctly that is). As it is now, every user has to write his own for
every platform he encounters (and I have written some 40 upto now,
though through the use of subroutine calls, not through inlining in gcc).
And that was also the complaint of Peter Montgomery who has proposed such
a thing for the Fortran standard, and it is not there. The current state
of affairs is that it is not in the standard for most languages.
One restriction: Joe wants to write portable code. He moves between
machines a lot. He can't afford to spend the time to write Hamming
weight and high-word-of-product and ab + c divided by d and dozens more
operations in assembly language---in Cray assembly language, that is,
and VAX assembly language, and SPARC assembly language, and Convex
assembly language, and I think you get the point.
Now there are lots of fundamentally different ways to express the
Hamming weight in Z, none of them particularly direct. A compiler faced
with one of those ways will probably not recognize it, even if the
machine language has a ``pop count'' instruction. In contrast, a + b + c
is so easy to express in the language that nobody would think twice
about how to write it, and any compiler will do well with it.
Observation 1: While Joe uses operations like a + b + c most of the time,
he needs Hamming weights too. He's always going to need some operation
that your chip doesn't provide or that your language doesn't understand.
Joe might be able to take a little time writing one special operation in
assembly language for several machines, but in even one library he finds
too many trouble spots to take all this time.
Language designers pay attention to this, of course. The designer of Z
happened to put in some ``weird'' operations that he liked---finding the
high word of a product, for example. But it takes a very long time for
languages to change. Joe can't just beg for Z1992 to have a feature and
expect to write portable programs using that feature next year. He can
only use features that have been defined for many years. Again, Joe
needs a lot more than Z provides.
Chip designers pay attention too. Operations appear in chips much faster
than they appear in languages. A new chip---call it the FUBAR---is quite
different from previous chips, and the FUBAR assembly language isn't
portable, but Joe can still take advantage of the chip because the FUBAR
compiler for Z hides all the lower-level details.
Observation 2: Whoever writes the optimizer for FUBAR---let's call this
guy Natas---*could* make every single FUBAR instruction available from
Z. All he has to do is make sure that there's *some* language construct,
perhaps ridiculously convoluted, that will compile to each instruction.
Observation 3: Natas rarely does this. I have yet to see any compiler
understand any portable expression of Hamming weights, even though this
wouldn't be very difficult. Even in the occasional case where the
compiler writer shows some evidence of having used the assembly language
he's compiling for, he rarely tells programmers what the optimizer can
and can't do.
Let's assume that Natas is not such a devil, and manages not only to
give his optimizer some way of using FUBAR's CXi operation and some way
of using FUBAR's DREM operation, but also to *document* (gasp) the
corresponding expressions in Z. Now Joe can write portable Z code that
will run fast on FUBAR, taking advantage of the chip's instructions. All
he has to do is follow Natas's instructions.
We've almost solved Joe's problem. He wants to take advantage of FUBAR's
CXi? He just writes his Hamming weight code in a particular way, and it
turns into a single instruction. He wants to take advantage of the VAX's
DREM? He just writes his (a * b + c) / d code in a particular way. He
wants to take advantage of FUBAR's DREM? He rewrites his (a * b + c) / d
code---oops.
Observation 4: FUBAR's Natas will never agree with DEC's Natas. It's
hard enough to convince a compiler writer to document his work; it would
be absolutely inconceivable that a single expression would be optimized
by every compiler. Sure, a few people might hit on the same expression
of Hamming weights independently, and standards might eventually spring
up. But there will always be differences.
In practice this isn't such a big problem. Joe just writes the code in
several different ways (he has to, if he wants a fighting chance) and
uses #ifdef or something similar to select the right solution for each
machine. But this is still a pain. There's no reason that the installer
on every new machine should have to figure out which definitions are
best in a big library with perhaps hundreds of orthogonal choices.
Machine description languages help but will never cover all the
operations that Joe needs.
Instead, why not make it the compiler's problem? After all, the
optimizer knows what operations it recognizes. Joe should just be able
to write ``use <incomprehensible Hamming weights algorithm #1> or
<incomprehensible Hamming weights algorithm #2> or <incomprehensible
Hamming weights algorithm #3>, whichever you [the compiler!] thinks will
be the fastest.'' The FUBAR compiler might recognize #1 and compile it
to a single instruction, throwing out the other choices. A 6502 compiler
will throw up its hands and choose randomly. In any case, Joe can write
the code efficiently, with no portability worries.
I claim that these two changes---having every assembly-language
construct available *somehow* (and documented by the compiler writer!)
from the portable language, and having the portable language support a
construct to choose between several equivalent sections of code---would
completely solve Joe's problem. There's no way that Joe can do better
than this.
Indeed, what would happen if a chip maker decided to support some new
instruction---say inverses mod m? The compiler writer would follow the
rules here, and provide some way in Z to use that instruction. If Joe's
lucky, he's already used exactly the same construct in his programs, and
he won't have to change anything to use the instruction. Otherwise all
he has to do is add an alternative with the right code for the new
compiler. His code will continue to work on other machines while taking
full advantage of the new machine. How can you beat that?
Side comments: I proposed a this-or-that language construct in
comp.lang.misc a few months back, and a few days ago convinced my
greatest critic that it was the right solution to these problems. Obtw,
no offense meant to compiler writers through this article.
---Dan
Yes. That's exactly the problem here: given an operation glorp, we need
to express glorp in a portable way with machine-independent semantics.
Worse than that: when a chip supports glorp, we should be able to write
the code so that it will be compiled the right way for that chip---but
we shouldn't have to lose portability. Worse than that: the language
designer probably never heard of glorp.
No matter how nearsighted the language designer was, the chip designer
has to be able to show his support for glorp. And no matter how
nearsighted the language designer and chip designer were, the programmer
has to be able to write glorp portably yet efficiently.
Is this too much to ask? I don't think so. See my previous article for
my thoughts on how this can and should be done.
> I further claim that if you require an <mp.h> to exist and
> to define a mul_add_div_rem operation, and make constructing a proper
> <mp.h> part of `building the compiler' (just as constructing a proper
> <math.h> and <string.h> and <stdlib.h> is already part of building any
> hosted ANSI C compiler), that your mul_add_div_rem operation will be
> `built in'.
That's exactly the wrong answer. Programmers can't wait for chip
designers, and nobody can wait for language designers. It has to be
possible to set up a good enough structure beforehand that (1) chip
designers can make innovations available without changing the language,
(2) programmers can use those innovations without changing the chips or
the language, and (3) programmers don't have to sacrifice portability
for efficiency. I believe this is possible, and requires only a small
addition to languages and a bit of discipline on the part of compiler
writers.
---Dan
I agree partly with both. The LANGUAGE should probably contain some such
instruction, and it certainly should allow the writing
E,F = (A * B + C) divrem D
or even
E,F = (A * B + C) / D
> Do you want a machine-independent (but language-dependent) syntax for
> expressing `(A * B + C) divrem D', and one which does not require a
> subroutine call per operation?
>
> This is what I provided. If you believe you can improve on the syntax,
> go to it; knock yourself out. (Incidentally, the only reason it `only
> works with one compiler' is that only one compiler compiles the
> particular language GCC implements. GCC is not a C compiler [well, it
> is if you run it with `-ansi -pedantic']. There is, however, no
> constraint that says that GCC's language may not be adopted by other
> compilers.)
You have provided a means of producing the result, but not the flexible
allocation of that result in a particular compiler, using divrem for the
operator, but not allowing / (or // if the overloading gets too bad).
Silverman's complaint about the language not having it still stands,
nor is there any method provided for adding it to the language.
On the other hand, I have to agree with Torek about the implementation on
a given machine. This must depend on the hardware. But I can hope, with
Silverman, that if the language allows writing things like this, hardware
people might put them in.
A question for Torek: Suppose I want to do the above for a variety of
types, for example, have A, B, C, D, and F floating, but E integer. Can
I use the same overloaded operator symbol in both cases with his approach?
Can I do even more complicated overloaded situations, where the translation
needs to take into account the types?
[Rest of relevant article deleted.]
Not so fast. Our C3Ada-68K compiler and our Fortran VII-3200 compiler has the
transcendentals built-in. This means much more than just compiling the
calls inline. Built-in means that:
1. Constant evaluation is performed at compile time:
SIN(0.0) is replaced by 0.0, etc.
2. Common subexpressions and loop invarient code motion is
performed on these operations at compile time:
SIN(A) ... SIN(A) replaced by one call using a tempo.
3. Special-case optimizations are performed:
A**0.5 -> SQRT(A)
A**1.0 -> A
4. The compiler knows the argument profile for the routines.
In Fortran, SQRT(2) gets a warning and is converted to SQRT(2.0).
4. The calls generate inline code.
regards,
joe
--
Full-Name: Joseph M. Orost
Email: jo...@tinton.ccur.com
Phone: (908) 758-7284 Fax: (908) 758-7113
US Mail: MS 322; Concurrent Computer Corporation; 106 Apple St
Tinton Falls, NJ 07724
I don't know what Seymour Cray would say, but I have an application
that would really benefit from this.
When performing digital communication over HF (highly error prone
media) major problems are encountered.
1. Most HF modems are syncronous.
2. Due to the error proneness of the media, you will rarely get a hard (exact)
sync match.
3. Can't go to a short sync sequence because of high probability of false sync.
Solution, requires that you write a correlator to count the number of bits in
error in the sync sequence, and only accept it as a valid sync sequence if
the distance is less than a predetermined limit.
The distance is simply the number_of_ones(receieved_seq XOR sync_pattern).
I must admit that this is a very specific problem, but one that must be
performed once every bit time, and it would benefit greatly from having
such an instruction in hardware.
Donald McLachlan
>Yes and no. Yes, most users expect sin(pi) to be zero. No, most
>users do _not_ expect sin(3.1415927) to be zero (or, whatever PI
>approximation you use). Any user who needs trig functions should be
People owning a HP28 or HP48 know this. This calculator indeed delivers 0 for
sin(pi), and 0.0548036651488 for sin(3.14159265359). Note that the latter
value is correct to 12 digits.
>People owning a HP28 or HP48 know this. This calculator indeed delivers 0 for
>sin(pi), and 0.0548036651488 for sin(3.14159265359). Note that the latter
>value is correct to 12 digits.
Oops. I was asleep.
The correct value of sin(3.14159265359) in radians (not degrees!) is of course
-2.06761537357E-13. The calculator uses 32-bit argument reduction, so that in
can compute 12-digit sine values up to 1E20.
I am embarrassed.
Not done by the above; you're one up.
> 2. Common subexpressions and loop invarient code motion is
> performed on these operations at compile time:
> SIN(A) ... SIN(A) replaced by one call using a tempo.
Done by the above; that's what the "const" qualifier on the function does,
and after inlining, gcc can see that there are no side effects. (I did
some experiments with a non-existent "foo" instruction... it works fine.)
> 3. Special-case optimizations are performed:
> A**0.5 -> SQRT(A)
> A**1.0 -> A
Not done by the above. I can see how the A**0.5 is useful; I can't see how
A**1.0 is - do programmers often write this?
> 4. The compiler knows the argument profile for the routines.
> In Fortran, SQRT(2) gets a warning and is converted to SQRT(2.0).
Done by the above. The prototype forces conversaion to the right type
(although silently; there may be a warning flag you can turn on).
> 5. The calls generate inline code.
Of course, this is done by the above.
So, the only advantage you have to built-in transcendentals are compile-time
evaluation of constant expressions and some special cases.
I don't deny their usefulness, but claim it's not tremendous, and the cost
of a more complex compiler is not zero. If you support a cross-compiler,
I hope you worked hard to ensure you're doing target arithmentic and not
host arithmetic when evaluating transcendentals. Numerically unstable
applications might die horribly otherwise.
--
-Colin
Marc Auslander <ma...@ibm.com>
Way back in 1979 I had a job writing code for an LSI-11 board with
4K words of core memory and no O/S (it was basically an interrupt-driven
controller for interfacing a couple of array processors together),
and I found use for several of the weird autodecrement-deferred
instructions: including a "program" with one instruction word and
one data word that would zero all of memory and then halt, returning
control to the built-in ODT debugger. Can any PDP-11 programmers
figure out what it was?
--
Joe Buck
jb...@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
I am more than willing to write the assembler code for an operation
----
we will call FOO. FOO takes 3 arguments and returns 2 more.
I would then like to see the following in a HLL:
asm_routine(FOO(a,b,c,d,e))
The compiler will then fetch my assembler routine, and RESOLVE any
register/stack/addressing conflicts that might arise because the
assembler routine uses the same registers/addresses etc. that are
produced by the calling routine.
The compiler will straighten out any conflicts [renaming registers
within the assembler routine as needed etc.] and then in-line the
assembler routine.
It is true that I must rewrite FOO for every new machine, but because
FOO is time critical it is worth my trouble to do just that.
FOO would be written as an ordinary subroutine; it would assume that
the arguments to it had been placed on the stack the same as any other
subroutine call.
There are currently machines which have no multiply or divide instruction
at all, e.g. SPARC. When the compiler sees an operation a*b it generates
a subroutine call. Building in a 'divrem' operator into a language
could be done the same way. For machines that have the instruction there
is no problem. For those that don't, a subroutine call is generated.
The only difficulty in the above paragraph would be in deciding that
a 'divrem' operator was worth adding to the language. If enough people
can agree that it IS worthwhile, then actually doing it is a non-problem.
As for all of you who claim that 'divrem' (or 'foo' or any other special
operation) is not needed let me remind you of one thing: The current
SPARC does not have integer multiply or divide in hardware. The next
release (Version 8) will have it. If, as all of you have been claiming
that multiplication & division are so infrequent as to not be needed,
then why did SUN add it in? Now I can't say of my own personal knowledge,
of course, but ask yourself what user or set of users would have enough
clout [i.e. a big enough market] to convince SUN to add such a thing.
Hint: think 'cryptography'
Many of the instructions under discussion are essential for good performance in
cryptographic applications.
As more and more PC users hook into networks, computer security applications
will become more important. As will the necessary instructions to support
those applications.
Huh??? sin(3.14159265359) = .05480366... ????
This value is so far off as to be unbelievable.
When I punch sin(pi) into my HP I get -2.0676154e-13.
I would not TRUST a calculator that actually returned 0.0
I'm not Chris Torek, but the answer is: yes you can, with g++. g++ accepts
exactly the same code Chris wrote for gcc, and, since it is a C++ compiler,
allows functions and operators to be overloaded based on argument type.
brn...@kramden.acf.nyu.edu (Dan Bernstein) writes:
>Observation 2: Whoever writes the optimizer for FUBAR---let's call this
>guy Natas---*could* make every single FUBAR instruction available from
>Z. All he has to do is make sure that there's *some* language construct,
>perhaps ridiculously convoluted, that will compile to each instruction.
>Observation 3: Natas rarely does this. I have yet to see any compiler
>understand any portable expression of Hamming weights, even though this
>wouldn't be very difficult. Even in the occasional case where the
>compiler writer shows some evidence of having used the assembly language
>he's compiling for, he rarely tells programmers what the optimizer can
>and can't do.
>Let's assume that Natas is not such a devil, and manages not only to
>give his optimizer some way of using FUBAR's CXi operation and some way
>of using FUBAR's DREM operation, but also to *document* (gasp) the
>corresponding expressions in Z. Now Joe can write portable Z code that
>will run fast on FUBAR, taking advantage of the chip's instructions. All
>he has to do is follow Natas's instructions.
There's one problem with this: finite resources (if you or Herman
Rubin wants to pony up a dump-truck full of cash, maybe we can talk,
but I'll bet your resources are finite, too). I think you'll agree
that no compiler writer should spend time optimizing for the
machine-specific case until most of the optimization has been done for
the portable (standard-conforming) case. So, after we've taken care
of reduction in strength, global value numbering, constant
propagation, redundancy elimination, loop invariant code motion,
instruction selection, register allocation, scheduling, loop
unrolling, loop straightening, peephole improvements, software
pipelining, linear function test replacement, loop fusion,
stripmining, blocking, dead code elimination, tail call elimination,
leaf routine optimization -- oops, better algorithms appeared, so we
have to reimplement a couple of those -- oops, new machine, time to
fool around with the scheduling and instruction selection some more --
oops, time to do some fancy code placement to help out the cache --
oops, time to do interprocedural optimization. I think you get the
picture. Always, optimization efforts have to be directed towards
those things that will make the largest number of present and future
customers happy. When all the work is done for C, Fortran, Pascal,
Modula, and C++, is it better to expose non-portable machine-specific
optimizations to the programmer, or should we look at extending the
optimizer to be useful to Lisp, Ada, Cobol, RPG-III, Prolog, and
Jovial? Maybe we'd make people happier if the optimizer ran twice as
fast, or in half the memory. Maybe we should make use of some of the
dataflow algorithms to provide debugging feedback to the user
("there's no exit from this loop; perhaps it won't terminate?")
Another difficulty with the scheme that you describe is that you
really don't want people to be writing their code in strange little
idioms because that will engage some magic widget in the optimizer.
It's portable, but weird. That doesn't help readability or
debuggability, and it may not be portable into the future. If someone
rewrites the optimizer, the last thing they want to do is support
weird little hacks like that. If you must (and some people must),
then use something based on subroutine calls. That has the advantage
that (1) if no machine support exists, it is easy to plug in portable
code and (2) machine support often exists for assembly-language
implementations (for instance, say "man inline" on a Sun).
Of course, there are some things that "ought" to be recognized because
they are portable, but it still isn't clear that they are needed, or
worth the cost. For instance, on the Sparc we *could* spill register
%i7 (return PC) and use that register for other purposes in the main
body of a subroutine. Fine, except that we'd break all the debuggers,
including adb. That adds big costs to the final debugging that goes
on before a product (ours, or some software vendor's) product is
shipped. Probably not worth it.
David Chase
Sun
> 3. Special-case optimizations are performed:
> A**0.5 -> SQRT(A)
> A**1.0 -> A
Not done by the above. I can see how the A**0.5 is useful; I can't see how
A**1.0 is - do programmers often write this?
probably just about never. but that doesn't mean the compiler
shouldn't deal with it. A lot of code is generated by cpp (or lisp
macros), and therefore contains all sorts of stupidity. this is an
important point to keep in mind.
as an example, suppose I wrote the cpp macro
#define GammaCorrect(intensity, gamma) \
(pow((double)(intensity),1.0/(double)(gamma)))
A perfectly sane programmer might write GammaCorrect(i, 1.0).
--
IBM
Scott Draves Intel
sp...@cs.cmu.edu Microsoft
I do not know what the first part means (languages do not contain
instructions: languages define syntax and semantics; compilers
translate programs [`sentences', or sometimes called statements,
particularly in applicative languages] in the language into
semantically-equivalent sequences of machine instructions). Do you
mean you want a language grammar that includes a syntax like the
above?
[me]
>>Do you want a machine-independent (but language-dependent) syntax for
>>expressing `(A * B + C) divrem D', and one which does not require a
>>subroutine call per operation?
>>
>>This is what I provided. ...
>You have provided a means of producing the result, but not the flexible
>allocation of that result in a particular compiler,
I have no idea what the phrase `flexible allocation of that result' means.
GCC will put all of the operands in machine registers.
>using divrem for the operator, but not allowing / (or // if the
>overloading gets too bad).
I.e., you do not like the syntax (which looks like a function call).
I happen to be a fan of functional notation; I think there is nothing
really wrong with
(setq a (multiply b c))
being used in place of
a = b * c
but this is, of course, a matter of taste, and it is true that in most
applicative languages (including C) there are syntax rules for `common'
mathematical operations (+ - * / remainder) and hence such `extended'
operations do not always `fit' aesthetically (even though no one seems
to mind the fact that sine, logarithm, etc, use functional notation).
I attach little importance to these (but then, I never use (a*b+c)/d).
>A question for Torek: Suppose I want to do the above for a variety of
>types, for example, have A, B, C, D, and F floating, but E integer. Can
>I use the same overloaded operator symbol in both cases with his approach?
Not in the language GCC provides. It is, however, possible in the
language G++ provides (a `C++-like' language; G++ uses the exact same
compiler machinery as GCC, hence will allocate registers properly).
(Actually, I am speaking through a metaphorical hat here, as I have
observed GCC to do a horrible job of register allocation on occasion---
whenever there are fewer machine registers than active values, it tends
to choose pessimally. This can be fixed, and no doubt will be in GCC 2.x.)
>Can I do even more complicated overloaded situations, where the translation
>needs to take into account the types?
Possible in G++, not in GCC. Unfortunately, the base C++ mechanisms
for type overloading often induce a combinatorial explosion in the
number of `elemental' routines you must write: e.g., to get
x assign y op z
for some two different types, you need four different operation
inline-functions (all named with the same overloaded operator), one for
(y,z)=type1, one for (y=type1,z=type2), one for (y=type2,z=type1), and
one for (y,z)=type2. (You do not need to worry too much about the
type of x, unless you want to.)
There are techniques to get around this (using `friend' and implicit
conversions) but when you are trying to control each bit and cycle the
compiler may not cooperate if you use them.
Note that G++ does not provide a special syntax for multiple results
(i.e., there is no syntactic analogue for the proposed
q, r leftarrrow a divrem b
operation); you must still use functional notation (or use structure
return values and hope that the compiler writer took some care in the
register allocation code).
If you want flexible syntax, there is always Lisp, :-)
In article <61...@masscomp.westford.ccur.com> jo...@tinton.ccur.com
(Joe Orost) writes:
>Not so fast. Our C3Ada-68K compiler and our Fortran VII-3200 compiler has the
>transcendentals built-in. This means much more than just compiling the
>calls inline. Built-in means that:
>
> 1. Constant evaluation is performed at compile time:
> SIN(0.0) is replaced by 0.0, etc.
Not true above.
> 2. Common subexpressions and loop invarient code motion is
> performed on these operations at compile time:
> SIN(A) ... SIN(A) replaced by one call using a tempo.
True above. (Perhaps you were not watching when someone described
GCC's definition of a `const' function.)
> 3. Special-case optimizations are performed:
> A**0.5 -> SQRT(A)
> A**1.0 -> A
Inapplicable above (unless your compiler knows to replace
SIN(X) ** 2 + COS(X) ** 2
with `1' :-) ). [Incidentally, why has Herman Rubin not complained
that F8x does not allow writing `sine-squared x'? :-) ]
For the pow() definition in <math.h>, if it is written as
static __inline const double pow(double x, double y) {
if (y == 0.5) return sqrt(x);
if (y == 1.0) return (x);
if (y == 0.0) {
if (x == 0.0) return NaN();
return (1.0);
}
...
}
then optimization 3 is also true in GCC. (Note that most pow()
assembly-coded subroutines check for 0.5, 1.0, etc., so having the check
done in-line for variables does not cost extra time, except to
the extent that the extra code space costs extra time.)
> 4. The compiler knows the argument profile for the routines.
> In Fortran, SQRT(2) gets a warning and is converted to SQRT(2.0).
True above (though the warnings do not occur, even under -Wall).
> 4. The calls generate inline code.
True above. (why is this point 4 again? :-) )
I still havn't been object oriented, so I may be completely misunderstanding
things. However, couldn't one design an object for g++ that performs
the correct calculation, no matter how ugly its internals, and then
persuade the people familiar with the internals of g++ to make this
object a little closer to builtin? Would this bring us closer to
having a language that does the right*thing? Wouldn't operator
overloading let us use sensible expressions for the operations?
dan herrick
herr...@iccgcc.decnet.ab.com
Or ML:
% sml
Standard ML of New Jersey, Version 0.66, 15 September 1990
val it = () : unit
- infix 7 //
fun op // (a:int, b:int): (int*int) = (a div b, a mod b);
- val (x, y) = 15 // 7;
val x = 2 : int
val y = 1 : int
No operator overloading, though.
I submit it's better than most, and allows mostly what you'd like.
Also, gcc is very widely available, and might inspire others.
>Actually I would settle for the following:::
>
>I am more than willing to write the assembler code for an operation
> ----
>we will call FOO. FOO takes 3 arguments and returns 2 more.
>
>I would then like to see the following in a HLL:
>
> asm_routine(FOO(a,b,c,d,e))
>
>The compiler will then fetch my assembler routine, and RESOLVE any
>register/stack/addressing conflicts that might arise because the
>assembler routine uses the same registers/addresses etc. that are
>produced by the calling routine.
>
>FOO would be written as an ordinary subroutine; it would assume that
>the arguments to it had been placed on the stack the same as any other
>subroutine call.
>
>The compiler will straighten out any conflicts [renaming registers
>within the assembler routine as needed etc.] and then in-line the
>assembler routine.
Som the compiler has to be able to parse either object files or
assembler source (I'm not sure which you mean), follow the data flow
(with possible aliasing) in the procedure-calling convention, know the
operand restrictions (does it need to be in registers or can it accept
a memory operand?) and implicit operands (MQ on the MIPS chip,
r0/r1/r2/... for VAX movec instructions, etc.) of every instruction,
and in-line it? I think this is awfully impractical.
The gcc solution, while syntactically messier, tells the compiler the
things it needs to know for optimisation purposes in a form it can
easily understand, and just supplies a string pattern to emit to the
assembler, so the compiler doesn't have to know how to parse assembler
input.
It can also handle new instructions the compiler has not yet been extended
to handle when a 32732/68050/80586/R5000/whatever comes out.
I respectfully suggest that you wouldn't like the results of what you
suggest either, like the people in Jon Bentley's story who wanted to
type a list of ~200 districts into the computer and get a random sample
(~5) out. He suggested perhaps having the computer emit a few random
numbers and looking them up on the paper list would be a lot easier
than typing in the paper list.
I also suggest that you don't want it embedded in the source that the
code is an assembler routine, so you can supply a high level version
if it becomes necessary.
--
-Colin
I am not even sure that it will be in the next release as a hardware
instruction. It may be noted that under SunOS 4.1 you can already use
the multiply and divide instructions, but they are emulated in software.
So although the architecture defines an instruction it need not be present
as a hardware option. Another example form the extended precision floating
point instructions. They are all emulated in software.
As an aside: I rewrote my routines so that they can either use a special
coded sequence or else the multiply and divide instructions. It was
interesting to see that use of the multiply and divide instructions slowed
the program down by a factor of about 3 to 4.
The HP 28 and 48 have a symbolic "pi" symbol that they propagate, only
converting to floating point when necessary. The trig functions all
accept it, returning 0.5 for sin(pi/6), etc.
--
-Colin
2) My time is still better spent doing general-purpose optimizations.
3) For hysterical reasons, an optimizer pre-mutilates the code, so
idioms may get transformed beyond recognition in a context-dependent
way.
4) There is ALREADY a tool that will do most of what you want. Say
"man inline" on some nearby Sun machine running a recent rev of the
operating system. "f77 -v -fast ..." might provide some guidance.
Given that an 80% solution exists, that makes solving the problem
again even less profitable.
5) Any "idiom" designed to trigger use of a specific instruction may
be portable for correctness, but it will not be portable for speed in
future releases of the compiler unless care is taken. Verifying that
it continues to provide the same function originally promised to
customers may well take more time than it took to add it in the first
place.
6) Since we have a multi-language optimizer, we might choose to do it
for some language other than C first. Fortran comes to mind.
>> If you must (and some people must),
>> then use something based on subroutine calls.
>
>That's fine. ``If you write the following routine: int sumones(x) int x;
>{ int a; int b; <blah, blah, blah> } then our optimizer will convert any
>call to sumones() into a CX instruction.
7) No, it is simpler to use the existing inliner tool to inline
assembly language, and the results are more certain. If there
tremendous demand, it goes into a library. Portable into the future,
and not dependent on the behavior of the optimizer.
8) Need is demonstrated with dollars. Other people back up their
requests (sometimes unreasonable requests) with dollars, so they get
serious attention paid to them.
David Chase
Sun
Inside a function that only uses local variables? Stop making up
excuses.
> 4) There is ALREADY a tool that will do most of what you want.
[ automatically inlining assembly language ]
Look, David, I cannot afford to write any large number of routines in
assembly language. Most of my code must work on several different
platforms without excessive rewriting. I have the time to write an
occasional routine in assembly language for a few machines, but this is
simply not cost-effective most of the time.
Now will you stop suggesting assembly language?
As for inlining: Yes, David, we are all aware that inlining is a Good
Thing. I am ignoring the inlining problem because a good language or a
good optimizer will automatically solve it. Enough said.
> Given that an 80% solution exists, that makes solving the problem
> again even less profitable.
Inlining assembly language is a 0% solution. My not-so-mythical
programmer, Joe, wants to write *efficient* yet *portable* code. It is
getting rather sickening to watch you proclaim the virtues of assembly
language as a ``solution'' when assembly language is *not* portable.
> 5) Any "idiom" designed to trigger use of a specific instruction may
> be portable for correctness, but it will not be portable for speed in
> future releases of the compiler unless care is taken.
In the article you responded to, David, I proposed a language construct
to let the compiler choose one of several sections of code. It is f'ing
obvious that this makes the program ``portable for speed.'' It is f'ing
obvious that I proposed the construct *exactly* because I am worried
about efficiency, and because I am trying to *reduce* the work of the
compiler writer and programmer in making code efficient. The only way
now to make portable code efficient is to spend years communicating with
language designers and chip writers. I want to reduce that time.
You are saying that you don't want to allow portable code to be
efficient on Suns in the first place. That's too bad for Sun.
> 7) No, it is simpler to use the existing inliner tool to inline
> assembly language, and the results are more certain. If there
> tremendous demand, it goes into a library. Portable into the future,
You have a very twisted conception of ``portable.'' I suppose that
within Sun you believe that ``portable into the future'' means ``will
work under SunOS 5.*,'' but portability is defined by the real world.
I can't afford to write SPARC assembler and call it ``portable.'' Nor
can I write code dependent on a Sun library and call that ``portable.''
David, suppose you take my advice. You take even one instruction that
isn't easily usable from C. You rewrite the instruction as a C function,
no matter how slow or ugly that C function is. You teach your compiler
to recognize exactly that one expression of the function and convert it
into the instruction, and you document this optimization.
Now what have you done for the programmer? Joe doesn't have to bother
writing unportable assembly language any more. He can just copy your
function out of the manual into his code. On Suns, his code will be just
as fast as if he had used assembly language. And guess what? His code
will work, that's right, *work* on (gasp) *other machines*!
Before: Joe has to write assembly language, and he has to write the code
again if he even wants a fighting chance on other machines. If he wants
efficiency on other machines, he'll probably have to write assembly
language for them too.
After: Joe just copies what's in a manual, and his code will not only be
efficient on at least one machine, but it will work without change on
any other machine. If he wants efficiency on other machines, he may have
to copy out of their manuals too, but if he's lucky then two machines
will recognize the same expressions and he'll ``accidentally'' get
efficient code with no work. In any case he won't have to learn any
assembly language.
David, surely you agree that the After situation is much better than
Before. Surely you see that assembly language will never be portable,
while the ``idioms'' are not only portable, but may also be efficient on
many different machines.
I can accept that you don't like the prospect of adding a
``special-purpose'' optimization and documenting it. But don't you admit
that this would have tangible advantages?
Given that, do you really think it would take so much work? I don't know
how Sun's optimizer is structured, but at some stage it must have a
pseudo-code representation of a given basic block or other code section.
Can't it recognize a fixed code sequence at that point?
---Dan
(*ouch*) I fell into this trap, too, for a second. Then it hit me:
If your calculator is in "degrees" mode, then of course you will
get .0548037-, which is correct. "Pi degrees" ~= 3.1416/57.296 ~=
.0548311+ radians. And of course, for small x, sin(x) ~= x.
In "radians" mode, my Casio calculator says sin(pi) is 0.
+---------------
| I would not TRUST a calculator that actually returned 0.0
+---------------
Well, some calculators display 0 when all significant digits are zero and
all but the last guard digit are also zero. It keeps from scaring the naive.
I've gotten used to it...
-Rob
Look, Dave. Say you have a chip with 100 instructions, of which you use
90 in your compiler output. Pick one of the ten left: DREM, perhaps. How
long would it take you to write a DREM function in portable C? So you
write it just *one* way, and get your optimizer to understand that *one*
way, and document that *one* way. How long does this take? An hour,
perhaps, depending on the structure of your compiler. Can you spare 10
hours to make your compiler vastly more useful?
> Another difficulty with the scheme that you describe is that you
> really don't want people to be writing their code in strange little
> idioms because that will engage some magic widget in the optimizer.
> It's portable, but weird. That doesn't help readability or
> debuggability, and it may not be portable into the future.
Maybe you didn't understand what I said. By definition, everything done
here is portable. If your compiler understands <shazam!> to be DREM, and
<shazam!> doesn't work the same on another machine, then your compiler
is simply incorrect. Programmers need to write efficient code that
*will* be portable into the future, and my scheme is designed to allow
this.
As for maintainability... If the programmer *needs* Hamming weights,
then he's going to write the code for it anyway. Adding an equivalent
section of code for a different machine won't hurt the maintainability
of the original. In fact, I think a pick-this-or-that construct would
*help* maintainability, as it would let the programmer display his
original, comprehensible code along with each transformed version.
Do you have some religious argument against ``strange little idioms''?
Well, it's the compiler writer's fault---your fault---if the idioms are
strange. What's important to me is that they be portable. If you want to
cure the ``strangeness,'' publish a standard for what you consider to be
the right idiom, as I suggested before.
> If you must (and some people must),
> then use something based on subroutine calls.
That's fine. ``If you write the following routine: int sumones(x) int x;
{ int a; int b; <blah, blah, blah> } then our optimizer will convert any
call to sumones() into a CX instruction. You may only change the names
sumones, x, a, and b; you must use exactly the order of instructions
shown here, or our optimizer will not recognize the routine.''
This is all I'm asking for. Surely your compiler can do fixed
pattern-matching without trouble? Is this such a ``weird little hack''?
---Dan
Actually, this use of output type information to select operators is
relatively common in Lisp compilers, especially Common Lisp. It is
necessitated by the default assumption of indefinite-precision arithmetic.
In order to do fixed precision, you must somehow know the result is a
"small" integer.
The CMU Common Lisp compiler (Python) is (so far as I know) unique among
Lisp compilers in efficiently compiling full-word integer arithmetic (no tag
bits). And our BIGNUM code is implemented with this facility combined with
some additional primitives such as 32x32 -> 64. The 64 bit result is
returned as two 32 bit values, using the Common Lisp multiple value
mechanism.
It would be relatively straightforward to add support for larger
fixed-precision arithmetic.
Robert A MacLachlan (r...@cs.cmu.edu)
As long as we are on the subject of numeric computation and accuracy of
math operations, can someone point me to a reference on 32-bit argument
reduction and how to implement routines like cosine to an accuracy of
arbitrary n-bits? ( In English, not Dutch :-) )
-yoshio
-------------------------------------------------------------------------------
Yoshio Nakamura Internet: yos...@ucrmath.ucr.edu
University of Calif., Riverside uucp: {ucsd, uci, ucdavis}!ucrmath!yoshio
I give up. The man just refuses to see that this does NOT produce
portable code. Sun writes one library, and DEC writes another one, and
Convex writes another one, and a programmer who uses these libraries
will NOT be able to run his programs on more than one machine.
> No need to promise that
> in the future, a special chunk of code will result in the use of a
> special instruction.
Nor does he see that the compiler writer NEVER has to make such a
promise. Again it is obvious that his viewpoint is twisted by his
Sun-specific environment. Too bad for him that the rest of the world
uses hundreds of different architectures.
If David opens his eyes he will see that what I am proposing will ALWAYS
improve the portability of efficient code. But because this requires a
bit of work on the compiler-writer's part, his eyes will remain closed.
> In addition, as I said before, there is ample evidence that time is
> better spent doing a good job of optimizing code in a general way. If
> it paid to do the things that you propose, it would have showed up in
> compiler tuning. People do look at the code that is generated, and
> try to come up with ways to make it better. Somehow, nothing like
> your idea even made it onto the list, let alone near the top.
Actually, David, I found out from e-mail that essentially the same ideas
have been published before, in many separate papers.
I still find it amazing that David can draw a line between optimizing
*(p+a) into a single instruction on a Convex, optimizing c = a/b and
d = a%b into a single instruction on most CISC chips, and optimizing a
short Hamming weight function into a single instruction on a Cyber.
---Dan
Dan, you're missing the forest for the trees, several times over.
Even if the general idea that you propose were worthwhile, your scheme
for accessing special-purpose assembly language instructions from
portable code is too complicated. It is gratuitously difficult for
both the compiler writer and the programmer. Compare:
Dan's method:
compiler writer writes recognizer for special code sequences.
compiler writer documents these code sequences.
programmer reads the code sequences.
programmer writes these code sequences.
Simpler method:
compiler writer writes library of inlineable code.
compiler writer documents names of routines that are in this library.
programmer uses those names.
if you insist, compiler writer writes portable versions of these
subroutines, though programmer would have had to do it anyway in
your method (working from documentation).
No need to recognize complicated code sequences, and easier to use.
The inliner sucks in the code automagically. Writing the portable
version is no harder than figuring out what pattern is recognized, but
you don't have to write a pattern recognizer, OR impose restrictions
on what transformations might be performed by the optimizer now or in
the future that might complicate pattern recognition (yes,
straight-line code can get oddly mutilated). No need to promise that
in the future, a special chunk of code will result in the use of a
special instruction.
Now, no doubt in your next twisted missive you will claim that is
exactly what you proposed, since, of course, a procedure call is a
special case of a code sequence. This behavior is already supported
by existing compilers -- there's several people who write inlineable
routines for Fortran -- so if this is what you meant, then you are
ignorant of the state of current tools. Since that cannot possibly be
the case, I conclude that this is not what you meant.
In addition, as I said before, there is ample evidence that time is
better spent doing a good job of optimizing code in a general way. If
it paid to do the things that you propose, it would have showed up in
compiler tuning. People do look at the code that is generated, and
try to come up with ways to make it better. Somehow, nothing like
your idea even made it onto the list, let alone near the top.
Perhaps you have an exaggerated opinion of your intelligence and the
importance of your problems. Just a guess. By the way, is
abbreviating "fucking" into "f'ing" consider politeness on the net?
Thanks for being so considerate.
Your pal,
David Chase
Sun
Just wondering: the usual hack to count bits is a table of, say, 256 entries.
Look up each byte and add the values together. If you know the sync sequence
ahead of time and can afford one table per byte of the sync sequence, you
can get a matched bits count directly by pre-diddling the table.
Another approach, since what you really want is to know whether there are
fewer than N bits set in the word, is to repeat distance &= (distance-1);
N times, and if the result is zero, you have a match. For small n, this
is quite fast.
--
-Colin
I believe you forgot to quote the next line that said "if you insist,
provide portable implementations of the subroutines in that library".
Ergo, portable code. Or, the programmer can write the portable code
him- or herself; that's no harder than your atrocity. No harder for
programmers, simpler for compiler, easier to maintain, means BETTER.
Furthermore, note that your scheme for triggering special instructions
through the use of special code sequences doesn't do anything for all
the code that has already been written. No win means not better.
>Nor does he see that the compiler writer NEVER has to make such a
>promise.
If the documentation says "do this to get the compiler to emit the
splartzfooie instruction", that's a promise. It seems to be the case
that a non-trivial number of people with money will interpret ANY
discovered behavior as a promise, so imagine the signicance of
documentation.
>Actually, David, I found out from e-mail that essentially the same ideas
>have been published before, in many separate papers.
References, please. I'd love to be enlightened.
>I still find it amazing that David can draw a line between optimizing
>*(p+a) into a single instruction on a Convex, optimizing c = a/b and
>d = a%b into a single instruction on most CISC chips, and optimizing a
>short Hamming weight function into a single instruction on a Cyber.
It's an easy distinction to make. I'll bet you'd loan someone a dime,
and not be too upset about not getting it back. I'll also bet that
you wouldn't feel the same way about $100. Same difference -- LOTS of
code can benefit from recognizing *(p+a), AND it is easy to recognize
using any number of fast algorithms (it is a simple tree pattern, and
no unification is required). Recognizing that two divisions can be
compressed into one is not so automatic, but it matters sometimes, and
it isn't too hard to catch most cases, given a basic block that's had
value numbers applied to it. This last case occurs far less often,
and I think it's harder to recognize (I'd have to go look up what a
Hamming weight function is. If it involves a loop, it is harder).
Given the low return for high investment, why bother?
So, the line is easy to draw: profitable, probably profitable,
probably unprofitable. You and Herman Rubin are not a large enough
market to justify more than a couple of minutes of work, especially
given that there is an alternative. There are far better wild geese
to chase.
[Lest anyone wonders why I persist, I'm practicing for the day when I
have to deal with a two-year old.]
David Chase
Sun
Good ref's that many people should read include papers describing
the ECS project undertaken at IBM Yorktown in the mid-70's.
ECS was Experimental Compiling System. Very roughly, it
worked like this:
They defined an intermediate language (IL).
This was fairly interesting by itself, and rather more
extensive and complete than the norm.
They defined (built?) an extensive IL optimizer
Front-ends produced IL
Libraries were kept in IL form
There was the potential for extensive inlining
with reoptimization (and more inlining, ...)
New operations could be added by providing appropriate IL
definitions. With adequate inlining, constant propagation,
type propagation, etc... these new operations could be
extensively optimized (examples included string concatenation,
which is trickier than many imagine).
I have only a few references.
"A Case Study of a New Code Generation Technique for Compilers"
J. L. Carter
CACM, 1977
"A New Strategy for Code Generation --
The General Purpose Optimizing Compiler"
W. Harrison
IEEE Transactions on Software Engineering, 1977
"The Experimental Compiling System"
F. Allen, Carter, Fabri, Ferrante, Harrison, Loewner, Trevillyan
IBM Journal of R & D, November 1980
Unfortunately, the project apparently died after a few years.
Why? I dunno. Perhaps the success of the PL.8 work. Perhaps
it was too complex. Perhaps it didn't work. Perhaps funding or politics.
Personally, I was frightened by the complexity of their
intermediate language. I also wonder how to control
the huge potential inlining. Can compile-times be controlled?
I think many people suggesting various forms of compiler extensibility
can learn a lot by reading theses guys. In particular, the case study
paper illustrates how operations can interact and how difficult it
can be to optimize the results.
>>I still find it amazing that David can draw a line between optimizing
>>*(p+a) into a single instruction on a Convex, optimizing c = a/b and
>>d = a%b into a single instruction on most CISC chips, and optimizing a
>>short Hamming weight function into a single instruction on a Cyber.
There are significant differences between the three cases.
The 1st is simply instruction selection based on a single tree.
The 2nd is much trickier. The 2 instructions don't share a common
root, so we don't have any easy place to start looking (for
efficient search). They do share operands, which can be used as
a big hint. Hence the commonality can often be detected with a
simple extension to value numbering (like Chase suggested).
The extension many people are arguing for:
c, d = a divrem b
is much more significant (at least to languages like C or Fortran).
Basically, you've suddenly defined a new language with significantly
different syntax. The front-end of the compiler will probably
have to be chopped up pretty extensively, though the optimizer
and code generator might survive unscathed.
Better than a massive hack job would be to look into languages
that allow various forms of multiple-assignment
(Mesa, Beta, Lisp, Clu, others?). Do some reading!
The 3rd is much harder again. Recognizing all the ways
I might code a particularly complex function as being equivalent
to some bizzarre instruction is quite difficult -- probably
impossible.
It's also wrong-headed, for many reasons. Chase pointed out
that there's only a small payoff. Perhaps he didn't make his
argument clearly enough. Think of all the things that are wrong
with some particular compiler, and prioritize the list.
First is usually that it produces incorrect code in some
important case. It also compiles to slow, uses too much memory,
and produces slow code. Why is the object code slow?
Well, we didn't do good enough strength reduction and
the value number has bugs. Our dependence analysis
can handle subscript expressions that complex, and we
don't know how to block loops for the TLB yet.
Or whatever. These matter a lot and occur over and over
in everyone's code.
And there you sit, working on that special hamming-weight
function recognizer so Preston's code will run faster.
Besides being a generally undecidable problem (program equivalence),
Preston suddenly decides to change his code a little
or the architects redefine how the special hamming-instruction
works in the first place (say, how one of the condition
code bits is set).
Besides, is the instruction faster than those cute little
RISCish instructions anyway? Especially when the optimizer
realizes that you've already done half a hamming-thing earlier
and that it needn't repeat some of the instructions.
>>I give up.
Good idea. One of the cool things about the net is that it provides a way
we can ask questions and get quick answers. For example, I can ask
about the MIPS I-cache and usaually get a precise answer from someone
who knows. It isn't helpful to tell John Mashey he's wrong about
how his machine works! Similarly, you aren't going to win arguments
with David (or Ben) Chase about how compilers work.
...
If David opens his eyes he will see that what I am proposing will ALWAYS
improve the portability of efficient code. But because this requires a
bit of work on the compiler-writer's part, his eyes will remain
closed.
...
I question whose eyes are closed here. Exactly how many compiler
writers do you expect to conform to your new ad hoc standard ? In what
timeframe ? If you rely exclusively on the kindness of compilter
writers, you will have zero portability.
Methods which start with class libraries (modules etc.), are
completely portable. Performance can be dialed in over time, and if
need be compilers altered. Portability is maximal; as even old systems
can provide the required functionality.
This is precisely why languages features such as modules and operator
overloading were put into ISO DIS 1539 Fortran.
--
----------------------------------------------------------------
Keith H. Bierman kbie...@Eng.Sun.COM | k...@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
[ ... mul-add-div-rem primitive should be defined in some header,
rather than the compiler provide, like GCC, the primitives to
define it efficiently ... ]
dik> And this is exactly what Bob Silverman wants (if I read his mind
dik> correctly that is). As it is now, every user has to write his own
dik> for every platform he encounters (and I have written some 40 upto
dik> now, though through the use of subroutine calls, not through
dik> inlining in gcc).
This ought to be encouraged. I think that the FSF should be much more
about libraries than about commands. The emphasis on programs/commands
instead of reusable libraries is one of tragic historic legacies of
Unix.
The technical question is which language architecture best supports
coding efficient libraries and efficient library use; then there is a
practical (economics, politics) question of writing those libraries.
dik> And that was also the complaint of Peter Montgomery who has
dik> proposed such a thing for the Fortran standard, and it is not
dik> there. The current state of affairs is that it is not in the
dik> standard for most languages.
But this is not something we ought to argue about in comp.arch; it is no
longer about how to define a language architecture so that it may take
advantage of specifics of the CPU architecture if possible.
What you want is not a technical solution, it is legislation. "There
ought to be a law..."; in your case it is one that makes compiler
writers provide your favourite primitives on all the platforms you want
to use.
--
Piercarlo Grandi | ARPA: pcg%uk.ac....@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: p...@cs.aber.ac.uk
Huh? It's purely a quality-of-implementation issue. Someone whose
compiler *cannot* produce DREM or CX or any other available instruction
has a poor compiler. And someone with a better compiler had better
document the idioms that his optimizer will understand, or programmers
won't be able to take advantage of the feature.
Are you claiming that a compiler that *can't* produce DREM on a VAX is
better than a compiler that *can*, all else being equal?
Sure, it takes a bit of time for the compiler writer to add this
feature. That's the disadvantage of any improvement to software: someone
has to write the code.
> If you rely exclusively on the kindness of compilter
> writers, you will have zero portability.
Huh? I've seen several responses that make the same claim. But it's
entirely wrong. The recognized idioms will typically be *different*
under different compilers, but since they're written in PORTABLE code
this is entirely irrelevant. If I write code for one idiom and use it
under a compiler that doesn't recognize the idiom, the code will still
work, and I'm certainly better off than if I had used assembly language
on the first machine.
> Methods which start with class libraries (modules etc.), are
> completely portable.
And this is completely useless. I've said this before, and I'll say it
again: I cannot wait for language designers to catch up. I need to be
able to write efficient, portable code NOW, not in ten years when a
better language is widely available.
Sure, it would be nice if C or Fortran had Hamming weights. But they
don't. So stop talking about portable libraries to solve these problems:
they don't exist.
---Dan
.........................
> And this is completely useless. I've said this before, and I'll say it
> again: I cannot wait for language designers to catch up. I need to be
> able to write efficient, portable code NOW, not in ten years when a
> better language is widely available.
I fully sympathize with you, but neither you nor I will be able to write
efficient, portable code NOW. We both WANT to, but I see no reasonable
prospect of this in the near future.
What I believe can be done in the near future is to produce a totally
non-optimizing macro translator, and let the user and compiler/assembler
do the optimizing afterwards. This is within present knowledge. But if
the operations one wishes to use are not even known by the language, and
cannot be added to it, even reasonable code cannot be written.
Followups to comp.lang.misc, please. This is not a hardware problem,
although I suspect that hardware would change to take into account
these instructions, now totally disregarded.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
All the ways, yes. But many of the ways, perhaps not.
I seem to remember running across a reference, about 10 years ago
(that's when I saw it, not necessarily when it was published),
that dealt with recognizing common sequences of operations in APL
and compiling them into particularly efficient code.
I think they called it "recognition of programming idioms" or something
of the sort. Does this ring a bell with anyone?
> It's also wrong-headed, for many reasons. Chase pointed out
> that there's only a small payoff.
I think that, in the APL case, the payoff was somewhat higher because
it sometimes encourages the use of powerful yet expensive operations
to accomplish relatively simple tasks.
--
Brian Thomson, CSRI Univ. of Toronto
utcsri!uthub!thomson, tho...@hub.toronto.edu
>I seem to remember running across a reference, about 10 years ago
>(that's when I saw it, not necessarily when it was published),
>that dealt with recognizing common sequences of operations in APL
>and compiling them into particularly efficient code.
>I think they called it "recognition of programming idioms" or something
>of the sort. Does this ring a bell with anyone?
I'm not sure we're talking about the same thing, but it rings bells with
me. The Hewlett-Packard APL 3000 compiler did snazzy dynamic compilation,
allowing it to do optimizations of program fragments that actually
get executed, as opposed to working that hard on all of the dynamically
possible sequences. (See Van Dyke, E.J., "A dynamic incremental compiler
for an interpretive language," Hewlett-Packard Journal, July 1977, pp.17-24.)
This is philosophically similar to Chambers & Ungar's recent work on
the Self compiler, which uses aggressive inlining and interprocedural
optimization. It comes up with "customized" versions of code for
different kinds of call sites. This and the inlining allow a lot
more type analysis of code for a (*very*) dynamically-typed
language. That lets them optimize away most of the dynamic type
checks, and (importantly) optimize across things they otherwise
couldn't optimize across. Self is very, very fast for a dynamically-typed
language.
(Self is like Smalltalk, only more so. It doesn't even have classes,
like Smalltalk; it uses prototypes intead. The implementation has
an analogue of classes to get back the efficiency advantages. See
Chambers & Ungar, "Customization: optimizing compiler technology
for Self, a dynamically-typed object-oriented language," Proc. SIGPLAN
'89, pp. 146-160.)
It seems likely that some of the stuff from the APL 3000 compiler could
also be generalized to do a good job optimizing operations on parameterized
types and homogeneous collection types in a strongly-typed polymorphic
language like Modula-3 or Eiffel.
-- Paul
--
Paul R. Wilson
Software Systems Laboratory lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154) wil...@bert.eecs.uic.edu
Box 4348 Chicago,IL 60680
Ding.
AUTHOR = "Feng Cheng and Scott Omdahl and George Strawn",
TITLE = "Idiom matching: An optimization technique for an {APL} compiler",
INSTITUTION = "Iowa State University",
YEAR = 1982
(good luck getting a copy)
The used 50 idioms, derived from work by (1) Perlis and Rugaber, (2)
Brown, and (3) Miller. I was under the impression that these idioms
covered a large portion (90-95%?) of most programs written in APL, but
I couldn't find that in a quick scan of the tech report. Cheng et al
give an example of an idiom that could be compiled to run 50 times
faster, if recognized.
The difficulty in their work is the time and space needed to construct
and compress the bottom-up matching tables. Since 1982, there has
been nice progress made in that area, and instead of taking 1000
seconds on a "NAS AS/6" (an IBM-clone mainframe) they can be computed
in about 4 seconds on a Sun 3, or about .3 second a Sparcstation 2.
For APL, this may now be irrelevant -- I don't know much about other
improvements in APL implementation. For pattern matching, Cai, Paige
& Tarjan have come up with even further more improvements to speed up
the algorithm (n.b. I don't know if they handle "regular" tree
patterns yet).
Note that this belongs comp.lang.misc, and followups will go there.
David Chase
Sun
I don't think this is practical. I have my favorites, and you have your
favorites, and when you take the overall favorites of programmers around
the world you have way too many operations for any one language to
support.
To put it differently, there will always be *some* primitive that a chip
designer implements or that a programmer wants, and that isn't in the
language. If the chip designer provides X and the programmer wants X but
the language doesn't understand X, the programmer is screwed.
What I do think is practical is having each compiler writer at least
provide a portable idiom for each instruction on his machine. (Portable
doesn't mean everyone will recognize the same idiom the same way; it
just means that the idiom is written portably in the language.) This is
much more manageable, and it ensures that the programmer will never hit
a wall between what the compiler lets him do and what he can accomplish
in assembly.
Note that some compilers go halfway, and provide unportable idioms
(e.g., directives). I find these to be no better than assembly language,
because I can't take a program with directives and use it without change
on another system.
---Dan
Oh, that's not nearly as exciting as autodecrementing the PC. (immediate
and immediate indirect mode were implemented as autoincrement and
autoincrement deferred on the PC)
TST -(PC)
(on the other hand, I would expect that autodecrement deferred would have
cost some to leave out, given the PDP-11 instruction coding).
--
Peter da Silva. `-_-' pe...@ferranti.com
+1 713 274 5180. 'U` "Have you hugged your wolf today?"
Yes. It's even more of a pain than just #ifdefing a bunch of inline
assembly. So why not just do that for all the obscure instructions
and have done with it?
......................
> Yes. It's even more of a pain than just #ifdefing a bunch of inline
> assembly. So why not just do that for all the obscure instructions
> and have done with it?
ALL the obscure instructions? How about the ones which we haven't
discovered yet?
>What I do think is practical is having each compiler writer at least
>provide a portable idiom for each instruction on his machine. (Portable
I'm NOT suggesting there is no use to the various kinds of inlining
and other ways to get access to machine instructions that exist,
as has been discussed.
HOWEVER, I would observe that:
a) People must be careful NOT to assume that function calls
are inordinately expensive. This is simply NOT true these
days. In particular, on most of the current RISCs, especially
when aided and abetted by good register allocators, function
calls only cost a few cycles; in particular, calls to
leaf routines (i.e., those that do not call others) are
usually very cheap, because they almost never save/restore
any registers.
b) Fast function calls ae necessary for many purposes.
I cannot think of any popular architecture designed in the
last 5-8 years that hasn't paid at least some attention to
making fast function calls, one way or another.
c) They solve a LOT of the problem being discussed.
d) They do not solve 100% of the problem being discussed.
However, it is always worth doing the cycle count for a given
language and architecture choice between:
1) The BEST that you can possibly do, with hand-coding
2) The BEST you can do by writing leaf-level assembly
programs on the machine.
I believe that in most cases, measured over the complete
program, that case 2) does pretty well, especially because the
instructions you're trying to get to are often high cycle-count
operations, where the % overhead for getting them is low.
The one obvious exception is if you have low-cycle count
operations (like rotate, or population count, or byte-swapping,
or string-primitives, etc) that you'd like to get inlined,
and have no way to express directly. At least C is moving,
albeit slowly in the direction of better support for the
possibility of these things.
However, the bottom line still is: analyze the cycle count differences
to see if mechanisms are worth it. Sometimes they are, sometimes
they're not.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash
DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
Easy mistake to make -
His calculator was set to _degrees_ mode.
Set it to _radians_ mode, and rational results will
be returned.
I know, because I made the same mistake when this
point was first raised.
Don Corbitt
Microsoft
Mail flames, post apologies
I
Hate
Fascist
News
Soft
Ware
brnstnd> Look, David, I cannot afford to write any large number of routines in
brnstnd> assembly language. Most of my code must work on several different
brnstnd> platforms without excessive rewriting. I have the time to write an
brnstnd> occasional routine in assembly language for a few machines, but this is
brnstnd> simply not cost-effective most of the time.
Look, Dan. Say you have to port your code to 100 machines, of which 90 don't
have machine instructions for DREM anyway. How long would it take you to
write a DREM function in portable C? So you write it just *one* way, and use
it on those 90 machines. For the 10 machines that do have some sort of DREM
instruction, you hand code an assembly routine. How long does this take? An
hour, perhaps, depending on the machine instruction you need to execute. Can
you spare 10 hours to make your application run on 10 machines?
--
======================================================================
domain: taho...@csd.harris.com USMail: Tom Horsley
uucp: ...!uunet!hcx1!tahorsley 511 Kingbird Circle
Delray Beach, FL 33444
+==== Censorship is the only form of Obscenity ======================+
| (Wait, I forgot government tobacco subsidies...) |
+====================================================================+
Look, so-called RISC chips are hardly the mainstream. Function calls are
relatively expensive on most of the chips I use; the SPARC and the MIPS
are the only exceptions. On the other hand, I consider this whole
function-call issue to be orthogonal to the issue of having the compiler
provide portable access to all machine instructions. I assume that any
good language or optimizer will handle inlining.
> b) Fast function calls ae necessary for many purposes.
> I cannot think of any popular architecture designed in the
> last 5-8 years that hasn't paid at least some attention to
> making fast function calls, one way or another.
How about the Astronautics ZS supermini? It's a hellishly fast
machine, and beats the Convex hands-down on non-vectorizable code.
It has many registers, to be sure, and a nice optimizer, but a typical
function call still takes several load-saves. As the machine has no
built-in stack support, and as every memory reference takes two
instructions, function calls are quite expensive. Inlining is a must for
fast code on that machine.
[ some opinions on whether the current situation works ]
> The one obvious exception is if you have low-cycle count
> operations (like rotate, or population count, or byte-swapping,
> or string-primitives, etc) that you'd like to get inlined,
> and have no way to express directly.
That's a pretty big exception.
As Peter points out, fast divrem would noticeably speed up integer
formatting, not to mention multiple-precision arithmetic. Pop count is
useful for algebraic coding (e.g., as in the Goppa public-key system).
Byte reversal (not that I know any chips that have it) is a costly step
in most applications of FFTs. Getting the high word of a multiply is
practically necessary for fast multiple-precision arithmetic and other
number-theoretic applications. And so on.
---Dan
Oh, great, and when a library routine has a dozen things like this, and
a thousand people are using the library, you can expect that a few
hundred of them are using architectures that I've never seen. You want
each of them to spend dozens of hours just to make a library run fast?
You want the library to get slower and slower as the years go by and as
architectures change? You want to have a dozen ten-way #ifdefs that
every user has to worry about?
Contrast this with the situation I'm proposing. The compiler writers on
those 10 machines with DREM make sure to recognize and document *some*
portable idiom for the instruction. I don't have to learn any assembly
languages; I just have to copy the routines straight out of the manuals.
Since the language has a pick-this-or-that construct, I don't need *any*
#ifdefs. Since there are only so many obvious idioms for a given
operation, my code has a good chance of being perfectly efficient on an
architecture I've never seen. The users don't have to worry about a
thing. Even better, as time goes by, the code will get faster: compiler
writers will recognize more idioms, the idioms will slowly become
standardized, and the code will work efficiently without change.
---Dan
| ...On the other hand, I consider this whole function-call issue to
| be orthogonal to the issue of having the compiler provide portable
| access to all machine instructions. ...
I have been following this discussion quietly (and probably I should
not break that), and I have tried to understand this position, but
it still doesn't make sense to me. It resists definition, and
therefore it is easy to use it to make the discussion infinitely
long. There is something to be said for recommending that the
compilers pick up idioms and builtins and treat them
especially---there is a large opportunity for improvement there.
But it seems obvious that keeping the semantics straight and
guaranteeing a specific mapping to machine code are simultaneously
incompatible if pursued to extremes.
Let's see it this way: the other camp considers "portability" and
"efficiency" desirable, but sees them as non-binary and as
commodities to be traded. An occasional lack of portability to
ensure large local benefit is OK to this camp. Say, no one expects
GCC asm to be portable at all, but it gives a clear interface to
most current---not imaginary or prospective---hardware. Similarly,
a little efficiency can be lost for portability: your compiler can
be out and running on more machines faster if it doesn't recognize
the many special cases of pow() and just goes treats it as another
function to be called. As a user, you get the tools to make your
own tradeoffs.
How to make the decision? Maximum payoff first. If pow appears
often enough to warrant it, put some effort into it, etc. Don't get
stuck with extremes: maximum portability, maximum efficiency and
your pet syntax to boot. Get a lollipop.
Blah, this has been said a bunch of times already.
| As Peter points out, fast divrem would noticeably speed up integer
| formatting, not to mention multiple-precision arithmetic. Pop count is
| useful for algebraic coding (e.g., as in the Goppa public-key system).
| Byte reversal (not that I know any chips that have it) is a costly step
| in most applications of FFTs. Getting the high word of a multiply is
| practically necessary for fast multiple-precision arithmetic and other
| number-theoretic applications. And so on.
And so on indeed. The list is infinite by rights, should compiler
writers give the same ad-hoc treatment to each and every one of
those algorithms? This departs from giving ad-hoc treatment to each
and every machine instruction, but that confusion seems deeply at
the root of Bernstein's position
--
Cesar Quiroz
In article <23481:Mar311:05:35...@kramden.acf.nyu.edu> brn...@kramden.acf.nyu.edu (Dan Bernstein) writes:
> Contrast this with the situation I'm proposing. The compiler writers on
> those 10 machines with DREM make sure to recognize and document *some*
> portable idiom for the instruction. I don't have to learn any assembly
> languages; I just have to copy the routines straight out of the manuals.
> Since the language has a pick-this-or-that construct, I don't need *any*
> #ifdefs. Since there are only so many obvious idioms for a given
> operation, my code has a good chance of being perfectly efficient on an
> architecture I've never seen. The users don't have to worry about a
> thing. Even better, as time goes by, the code will get faster: compiler
> writers will recognize more idioms, the idioms will slowly become
> standardized, and the code will work efficiently without change.
>
Compiler writers are very inventive in the idiom they recognize. Do not
expect two compiler writers to support the same idiom. And even then...
Peter Montgomery and Bob Silverman do request two slightly different
routines, in order to do (in essence) the same operation. We have the
BigNum package from INRIA/DEC, which requires a different operation again,
and next we have Henry Cohen from Bordeaux with again different requirements.
And I am doing some stuff in this field myself, and do something different
again. If the users do apparently not have a common view, how do you think
the compiler writers come to a common view? I think that in view of this
it is not important to standardize on an interface with a DREM instruction
(should it operate on unsigned only, or signed only, or should both options
be present; all occur in existing hardware), it is more important to
standardize on a mp library. (Yes I am working on such a thing, but at this
stage my package is far from complete; although it has been ported to some
40 platforms for fixed size integers.) More about this in my follow-up to
John Mashey.
--
dik t. winter, cwi, amsterdam, nederland
d...@cwi.nl
> b) Fast function calls ae necessary for many purposes.
> I cannot think of any popular architecture designed in the
> last 5-8 years that hasn't paid at least some attention to
> making fast function calls, one way or another.
I have extensive figures about subroutine call overhead (in this case about
leaf routines). The background is a library of routines to do extended
precision integer arithmetic (quit apt in this discussion). I did port it
(both in C and in Fortran) to some 40 different platforms. For most coding
in assembly was necessary to get performance. The source is extremely
flexible. It knows about different optimization leves, will compile for a
different number of bits used in a word and so on. There is a completely
portable variant (optimization level 0) that only avoids all the known
compiler buts (C and Fortran). It is fairly simple to get an optimized
version on a new platform (at least I think so). The problem is of course
that all routines using the library need to be recompiled to use the more
efficient version because the number of bits in a word can change.
Optimization level 1 implies assembly routines for (integer) multiplication
and division. It is however sometimes necessary to code these routines to
use floating-point arithmetic. This only reduces the number of bits in a
word that are used from 30 to 26 (on a 32 bit machine). This is needed on
the i860 that does not have a integer multiply, and no divide at all. It
can be useful on others (e.g. MIPS). The routines vary in size from a few
instructions to quite a lot. E.g. the Vax multiply is under 10 instructions
while the SPARC divide is over 200. In this case the operations are performed
on the available registers, i.e. no registers are saved.
Optimization level 2 means that loops over the multiplications and divisions
are coded in assembly. This is roughly equivalent to inlining the simple
functions (although you would not do that if the simple function is fairly
large). In this case sometimes registers are saved because they are needed
for the loop.
(Level 3 and level 4 optimization are also defined, but not relevant here.)
Below follows a table for a number of machines that indicates the speedup
when optimization level 2 is used. Also in the last two columns it
indicates whether the base multiplication and division routines are small,
medium size or large. I use small when a single instruction could be used,
medium when the hardware provides step instructions (multiply step or divide
step) and large when bit fiddling should be done. Medium is also used if FP
operations are used.
Gould NP1 32.3% small small
Gould PowerNode 32.1% small small
Alliant FX/4 (68020 look-alike) 27.5% small small
Sony (68030) 27.1% small small
Vax 780 26.0% small small
Encore (32532) 24.9% small small
Sequent Balance (32032) 21.9% small small
Sun3 (68020) 20.4% small small
IBM RT/PC 18.6% medium medium
SGI 4D/25 (MIPS) FP 14.5% medium medium
Alliant FX2800 (i860) 13.8% medium medium
SGI 4D/25 (MIPS) INT 10.0% small large
Harris HCX7 (Tahoe) 3.8% small small
Sun4 (SPARC) 3.1% medium large
Two figures are given for the SGI. One for when complete integer arithmetic
is used, one when floating-point is used in part. Floating point is faster
by 12% on the SGI. The reason for the difference between Alliant FX/4 and
Sun3 is that the Alliant uses a caller saves calling sequence while the
Sun3 uses a callee saves sequence.
We see indeed that the subroutine call overhead decreases when subroutines
become larger. What is extremely surprising is the position of the Tahoe
in this table! This is not a modern machine by any means, nor is it RISC
(it is the smaller brother of the Vax). But while the base routines use
only 5 resp. 4 instructions (4 only because of a bug? in the hardware, it
would have been 3 otherwise), we see nearly no effect of the CALL.
Also interesing is that Sony's 68030 has larger subroutine call overhead
than Sun's 68020. Approximately the same code was run on both machines.
The only differences are due to compiler differences. So this is
contradictionary to John Mashey's remark that modern machines have lower
subroutine call overhead. That depends on your definition of modern.
What does this learn us?
Really nothing. Inlining may or may not help. On the Tahoe inlining will
help a bit, but not much. But on the MIPS with integer arithmetic inlining
helps (this means inlining some 20 instructions for the multiply and some
430 for the divide). In all, it would be good to look at how the Tahoe
does subroutine calls.
>Only ONCE have I ever seen a compiler bright enough to do the above
>optimization. (Burroughs 6700; a stack machine; Algol 68)
The last time I had my hands on a B6700 (a *lovely* machine) was in 1979.
However, I did write a compiler for it and assist with a couple of others,
and I am quite sure that there was no instruction that yielded both
quotient and remainder. There was an instruction with the effect of
Algol 60's integer divide, and an instruction with the effect of Algol
60's 'mod'. Later models in the same architectural family have varied
the instruction set somewhat (such as dropping VECTORMODE), perhaps this
was a later model?
Several years ago I posted a "<q,r> = (a*b+c) <div,mod> d" function for
Sun compilers, using their ".il" files.
I used to use a language (Pop-2) in which the integer division operator
*always* returned both results; oddly enough this was almost always a pain.
--
The purpose of advertising is to destroy the freedom of the market.
Toon Moene, SARA - Amsterdam (NL)
Internet: TO...@SARA.NL
/usr/lib/sendmail.cf: Do.:%@!=/
I guess I must be confused. For some weird reason, I thought RISCs
were at least part of the mainstream of current computer,
based on the almost-nonexistence of new CISC designs. (Please note, that
in all talks that I give, that doesn't mean I claim CISCs are going
away, especially the X86; just that new architectures look more
RISCy than CISCy).
Clearly, I have somehow missed the idea that the ZS might be the mainstream...
Except that by inlining, separate calls to the same function now
are separate copies of the code, and therefore are more likely to bust
your cache (you may have 5 copies of strcpy() or whatever in the cache
instead of 1, and of course you had to take initial misses on numbers 2-5).
>2. Inlining the code enables certain optimizations and simplifications
> of the code that are impossible in the library routine that should
> be able to catch *all* uses of it.
Yes, and avoids having to assume that all scratch registers are
trashed, a big win in some architectures/compilers, especially for very small
routines like strcpy.
--
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, je...@cbmvax.commodore.com BIX: rjesup
The compiler runs
Like a swift-flowing river
I wait in silence. (From "The Zen of Programming") ;-)
>I have extensive figures about subroutine call overhead (in this case about
>Below follows a table for a number of machines that indicates the speedup
>when optimization level 2 is used. Also in the last two columns it
>indicates whether the base multiplication and division routines are small,
>medium size or large. I use small when a single instruction could be used,
>medium when the hardware provides step instructions (multiply step or divide
>step) and large when bit fiddling should be done. Medium is also used if FP
>operations are used.
CPU ISA ARCHITECTURE DESIGNED IN WHAT YEAR? WHOLE BUNCH OF GUESSES, within
1-2 YEARS, MAYBE.
? Gould NP1 32.3% small small
? Gould PowerNode 32.1% small small
1979 Alliant FX/4 (68020 look-alike) 27.5% small small
1979 Sony (68030) 27.1% small small
1975 Vax 780 26.0% small small
1979 Encore (32532) 24.9% small small
1979 Sequent Balance (32032) 21.9% small small
1979 Sun3 (68020) 20.4% small small
1981 IBM RT/PC 18.6% medium medium
1985 SGI 4D/25 (MIPS) FP 14.5% medium medium
1986 Alliant FX2800 (i860) 13.8% medium medium
1985 SGI 4D/25 (MIPS) INT 10.0% small large==>WHY?
<1984 Harris HCX7 (Tahoe) 3.8% small small
1985 Sun4 (SPARC) 3.1% medium large
Just out of curiosity, say more about the nature of the routines for divide,
given than MIPS does have an integer divide, and SPARC doesn't, but both are
called large. Also, just out of curiosity, could we see the two columns
of numbers from which the % was computed, i.e., the measured times?
>We see indeed that the subroutine call overhead decreases when subroutines
>become larger. What is extremely surprising is the position of the Tahoe
>contradictionary to John Mashey's remark that modern machines have lower
>subroutine call overhead. That depends on your definition of modern.
Please re-read what I posted. I asserted that one could NOT assume
function calls were expensive these days, that on current RISCs, they
were cheap, and popular architectures designed in the last 5-8 years
had paid attention to function calls. I did NOT assert that NO earlier
machines had fast function calls, nor that modern machines always have
lower function call overhead.... (I would (almost) never assert such a
thing, since all generalizations are false :-)
On the other hand, given the data above,
I think I would claim that MOST modern machines have less subroutine
overhead than MOST older ones....
>430 for the divide). In all, it would be good to look at how the Tahoe
>does subroutine calls.
Can you post something on that? also, How about cycle counts for
the mul/div code used on the Tahoe?
Actually, some folks at Rice here observed that inlining causes large
modules and thus large numbers of large live ranges and the *register
allocator* drops the ball. In some cases the best thing you can do
is save a bunch of registers, and load them with new values - exactly
what the subroutine calling code does. Most graph coloring allocators
spill *1* value at a time, dribbling them throughout the module. A fast
save-multiple and load-multiple registers instruction is used in the
calling sequence, but not in the code which spills one-at-a-time. A
second effect is that things may spill inside loops inside the subroutines
after inlining, but not before (because they were effectively spilled at
the subroutine entrance).
These defects in register allocation are the targets of a spate of papers
on hierarchical register allocators.
Cliff
--
Cliff Click
cli...@rice.edu
Path: nntp-server.caltech.edu!jarthur!uunet!sunquest!venus.sunquest.com!terry
> >In fact, the old literature (and old machines!) are full of "bizarre"
> >instructions that proved ill-advised. A few examples:
> >
> > [ some good examples removed]
> >
> >- addressing modes which studies could not find a single use of.
> > (PDP-11 autodecrement deferred - omitted from the VAX.)
>
> This example differs somewhat from his others, which were basically
> deliberate design decisions that are not now generally seen as the
> wise way to go.
Actually, I've used this instruction to scan backwards in a linked list.
Extremely handy. I've seen other uses for it. It was not, however, used
by any *DEC* compiler. That's what did it in. Apart from that, there was
something about the symmetry of the machine (lacking in the VAX) that was
useful in itself. For one thing, it allowed compiler writers to very quickly
take advantage of the PDP-11's orthoganal architecture: get something working
with one instruction in one addressing mode, then extend it from there with
a flick of the wrist.
> DEC's pdp-10 instruction set had the same oddities. The machine had
> a plethora of no-op equivalent instructions
My favorite will always be the "JUMP" instruction, which of course does not
jump since there's no condition associated with it.
--Paul
--
This is my address: p...@ama.caltech.edu
This is UUCP: ...!{decwrl,uunet}!
This is my address on UUCP: ...!{decwrl,uunet}!caltech.edu!ama!ph
Any questions?
"Does Emacs have the Buddha nature?" --Paul Hardy "Yow!" --Zippy
Okay, I agree that this is arguable, but only because ``RISC'' is as
meaningful as ``object-oriented.'' Why is the CDC 6600 not a RISC
architecture?
> >> I cannot think of any popular architecture designed in the
> >> last 5-8 years that hasn't paid at least some attention to
> >> making fast function calls, one way or another.
> >How about the Astronautics ZS supermini? It's a hellishly fast
> >machine, and beats the Convex hands-down on non-vectorizable code.
> I guess I must be confused. For some weird reason, I thought RISCs
> were at least part of the mainstream of current computer,
The issue of fast function calls has nothing to do with whether RISC is
in the mainstream. You are confusing the issues.
The ZS has 64 registers. It is highly pipelined. It has no particularly
complex instructions. But I wouldn't say it pays much attention to fast
function calls.
---Dan
And your definition of "lower", apparently. Since your speedups are
achieved both from inlining and nonportable assembler, it's impossible
to tell how much function call overhead contributes. Even if it's
entirely responsible, it doesn't look expensive to me. Since it's
probably not entirely responsible, it looks even cheaper. Finally,
even if you could guarantee me a portable 30% overall speedup, your
code isn't a real program, it's a set of support routines. What
speedups have you measured in real programs? For instance, if you
like, in real floating point intensive programs calling your routines?
-- Jon
--
Jon Krueger, j...@ingres.com
Everybody please re-read what was posted, quoted above. If somebody
wants to challenge what was said in some useful way, we'd all be pleased
to hear of counter-examples.
In this particular case, I'd like to hear why the ZS is 'a popular
architecture' while "so-called RISC chips are hardly the mainstream."
Perhaps I am just ignorant, but I've not come across the
ZS very often ... well, ever.... and I'm not at all sure what issues
are that I'm confusing, but am happy to become educated ...
>In article <30...@charon.cwi.nl> d...@cwi.nl (Dik T. Winter) writes:
>> it is more important to
>> standardize on a mp library.
>Is there anyone else here who understands why suggestions like this are
>so pointless? I can't wait for the language designers to catch up and
>force everyone to provide an mp library. I think it's fine that you're
>writing libraries, and I fully support any efforts to standardize
>libraries. But standardization is SLOW. There is a much larger problem
>here, namely all the chip instructions that *aren't* supported by
>languages or other standards. Why do you refuse to admit that the
>programmer should be able to take advantage of those instructions
>without sacrificing portability?
Do you really suppose that your idiom recognition with your
pick-this-or-that and adequate documentation will become at all
widespread sooner than a new library? Or even that it could be
implemented faster if everyone wanted to?
If the language doesn't do what you want it to, is the hack you're
describing really the way to fix it? Even if it is workable, and
people keep telling you that it isn't, it sure isn't elegant. It's
not that we don't think you should be able to access machine
instructions that can speed up your code, it's that we think that the
method you describe for doing so is bad and impractical. And that it
won't come any faster than a standard library.
\begin{:-)}
Besides, it isn't ``object oriented''. It is ``turbo,'' but turbo
went out a few years ago. ``Hyper'' still has a bit of life left. If
you can find a way to call it hyper, maybe you can sell it. Name it
``fuzzy'' and it will be a hit in Japan. At least this year.
\end{:-)}
--
Michael Pereckas * InterNet: m-per...@uiuc.edu *
just another student... (CI$: 72311,3246)
Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado
True. Since few compiler writers for the VAX have made the effort to
provide documented DREM support, nobody agrees on idioms yet, and we all
have different ideas about what good idioms would be. So what?
There doesn't have to be any agreement on idioms for my plan to work.
That's why it's such a crucial element of the plan that the idioms be
written portably (and why it's helpful to have a pick-this-or-that
control structure).
At first, instead of learning five different assembly languages and
figuring out how to write inlined assembly code under five different
compilers, the programmer will copy five routines straight out of
manuals. Even this reduction in effort would be well worth the compiler
writer's time, but that's not the only advantage. As I explained in my
previous article, as time goes by, idioms for a certain instruction can
only become more standardized, and programs written for those idioms can
only become faster. Everything will be perfectly portable to start.
> Users do apparently not have a common view, how do you think
> the compiler writers come to a common view?
I don't, though I imagine that standards will spring up eventually. My
whole point is that we should be able to write portable, efficient code
*now*, even with *no* agreement on standards. All it takes is a bit of
effort on the part of each compiler writer.
> it is more important to
> standardize on a mp library.
Is there anyone else here who understands why suggestions like this are
so pointless? I can't wait for the language designers to catch up and
force everyone to provide an mp library. I think it's fine that you're
writing libraries, and I fully support any efforts to standardize
libraries. But standardization is SLOW. There is a much larger problem
here, namely all the chip instructions that *aren't* supported by
languages or other standards. Why do you refuse to admit that the
programmer should be able to take advantage of those instructions
without sacrificing portability?
---Dan
>Dan, you're missing the forest for the trees, several times over.
>Even if the general idea that you propose were worthwhile, your scheme
>for accessing special-purpose assembly language instructions from
>portable code is too complicated. It is gratuitously difficult for
>both the compiler writer and the programmer. Compare:
This may be true, but I question it. Vectorizing compilers have been around
in usable form for about a decade. Before that, some compilers recognized
certain loops and variations to emit specific code sequences (e.g. "Stacklib"
type loops). (In unusable form, they have been around for two decades :-) )
Note that vectorizing compilers DO recognize specific multiple statement
language constructs and, sometimes, optimize these constructs down to
a simple code sequence, or, in some cases, even a single instruction.
>Dan's method:
:
>compiler writer writes recognizer for special code sequences.
>compiler writer documents these code sequences.
>programmer reads the code sequences.
>programmer writes these code sequences.
This is done every day with vectorizing compilers. Programmers write
specific code sequences so that the vectorizer will recognize them.
The compiler documentation, or,sometimes a separate document, tells
programmers exactly how to write code so that it is recognized by
the vectorizer.
Sometimes, they even write very, very specific sequences that several
different vectorizers from several different vendors will recognize.
Of course, now Parallelizing compilers are the rage.
>Simpler method:
>
>compiler writer writes library of inlineable code.
>compiler writer documents names of routines that are in this library.
>programmer uses those names.
>if you insist, compiler writer writes portable versions of these
> subroutines, though programmer would have had to do it anyway in
> your method (working from documentation).
This method is also useful and important. Why do they have to be
mutually exclusive?
BTW, I do insist that portable versions be available. Note that Cray
Research provides documentation of equivalent code for many of its
scientific library subroutines.
In other words, both methods have long been done, and are not news.
What Dan is proposing is that scalar oriented machines could benefit
from the same technique that vector machine users have all along. I
think there probably *are* cases which could benefit from his technique,
even on small workstations. In fact, as I understand it, some compilers
for RISC micros *do* already recognize, for example, special code sequences
that really mean: integer div, and remainder also. There are probably
other cases: for example, code sequences that do multiple precision
arithmetic that could take advantage of a double precision multiply result
(anyone remember the CDC 66xx/Cyber/etc multiply?)
:
>try to come up with ways to make it better. Somehow, nothing like
>your idea even made it onto the list, let alone near the top.
Of course, this is incorrect for vector machines.
For scalar machines, the performance payoff is usually a lot less. But not
always trivial. And, as I mentioned above, a few compilers have a few such
cases. I have seen factors of two increase in performance with older
compilers by recognizing special code sequences and doing the optimal
thing. This sort of thing is a lot more common in numerical simulation
type codes. Running typical kernel code, or pointer intensive stuff,
the potential for payoff is very small with today's much better
optimizers. And, since some readers of this group don't want to talk
about any other kind of applications, it is no wonder that the perceived
benefit is so small :-)
Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster
NASA Ames Research Center Internet: lama...@ames.arc.nasa.gov
Moffett Field, CA 94035 With Good Mailer: lama...@george.arc.nasa.gov
Phone: 415/604-6117 #include <std.disclaimer>
> They defined an intermediate language (IL).
>Unfortunately, the project apparently died after a few years.
:
>Personally, I was frightened by the complexity of their
>intermediate language. I also wonder how to control
>the huge potential inlining. Can compile-times be controlled?
I would be curious to know if any of the IL's for vectorizing compilers,
commercial or not, have been published. I note that several companies,
including Cray and CDC, claim to have vectorized IL's with common
code generators for multiple languages (C, Fortran, et al.)
I believe CDC has been doing this for about 4 years on the Cyber 800/900
series running their own O/S. Cray has done it more recently with
their Fortran and C compilers. It seems that vectorization is generally
viewed as a front-end, language-dependent, process, unlike what you might
expect.
>how his machine works! Similarly, you aren't going to win arguments
>with David (or Ben) Chase about how compilers work.
Maybe not. (It wasn't my argument to begin with - pardon my butting my
head in). But I think that there is more to this than is first apparent.
P.S.
I think there is, indeed, a good case for multiple assignment.
*Some* of the funny optimizations that compilers now have to deal with
are an unnatural artifact of the languages with only a single assignment
operator.
> The extension many people are arguing for:
>
> c, d = a divrem b
>
> is much more significant (at least to languages like C or Fortran).
> Basically, you've suddenly defined a new language with significantly
> different syntax. The front-end of the compiler will probably