Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Low End NeXTs (was Re: Desktop publishing)

59 views
Skip to first unread message

Sean Eric Fagan

unread,
Apr 3, 1991, 6:24:00 PM4/3/91
to
I would suggest that this thread continue in comp.arch, which is more suited
to it.

In article <27fa33...@petunia.CalPoly.EDU> ara...@polyslo.CalPoly.EDU (Alex Raftis) writes:
>What makes you say this. Sure RISC is nice, and it has it's place, but
>look at what advantages RISC gives. It has few intructions, so they execute
>quickly. That's nice, but each instruction does very little, so you need
>five instructions to do the same thigns a CISC processor can do.

Which end up not being used.

Fact: the current generation RISCs are almost invariably "faster" than the
current generation CISCs. Consider 68030 vs. R3000 (I believe those were
about comparable). Consider 68040 vs. R6000. In both cases, the MIPS chip
is "faster" (the SPECmarks are higher, by a considerable margin).

Few compilers use any of the nifty instructions that a CISC has. gcc, for
example, uses a relatively mundane set of instructions for the 68k line.
Most instructions get unused this way, in a CISC; on the other hand, most of
the instructions that the MIPS series has *are* used by the compiler (the
exceptions being about a dozen or so that are used to do things like page
control, processor control/setup, etc.).

>A RISC
>processor has a lot a registers to work with for speed. Well, the 68040
>has sixteen registers, which I find more than plenty when programming.

Maybe you don't, but I can come up with cases where it isn't. Using the
MIPS again. It has 32 integer registers (one of which is hardwired to 0, I
believe). Some of the registers are, by convention, used to pass in
subroutine arguments. Result? Subroutine calls are *fast*. Also, the
return address goes into r31, not onto a stack; this is also quite faster
than the 68040 (MIPS does: r31 := pc+1; pc = addr, while the 68k has to do
quite a bit more, including storing into memory, decrementing the sp, and
*then* incrmeneting the pc, as well as parsing the addressing mode).

>What are some of its disadvantages? Well, floating point work generally
>is more difficult. They're also nearly impossible to work with at the assembly
>level due to the amount of work the programmer has to do to make the
>advantages of RISC, like pipelining work.

So? How often do you program at the assembly level for unix? I rarely do;
when I do, I use gcc to help. The MIPS assembler takes care of pipeline
slots, so you don't have to deal with *that* part of it. In addition, you
*do* realize that the R6000 has essentially the same FP instructions that
the 68040 does? Yeah, that's right. The 68040 got rid of *tons* of
instructions. There goes *that* part of your "CISC is better than RISC"
argument.

>On the other hand, look at the 68040. It executes instructions at around
>1.3 cycles per instuction.

That depends on your instruction stream. The R3000 gets about the same
ratio, although I seem to recall "1.2"; however, I don't have any reference
to that, so that's just hearsay.

>It gives you lots of registers.

So does the R3000.

>It's easy to
>work with at the assembly level while its easy to write compilers for.

Actually, I have no trouble writing in assembly for any of the current RISC
chips. And the R3000 is easier to write a compiler for than the 68k;
there's a lot of stuff you don't have to worry about. (For example, all the
addressing modes. Sure, they make your code smaller, but not necessarily
faster.)

>It's only major problem is
>in clock speed. A 25 Mhz 68040 can beat a SPARC at 25Mhz, but the SPAC's
>top speed is 40Mhz, which will easily beat the 040. Motorola claims to be
>working on a 50Mhz version of the 040, but I don't have any idea of when
>they claim this will be released.

I don't like the SPARC. It's got an ugly instruction set. How about the
RIOS or the MIPS? They are acknowledged to be the fastest chips, generally
(although the new HP chip[set?] also seems to be pretty fast, as well).

>With their current strategy
>they can work their way into the workstation market with their 25 Mhz
>cube, which requires low cost memory and support hardware, while they wait
>for faster versions of the processor to come onto the market.

When a 50MHz 68040 comes out, I expect MIPS to come out with a 75MHz (or
faster) R4000; this will be able to get 150MIPS max, and should average
around 100MIPS. I know that doesn't mean much; let's just say it will have
such impressive SPECmarks (assuming the system is designed to keep pace)
that Motorola would have to come out with a 120MHz 68040 to keep pace.

>Once this
>occurs, they can begin to release faster version of the Cube, while still
>selling their current models as the Workstation for the rest of us.

You can only make things go so fast at any given technology. Using the
technology to put all of those transistors on a 68040, the RISC people will
make a much faster chip (see R6000, RIOS, R4000 when it comes out).

Go read _Computer Architecture: A Quantitive Approach_, by Hennessey and
Patterson. Then make your arguments again.

--
Sean Eric Fagan | "I made the universe, but please don't blame me for it;
s...@kithrup.COM | I had a bellyache at the time."
-----------------+ -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

John Mashey

unread,
Apr 3, 1991, 11:21:34 PM4/3/91
to
In article <1991Apr03....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
>I would suggest that this thread continue in comp.arch, which is more suited
>to it.

>In article <27fa33...@petunia.CalPoly.EDU> ara...@polyslo.CalPoly.EDU (Alex Raftis) writes:

>>On the other hand, look at the 68040. It executes instructions at around
>>1.3 cycles per instuction.

It is very difficult to verify such a number, or make reasonable
comparisons without being the architects. As noted in earlier posting,
if you try to use SPECint/Mhz, you find that the best CISC micros,
with 128KB external caches, get about .5 - .53 (@ 25MHz), i.e.,
12.9 - 13.3 for 68040 & 486. RISCs get as much as .75-.80
(MIPS, IBM, HP PA). Using that as inverse of CPI, one gets
cycles/SPEcint of 1.88 (at best) for the CISCs, and 1.25 for RISCs.
Of course, FP is a more variable story; the RISCs have, at minimum
at least some additional advantage, more than integer,
and often a lot more.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash
DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

Doug Merritt

unread,
Apr 5, 1991, 12:25:33 PM4/5/91
to
In article <1991Apr4...@capd.jhuapl.edu> wal...@capd.jhuapl.edu writes:
> He observes that "Few compilers use any of the nifty instructions that
> a CISC has." I believe I read recently that RISC was based on the
> observation that, in fact, only about 30% of the instructions in CISC
> computers were used by compilers. The rest of the instructions, for
> all practical purposes, were just excess baggage.

Not only that, but even when the other instructions and addressing modes
*are* used, it doesn't help that much. A few years ago I wrote a compiler
(code generator, actually) that used essentially all of the addressing
modes, and almost all of the instructions in the 68020. The results were
not encouraging. In the absolute best case, static code density improved
maybe 25%, but usually closer to 0% (break-even). This is because the extra
features almost always take as many bytes to encode in a single fancy
instruction as the equivalent multi-instruction sequence. Dynamic code
speed showed similar results.

(The effort wasn't wasted, though, because the original prototype code
generator was highly suboptimal even for simple instructions.)

Anyway, the point is that even if compilers *do* use every possible feature
of a CISC, it still usually won't make the CISC s/w competitive with RISC
s/w. CISC cpus almost never put sufficient h/w optimization into the support
of the fancy instructions. (There are exceptions to this, of course, and I
haven't been watching the 030 and 040 to see how they do in this regard.
CISC may yet rise again, but not until after superscalar RISC has been
completely exploited.)
Doug
--
Doug Merritt do...@eris.berkeley.edu (ucbvax!eris!doug)
or uunet.uu.net!crossck!dougm

Donald Lewine

unread,
Apr 5, 1991, 9:34:59 AM4/5/91
to
|>
|> >In article <27fa33...@petunia.CalPoly.EDU> ara...@polyslo.CalPoly.EDU (Alex Raftis) writes:
|>
|> >>On the other hand, look at the 68040. It executes instructions at around
|> >>1.3 cycles per instuction.
|> It is very difficult to verify such a number, or make reasonable
|> comparisons without being the architects. As noted in earlier posting,
|> if you try to use SPECint/Mhz, you find that the best CISC micros,
|> with 128KB external caches, get about .5 - .53 (@ 25MHz), i.e.,
|> 12.9 - 13.3 for 68040 & 486. RISCs get as much as .75-.80
|> (MIPS, IBM, HP PA). Using that as inverse of CPI, one gets
|> cycles/SPEcint of 1.88 (at best) for the CISCs, and 1.25 for RISCs.
|> Of course, FP is a more variable story; the RISCs have, at minimum
|> at least some additional advantage, more than integer,
|> and often a lot more.

ERROR: Invalid mixing of Data in line 9!

When they say that the 68040 takes 1.3 cycles per instruction, they
mean native 68040 instructions. You can not infer anything about
clocks per native instruction from a SPECint number.

A VAX, a Data General MV/40000, a MIPS R3000 and a 486 all execute
a vastly different number of native instructions when running the
spec suite.

I fully agree with your first statement, "It is very difficult to
verify such a number." It is possible to count the instructions
executed and the number of clock cycles required, however, it is
only easy if you have special hardware and/or simulation tools.
The architects tend to have these tools.

--------------------------------------------------------------------
Donald A. Lewine (508) 870-9008 Voice
Data General Corporation (508) 366-0750 FAX
4400 Computer Drive. MS D112A
Westboro, MA 01580 U.S.A.

uucp: uunet!dg!lewine Internet: lew...@cheshirecat.webo.dg.com

David Chase

unread,
Apr 5, 1991, 2:20:45 PM4/5/91
to
Somewhat mystifying that anyone would speak of "a processor" achieving
a certain number of cycles per instruction. The compiler also has
something to do with this. When including the compiler in the
measurement, be sure that you compare both CPI and total instructions
executed -- all else being equal (which, of course, it isn't), I'll
take 10% higher CPI if the dynamic instruction count is reduced by
20%. If *all* you care about is CPI, I'm sure that the compiler can
get your code asymptotically close to the minimum (how many NOPS would
you like inserted?)

David Chase
Sun

Henry Spencer

unread,
Apr 7, 1991, 1:48:55 AM4/7/91
to
In article <1991Apr03....@kithrup.COM> s...@kithrup.COM (Sean Eric Fagan) writes:
>Fact: the current generation RISCs are almost invariably "faster" than the
>current generation CISCs. Consider 68030 vs. R3000 (I believe those were
>about comparable). Consider 68040 vs. R6000. In both cases, the MIPS chip
>is "faster" (the SPECmarks are higher, by a considerable margin).

Or for a *really* straight comparison, due to John Mashey I think, compare
the i860 to the i486: same tools, same process, same chip size, roughly
the same release time... and the RISC machine is faster, much faster, in
every way.
--
"The stories one hears about putting up | Henry Spencer @ U of Toronto Zoology
SunOS 4.1.1 are all true." -D. Harrison| he...@zoo.toronto.edu utzoo!henry

Henry Spencer

unread,
Apr 7, 1991, 1:51:05 AM4/7/91
to
In article <1991Apr5.1...@agate.berkeley.edu> do...@eris.berkeley.edu (Doug Merritt) writes:
>... CISC cpus almost never put sufficient h/w optimization into the support

>of the fancy instructions. (There are exceptions to this, of course, and I
>haven't been watching the 030 and 040 to see how they do in this regard...

Actually, on the 040, by the published descriptions, the situation is very
simple: the 68000 subset, plus 32-bit absolute addresses, minus indexed
addressing, is fast. Everything else -- all the goo the 020 added -- is slow.

Herman Rubin

unread,
Apr 7, 1991, 3:04:12 PM4/7/91
to
In article <1991Apr5.1...@agate.berkeley.edu>, do...@eris.berkeley.edu (Doug Merritt) writes:
> In article <1991Apr4...@capd.jhuapl.edu> wal...@capd.jhuapl.edu writes:
> > He observes that "Few compilers use any of the nifty instructions that
> > a CISC has." I believe I read recently that RISC was based on the
> > observation that, in fact, only about 30% of the instructions in CISC
> > computers were used by compilers. The rest of the instructions, for
> > all practical purposes, were just excess baggage.

.........................

> Anyway, the point is that even if compilers *do* use every possible feature
> of a CISC, it still usually won't make the CISC s/w competitive with RISC
> s/w. CISC cpus almost never put sufficient h/w optimization into the support
> of the fancy instructions. (There are exceptions to this, of course, and I
> haven't been watching the 030 and 040 to see how they do in this regard.
> CISC may yet rise again, but not until after superscalar RISC has been
> completely exploited.)

If the languages do not allow the user to communicate the use of those
features of the CISC architecture which can make the object code considerably
faster, the compiler cannot take advantage of them. This IS the situation.
For example, suppose there was an instruction, which could be effectively
used, which would divide a float by a float, obtaining an integer quotient
and a float remainder. Now just how is the compiler to recognize that this
is in fact wanted?

There is, at present, no language designed to take into account the collection
of "natural" useful hardware which the present users can find uses for, and
which are far more expensive in software than in hardware.

--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)

Larry McVoy

unread,
Apr 7, 1991, 8:42:34 PM4/7/91
to
In article <1991Apr7.0...@zoo.toronto.edu> he...@zoo.toronto.edu (Henry Spencer) writes:
>Or for a *really* straight comparison, due to John Mashey I think, compare
>the i860 to the i486: same tools, same process, same chip size, roughly
>the same release time... and the RISC machine is faster, much faster, in
>every way.

I respect Henry, despite his annoying signatures, but I can't agree
with this comparison. The i860 is, to the best of my knowledge, a
clean start. The i486 is carrying baggage from all the way back to
8080's. (I personally think the i486 is a cool chip, if you look
closely, it is quite RISC like in the most common instruction uses.
First the 386, then the 486. Hmm. If Intel keeps this up, they might
make a decent CPU one day :-)

Anyway, it is not a fair comparison. Not by a long stretch. Let's see
how the Nth generation SPARC, MIPS, and 88K's do (assuming they last)
compared to some new design from scratch.
---
Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or l...@sun.com

Rob MacLachlan

unread,
Apr 8, 1991, 7:49:16 PM4/8/91
to
>From: hru...@pop.stat.purdue.edu (Herman Rubin)
>Subject: Compilers and efficiency
>Date: 7 Apr 91 19:04:12 GMT

>
>For example, suppose there was an instruction, which could be effectively
>used, which would divide a float by a float, obtaining an integer quotient
>and a float remainder. Now just how is the compiler to recognize that this
>is in fact wanted?
>
>There is, at present, no language designed to take into account the collection
>of "natural" useful hardware which the present users can find uses for, and
>which are far more expensive in software than in hardware.

Well, in Common Lisp, (truncate 3.5 .5) returns 3 (an integer) and .5
(a float).

Rob

Rob MacLachlan

unread,
Apr 8, 1991, 8:01:44 PM4/8/91
to
>From: hru...@pop.stat.purdue.edu (Herman Rubin)
>Subject: Compilers and efficiency
>Date: 7 Apr 91 19:04:12 GMT
>
>For example, suppose there was an instruction, which could be effectively
>used, which would divide a float by a float, obtaining an integer quotient
>and a float remainder. Now just how is the compiler to recognize that this
>is in fact wanted?
>
>There is, at present, no language designed to take into account the collection
>of "natural" useful hardware which the present users can find uses for, and
>which are far more expensive in software than in hardware.

Well, in Common Lisp, (truncate 4.5 2.0) returns 2 (an integer) and .5 (a
float). And you can use CEILING, FLOOR or ROUND to get rounding to +inf,
-inf or nearest. The Common Lisp numeric support was designed by someone
with a pretty good understanding of numerical analysis, and of the IEEE
float standard. [Guy Steele, I believe...] Unfortunately, most Common Lisp
implementations don't implement numeric stuff well enough to be worth using.
In my work on CMU Common Lisp, I have made some major improvements in basic
efficiency, but nobody has yet realized the full potential of numbers in
Common Lisp.

Rob

Robert Firth

unread,
Apr 7, 1991, 3:44:36 PM4/7/91
to
In article <1991Apr7.0...@zoo.toronto.edu> he...@zoo.toronto.edu (Henry Spencer) writes:

>Actually, on the 040, by the published descriptions, the situation is very
>simple: the 68000 subset, plus 32-bit absolute addresses, minus indexed
>addressing, is fast. Everything else -- all the goo the 020 added -- is slow.

My rough calculations (again, based on the documentation) agree with
Henry's. As with the 68020, the indexed address modes are on balance
slower than the equivalent naive code using only the 68000 address
modes. Once in a while, you'll be able to avoid an extra register
save and restore, for a marginal gain; but in general, forget them.

Guy Harris

unread,
Apr 10, 1991, 4:59:45 PM4/10/91
to
>If the languages do not allow the user to communicate the use of those
>features of the CISC architecture which can make the object code considerably
>faster, the compiler cannot take advantage of them.

What about those features of the CISC architecture that don't make the
object code *any* faster? Methinks that's what a lot of people are
complaining about here....

Chip Salzenberg

unread,
Apr 10, 1991, 5:33:19 PM4/10/91
to
According to hru...@pop.stat.purdue.edu (Herman Rubin):

>There is, at present, no language designed to take into account the collection
>of "natural" useful hardware which the present users can find uses for, and
>which are far more expensive in software than in hardware.

If Herman would write be so kind as to write a spec for such a
language, we could try to implement it.

We're waiting, Herman...
--
Brand X Industries Custodial, Refurbishing and Containment Service:
When You Never, Ever Want To See It Again [tm]
Chip Salzenberg <ch...@tct.com>, <uunet!pdn!tct!chip>

Michael Turner

unread,
Apr 11, 1991, 1:47:26 PM4/11/91
to

Guy Harris's example (if I remember correctly) was that no language he
knew of had semantics for retrieving both the integer quotient and the
remainder from a floating divide, the point being that both values are
usually available from any implementation of floating point divide, but
that the language stands in the way of getting them at the same time.
I think a fairly trivial ad hoc common-subexpression analysis could handle
this particular case. I also don't see how this is a CISC/RISC issue--
if anything, one instruction that yielded both quotient and remainder
is more "RISC", than having two that did div and rem separately.

As for CISC-architecture features that don't appear to be faster than
using combinations of other native-code features, this gets slippery.
Yes, those added 68020 addressing modes don't make it faster, but they
do make the corresponding code denser, which might reduce I-cache misses.
If implementing them used up space on the chip that would have been
wasted otherwise, then there has been some good done, and little harm
but for a possible slight decrease in yield. I don't know that that's
the case with the 68020. From what I'm hearing of the 68040, it even
seems doubtful. But that shouldn't get in the way of my point: that
when you're designing an architecture for single-chip implementation,
at some point you've got a little left-over real estate to play around
with. Maybe you can even (*gasp*) add some instructions. If this be
treason, make the most of it, I say.

Just my arrogant, ignorant opinion, as usual.
---
Michael Turner
tur...@tis.llnl.gov

Herman Rubin

unread,
Apr 11, 1991, 9:26:28 AM4/11/91
to

In some cases this is true. A badly implemented procedure is bad.

If a hardware polynomial evaluation takes longer than an explicit loop,
it is not the fault of the instruction, but of the implementation. Also,
it is important not to compare the object codes produced by compilers, but
by intelligent human beings, who can reason out how to use the features not
supported by the languages.

Herman Rubin

unread,
Apr 11, 1991, 10:17:56 AM4/11/91
to
In article <280384...@tct.com>, ch...@tct.com (Chip Salzenberg) writes:
> According to hru...@pop.stat.purdue.edu (Herman Rubin):
> >There is, at present, no language designed to take into account the collection
> >of "natural" useful hardware which the present users can find uses for, and
> >which are far more expensive in software than in hardware.
>
> If Herman would write be so kind as to write a spec for such a
> language, we could try to implement it.
>
> We're waiting, Herman...

1. I am not quite sure what you mean by a spec. If it is what I think it
is, it is far too early to do this.

2. I am willing to provide a list of such instructions and their descriptions,
by no means intended to be complete, written in such a way that a reasonable
undergraduate (or even a high school student who can understand the idea of
mathematical symbolism and operations) with an understanding of binary
representation can understand. Many of these have been posted to this group
by others as well as myself. Such a list will include only those operations
of which I am acquainted, and others will come up with additional suggestions.

cs45...@uc780.umd.edu

unread,
Apr 11, 1991, 5:11:46 PM4/11/91
to
Michael Turner writes:

>Guy Harris's example (if I remember correctly) was that no language he
>knew of had semantics for retrieving both the integer quotient and the

>remainder from a floating divide ...

You mean like
0 3.141592654 #: 20
6 1.150444076

???

That's been around for ages. It's just in a language that not many
people write compilers for (APL).

Raul Rockwell

Michael S. Pereckas

unread,
Apr 11, 1991, 8:13:24 PM4/11/91
to

>If implementing them used up space on the chip that would have been
>wasted otherwise, then there has been some good done, and little harm
>but for a possible slight decrease in yield. I don't know that that's
>the case with the 68020. From what I'm hearing of the 68040, it even
>seems doubtful. But that shouldn't get in the way of my point: that
>when you're designing an architecture for single-chip implementation,
>at some point you've got a little left-over real estate to play around
>with. Maybe you can even (*gasp*) add some instructions. If this be
>treason, make the most of it, I say.

The problem is that you will have to include those instructions in the
next version of your processor in order to remain binary compatible,
and if things work out differently, as they probably will, they may be
hard to implement. That happens in microcoded machines: we have some
extra microcode store space, lets add some instructions. Next
version, you have to work like hell to cram all those instructions in
because things worked out differently, and probably almost no one uses
those instructions anyway.

--


Michael Pereckas * InterNet: m-per...@uiuc.edu *
just another student... (CI$: 72311,3246)
Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado

Wm E Davidsen Jr

unread,
Apr 12, 1991, 9:16:35 AM4/12/91
to
In article <10...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:

| If a hardware polynomial evaluation takes longer than an explicit loop,
| it is not the fault of the instruction, but of the implementation. Also,
| it is important not to compare the object codes produced by compilers, but
| by intelligent human beings, who can reason out how to use the features not
| supported by the languages.

Obviously a bad algorithm is slow, however you implement it. A good
implementation can be faster than the best code, due to overlap of
instructions. We had someone do an FFT instruction for VAX loadable
control store (master's thesis) and he got about 15-20% over the hand
coded assembler.

You can get somewhat the same effect on a RISC machine if you feed it
good enough code and it has register scoreboarding or other techniques
which allow overlap. If I wanted maximum speed for some operation I
would still hardcode an instruction, but you need a certain dollar
volume to justify building a special instruction into a CPU instead of
using the real estate for something else. This is why we have
coprocessors, to allow the user to buy the instructionss/he needs.
--
bill davidsen (davi...@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"Most of the VAX instructions are in microcode,
but halt and no-op are in hardware for efficiency"

Herman Rubin

unread,
Apr 12, 1991, 6:45:45 AM4/12/91
to
In article <11APR91....@uc780.umd.edu>, cs45...@uc780.umd.edu writes:
> Michael Turner writes:
>
> >Guy Harris's example (if I remember correctly) was that no language he
> >knew of had semantics for retrieving both the integer quotient and the
> >remainder from a floating divide ...

> You mean like
> 0 3.141592654 #: 20
> 6 1.150444076

Translate, please. Seeing the results, I still do not have any idea what
the instruction does or how to put the results in a useful place, probably
registers.

The problem with APL and other such wierdos (like 99+% of the assembler
languages) is their highly artificial syntax. I can, even at my age,
learn the machine instructions quickly, but the stupid "mnemonics" of
assembler language are another matter. This is what computer language
people do not realize; they seem to think that memorizing even highly
obscure notation is easy, and understanding slightly unusual operations
is difficult.

APL does have one feature lacking in at least most other languages; it
attempts to have a reasonably compact notation. But there is no
adequate macro processor available. An adequate macro processor would
include the capability of easily translating APL into other languages,
for example, and even having the translation take into account types.

Guido van Rossum

unread,
Apr 13, 1991, 8:05:30 AM4/13/91
to
hru...@pop.stat.purdue.edu (Herman Rubin) writes:

>[...] An adequate macro processor would


>include the capability of easily translating APL into other languages,
>for example, and even having the translation take into account types.

Can you give some more clues that this statement is true? Without any
more context it sounds like what you call a macro processor is
something completely different from what is usually called a macro
processor; for instance, normally macro processors have no notion
whatsoever of the types they are working on.

cs45...@uc780.umd.edu

unread,
Apr 12, 1991, 4:52:34 PM4/12/91
to
Herman Rubin >
Me >|
Michael Turner >**
>** Guy Harris's example (if I remember correctly) was that no
>** language he knew of had semantics for retrieving both the integer
>** quotient and the remainder from a floating divide ...

>| You mean like
>| 0 3.141592654 #: 20
>| 6 1.150444076

>Translate, please. Seeing the results, I still do not have any idea what
>the instruction does or how to put the results in a useful place, probably
>registers.

Well, (x,y) #: z where x, y, and z are all single numbers returns
two results. The "number on the right" is the remainder from z
divided by y. The "number on the left" is the quotient of that
division modulo x. "N modulo 0" is defined to be N.

Therefore, (0,y) #: z computes the dividend and remainder of y
divided by z.

Finally, note that for constants, the notation (x, y) may be
abbreviated by just writing the two numbers separated by a space.

>The problem with APL and other such wierdos (like 99+% of the
>assembler languages) is their highly artificial syntax.

Anything you understand is simple. Anything you do not understand is
not simple. I suppose I should also point out that in infix notation
a function takes arguments on both the right side and the left side.
I don't know if it is relevant to describe the details of this
mechanism.

Also note that #: is a single symbol.

>APL does have one feature lacking in at least most other languages;
>it attempts to have a reasonably compact notation. But there is no
>adequate macro processor available. An adequate macro processor
>would include the capability of easily translating APL into other
>languages, for example, and even having the translation take into
>account types.

He he...

APL is just now getting around to using ASCII... give it a little time
before it translates to every language under the sun. :-)

Other than that, it looks like all you are asking for is constant
folding.

Raul

Chip Salzenberg

unread,
Apr 12, 1991, 1:27:55 PM4/12/91
to
According to hru...@pop.stat.purdue.edu (Herman Rubin):
>In article <280384...@tct.com>, ch...@tct.com (Chip Salzenberg) writes:
>> If Herman would write be so kind as to write a spec for such a
>> language, we could try to implement it.
>
>1. I am not quite sure what you mean by a spec.

Okay. Let's suppose that holy grail of programming languages, HROL
(Herman Rubin Oriented Language), has been designed, implemented, and
tested, and it's in a box on your desk.

In the box is a tape, a programmer's reference manual, and an
architecture-specific reference manual.

Here's your job, Herman: Write that programmer's reference manual.

If you feel that you can't write it because you're not sure what
should be in it, then obviously you don't know what you want, and I
for one would be happy if you would stop lamenting its absence.

>2. I am willing to provide a list of such instructions ...

Not an instruction reference, Herman, a *language* specification.
The instruction set is way down the line from there.

Randell Jesup

unread,
Apr 15, 1991, 2:58:04 AM4/15/91
to
In article <23...@as0c.sei.cmu.edu> fi...@sei.cmu.edu (Robert Firth) writes:
>My rough calculations (again, based on the documentation) agree with
>Henry's. As with the 68020, the indexed address modes are on balance
>slower than the equivalent naive code using only the 68000 address
>modes. Once in a while, you'll be able to avoid an extra register
>save and restore, for a marginal gain; but in general, forget them.

I mostly agree, though note that the '020 had more overhead on the
new modes than the '030 did - on an '030, some of the modes are actually
a win (I think d8(An,Xn*scale) becomes a win, for example, in certain cases).
I use a compiler that has different specific '020/'030 optimizations, and
on an '030 it is more willing to use some of the new addressing modes. I
don't know if the '040 has maintained the split-point of utility, but I
wouldn't be suprised if it were in about the same spot.

The 32-bit multiply and divide on the '020 and above are a large
win. For certain types of code (not the general case - cpu blitter code,
for example) the bitfield instructions are a win, and of course the barrel-
shifter.

--
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, je...@cbmvax.commodore.com BIX: rjesup
Disclaimer: Nothing I say is anything other than my personal opinion.
Thus spake the Master Ninjei: "To program a million-line operating system
is easy, to change a man's temperament is more difficult."
(From "The Zen of Programming") ;-)

herr...@iccgcc.decnet.ab.com

unread,
Apr 15, 1991, 6:07:41 PM4/15/91
to
In article <280384...@tct.com>, ch...@tct.com (Chip Salzenberg) writes:
> According to hru...@pop.stat.purdue.edu (Herman Rubin):
>>There is, at present, no language designed to take into account the collection
>>of "natural" useful hardware which the present users can find uses for, and
>>which are far more expensive in software than in hardware.
>
> If Herman would write be so kind as to write a spec for such a
> language, we could try to implement it.
>
> We're waiting, Herman...

Herman, and some others, have already published a requirement spec for
the language, mostly in the "bizarre instructions" thread here quite
recently. Of course, it would help if someone would gather it together
in a form that could be reviewed and resolve some of the contradictions.

dan herrick
herr...@iccgcc.decnet.ab.com

Guy Harris

unread,
Apr 15, 1991, 7:44:25 PM4/15/91
to
>Guy Harris's example (if I remember correctly) was that no language he
>knew of had semantics for retrieving both the integer quotient and the
>remainder from a floating divide, the point being that both values are
>usually available from any implementation of floating point divide, but
>that the language stands in the way of getting them at the same time.

I don't think you remember correctly; if *I* remember correctly, I
didn't give any particular example, and I certainly didn't give *that*
example. *Herman Rubin* made the claim that no language he knew of had
those semantics, and at least two other people jumped in to claim that
Common LISP *did* have those semantics.

I was mainly thinking of, indeed, such things as the added addressing
modes; they may increase code density, but do they complicate instruction
decoding and slow the machine down there? And what about some of the
more elaborate procedure-calling, procedure-entry, or procedure-exit
instructions?

Admittedly, to some extent, the problems with the more elaborate
features may come either from 1) sloppy implementations of them and
2) using the elaborate features even when inappropriate, e.g. not
making use of simplifications that can be done at code-generation time,
such as better management of registers. Both seem to come, at least in
part, from the notion that "well, it's *one instruction*, that means it
*has* to be fast!" - i.e., since it's a single instruction, they didn't
worry about making it fast, or making the common case fast, and/or
assumed it was *always* the right thing to do to use that instruction.

As an example of 1), the CCI Power 6/32, as I remember, did a lot better
job at implementing a VAX-like CALLS instruction than did the
VAX-11/780; one trick I remember them doing was to generate the fields
of the stack frame in order, so that the stores that built the stack
frame would work well with interleaved memory. They also stored
*decoded* instructions, rather than code bytes, in the instruction
cache, as a way of avoiding the overhead of decoding VAX-style
instructions.

As an example of 2), consider, say, treating leaf procedures differently
- can you get away with doing less than a full procedure entry?

Guy Harris

unread,
Apr 16, 1991, 1:35:34 PM4/16/91
to
>If a hardware polynomial evaluation takes longer than an explicit loop,
>it is not the fault of the instruction, but of the implementation.

What if a hardware polynomial evaluation takes longer than an *unrolled*
loop, with the zero-coefficient terms taken out? Or do you have an
implementation in mind that takes zero cycles to skip the terms with
zero coefficients?

Hascall John Paul

unread,
Apr 16, 1991, 11:25:15 PM4/16/91
to

Doesn't seem so impossible:

POLY CMASK,CTABLE,X,RESULT

where the CMASK operand is a bitmask which indicates which elements
of the CTABLE array are non-zero. [probably this instruction would
still be a waste of real estate for most]

John

--
John Hascall An ill-chosen word is the fool's messenger.
Project Vincent
Iowa State University Computation Center jo...@iastate.edu
Ames, IA 50011 (515) 294-9551

Michael Goldman

unread,
Apr 16, 1991, 3:25:16 PM4/16/91
to
msp3...@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:

>In <14...@ncis.tis.llnl.gov> tur...@lance.tis.llnl.gov (Michael Turner) writes:

>>If implementing them used up space on the chip that would have been
>>wasted otherwise, then there has been some good done, and little harm
>>but for a possible slight decrease in yield. I don't know that that's
>>the case with the 68020. From what I'm hearing of the 68040, it even
>>seems doubtful. But that shouldn't get in the way of my point: that
>>when you're designing an architecture for single-chip implementation,
>>at some point you've got a little left-over real estate to play around
>>with. Maybe you can even (*gasp*) add some instructions. If this be
>>treason, make the most of it, I say.

>The problem is that you will have to include those instructions in the
>next version of your processor in order to remain binary compatible,
>and if things work out differently, as they probably will, they may be
>hard to implement. That happens in microcoded machines: we have some
>extra microcode store space, lets add some instructions. Next
>version, you have to work like hell to cram all those instructions in
>because things worked out differently, and probably almost no one uses
>those instructions anyway.

Actually, as I follow the 68020 to the 68030 to the 68040 I find a number
of deleted instructions as well as added insrtuctions at each move. The
added instructions are either essential (MMU support) or not (BKPT).
The deleted instructions are of the same nature. That this requires
rewriting code is just one of those things you're supposed to put up
with to maintain application compatibility. The 680x0 doesn't have a
lot of really complex instructions like the 80x86 family. The string
handling instructions of the 80x86 come to mind as something best
implemented as library functions. The context switching instructions
of the 80286 are sufficiently involved as to be part of a large function.

I think one of the problems with the incredibly complex instructions
for OS stuff in the 80286 and beyond, is that you may not WANT to
implement your OS the way they designed the instruction. Another
problem is that it takes learning about the instruction set that
much harder and longer. This results in delays in implementation and
bugs. Then, also, as pointed out above, you've got compatibility issues
if you actually wrote your OS around their specific hardware concept
of how context switching, should be done.

Wild Rider

unread,
Apr 17, 1991, 4:32:48 PM4/17/91
to

In article <1991Apr12...@ux1.cso.uiuc.edu>

msp3...@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:

>The problem is that you will have to include those instructions in the
>next version of your processor in order to remain binary compatible,
>and if things work out differently, as they probably will, they may be
>hard to implement. That happens in microcoded machines: we have some
>extra microcode store space, lets add some instructions. Next
>version, you have to work like hell to cram all those instructions in
>because things worked out differently, and probably almost no one uses
>those instructions anyway.

or, you could do like motorola did when they designed the 030:
dump a couple of unused 020 instructions, i.e., the "callm"
("call module") and the corresponding "rtm" ("return from
module") instructions. yes, that's correct, you can have an
executable which runs _only_ on the 020, if it uses those
instructions. i guess motorola called up all their major 020
customers and asked them: "do you use the callm/rtm
instructions?" apparently, the answer was "no" from every
customer so motorola dumped the 2 unused instructions. i just
dug out my "mc68020 user's manual" and looked up the "callm"
and "rtm" instructions... yuk. no wonder they dumped them.
talk about "complex instructions" ... anyway, i'll bet moto's
designers freed up quite a bit of microcode real estate when
they released those 2 losers.
--
Wallace Roberts, AG (formerly GTE) Communication Systems, Phoenix, AZ
UUCP: ...!{ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!robertsw
Internet: gtephx!robe...@asuvax.eas.asu.edu Bike: '82 GS1100L Suz
voice: (602)581-4555 fax: (602)582-7624 Cage: '89 Mustang GT

John Mashey

unread,
Apr 18, 1991, 12:40:41 AM4/18/91
to
WARNING: you may want to print this one to read it...

Well, there is baggage and there is BAGGAGE.
One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION:
a) Architectures persist longer than implementations, especially
user-level Instruction-Set Architecture.
b) The first member of an architecture family is usually designed
with the current implementation constraints in mind, and if you're
lucky, software people had some input.
c) If you're really lucky, you anticipate 5-10 years of technology
trends, and that modifies your idea of the ISA you commit to.
d) It's pretty hard to delete anything from an ISA, except where:
1) You can find that NO ONE uses a feature
(the 68020->68030 deletions mentioned by someone
else).
2) You believe that you can trap and emulate the feature
"fast enough".
i.e., microVAX support for decimal ops,
68040 support for transcendentals.
Now, one might claim that the i486 and 68040 are RISC implementations
of CISC architectures .... and I think there is some truth to this,
but I also think that it can confuse things badly:
Anyone who has studied the history of computer design knows that
high-performance designs have used many of the same techniques for years,
for all of the natural reasons, that is:
a) they use as much pipelining as they can, in soem cases, if this
means a high gate-count, then so be it.
b) They use caches (separate I & D if convenient).
c) They use hardware rather than micro-code for the simpler
operations.
(For instance, look at the evolution of the S/360 products.
Recall that the 360/85 used caches, back around 1969, and within a few
years, so did any mainframe or supermini.)

So, what difference is there among machines if similar implementation
ideas are used?
A: there is a very specific set of characteristics shared by most
machines labeled RISCs, most of which are not shared by most CISCs.
The RISC characteristics:
a) Are aimed at more performance from current compiler technology
(i.e., enough registers).
OR
b) Are aimed at fast pipelining
in a virtual-memory environment
with the ability to still survive exceptions
without inextricably increasing the number of gate delays
(notice that I say gate delays, NOT just how many
gates).
Even though various RISCs have made various decisions, most of them have
been very careful to omit those things that CPU designers have found
difficult and/or expensive to implement, and especially, things that
are painful, for relatively little gain.

I would claim, that even as RISCs evolve, they may have certain baggage
that they'd wish weren't there .... but not very much.
In particular, there are a bunch of objective characteristics shared by
RISC ARCHITECTURES that clearly distinguish them from CISC architectures.

I'll give a few examples, followed by the detailed analysis from
an upcoming talk:

MOST RISCs:
3a) Have 1 size of instruction in an instruction stream
3b) And that size is 4 bytes
3c) Have a handful (1-4) addressing modes) (* it is VERY
hard to count these things; will discuss later).
3d) Have NO indirect addressing in any form (i.e., where you need
one memory access to get the address of another operand in memory)
4a) Have NO operations that combine load/store with arithmetic,
i.e., like add from memory, or add to memory.
4b) Have no more than 1 memory-addressed operand per instruction
5a) Do NOT support arbitrary alignment of data for loads/stores
5b) Use an MMU for a data address no more than once per instruction
6a) Have >=5 bits per integer register specifier
6b) Have >= 4 bits per FP register specifier
These rules provide a rather distinct dividing line among architectures,
and I think there are rather strong technical reasons for this, such
that there is one more interesting attribute: almost every architecture
whose first instance appeared on the market from 1986 onward obeys the
rules above .....
Note that I didn't say anything about counting the number of
instructions....
So, here's a table:
C: number of years since first implementation sold in this family
(or first thing of which this is binary compatible with)
3a: # instruction sizes
3b: maximum instruction size in bytes
3c: number of distinct addressing modes for accessing data (not jumps)>
I didn't count register or
literal, but only ones that referenced memory, and I counted different
formats with different offset sizes separately. This was hard work...
Also, even when a machine had different modes for register-relative and
PC_relative addressing, I counted them only once.
3d: indirect addressing: 0: no, 1: yes
4a: load/store combined with arithmetic: 0: no, 1:yes
4b: maximum number of memory operands
5a: unaligned addressing of memory references allowed in load/store,
without specific instructions
0: no never (MIPS, SPARC, etc)
1: sometimes (as in RS/6000)
2: just about any time
5b: maximum number of MMU uses for data operands in an instruction
6a: number of bits for integer register specifier
6b: number of bits for 64-bit or more FP register specifier,
distinct from integer registers

Note that all of this are ARCHIECTURE issues, and it is usually quite
difficult to either delete a feature (3a-5b) or increase the number
of real registers (6a-6b) given an initial isntruction set design.
(yes, register renaming can help, but...)

Now: items 3a, 3b, and 3c are an indication of the decode complexity
3d-5b hint at the ease or difficulty of pipelining, especially
in the presence of virtual-memory requirements, and need to go
fast while still taking exceptions sanely
items 6a and 6b are more related to ability to take good advantage
of current compilers.
There are some other attributes that can be useful, but I couldn't
imagine how to create metrics for them without being very subjective;
for example "degree of sequential decode", "number of writebacks
that you might want to do in the middle of an instruction, but can't,
because you have to wait to make sure you see all of the instruction
before committing any state, because the last part might cause a
page fault," or "irregularity/assymetricness of register use",
or "irregularity/complexity of instruction formats". I'd love to
use those, but just don't know how to measure them.
Also, I'd be happy to hear corrections for some of these.

So, here's a table of 12 implementations of various architectures,
one per architecture, with the attributes above. Just for fun, I'm
going to leave the architectures coded at first, although I'll identify
them later. I'm going to draw a line between H1 and L4 (obviously,
the RISC-CISC Line), and also, at the head of each column, I'm going
to put a rule, which, in that column, most of the RISCs obey.
Any RISC that does not obey it is marked with a +; any CISC that DOES
obey it is marked with a *. So...

CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------
A1 4 1 4 1 0 0 1 0 1 8 3+ 1
B1 5 1 4 1 0 0 1 0 1 5 4 -
C1 2 1 4 2 0 0 1 0 1 5 4 -
D1 2 1 4 3 0 0 1 0 1 5 0+ 1
E1 5 1 4 10+ 0 0 1 0 1 5 4 1
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3
G1 1 1 4 4 0 0 1 1 1 5 5 -
H1 2 1 4 4 0 0 1 0 1 5 4 - RISC
---------------------------------------------------------------
L4 26 4 8 2* 0* 1 2 2 4 4 2 2 CISC
M2 12 12 12 15 0* 1 2 2 4 3 3 1
N1 10 21 21 23 1 1 2 2 4 3 3 -
O3 11 11 22 44 1 1 2 2 8 4 3 -
P3 13 56 56 22 1 1 6 2 24 4 0 -

An interesting exercise is to analyze the ODD cases.
First, observe that of 12 architectures, in only 2 cases does an
architecture have an attribute that puts it on the wrong side of the line.
of the RISCs:
-A1 is slightly unusual in having more integer registers, and less FP
than usual.
-D1 is unusual in sharing integer and FP registers (that's what the D1:6b == 0).
-E1 seems odd in having a large number of address modes. I think most of this
is an artifact of the way that I counted, as this architecture really only
has a fundamentally small number of ways to create addresses, but has several
different-sized offsets and combinations, but all within 1 4-byte instruction;
I believe that it's addressing mechanisms are fundamentally MUCH simpler
than, for example, M2, or especially N1, O3, or P3, but the specific number
doesn't capture it very well.
-F1 .... is not sold any more.
-H1 one might argue that this process has 2 sizes of instructions,
but I'd observe that at any point in the instruction stream, the instructions
are either 4-bytes long, or 8-bytes long, with the setting done by a mode bit,
i.e., not dynamically encoded in every instruction.

Of the processors called CISCs:
-L4 happens to be one in which you can tell the length of the instruction
from the first few bits, has a fairly regular instruction decode,
has relatively few addressing modes, no indirect addressing.
In fact, a big subset of its instructions are actually fairly RISC-like,
although another subset is very CISCy.
-M2 has a myriad of instruction formats, but fortunately avoided
indirect addressing, and actually, MOST of instructions only have 1
address, except for a small set of string operations with 2.
I.e., in this case, the decode complexity may be high, but most instructions
cannot turn into multiple-memory-address-with-side-effects things.
-N1,O3, and P3 are actually fairly clean, orthogonal architectures, in
which most operations can consistently have operands in either memory or
registers, and there are relatively few weirdnesses of special-cased uses
of registers. Unfortunately, they also have indirect addressing,
instruction formats whose very orthogonality almost guarantees sequential
decoding, where it's hard to even know how long an instruction is unitl
you parse each piece, and that may have side-effects where you'd like to
do a register write-back early, but either:
must wait until you see all of the instruction until you commit state
or
must have "undo" shadow-registers
or
must use instruction-continuation with fairly tricky exception
handling to restore the state of the machine
It is also interesting to note that the original member of the family to
which O3 belongs was rather simpler in some of the critical areas,
with only 5 instruction sizes, of maximum size 10 bytes, and no indirect
addressing, and requiring alignment (i.e., it was a much more RISC-like
design, and it would be a fascinating speculation to know if that
extra complexity was useful in practice).
Now, here's the table again, with the labels:

CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------
A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
B1 5 1 4 1 0 0 1 0 1 5 4 - R2000
C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC
D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000
E1 5 1 4 10+ 0 0 1 0 1 5 4 1 HP PA
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860
---------------------------------------------------------------
L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM 3090
M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016
O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
P3 13 56 56 22 1 1 6 2 24 4 0 - VAX

General comment: this may sound weird, but in the long term, it might
be easier to deal with a really complicated bunch of instruction
formats, than with a complex set of addressing modes, because at least
the former is more amenable to pre-decoding into a cache of
decoded instructions that can be pipelined reasonably, whereas the pipeline
on the latter can get very tricky (examples to follow). This can lead to
the funny effect that a relatively "clean", orthogonal archiecture may actually
be harder to make run fast than one that is less clean. Obviously, every
weirdness has it's penalties.... But consider the fundamental difficulty
of pipelining something like (on a VAX):
ADDL @(R1)+,@(R1)+,@(R2)+

(I.e., something that, might theoretically arise from:
int **r1, **r2;
**r2++ = **r1++ + **r1++;
and which a RISC machine would do (most straightforwardly) as:
lw r3,0(r1) *r1
add r1,4 r1++
lw r4,0(r1) *r1 again
add r1,4 r1++
lw r5,0(r2) r5 = *r2
add r6,r3,r4 sum in r6
sw r6,0(r5) **r2 = sum
add r5,4 r2++
(Now, some RISCs might use auto-increment to get rid of, for example,
the last add; in any case, samrt compilers are quite likely to generate
something more like:
lw r3,0(r1) *r1
lw r4,4(r1) *r1 again
add r1,8 r1++
lw r5,0(r2) r5 = *r2
add r6,r3,r4 sum in r6
sw r6,0(r5) **r2 = sum
add r5,4 r2++
which has no stalls anywhere on most RISCs.)
Now, consider what the VAX has to do:
1) Decode the opcode (ADD)
2) Fetch first operand specifier from I-stream and work on it.
a) Compute the memory address from (r1)
If aligned
run through MMU
if MMU miss, fixup
access cache
if cache miss, do write-back/refill
Elseif unaligned
run through MMU for first part of data
if MMU miss, fixup
access cache for that part of data
if cache miss, do write-back/refill
run through MMU for second part of data
if MMU miss, fixup
access cache for second part of data
if cache miss, do write-back/refill
Now, in either case, we now have a longword that has the
address of the actual data.
b) Increment r1 [well, this is where you'd LIKE to do it, or
in parallel with step 2a).] However, see later why not...
c) Now, fetch the actual data from memory, using the address just
obtained, doing everything in step 2a) again, yielding the
actual data, which we needto stick in a temporary buffer, since it
doesn't actually go in a register.
3) Now, decode the second operand specifier, which goes thru everything
that we did in step 2, only again, and leaves the results in a second
temporary buffer. Note that we'd like to be starting this before we get
done with all of 2 (and I THINK the VAX9000 probably does that??) but
you have to be careful to bypass/interlock on potential side-effects to
registers .... actually, you may well have to keep shadow copies of
every register that might get written in the instruction, since every
operand can use auto-increment/decrement. You'd probably want badly to
try to compute the address of the second argument and do the MMU
access interleaved with the memory access of the first, although the
ability of any operand to need 2-4 MMU accesses probably makes this
tricky. [Recall that any MMU access may well cause a page fault....]

4) Now, do the add. [could cause exception]

5) Now, do the third specifier .... only, it might be a little different,
depending on the nature of the cache, that is, you cannot modify cache or
memory, unless you know it will complete. (Why? well, suppose that
the location you are storing into overlaps with one of the indirect-addressing
words pointed to by r1 or 4(r1), and suppose that the store was unaligned,
and suppose that the last byte of the store crossed a page boundary and
caused a page fault, and that you'd already written the first 3 bytes.
If you did this straightforwardly, and then tried to restart the
instruction, it wouldn't do the same thing the second time.

6) When you're sure all is well, and the store is on its way, then you
can safely update the two registers, but you'd better wait until the end,
or else, keep copies of any modified registers until you're sure it's safe.
(I think both have been done ??)

7) You may say that this code is unlikely, but it is legal, so the CPU must
do it. This style has the following effects:
a) You have to worry about unlikely cases.
b) You'd like to do the work, with predictable uses of functional
units, but instead, they can make unpredictable demands.
c) You'd like to minimize the amount of buffering and state,
but it costs you in both to go fast.
d) Simple pipelining is very, very tough: for example, it is
pretty hard to do much about the next instruction following the
ADDL, (except some early decode, perhaps), without a lot of gates
for special-casing.
(I've always been amazed that CVAX chips are fast as they are,
and VAX 9000s are REALLY impressive...)
e) EVERY memory operand can potentially cause 4 MMU uses,
and hence 4 MMU faults that might actually be page faults...
8) Consider how "lazy" RISC designers can be, with the RISC sequence shown:
a) Every load/store uses exactly 1 MMU access.
b) The compilers are often free to re-arrange the order, even across
what would have been the next instruction on a CISC.
This gets rid of some stalls that the CISC may be stuck with
(especially memory accesses).
c) The alignment requirement avoids especially the problem with
sending the first part of a store on the way before you're SURE
that the second part of it is safe to do.

Finally, to be fair, let me add the two cases that I knew of that were more
on the borderline: i960 and Clipper:
CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------
J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB

SUMMARY:
1) RISCs share certain architectural characteristics, although there
are differences, and some of those differences matter a lot.
2) However, the RISCs, as a group, are much more alike than the
CISCs as a group.
3) At least some of these architectural characteristics have fairly
serious consequences on the pipelinability of the ISA, especially
in a virtual-memory, cached environment.
4) Counting instructions turns out to be fairly irrelevant:
a) It's HARD to actually count instructions in a meaningful
way... (if you disagree, I'll claim that the VAX is RISCier
than any RISC, at least for part of its instruction set :-)
b) More instructions aren't what REALLY hurts you, anywhere
near as much features that are hard to pipeline:
c) RISCs can perfectly well have string-support, or decimal
arithmetic support, or graphics transforms ... or lots of
strange register-register transforms, and it won't cause
problems ..... but compare that with the consequence of
adding a single instruction that has 2-3 memory operands,
each of which can go indirect, with auto-increments,
and unaligned data...
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: ma...@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash
DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

David Chase

unread,
Apr 18, 1991, 2:10:07 PM4/18/91
to
Hey, you left out the Acorn Risc Machine. Not a big name in the
workstation market, but I understand that they sell a good number of
them, and the instruction set is "interesting" (a bit-twiddler/
superoptimizer's wet dream, if you ask me, but probably "risc" by many
definitions of the term). VTI sells this chip in the US, they should
be able to give you a spec if you want it.

David Chase
Sun

Marc Sabatella

unread,
Apr 17, 1991, 12:23:00 PM4/17/91
to
>no language he [ Herman Rubin, actually ]

>knew of had semantics for retrieving both the integer quotient and the
>remainder from a floating divide, the point being that both values are
>usually available from any implementation of floating point divide, but
>that the language stands in the way of getting them at the same time.

In addition to Common LISP, Forth also has this ability. In fact, it comes
about as close to Herman's idealized "user definable syntax macro processor"
as any language I can think of.

Henry Spencer

unread,
Apr 18, 1991, 3:55:34 PM4/18/91
to
In article <24...@spim.mips.COM> ma...@mips.com (John Mashey) writes:
>-A1 [29k] is slightly unusual in having more integer registers, and less FP
>than usual. ...
>-D1 [88k] is unusual in sharing integer and FP registers

This is slightly out of date. AMD appears to have (wisely) decided to ditch
the peculiar 29027 FPU architecture. The 29050's on-chip floating point uses
the integer register bank (disregarding one or two odd instructions), subject
to a constraint that double-precision arithmetic use even-odd pairs.
--
And the bean-counter replied, | Henry Spencer @ U of Toronto Zoology
"beans are more important". | he...@zoo.toronto.edu utzoo!henry

John Mashey

unread,
Apr 18, 1991, 7:24:36 PM4/18/91
to

Sorry, it could have been included, but I just ran out of time and
space, and I thought I had enough data to make the point, which was that
there were noticable differences between RISC and CISC architectures,
regardless of the implementation.
The ARM would certainly get classified as a RISC, with
32-bit instructions, 1 size
a handful of memory address modes
no indirect addressing
only loads/sotres access memory
no more than 1 memory address/instruction
alignment (I think)
1 use of memory control/TLB per instruction for data
4 bits available for integer register specifiers
The manual I have shows it with no FP at all, but then it's 2 years' old.

Although I didn't post them, the more complete tables that I was working from
contained multiple implementations of some of the architecture families,
from which one may find several trends:
a) Only occasionally does anyone (RISC or CISC) subtract instructions.
b) Both RISC and CISC often add instructions as time goes on.
c) Sometimes CISCs got CISCier in their ISAs, i.e. ,by adding
addressing modes the way the 68020 did, or by deleting alignment
characteristics the way 360->370 did.
d) Sometimes CISC implementations were done in more RISC-like
fashion (i.e., trap certain opcodes and emulate)>
e) I could find no architecture that clearly started as a RISC,
and then seriously evolved into a CISC, or took on any of the
attributes that I used to distinguish CISCs and RISCs.
(So much for this idea that CRISP = Complex RIS Processor
(not AT&T CRISP) is a merger of RISC and CISC, and that current
RISC and CISC architectures are evolving towards each other.
Nonsense.)

Torben [gidius Mogensen

unread,
Apr 19, 1991, 9:36:34 AM4/19/91
to
ma...@mips.com (John Mashey) writes:

>Age: number of years since first implementation sold in this family


>(or first thing of which this is binary compatible with)
>3a: # instruction sizes
>3b: maximum instruction size in bytes
>3c: number of distinct addressing modes for accessing data (not jumps)>

>3d: indirect addressing: 0: no, 1: yes
>4a: load/store combined with arithmetic: 0: no, 1:yes
>4b: maximum number of memory operands
>5a: unaligned addressing of memory references allowed in load/store,
> without specific instructions
> 0: no never (MIPS, SPARC, etc)
> 1: sometimes (as in RS/6000)
> 2: just about any time
>5b: maximum number of MMU uses for data operands in an instruction
>6a: number of bits for integer register specifier
>6b: number of bits for 64-bit or more FP register specifier,
> distinct from integer registers

You might add ARM to the list

CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------

6+ (7?) 1 4 13+ 0 0 1? 0 1? 4+ 0+ 4 ARM

Notes:

There are actually 4 bits that specify addressing mode, but
four of the 16 modes have the same effect. This is due to a very
orthogonal specification.

The (?) in 4b and 5b are due to the load/store multiple register
instructions, which use one memory access per register.

There are no FP unit on the chip, thus no specific FP registers. There
has been an announcement of an FPU to appear this year, but I don't
know anything about it.

I'm not certain about the age, but it was, I think, the first
commercially available RISC processor.

Torben Mogensen (tor...@diku.dk)

Torben [gidius Mogensen

unread,
Apr 19, 1991, 9:54:27 AM4/19/91
to
I made an error in the number of addressing modes. The stated 13 modes
(16 with 4 identical) should be 10, since in 6 (8) of the modes, one
of the specifying bits has no effect in user mode. You could also say
that there are 20 modes, as each mode can transfer either 8 or 32 bits.

Torben Mogensen (tor...@diku.dk)

Donald Lindsay

unread,
Apr 21, 1991, 2:41:51 PM4/21/91
to
In article <71...@auspex.auspex.com> g...@auspex.auspex.com (Guy Harris) writes:
>And what about some of the
>more elaborate procedure-calling, procedure-entry, or procedure-exit
>instructions?
>
>Admittedly, to some extent, the problems with the more elaborate
>features may come either from 1) sloppy implementations of them and
>2) using the elaborate features even when inappropriate

The problem with an elaborate procedure calling convention is that
you *have* to do it that way - by emulation if not by use of the
fancy instructions. To do otherwise is to confuse debuggers,
exception/longjmp packages, and smart linkers, not to mention
breaking separate compilation.

Actually, the worst calling convention I ever dealt with had none of
those problems. IBM 360 PL/I-F didn't *have* a debugger, smart
linkers hadn't been invented, and the exception package was easily
tacked on. The reason it was easy was because the machine code for a
procedure prologue, contained (get this) a *call* to the runtime
package...
--
Don D.C.Lindsay Carnegie Mellon Robotics Institute

Michael Meissner

unread,
Apr 21, 1991, 3:37:14 PM4/21/91
to
In article <12...@pt.cs.cmu.edu> lin...@gandalf.cs.cmu.edu (Donald
Lindsay) writes:

| Actually, the worst calling convention I ever dealt with had none of
| those problems. IBM 360 PL/I-F didn't *have* a debugger, smart
| linkers hadn't been invented, and the exception package was easily
| tacked on. The reason it was easy was because the machine code for a
| procedure prologue, contained (get this) a *call* to the runtime
| package...

If I remember correctly, the Unix V6 (Ritchie) compiler did this
too....
--
Michael Meissner email: meis...@osf.org phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

Donald Lindsay

unread,
Apr 21, 1991, 5:05:27 PM4/21/91
to
In article <1991Apr12...@ux1.cso.uiuc.edu>
msp3...@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:
>The problem is that you will have to include those instructions in the
>next version of your processor in order to remain binary compatible,
>and if things work out differently, as they probably will, they may be
>hard to implement.

In article <3...@scorpio.gtephx.UUCP> robe...@scorpio.UUCP
(Wild Rider) gives the 68020 "callm"/"rtm" as examples.

An example they weren't able to dump was "cas2". This fetches two
independant words, and then conditionally overwrites them, with one
big R/M/W lock on memory during the whole 4 memory cycles.

Since each of these words could be unaligned, there could be four
page faults just in fetching them. (Let's not even talk about
write-protected/copy-on-write pages.) The 68040 team didn't want to
leave memory locked for a potentially long period, and yet releasing
the lock logically implies a full restart. So, before attempting the
reads, the 68040 insures that all 1-4 mappings are in the TLB and
valid. That's not as trivial as it may sound: the design has to allow
for (say) the 4th mapping bumping the 3rd one back out of the TLB.

The chip works, but I believe they regret getting themselves into
this.

Shankar Unni

unread,
Apr 20, 1991, 12:43:22 AM4/20/91
to
In comp.arch, robe...@gtephx.UUCP (Wild Rider) writes:

> or, you could do like motorola did when they designed the 030:
> dump a couple of unused 020 instructions, i.e., the "callm"
> ("call module") and the corresponding "rtm" ("return from
> module") instructions. yes, that's correct, you can have an

That's surely an oversimplification. What usually happens is that those
instructions will raise some sort of "unimplemented instruction" (or
so-called "assist") exception, which is different from the run-of-the-mill
"illegal instruction", moving the responsibility for simulating
those instructions to the kernel (or the user program, if you are running
stand-alone embedded programs). This sort of thing is commonly done with
certain types of floating-point instructions.
-----
Shankar Unni E-Mail:
HP India Software Operation, Bangalore Internet: sha...@cup.hp.com
Phone : +91-812-261254 x417 UUCP: ...!hplabs!hpda!shankar

peter da silva

unread,
Apr 22, 1991, 10:11:22 AM4/22/91
to
In article <12...@pt.cs.cmu.edu>, lin...@gandalf.cs.cmu.edu (Donald Lindsay) writes:
> The reason it was easy was because the machine code for a
> procedure prologue, contained (get this) a *call* to the runtime
> package...

The K&R C compiler for the PDP-11 did this, too. And it did returns with
a "jmp cret".

You want a weird calling convention, try the 1802. A subroutine call is made
by changing which register is acting as the PC: all subroutines were
coroutines. To get stack calling you set up a pair of call-return subroutines,
burned a couple registers as their start address, and called them with SEP 4
and SEP 5. The only machine I know for which Forth (particularly token
indirect threaded code) was faster than regular subroutine calls using the
Standard Call/Return technique.

(I know, you're all tired of me flaming about the 1802. But it was a really
CUTE micro. Sort of like a boutique car.)
--
Peter da Silva. `-_-' pe...@ferranti.com
+1 713 274 5180. 'U` "Have you hugged your wolf today?"

Jerry Callen

unread,
Apr 22, 1991, 11:37:16 AM4/22/91
to
In article <12...@pt.cs.cmu.edu> lin...@gandalf.cs.cmu.edu (Donald Lindsay) writes:

>[regarding 68020 instructions that Moto had to carry over to the 68040]


>
>An example they weren't able to dump was "cas2". This fetches two
>independant words, and then conditionally overwrites them, with one
>big R/M/W lock on memory during the whole 4 memory cycles.
>
>Since each of these words could be unaligned, there could be four
>page faults just in fetching them.

cas2 is a horrible case of overkill, but a "reasonable" compare and swap
instruction would be AWFULLY handy. By "reasonable" I mean an instruction
that:

- fetches an aligned word from memory
- compares it to a register
- if the values match, the value in another register is stored back
- a condition code is set somewhere that indicates whether or not the
store was done.

The ancient IBM 370 offered essentially this instruction; the OS used it
extensively, and I used it to implement locks at user level. It was
sure handy!

Yes, this is VERY complex by RISC standards, but it makes a number of
nasty synchronization problems almost trivially easy. You can sorta-kinda
emulate compare and swap using test and set (which the 88K has) and spin
locks, but to do so in complete safety you have to disable interrupts
(nasty from user code...).

What makes a "reasonable" compare and swap so hard to implement that
current RISCs don't do it? It seems to me that implementing test and
set already forces MOST of the hard work to be done anyway.

-- Jerry Callen
jca...@encore.com

Richard Tobin

unread,
Apr 22, 1991, 12:30:35 PM4/22/91
to
In article <1991Apr19.1...@odin.diku.dk> tor...@diku.dk (Torben [gidius Mogensen) writes:

>You might add ARM to the list

>I'm not certain about the age, but it was, I think, the first
>commercially available RISC processor.

I'm not certain either (surely there's someone from Acorn out there?) but
it was certainly being discussed at Acorn in the summer of 1983.

-- Richard
--
Richard Tobin, JANET: R.T...@uk.ac.ed
AI Applications Institute, ARPA: R.Tobin%uk.a...@nsfnet-relay.ac.uk
Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin

Peter Song

unread,
Apr 23, 1991, 2:47:37 PM4/23/91
to
In article <14...@encore.Encore.COM> jca...@encore.Com (Jerry Callen) writes:
| cas2 is a horrible case of overkill, but a "reasonable" compare and swap
| instruction would be AWFULLY handy. By "reasonable" I mean an instruction
| that:
|
| - fetches an aligned word from memory
| - compares it to a register
| - if the values match, the value in another register is stored back
| - a condition code is set somewhere that indicates whether or not the
| store was done.
|
| The ancient IBM 370 offered essentially this instruction; the OS used it
| extensively, and I used it to implement locks at user level. It was
| sure handy!
|
| Yes, this is VERY complex by RISC standards, but it makes a number of
| nasty synchronization problems almost trivially easy. You can sorta-kinda
| emulate compare and swap using test and set (which the 88K has) and spin
| locks, but to do so in complete safety you have to disable interrupts
| (nasty from user code...).
|
| What makes a "reasonable" compare and swap so hard to implement that
| current RISCs don't do it? It seems to me that implementing test and
| set already forces MOST of the hard work to be done anyway.
|
| -- Jerry Callen
| jca...@encore.com

An efficient compare-and-swap (CAS) function can be provided in RISC architectures
without providing one instruction that does it. I'll give an example using the
load linked (LL) and store conditionally (SC) instructions in the MIPS II architecture.

The LL reads from a memory location and sets a status bit. The SC completes the write
only if the status bit is still set - the status bit gets cleared when a SC writes to
the location associated with the status bit. The following code provides the
compare-and-swap(old,new,match) == compare-and-swap(T0,T1,T2) function:

L: LL T4,(T0) ;T0 has the address of old
BNE T4,T2,SKIP ;if old != match, skip
NOP
SC T1,(T0) ;write new atomically!!
;SC is success only if T1 is 1 (I think)
...
SKIP: ;store in compare-and-swap didn't happen

If several processes simultaneously execute the compare-and-swap, only one process
that first executes the SC instruction will succeed.

On a side note, I think the only real benefit of the compare-and-swap over the more
simpler test-and-set is that it semantically defines when the write is to take place.
This definition reduces the number of writes in spinlock that require coherency actions
in cache-based systems. This is not to say that the test-and-set can't be implemented
in the ways that eliminate the unnecessary writes. For instance, the Am29K has the
LOADSET, a test-and-set type of instruction, that defines -1 to be the "set" value.
Since the "set" value is defined in hardware, the LOADSET can check the lock and not
start the write if the lock is already set. This is possible only if the instruction
defines the "set" value.


-Peter
--
S. Peter Song - 29K Advanced Processor Development - Advanced Micro Devices
pe...@nucleus.amd.com (800) 531-5202 x54818

Barry Margolin

unread,
Apr 23, 1991, 12:28:14 PM4/23/91
to
In article <1991Apr23....@zoo.toronto.edu> he...@zoo.toronto.edu (Henry Spencer) writes:
>In article <MEISSNER.91...@curley.osf.org> meis...@osf.org (Michael Meissner) writes:
>>| ... The reason it was easy was because the machine code for a

>>| procedure prologue, contained (get this) a *call* to the runtime
>>| package...
>>If I remember correctly, the Unix V6 (Ritchie) compiler did this too...
>All instantiations of the Ritchie compiler did it, in fact. On the whole,
>it made sense: it saved code space, which was usually more of an issue on
>the 11 than execution time. And it wasn't actually that much slower than
>inlining the relevant code, given that we're talking about machines that
>had little or no pipelining.

On Multics, external procedures also make a call to the runtime system
(called "operators", for some reason -- I guess to distinguish the library
routines that the compiler "knows" about from the ones that it doesn't)
during the standard entry sequence.

Besides reducing code space, it is a very useful feature of the software
development environment. The operator procedures are addressed through a
transfer vector. Thus, alternate versions of the operators could be
dynamically linked into the process (similar to the way personal computer
customizations patch trap vectors). The most notable uses of this are the
procedure call tracing mechanism and the metering facility. On most
systems, one must compile a program with a special option in order to make
it maintain metering statistics; on Multics, one just issues a command that
switches operators, runs the program, and then switch back to the regular
operators (there's a program that combines all this). For call tracing,
one switches to versions of the entry and exit operators that display a
message.
--
Barry Margolin, Thinking Machines Corp.

bar...@think.com
{uunet,harvard}!think!barmar

Henry Spencer

unread,
Apr 23, 1991, 11:19:32 AM4/23/91
to
In article <MEISSNER.91...@curley.osf.org> meis...@osf.org (Michael Meissner) writes:
>| ... The reason it was easy was because the machine code for a

>| procedure prologue, contained (get this) a *call* to the runtime
>| package...
>
>If I remember correctly, the Unix V6 (Ritchie) compiler did this too...

All instantiations of the Ritchie compiler did it, in fact. On the whole,
it made sense: it saved code space, which was usually more of an issue on
the 11 than execution time. And it wasn't actually that much slower than
inlining the relevant code, given that we're talking about machines that
had little or no pipelining.

John R. Levine

unread,
Apr 23, 1991, 5:57:01 PM4/23/91
to
In article <1991Apr23.1...@dvorak.amd.com> you write:
>[How to implement CAS on a MIPS, and then ...]

>On a side note, I think the only real benefit of the compare-and-swap over
>the more simpler test-and-set is that it semantically defines when the write
>is to take place.

Compare and swap is fundamentally more powerful. With CAS you can implement
wait-free interlocking with arbitrarily many contending processors for such
operations as allocating buffers from a pool or linking request blocks into
a queue. Test and set doesn't let you implement wait-free interlocks at all
for more than two processes, it's mostly only good for spin locks. There is
an interesting article by Maurice Herlihy in the January TOPLAS comparing
the ability of various constructs to implement wait-free synchronization,
and it turns out that nothing he looked at weaker than compare and swap is
adequate to synchronize arbitrarily many processes without waiting. (The
only other one he found was atomically updated queues with peeking at the
top entry, a lot more complex than compare and swap.)

In many cases, the penalty for spin locks is low, because you only lock a
small straight-line piece of code, but not always. For one thing, it's not
always desirable to lock out all interrupts during a locked section, since
the cost of turning interrupts on and off may be considerably greater than
that of the locked code, particularly if the code is running as part of a
user program. If the lock is in a user program, there is also the
possibility that the operating system will time-slice, page, swap, or
otherwise slow down the locked code at an inconvenient time.

In any system, there is always the possibility that one process or processor
will crash, and spin locks make it a lot harder to recover from crashes.

Regards,
John Levine, jo...@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl

Jeff Kenton OSG/UEG

unread,
Apr 22, 1991, 1:54:10 PM4/22/91
to
In article <14...@encore.Encore.COM>, jca...@Encore.COM (Jerry Callen) writes:

|>
|> Yes, this is VERY complex by RISC standards, but it makes a number of
|> nasty synchronization problems almost trivially easy. You can sorta-kinda
|> emulate compare and swap using test and set (which the 88K has) and spin
|> locks, but to do so in complete safety you have to disable interrupts
|> (nasty from user code...).
|>

|> -- Jerry Callen
|> jca...@encore.com

The 88K has xmem, not test and set, which is guaranteed to be an atomic
operation (the bus is locked between the read and write cycles). This
can be used to create locks or compare and swap without disabling itnerrupts.

Go look at the locking primitives in Encore's 88K kernel. They are built
around xmem's. (I was there.)


-----------------------------------------------------------------------------
== jeff kenton Consulting at ken...@decvax.dec.com ==
== (617) 894-4508 (603) 881-0011 ==
-----------------------------------------------------------------------------

Donald Lindsay

unread,
Apr 24, 1991, 11:15:20 PM4/24/91
to
In article <1991Apr23.1...@Think.COM> bar...@think.com
(Barry Margolin) writes:
| ... The reason it was easy was because the machine code for a
| procedure prologue, contained (get this) a *call* to the runtime
| package...

>...alternate versions of the operators could be


>dynamically linked into the process (similar to the way personal computer
>customizations patch trap vectors). The most notable uses of this are the
>procedure call tracing mechanism and the metering facility.

Another debugging feature available via this trick is break-when-
expression-true. The expression is of course only evaluated on every
procedure entry, not on every cycle. However, my personal experience
(on a DEC 20) was very postive. I was usually trying to find out
which routine was using a wild pointer, so per-call was the perfect
granularity.

Of course, there are other ways to achieve the same effect on vanilla
hardware. Aside from page protection cleverness, one can run a snoop
process on one of the other CPUs. (Multiprocessors *are* vanilla,
right?)

David Halko

unread,
Apr 23, 1991, 11:31:37 AM4/23/91
to
In article <97...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> In article <1991Apr5.1...@agate.berkeley.edu>, do...@eris.berkeley.edu (Doug Merritt) writes:
> > In article <1991Apr4...@capd.jhuapl.edu> wal...@capd.jhuapl.edu writes:
> > > He observes that "Few compilers use any of the nifty instructions that
> > > a CISC has." I believe I read recently that RISC was based on the
> > > observation that, in fact, only about 30% of the instructions in CISC
> > > computers were used by compilers. The rest of the instructions, for
> > > all practical purposes, were just excess baggage.
> .........................
> > Anyway, the point is that even if compilers *do* use every possible feature
> > of a CISC, it still usually won't make the CISC s/w competitive with RISC
> > s/w. CISC cpus almost never put sufficient h/w optimization into the support
> > of the fancy instructions. (There are exceptions to this, of course, and I
> > haven't been watching the 030 and 040 to see how they do in this regard.
> > CISC may yet rise again, but not until after superscalar RISC has been
> > completely exploited.)
> .........................
> If the languages do not allow the user to communicate the use of those
> features of the CISC architecture which can make the object code considerably
> faster, the compiler cannot take advantage of them. This IS the situation.
> For example, suppose there was an instruction, which could be effectively
> used, which would divide a float by a float, obtaining an integer quotient
> and a float remainder. Now just how is the compiler to recognize that this
> is in fact wanted?
> .........................
What do you mean how is a compiler suppoosed to recognize that this is wanted?

AT&T came out with a DSP board which utilized neato CPU features in its
high level languages... since floating point multiply was so much faster
than integer multiply, it would take integers, convert them to float, do
the floating point multiplication, and convert back to integer again. Man,
talk about fast and nifty. This was all handled internally, inside the C
Compiler. It seems quite useful for integer division to do the same, here.
There were, however, some minor flaws in the Compiler which took away from
some of it's efficiency. I used to use global search and replace with specific
patterns to optimize the code I produced...

> There is, at present, no language designed to take into account the collection
> of "natural" useful hardware which the present users can find uses for, and
> which are far more expensive in software than in hardware.

I think it may be more of a question of "Are CISC features being fully realized by
compiler designers?"

I wonder if RISC architecture is all what it is cracked up to be. I work on
a SUN SS1, and it is not all that much faster than a 3/60. Quite often, I
use the 3/60's we have instead.
--
_______________________________________________________________________
/ \
/ "The use of COBOL cripples the mind; its teaching should, therefore, be \
/ regarded as a criminal offence." E.W.Dijkstra, 18th June 1975. \
+-----------------------------------------------------------------------------+
\ "Have you purchased a multi- hal...@moravian.edu Have you booted your /
\ media machine from IMS yet?" David J. Halko OS-9 computer today?/
\_______________________________________________________________________/

Michael Meissner

unread,
Apr 25, 1991, 1:45:21 PM4/25/91
to
In article <40...@batman.moravian.EDU> hal...@batman.moravian.EDU
(David Halko) writes:

| I wonder if RISC architecture is all what it is cracked up to be. I work on
| a SUN SS1, and it is not all that much faster than a 3/60. Quite often, I
| use the 3/60's we have instead.

From your posting, I would guess that you use a lot of integer
multiplies (non-constant in particular). The initial Sparc hardware
does not have an integer multiply or divide instruction. It has a
multiply step instruction (not sure about divide) to give a partial
answer.

Herman Rubin

unread,
Apr 27, 1991, 8:05:16 AM4/27/91
to

I mean exactly that. I am assuming that the operations are done by means of
a sequence of instructions, and not a call to some (possibly inlined) function.
I know of NO hardware on which I would consider calls to be reasonable where
short sequences of code will do the job.

It is not the case that the algorithm to be used is the same on different
machines. Hardware differences can cause major problems, as was seen when
cache/paging first came in; an algorithm for matrix multiplication which
would seem to be better from the timing standpoint (same operations, fewer
loads, far fewer stores) caused page faults on the great bulk of its loads,
and thus became non-competitive for large matrices.

Consider the following two operations, which comprise more than 25% of the
operations in some highly efficient, from the "picocode" standpoint, methods
for generating non-uniform random numbers.

1. Examine a bit, and take some action depending on it. Also, the next time
this is to be done, this bit effectively no longer exists. It is necessary
to make provision at least somewhere for replenishing the supply of bits.

2. Find the distance to the next one in a bit stream. The same considerations
as above apply.

Try coding this in any present language, including assembler, in such a way
that a compiler can be expected to recognize one of these without being
explicitly instructed.

I am not saying that this should not be the case. But it is the user, not
the language designer, who has to do this. Whatever language, present or
future, is used, human beings are going to come up with extremely simple
operations which the language designers have not thought of, and for which
doing those operations using the language primitives will be clumsy.

The user must be able to instruct the compiler about these operations
in a manner similar to the way it now looks at addition and multiplication.
If there are competing ways to do something, the compiler must know the
available variety. This is even the case for adding vectors in C.

> AT&T came out with a DSP board which utilized neato CPU features in its
> high level languages... since floating point multiply was so much faster
> than integer multiply, it would take integers, convert them to float, do
> the floating point multiplication, and convert back to integer again. Man,
> talk about fast and nifty. This was all handled internally, inside the C
> Compiler. It seems quite useful for integer division to do the same, here.
> There were, however, some minor flaws in the Compiler which took away from
> some of it's efficiency. I used to use global search and replace with specific
> patterns to optimize the code I produced...

If this was the case, the hardware was deficient. To a mathematician, floating
point arithmetic is a complicated version of fixed point arithmetic. This is a
point which the RISC designers do not see; the chip real estate for doing good
integer arithmetic is already there in the floating point stuff, and the design
should be such as to use it. Floating point arithmetic uses fixed point
internally, so this would not be a problem AT THE TIME OF THE DESIGN OF THE
ARCHITECTURE. It is a problem afterward.

> > There is, at present, no language designed to take into account the collection
> > of "natural" useful hardware which the present users can find uses for, and
> > which are far more expensive in software than in hardware.
>
> I think it may be more of a question of "Are CISC features being fully realized by
> compiler designers?"

Have any language designers or hardware designers ever asked for input on the
variety of operations which are readily understood by people like Silverman,
Lehmer, etc.? Have language designers considered that vastly different
algorithms should be used with different architectures? Most optimization
procedures which I have seen assume not.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)

Preston Briggs

unread,
Apr 28, 1991, 11:46:03 AM4/28/91
to
hru...@pop.stat.purdue.edu (Herman Rubin) writes:

> If the languages do not allow the user to communicate the use of those
> features of the CISC architecture which can make the object code considerably
> faster, the compiler cannot take advantage of them. This IS the situation.
> For example, suppose there was an instruction, which could be effectively
> used, which would divide a float by a float, obtaining an integer quotient
> and a float remainder. Now just how is the compiler to recognize that this
> is in fact wanted?

In Fortran, you might write

iq = f1 / f2

and a while later (during which f1 and f2 are not changed)

fr = amod(f1, f2)

The compiler could recognize, during value numbering, that an
quotient and remainder of the same operands was being computed.
During instruction selection (or perhaps peephole optimization),
it could recognize that it could avoid the real2int conversion
of the quotent by using the "hrubin" instruction.

Not everyone does these things the way I've described them, but many
compilers can achieve the same effect, one way or another.

Preston Briggs

Bengt Larsson

unread,
Apr 28, 1991, 11:18:31 AM4/28/91
to
In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu
(Herman Rubin) writes:
>In article <40...@batman.moravian.EDU>, hal...@batman.moravian.EDU (David Halko) writes:
>> What do you mean how is a compiler suppoosed to recognize that this is wanted?
>
>I mean exactly that. I am assuming that the operations are done by means of
>a sequence of instructions, and not a call to some (possibly inlined) function.
>I know of NO hardware on which I would consider calls to be reasonable where
>short sequences of code will do the job.

What is the difference between an _inlined_ function and a "short sequence
of code"? Explanation, please.

>It is not the case that the algorithm to be used is the same on different
>machines. Hardware differences can cause major problems, as was seen when

>cache/paging first came in; ...

This no news.

>1. Examine a bit, and take some action depending on it. Also, the next time
>this is to be done, this bit effectively no longer exists. It is necessary
>to make provision at least somewhere for replenishing the supply of bits.

You are being obscure. What is this if not a shift instruction?

>2. Find the distance to the next one in a bit stream. The same considerations
>as above apply.

True. And some hardware has instructions for "find first set bit". The
Vax has, for example. That may be a useful addition, though it's
somewhat specialized. Useful for cryptography, I hear.

>Try coding this in any present language, including assembler, in such a way
>that a compiler can be expected to recognize one of these without being
>explicitly instructed.

It isn't doable to make a compiler which can recognize _anything_ without
being instructed. If you want an extensible language, you'll have to show how it
can be made extensible. I can assure you that it isn't quite as easy as it
looks.

>I am not saying that this should not be the case. But it is the user, not
>the language designer, who has to do this. Whatever language, present or
>future, is used, human beings are going to come up with extremely simple
>operations which the language designers have not thought of, and for which
>doing those operations using the language primitives will be clumsy.

This is utterly obvious.

>The user must be able to instruct the compiler about these operations
>in a manner similar to the way it now looks at addition and multiplication.

OK, you like infix notation.

>If there are competing ways to do something, the compiler must know the
>available variety. This is even the case for adding vectors in C.

You don't seem to know anything about the problems in what you are
talking about.

>If this was the case, the hardware was deficient. To a mathematician, floating
>point arithmetic is a complicated version of fixed point arithmetic.

Hardware integer multiply should probably be provided for, yes.

>Have any language designers or hardware designers ever asked for input on the
>variety of operations which are readily understood by people like Silverman,
>Lehmer, etc.?

Why do you assume that people should come and ask you (a matematician)
about computer language development? What makes you an expert in the
field?

Bengt L.
--
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: ben...@maths.lth.se SUNET: TYCHE::BENGT_L

Preston Briggs

unread,
Apr 28, 1991, 12:10:32 PM4/28/91
to
hru...@pop.stat.purdue.edu (Herman Rubin) writes:

>Consider the following two operations, which comprise more than 25% of the
>operations in some highly efficient, from the "picocode" standpoint, methods
>for generating non-uniform random numbers.
>
>1. Examine a bit, and take some action depending on it.
> Also, the next time this is to be done, this bit effectively
> no longer exists. It is necessary to make provision at least
> somewhere for replenishing the supply of bits.
>
>2. Find the distance to the next one in a bit stream.
> The same considerations as above apply.

These sort of things are certainly CISCish and would be difficult to
recognize in a compiler. However, I would argue that they are
possibly misrepresented.

How about a linked-list (oh no!) of integers.
An element in the list represents a "1" bit.
The integer is number of "zero" bits preceding the 1.
The stream starting with 001010000001...
would look like

(2, 1, 6, ...)

So, a single memory access per 1 bit. The "distance to next 1 bit"
questions can be answered immediately. The "examine a bit and take
an action" can be done with some of those fancy "decrement, test, and branch"
instructions people use for closing loops.

Alternatively, ...
It seems like there's something producing the bits and something consuming
them and making decisions based on them. Why not have the producer
simply call the alternate choices in the consumer directly?


Generally though, my critisism is that you're thinking very low-level.
You decided how you want these operations represented and you seem
unwilling to talk about them at a level above "bit stream".
If you must think this way, I expect you're going to have to
code in the same fashion; that is, in assembly.

Preston Briggs

Herman Rubin

unread,
Apr 28, 1991, 3:58:23 PM4/28/91
to

Any specific instance of this can be added to the compiler's ken, but
until it does, the compiler cannot take advantage of it. The point is
that the compiler must be instructed to look for them. As to what it
can do with such instructions, this depends very strongly on what the
hardware provides.

The real2int conversion still has to be done on most machines, and also
an integerize of real. But suppose that the machine has integer hardware
comparable to floating hardware, and can also handle unnormalized floats.
Then one could put a shifted significand of f1 in a double register,
divide by the significand of f2, setting iq to be the quotient, and
the remainder with the exponent of f2, and then normalized, as fr.
Now I wonder how many of the readers would think of doing it this way?
And how would the hardware designers recognize that putting the operation
in hardware might be a good idea?

The individual compiler might or might not see it. The community definitely
would not.

Now, how would you write the code for doing a conditional transfer efficiently
on the next bit of a bit stream, which can be extended (usually by a subroutine
call) when necessary? It is unlikely that one will know, without querying, just
where one is in the stream. If you wrote the code, would someone else so
recognize it?

Randell Jesup

unread,
Apr 29, 1991, 12:35:13 AM4/29/91
to
In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>If this was the case, the hardware was deficient. To a mathematician, floating
>point arithmetic is a complicated version of fixed point arithmetic. This is a
>point which the RISC designers do not see; the chip real estate for doing good
>integer arithmetic is already there in the floating point stuff, and the design
>should be such as to use it. Floating point arithmetic uses fixed point
>internally, so this would not be a problem AT THE TIME OF THE DESIGN OF THE
>ARCHITECTURE. It is a problem afterward.

The RPM-40 used the floating point HW for integer multiple and divide,
and I'm certain at least a few others have also. However, don't do this if
you have only a couple FP registers (a problem the rpm40 had).

--
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, je...@cbmvax.commodore.com BIX: rjesup
Disclaimer: Nothing I say is anything other than my personal opinion.
Thus spake the Master Ninjei: "To program a million-line operating system
is easy, to change a man's temperament is more difficult."
(From "The Zen of Programming") ;-)

Preston Briggs

unread,
Apr 29, 1991, 10:59:04 PM4/29/91
to
hru...@pop.stat.purdue.edu (Herman Rubin) writes:

>Is it the compiler writer's job to recognize combinations of expressions in
>a language which really represent an operation not contained in that
>language?

It's the compiler writer's job to write code that recognizes
combinations of expressions that are supported by machine instructions.
It's one of the significant tasks of instruction selection.

I don't know how to handle everything (e.g., recognizing a
routine that computes square root), but many of the combinations
you (and others) give as examples _are_ easily handled.

>What is needed for the user to be able to communicate what operations are
>wanted.

I agree that the user should be able to instruction the computer.
However, I don't think I want to specify the assembly operations.
Instead, I want to manipulate numbers, sets, sequences, and objects in a
somewhat high-level fashion. Then I want the compiler to make very good
code out of my specification.

>> If you change the machine, you need to change compilers.
>
>This newsgroup is comp.arch. How are the computer architects to recognize
>these situations and others?

I expect they learn from reading and by experience.
It seems that one of the major accomplishments of the RISC evolution
was a recognition that architects and compiler people should talk
more often than in the past.

>What you have shown is that a compiler can, in effect, extend a language
>by looking for certain code patterns.

I disagree. The language (Fortran 77) didn't change when I added my
optimization. The compiler translates exactly the same set of
programs; some programs simply ran slightly faster.

> I fail to see that this procedure,
>which will always be limited by the information given to the compiler
>writers, and which can also sometimes lead to difficulties, is better
>than allowing the user to extend the language and pass the information
>to the compiler in such a way that it can use more of the
>machine capabilities.

The difference is that we (computer science community in general)
know how to do optimizing compilers. You pose a much harder problem.

Preston Briggs

J. Kyle Wilson

unread,
Apr 30, 1991, 4:11:20 PM4/30/91
to
In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>In article <1991Apr30.0...@rice.edu>, pre...@ariel.rice.edu (Preston Briggs) writes:
>> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>
>> >What is needed for the user to be able to communicate what operations are
>> >wanted.
>>
>> I agree that the user should be able to instruction the computer.
>> However, I don't think I want to specify the assembly operations.
>> Instead, I want to manipulate numbers, sets, sequences, and objects in a
>> somewhat high-level fashion. Then I want the compiler to make very good
>> code out of my specification.
>
>But what if I find Boolean operations useful in dealing with floats, which
>I do for certain purposes? There are other operations which I can use for
>useful purposes, but only if they are fast. You do not know these operations,
>and the language community has seen fit to ignore what those who are capable
>of versatility can envision.

The language community has been providing the greatest good for the
greatest number where possible. If you have need of an optimization
that no one has seen fit to add to a compiler you are using your best
bet may be to hire a compiler writer to tweak it to handle that case.
If you can find a large enough segment of the user community to
support your modification you may even be able to get this new twist
adopted as a standard. If your time spent optimizing the interior of
critical loops in your code is worth enough, your employer should
gladly fund the rewriting of whatever compiler you use to optimize in
any way you wish.

Satisfying your requests would require the compiler to recognize a
large number of special cases each of which would likely be used by a
small segment of the user community, while requiring signifigant work
to support as the compiler incorporates new technology. This would
have an overall greater cost to the community at large than allowing
the user to code oddball cases in #ifdef'd asm statements for each
machine. One could add #pragma statements to the compiler (or
equivalents for those using other languages) containing clues
regarding optimization of the code. Why should you specify a boolean
operation on a float when this is really not the operation you wish to
perform? Have your hired compiler writer add a pragma that indicates
to the compiler that the following line of code may be implemented in
some odd way, and stick with that.

Kyle Wilson
j...@gnerad.com

Kyle Wilson
GenRad Inc.
300 Baker Ave
Concord MA 01742

Herman Rubin

unread,
Apr 30, 1991, 11:34:48 AM4/30/91
to
In article <1991Apr30.0...@rice.edu>, pre...@ariel.rice.edu (Preston Briggs) writes:
> hru...@pop.stat.purdue.edu (Herman Rubin) writes:

......................

> >What is needed for the user to be able to communicate what operations are
> >wanted.
>
> I agree that the user should be able to instruction the computer.
> However, I don't think I want to specify the assembly operations.
> Instead, I want to manipulate numbers, sets, sequences, and objects in a
> somewhat high-level fashion. Then I want the compiler to make very good
> code out of my specification.

But what if I find Boolean operations useful in dealing with floats, which


I do for certain purposes? There are other operations which I can use for
useful purposes, but only if they are fast. You do not know these operations,
and the language community has seen fit to ignore what those who are capable
of versatility can envision.

> >> If you change the machine, you need to change compilers.


> >
> >This newsgroup is comp.arch. How are the computer architects to recognize
> >these situations and others?
>
> I expect they learn from reading and by experience.
> It seems that one of the major accomplishments of the RISC evolution
> was a recognition that architects and compiler people should talk
> more often than in the past.

But what about getting input from those users who do not find what they
want? That multiply and add could be easily combined in hardware was known
BC (before computers); it was routine in desk calculatore (now essentially
extinct). The only reason I can think of that the early computers did not
have this feature is that it would have taken another register to hold the
multiplier.

One of the reasons why, in the late 50s and early 60s, binary was settled on
(there are many others) was that multiplication by powers of 2 was rather
common. Only a few machines have this very important operation in floating
hardware, and instead have to bring in the multiplier and use the relatively
slow multiplication. It is true that with pipelined hardware this is not so
much of a problem, but I know of only a few machines which had it before then.

> >What you have shown is that a compiler can, in effect, extend a language
> >by looking for certain code patterns.
>
> I disagree. The language (Fortran 77) didn't change when I added my
> optimization. The compiler translates exactly the same set of
> programs; some programs simply ran slightly faster.
>
> > I fail to see that this procedure,
> >which will always be limited by the information given to the compiler
> >writers, and which can also sometimes lead to difficulties, is better
> >than allowing the user to extend the language and pass the information
> >to the compiler in such a way that it can use more of the
> >machine capabilities.
>
> The difference is that we (computer science community in general)
> know how to do optimizing compilers. You pose a much harder problem.

I agree that you know how to do certain types of optimizations. I suggest
that you make these optimizations available to those who have a more varied
collection of things to optimize. You (the computer science community) know
how to optimize, but have only a limited knowledge of what it is useful to
optimize. I do not criticise you for this lack of knowledge, but only for
failing to realize that you should be looking for input on this from others.

We, and I include all of mankind, do not know enough about to design a good
system of communicating between people and computers now. We can design a
fair one by realizing this and allowing the users to extend the language, and
to communicate to the compiler this extension.

Bob Alverson

unread,
Apr 29, 1991, 12:05:41 PM4/29/91
to
In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>[stuff deleted]

>The real2int conversion still has to be done on most machines, and also
>an integerize of real. But suppose that the machine has integer hardware
>comparable to floating hardware, and can also handle unnormalized floats.
>Then one could put a shifted significand of f1 in a double register,
>divide by the significand of f2, setting iq to be the quotient, and
>the remainder with the exponent of f2, and then normalized, as fr.
>Now I wonder how many of the readers would think of doing it this way?
>And how would the hardware designers recognize that putting the operation
>in hardware might be a good idea?
>

Gee, and what does the hardware do if the division is 1e300 / 1e-300?
It's got to extract about 2000 bits of quotient before it gets all the
integer bits. I take it you really want

iq = (f1 / f2) % some-power-of-two

Since continuable instructions don't mesh well with pipelined systems,
most hardware would only calculate 53 or maybe 64 quotient bits at a
time. For example, look at the remainder *sequence* used on the 80x87.
Nobody wants to wait for the full calculation to finish before taking
an interrupt or context switching.

Strange as it may seem, the dividers in today's computers can compute
the correct quotient without having the correct remainder around to
return to the user. Fortunately, some can compute the correct remainder
in one operation with a fused multiply add.


>--
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
>Phone: (317)494-6054
>hru...@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)

Bob Alverson, Tera Computer
Resident arithmetic lunatic

Preston Briggs

unread,
Apr 29, 1991, 11:59:45 AM4/29/91
to
hru...@pop.stat.purdue.edu (Herman Rubin) wrote

> If the languages do not allow the user to communicate the use of those
> features of the CISC architecture which can make the object code considerably
> faster, the compiler cannot take advantage of them. This IS the situation.
> For example, suppose there was an instruction, which could be effectively
> used, which would divide a float by a float, obtaining an integer quotient
> and a float remainder. Now just how is the compiler to recognize that this
> is in fact wanted?

And I described how it could be done, using Fortran 77 and pretty
standard optimization (value numbering and peephole optimization).

hru...@pop.stat.purdue.edu (Herman Rubin) replies:

>Any specific instance of this can be added to the compiler's ken, but
>until it does, the compiler cannot take advantage of it. The point is
>that the compiler must be instructed to look for them.

Yes indeed. That's the job of the compiler writer.
If he does a bad job in using the instruction set of the machine,
you should complain to him.

>The real2int conversion still has to be done on most machines, and also
>an integerize of real. But suppose that the machine has integer hardware
>comparable to floating hardware, and can also handle unnormalized floats.

Well, we can suppose all sorts of things. Nevertheless, my solution
handles your first set of suppositions. If you change the machine,


you need to change compilers.

I thought your original question was posed in an attempt to show that
many languages don't have the facilities you consider necessary
to control the machine. I just tried to show that the effect you
wanted could be achieved.

Preston Briggs

Dik T. Winter

unread,
Apr 30, 1991, 10:55:44 PM4/30/91
to
In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> But what if I find Boolean operations useful in dealing with floats, which
> I do for certain purposes?
Can you also specify the floating point format please? Should the language
require a specific format? And if it does *not* depend on the format, what
kind of Boolean operations are you thinking of? Or should there be a
different language for every architecture? Yes, I know you have mumbled about
user extensible languages. Upto now I have seen no specification of the
things you want. Algol 68 is very powerful, you can add all kinds of infix
operators, but I have this idea that that is not what you want.

> One of the reasons why, in the late 50s and early 60s, binary was settled on
> (there are many others) was that multiplication by powers of 2 was rather
> common.

I have heard many reasons, but never this one. Can you give a reference
please?

> Only a few machines have this very important operation in floating
> hardware, and instead have to bring in the multiplier and use the relatively
> slow multiplication. It is true that with pipelined hardware this is not so
> much of a problem, but I know of only a few machines which had it before then.

Mm, some 360's had it I think, but that was completely micro code, and if my
memory is right, later versions did it through a straight multiplication.
There must be some reason. Also, multiplication by 2 is *not* easy on IEEE
conformant machines (think gradual underflow). Kahan had some (good) reasons
to include gradual underflow. But IEEE machines might use the 'adjust exponent'
primitive. On the other hand, IEEE says nothing about what has to be done in
hardware and what can be done in software. And on your favorite machine (Cyber
205) DP multiplication by 2.0 need not even be exact!


>
> I agree that you know how to do certain types of optimizations. I suggest
> that you make these optimizations available to those who have a more varied
> collection of things to optimize. You (the computer science community) know
> how to optimize, but have only a limited knowledge of what it is useful to
> optimize. I do not criticise you for this lack of knowledge, but only for
> failing to realize that you should be looking for input on this from others.

Not so limited in fact. They use several tools to asses the suitability of
different optimizations. There is no lack of knowledge, they use input from
others: the programs that are normally run on their systems.


>
> We, and I include all of mankind, do not know enough about to design a good
> system of communicating between people and computers now. We can design a
> fair one by realizing this and allowing the users to extend the language, and
> to communicate to the compiler this extension.

As has been asked before: give examples. How would you like to write in your
favorite language your favorite examples:
1. Check bit in a bit stream and mark as seen.
2. Calculate length of a run of 0's or 1's in a bit stream and mark them as
seen.
HINTS:
They could be done along the lines of:
bitstream b;
bit b1;
int count;
/* another gripe of you */
external end_of_bits();

trap_at_bits_end(b, end_of_bits);
b1 = nextbit(b); /* I hear you about infix operators */
count = countbits(b); /* should it count the bit just consumed? */

Next question: how do you envisage an *efficient* hardware implementation?
If there is one, compilers are quite able to translate the above code to
that implementation. If you know of that *efficient* hardware implementation
and if it is present on a current architecture, just convince your compiler
vendor to put that optimization in your compiler. If not, convince your
hardware vendor to put the operations required in the hardware (and show
how it can be done efficiently), and next convince your compiler vendor.

I am not entirely against you. But both hardware designers and compiler
writers have their reason to do things. At first I was disgusted to hear
that the new SPARC architecture provides an integer division instruction
without at the same time giving a remainder. Now I know the reasons, and
I respect them. Contrary to the WE32000 which also has a division without
remainder, *and* a remainder without quotient, without apparent reasons.
--
dik t. winter, cwi, amsterdam, nederland
d...@cwi.nl

Herman Rubin

unread,
May 1, 1991, 3:34:15 PM5/1/91
to
In article <34...@charon.cwi.nl>, d...@cwi.nl (Dik T. Winter) writes:
> In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:

.......................

> > One of the reasons why, in the late 50s and early 60s, binary was settled on
> > (there are many others) was that multiplication by powers of 2 was rather
> > common.
> I have heard many reasons, but never this one. Can you give a reference
> please?

An explicit place where I saw this was in a paper discussing the design of the
MANIAC 2.

> > Only a few machines have this very important operation in floating
> > hardware, and instead have to bring in the multiplier and use the relatively
> > slow multiplication. It is true that with pipelined hardware this is not so
> > much of a problem, but I know of only a few machines which had it before then.
> Mm, some 360's had it I think, but that was completely micro code, and if my
> memory is right, later versions did it through a straight multiplication.
> There must be some reason. Also, multiplication by 2 is *not* easy on IEEE
> conformant machines (think gradual underflow). Kahan had some (good) reasons
> to include gradual underflow. But IEEE machines might use the 'adjust exponent'
> primitive. On the other hand, IEEE says nothing about what has to be done in
> hardware and what can be done in software. And on your favorite machine (Cyber
> 205) DP multiplication by 2.0 need not even be exact!

On the 360 this would have been difficult, because the machine was base 16
rather than base 2. The machine has a floating instruction to halve.

One machine which had this instruction was the CDC 3600. The instruction was
ADX, add to exponent, and it managed overflows and underflows exactly the way
that they would be for other floating operations. Clearly this can be a fast
operation.

I did not state that the 205 was a really well-designed machine, just that it
is the most versatile vector machine I know. I am even inclined to agree that
the virtual memory is not all that great. Not only may DP multiplication by
2.0 fail to be exact, even by 1.0. But compare this to the total nuisance to
do DP operations on a CRAY, for example! In fact, it is not common for DP
operations to be exact in general on most machines.

.........................

> As has been asked before: give examples. How would you like to write in your
> favorite language your favorite examples:
> 1. Check bit in a bit stream and mark as seen.
> 2. Calculate length of a run of 0's or 1's in a bit stream and mark them as
> seen.
> HINTS:
> They could be done along the lines of:
> bitstream b;
> bit b1;
> int count;
> /* another gripe of you */
> external end_of_bits();
> trap_at_bits_end(b, end_of_bits);
> b1 = nextbit(b); /* I hear you about infix operators */
> count = countbits(b); /* should it count the bit just consumed? */

This is the best I have been able to do for 1. In the envisioned applications,
I would not be counting the bits, but transferring on them. Pardon the
somewhat clumsier notation, as I am not used to trap pragmas. I am assuming
registers are 32 bits.
initialize pointers, etc.
double int *end_of_bits, *loc;
int left;
registerpair r;
if (left -= 1 < 0){
if(loc >= end_of_bits){
refill();
loc = start_of_bits;}
left = 64;
r = *loc++;}
else r <<= 1;
if (r < 0) goto ...

For 2, I have not done as well even on scalar machines. On vector machines,
allowing collapse of vectors, it can be done fairly easily by having the input
bitstream act as a selection for consecutive integers, with appropriate handling
of ends. It could be done by counting here, but generally there are faster
methods, although they are tricky. I doubt that they would be recognized even
if these are.

> Next question: how do you envisage an *efficient* hardware implementation?

Have hardware hold bitstreams, and have faults on end_of_bits, just as there
are now page faults. The operations are not at all difficult from a hardware
viewpoint.

> If there is one, compilers are quite able to translate the above code to
> that implementation. If you know of that *efficient* hardware implementation
> and if it is present on a current architecture, just convince your compiler
> vendor to put that optimization in your compiler. If not, convince your
> hardware vendor to put the operations required in the hardware (and show
> how it can be done efficiently), and next convince your compiler vendor.

Do you really see a compiler writer recognizing that sequence of code, or one
of its alternatives, as this operation? Especially since I, knowing that tests
are bad, might use a slightly different code for such situations where I have
some control on how many times the operation will be used between tests.
This would enable the use of
left -= 1;
if (r<<=1 < 0) goto ...

and the reloading of registers becomes much more complicated.

Herman Rubin

unread,
May 1, 1991, 2:40:16 PM5/1/91
to
In article <41...@genrad.UUCP>, j...@genrad.com (J. Kyle Wilson) writes:
> In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> >In article <1991Apr30.0...@rice.edu>, pre...@ariel.rice.edu (Preston Briggs) writes:
> >> hru...@pop.stat.purdue.edu (Herman Rubin) writes:

> >> >What is needed for the user to be able to communicate what operations are
> >> >wanted.

> >> I agree that the user should be able to instruction the computer.
> >> However, I don't think I want to specify the assembly operations.
> >> Instead, I want to manipulate numbers, sets, sequences, and objects in a
> >> somewhat high-level fashion. Then I want the compiler to make very good
> >> code out of my specification.

> >But what if I find Boolean operations useful in dealing with floats, which
> >I do for certain purposes? There are other operations which I can use for
> >useful purposes, but only if they are fast. You do not know these operations,
> >and the language community has seen fit to ignore what those who are capable
> >of versatility can envision.

> The language community has been providing the greatest good for the
> greatest number where possible. If you have need of an optimization
> that no one has seen fit to add to a compiler you are using your best
> bet may be to hire a compiler writer to tweak it to handle that case.

.......................



> Satisfying your requests would require the compiler to recognize a
> large number of special cases each of which would likely be used by a
> small segment of the user community, while requiring si

...........................

Satisfysing my requests would require nothing of the sort. I see no
reason why I should have to hire a compiler writer for each augmentation.
I am asking the compiler writers and language designers to do what is
normally done in other fields; do not freeze the language, and allow the
user to augment it with convenient notation. Essentially this means that
the user can put translations of his method of code into HLL or machine
instructions, or even into other espressions. In some cases, even alternate
translations should be giver, and the optimizer decide, again using user-
supplied information.

An optimal route to travel does not have to be restricted to major highways.
Using lesser highways or even city streets may be more efficient. The route
could also be dependent on the time of day and the type of vehicle. Do not
deprive the driver of the gas pedal and the steering wheel.

peter da silva

unread,
May 1, 1991, 11:14:56 AM5/1/91
to
If you want to specify a boolean operation on a float, do the following:

f * 2.0

It's up to the compiler to determine if this multiplication by a constant
two can be handled by incrementing the exponent or if you have to do a real
floating multiply (say, for example, that it's m * e ^ 16).

Andy Glew

unread,
May 1, 1991, 7:10:28 AM5/1/91
to
Just thought that I would juxtapose two posts to this newsgroup:

Strange as it may seem, the dividers in today's computers can compute
the correct quotient without having the correct remainder around to
return to the user. Fortunately, some can compute the correct remainder
in one operation with a fused multiply add.

Bob Alverson, Tera Computer
Resident arithmetic lunatic


And:

From the Program for the IEEE Symposium on Computer Arithmetic:

8.1 ``Integer Division Using Reciprocals''
Robert Alverson, Tera Computer Company

Nuff said?

(NB. I'm on Bob Alverson's side. This sort of thing is good.)
--

Andy Glew, gl...@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway,
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

Herman Rubin

unread,
May 1, 1991, 9:08:22 PM5/1/91
to
In article <WK=A6...@xds13.ferranti.com>, pe...@ficc.ferranti.com (peter da silva) writes:
> If you want to specify a boolean operation on a float, do the following:
>
> f * 2.0
>
> It's up to the compiler to determine if this multiplication by a constant
> two can be handled by incrementing the exponent or if you have to do a real
> floating multiply (say, for example, that it's m * e ^ 16).

This is a boolean operation? I do not even see how to express it as one.

It is a useful operation, however. But unless the appropriate hardware is
there, the compiler better ask the user for permission. On most machines,
adding to the exponent is safe only if neither overflow not underflow can
occur, and if floats are automatically normalized, even positive numbers
cannot be added to the exponent of 0.

peter da silva

unread,
May 1, 1991, 4:39:21 PM5/1/91
to
In article <11...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> I am asking the compiler writers and language designers to do what is
> normally done in other fields; do not freeze the language, and allow the
> user to augment it with convenient notation.

There have been languages with that capability: Lisp, or Forth, or Smalltalk,
for example. Why not look there for your needs?

> Do not deprive the driver of the gas pedal and the steering wheel.

No problem, but should the driver want to add a supercharger he should be
expected to learn something about the engine.

Herman Rubin

unread,
Apr 29, 1991, 9:55:25 PM4/29/91
to
In article <1991Apr29.1...@rice.edu>, pre...@ariel.rice.edu (Preston Briggs) writes:
> hru...@pop.stat.purdue.edu (Herman Rubin) wrote

.....................

> And I described how it could be done, using Fortran 77 and pretty
> standard optimization (value numbering and peephole optimization).
>
> hru...@pop.stat.purdue.edu (Herman Rubin) replies:
>
> >Any specific instance of this can be added to the compiler's ken, but
> >until it does, the compiler cannot take advantage of it. The point is
> >that the compiler must be instructed to look for them.
>
> Yes indeed. That's the job of the compiler writer.
> If he does a bad job in using the instruction set of the machine,
> you should complain to him.

Is it the compiler writer's job to recognize combinations of expressions in


a language which really represent an operation not contained in that

language? How do you separate a float into the exponent and the
significand?

What is needed for the user to be able to communicate what operations are

wanted. It is even necessary to do this in cases in which the language
supports the desired effect in multiple ways. Looking for the code and
making sure it has no side effects which would make the supposed improvement
more costly is no simple task, but it is easier if the compiler is told what
is wanted, instead of having to guess.

> >The real2int conversion still has to be done on most machines, and also
> >an integerize of real. But suppose that the machine has integer hardware
> >comparable to floating hardware, and can also handle unnormalized floats.

> Well, we can suppose all sorts of things. Nevertheless, my solution
> handles your first set of suppositions. If you change the machine,
> you need to change compilers.

This newsgroup is comp.arch. How are the computer architects to recognize
these situations and others?

> I thought your original question was posed in an attempt to show that


> many languages don't have the facilities you consider necessary
> to control the machine. I just tried to show that the effect you
> wanted could be achieved.

What you have shown is that a compiler can, in effect, extend a language
by looking for certain code patterns. I fail to see that this procedure,


which will always be limited by the information given to the compiler
writers, and which can also sometimes lead to difficulties, is better
than allowing the user to extend the language and pass the information
to the compiler in such a way that it can use more of the machine capabilities.

It is also by having these situations laid out in the open that subsequent
language designers, compiler writers, and hardware designers will be likely
to know about them and include them. No matter how great these are, there
are plenty of "obvious" things which will be missed.

Jim Patterson

unread,
May 2, 1991, 9:38:19 AM5/2/91
to
In article <WK=A6...@xds13.ferranti.com> pe...@ficc.ferranti.com (peter da silva) writes:
>If you want to specify a boolean operation on a float, do the following:
>
> f * 2.0
>
>It's up to the compiler to determine if this multiplication by a constant
>two can be handled by incrementing the exponent or if you have to do a real
>floating multiply (say, for example, that it's m * e ^ 16).

The problem with this is that the compiler rarely has enough
information to determine if it's "safe" to just bump the exponent.
Therefore it's left with the choice of doing a number of runtime
checks (is f equal to zero? Is exponent already at upper limit?), or
just punting and generating the naive "multiply by 2.0" instruction.

I suspect most compilers will do the latter because it's ultimately
more important to get the right answer.

--
Jim Patterson Cognos Incorporated
UUCP:uunet!mitel!cunews!cognos!jimp P.O. BOX 9707
PHONE:(613)738-1440 x6112 3755 Riverside Drive
Ottawa, Ont K1G 3Z4

Herman Rubin

unread,
May 3, 1991, 10:14:33 AM5/3/91
to
In article <282002...@tct.com>, ch...@tct.com (Chip Salzenberg) writes:
> According to hru...@pop.stat.purdue.edu (Herman Rubin):

> >Satisfysing my requests would require nothing of the sort. I see no
> >reason why I should have to hire a compiler writer for each augmentation.
>
> You're right, of course.
>
> Nor can I see a reason why we should endure endless tirades about how
> the all-singing all-dancing Herman Rubin Language Compiler will solve
> the world's problems, when you STILL haven't written ...
>
> [wait for it]
>
> the SPECIFICATION.
>
> Give us a spec, and we'll have something to discuss. Why spend all
> this time yammering about a hypothetical compiler, when that same
> effort expended in spec-writing could produce a REAL IMPLEMENTATION?

I have several times responded to this type of posting in email, and
have not even received clarification of my questions.

1. I am not quite sure what you mean by a specification.

2. To the extent that I understand it, I do not think that one
even should be produced.

Now, I do not mean that it is all a mystery. I will try to clarify the
problem and objectives. I envision a user as having some things to be
done, which for purposes of discussion we shall call transformations.
There are many ways of producing these transformations by sequences of
operations. The compiler cannot be expected to know how to do this; the
user must carefully instruct it by the program. Now this means that the
program may present the compiler with alternatives to assess for speed,
memory requirements, etc.

For convenience, the arguments of these operations may be typed. The
collection of types is at the user's disposal. This allows overloaded
operations. However, it may be convenient to have a few "standard" types.

Next, these operations may be further decomposed into sequences of other
operations, sequences of machine instructions, etc. The user should have
control over this. There may be a default set of operations, but the user
should have essentially free rein over operation symbols, etc. Again, it
is possible that there will be situations in which altenate translation
sequences should be made available to the compiler.

I have not specified a language, but a frame on which to hang a language.
Unlike the present languages, this frame does not declare certain operations,
such as + - * / & | ~ ^ ** << >> as THE primitive operators, but allows free
extension of the notion of primitive operator. If a user wishes to use [) as
an operator symbol, there is no bar to this. If the user wishes to deliberately
override types, likewise. If the user wishes to define an operation in a maner
using assembler instructions, transfer on overflow, etc., which cannot be
conveniently done with the present "portable" primitives, go ahead.

I have given examples of this. To some extent, this was done by the AUGMENT
preprocessor to handle multiple precision arithmetic, but this was limited.
This essentially unrestricted translation phase can be combined with whatever
optimization techniques are to be further implemented. Even here, allowing
the compiler to lay out alternatives to the user and get input can be desirable.

Chip Salzenberg

unread,
May 2, 1991, 8:15:12 AM5/2/91
to
According to hru...@pop.stat.purdue.edu (Herman Rubin):
>Satisfysing my requests would require nothing of the sort. I see no
>reason why I should have to hire a compiler writer for each augmentation.

You're right, of course.

Nor can I see a reason why we should endure endless tirades about how
the all-singing all-dancing Herman Rubin Language Compiler will solve
the world's problems, when you STILL haven't written ...

[wait for it]

the SPECIFICATION.

Give us a spec, and we'll have something to discuss. Why spend all
this time yammering about a hypothetical compiler, when that same
effort expended in spec-writing could produce a REAL IMPLEMENTATION?

--
Brand X Industries Sentient and Semi-Sentient Being Resources Department:
Because Sometimes, "Human" Just Isn't Good Enough [tm]
Chip Salzenberg <ch...@tct.com>, <uunet!pdn!tct!chip>

Michael S. Pereckas

unread,
May 3, 1991, 3:58:56 PM5/3/91
to

>I have not specified a language, but a frame on which to hang a language.
>Unlike the present languages, this frame does not declare certain operations,
>such as + - * / & | ~ ^ ** << >> as THE primitive operators, but allows free
>extension of the notion of primitive operator. If a user wishes to use [) as
>an operator symbol, there is no bar to this. If the user wishes to deliberately
>override types, likewise. If the user wishes to define an operation in a maner
>using assembler instructions, transfer on overflow, etc., which cannot be
>conveniently done with the present "portable" primitives, go ahead.

Sounds vaguely (buzzword alert) object oriented to me. Except that
you not only don't want to pay the usual performance penalty of OO
languages, you want a speed boost over plain C or whatever. And you
don't want to write any assembly (although another user might)?

(I'm assuming that this is a modified C. If you like Ada better,
replace `C' with `Ada' in the following text...)

What if you could define new operations
example:

@ is defined here as some complicated function of two variables
using either C or assembler

later on you get to write stuff like this:
x = y @ z

The code for the @ operation is automagically inlined.
Appropriate ifdefs would let you include in the source file several
versions of the @ code, perhaps in portable C and two different
assembly languages.

You (H. Rubin) would like (I think) to be able to write several
different but equivalent versions in portable C, and have the compiler
automagically select the one it can produce the best code for. This
sounds like it could be hard. Perhaps you could include some test
examples, and the compiler could compile each version and benchmark
it?

Am I getting close?

--

< Michael Pereckas <> m-per...@uiuc.edu <> Just another student... >
"This desoldering braid doesn't work. What's this cheap stuff made
of, anyway?" "I don't know, looks like solder to me."

Eric Hamilton

unread,
May 2, 1991, 4:19:17 PM5/2/91
to
In article <1991Apr22....@decvax.dec.com>, ken...@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:
|>
|> The 88K has xmem, not test and set, which is guaranteed to be an atomic
|> operation (the bus is locked between the read and write cycles). This
|> can be used to create locks or compare and swap without disabling itnerrupts.
|>
Now I'm curious. How does one fabricate compare-and-swap out of xmem
without disabling interrupts? Xmem has essentially the same restriction as
test-and-set, which is that the stored value must be a constant independent of
the loaded value.

--
----------------------------------------------------------------------
Eric Hamilton +1 919 248 6172
Data General Corporation hami...@dg-rtp.rtp.dg.com
62 Alexander Drive ...!mcnc!rti!xyzzy!hamilton
Research Triangle Park, NC 27709, USA

Herman Rubin

unread,
May 3, 1991, 6:03:41 PM5/3/91
to
In article <1991May3.1...@ux1.cso.uiuc.edu>, msp3...@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:
> In <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:

> >I have not specified a language, but a frame on which to hang a language.
> >Unlike the present languages, this frame does not declare certain operations,
> >such as + - * / & | ~ ^ ** << >> as THE primitive operators, but allows free
> >extension of the notion of primitive operator. If a user wishes to use [) as
> >an operator symbol, there is no bar to this. If the user wishes to deliberately
> >override types, likewise. If the user wishes to define an operation in a maner
> >using assembler instructions, transfer on overflow, etc., which cannot be
> >conveniently done with the present "portable" primitives, go ahead.

> Sounds vaguely (buzzword alert) object oriented to me. Except that
> you not only don't want to pay the usual performance penalty of OO
> languages, you want a speed boost over plain C or whatever. And you
> don't want to write any assembly (although another user might)?

I do not mind using machine operations, but I do not like to write the
vague and confusing, and different for different machines, assembler
symbolism. On any machine, one of the things I would do is to include
my own notation for the machine operations, and the corresponding translation.

I have no problem whatever in thinking in terms of machine operations where
necessary, and I have no objection to using them. But we no longer need to
write these operations in a form designed to be easy for the assembler to
parse.

> (I'm assuming that this is a modified C. If you like Ada better,
> replace `C' with `Ada' in the following text...)

> What if you could define new operations
> example:

> @ is defined here as some complicated function of two variables
> using either C or assembler

> later on you get to write stuff like this:
> x = y @ z

> The code for the @ operation is automagically inlined.
> Appropriate ifdefs would let you include in the source file several
> versions of the @ code, perhaps in portable C and two different
> assembly languages.

Almost right! I may wish to define @ in terms of other things defined. It
may be desirable to use primitives which even mess up coding, such as overflow.

> You (H. Rubin) would like (I think) to be able to write several
> different but equivalent versions in portable C, and have the compiler
> automagically select the one it can produce the best code for. This
> sounds like it could be hard. Perhaps you could include some test
> examples, and the compiler could compile each version and benchmark
> it?

> Am I getting close?

One of the claims of the posters is that compilers can greatly outdo
humans. They can only outdo humans if they can take into account what
the humans can consider. This they cannot do without being told.

Now how does a compiler operate now? It parses the code, translates into
its primitive operators, decides how to make things more efficient, which in
some cases requires consideration of alternatives (Simple example: on the
CDC6600, there were three "standard" methods of putting 0 in an X-register.
These used different functional units, and each functional unit could only
do one thing at a time, and the units were used for other purposes. So the
compiler would decide which alternative to use.), and then optimize the
resulting hardware instruction code.

This means that the programmer must adequately instruct the compiler on the
possible ways to do things if the compiler is to do well. Automagic selection
is certainly an important part of optimization.

herr...@iccgcc.decnet.ab.com

unread,
May 3, 1991, 5:40:15 PM5/3/91
to
He is a customer. Unfortunately, he does not represent a large enough
subset of their customers.

Unfortunately because he is better informed than many other customers
and all the customer base might be better served if Henry's ideas
could penetrate the language design community.

Not talking to the mathematicians let the FORTRAN design team at IBM
put the label REAL on a finite subset of the rationals.

We'll never displace floating point numbers now, but what might we
have if the game theorist had spent more time talking with number
theorists back around 1940-1945?

How many engineering juniors believe that

DO I = 1, 50000
SUM = SUM + DATA (I)
END DO

computes a good approximation of the sum of 50000 numbers they measured
somehow?

Could we have realized in hardware a number system that does give us
a commutative addition?

I think there is good reason for the hardware and software architects
to listen to the number theorists.

dan herrick
herr...@iccgcc.decnet.ab.com

herr...@iccgcc.decnet.ab.com

unread,
May 3, 1991, 5:51:28 PM5/3/91
to
In article <4474.2...@iccgcc.decnet.ab.com>, I said:
>
> Could we have realized in hardware a number system that does give us
> a commutative addition?
>
after giving the example that shows floating point arithmetic is
not associative. But the fix for the algorithm is to sort the
input data - doesn't that suggest the word "commutative" to you, too?

I'm still
> dan herrick
> herr...@iccgcc.decnet.ab.com

Jon Krueger

unread,
May 3, 1991, 2:06:10 AM5/3/91
to
hru...@pop.stat.purdue.edu (Herman Rubin) analogizes:

> An optimal route to travel does not have to be restricted to major highways.
> Using lesser highways or even city streets may be more efficient. The route
> could also be dependent on the time of day and the type of vehicle. Do not
> deprive the driver of the gas pedal and the steering wheel.

What! You trust the car to make the optimal decisions for you on
shifting? Surely you could to better with a manual. Oh, and what
about that gas-to-air ratio? Surely you wouldn't drive a car whose
designer deprived you of a choke, so you could tweak it while driving
different territory. And how about timing? You do adjust the timing
for uphill versus down, don't you Herman? Don't tell me you pay more
attention to your route than to these important "efficiency" issues.
Why you might actually get somewhere instead of complaining loudly
about the fascism of auto manufacturers. Kinda boggles the mind.

-- Jon
--

Jon Krueger, j...@ingres.com

John R. Levine

unread,
May 4, 1991, 12:34:52 AM5/4/91
to
In article <1991May2.2...@dg-rtp.dg.com> hami...@siberia.rtp.dg.com (Eric Hamilton) writes:
>Now I'm curious. How does one fabricate compare-and-swap out of xmem
>without disabling interrupts?

You can't, at least not without a scheme that uses spin locks which are at
least as bad as disabling interrupts. See Herlihy's article on "Wait-free
Synchronization" in the January TOPLAS. It turns out that CAS is
fundamentally more powerful than most other synchronization primitives.

--
John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 492 3869
jo...@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl
Cheap oil is an oxymoron.

Chris Shaw

unread,
May 5, 1991, 6:20:07 PM5/5/91
to
In article <moreWhiningWhiningWhiningWhining> (Herman Rubin) writes:
>What you have shown is that a compiler can, in effect, extend a language
>by looking for certain code patterns. I fail to see that this procedure,
>which will always be limited by the information given to the compiler
>writers, and which can also sometimes lead to difficulties, is better
>than allowing the user to extend the language and pass the information
>to the compiler in such a way that it can use more of the machine capabilities.

>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399

Yes, Dr. Rubin, we KNOW you fail to see this. You've been failing to see this
for at least 8 years. Why don't you write a compiler that allows all these
"I'm the boss" type operations you keep talking about? This would accomplish
3 things: (1) You'd get a compiler you can appreciate, (2) You'd know what
you're talking about as regards compiler design, and (3) You'd be so busy
writing this compiler that you would have no chance to post to comp.arch
with your complaints, thereby making us all happy.

Now quit whining and start designing!

PS: Item number 1 is meant to be taken seriously.
Items 2 & 3 are half- :-)}, and are not intended to offend.
--
Chris Shaw University of Alberta
cds...@cs.UAlberta.ca Now with new, minty Internet flavour!
CatchPhrase: Bogus as HELL !

Ian Dall

unread,
May 2, 1991, 8:22:09 PM5/2/91
to
In article <11...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>We, and I include all of mankind, do not know enough about to design a good
>system of communicating between people and computers now. We can design a
>fair one by realizing this and allowing the users to extend the language, and
>to communicate to the compiler this extension.

This is available. The best example I can think of is gcc/g++. I seem to
recall someone (Chris Torek?) posting an example of gcc asm's before in
this thread (or one of its many reincarnations).

If the machine has an instruction you can use it! Certainly the syntax
is a bit arcane, but you can't expect to "extend" a compiler without
some knowledge of both the machine language and the compiler. This
syntax need only be tackled once per function per machine. The results
of your efforts can be hidden inside macros or inline functions in
an include file. If you can find a few other people with like requirements,
you can share your effort.

Eventually your group might be able to settle on a proposal for a formal
standard. In short the tools are available now if you don't mind a little
effort.

--
Ian Dall I'm not into isms, but hedonism is the most
harmless I can think of. -- Phillip Adams
ACSnet: i...@sibyl.eleceng.ua.oz
internet: i...@sibyl.eleceng.ua.oz.au

Chip Salzenberg

unread,
May 6, 1991, 1:19:57 PM5/6/91
to
According to hru...@pop.stat.purdue.edu (Herman Rubin):
>In article <282002...@tct.com>, ch...@tct.com (Chip Salzenberg) writes:
>> Give us a spec, and we'll have something to discuss. Why spend all
>> this time yammering about a hypothetical compiler, when that same
>> effort expended in spec-writing could produce a REAL IMPLEMENTATION?
>
>I have several times responded to this type of posting in email, and
>have not even received clarification of my questions.

Rubin-san has not E-Mailed me on this subject -- but no matter; he
chooses to carry on his correspondence in comp.arch.

>1. I am not quite sure what you mean by a specification.

A detailed and precise description of the desired compiler features
and input format. It would tell me what I need to know to write a
HROL (Herman Rubin Oriented Language) compiler.

Let me give a tiny example.

To define a binary operator <op> to be supported by a machine
instruction <inst>, code the following:

OPERATOR "<op>" ;
{ BINARY | UNARY } ;
PRECEDENCE { AFTER | BEFORE | SAME } <other-op> ;
ARGUMENT 1 (<type>, ...) [ IN ("reg1",...) ] ;
[ ARGUMENT 2 (<type>, ...) [ IN ("reg2",...) ] ; ]
INSTRUCTION "<inst> ARG1,ARG2" ;
RESULT (<type>) IN ("reg") ;

Of course, this tiny example doesn't address indirection, or processor
addressing modes, etc, etc. But it does give the flavor of what I mean
by "specification".

Got the picture yet?

Terry Ingoldsby

unread,
May 7, 1991, 3:19:05 PM5/7/91
to
In article <11...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> In article <40...@batman.moravian.EDU>, hal...@batman.moravian.EDU (David Halko) writes:
> > In article <97...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> > > In article <1991Apr5.1...@agate.berkeley.edu>, do...@eris.berkeley.edu (Doug Merritt) writes:
> > > > In article <1991Apr4...@capd.jhuapl.edu> wal...@capd.jhuapl.edu writes:

I had to leave the above 4 lines in, just cause it looks so awesome to see
a discussion going so many levels deep :^)
...
> Consider the following two operations, which comprise more than 25% of the
> operations in some highly efficient, from the "picocode" standpoint, methods
> for generating non-uniform random numbers.
>
> 1. Examine a bit, and take some action depending on it. Also, the next time
> this is to be done, this bit effectively no longer exists. It is necessary
> to make provision at least somewhere for replenishing the supply of bits.
>
> 2. Find the distance to the next one in a bit stream. The same considerations
> as above apply.

I have got to ask. Why is it so generally important that the distance between
bits can be determined efficiently. Note that I want to know, `why is it
important to ME, and to the general computing base?'. This is the question
that hardware designers and compiler writers ask themselves.

For the hardware/software designer to incorporate the features you want, it
is not enough to know that they are important to YOU. From
a business case, the only reason that they care if you are happy is if you
give them money. If the desired features are generally useful, then the
probability is high that inclusion of such features will generate lots of
money.

It seems to me that the only way your crusade for bit distance will ever be
successful is to show the general utility of such functions. Otherwise, no
matter how interesting (to you), you will not get these features.

Anyway, please tell us. What is so interesting/useful about non-uniform
random numbers?

--
Terry Ingoldsby ingoldsb%cty...@cpsc.ucalgary.ca
Land Information Services or
The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

Stephen P Spackman

unread,
May 8, 1991, 2:17:42 AM5/8/91
to
In article <6...@ctycal.UUCP> ingo...@ctycal.UUCP (Terry Ingoldsby) writes:

It seems to me that the only way your crusade for bit distance will ever be
successful is to show the general utility of such functions. Otherwise, no
matter how interesting (to you), you will not get these features.

Are we talking about "find first one" here?

Find first one was a pretty useful instruction in our lisp compiler.
In fact, we figured we could speed the thing up a LOT by moving to the
Amiga and using the blitter to do bitmask stuff (which we wanted (a)
for closure algorithms and (b) for resource allocation).

A lot of CPU seems to get expended on variable length data encoding
around here. I haven't examined the code myself but it sounds like it
would come up there.

It's a notoriously useful thing for resource allocation in operating
systems, too.

I've surely wanted this more often than, say, floating point addition.

Of course, there are always other data representations, and so long as
the hardware people take the attitude that computers are for fluid
dynamics and to hell with compilers, operating systems and databases
(who uses computers for THAT stuff anyway? :-)....

Lotsa smilies.
----------------------------------------------------------------------
stephen p spackman Center for Information and Language Studies
systems analyst University of Chicago
----------------------------------------------------------------------

Herman Rubin

unread,
May 8, 1991, 11:46:00 AM5/8/91
to
In article <6...@ctycal.UUCP>, ingo...@ctycal.UUCP (Terry Ingoldsby) writes:
> In article <11...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> > In article <40...@batman.moravian.EDU>, hal...@batman.moravian.EDU (David Halko) writes:
> > > In article <97...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> > > > In article <1991Apr5.1...@agate.berkeley.edu>, do...@eris.berkeley.edu (Doug Merritt) writes:
> > > > > In article <1991Apr4...@capd.jhuapl.edu> wal...@capd.jhuapl.edu writes:

> I had to leave the above 4 lines in, just cause it looks so awesome to see
> a discussion going so many levels deep :^)
...
> > Consider the following two operations, which comprise more than 25% of the
> > operations in some highly efficient, from the "picocode" standpoint, methods
> > for generating non-uniform random numbers.

> > 1. Examine a bit, and take some action depending on it. Also, the next time
> > this is to be done, this bit effectively no longer exists. It is necessary
> > to make provision at least somewhere for replenishing the supply of bits.

> > 2. Find the distance to the next one in a bit stream. The same considerations
> > as above apply.

> I have got to ask. Why is it so generally important that the distance between
> bits can be determined efficiently. Note that I want to know, `why is it
> important to ME, and to the general computing base?'. This is the question
> that hardware designers and compiler writers ask themselves.

I believe most people are aware of the existence of simulation, including
Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
intractable problems. In these, it is almost always the case that
considerable use must be made of non-uniform random variables; the
quantities of interest are usually exponential, normal, Cauchy, binomial,
Poisson, etc., and not uniform.

In any particular problem, it is quite likely that thousands or millions or
even billions of these random numbers will be used. If this is not a situation
which calls for efficiency, it will do until a real one comes along. :-)
--

Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399

David Wright

unread,
May 8, 1991, 4:51:06 PM5/8/91
to
In article <6...@ctycal.UUCP> ingo...@ctycal.UUCP (Terry Ingoldsby) writes:
>In article <11...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>> In article <40...@batman.moravian.EDU>, hal...@batman.moravian.EDU (David Halko) writes:
>> > In article <97...@mentor.cc.purdue.edu>, hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>> Consider the following two operations, which comprise more than 25% of the
>> operations in some highly efficient, from the "picocode" standpoint, methods
>> for generating non-uniform random numbers.
>>
>> 1. Examine a bit, and take some action depending on it. Also, the next time
>> this is to be done, this bit effectively no longer exists. It is necessary
>> to make provision at least somewhere for replenishing the supply of bits.
>>
>> 2. Find the distance to the next one in a bit stream. The same
>> considerations as above apply.
>
>I have got to ask. Why is it so generally important that the distance between
>bits can be determined efficiently. Note that I want to know, `why is it
>important to ME, and to the general computing base?'. This is the question
>that hardware designers and compiler writers ask themselves.

Hear hear. And this leads to a second question of HR, which has been
posed once before but not answered (or I missed it): Why does your
application have to work on a bit stream? What is wrong with
instead computing the data about the bit stream, and passing the
results along as, say, pairs (a,b), where a is the number of
consecutive 0's and b the number of consecutive 1's in the current
part of the stream?

>Anyway, please tell us. What is so interesting/useful about non-uniform
>random numbers?

Well, if he's generating numbers that conform to some other
distribution, the results certainly wouldn't be uniform random numbers
(though I think this is usually done by *starting* with a uniform
distribution and then manipulating that into, say, a Gaussian).

-- David Wright, not officially representing Stardent Computer Inc
wri...@stardent.com or uunet!stardent!wright

Groucho: America is waiting to hear him sing!
Chico: Well, he can sing pretty loud, but not that loud.
-- A Night at the Opera

Chip Salzenberg

unread,
May 9, 1991, 12:43:14 PM5/9/91
to
According to hru...@pop.stat.purdue.edu (Herman Rubin):
>In article <6...@ctycal.UUCP>, ingo...@ctycal.UUCP (Terry Ingoldsby) writes:
>> I have got to ask. Why is it so generally important that the distance
>> between bits can be determined efficiently. Note that I want to know,
>> `why is it important to ME, and to the general computing base?'.
>
>I believe most people are aware of the existence of simulation, including
>Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
>intractable problems.

That kind of problem is not "generally important" in that it does not
come up in the "general computing base".

Care to try again?

Herman Rubin

unread,
May 10, 1991, 9:47:48 AM5/10/91
to
In article <28297C...@tct.com>, ch...@tct.com (Chip Salzenberg) writes:
> According to hru...@pop.stat.purdue.edu (Herman Rubin):
> >In article <6...@ctycal.UUCP>, ingo...@ctycal.UUCP (Terry Ingoldsby) writes:
> >> I have got to ask. Why is it so generally important that the distance
> >> between bits can be determined efficiently. Note that I want to know,
> >> `why is it important to ME, and to the general computing base?'.

> >I believe most people are aware of the existence of simulation, including
> >Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
> >intractable problems.

> That kind of problem is not "generally important" in that it does not
> come up in the "general computing base".

That something is not important to everyone is no reason no to make it
available to those who can see a use for it.

How much of the driving public uses the low gears of automatic transmissions?

Guy Harris

unread,
May 10, 1991, 2:43:18 PM5/10/91
to
>> I have got to ask. Why is it so generally important that the distance between
>> bits can be determined efficiently. Note that I want to know, `why is it
>> important to ME, and to the general computing base?'. This is the question
>> that hardware designers and compiler writers ask themselves.
>
>I believe most people are aware of the existence of simulation, including
>Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
>intractable problems.

Yes, but unless the person who posed the original question does those
kinds of simulations, it's not important to him, and unless a given
member of the general computing base does them, it's not important to
them, either.

I.e., you answered an interesting question, namely "tell me a use for
finding the distance to the next 1 in a bit stream", but you didn't
answer the question that was asked.

Now, there are other places where finding the distance to the next 1 in
a bit stream is useful. It may well be that it's worth providing some
kind of hardware assist for it; I'm curious what forms of "hardware
assist" of that sort exist (other than doing it using the obvious
simple-minded loop, but in microcode).

Some questions then are "what other useful hardware would you have to
sacrifice, in some given implementation, to add the hardware assist for
'find next 1'?" and "will we, at some point, not have to sacrifice
anything really useful to get it?" so that you might want to put such an
operation into the instruction set anyway, and have a non-assisted
implementation early on.

(Is 'find next 1' similar enough to floating-point normalization that
the same hardware assistance can be used for both?"

BTW, I assume that by "most people" in "I believe most people are aware
of the existence of simulation..." you mean "most readers of this
group", which may well be true. I don't know how "find first 1" comes
in handy for those simulations, though.

Ross Alexander

unread,
May 10, 1991, 1:11:53 PM5/10/91
to
hru...@pop.stat.purdue.edu (Herman Rubin) writes:
[much chat about whether Herman's apps are typical elided]

>How much of the driving public uses the low gears of automatic transmissions?

All of them, Herman. Whenever they accelerate from a stop. That's
what an automatic transmission is for.

--
Ross Alexander r...@cs.athabascau.ca (403) 675 6311 ve6pdq
`You were s'posed to laugh!' -- Zippy

Greg Hennessy

unread,
May 10, 1991, 11:00:35 AM5/10/91
to
Chip Salzenberg writes:
#> That kind of problem is not "generally important" in that it does not
#> come up in the "general computing base".

Herman Rubin writes:
#That something is not important to everyone is no reason no to make it
#available to those who can see a use for it.

Chip didn't say something had to be important for everyone before
making it available, Chip said it had to be important to the general
base. Those aren't the same thing.

Herman, you seem to have quite specialized computer needs, and more
knowledge about computers than the average person, but the types of
things you do are not useful to J. Random User, so people who write
compilers for general computer users don't find it cost effective to
add your optimizations into their code.

As a radio-astronomer, I have specialized needs in computing also. I
don't complain that there is no commodity language to reduce my data,
instead I (and other radio-astronomers) write all our code ourselves.
Our market is not large enough to expect off the shelf solutions to
our problems.

Why don't you apply for a grant, and hire someone to write a language
for you?

--
-Greg Hennessy, University of Virginia
USPS Mail: Astronomy Department, Charlottesville, VA 22903-2475 USA
Internet: gs...@virginia.edu
UUCP: ...!uunet!virginia!gsh7w

Zalman Stern

unread,
May 11, 1991, 7:37:44 AM5/11/91
to
In article <77...@auspex.auspex.com> g...@auspex.auspex.com (Guy Harris) writes:
[...]

>Now, there are other places where finding the distance to the next 1 in
>a bit stream is useful. It may well be that it's worth providing some
>kind of hardware assist for it; I'm curious what forms of "hardware
>assist" of that sort exist (other than doing it using the obvious
>simple-minded loop, but in microcode).

The IBM RS/6000 has a single cycle count leading zeros instruction. The
hardware is also used by the multiplier to short circuit multiplies by
smaller constants. The operation takes from 3 to 5 cycles depending on the
size of the multiplier. The clz hardware is used to decide how much work it
has to do.

>
>Some questions then are "what other useful hardware would you have to
>sacrifice, in some given implementation, to add the hardware assist for
>'find next 1'?" and "will we, at some point, not have to sacrifice
>anything really useful to get it?" so that you might want to put such an
>operation into the instruction set anyway, and have a non-assisted
>implementation early on.

A Count leading zeros (or ffs if you prefer) instruction is not a big deal
other than that you have to design the hardware to do it. (I.e. it doesn't
place unreasoanble demands on the register file, it doesn't eat lots of
opcode space.) However its not clear that this single instruction does the
job. You can either look for a 1 or a 0 bit and you can look from most
significant to least significant or vice-versa. Herman probably wants all
of these options...

>
>(Is 'find next 1' similar enough to floating-point normalization that
>the same hardware assistance can be used for both?"

Putting it in the FP unit makes it harder to uses the result for shifts and
such on machines with seperate FP and integer units. (Hooking the integer
register file up to the FP normalization hardware is going to be very
painful.)

My guess is that special purpose hardware would be useful for this sort of
operation. DMA it in at bus speed and write the values into scratch RAM.
Interrupt the processor when the scratch RAM gets full or the buffer gets
empty. (Or DMA the scratch RAM back out too...) Go ahead and toss a Xlinx
or something into the hardware to make it programable.
--
Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94088
zal...@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman (408) 524 8395
"Never rub another man's rhubarb" -- the Joker via Pop Will Eat Itself

Herman Rubin

unread,
May 11, 1991, 11:05:14 AM5/11/91
to
In article <rwa.67...@aupair.cs.athabascau.ca>, r...@cs.athabascau.ca (Ross Alexander) writes:
> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
> [much chat about whether Herman's apps are typical elided]
> >How much of the driving public uses the low gears of automatic transmissions?
>
> All of them, Herman. Whenever they accelerate from a stop. That's
> what an automatic transmission is for.

To clarify, I meant the specific ability to place the transmission in low gears.

Current computer hardware can do anything,assuming sufficient external storage,
but not efficiently. Similarly with software.

Raul Rockwell

unread,
May 11, 1991, 12:38:55 AM5/11/91
to
Herman Rubin:
>> Consider the following two operations, ...
>> 1. Examine a bit, ...

>> 2. Find the distance to the next one in a bit stream.

Terry Ingoldsby:


>I have got to ask. Why is it so generally important that the
>distance between bits can be determined efficiently.

David Wright:


Hear hear. And this leads to a second question of HR, which has been
posed once before but not answered (or I missed it): Why does your
application have to work on a bit stream?

Chip Salzenberg:


That kind of problem is not "generally important" in that it does not
come up in the "general computing base".

harumph.

At work we have a hand-optimized assembly routine to convert from a
bit array to a list of indices. It gets used a lot.

Please note that a bit list is very handy for dealing with selection
problems. They are very compact, and have very predictable size
requirements. Selection problems include highlighting text for
emphasis, intermediate results during a database query, as well as Herman
Rubin's probabalistic models.

They're also handy in some domains for converting from some arbitrary
list of integers to a unique, sorted list (but otherwise the same
integers), in O(n).

We use "bit lists as integers" for other things too (once you have a
generic feature you tend to use it whenever it's handy -- particularly
if it's fast).

As for the problem of "distance between bits" vs "absolute position of
bits" note that if you have a _lot_ of bits you start talking about
very long lists of integers (and possibly even exceeding maxint, or
whatever for the representation you're using). This is particularly
painful if your bitlist is dense. Both have their uses (as does
a list ordered pairs of the form (position, offset)), and we happen to
use them a lot.

Raul Rockwell

Herman Rubin

unread,
May 12, 1991, 7:20:42 AM5/12/91
to
In article <1991May10.2...@ingres.Ingres.COM>, j...@ingres.com (Jon Krueger) writes:
> In article <12...@mentor.cc.purdue.edu>, Herman Rubin does not address
> the question:

> >> Why is it so generally important that the distance between
> >> bits can be determined efficiently ... why is it

> >> important to ME, and to the general computing base?
>
> Instead he addresses the question "why is it so specifically
> ipmortant to Herman Rubin":

>
> > I believe most people are aware of the existence of simulation, including
> > Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
> > intractable problems.
>
> No one asked that question. We believe it's important to you, Herman.
> Can you answer the question that was asked?
>
> Why is it important to the general computing base that the distance
> between bits can be determined efficiently?

The summary points out the problems. Does one have to go to great lengths
to get a car which has features not in the "general driving base"?

Do film companies only make films for the "general public"?

Should one even consider that universities only teach courses for the
"general student"?

In addition, there is much use of instructions for systems programs and
libraries which the individual programmer will not use. It is unlikely
that the person using the subroutines which would make use of efficient
obtaining of the distance between 1's in a bit stream would be aware of
this, any more than the person who computes a trigonometric or exponential
function is aware that that algorithm has an integer quotient and floating
remainder in it.

Gideon YUVAL

unread,
May 9, 1991, 2:57:41 PM5/9/91
to
In article <STEPHEN.91...@pesto.uchicago.edu> ste...@pesto.uchicago.edu (Stephen P Spackman) writes:
>Are we talking about "find first one" here?
>
>Find first one was a pretty useful instruction in our lisp compiler.

If you start from the low-order bit, you can find the first "one"
bit by:

table[ (x XOR (x-1)) MOD 37]

On a 386, this is 4X as fast (worst-case) as the "hardware" BSF.
--
Gideon Yuval, gid...@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)

Daniel Mocsny

unread,
May 12, 1991, 12:53:23 PM5/12/91
to
In article <12...@mentor.cc.purdue.edu> hru...@pop.stat.purdue.edu (Herman Rubin) writes:
>In article <1991May10.2...@ingres.Ingres.COM>, j...@ingres.com (Jon Krueger) writes:
>> Why is it important to the general computing base that the distance
>> between bits can be determined efficiently?
>
>The summary points out the problems. Does one have to go to great lengths
>to get a car which has features not in the "general driving base"?

I believe that Herman Rubin could solve his problem with these steps:

1. Develop a computer code which solves a reasonable set of problems,
and which relies extensively on a computer's ability to determine the
distance between bits.

2. Submit the code to SPEC and get it included in the standard
benchmark suite.

3. Watch panic-stricken marketroids scurry to protect their SPEC
ratings by further partitioning SPEC into an additional category:
SPECrubins.


--
Dan Mocsny
Internet: dmo...@minerva.che.uc.edu

It is loading more messages.
0 new messages