Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

VLIW Architecture

50 views
Skip to first unread message

Custeau RD

unread,
Sep 26, 1989, 3:17:19 PM9/26/89
to

I am looking for references or information on VLIW architectures for
a fourth-year architecture seminar. Any help would be greatly appreciated.

Please reply by e-mail, as I don't often read the news.

Thanks in advance!

Ben Cutler

unread,
Sep 28, 1989, 9:23:13 PM9/28/89
to


A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
by John Ellis, which won the ACM Doctoral Dissertation Award in 1985. It's
available from MIT Press. For information on a commercial VLIW
implementation, contact Multiflow Computer at (203) 488-6090.

Stan Lackey

unread,
Sep 29, 1989, 8:58:01 AM9/29/89
to
In article <10...@m3.mfci.UUCP> cut...@mfci.UUCP (Ben Cutler) writes:
>In article <251FCB3...@maccs.dcss.mcmaster.ca> cs4...@maccs.dcss.mcmaster.ca (Custeau RD) writes:
>> I am looking for references or information on VLIW architectures for
>>a fourth-year architecture seminar. Any help would be greatly appreciated.
>A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',

It's a great work. Nearly all of the attention is on the compiler,
and doesn't get into hardware or OS issues. A main difference between
the book and what Multiflow ended up shipping is in the memory
interface. The book expected a separate path from each main data path
in the CPU to its corresponding bank of memory, with relatively slow
transfer paths in between. Thus, for peak performance, the compiler
would have to be able to predict which memory bank a datum is in, and
give the instruction to the corresponding CPU.

In many cases, this is not a problem, as scalar variables (in FORTRAN
anyway) have their address known at compiler time, and of course the
base of all arrays are known. But apparently calculated indices into
arrays are common enough to be a problem. There was much attention
paid to this "memory bank disambiguation" in the book, and even
included a syntax for allowing the user to give hints to the compiler
to aid in disambiguation.

From what I could tell from looking at Multiflow's released
documentation, Multiflow eliminated this problem with a kind of a
"distributed crossbar" (as I would call it) between the CPU's and the
memory banks. This allows the disambiguation to be resolved at
runtime, and answered my main criticism of the technology.

Another difference is that the prototype compiler for the thesis was
written in Lisp. I think Multiflow ended up using C.

-Stan
Not affiliated with Multiflow except as an interested observer. Not
necessarily the views of BBN either.

Herman Rubin

unread,
Oct 1, 1989, 7:43:40 AM10/1/89
to
In article <10...@m3.mfci.UUCP>, cut...@mfci.UUCP (Ben Cutler) writes:
> In article <251FCB3...@maccs.dcss.mcmaster.ca> cs4...@maccs.dcss.mcmaster.ca (Custeau RD) writes:
| >
| > I am looking for references or information on VLIW architectures for
| >a fourth-year architecture seminar. Any help would be greatly appreciated.

> A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
> by John Ellis, which won the ACM Doctoral Dissertation Award in 1985. It's
> available from MIT Press. For information on a commercial VLIW
> implementation, contact Multiflow Computer at (203) 488-6090.

As one who always finds ways to use the architecture that the compiler
writers did not think about, I maintain that this book helps little.

Why is it the case that people in the computing field think that someone
can understand a computer in ignorance of its instruction set and the
temporal operation of those instructions?
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hru...@l.cc.purdue.edu (Internet, bitnet, UUCP)

Paul Rodman

unread,
Oct 2, 1989, 10:18:08 AM10/2/89
to
In article <16...@l.cc.purdue.edu> c...@l.cc.purdue.edu (Herman Rubin) writes:

>
>> A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
>> by John Ellis, which won the ACM Doctoral Dissertation Award in 1985. It's
>> available from MIT Press. For information on a commercial VLIW
>> implementation, contact Multiflow Computer at (203) 488-6090.
>
>As one who always finds ways to use the architecture that the compiler
>writers did not think about, I maintain that this book helps little.

Helps little for what? Compilers ALWAYs involve tradeoffs, as does
hardware, as does any engineering endeavor. I would be suprised if you
_couldn't_ find ways to use computers that the compiler writers didn't
think about..(or were to stupid to realize, or decided to blow off...)

IMHO, this book is _extremely_ well written, and filled with good ideas.
When I first read it there were some things I disagreed with, but I was
very impressed with it overall.

>
>Why is it the case that people in the computing field think that someone
>can understand a computer in ignorance of its instruction set and the
>temporal operation of those instructions?

What do you mean by "understand"? The Bulldog effort broke important
ground, period. Throwing stones at it because it didn't cover everything
is silly. It was a phd thesis, not a product!

Belive me, in order to generate code for a VLIW you _do_ need to understand
the instruction set and the temporal operation of those instructions!
But the kind of stuff Ellis talks about is at a higher level and
has to come first. Obviously the best product will try to integrate
things from the lowest gate level back to the phase-one front end of the
compiler, but it is human nature to try to comparmentalize things.

Too _many_ folks have designed computers and focused on the low-level
instructions set/ pipelining issues without having a clue as to
how the compiler will actually find a way use them. When it comes time
to emit a .o file, hand waving won't work and bad algorithms will
fall apart on somebody's code.

Written any compilers lately?

>--
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

Paul K. Rodman / KA1ZA / rod...@multiflow.com
Multiflow Computer, Inc. Tel. 203 488 6090 x 236
Branford, Ct. 06405

Hank Dietz

unread,
Oct 2, 1989, 3:42:19 PM10/2/89
to
In article <16...@l.cc.purdue.edu> c...@l.cc.purdue.edu (Herman Rubin) writes:
>In article <10...@m3.mfci.UUCP>, cut...@mfci.UUCP (Ben Cutler) writes:
>> In article <251FCB3...@maccs.dcss.mcmaster.ca> cs4...@maccs.dcss.mcmaster.ca (Custeau RD) writes:
>| >
>| > I am looking for references or information on VLIW architectures for
>| >a fourth-year architecture seminar. Any help would be greatly appreciated.
>
>> A good ``introductory'' text is ``Bulldog: A Compiler for VLIW Architectures'',
>> by John Ellis, which won the ACM Doctoral Dissertation Award in 1985. It's
>> available from MIT Press. For information on a commercial VLIW
>> implementation, contact Multiflow Computer at (203) 488-6090.
>
>As one who always finds ways to use the architecture that the compiler
>writers did not think about, I maintain that this book helps little.

VLIW machine compilers essentially search for optimal schedules of
instructions; I don't see how a full-width search could be ommitting things
that Dr. Rubin would want to do. ;-)

>Why is it the case that people in the computing field think that someone
>can understand a computer in ignorance of its instruction set and the
>temporal operation of those instructions?

The primary reason humans have done so well is not that we are so much
better than other life forms, but rather that we build much better tools.
To accomplish any large task, a human must be able to be ignorant of at least
some details -- VLIW compiler technology is a prime example of a mechanism
keeping track of, and optimizing use of, a structure too complex for humans
to manage directly.

I suppose you'd rather not use any tools... but then why do you want to use
the tools called "computers?"

-ha...@ecn.purdue.edu

>--
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
>Phone: (317)494-6054
>hru...@l.cc.purdue.edu (Internet, bitnet, UUCP)

PS: Dr. Rubin's views do not represent those of Purdue University. ;-)

Herman Rubin

unread,
Oct 3, 1989, 8:41:20 AM10/3/89
to
In article <13...@pur-ee.UUCP>, ha...@pur-ee.UUCP (Hank Dietz) writes:
> In article <16...@l.cc.purdue.edu> c...@l.cc.purdue.edu (Herman Rubin) writes:

> >As one who always finds ways to use the architecture that the compiler
> >writers did not think about, I maintain that this book helps little.
>
> VLIW machine compilers essentially search for optimal schedules of
> instructions; I don't see how a full-width search could be ommitting things
> that Dr. Rubin would want to do. ;-)
>
> >Why is it the case that people in the computing field think that someone
> >can understand a computer in ignorance of its instruction set and the
> >temporal operation of those instructions?

What kind of integer arithmetic is there in hardware? Examples of the
problem are what kinds of integer products can be formed in one operation.
Some machines can form long x long -> double long in one operation, and
some cannot. Some allow this to be done unsigned and some do not. Some
machines allow double/single getting quotient and remainder in one operation
and some do not. Some allow this to be unsigned and some do not.

Some machines have hardware square root and some do not. A few allow
fixed-point arithmetic, but most do not. Different machines have different
addressing modes; vector operations should be coded differently depending on
that. I have profitably used bit operations on floating-point numbers; this
means that I must know the exact representation.

How do you code the following: look at one bit and transfer if it is a 1,
setting pointers so that that bit will not be looked at again? In the cases
where I envision using this, it is a tool and will be used frequently.
Another such is the spacing between 1's in a bit stream, and in this case
there are alternatives.

What does your HLL have to say about these?

> The primary reason humans have done so well is not that we are so much
> better than other life forms, but rather that we build much better tools.
> To accomplish any large task, a human must be able to be ignorant of at least
> some details -- VLIW compiler technology is a prime example of a mechanism
> keeping track of, and optimizing use of, a structure too complex for humans
> to manage directly.

The human knows when not to use the tools provided. Are you so dead sure that
I cannot manage such structures better than the compiler? The most complicated
instruction set I have seen is MUCH simpler than HLLs. The compiler maps a
route on Interstates; I find a short cut on good roads. Will an autodriver
let you back the car out of the garage so you can clean the garage?

> I suppose you'd rather not use any tools... but then why do you want to use
> the tools called "computers?"

Use the best tool for the user and the job. The crude mechanic must put the
car on the analyzer and use what it tells him. There are a few good mechanics
who can do better listening to the engine.

> -ha...@ecn.purdue.edu


>
> PS: Dr. Rubin's views do not represent those of Purdue University. ;-)

PPS: Neither do those of Dr. Dietz.

Hank Dietz

unread,
Oct 3, 1989, 7:06:57 PM10/3/89
to
In article <16...@l.cc.purdue.edu> c...@l.cc.purdue.edu (Herman Rubin) writes:
...[numerous "why don't HLLs let me say..." flames omitted]...

>What does your HLL have to say about these?

What HLL? What does this have to do with VLIW techniques? Dr. Rubin is
complaining about languages -- but we are talking about VLIW compiler
analysis and program transformation technology, not language design.

I hope that this confusion is not common. Perhaps a lot of compilers do
blindly spit-out the "obvious" code for badly-designed language constructs,
but that certainly isn't the state of the art. I would think that a person
who has spent some time counting T-states would really appreciate the VLIW
work that Ellis presents in his award-winning PhD thesis... I know I do.

>... Are you so dead sure that


>I cannot manage such structures better than the compiler? The most complicated

>instruction set I have seen is MUCH simpler than HLLs. ...

VLIW technology is complex because of its use of parallelism -- it has very
little to do with instruction set complexity issues. Generating good code
for a VLIW is most like microcode scheduling/compaction for a huge,
asymmetric, microcoded machine. You really don't even want to try it by
hand... well, I know I don't. And why bother? VLIW compiler techniques
come very close to optimal schedules every time. I can't match it, let
alone beat it.

-ha...@ecn.purdue.edu

"A good workman is known by his tools."
Intro to Chapter 12, F. Brooks, "The Mythical Man-Month," p. 127.

Herman Rubin

unread,
Oct 4, 1989, 7:57:51 AM10/4/89
to
In article <13...@pur-ee.UUCP>, ha...@pur-ee.UUCP (Hank Dietz) writes:
> In article <16...@l.cc.purdue.edu> c...@l.cc.purdue.edu (Herman Rubin) writes:
> ...[numerous "why don't HLLs let me say..." flames omitted]...
> >What does your HLL have to say about these?
>
> What HLL? What does this have to do with VLIW techniques? Dr. Rubin is
> complaining about languages -- but we are talking about VLIW compiler
> analysis and program transformation technology, not language design.

If the language the compiler compiles does not have the operations to be used
in my program, I cannot use them. I have seen NO language designed for
the efficient use of hardware operations. The choice of the algorithm
depends on the hardware, and there can be lots of different algorithms.

An example: most of the efficient ways of generating non-uniform random
variables are acceptance-rejection. This is only partly vectorizable on
CRAYs, and is especially difficult on the CRAY 1. This is due to the
non-existence of some hardware instructions. It would be necessary to
try out a number of computationally less efficient methods to find the
ones which are best on the machine. I would not consider any of them on
the CYBER 205, as they are obviously worse.

> I hope that this confusion is not common. Perhaps a lot of compilers do
> blindly spit-out the "obvious" code for badly-designed language constructs,
> but that certainly isn't the state of the art. I would think that a person
> who has spent some time counting T-states would really appreciate the VLIW
> work that Ellis presents in his award-winning PhD thesis... I know I do.
>
> >... Are you so dead sure that
> >I cannot manage such structures better than the compiler? The most complicated
> >instruction set I have seen is MUCH simpler than HLLs. ...
>
> VLIW technology is complex because of its use of parallelism -- it has very
> little to do with instruction set complexity issues. Generating good code
> for a VLIW is most like microcode scheduling/compaction for a huge,
> asymmetric, microcoded machine. You really don't even want to try it by
> hand... well, I know I don't. And why bother? VLIW compiler techniques
> come very close to optimal schedules every time. I can't match it, let
> alone beat it.

What you are saying is that an instruction scheduler can do better thqn a
human. This is mostly true. It is necessary to allow human override, as
the scheduler only takes into account what its designers thought was
of value.

But this does not answer my complaint. I need to know these instructions
to make use of them. I certainly can use information about their timing
and overlap to select those algorithms which will run faster. I do not
insist that an assembler carry out the instructions in the order I present
them, as long as the program logic is maintained. This is what an
optimizer does, and I do not oppose this.

I do not even object to the optimizer-compiler-scheduler even suggesting
alternates which are more efficient, but which it cannot be sure will work.
One of our programmers asked me whether some transfers could be deleted
from code. It was possible by essentially duplicating some code, and
making inessential changes in the program. I doubt a compiler could
figure this out.

Preston Briggs

unread,
Oct 4, 1989, 12:36:25 PM10/4/89
to

Steve Carr showed me this transformation, which can be done automatically
(and *is* done, in a local experimental compiler).
This is slightly contrived, for expository simplicity.

A simple DO-loop

DO i = 1, 2*n
x(i) = f(y(i))
z(i) = g(x(i-1))
ENDDO

where x, y, and z are arrays and f and g are arbitrary functions.
Note the reuse of each x(i) value one iteration after it has been set.
We can take advantage of this reuse

x1 = x(0)
DO i = 1, 2*n
x0 = f(y(i))
x(i) = x0
z(i) = g(x1)
x1 = x0
ENDDO

This is nice because we save a memory access per iteration.
(Assuming a smart backend (normal optimizing compiler) that
will allocate x0 and x1 to registers.)
The problem is that we'll end up with an extra reg-reg copy
(x1=x0) at the end of the loop.
This can be avoided by unrolling slightly to subsume the copy.

x1 = x(0)
DO i = 1, n
x0 = f(y(i))
x(i) = x0
z(i) = g(x1)
x1 = f(y(i+1))
x(i+1) = x1
z(i) = g(x0)
ENDDO

The first transformation, dependence-based register allocation,
is due to Randy Allan and Ken Kennedy. The 2nd transform,
unrolling to subsume copies, is in Kennedy's thesis (1971).
Both have been implemented at Rice by Steve Carr.

Of course, this is a simple example; but these transformations
(with others of a similar nature) can give tremendous improvements
in real code.

What's the point? Compilers are getting better. Humans had to
come up with the original ideas (certainly compiler writers learn
from human coders), but a compiler can apply the ideas repeatedly,
without human effort, to amazingly nastly code.

Regards,
Preston Briggs

Preston Briggs

unread,
Oct 4, 1989, 3:33:43 PM10/4/89
to
In article <19...@brazos.Rice.edu>
pre...@titan.rice.edu (Preston Briggs) writes:

> DO i = 1, 2*n
> x(i) = f(y(i))
> z(i) = g(x(i-1))
> ENDDO

becomes

> x1 = x(0)
> DO i = 1, 2*n
> x0 = f(y(i))
> x(i) = x0
> z(i) = g(x1)
> x1 = x0
> ENDDO

becomes

> x1 = x(0)
> DO i = 1, n
> x0 = f(y(i))
> x(i) = x0
> z(i) = g(x1)
> x1 = f(y(i+1))
> x(i+1) = x1
> z(i) = g(x0)
> ENDDO


but this last is wrong.
It should be

x1 = x(0)
DO i = 1, 2*n, 2


x0 = f(y(i))
x(i) = x0
z(i) = g(x1)
x1 = f(y(i+1))
x(i+1) = x1

z(i+1) = g(x0)
ENDDO

I'm sorry for any confusion.
Think of this as an example of why
I'd prefer an optimizer to transform my program.

Preston

John R. Levine

unread,
Oct 4, 1989, 10:58:41 PM10/4/89
to
In article <16...@l.cc.purdue.edu> c...@l.cc.purdue.edu (Herman Rubin) writes:
>I have seen NO language designed for the efficient use of hardware operations.
Gee, I have, albeit for specific computers, e.g.:

Bliss-10 PDP-10
Fortran IBM 709

Neither was very good in the portability department, at least not in its
initial form, as befits a hardware specific language.

To return to the original argument, John Ellis' book doesn't concern itself
with specific instruction sets because at the time he wrote it, VLIWs
existed only on paper. The concrete design of a VLIW happened only after
most of the rest of the Yale VLIW project left to build real hardware at
Multiflow. I was at Yale at the time and the longest instruction word we
had was 36 bits in our DEC-20. Or maybe 48 bits in a PDP-11.
--
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 492 3869
jo...@esegue.segue.boston.ma.us, {ima|lotus}!esegue!johnl, Lev...@YALE.edu
Massachusetts has 64 licensed drivers who are over 100 years old. -The Globe

Richard O'Keefe

unread,
Oct 5, 1989, 2:24:48 AM10/5/89
to
In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin) writes:
> If the language the compiler compiles does not have the operations to be used
> in my program, I cannot use them. I have seen NO language designed for
> the efficient use of hardware operations. The choice of the algorithm
> depends on the hardware, and there can be lots of different algorithms.

How about PL-360 and Bliss-10, both of which let you generate any instruction
you like? Come to that, how about the C compiler that comes with UNIX in
System V.3 for the 386, which lets you write assembly-code routines that
are expanded in-line? Or how about the SunOS C/Fortran/Pascal/Modula2
compilers, which have a pass that replaces apparent procedure calls with
arbitrary assembly code of your choice (specified in .il files) before
final code optimisation?

Herman Rubin clearly enunciates a policy which I believe is responsible
for much of the unreliability of present-day software:
the machine is what it is,
the programmer's job is to exploit the machine,
the language's job is to expose the machine to the programmer.
The view that I take is
the programmer's job is to express his intentions clearly,
the machine's job is to obey the programmer,
the language's job is to provide a *simple* conceptual model.
Rubin is willing to be the servant of his machines. I'm not.
I've exchanged E-mail with a floating-point expert who had some
illuminating things to say about the work being done for Ada; it turns
out that they can do amazing things without needing to know the details
of the hardware instructions, but what does hurt them is the extent to
which Ada already lets hardware vagaries show through.

Herman Rubin

unread,
Oct 5, 1989, 8:42:38 AM10/5/89
to
In article <23...@munnari.oz.au>, o...@cs.mu.oz.au (Richard O'Keefe) writes:
> In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin) writes:
> > If the language the compiler compiles does not have the operations to be used
> > in my program, I cannot use them. I have seen NO language designed for
> > the efficient use of hardware operations. The choice of the algorithm
> > depends on the hardware, and there can be lots of different algorithms.
>
> How about PL-360 and Bliss-10, both of which let you generate any instruction
> you like? Come to that, how about the C compiler that comes with UNIX in
> System V.3 for the 386, which lets you write assembly-code routines that
> are expanded in-line? Or how about the SunOS C/Fortran/Pascal/Modula2
> compilers, which have a pass that replaces apparent procedure calls with
> arbitrary assembly code of your choice (specified in .il files) before
> final code optimisation?

I do not believe that the .il approach is quite adequate. As to the others
you mention, I am unfamiliar with them. And even that is inadequate; I want
to be able even to introduce overloaded infix operators in addition to the
usual ones. Another feature missing from most languages since day 1 is to
allow the return of a list (NOT a struct).

> Herman Rubin clearly enunciates a policy which I believe is responsible
> for much of the unreliability of present-day software:
> the machine is what it is,
> the programmer's job is to exploit the machine,
> the language's job is to expose the machine to the programmer.
> The view that I take is
> the programmer's job is to express his intentions clearly,
> the machine's job is to obey the programmer,
> the language's job is to provide a *simple* conceptual model.
> Rubin is willing to be the servant of his machines. I'm not.

I would like to be able to travel to any place in the world at low cost
by stepping into a booth and pressing a button. I would like a computer
which has a rather large number of machine instructions now lacking with
a gigabyte memory, a reasonable screen, good color graphics, a WYSIWYG
editor capable of handling mathematical expressions well, at a cost of $1000.

These machines do not exist. Does that mean that I should not use the ones
that are available? I should not travel by airplane because there are no
teleportation booths? That I should do no computing because the machines
I would like do not exist? That I should write no papers because the
processors I want are not there? What would O'Keefe do in these situations?

My travel intentions depend on what is available and how long it will take.
I would like to go to more meetings than I do. So I state that my intention
is to go to such-and-such meetings for X dollars and return home to sleep
each night. Okay, machines; obey me!

I have clearly stated that the performance of an algorithm depends on the
hardware, so much so that a single hardware instruction can be crucial.
The current languages do not provide a conceptual model for the operations
which I have thought of. I ask O'Keefe to tell me how to find his Utopian
machine and language situation for my problems.

> I've exchanged E-mail with a floating-point expert who had some
> illuminating things to say about the work being done for Ada; it turns
> out that they can do amazing things without needing to know the details
> of the hardware instructions, but what does hurt them is the extent to
> which Ada already lets hardware vagaries show through.

How can it be otherwise?

Richard O'Keefe

unread,
Oct 6, 1989, 1:23:26 AM10/6/89
to
In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin)
complained that he had "seen no language designed for the efficient use of
hardware operations".
In article <23...@munnari.oz.au>, o...@cs.mu.oz.au (I) listed several
languages (PL-360, Bliss-10, SunOS C) which provide full access to
all hardware instructions.

In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin)
replied

> I do not believe that the .il approach is quite adequate.

What, precisely, can it not do? Admittedly, this approach relies on
the back-end optimiser being able to handle any assembly code you throw
at it, but then, so must any scheme giving you access to all the operations.

> As to the others
> you mention, I am unfamiliar with them. And even that is inadequate;

If you don't know them, how can you know they are inadequate?

> I want
> to be able even to introduce overloaded infix operators in addition to the
> usual ones. Another feature missing from most languages since day 1 is to
> allow the return of a list (NOT a struct).

Having used Fortran, Algol, SNOBOL, Lisp, Prolog, APL, COBOL, PL/I,
and lots of other languages, I gave up thinking that the syntax of
programming languages had any great importance a long time ago.
I should have mentioned Ada as one of the languages which provides
full machine access; it has both assembly code inserts and overloaded
operators. I don't know precisely what Rubin means by returning a list;
Algol 68 permits the return of anything (except pointers into the local
frame). The reason that most languages don't return lists is precisely
that they are designed to let the hardware show through, and you can't
fit much in a register.

I wrote


> Herman Rubin clearly enunciates a policy which I believe is responsible
> for much of the unreliability of present-day software:
> the machine is what it is,
> the programmer's job is to exploit the machine,
> the language's job is to expose the machine to the programmer.
> The view that I take is
> the programmer's job is to express his intentions clearly,
> the machine's job is to obey the programmer,
> the language's job is to provide a *simple* conceptual model.
> Rubin is willing to be the servant of his machines. I'm not.

Rubin replied that he would like transmatters and the ultimate computer.


> These machines do not exist. Does that mean that I should not use the ones
> that are available? I should not travel by airplane because there are no
> teleportation booths? That I should do no computing because the machines
> I would like do not exist? That I should write no papers because the
> processors I want are not there? What would O'Keefe do in these situations?

Why, I would do the same as I do now. Nowhere in the text cited above
do I say that machines should be ideal! The language's job is to
conceal the deficiencies of the computer. The computer can only do
32-bit arithmetic? Then it is the language's job to implement
arbitrary-precision integers on top of that and not bother me with
stupid little details. For a more common example, I really do not want
to be bothered with the MC68010 limitation that a shift-by-literal can
shift by at most 8. (I have used a language where this limit showed
through, and I did _not_ like it.) A C compiler papers over this limit,
and so it should. I normally use a language which runs 2 to 4 times
slower than C; I do this gladly because the reduction in _language_
complexity means that I can manage more complex _algorithms_, and win
back O(n) speed that way. (Yes, such algorithms can then in principle
be recoded in C. But it is much harder to _develop_ complex algorithms
in complex languages.)

I wrote


> I've exchanged E-mail with a floating-point expert who had some
> illuminating things to say about the work being done for Ada; it turns
> out that they can do amazing things without needing to know the details
> of the hardware instructions, but what does hurt them is the extent to
> which Ada already lets hardware vagaries show through.

Rubin replied:


> How can it be otherwise?

But that's _my_ point; letting the hardware show through is precisely
what _he_ is asking for, and _that_ is the problem!

This is not unlike the RISC/CISC debate. I want simple regular
languages (C is a monster of complexity) which I can master completely.
I don't care whether it's a ROMP or a WM or a /360 underneath. I
particularly don't want to _have_ to know.

Martin Lewitt

unread,
Oct 6, 1989, 5:38:24 AM10/6/89
to
In article <1989Oct5.0...@esegue.segue.boston.ma.us> jo...@esegue.segue.boston.ma.us (John R. Levine) writes:
*** much deleted ***

>To return to the original argument, John Ellis' book doesn't concern itself
>with specific instruction sets because at the time he wrote it, VLIWs
>existed only on paper. The concrete design of a VLIW happened only after
>most of the rest of the Yale VLIW project left to build real hardware at
>Multiflow. I was at Yale at the time and the longest instruction word we
>had was 36 bits in our DEC-20. Or maybe 48 bits in a PDP-11.

Ouch!! Why don't the Floating Point System's AP120B and FPS-164 processors
qualify as VLIW? They were commercial products introduced in 1975 and 1981
respectively:
1) Both have 64 bit instructions, with fields controlling
10 operations in parallel, i.e., micro-coded.

2) Both have FORTRAN compilers, the 164 from its
introduction in 1981 and the 120B in 1985 (though
one may have been available from a third party earlier)

3) By 1983, the FORTRAN compiler for the 164 was "software
pipelining", i.e., executing operations from different
iterations of a loop in parallel in an instruction.

4) Both were installed on the Yale campus by 1985
(correct me on this one). 8-)

I've tried to think of possible objections to their classification
as VLIW:
1) The 64 bit instruction word isn't long enough.
This quantitative argument falls to the qualitative,
10 operations controlled by a horizontal micro-instruction.

2) They were only "array processors" or "subroutine
boxes". This is true, especially for the AP120B,
but the 164 was able to run complete FORTRAN jobs
from 1982 on. It is more of a back end or
attached processor, like the CRAY.

3) No virtual memory or multi-tasking. What does this
have to do with VLIW?

4) They were too easy to micro-program, humans could do
it. True! And they were fun to program as well. The
software pipelining compiler was faster though, and
produced fine code. There were only a few hand coding
tricks it couldn't handle, e.g., moving the loop
control calculations to the floating point adder when
the integer/address unit was too busy.

The Multiflow machine was no mystery to FPS sales analysts. It was the
FPS dream machine: UNIX, virtual memory, better compilers, etc. We had
asked FPS to give us these features right from the beginning of the 164
back in 1981. By the time, Multi-flow delivered, it was too, late,
Convex, Alliant and FPS's own ECL machine, 264 were already on the scene,
and the high end workstations arrived in short order.

It will be interesting to see if FPS gets the credit it deserves when the
history is written, I doubt it. Will the historians do any better than
the contemporaries? In 1985, an article about micro-programming appeared
in Scientific American, written by a Stanford professor. He anticipated
the day when there would be commercial machines compiling directly to
micro-code. FPS had sold nearly 150 machines already.

Let's get the history right.
1) The first commercially available VLIW machine was the FPS-164,
not the Multiflow (by 6 years).
2) The "first affordable supercomputer" was the FPS-164, not the
Convex (by 4 years). FPS was using the "affordable
supercomputer" and "departmental supercomputing" phrases
long before the Convex advertisements and literature took
them up.
3) The first commercially available machine to compile complete
HLL applications to micro-code was once again, the FPS-164
(well, actually the VAX since it was cross-compilation).
4) The first commercially available machine to successfully exploit
parallel processors automatically using "dusty deck", serial
FORTRAN, the Alliant FX8, (by 4 years and counting).
5) The first commercially available RISC machine was the FPS-164.
(I'd love to see this one discussed, are VLIWs RISCy?) 8-)

I don't know if I've got the history right, I've only been in the industry
since '81, so feel free to propose "minor" adjustments. I'll gracelessly hide
out when the heavyweights start swinging.

Martin Lewitt

unread,
Oct 6, 1989, 6:02:10 AM10/6/89
to
Apologies to John Levine and the net if the authorship of my last
posting on this subject was unclear. I'm posting without a home
directory for the first time so my signature wasn't automatically
appended. It is manually appended here. Please double check before
flaming John, you possibly really intend to flame me.
---
Phone: (206) 931-8364 Martin E. Lewitt My opinions are
Domain: lew...@alliant.COM 2945 Scenic Dr. SE my own, not my
UUCP: {linus|mit-eddie}!alliant!lewitt Auburn, WA 98002 employer's.

Craig Jackson drilex1

unread,
Oct 6, 1989, 9:28:49 AM10/6/89
to
In article <23...@munnari.oz.au> o...@cs.mu.oz.au (Richard O'Keefe) writes:
>In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin) writes:
>> If the language the compiler compiles does not have the operations to be used
>> in my program, I cannot use them. I have seen NO language designed for
>> the efficient use of hardware operations. The choice of the algorithm
>> depends on the hardware, and there can be lots of different algorithms.
>
>Herman Rubin clearly enunciates a policy which I believe is responsible
>for much of the unreliability of present-day software:
> the machine is what it is,
> the programmer's job is to exploit the machine,
> the language's job is to expose the machine to the programmer.
>The view that I take is
> the programmer's job is to express his intentions clearly,
> the machine's job is to obey the programmer,
> the language's job is to provide a *simple* conceptual model.
>Rubin is willing to be the servant of his machines. I'm not.
>I've exchanged E-mail with a floating-point expert who had some
>illuminating things to say about the work being done for Ada; it turns
>out that they can do amazing things without needing to know the details
>of the hardware instructions, but what does hurt them is the extent to
>which Ada already lets hardware vagaries show through.

I think you have probably guess Herman Rubin's view pretty well (I don't
know him). It is a view which was very common up until around 1980,
when computers finally began to be fast enough for many uses. (Before then,
there were few users who really thought the computer that they worked on
was fast enough.) I choose 1980 because that is when the VAX 11/780 and
the 4341 became common; before those machines, few organizations had
more than one 32-bit machine with more than one MIPS or so of power.

Your view is conventional wisdom in computer science today, partially
driven by the common surfeit of computational power available today.
When there is enough computing power so that few tasks run into
compute-hours, and compilers are good enough to translate rather
abstract programming languages to good-enough machine code, this is a
reasonable idea.

I think that the difference is that Mr. Rubin still doesn't see himself
as having a surfeit of computing power available to him. I am not familiar
with his area of work, but I know that there remain many problems which
still cannot be solved in reasonable time by today's fastest computers.
When you are at the bleeding edge of technology, you *do* want to know
about the relative speeds of various instructions, and other capabilities
of the processor. It might make a difference if multiply were 3 or 10
times faster than divide--a differences of minutes, if not hours. If
your compute times were on that order, you would care about how you used
the underlying hardware.

We should be thankful for the people at the bleeding edge of computability
--it is they who keep the MIPS (VUPS, whatever) increasing every year...
--
Craig Jackson
dri...@drilex.dri.mgh.com
{bbn,ll-xn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

Herman Rubin

unread,
Oct 6, 1989, 11:39:36 AM10/6/89
to
In article <23...@munnari.oz.au>, o...@cs.mu.oz.au (Richard O'Keefe) writes:
> In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin)
> complained that he had "seen no language designed for the efficient use of
> hardware operations".
> In article <23...@munnari.oz.au>, o...@cs.mu.oz.au (I) listed several
> languages (PL-360, Bliss-10, SunOS C) which provide full access to
> all hardware instructions.
> In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin)
> replied

< > I do not believe that the .il approach is quite adequate.

> What, precisely, can it not do? Admittedly, this approach relies on
> the back-end optimiser being able to handle any assembly code you throw
> at it, but then, so must any scheme giving you access to all the operations.

I admit I am not familiar with .il. How would you do the following:

fabs(x) implemented as x &~ CLEARSIGN;

where this clears the sign bit, and may depend on the precision of x. No
moving of x to specific registers, etc., should be used.


< > As to the others
< > you mention, I am unfamiliar with them. And even that is inadequate;

> If you don't know them, how can you know they are inadequate?

< > I want
< > to be able even to introduce overloaded infix operators in addition to the
< > usual ones. Another feature missing from most languages since day 1 is to
< > allow the return of a list (NOT a struct).

And suppose I want to use << and >> on double longs? How do I get this in
without using any unnecessary memory references, and keeping things which
are already in registers there?

I want to be able to define a heavily overloaded power operator with a
reasonable notation such as a!b or something similar. I see no more reason
why I should have to use prenex form for this than I should have to write
sum(a,b) instead of a+b. To paraphrase, why should I be a slave to the
compiler's notation rather than using familiar mathematical notation?

> Having used Fortran, Algol, SNOBOL, Lisp, Prolog, APL, COBOL, PL/I,
> and lots of other languages, I gave up thinking that the syntax of
> programming languages had any great importance a long time ago.
> I should have mentioned Ada as one of the languages which provides
> full machine access; it has both assembly code inserts and overloaded
> operators. I don't know precisely what Rubin means by returning a list;
> Algol 68 permits the return of anything (except pointers into the local
> frame). The reason that most languages don't return lists is precisely
> that they are designed to let the hardware show through, and you can't
> fit much in a register.

When computers first came out, most of them would multiply an integer by
an integer, giving a most and least significant part. Most of them would
divide a double length integer by an integer, giving a quotient and a
remainder.

Getting O(n) may or may not be appropriate. Algorithms for generating
non-uniform random variables efficiently tend to be short, but with
a fairly intricate structure. They will be used millions of times.
The most efficient, from the standpoint of computational complexity,
are difficult even knowing the hardware, and may be not even worth
programming. Two of my operations are transferring on a bit and
"removing" the bit whether or not the transfer occurs, and finding the
distance to the next one in a bit stream and "removing" all bits scanned.
These are tools, as basic as addition.

As for not worrying about those stupid little details about multiple
precision arithmetic, it is the difference between factoring a large
number in a month and several years. I can give algorithms which anyone
who understands even what is taught in computing will understand, but
which the HLLs cannot express.

In a university, it is almost impossible to get programming assistance
in mathematics and statistics. Also, these supposedly highly optimizing
compilers do not seem to be available. I doubt that I would have any
more difficulty in writing a decent machine language than any HLL,
because a decent machine language would be weakly typed and overloaded
operator.

> I wrote
< > I've exchanged E-mail with a floating-point expert who had some
< > illuminating things to say about the work being done for Ada; it turns
< > out that they can do amazing things without needing to know the details
< > of the hardware instructions, but what does hurt them is the extent to
< > which Ada already lets hardware vagaries show through.
>
> Rubin replied:
< > How can it be otherwise?
>
> But that's _my_ point; letting the hardware show through is precisely
> what _he_ is asking for, and _that_ is the problem!
>
> This is not unlike the RISC/CISC debate. I want simple regular
> languages (C is a monster of complexity) which I can master completely.
> I don't care whether it's a ROMP or a WM or a /360 underneath. I
> particularly don't want to _have_ to know.

When I design an algorithm, I _have_ to know. O'Keefe wants to master
his language completely, but he is a slave to the language, whether or
not he knows it. I will use different algorithms on different computers,
and have no problem keeping them straight. I find it no problem whatever
in seeing how the hardware interacts with the algorithm and exploiting it.

Earl Culham

unread,
Oct 6, 1989, 12:02:19 PM10/6/89
to
In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin) writes:

.... all sort of things about languages accessing the hardware
through computer languages

This is an open letter to Herman Rubin.
========================================================================
Rubin, I realise that what you are trying to say in your postings is
"Don't throw away the power of the computer by limiting yourself to
high level languages."
But, you are going overboard when you say that NO language is designed
for the effecient use of hardware operations. I can name two right off,
Assembler, and PL360.

We have a dichotomy between wanting a program to run as fast as possible
on our current machine, and wanting it to keep running as we migrate
to new machines.

Often, your comments are along the line
"We need access to the nitty-gritty in order make this algorithm fly."
Nice ideas, but they don't belong in comp.arch. Please take them to
a language discussion.

On the other hand, comments along the lines of
"Having feature 'x' in hardware would sure make algorithm 'y' run fast"
are welcome grist.

Scott Schwartz

unread,
Oct 6, 1989, 12:48:01 PM10/6/89
to
In article <23...@munnari.oz.au> o...@cs.mu.oz.au (Richard O'Keefe) writes:

Herman Rubin writes:
> I want
> to be able even to introduce overloaded infix operators in addition to the
> usual ones. Another feature missing from most languages since day 1 is to
> allow the return of a list (NOT a struct).
The reason that most languages don't return lists is precisely
that they are designed to let the hardware show through, and you can't
fit much in a register.

Isn't the idea to return out-parameters on the left side of an
assignment? I.e.
(out1, out2, out3) := foo (in1, in2, in3);
corresponds to
procedure foo (in in1, in2, in3; out out1, out2, out3) ...
foo (in1, in2, in3, out1, out2, out3);

This doesn't seem that bad to me. A byte full of syntactic
sugar helps the eyestrain go down. :-)

--
Scott Schwartz <schw...@shire.cs.psu.edu>
Now back to our regularly scheduled programming....

Richard O'Keefe

unread,
Oct 7, 1989, 2:26:19 AM10/7/89
to
In article <16...@l.cc.purdue.edu>, c...@l.cc.purdue.edu (Herman Rubin) writes:
> I admit I am not familiar with .il. How would you do the following:
> fabs(x) implemented as x &~ CLEARSIGN;
> where this clears the sign bit, and may depend on the precision of x. No
> moving of x to specific registers, etc., should be used.

To start with, we have to realise that this CANNOT be done in a machine-
independent way. For example, some machines (/370, MC68020+MC68881, PR1ME,
80386+80387) have two sets of registers: floating point and integer.
Most such machines do not provide bitwise operations on floating-point
registers. The /370 does not provide moves between integer registers
and floating-point registers, so if you want to do this you have to
store the floating-point value in memory, load part of it into an integer
register, store it back in memory, and reload it into the floating-point
registers. Rather silly, really, when you might have used the
load-absolute-float instruction.

Here's how I would obtain the requisite effect on a 68020 using .il:
.inline _fabs, 8 # function name, #argument bytes
movl sp@+,d0 # argument is assumed to be on the
movl sp@+,d1 # stack, result must be left in (d0,d1)
andl #0x7FFFFFFF,d0 # clear the sign bit
.end
Ah, but Rubin demanded that "no moving of x to specific registers, etc.,
should be used". Haven't I violated that? Not at all. .il expansion
is followed by an assembly-code optimisation pass, which is quite able
to eliminate the use of the stack, and can also optimise some register
operations.

I think that it is worth stressing that bit-twiddling like this can make
the code WORSE. I have already cited the /370. But the 68881 has a set
of fabs<size> instructions and the Sun FPA has fpabs<s,d> instructions.
If you do something like fabs(x*y), you have the result of x*y in a
floating-point register. If you let the compiler generate an
f[p]absd instruction, the operation takes place in a single instruction
without any data movement between CPU and coprocessor. If you insist on
using a bit mask, you force data movement between the 68881 co-processor
(or the FPA) and the 68020 CPU, and that means moving the intermediate
value to memory!

Leaving aside the fact that bit-twiddling may be slower than letting the
compiler do its thing (even if that is a conditional branch around a
load-negative-float), it is worth noting that the location of the sign
bit differs from machine to machine. For example, if you load a VAX-11
floating-point number using integer instructions, you will not find the
sign bit where you would find the sign bit of an integer. Assumptions
about the location of the sign bit are not even portable among machines
which obey the IEEE 754 standard; that says where the bits are in
_registers_, but says nothing about the format of a number stored in
memory.

This kind of consideration is what makes the .il approach (or similar
approaches) so attractive: the machine-specific code is not there in
the main source file. I can hand-tune certain functions for a particular
machine if I wish, without requiring compilers for _other_ machines to
support the particular instructions I use.

> And suppose I want to use << and >> on double longs? How do I get this in
> without using any unnecessary memory references, and keeping things which
> are already in registers there?

On many popular machines it simply can't be done, for the simple reason that
there are no instructions to shift floating-point registers, and to get a
number where you can shift it, you have to use memory references (/370,
PR1ME 50 series, 68020+68881, others). And once you have done that, it's
surprising the number of machines that haven't got double-length integer
shifts. I must admit that it isn't perfectly clear to me what shifting
a double would be good for: shifting a float left one bit would move the
top bit of the biassed exponent into the sign bit, and the second bit of
the significand (after the hidden bit) into the exponent. (On a /370 the
effect would be rather different.)

On the assumption that it did make sense, here's how to shift a double
left one bit with the aid of .il:
.inline _lsldbl, 8
movl sp@+,d0
movl sp@+,d1
lsll #1,d1
roxl, #1,d0
.end
Again, the stack movement may be optimised away by later compiler passes.
Other shifts are left as an exercise for the reader.

Suppose that you have the full set of floating-point operations
required, recommended, or suggested in the IEEE 754 and 854 standards.
What bit-twiddling operations are then required, and what are they good for?

> I want to be able to define a heavily overloaded power operator with a
> reasonable notation such as a!b or something similar.

Well then, *DO* it!

> I see no more reason
> why I should have to use prenex form for this than I should have to write
> sum(a,b) instead of a+b. To paraphrase, why should I be a slave to the
> compiler's notation rather than using familiar mathematical notation?

Why, because you *choose* to be! Nut out a syntax you like, knock together
a parser using Lex and Yacc (public-domain versions of which are readily
available) or talk a CS undergraduate into doing it for you, and write in
the language of your dreams, letting the preprocessor take care of
generating sum(_,_) or whatever. The University of Wisconsin-Madison had
a rather interesting Fortran preprocessor which they used to support
interval and triplex arithmetic in a variety of ways. The Linpack algorithms
were developed using a preprocessor called TAMPR. I believe the ToolPack
project has a preprocessor which isn't hard to obtain; I don't know what it
is capable of.

Another possibility: get hold of the source code for gcc or g++ (all it
takes is FTP access or a small cheque to the Free Software Foundation),
and add a few operators to it (or get a CS graduate to do it for you).
Of course, that compiler may not have been ported yet to the machine that
Rubin is using (what machine _is_ that, by the way?), but if it has been,
I think that modifying gcc or g++ would be far and away the most profitable
thing for Rubin to do.

I wrote


> > I don't know precisely what Rubin means by returning a list;

> When computers first came out, most of them would multiply an integer by


> an integer, giving a most and least significant part. Most of them would
> divide a double length integer by an integer, giving a quotient and a
> remainder.

Ah yes. Now I remember. A year or so I posted a file to the net called
muldivrem.il, which yielded in-line code for
mul_div_rem(A, B, C, D, *Q, *R) is
T := A*B+C;
Q, R := T div D, T mod D
That was a follow-up to a posting from Rubin. I believe that I pointed
out in that debate that Common Lisp has multiple-value-return and other
related operations, e.g.
(multiple-value-bind ((Q R) (truncate (+ (* A B) C) D)
;; Q is (A*B+C) div D, R is (A*B+C) mod D
;; no intermediate data-structure was involved
)
There is a program called CGOL which provides you with an extensible
operator-based syntax for Lisp.


I wrote


> > I normally use a language which runs 2 to 4 times
> > slower than C; I do this gladly because the reduction in _language_
> > complexity means that I can manage more complex _algorithms_, and win
> > back O(n) speed that way.

> Getting O(n) may or may not be appropriate.

What I meant was an O(N) **SPEED-UP**. As a very crude example of what I
mean, I once saw a very complicated Fortran subroutine published in a
Journal. After fighting through the details for about half an hour, I
discovered that it was an O(N**3) algorithm for evaluating a polynomial
(and no, it wasn't optimising or taking special precautions to avoid
overflow, underflow, or cancellation). Rewriting it the obvious way
without any attempt at bit-level optimisation gave an O(N) algorithm.
In the same way, I am glad to pay an O(1) cost in speed _when_ that
means that I can manage more complex data structures and handle the
development of an O(N) or O(NlgN) algorithm instead of an O(N**2) one.
It is not an uncommon thing for me to see highly tweaked C code for an
O(N**2) algorithm when a simpler but more expressive language makes it
easy to write an O(N) algorithm for the same problem.


> Two of my operations are transferring on a bit and
> "removing" the bit whether or not the transfer occurs, and finding the
> distance to the next one in a bit stream and "removing" all bits scanned.
> These are tools, as basic as addition.

Hypothetical C code (assume bitno is a constant):

if (x & (1 << bitno)) { x &=~ (1 << bitno); goto Foo; }

Simple-minded 680x0 code:

btst #bitno,x
jnes L1
bclr #bitno,x
jra Foo
L1:

Optimised 680x0 code:
bclr #bitno,x
jeq Foo

This is at most a factor of 2, and the optimisation is well within the
reach of a peep-hole optimiser. If gcc doesn't do it, and it probably
doesn't, it would be about as hard to add the optimisation to the compiler
as it would be to write the hand-optimised code once. (I repeat that I
think it would be very useful to Rubin to get a copy of gcc.)

I hardly think that either of these operations is "as basic as addition".
Bear in mind that while some machines have a "find first one" instruction,
many don't. Also, I do read Applied Statistics and similar journals from
time to time, and it's amazing what people have been able to contribute
in the way of random number generators and such without needing more than
Fortran gives them. (Of course, they were trying to write code that
other people could use even if they didn't have the same kind of computer.)

> In a university, it is almost impossible to get programming assistance
> in mathematics and statistics.

I know.

> Also, these supposedly highly optimizing compilers do not seem to be available.

Try gcc.

> I doubt that I would have any more difficulty in writing a decent
> machine language than any HLL, because a decent machine language would
> be weakly typed and overloaded operator.

There are, then, remarkably few "decent machine languages". Seriously,
the examples that Rubin has given are extremely machine-specific. If
code is not portable, there is no disgrace in writing it in assembler.

> > This is not unlike the RISC/CISC debate. I want simple regular
> > languages (C is a monster of complexity) which I can master completely.
> > I don't care whether it's a ROMP or a WM or a /360 underneath. I
> > particularly don't want to _have_ to know.

> When I design an algorithm, I _have_ to know.

> I will use different algorithms on different computers,
> and have no problem keeping them straight. I find it no problem whatever
> in seeing how the hardware interacts with the algorithm and exploiting it.

But then you can only use the machines you have special hacks for.
If you are content to write different algorithms for different machines,
why should you demand to use the same language to write them?

> O'Keefe wants to master
> his language completely, but he is a slave to the language, whether or
> not he knows it.

As I have written programs in several dozen languages and have written
assembly code on nearly a dozen machines, I don't think I'm a slave to
any one language.

> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

^^^^^^^^^^
My statistics programming was done mainly in GLIM.

John R. Levine

unread,
Oct 7, 1989, 12:54:38 PM10/7/89
to
In article <34...@alliant.Alliant.COM> lew...@Alliant.COM (Martin Lewitt) writes:
>Ouch!! Why don't the Floating Point System's AP120B and FPS-164 processors
>qualify as VLIW? ...
>4) Both were installed on the Yale campus by 1985 ...

They were indeed there, excuse me, my mind curdled. (The fact that I never
used either of them while I was there may have something to do with it.)
Ellis discusses the FPS-164 in his book. He calls it an LIW, kind of a
predecessor to a VLIW. It's a horizontally microcoded machine rather than a
RISC with a lot of parallel functional units. The trace scheduling compiler
used for VLIWs is quite different from the one used for the FPS. Someone did
try trace scheduling for the FPS, but gave up because the assymetry of the
architecture made it very hard. There are several pages in Ellis' book that
talk about this.
--

Ben Cutler

unread,
Oct 8, 1989, 12:06:56 AM10/8/89
to
In article <34...@alliant.Alliant.COM> lew...@Alliant.COM (Martin Lewitt) writes:
>Ouch!! Why don't the Floating Point System's AP120B and FPS-164 processors
>qualify as VLIW? They were commercial products introduced in 1975 and 1981
>respectively:
>
> [ lots of deleted text ]

>
>Let's get the history right.
> 1) The first commercially available VLIW machine was the FPS-164,
> not the Multiflow (by 6 years).

VLIWs and Trace Scheduling (TM) Compiler technology go hand-in-hand. Ellis
describes why the FPS-164 doesn't qualify as a good compilation target in
his thesis:

``Many micro-programmed architectures have so-called `hot spots', latches,
that will hold a value for only one cycle or until another value displaces it.
For example, the outputs of the multiplier and adder pipelines in the FPS-164
[ APAL64 Programmer's Guide, FPS 1982 ] are latched and can be fed back to the
register banks or directly to the inputs of the pipelines. There isn't
enough bandwidth into the register banks, so to keep the pipelines full the
programmer is often forced to use the latched pipeline outputs directly as
pipeline inputs. But the programmer must be very careful to synchronize the
pipelines' inputs and outputs, since the next value coming out of the
pipeline will overwrite the previous latched value.''

Ellis goes on to describe how hot spots are unpleasant artifacts for a
compiler to deal with, how it isn't always possible to move a value in a
latch out of the way and into a register bank, or if you can do so by making
room in the register bank, that in turn might require displacing a value in
another latch, and so on. If you want to do a good job on complex loops,
a backtracking compiler (blech) may be in order.

According to Ellis, ''Every value producing operation and every data transfer
reads its operands from a register bank and delivers its result to a register
bank.'' Ellis notes that, ``Architectures with hot spots are easy to build,
but building compilers for them is hard. (It took years to build the FPS-164
compiler.) Better to build the compiler and hardware in parallel, avoiding
hardware features that can't be used easily by the compiler.''

If you want to know more, then read the thesis, which is extremely
well-written and informative.

( My apologies for any typos in the above quotations. )

Ben Cutler

unread,
Oct 8, 1989, 12:16:44 AM10/8/89
to
Oops, in case it wasn't clear, the sentence reproduced below describes a
fundamental characteristic of VLIWs (and NOT FPS machines), that operations
don't leave values sitting around in hot spots waiting to be trashed or
pushed along someplace else:

Martin E. Lewitt

unread,
Oct 8, 1989, 5:15:30 AM10/8/89
to
In article <1989Oct7.1...@esegue.segue.boston.ma.us> jo...@esegue.segue.boston.ma.us (John R. Levine) writes:
>Ellis discusses the FPS-164 in his book. He calls it an LIW, kind of a
>predecessor to a VLIW. It's a horizontally microcoded machine rather than a
>RISC with a lot of parallel functional units. The trace scheduling compiler
>used for VLIWs is quite different from the one used for the FPS. Someone did
>try trace scheduling for the FPS, but gave up because the assymetry of the
>architecture made it very hard. There are several pages in Ellis' book that
>talk about this.

Thanx for the info.

It's my understanding that some of the 64 bit products that FPS introduced
after I left in 1985, had a cleaned up instruction set, e.g., a write to
table memory RAM no longer required the special I/O format instruction and
more registers were directly addressable.

Does anyone out there know:
1) Was anything else "cleaned up" in the instruction set?
2) Did this clean-up address the asymmetries mentioned in the
Ellis book?
3) Did the Fortran compiler ever change to use Table Memory RAM?
4) Did the Fortran compiler ever change to use a trace scheduling
algorithm?
just curious, thanx in advance.
--

Keith Bierman - SPD Advanced Languages

unread,
Oct 8, 1989, 4:35:04 PM10/8/89
to
In article <34...@alliant.Alliant.COM> lew...@Alliant.COM (Martin Lewitt) writes:
... stuff about FPS being first VLIW...

It is my understanding that FPS purchased the basic design from Glen
Culler. Culler continued his design activities and spun up a company
for rev 7 of the basic design (The infamous Culler-7) down in Santa
Barbara.

Glen is still at it, and perhaps the 8th, 9th or 10th edition will
gain a popular following.

Keith H. Bierman |*My thoughts are my own. !! kbie...@sun.com
It's Not My Fault | MTS --Only my work belongs to Sun*
I Voted for Bill & | Advanced Languages/Floating Point Group
Opus | "When the going gets Weird .. the Weird turn PRO"

Martin E. Lewitt

unread,
Oct 9, 1989, 8:16:47 AM10/9/89
to
In article <10...@m3.mfci.UUCP> cut...@mfci.UUCP (Ben Cutler) writes:
*** some deleted ***

>VLIWs and Trace Scheduling (TM) Compiler technology go hand-in-hand. Ellis
>describes why the FPS-164 doesn't qualify as a good compilation target in
>his thesis:

I guess what I find strange is that the definition of a type of computers
and a (trademarked) compiler technology are so closely tied. The definition
doesn't seem as general as the RISC and CISC definitions (which I won't
attempt to give here). It is strange that the definition doesn't encompass
the similarities with the FPS architecture.

Ex-FPS persons I know, immediately saw the similarities with the
Multiflow and correctly anticipated (in a general way) how it would
perform. We would have welcomed the improvements in register unit
bandwidth that Ellis requires, though we would not have given up
the ability to latch the result of one pipeline directly to the
input of another, if there were a performance penalty. A sophisticated
compiler technology should be able to handle this (but why bother
if you don't have to?). From a marketing viewpoint the virtual
memory and multiuser OS were more important features. The Multiflow
architecture just seemed a natural evolution of the FPS architecture.

From your posting, it appears Ellis correctly understood the limitations
of the FPS architecture and his analysis of it seems to acknowledge
some related place for it in history.

*** much deleted ***

>If you want to know more, then read the thesis, which is extremely
>well-written and informative.

Yes, I hope to have the opportunity to read this.

Robert Colwell

unread,
Oct 9, 1989, 9:33:10 AM10/9/89
to
In article <34...@alliant.Alliant.COM> lew...@Alliant.COM (Martin Lewitt) writes:
>In article <1989Oct5.0...@esegue.segue.boston.ma.us> jo...@esegue.segue.boston.ma.us (John R. Levine) writes:
>*** much deleted ***
>>To return to the original argument, John Ellis' book doesn't concern itself
>>with specific instruction sets because at the time he wrote it, VLIWs
>>existed only on paper. The concrete design of a VLIW happened only after
>>most of the rest of the Yale VLIW project left to build real hardware at
>>Multiflow. I was at Yale at the time and the longest instruction word we
>>had was 36 bits in our DEC-20. Or maybe 48 bits in a PDP-11.
>
>Ouch!! Why don't the Floating Point System's AP120B and FPS-164 processors
>qualify as VLIW? They were commercial products introduced in 1975 and 1981
>respectively:

FPS failed to realize a couple of key concepts, and this is why they don't
get to retrofit the "VLIW" label to their boxes (after all, Josh Fisher
invented this technology and came up with that label, so he gets to provide
the definition for VLIW).

The first concept is that you can't just invent whatever architecture and
instruction set comes to mind, or is easiest to build, or is cheapest. You
must start with what the compiler wants to see as a target. A trace-scheduling
compiler is hard enough to do that you certainly do not want to add to its
burdens without a compelling reason.

If you have only a wide-word (and I won't quibble about the actual width)
machine with parallel functional units then you're not even halfway there.

>The Multiflow machine was no mystery to FPS sales analysts. It was the
>FPS dream machine: UNIX, virtual memory, better compilers, etc. We had
>asked FPS to give us these features right from the beginning of the 164
>back in 1981. By the time, Multi-flow delivered, it was too, late,
>Convex, Alliant and FPS's own ECL machine, 264 were already on the scene,
>and the high end workstations arrived in short order.

Too late for whom or for what? What a weird paragraph.

>It will be interesting to see if FPS gets the credit it deserves when the
>history is written, I doubt it. Will the historians do any better than
>the contemporaries? In 1985, an article about micro-programming appeared
>in Scientific American, written by a Stanford professor. He anticipated
>the day when there would be commercial machines compiling directly to
>micro-code. FPS had sold nearly 150 machines already.

There *is* a relationship between the FPS boxes and the VLIW work done at
Multiflow and Yale. As Ellis reports, the FPS boxes showed what would be
wrong with a machine that was not designed at the outset to be a good target
for compiled code. There are hot spot registers, the functional units are
not decoupled at the instruction set level, the datapaths are not regular,
and there aren't enough registers to support the flops. When you design a
machine to be completely scheduled at compile-time, it shows in the design.
Retrofitting a compiler to an existing architecture will never work. The
influence that the FPS work had on the Multiflow machines was a negative
one -- we knew that was one approach which would not get us what we needed.

>Let's get the history right.
> 1) The first commercially available VLIW machine was the FPS-164,
> not the Multiflow (by 6 years).

The FPS machines do indeed enjoy a unique place in the computer hall of
fame, but the label under the exhibit won't say VLIW, nor should it.
But the FPS machines were not statically scheduled and
controlled by the compiler. They therefore don't qualify as VLIWs under
Josh's definition of the term. Feel free to make up some other name if
you want; "attached processor" has always seemed pretty appropriate to me.

The FPS designs also try to compress the instruction stream in a sub-optimal way,
pre-combining certain operations (your "parallel micro-operations") in
much the same way (and for the same reasons) that CISCs did. A VLIW,
on the other hand, completely decouples the control of all functional
units, and avoids the baggage of carrying around the whole instruction
word width during sequences of sequential code by using an encoding scheme
described in our IEEE Transactions paper. This makes the compiler much
happier. Ours smiles a lot.

> 2) The "first affordable supercomputer" was the FPS-164, not the
> Convex (by 4 years). FPS was using the "affordable
> supercomputer" and "departmental supercomputing" phrases
> long before the Convex advertisements and literature took
> them up.

Agreed. Although calling an attached processor a "computer" smacks of
marketing hype, to me. And if your references to having a workable compiler
were legit, then why did Convex's first machine beat the 164, which had more
mflops under the hood?

> 3) The first commercially available machine to compile complete
> HLL applications to micro-code was once again, the FPS-164
> (well, actually the VAX since it was cross-compilation).

Yes, although with the cart before the horse this effort was largely
doomed from the start.

> 4) The first commercially available machine to successfully exploit
> parallel processors automatically using "dusty deck", serial
> FORTRAN, the Alliant FX8, (by 4 years and counting).

"And counting"? What does that mean?

> 5) The first commercially available RISC machine was the FPS-164.
> (I'd love to see this one discussed, are VLIWs RISCy?) 8-)

What about the FPS-164 qualified it as a RISC? You must have some interesting
definition of RISC. From Hwang and Briggs: "The AP-120B and the FPS-164
are back-end attached arithmetic processors specially designed to process
large vectors or arrays (matrices) of data. Operationally, these processors
must work with a host computer, which can be either a minicomputer (such as
the VAX-11 series) or a mainframe computer (IBM 308X series." Whatever else
the RISC acronym implies, the "C" stands for computer, and to me, attached
processors aren't computers.

I think discussing VLIWs as RISCs is interesting, partly because it so aptly
illustrates why the early insistence on counting the number of instructions
before declaring something a RISC was so misguided. The important thing is
to be able to implement the most-used ops such that they're as fast as can be,
which usually means they get hard-wired control. If you can manage to implement
your design such that you get fast, hard-wired simple ops, but lots of them
in parallel (controlled by the wide word) then I think you've got the essence
of the RISC approach without ending up hamstrung by religious zealotry.

>I don't know if I've got the history right, I've only been in the industry
>since '81, so feel free to propose "minor" adjustments. I'll gracelessly hide
>out when the heavyweights start swinging.

Aw, you're just a baby. You should have stated this earlier and I'd have
used smaller words. :-)

Bob Colwell ..!uunet!mfci!colwell
Multiflow Computer or col...@multiflow.com
31 Business Park Dr.
Branford, CT 06405 203-488-6090

Hank Dietz

unread,
Oct 9, 1989, 7:03:32 PM10/9/89
to
In article <10...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:
>In article <34...@alliant.Alliant.COM> lew...@Alliant.COM (Martin Lewitt) writes:
>>Let's get the history right.
>> 1) The first commercially available VLIW machine was the FPS-164,
>> not the Multiflow (by 6 years).
>
>The FPS machines do indeed enjoy a unique place in the computer hall of
>fame, but the label under the exhibit won't say VLIW, nor should it.

I quite agree that the FPS-164 isn't trace-scheduled etc. as a VLIW, but I'm
not sure that means the hardware isn't VLIW.... However, it isn't the first
in either case. Way back, Burroughs had some nifty WCS stuff that probably
qualifies as the first "VLIW hardware," but Multiflow's Trace clearly
deserves the title "first VLIW computer system."

>> 2) The "first affordable supercomputer" was the FPS-164, not the
>> Convex (by 4 years). FPS was using the "affordable
>> supercomputer" and "departmental supercomputing" phrases
>> long before the Convex advertisements and literature took
>> them up.
>
>Agreed. Although calling an attached processor a "computer" smacks of
>marketing hype, to me.

Define "supercomputer." Intel's IPSC tried to claim the first title, and
there are plenty more if you define "affordable" the right way. ;-)

>> 3) The first commercially available machine to compile complete
>> HLL applications to micro-code was once again, the FPS-164

Were "complete HLL applications" really compiled to microcode? I thought
they ran partly on the host machine?

>> 4) The first commercially available machine to successfully exploit
>> parallel processors automatically using "dusty deck", serial

^^^^^^^^


>> FORTRAN, the Alliant FX8, (by 4 years and counting).
>
>"And counting"? What does that mean?

Are you defining "parallel" to be MIMD, but not SIMD/Vector? In either
case, doesn't Cray count? I don't know what compilers have been marketed,
but I know of more than a few "dusty deck" compilers for MIMD targets....

>> 5) The first commercially available RISC machine was the FPS-164.
>> (I'd love to see this one discussed, are VLIWs RISCy?) 8-)

>... The important thing is


>to be able to implement the most-used ops such that they're as fast as can be,
>which usually means they get hard-wired control.

Most early machines looked pretty RISCy... in exactly the sense stated
above. The main difference in recent RISC design is the use of compiled-code
statistics rather than applications algorithms in determining which ops are
frequently used. I see no reason to single-out the relatively-modern
FPS-164; of course, just as the "first VLIW" title goes to whoever coined
the name, "first RISC" really belongs to UCB.

VLIW does tend to imply RISC... and RISC implies VLIW if you really think
about it. After all, parallelism is the *OBVIOUS* way to speed-up things,
and the compiler emphasis of RISC + parallelism = VLIW. Not that the full
VLIW concept is trivial, rather that it is natural & elegant. Best of all,
it even seems to work pretty well. ;-)

-ha...@ecn.purdue.edu

Martin E. Lewitt

unread,
Oct 10, 1989, 6:08:14 AM10/10/89
to
In article <13...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:
= In article <10...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:
= >In article <34...@alliant.Alliant.COM> lew...@Alliant.COM (Martin Lewitt) writes:
= >> 3) The first commercially available machine to compile complete
= >> HLL applications to micro-code was once again, the FPS-164
=
= Were "complete HLL applications" really compiled to microcode? I thought
= they ran partly on the host machine?

The FPS-164 and its successors almost always ran complete applications, the
"subroutine box" mode was not used by most customers. This was made possible
by an advanced (for FPS) OS called the Single Job Executive (SJE).

= >> 4) The first commercially available machine to successfully exploit
= >> parallel processors automatically using "dusty deck", serial
= ^^^^^^^^
= >> FORTRAN, the Alliant FX8, (by 4 years and counting).
= >
= >"And counting"? What does that mean?
=
= Are you defining "parallel" to be MIMD, but not SIMD/Vector? In either
= case, doesn't Cray count? I don't know what compilers have been marketed,
= but I know of more than a few "dusty deck" compilers for MIMD targets....

I was thinking MIMD. Until recently didn't Cray MIMD require
micro-tasking directives inserted in the code on the based on user analysis
of the code?

= -ha...@ecn.purdue.edu

Martin E. Lewitt

unread,
Oct 10, 1989, 7:12:10 AM10/10/89
to
In article <10...@m3.mfci.UUCP> col...@mfci.UUCP (Robert Colwell) writes:
>In article <34...@alliant.Alliant.COM> lew...@Alliant.COM (Martin Lewitt) writes:
>>The Multiflow machine was no mystery to FPS sales analysts. It was the
>>FPS dream machine: UNIX, virtual memory, better compilers, etc. We had
>>asked FPS to give us these features right from the beginning of the 164
>>back in 1981. By the time, Multi-flow delivered, it was too, late,
>>Convex, Alliant and FPS's own ECL machine, 264 were already on the scene,
>>and the high end workstations arrived in short order.
>
>Too late for whom or for what? What a weird paragraph.

I agree, what a weird paragraph! By "too late", I meant that Multiflow
was too late to exploit the wide open market opportunities that the
FPS analysts saw when they were specifying the machine they wanted FPS
to build, i.e., a wide instruction machine supporting virtual memory
and UNIX. FPS's analysts saw these requirements of the market place
in plenty of time to preempt the later arrivals of Convex and Alliant.

I was with Alliant when the Multi-flow Trace 7 first started shipping.
While we didn't fear the Trace 7 in most application areas, the Trace 7
did turn out to be a tough competitor in certain areas, e.g.,
the integer (non-floating point) ECAD codes. However, after some initial
competitive success for the Trace 7, I noticed that Alliant was no longer
losing this market to the Multi-flow, but to the SUN-4 Sparcstation instead.
While this was bad news for Alliant, it had to be terrible news for Multi-flow.
A key area of competitive strength against the product that Alliant was
shipping at that time, was being undercut by cheap integer mips from SUN.
So this is another sense in which I meant "too late".

*** stuff deleted ***


>marketing hype, to me. And if your references to having a workable compiler
>were legit, then why did Convex's first machine beat the 164, which had more
>mflops under the hood?

Convex's first machine the C1-XL had more megaflops, 40 single and
20 double peak megaflops vs the FPS's 11 peak megaflops. Even so
the Convex rarely beat the FPS-164 on benchmarks unless the codes
were highly vectorizable. The XL had poor scalar performance and
the FPS-164 often beat it by a factor of two. Good scalar performance
is a strength of these wide instruction machines. The C1-XP had
much improved scalar performance but still faced tough competition
from the 164. Of course, performance is only one purchase criterion
and FPS could lose sales because of the lack of virtual memory,
lack of a multi-user operating system, reliability concerns, etc.

>> 4) The first commercially available machine to successfully exploit
>> parallel processors automatically using "dusty deck", serial
>> FORTRAN, the Alliant FX8, (by 4 years and counting).
>
>"And counting"? What does that mean?

No competitors seem to have replicated Alliant's success in automatically
producing code for a MIMD machine. So Alliant's lead is 4 years and growing.
By success, I mean that this is the standard way most customers use
the machine. Most Cray jobs still only use one CPU.

*** more stuff deleted ***
>Bob Colwell ..!uunet!mfci!colwell

Ben Cutler

unread,
Oct 10, 1989, 8:00:19 PM10/10/89
to
In article <34...@alliant.Alliant.COM> lew...@alliant.Alliant.COM (Martin E. Lewitt) writes:
>In article <10...@m3.mfci.UUCP> cut...@mfci.UUCP (Ben Cutler) writes:
>*** some deleted ***
>>VLIWs and Trace Scheduling (TM) Compiler technology go hand-in-hand. Ellis
>>describes why the FPS-164 doesn't qualify as a good compilation target in
>>his thesis:
>
>I guess what I find strange is that the definition of a type of computers
>and a (trademarked) compiler technology are so closely tied.

That's precisely the point. If you want to design the best high level
language possible, you don't have one team design the control structures and
a second team design the data structures and hope the two fit together well!

By analogy, if you want a high performance computer system, you want the
fastest overall system, not a superfast cpu for which you cannot generate a
compiler (and hence fast code). So you design all the pieces together, in
this case VLIW hardware and Trace Scheduling compilation. And by extension,
you want a real computer, not an awkward attached processor.

FPS never saw the big picture.

>From your posting, it appears Ellis correctly understood the limitations
>of the FPS architecture and his analysis of it seems to acknowledge
>some related place for it in history.

Historical footnote: The FPS machine was an attached numerical processor.
It bit the dust when general purpose computers with as much or more power and
greater ease of use appeared on the scene for many fewer dollars.

Eugene Brooks

unread,
Oct 11, 1989, 11:50:55 PM10/11/89
to
In article <34...@alliant.Alliant.COM> lew...@alliant.Alliant.COM (Martin E. Lewitt) writes:
>I was with Alliant when the Multi-flow Trace 7 first started shipping.
>While we didn't fear the Trace 7 in most application areas, the Trace 7
>did turn out to be a tough competitor in certain areas, e.g.,
>the integer (non-floating point) ECAD codes. However, after some initial
>competitive success for the Trace 7, I noticed that Alliant was no longer
>losing this market to the Multi-flow, but to the SUN-4 Sparcstation instead.
You vendors of custom processors has better take note of this statement. It
is really amusing to see the Multi-flow and Alliant fellows bragging at how
well they are doing when single chip microprocessors run circles around their
entire multiprocessor complexes. The SUN-4 Sparcstation which did all this
"damage" is an absolute DOG compared to the most recent Sparc, MIPS, or i860
chip sets. No doubt a real terror from Motorola is not far away. The lifetime
of custom processors, even processors like the Cray line, are real short.
Imagine the surprise of CRI executives when they realize that they are no
longer losing market share to the Japanese super computer companies, and are
now losing market share to microprocessors. We are watching this happen
internally at LLNL and do not doubt that it is happening elsewhere.

There is no defense against the ATTACK OF THE KILLER MICROS!


bro...@maddog.llnl.gov, bro...@maddog.uucp

John Mashey

unread,
Oct 13, 1989, 4:59:58 AM10/13/89
to
In article <35...@lll-winken.LLNL.GOV> bro...@maddog.llnl.gov (Eugene Brooks) writes:
....

>There is no defense against the ATTACK OF THE KILLER MICROS!

Well, even if I agree with the fundamental point, this might be a teeny bit
strong :-)

1) Supers and mini-supers and mainframes still do things that micros
don't yet, like huge memories, and very fast I/O, or very high
connectivity, or very high bandwidths in more places. Those things
cost money. One can hope that having fast cheap micros will induce people to
build more fast cheap peripheral chips to help some parts of the problems.
Some things will always cost money, no matter what.

2) It's hard to believe that there will not always be a niche for:
a) The fastest box, at almost any cost
b) Very fast boxes with good cost/performance

3) The issue raised really is:
How big are those niches? How many companies can thrive in them?

4) OPINION: 1-2 companies each in both a) and b).
This is a high-stakes game, and the ante to play keeps rising.

5) OBSERVATION: RISC micros are munching away at the lower levels of the
minisuper business; going up (i.e., CONVEX) was a good strategy.
Trying to fight it out with them where they can do the job, is like the
fight that huge horde of 1970s mini- makers had with CISC micros:
once the latter got to be reasonably competitive on performance,
the former just got run over, except for the very biggest players.

6) An interesting paper is: "SUPERCOMPUTING, Myth & Reality", by
George J. Luste, of the Physics Department at U. Toronto.
(I think this was in Supercomputing Symposium '89). This was a nice
intro to some of the issues of vector code versus scalar code,
and cost/performance issues of supercomputers versus RISC micros.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR ma...@mips.com
DDD: 408-991-0253 or 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Shahin Kahn

unread,
Nov 4, 1989, 1:35:15 AM11/4/89
to
In article <10...@m3.mfci.UUCP> cut...@mfci.UUCP (Ben Cutler) writes:
>don't leave values sitting around in hot spots waiting to be trashed or
>pushed along someplace else:
>
> According to Ellis, ''Every value producing operation and every data
> transfer reads its operands from a register bank and delivers its result
> to a register bank.''


This could end-up being a constraint. Perhaps more speed can be obtained
if one doesn't have to take every operand from and every result to
a register. Some reduction operations produce intermediates that
dont need to go back to a register until the end. That's why you
can do a one-cycle dot-product on the FPS attached processors (they now make
unix machines too).
You could leave a partial sum come out of the adder and go right back to the
adder, completely bypassing registers. Pretty powerful if you ask me
though it probably made compiler writing a pain, (but FPS has a pretty good
compiler).
I am not arguing with how you define VLIW and whether FPS attached processors
are VLIW or not.

Rob Peglar x615

unread,
Nov 6, 1989, 10:21:28 AM11/6/89
to
In article <92...@batcomputer.tn.cornell.edu>, ka...@batcomputer.tn.cornell.edu (Shahin Kahn) writes:
> In article <10...@m3.mfci.UUCP> cut...@mfci.UUCP (Ben Cutler) writes:
> >don't leave values sitting around in hot spots waiting to be trashed or
> >pushed along someplace else:
> >
> > According to Ellis, ''Every value producing operation and every data
> > transfer reads its operands from a register bank and delivers its result
> > to a register bank.''
>
>
> This could end-up being a constraint. Perhaps more speed can be obtained
> if one doesn't have to take every operand from and every result to
> a register. Some reduction operations produce intermediates that
> dont need to go back to a register until the end. That's why you
> can do a one-cycle dot-product on the FPS attached processors (they now make
> unix machines too).

This is old news. Since the days of the CDC Star-100, the "catching" of
operands before they "hit" the register file has been done, in the now-
classic (i.e. they don't make 'em anymore) Star-100/Cyber 20x/ETA-10 line
of vector supers. The process was called "shortstop". Omitting vector
shortstop (vectors are M-M on these machines) and looking at the scalar
shortstop (R-R or R-M), Kahn's note that "more speed can be obtained" is
true. A typical scalar (64-bit) R-R add issued in 1 clock; you could
get the shortstop in 2 (more) clocks; you could get the register file
in 3(more) clocks. Turns out the ETA-10 had a fairly powerful vector
shortstop as well - unlike predecessors.

> You could leave a partial sum come out of the adder and go right back to the
> adder, completely bypassing registers. Pretty powerful if you ask me
> though it probably made compiler writing a pain, (but FPS has a pretty good
> compiler).

Not really. The pain was trying to make really smart compilers, ones whose
code instruction sequences were as-optimal-as-could-be given the shortstop
capability. It turned out that trying to catch the shortstop was one of
the last things to worry about in writing such compilers (i.e. the marginal
gain was small). Shortstop was one of those things you just "took for
granted" as part of the architecture, and didn't worry about trying to
hit each and every instruction issue.


Rob

Rob Peglar Control Systems, Inc. 2675 Patton Rd., St. Paul MN 55113
...uunet!csinc!rpeglar 612-631-7800

The posting above does not necessarily represent the policies of my employer.
--
Rob Peglar Control Systems, Inc. 2675 Patton Rd., St. Paul MN 55113
...uunet!csinc!rpeglar 612-631-7800

The posting above does not necessarily represent the policies of my employer.

Paul Rodman

unread,
Nov 7, 1989, 9:40:17 AM11/7/89
to
In article <1...@csinc.UUCP> rpe...@csinc.UUCP (Rob Peglar x615) writes:
>>
>> This could end-up being a constraint. Perhaps more speed can be obtained
>> if one doesn't have to take every operand from and every result to
>> a register. Some reduction operations produce intermediates that
>> dont need to go back to a register until the end. That's why you
>> can do a one-cycle dot-product on the FPS attached processors (they now make
>> unix machines too).
>
>This is old news. Since the days of the CDC Star-100, the "catching" of
>operands before they "hit" the register file has been done, in the now-
>classic (i.e. they don't make 'em anymore) Star-100/Cyber 20x/ETA-10 line

It's not clear me from your response how this feature was used, but
clearly an aversion to hot spot registers does not rule out the use of
standard register file bypassing techniques. These are easy to
implement, and are transparent to the compiler writer (i.e. just makes
the register file look faster). In fact, the last machine I worked on that
did NOT have bypassing was in 1979.


Paul K. Rodman / KA1ZA / rod...@multiflow.com
Multiflow Computer, Inc. Tel. 203 488 6090 x 236
Branford, Ct. 06405

Rob Peglar x615

unread,
Nov 8, 1989, 10:10:49 AM11/8/89
to
In article <11...@m3.mfci.UUCP>, rod...@mfci.UUCP (Paul Rodman) writes:
>
> It's not clear me from your response how this feature was used, but
> clearly an aversion to hot spot registers does not rule out the use of
> standard register file bypassing techniques. These are easy to
> implement, and are transparent to the compiler writer (i.e. just makes
> the register file look faster). In fact, the last machine I worked on that
> did NOT have bypassing was in 1979.

"Shortstop" is essentially a register file bypass technique. From the
hardware perspective, the result (from an r-r or r-m operation) would be
available (for the subsequent instruction) some number of cycles before
the (same) result was posted in the register file (or memory). This came
in handy even in the very large register file architectures of the machines
in question (ETA-10 and predecessors).

From the software perspective, shortstop allowed one to write tighter
code than normal. Without shortstop, instruction issue would block
until the operand (usually a register) was freed by a previous instruction.
Shortstop allowed bypass, essentially de-coupling the result from the
register it was bound for. One could save a few cycles per instruction
by carefully selecting registers and sequencing properly to hit the
shortstop. In tight loops, the savings could be as much as 33% at times.

The whole feature helped hot spot registers somewhat. Of course, most
people tended to write code that used (more of) the large register file
(256 regs, most of which were usable by user space code) on these machines.
This negated the architectural advantage of shorstop overcoming hot
spots. One usually had free registers floating around; thus the instruction
issue pipe (usually) never had to block on a hot spot. The compilers
for these machines took this approach (use lots of regs) instead of
trying to sequence code for shortstop.

I won't get into the (previously debated) thread about "how many regs
is enough" - i.e. the tradeoff between large, fast reg files and the
ensuing overhead of saving/restoring them on context switches/subroutine
calls.

Hope this answers the question.

Rob

0 new messages