Optimal Computer Architectures

Norwitz Neal

unread,

Nov 8, 1990, 4:24:12 PM11/8/90

to

I have a recursive algorithm that I would like to run as fast as possible. I am running a part of it on a SPARCstation SLC. But at this rate, it might finnish in a couple of years!! I was wondering if someone could suggest a couple of architectures that may suit this recursive algorithm better than others. The program takes no memory (about 12K). So all I need is speed. There are no floating point operations, except a couple of increments.

Also, any information about speed ratings (in mips) from many different computers would be appreciated.

Thanks a lot.

Neal Norwitz
ne...@tut.fi
Tampere University of Technology

gil...@m.cs.uiuc.edu

unread,

Nov 8, 1990, 11:03:00 PM11/8/90

to

> I have a recursive algorithm that I would like to run as fast as
> possible. I am running a part of it on a SPARCstation SLC. But at
> this rate, it might finnish in a couple of years!! I was wondering if
> someone could suggest a couple of architectures that may suit this
> recursive algorithm better than others.

Oh that's easy. Any RISC BESIDES the SPARC should probably run much
faster. In a recent benchmark, I found that, on deeply recursive code
the SPARCstation 1 was 2.8 times faster than 16Mhz 68020/Mac II, but 6
times faster on non-recursive code. SPARC seems to suck the big wazoo
on deep recursion, presumably because of its register-windowing
design.

Tom Reingold

unread,

Nov 9, 1990, 10:32:55 AM11/9/90

to

In article <1990Nov8.2...@funet.fi> ne...@uikku.tut.fi (Norwitz
Neal) writes:

$
$ I have a recursive algorithm that I would like to run as fast as possible. I am running a part of it on a SPARCstation SLC. But at this rate, it might finnish in a couple of years!! I was wondering if someone could suggest a couple of architectures tha
$ t may suit this recursive algorithm better than others. The program takes no memory (about 12K). So all I need is speed. There are no floating point operations, except a couple of increments.
$
$ Also, any information about speed ratings (in mips) from many different computers would be appreciated.

You pose a vague question, so I don't know what we (the net) can
suggest. If it's really going to take years with the current
algorithm, it sounds worthwhile changing it to an iterative one.

Have you looked at lisp systems? Some of them handle recursion pretty
efficiently. I am not talking about lisp running under UNIX. I am
talking about machines with lisp semantics in the processor.
--
Tom Reingold
t...@samadams.princeton.edu OR ...!princeton!samadams!tr
"Warning: Do not drive with Auto-Shade in place. Remove
from windshield before starting ignition."

Bruce Carneal

unread,

Nov 10, 1990, 7:10:19 AM11/10/90

to

For tail calls at least, the responsibility/blame for
excessive window overflows lies with the compiler, not
the architecture. Hopefully the newer compilers
for the SPARC do/will handle tail calls without stack
growth.

Sean Fagan

unread,

Nov 10, 1990, 6:47:12 PM11/10/90

to

In article <330...@m.cs.uiuc.edu> gil...@m.cs.uiuc.edu writes:
>SPARC seems to suck the big wazoo
>on deep recursion, presumably because of its register-windowing
>design.

More to the point, *any* processor with register windows is not going to do
too well on deeply recursive code.

There are tricks a programmer can do to get rid of some recursion, and some
that the compiler can do, of course.

Some people at Berkeley did some studies with their RISC chip and LISP;
since LISP is generally very recursive, and the Berkeley people seem to like
register windows, those are probably worth reading.

--
-----------------+
Sean Eric Fagan | "*Never* knock on Death's door: ring the bell and
se...@sco.COM | run away! Death hates that!"
uunet!sco!seanf | -- Dr. Mike Stratford (Matt Frewer, "Doctor, Doctor")
(408) 458-1422 | Any opinions expressed are my own, not my employers'.

Keith Bierman fpgroup

unread,

Nov 11, 1990, 1:28:55 AM11/11/90

to

In article <86...@scolex.sco.COM> se...@sco.COM (Sean Fagan) writes:
....

>SPARC seems to suck the big wazoo
>on deep recursion, presumably because of its register-windowing
>design.

More to the point, *any* processor with register windows is not going to do
too well on deeply recursive code.

....

Lest we forget, the SPARC ISA provides register windows, but their use
is up to sw conventions (viz. CALL does not result in a window spin).
--
----------------------------------------------------------------
Keith H. Bierman kbie...@Eng.Sun.COM | k...@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043

Scott Draves

unread,

Nov 11, 1990, 5:29:07 AM11/11/90

to

In article <3...@mentat.COM>, b...@mentat.COM (Bruce Carneal) writes:
|> Hopefully the newer compilers
|> for the SPARC do/will handle tail calls without stack
|> growth.

It is my understanding that the SunOS 4.x compiler does
detect and eliminate tail recursion. This is verified by
compiling (-O4 -S)

ENUM(i) { if (i==0) return 1; else return ENUM(i-1); }

and seeing

_ENUM:
L77001:
tst %o0
bne L77003
nop
retl
add %g0,1,%o0
L77003:
b L77001
dec %o0

It seems odd that it would generate
a branch to another branch. can anyone
who knows comment?

Consume
Scott Draves Be Silent
sp...@cs.cmu.edu Die

Bruce Hoult

unread,

Nov 12, 1990, 12:47:23 AM11/12/90

to

>More to the point, *any* processor with register windows is not going
>to do too well on deeply recursive code.

I don't know much about this, but it seems to me that a machine without
register windows is oiggoing to have to put those values on the stack
in any case. A machine with lots of registers that have to be stored
and loaded in deep recursion should at least be able to take advantage
of any burst mode that the memory subsystem supports, thus giving an
advantage over normal architectures.

Robert Firth

unread,

Nov 12, 1990, 3:55:13 PM11/12/90

to

In article <1990Nov11....@cs.cmu.edu> sp...@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

>It is my understanding that the SunOS 4.x compiler does
>detect and eliminate tail recursion. This is verified by
>compiling (-O4 -S)
>
>ENUM(i) { if (i==0) return 1; else return ENUM(i-1); }
>
>and seeing
>
>_ENUM:
>L77001:
> tst %o0
> bne L77003
> nop
> retl
> add %g0,1,%o0
>L77003:
> b L77001
> dec %o0
>
>
>It seems odd that it would generate
>a branch to another branch. can anyone
>who knows comment?

As a preliminary note: the program mentioned by the original poster
does not lend itself to tail-recursion elimination, so this won't
work for him. Another option, since the code is fairly short, would
be to hack an assembler version that didn't use the register windows
(as was rightly pointed out, they are optional, not an ineradicable
part of the call protocol). I actually tried that on Ackermann's
Function (what else?) and got a factor of 7 speedup.

Turning to the above code - yes there does seem to be a defect in
the low-level optimiser here. There are problems in eliding branch
chains on machines with delay slots, but that doesn't apply above:
the delay slot of the BNE is empty, so one could close the chain and
move the decrement of o0 to the label destination.

However, one can do better. Since register o0 is dead on the
fall-through (it is destroyed by the add), one can float up both
the instructions at L77003:

bne L77001
dec %o0

thus closing the chain and filling the delay slot. This reduces
the five-instruction loop to three instructions, for a 65% speed up.

Henry Spencer

unread,

Nov 13, 1990, 3:00:51 PM11/13/90

to

In article <86...@scolex.sco.COM> se...@sco.COM (Sean Fagan) writes:
>>SPARC seems to suck the big wazoo
>>on deep recursion, presumably because of its register-windowing
>>design.
>
>More to the point, *any* processor with register windows is not going to do
>too well on deeply recursive code.

Actually, the AMD 29k ought to do all right. The problem is not register
windows vs. recursion, it is *fixed-size* register windows vs. recursion.
SPARC ends up saving and restoring a lot of registers that aren't actually
in use, because the allocation quantum is 16 registers. On the 29k the
quantum is 2 registers (1 if you don't care about being floating-point
compatible with the 29050) and it should almost never be necessary to
save and restore unused registers.
--
"I don't *want* to be normal!" | Henry Spencer at U of Toronto Zoology
"Not to worry." | he...@zoo.toronto.edu utzoo!henry

Preston Briggs

unread,

Nov 13, 1990, 3:40:59 PM11/13/90

to

In article <1990Nov12.0...@actrix.co.nz> Bruce...@actrix.co.nz (Bruce Hoult) writes:
>>More to the point, *any* processor with register windows is not going
>>to do too well on deeply recursive code.

>I don't know much about this, but it seems to me that a machine without
>register windows is oiggoing to have to put those values on the stack

The point is _fixed-size_ register windows.
A window-less machine might have to save 3 registers per call.
A fixed-size window machine might save 16 or 32 registers per call.
Even with burtst-mode saves, the volume will overwhelm the cache,
and perhaps virtual memory.

Preston

Donald Lindsay

unread,

Nov 13, 1990, 8:38:15 PM11/13/90

to

In article <1990Nov13.2...@zoo.toronto.edu>

he...@zoo.toronto.edu (Henry Spencer) writes:
>In article <86...@scolex.sco.COM> se...@sco.COM (Sean Fagan) writes:
>>More to the point, *any* processor with register windows is not going to do
>>too well on deeply recursive code.

>The problem is not register
>windows vs. recursion, it is *fixed-size* register windows vs. recursion.
>SPARC ends up saving and restoring a lot of registers that aren't actually
>in use, because the allocation quantum is 16 registers.

That's a reasonable argument, but not the only one.

First, compilers for non-window machines try to dribble values out to
memory, avoiding bursts that could cause processor stalls. (The
infamous VAX "calls" instruction writes a burst, and that seriously
stalled the VAX 11/780.) Obviously, writing a bunch of windows is a
burst.

Second, it wasn't clear from the original posting that the SPARC was
actually running slower than some other machine - the poster just
said it was slow. Maybe he had a big computation? Did he try the
-O switch?

Third, programs that recurse can often be written as iterations.
All machines, SPARC included, run the iterative form faster.

I'm unconvinced that register windows are a good idea, but I'm also
unconvinced that SPARC is some sort of disaster area.

--
Don D.C.Lindsay

Bruce Carneal

unread,

Nov 13, 1990, 9:36:30 PM11/13/90

to

In article <1990Nov11....@cs.cmu.edu> sp...@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

>It is my understanding that the SunOS 4.x compiler does
>detect and eliminate tail recursion. This is verified by
>compiling (-O4 -S)

> (example of intra procedural tail recursion deleted)

Unfortunately this doesn't work in the general case.
Compiling under SunOS 4.1 with 'cc -O4 -S ...' the trivial
program fragment:

func1(arg) int arg; { return func2(arg); }
func2(arg) int arg; { return func3(arg); }
func3(arg) int arg; { return arg; }

you get (with some junk removed):

_func1:
save %sp,-96,%sp
call _func2,1
mov %i0,%o0
ret
restore %g0,%o0,%o0
_func2:
save %sp,-96,%sp
call _func3,1
mov %i0,%o0
ret
restore %g0,%o0,%o0
_func3:
retl
nop

As shown, no tail call optimization occurs.
This form of call is used along the main paths
of some Streams modules (and elsewhere of course).
Later compilers can/do optimize those calls.

Jerry Callen

unread,

Nov 14, 1990, 10:35:01 AM11/14/90

to

In article <11...@pt.cs.cmu.edu> lin...@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>First, compilers for non-window machines try to dribble values out to
>memory, avoiding bursts that could cause processor stalls. (The
>infamous VAX "calls" instruction writes a burst, and that seriously
>stalled the VAX 11/780.) Obviously, writing a bunch of windows is a
>burst.

I'm curious what's wrong with a burst write of registers out to memory?
It seem to me that an intelligent memory system (maybe with a hint
from the processor that a burst is coming) could accept one word per
cycle and get those registers out to memory very fast. If the processor
is "stalled" while the registers go out, how is this different from
procedure prologue code in a RISC that does a bunch of stores in a row?
In fact, the RISC has to fetch the instructions, so the RISC should take
MORE time to save the registers.

I'l take a stab at answering my own question (I like talking to myself :-)).
Conceivably the RISC could NOT do the saves as a block in the prologue, but
(as Donald says) dribble them out as the register is needed. One complication
here is that the epilogue has to have some way of knowing which registers
actually have to be restored (in the case of multiple returns, or, for
Ada, exceptions).

-- Jerry Callen
jca...@encore.com

Henry Spencer

unread,

Nov 14, 1990, 1:03:45 PM11/14/90

to

In article <11...@pt.cs.cmu.edu> lin...@gandalf.cs.cmu.edu (Donald Lindsay) writes:

>First, compilers for non-window machines try to dribble values out to
>memory, avoiding bursts that could cause processor stalls. (The
>infamous VAX "calls" instruction writes a burst, and that seriously
>stalled the VAX 11/780.) Obviously, writing a bunch of windows is a
>burst.

The tradeoffs here are subtle. Many memory systems have much higher
bandwidth for sequential bursts than for random accesses, although your
bus and processor have to be designed to exploit this. There is also
the question of stalls waiting for data to be *read* from memory, while
the window machine can charge ahead because the data is probably in a
register instead.

>I'm unconvinced that register windows are a good idea...

The MIPSCo people, who measure almost everything, say that for ordinary
C code it doesn't matter much either way, *if* you're willing to assume
sophisticated compilers. (Of course, they are perhaps not entirely
unbiased... :-)) They also say that for languages with highly dynamic
calling patterns (e.g. C++ and Smalltalk), where the compiler has less
of a handle on what's happening, register windows would be a modest win.
They're definitely a big win if your compiler isn't too bright.

>but I'm also
>unconvinced that SPARC is some sort of disaster area.

I wouldn't call it a disaster area exactly, at least not for this reason :-),
but rapid large excursions in stack depth are definitely expensive on it.

Marc Brandis

unread,

Nov 15, 1990, 5:03:27 AM11/15/90

to

In article <1990Nov12.0...@actrix.co.nz> Bruce...@actrix.co.nz (Bruce Hoult) writes:

>>More to the point, *any* processor with register windows is not going
>>to do too well on deeply recursive code.
>
>I don't know much about this, but it seems to me that a machine without
>register windows is oiggoing to have to put those values on the stack
>in any case. A machine with lots of registers that have to be stored

When you look at typical recursive procedures, you will notice that they
have very often only a small number of local variables, that can be held
in registers. In an architecture with register windows, a trap occurs on
stack underflow or overflow and causes the register windows to be dumped to
memory. Most of the time, too much is stored and restored.

In an architecture without register windows, a smart compiler can remove
much of this overhead.

There are also quite a lot of cases, where recursive procedures use variables
in one path through the procedure but not in another. In this case it may
be better to have the variables in memory anyway, so nothing would have to
be stored at all.

(* I speak only for myself.
Marc-Michael Brandis, Institut fuer Computersysteme, ETH-Zentrum
CH-8092 Zurich
email: bra...@inf.ethz.ch bra...@iis.ethz.ch
*)

AKA Solitair

unread,

Nov 16, 1990, 6:41:26 AM11/16/90

to

Norwitz Neal writes:
>I have a recursive algorithm that I would like to run as fast as possible.
>I am running a part of it on a SPARCstation SLC. But at this rate, it might
>finnish in a couple of years!!

Why don't you have a look at the FORTH-chip (don't remember the manufacturer,
the part number should be something like 4016 if I recall correctly), the
processor has no general purpose registers, just a stack cache, the thing
was designed to execute *native* FORTH (a medium to high level threaded stack
based language) code. Last I heard about it, they we're designing a
new one. This processor was practically designed for recursive algorithms.
--
Sincerely, berg%cip-s01.informat...@unido.bitnet
Stephen R. van den Berg.
"I code it in 5 min, optimize it in 90 min, because it's so well optimized:
it runs in only 5 min. Actually, most of the time I optimize programs."

H.Shrikumar

unread,

Nov 18, 1990, 11:20:09 PM11/18/90

to

In article <37...@rwthinf.UUCP>

>Norwitz Neal writes:
>>I have a recursive algorithm that I would like to run as fast as possible.
>>I am running a part of it on a SPARCstation SLC. But at this rate, it might
>>finnish in a couple of years!!

and berg%cip-s01.informat...@unido.bitnet writes:

>Why don't you have a look at the FORTH-chip (don't remember the manufacturer,
>the part number should be something like 4016 if I recall correctly), the
>processor has no general purpose registers, just a stack cache, the thing

The start-up that thought up the 4016 had start-up problems, and
sold the technology to Harris. Harris cleaned up a lot of the hardware
bugs in the Novix 4016, christened it the RTX 2000 (*R*eal *T*ime e*X*press),
played with it for a little while, and decided to discontinue the line.

Sad. See other posts in this newgroup and comp.lan.forth for more
details.

But your point on Stack Caching machines and recursive routines is
well taken.

Just to comment, I think the original poster was talking about
mature machines, and the Novix/RTX was yet no more than a chip and
an architecture concept. (25ns RAM, three bank of it, no OS yet ...)

-- shrikumar ( sh...@ncst.in )

Martin Huber

unread,

Dec 13, 1990, 5:20:14 AM12/13/90

to

I forward this for friend. *Don't* reply to me, reply to
the address in the Reply-To header.

Here goes:

Another comment: I have been at the November '90 trade show ELECTRONICA
in Munich, Germany. I have visited HARRIS. They were perfectly willing
to sell the RTX2000. I got a databook, infos ... There was a Mr. Klaus
Flesch which is obviously something like general manager of the company
FORTH SYSTEMS, Germany. He *sells* RTX2000-based products. IMHO, the
RTX2000 is available. It *is* mature: 10MFIPS (FORTH micro-instructions
per second) is definitively state of the art in RISCS. The hardware is
capable of being micro-codable. I think this is better than many of
today's machines. Prices should be in the $500-$5000 range (please ex-
cuse my bad main memory, it is *not* a standard memory), in any case,
they *are* cheaper than a big useless number cruncher.

Take it or leave it, it's up to you!

Martin (ma...@ee.uni-sb.de)
Disclaimer: Mostly i wait for technology to improve, but sometimes
improvements go by unnoticed. Moral: Watch out!