Registers vs. Stack

Tom Locke

unread,

Aug 21, 2003, 12:03:35 PM8/21/03

to perl6-i...@perl.org

Hi All

Is there somewhere you can point me to a discussion about the choice for a
register VM rather than a stack VM? If not, let's have it now - I'll
volunteer to tidy the end result into a postable form.

The FAQ briefly mentions:

we're already running with a faster opcode dispatch
than [Perl, Python, and Ruby] are, and having registers just
decreases the amount of stack thrash we get.

Can I for one ask for some more detail?

The FAQ also points to the 68k emulator as a successful example. But the 68k
emulator *had* to be register based. This example shows it can be done, but
gives no evidence that it *should* be done for a script VM.

Note that I have *absolutely* no opinion on this (I lack the knowledge).
It's just that with Perl, Python, Ruby, the JVM and the CLR all stack based,
Parrot seems out on a limb. That's fine by me -- innovation is not about
following the crowd, but I feel it does warrant stronger justification.

Tom.

p.s. (and at the risk of being controversial :) Why did Miguel de Icaza say
Parrot was "based on religion"? Was it realted to this issue? Why is he
wrong?

Sean O'Rourke

unread,

Aug 21, 2003, 1:56:22 PM8/21/03

to Tom Locke, perl6-i...@perl.org

"Tom Locke" <t...@livelogix.com> writes:
> p.s. (and at the risk of being controversial :) Why did Miguel de
> Icaza say Parrot was "based on religion"? Was it realted to this
> issue? Why is he wrong?

IIRC it is -- his take is that stack VM code provides useful
information about variable lifetimes, which can help the JIT produce
good code. If everything's jitted, then his contention is that
bytecode opcount doesn't matter that much.

I'm not going to stick my neck out and offer a technical opinion, but
I'll just note that religion has been the driving force behind some
great things in the past -- witness Europe's cathedrals ;). Oh,
wait... we're supposed to be "the bazaar".

/s

Gordon Henriksen

unread,

Aug 21, 2003, 2:44:05 PM8/21/03

to Tom Locke, perl6-i...@perl.org

Sean O'Rourke wrote:

More specifically, if IMCC spills a register into the frame or a lexical
pad or who-knows-where, it's hard for the JIT compiler to "unspill" that
value. Not having a register file in the bytecode leaves register spill
up to the JIT compiler, which knows more about the native platform.

For instance, IA-64 (Itanium) has a native register file large enough to
dedicate one native register to each parrot string register, integer
register, and PMC register--and still have plenty left over. It'd be
hard for the hypothetical IA-64 JIT to fully utilize that register file.
(OTOH, it would be hard to write even a C function which is complex
enough to fully utilize that particular register file without giving the
optimizer a seizure.)

By contrast, x86 has nowhere near enough native GP registers for the 96
total IRs, SRs, and PRs--the x86 JIT is likely spilling registers left
and right. (OTOH, C code is spilling left and right, too.)

The ideal stack-based JIT might be able to more easily emit better code
for both architectures.

Stack-based operators are an unnatural paradigm for most assembly-level
programmers, though. Easier to give a register a name than to remember
how deep on the stack such-and-such value is. What else to recommend
them? Well--parrot's pretty married to them by now, for one.

--

Gordon Henriksen
IT Manager
ICLUBcentral Inc.
gor...@iclub.com

Brent Dax

unread,

Aug 21, 2003, 9:40:33 PM8/21/03

to Tom Locke, perl6-i...@perl.org

Dan should really be answering this, but I'll try...

Tom Locke:
# The FAQ briefly mentions:
#
# we're already running with a faster opcode dispatch
# than [Perl, Python, and Ruby] are, and having registers just
# decreases the amount of stack thrash we get.
#
# Can I for one ask for some more detail?

Pushing and popping values off a stack--especially the fairly
complicated stacks Parrot uses (and needs)--is an expensive operation;
by having registers, we avoid the expense. The speed hit from the stack
operations would negate much of our advantage over those other
languages. That's pretty much all we're saying there.

As I remember, the decision to use registers was based on a few things:

1. Speed.
a. The above: Register accesses are faster than stack
accesses,
in several significant ways.

b. Fewer opcodes: Using registers means less
pushing/popping,
which means that opcode dispatch is less critical and
ops
that actually *do* stuff will stay in the cache
longer.

2. Less bytecode: Any way you slice it, "print X" is smaller on
disk than "push X, print".

3. Optimization: Optimizing register-based code is a problem
that's
been heavily studied for decades. We can leverage all that
knowledge to make Parrot bytecode run faster.

4. Ease of hand hacking: It's easier to hand-write and
hand-debug
register-based assembler than stack-based assembler. This
isn't
that big an issue, but it is a factor.

Basically it comes down to this: Register architectures have some Hard
Problems, but once they're overcome the code will run faster. Stack
architectures have fewer Hard Problems, but run slower.

All things considered, we'd rather tackle a few Hard Problems than run
slower.

--Brent Dax <br...@brentdax.com>
Perl and Parrot hacker

"Yeah, and my underwear is flame-retardant--that doesn't mean I'm gonna
set myself on fire to prove it."

K Stol

unread,

Aug 22, 2003, 2:58:37 PM8/22/03

to Brent Dax, Tom Locke, perl6-i...@perl.org

S. entry on Dan's blog: Registers vs stacks for interpreter design. It's on
this page:
http://www.sidhe.org/~dan/blog/archives/2003_05.html

klaas-jan

Paolo Molaro

unread,

Aug 22, 2003, 9:18:48 AM8/22/03

to perl6-i...@perl.org

On 08/21/03 Tom Locke wrote:
> Note that I have *absolutely* no opinion on this (I lack the knowledge).
> It's just that with Perl, Python, Ruby, the JVM and the CLR all stack based,
> Parrot seems out on a limb. That's fine by me -- innovation is not about
> following the crowd, but I feel it does warrant stronger justification.

A well-designed register-based interpreter can be faster than a
stack-based interpreter (because of the reduced opcode dispatch overhead).
Doing a simple JIT for it may be also easier, if you ignore some
advanced optimizations.
That seems to be the main reason for parrot to go for it.
Perl/Python/Ruby don't have (opcode dispatch) speed as their main aim
(they use coarse instructions) so they use a stack-based design because
it's much simpler.
The JVM and the CLR use a stack-based instruction representation, but
they are intended for JIT compilation in the end and in that case a
register-based representation doesn't buy you anything (and complicates
things such as the calling convention).
That said, it will be interesting to see how Parrot will turn out to be
once it's reasonably feature complete: it may end up providing
interesting research results (whether good or bad, we don't know yet:-).

> p.s. (and at the risk of being controversial :) Why did Miguel de Icaza say
> Parrot was "based on religion"? Was it realted to this issue? Why is he
> wrong?

See his side of the story at:
http://primates.ximian.com/~miguel/activity-log.php (22 July 2003).
There are also a few short comments on some blogs. But the main point
is: at the end of the day numbers are what count, though anyone is free
to assign more weight to different numbers such as:

* execution speed (in mini, macro and bogus benchmarks:-)
* number of languages that can be reasonably run on the VM
* number of langauges that _cannot_ reasonably run on the VM:-)
* memory usage overhead
* runtime safety features
* number of platforms supported
* number of developers working on the VM
* number of users of (programmers for) the VM

So, you can't really say someone is wrong, until you measure at least
some of the above quantities and set your priorities for which ones you
prefer. Alas, measuring is hard and someone's priorities may not match
the priorities of whoever implements/designs the virtual machines:-)

I guess some first reasonably good numbers will come out of the Python
on Parrot pie-fest: can't wait a year for it, though:-)

lupus

--
-----------------------------------------------------------------
lu...@debian.org debian/rules
lu...@ximian.com Monkeys do it better

Michael Maraist

unread,

Aug 22, 2003, 12:46:50 PM8/22/03

to perl6-i...@perl.org

On Thursday 21 August 2003 21:40, Brent Dax wrote:
> # we're already running with a faster opcode dispatch

Man I wish I had the time to keep up with parrot development. Though, as
others have pointed out, the core archetecture is somewhat solidified by this
point, I thought I'd put in my two cents worth.

I completely agree that stack machines are for wimps ;) But I have a problem
with some peoples representation of stack machines. When was the last modern
real-CPU that actually performed push/pop operations for their stack? That
entire argument is moot in my opinion.

Look at the sparc chip as an example. You have a set of pre-defined directly
mappable registers which are appended to the stack, then you have your input
parameters, your worst-case output parameters, and your local spill
variables; all of which are pre-calculated at compile time, then a single
number is computed. At the entry and exit of each function call, that number
is added to and subtracted from the stack. All subsequent "stack operations"
are simply "ld/st [sp + offset], reg". If you were balsy enough, you could
do global variable allocation, but depending on whether you're performing
relocatable-code, you might still have to add the address to your
Instruction-Pointer. Thus short of always having enough registers, you have
to perform offset calculations, which is not much different than stack
pushes/pops. But the paradigm is different.

But there's another issue that I've seen brought up. By statically allocating
spill/input/output variables to an offset of the stack pointer, you rid
yourself the issue of "where was that variable in the mix of pushes and
pops".. You're garunteed that a variable is at a specific address, albeit a
relative address.

There is no difference between performing
add R1, 5 # R1 += 5
then
add [SP+1], 5

Especially if at the opcode executing level, R1 is defined as SP+R1_OFFSET

Taking the register-spill analogy back to JITing. We don't know how big the
CPU register set is at parrot compile-time, so we don't know what a good
register-set-size is. x86's are sadly still treated as accumulators (even
with x86-64), there are just too many cool compiler techniques that don't
work unless you have 32+ GPRs, so it's hardly worth the effort to test for
possible optimizations with only 8.

On the other hand, IA-64 with 100+ GPRs can unroll loops and map temporaries
like there's no tomorrow.

The end result is that a dynamically sized register-set is probably the ideal
for a VM. If the compiler can assume that you have as many registers as you
need, but is given the constraint of "please try to not use any more than you
absolutely need" (a la generic chaitin or Chow (basic-block based)), then in
the rare case that an Itanium is in use, a full register mapping can occur.
If we need to resort to accumulatoring, then you can utilize a raw
vmStackFrame + offset, wheren vmstack is register. It's also possible
(albeit not as obvious) to have a hybrid of "map first n variables to
physical registers" for the common case of 32reg machines.

Now in the case of Parrot, our stack (last I checked) was not homogeneous, so
this simplistic case wouldn't work well. But there are two solutions that
immediately occur to me.
Soln A)
Treat the datatype as trusted-opaque, and large enough to handle the largest
data-type. e.x.
iadd R1 <= R2, R3
sconcat R4 <= R5, R6
etc..
We merely trust that the compiler won't mix and match data-types into offset
assignments.
We would still, of course need to properly handle gc-DOD through the stack, so
we couldn't be completely opaque.

Input parameters to functions would have to either be staticly sized, or there
would have to be a special op-code to access dynamically-sized input
parameters of unknown types.
A simple opcode
regAlloc(numInputRegs, numLocalRegs)
would shift the frame pointer such that numInputRegs become regs
1..numInputRegs, and the locals become numInputRegs ..
numInputRegs+numLocalRegs. This is somewhat similar to the Itanium
register-allocating style.

Soln B)
Have a multitude of homogeneous stacks. This is identical to solution A, but
trades complexity for performance. Namely, there would be:
intStack
fpStack
strStack
objStack

The reg-allocation op-code would also require 4 pairs of sizes.
Additionally, the compiler must maintain 4 seperate input/output/local
variable->register mappings.

The advantages are:
* no problems with typecasting parameter problems
* gcing is more efficient (garunteeded that all non-null refs found in str/obj
stacks need DODing / dont need to test the stack-element-type on each
iteration).
* more properly maps to inter/floating point register sets.. The str/obj
stacks need external referencing anyway.

Well, again, just my $0.2. But I just felt the need to defend "practical"
stack computing.