Looking for help with Rattlesnake (more bite for Python)

sk...@calendar.com

unread,

Nov 22, 1998, 3:00:00 AM11/22/98

to

I've been working on a peephole optimizer for Python for some time. I
presented some results at the Python conference. For those interested, the
slides from my talk are at

http://www.automatrix.com/~skip/public_html/python/spam7/peephole1.html

and the paper is at

http://www.automatrix.com/~skip/public_html/python/spam7/optimizer.html

One of the basic limitations of the work I reported on is that the
stack-based virtual machine does an awful lot of data shuffling, much of
which can't be eliminated by your basic peephole optimizer. What is needed
is an entirely different virtual machine architecture.

A couple things Guido said at the conference clicked with me. First, the
byte code compiler knows how big the stack gets during execution of any code
object. Second, because the maximum stack size is known, the local variables
array and the stack for each frame can be (and are) allocated as a single
chunk of storage. Consequently, the LOAD_FAST and STORE_FAST instructions
only move a PyObject pointer a few bytes away in memory. Fast, but fairly
wasteful of a CPU's time.

Guido's comments raised the interesting possibility of analyzing the byte
code and generating code for a new virtual machine that treats the stack and
local variables as a randomly accessible register file. The obvious reason
for doing this is that loading objects from local variables onto the stack
becomes unnecessary. Also, if you are interested in generating C code
automatically, it should be easier to generate C code from a register-based
machine that the C compiler can optimize. (See Jonathan Riehl's pyfront
work, also reported on at IPC7.)

Here's a very simple example culled from a couple of lines of Marc-Andre
Lemburg's pybench test suite. (Apologies to the couple of people who've seen
this in earlier form in private email.) In the complex math benchmark he has

c = 1.2 + 6.2j
c = a + b

Ignoring SET_LINENO, this yields the following byte code

44 LOAD_CONST 6 (1.2)
47 LOAD_CONST 7 (6.2j)
50 BINARY_ADD
51 STORE_FAST 4 (c)
54 LOAD_FAST 2 (a)
57 LOAD_FAST 3 (b)
60 BINARY_ADD
61 STORE_FAST 4 (c)

Assuming the stack size is 0 at the point this code segment is reached, the
LOAD_CONST results will be stuck in slots 0 & 1 of the stack and the add
output will be placed in slot 0 (two pops to expose slot 0 again, followed
by a push back into slot 0). Similar results are found for the second
operation.

Treating the stack and the local variables as a single register file we can
load the constants into specific "registers" (n+0 & n+1 in this case, where
n is the size of the local variables array containing three variables, a, b
and c):

# load 1.2 (6) & 6.2j (7) into slots 0 & 1, respectively
44 LOAD_CONST_REG 6,n+0
47 LOAD_CONST_REG 7,n+1

# add slots 0 & 1, saving the results into c (2)
50 BINARY_ADD_REG n+0,n+1,n+0
54 LOAD_FAST_REG n+0,2

# add a (0) and b (1), storing the results in c (2)
57 LOAD_FAST_REG 0,n+0
60 LOAD_FAST_REG 1,n+1
63 BINARY_ADD_REG n+0,n+1,n+0
67 LOAD_FAST_REG n+0,2

This isn't any better than the original code. In fact, it's bigger and does
no extra work. However, all the loads from local variables to the stack can
be eliminated, as can the use of slots in the stack as temporary storage
before writing results out:

44 LOAD_CONST_REG 6,n+0
47 LOAD_CONST_REG 7,n+1
50 BINARY_ADD_REG n+0,n+1,2
57 BINARY_ADD_REG 0,1,2

Existing peephole optimization I've done for the stack-based VM can
precalculate the first add, resulting in (assuming the new constant 1.2+6.2j
gets dumped into slot 8)

44 LOAD_CONST_REG 8,n+0
47 LOAD_FAST_REG n+0,2
50 BINARY_ADD_REG 0,1,2

which can then be reduced to

44 LOAD_CONST_REG 8,2
47 BINARY_ADD_REG 0,1,2

Since c isn't used before being overwritten by the add, the load may be
eliminated as well, leaving us with

44 BINARY_ADD_REG 0,1,2

which is about as good as we're going to get with this little chunk of code
at the moment. (Depending on whether or not we can determine that a and b
are numbers and whether c gets used later before being overwritten again, we
may even be able to eliminate that...)

There are some limitations to what I expect to implement in the new virtual
machine. IMPORT_FROM is a toughie because when the "from spam import *" it
expands the local variable array by an arbitrary amount and suppresses the
use of the local variable array for any items that were imported.
Consequently, code that contains IMPORT_FROM isn't converted. Also, for the
time being code that uses try/except or try/finally isn't converted because
I couldn't easily figure out the exception handling code at the bottom of
the eval_code2 main loop.

At this point, here's what I have:

* most of an entirely new virtual machine implemented as a clone of
ceval.c's eval_code2 that implements most of the instructions
necessary. This is, however, *completely untested*. It has literally
never been fed a chunk of VM code to execute.

* a peephole optimizer filter that generates instructions for the new
machine given a code object for the original VM.

* modifications to dis.py to mostly allow disassembly of both the stack
and register VM (mostly defining classes to handle both architectures
and converting existing dis module functions to class methods).

* mods to compile.c to pass off code objects to the peephole optimizer
when python is invoked with -O.

I need to get back to my real work (which isn't Python internals
development). Since returning from the conference pretty much all I've been
working on is this new virtual machine. I've given it the rather
presumptuous name of Rattlesnake, because I believe it will give Python more
bite. Under the assumption that more eyes means fewer bugs, I'm enlisting
help to complete the project. If you would like to help and don't mind
working on incomplete code that *doesn't run*, let me know.

--
Skip Montanaro
Mojam: http://www.mojam.com/
Musi-Cal: http://concerts.calendar.com/

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

sk...@calendar.com

unread,

Nov 22, 1998, 3:00:00 AM11/22/98

to

Such an idiot I am... The two URLs in my previous message should be

http://www.automatrix.com/~skip/python/spam7/optimizer.html
http://www.automatrix.com/~skip/python/spam7/peephole01.html

Thanks to George Herbert for bringing the error to my attention.

M.-A. Lemburg

unread,

Nov 22, 1998, 3:00:00 AM11/22/98

to

[Turning the pure stack based Python VM into a full-blown register
VM... Rattlesnake]

I like the approach. What I would change, though, is to make the
optimizer run as external program and not (only) hooked from inside
a running Python process.

To get a group working on the code, I'd suggest creating an eList
mailing list and inviting as many gurus as you can get :-)

I would also be especially interested in ways to inline
functions/methos. Function calls are still too slow in Python.

Aside: I have a hacked up version of ceval.c too, so you might want
to integrate those patches as well (mostly breaking the big switch
statement in smaller ones and restructuring the calling scheme).

--
Marc-Andre Lemburg Y2000: 404 days left
---------------------------------------------------------------------
: Python Pages >>> http://starship.skyport.net/~lemburg/ :
---------------------------------------------------------

William Tanksley

unread,

Nov 22, 1998, 3:00:00 AM11/22/98

to

On Sun, 22 Nov 1998 15:10:47 GMT, sk...@calendar.com wrote:
>I've been working on a peephole optimizer for Python for some time. I

Cool.

>One of the basic limitations of the work I reported on is that the
>stack-based virtual machine does an awful lot of data shuffling, much of
>which can't be eliminated by your basic peephole optimizer. What is needed
>is an entirely different virtual machine architecture.

This isn't really true. Stack-based optimizers have been less studied than
register-based ones, but there has been enough to give good results. Forth
is one of the classical uses of such optimizers (although, of course, Forth
isn't crippled by being limited to 256 bytecodes; it uses wordcodes, so the
peephole optimizer can insert 'anonymous' wordcodes to replace an
inefficient sequence).

Um... Let's see. Check out ThisForth, at www.taygeta.com; at that site
you'll also find a lot of papers on stack optimizations.

Stack machines vs. register machines does involve some interesting
tradeoffs, but there's absolutely nothing which makes one superior to the
other.

Interesting question: if changing the format of the bytecode to
register-based is interesting, how interesting owuld it be to change the
format to wordcode? Every instruction would take up 16 bits -- that's 16K
instructions available to the optimizer.

That might be even better than the memory consumption for the register
machine, since the stack wordcodes don't need to store register numbers.

>Skip Montanaro

-Billy

sk...@calendar.com

unread,

Nov 24, 1998, 3:00:00 AM11/24/98

to wtan...@cx930311-b.ocnsd1.sdca.home.com

> == Billy Tanksley
> > == me

> >One of the basic limitations of the work I reported on is that the
> >stack-based virtual machine does an awful lot of data shuffling, much of
> >which can't be eliminated by your basic peephole optimizer. What is needed
> >is an entirely different virtual machine architecture.
>

> This isn't really true. Stack-based optimizers have been less studied than

> register-based ones, but there has been enough to give good results....

>
> Um... Let's see. Check out ThisForth, at www.taygeta.com; at that site
> you'll also find a lot of papers on stack optimizations.
>
> Stack machines vs. register machines does involve some interesting
> tradeoffs, but there's absolutely nothing which makes one superior to the
> other.
>
> Interesting question: if changing the format of the bytecode to
> register-based is interesting, how interesting owuld it be to change the
> format to wordcode? Every instruction would take up 16 bits -- that's 16K
> instructions available to the optimizer.

Billy,

Thanks for the pointer to the Forth page. I didn't look at the ThisForth
source code, though I did locate a copy of Koopman's "A Preliminary
Exploration of Optimized Stack Code Generation" at

http://www.cs.cmu.edu/~koopman/stack_compiler/index.html

One thing that's worth noting is that Koopman appears to be generating code
for a real stack machine (Harris RTX 2000). Presumably access to that
machine's stack is going to be substantially faster than main memory access.
Such is not the case with the Python VM, since the stack and the local
variables are right next to one another in memory. A LOAD_FAST instruction
does about as much work as a DUP_TOP or ROT_THREE instruction, so the effort
of making a bottom-of-the-stack copy of a variable you want to reuse so you
can recall it with a stack operation isn't going to gain much, if anything.
Saving a copy of a global variable might be useful, but with multi-threading
and the possibility of function calls it's not clear that the global
variable's value won't be invalidated.

I'm not sure what to make of the wordcode/bytecode distinction, but I'm not a
Forth person, so perhaps I'm missing something. I would appreciate a bit
more elaboration on that point. I don't feel particularly limited by only
having 256 byte codes, especially since the two VMs are completely separate.

Darius Bacon

unread,

Nov 25, 1998, 3:00:00 AM11/25/98

to

sk...@calendar.com writes:

>Billy,

>Thanks for the pointer to the Forth page. I didn't look at the ThisForth
>source code, though I did locate a copy of Koopman's "A Preliminary
>Exploration of Optimized Stack Code Generation" at

> http://www.cs.cmu.edu/~koopman/stack_compiler/index.html

Anton Ertl has written more about these issues; see

http://www.complang.tuwien.ac.at/projects/forth.html
http://www.complang.tuwien.ac.at/projects/backends.html

under the subjects

Code generation for stack machines
Forth native code generation
Stack caching for interpreters
(This one in particular deals with register-machine versus
stack-machine tradeoffs)
Patterns in interpreted Forth execution

Also, Rob Pike and Phil Winterbottom's ``Design of the Inferno Virtual
Machine'' agrees with you about register machines being a better idea
today:

http://plan9.bell-labs.com/cm/cs/who/rob/hotchips.html

(I dunno about that and think there's room for more work.)

>One thing that's worth noting is that Koopman appears to be generating code
>for a real stack machine (Harris RTX 2000). Presumably access to that
>machine's stack is going to be substantially faster than main memory access.
>Such is not the case with the Python VM, since the stack and the local
>variables are right next to one another in memory. A LOAD_FAST instruction
>does about as much work as a DUP_TOP or ROT_THREE instruction, so the effort
>of making a bottom-of-the-stack copy of a variable you want to reuse so you
>can recall it with a stack operation isn't going to gain much, if anything.
>Saving a copy of a global variable might be useful, but with multi-threading
>and the possibility of function calls it's not clear that the global
>variable's value won't be invalidated.

>I'm not sure what to make of the wordcode/bytecode distinction, but I'm not a
>Forth person, so perhaps I'm missing something. I would appreciate a bit
>more elaboration on that point. I don't feel particularly limited by only
>having 256 byte codes, especially since the two VMs are completely separate.

I've worked on two 16-bit wordcode interpreters that others designed,
and don't think it's an especially great idea. One of them only went
for 16 bits because the language used Unicode and it was convenient to
put bytecode programs in string objects -- though it also meant less
pressure to have short special-case opcodes for branches, etc. The
other was in Java and had over 1000 opcodes, mostly combinations of
simpler actions, but we ended up using other tricks for real speed
(like direct JVM code generation). Er, that's real speed by Java
standards, anyway.

my-first-c.l.p.-post'ly yrs,
Darius

--
Darius Bacon http://www.well.com/~djello

Ken Starks

unread,

Nov 25, 1998, 3:00:00 AM11/25/98

to

dje...@well.com (Darius Bacon) wrote:

>> I'm not sure what to make of the wordcode/bytecode distinction, but I'm
>> not a Forth person, so perhaps I'm missing something. I would
>> appreciate a bit more elaboration on that point. I don't feel
>> particularly limited by only having 256 byte codes, especially since
>> the two VMs are completely separate.

The bytecode/wordcode thing is not part of the ANS Forth specification.
My version is on a 32 bit machine and uses 32 bits for its dictionary
spaces and for its data stack.
The top of the data stack is actually a register, which makes a definition
very quick if its input and output are both a single stack value.

There is essentially only one thing I really miss when programming
in Python rather than in Forth. That is the ability to define
fragments of machine code using Forth as a macro-assembler, and
test them literally 2 seconds later.
My Forth system doesn't use about 7 of the 32 bit registers, so
I don't even have to wrap my machine code in a 'save registers / restore
registers' packet.
Whatever way you look at it, its a heck of a lot quicker than using
C.

Or is a macro-assembler actually possible in Python ? It would be nice to
be able to save the fragment of code as a normal Pythonish object.

Ken, __O
_-\<,_
(_)/ (_)
Virtuale Saluton.

Guido van Rossum

unread,

Nov 25, 1998, 3:00:00 AM11/25/98

to

> Also, Rob Pike and Phil Winterbottom's ``Design of the Inferno Virtual
> Machine'' agrees with you about register machines being a better idea
> today:
>
> http://plan9.bell-labs.com/cm/cs/who/rob/hotchips.html
>
> (I dunno about that and think there's room for more work.)

Note that they are talking about a different kind of VM though --
their operands are more traditional data types like 32bit ints. (They
talk a lot about the JVM and JIT in their paper.)

I'm not yet convinced that it makes much sense to apply the usual CISC
machine optimization techniques to Python and expect to get similar
results. I don't really have the time to look into the Forth papers
but I suspect that they might provide a more fertile source of ideas.

I do think that reducing the number of instructions needed to execute
a given piece of Python make a lot of sense though, and that's why I
still think that Skip's plan to use 3-address opcodes makes sense.

--Guido van Rossum (home page: http://www.python.org/~guido/)

Michael Scharf

unread,

Nov 25, 1998, 3:00:00 AM11/25/98

to

Guido van Rossum wrote:
> I do think that reducing the number of instructions needed to execute
> a given piece of Python make a lot of sense though, and that's why I
> still think that Skip's plan to use 3-address opcodes makes sense.

What are '3-address opcodes'?

Michael
--
''''\ Michael Scharf
` c-@@ TakeFive Software
` > http://www.TakeFive.com
\_ V mailto:Michael...@TakeFive.co.at