Optimizing stack access in the Python interpreter

6 views
Skip to first unread message

sk...@calendar.com

unread,
Jul 6, 1998, 3:00:00 AM7/6/98
to

It occurred to me yesterday while examining the interpreter code that access
to the Python run-time stack isn't as efficient as it might be. For
instance, although the stack is accessible by the interpreter as an array,
all the accesses are through PUSH and POP macros. The ROT_TWO instruction
is a case in point:

v = POP();
w = POP();
PUSH(v);
PUSH(w);

The top two elements are popped off the stack using two temporary variables,
then pushed back on in reverse order. Ignoring the properties of exclusive
or bitwize operations, we can perform the above operation with only a single
temporary:

v = TOP();
TOP() = TOP2();
TOP2() = v;

where TOP2 is defined as accessing the next-to-top element of the stack.
ROT_THREE benefits similarly.

Unfortunately, ROT_TWO and ROT_THREE are not heavily used instructions in
the Python virtual machine. Fortunately, the same pricipal can be applied to
many more common instructions. Here's the old BINARY_MULTIPLY:

w = POP();
v = POP();
x = PyNumber_Multiply(v, w);
Py_DECREF(v);
Py_DECREF(w);
PUSH(x);
if (x != NULL) continue;

and the new one:

w = POP();
v = TOP();
TOP() = PyNumber_Multiply(v, w);
Py_DECREF(v);
Py_DECREF(w);
if (TOP() != NULL) continue;
x = NULL; /* set for error handler */

The general idea to decrease pushes and pops is to pop down to the level you
want at the completion, then operate on the top of the stack from then on.
Note that assignment to x above is deferred until we know there is an error,
a relatively rare occurrence one would hope.

Using Marc-Andre Lemburg's pybench 0.6 as a measuring stick I found a small,
but pretty much across the board improvement:

Tests: diff
--------------------------------------
BuiltinFunctionCalls: -10.71%
BuiltinMethodLookup: -2.94%
Comparisons: -10.84%
ConcatStrings: -4.03%
CreateInstances: +0.21%
CreateStringsWithConcat: -1.18%
DictCreation: -8.44%
ForLoops: -3.50%
IfThenElse: -14.44%
ListSlicing: -1.15%
NestedForLoops: -7.02%
NormalClassAttribute: -14.45%
NormalInstanceAttribute: -1.78%
PythonFunctionCalls: -3.13%
PythonMethodCalls: -7.93%
Random: -5.30%
Recursion: -1.71%
SecondImport: -3.83%
SecondPackageImport: -3.13%
SecondSubmoduleImport: -2.71%
SimpleComplexArithmetic: -8.55%
SimpleDictManipulation: -18.05%
SimpleFloatArithmetic: -4.76%
SimpleIntFloatArithmetic: -3.42%
SimpleIntegerArithmetic: -1.78%
SimpleListManipulation: -9.73%
SimpleLongArithmetic: -4.25%
SmallLists: -7.56%
SmallTuples: -5.20%
SpecialClassAttribute: -12.36%
SpecialInstanceAttribute: -13.00%
StringSlicing: +6.83%
TryExcept: +0.92%
TryRaiseExcept: -0.73%
TupleSlicing: -2.48%

Only three tests slowed down, and only string slicing enough to notice.
Some common operations were sped up significantly.

I made a couple other changes as well which should probably be factored into
the interpreter even if the stack-as-array changes are not adopted. I added
a macro to Include/listobject.h that's analogous to PyTuple_SET_ITEM - to be
used only to initialize list elements. It is used by BUILD_LIST. In the
UNPACK_TUPLE/UNPACK_LIST opcodes PyTuple_Size and PyList_Size calls were
replaced by their macro versions, since at the point they were called their
object types had already been checked. Also, there were several opcodes
(LOAD_CONST, LOAD_NAME, STORE_ATTR, DELETE_ATTR, STORE_GLOBAL, LOAD_GLOBAL,
LOAD_FAST) that either still used break to exit the opcode switch instead of
continue to skip the error checks or made a superfluous error test, even
though no error was possible at that point. Finally, strictly as a matter
of personal taste, I replaced a couple direct object accesses with their
PyAPI equivalents, e.g.

PyList_GET_ITEM(v, i)

instead of

((PyListObject*)v)->ob_item[i]).

I placed a context diff in my Python bits web site:

http://www.automatrix.com/~skip/python/

It was a bit challenging to get the context diff right. I tried to create a
patch that was just this change, and didn't include the peephole optimizer
stuff I'm still working on. If I've muffed it up, let me know.

--
Skip Montanaro | Musi-Cal: http://concerts.calendar.com/
sk...@calendar.com | Conference Calendar: http://conferences.calendar.com/

-----== Posted via Deja News, The Leader in Internet Discussion ==-----
http://www.dejanews.com/rg_mkgrp.xp Create Your Own Free Member Forum

Tim Peters

unread,
Jul 6, 1998, 3:00:00 AM7/6/98
to

[sk...@calendar.com]
> ... Fortunately, the same pricipal can be applied to

> many more common instructions. Here's the old
> BINARY_MULTIPLY:
>
> w = POP();
> v = POP();
> x = PyNumber_Multiply(v, w);
> Py_DECREF(v);
> Py_DECREF(w);
> PUSH(x);
> if (x != NULL) continue;
>
> and the new one:
>
> w = POP();
> v = TOP();
> TOP() = PyNumber_Multiply(v, w);
> Py_DECREF(v);
> Py_DECREF(w);
> if (TOP() != NULL) continue;
> x = NULL; /* set for error handler */
>
> The general idea to decrease pushes and pops is to pop down to
> the level you want at the completion, then operate on the top
> of the stack from then on.

Cool, Skip! Makes sense. It also takes you half-way to an old
assembly-language interpreter trick <wink>: don't materialize top-of-stack
in memory, but carry it around in a register instead. Then BINARY_MULTIPLY
could look more like

/* top of stack always in topOfStack */
w = topOfStack;
v = POP();
topOfStack = PyNumber_Multiply(v, w);
Py_DECREF(v);
Py_DECREF(w);
if (topOfStack != NULL) continue;


x = NULL; /* set for error handler */

which saves a few more memory references if the machine has enough
registers, and allows the unary ops (including the popular JUMP_IFs) to not
touch the memory stack at all. This is aggressive, though, and requires
carefully changing *all* of eval_code2's cases; needs care to push & pop
topOfStack around (Python-level) calls and context switches too.

crazy-is-as-crazy-does-ly y'rs - tim

sk...@automatrix.com

unread,
Jul 6, 1998, 3:00:00 AM7/6/98
to

Tim> don't materialize top-of-stack in memory, but carry it around in a
Tim> register instead. Then BINARY_MULTIPLY could look more like

Tim> /* top of stack always in topOfStack */
Tim> w = topOfStack;
Tim> v = POP();
Tim> topOfStack = PyNumber_Multiply(v, w);
Tim> Py_DECREF(v);
Tim> Py_DECREF(w);
Tim> if (topOfStack != NULL) continue;
Tim> x = NULL; /* set for error handler */

Tim> which saves a few more memory references if the machine has enough
Tim> registers, ...

That requires PUSH and POP have the side-effect of copying the new
top-of-stack into a register variable. This wouldn't be too difficult. I
wonder if that would have any effect on my register-starved Pentium (has
Intel ever managed to add more registers to the 8086 architecture? I don't
follow that stuff very closely). Keeping top-of-stack in a register would
probably cause the compiler to boot something else out of a register in my
case, but might be fairly useful in a RISC architecture with plenty of
registers. I'd be interested to see if the changes I posted already have
any benefit on register-studly RISC machine. I suspect not.

I was a bit worried about replacing register variable accesses with array
accesses (stack_pointer[-1], but the top-of-stack is likely to be in the
processor's cache all the time anyway, so I figured that penalty wasn't
going to be huge, and reducing the number of operations on the primary
register variables x, v, and w probably allowed the compiler to keep
something else in a register.

Tim Peters

unread,
Jul 7, 1998, 3:00:00 AM7/7/98
to

Tim> don't materialize top-of-stack in memory, but carry it
Tim> around in a register instead. ... saves a few more memory
Tim> references if the machine has enough registers, ...

[sk...@automatrix.com]


> That requires PUSH and POP have the side-effect of copying
> the new top-of-stack into a register variable.

Yes, something like that <wink>.

> This wouldn't be too difficult.

Right, it's really mechanical, but delicate because so tedious! The thing
to remember that will stop most bugs cold is that the *net* change in the
*memory* stack size is the same either way. BINARY_MULTIPLY left the memory
stack one smaller before, so it must leave it one smaller after too. Same
everywhere else, PUSH and POP included.

> I wonder if that would have any effect on my register-starved
> Pentium

There's no predicting what the compiler du jour will do about any change,
but this is a good one & will eventually pay off: even if it saves it to
memory, it gets to save it at a fixed location the compiler chooses for its
own addressing convenience.

> (has Intel ever managed to add more registers to the 8086
> architecture? I don't follow that stuff very closely).

You'll never win this game by trying to out-think the architecture du
jour -- by the time you have any sort of solid handle on it, it will have
changed. The Pentium architecture is moving down the same road the IBM 360
took, with increasingly sophisticated "hidden" hardware to get code much of
the benefits of a larger register set even though the programming model
doesn't expose more registers. But it's not worth thinking about this
unless you're writing a Pentium-specific chunk of code, *and* intend to
rewrite it for every new flavor of Pentium that comes out.

I recently reposted an essay on "reduce the operation count!", and that is--
I'm not kidding --the only way to win cross-platform optimization over time.
Everybody makes this sooooo hard by twisting things to favor some stupid
transient architecture detail (which they most often don't understand
correctly to begin with <snarl/wink>), and the end result is they *lose* the
game over time -- when they could have won it with a fifth the effort <0.2
wink> and a tenth the fear <wink>.

> Keeping top-of-stack in a register would probably cause the
> compiler to boot something else out of a register in my
> case,

It doesn't matter. Can you think of something more valuable than
top-of-stack *to* carry in a register? It's beat on constantly. It's never
a long-term loss to create more opportunities for a compiler to carry things
in registers; the particular release of the particular compiler with the
particular optimization settings you're using today may make suboptimal
choices, and there's nothing you can do about that. So don't worry about it
<wink>.

> but might be fairly useful in a RISC architecture with plenty
> of registers. I'd be interested to see if the changes I posted
> already have any benefit on register-studly RISC machine. I
> suspect not.

It may actually hurt without pushing on to the top-of-stack trick: consider
that BINARY_MULTIPLY made 3 memory stack references before (POP, POP, and
PUSH), but 4 after (a POP and 3 TOPs). On a RISC machine it's *especially*
valuable to eliminate a memory operation. Their ticks are set to the speed
of the fastest register->register ALU operations, and even in-cache loads
will take a cycle or two (& worse the RISCier the machine, like an Alpha)
dead-time hit waiting for the cache to figure out it has the beast. And
that's on top of the addressing calculations needed to get the load started
to begin with. As a cross-platform rule of thumb, getting rid of a load is
at least as good as getting rid of three adds -- even if you think it's in
cache.

> I was a bit worried about replacing register variable accesses with
> array accesses (stack_pointer[-1], but the top-of-stack is likely
> to be in the processor's cache all the time anyway, so I figured
> that penalty wasn't going to be huge,

See above for an opposing view <wink>.

> and reducing the number of operations on the primary register
> variables x, v, and w probably allowed the compiler to keep
> something else in a register.

Yes, it probably wasted a register to hold the computed address of TOP()
across statements (a low-level common subexpression); better to carry the
*value* of TOP() in a register and avoid the more-expensive redundant load.

let-the-compiler-do-its-job-even-when-you're-sure-you-
know-better-because-it-gets-smarter-while-we-just-
get-older<wink>-ly y'rs - tim

John (Max) Skaller

unread,
Jul 7, 1998, 3:00:00 AM7/7/98
to
On Tue, 7 Jul 1998 06:12:49 GMT, "Tim Peters" <tim...@email.msn.com> wrote:
>I recently reposted an essay on "reduce the operation count!", and that is--
>I'm not kidding --the only way to win cross-platform optimization over time.

I agree. And the corollary: make it smaller and simpler
and it will run faster and do more.

This also applies to language design, not just low
level optimisation. The worst case I'm aware of is C++,
in which the useless concept of an lvalue is preserved
just to ensure a few 'errors' are trapped. In fact,
those 'errors' are useful, and if they were permitted,
then the C++ Standard could be reduced in size by 20%,
and many of the rules simplified by 50%. Here are the cases:

x + 1 = 2;
&(x+1);

The same applies to Python. Have a look at the source
code (both C and Python code) for some of the facilities,
and you will sometimes find contorted code there, to implement
'special cases and features'. Removing the special cases and
features would simplify Python, make it more flexible,
and, as a side effect, make it faster.



>let-the-compiler-do-its-job-even-when-you're-sure-you-
> know-better-because-it-gets-smarter-while-we-just-
> get-older<wink>-ly y'rs - tim

:-)

John Max Skaller ph:61-2-96600850
mailto:ska...@maxtal.com.au 10/1 Toxteth Rd
http://www.maxtal.com.au/~skaller Glebe 2037 NSW AUSTRALIA

Tim Peters

unread,
Jul 8, 1998, 3:00:00 AM7/8/98
to
[about carrying top-of-stack "in a register"]

[Vladimir Marangozov]
> Right, and last time I played with this, it wasn't very tedious,
> but as far as I remember, I won nothing by doing so (keeping the
> TOS in a variable).

Ack! Vladimir, you're forcing me to make a pseudo-bold claim: Python would
be a good deal quicker today if the people who have worked on speeding it up
had never undercut their educated intuitions by timing anything <0.43 wink>.
That's just pseudo-bold because it's non-falsifiable <wink>, but I'm less
than half joking (in fact, I believe I was exactly 43% joking, to judge from
the infallible wink-o-meter).

> The main reason being that lot of opcodes are "unbalanced"
> wrt PUSH and POP. They were mostly 0-1, 1-0, 1-2, 2-1, etc. so
> keeping the TOS in a register required to update that variable for
> every opcode that changes the stack size.

Whether TOS is in memory or a register, it needs to be updated regardless.
That's neutral. The point of keeping it in a register is that, whether the
stack signature is 1->1 or 2->1 or 64->517, it saves one load at the start
and one store at the end of each of those opcode sequences (and for the
common 2->1, that saves half the stack loads and all the stack stores!). If
you saw no change on your particular compiler + optimization + hardware
combination at the time, that's a win: it just means the compiler stored
TOS in memory anyway, so it did no harm and *would* win on a combo that used
a register instead. Getting rid of a load is Major Goodness!

So you don't put in a sensible change that was speed-neutral on your combo,
and Guido doesn't put in sensible changes that are speed-netural on his
combo, & so on & so on: the end result is that no combo speeds up, despite
that all likely *would* speed up if they got the benefit of the *other*
fellows' timidly rejected good changes.

A *good* change should be made even if timing shows it slowing down on some
combo; if the operation count has been reduced, the timing result is a fluke
due to bad compiler or load-map luck that's probably specific to every
detail of the combo in use, and that will go away by magic anyway.

And if you don't *know* whether a change is good, just ask me: I'll make
something up off the top of my head, and because I'll probably guess right
more than half the time (I used to get paid a lot for this <wink>), Python
will speed up over time without anybody needing to think at all <ahem>.

viewers-encouraged-to-try-this-at-home-ly y'rs - tim

sk...@calendar.com

unread,
Jul 9, 1998, 3:00:00 AM7/9/98
to
[Tim]
...

> A *good* change should be made even if timing shows it slowing down on some
> combo; if the operation count has been reduced, the timing result is a fluke
> due to bad compiler or load-map luck that's probably specific to every
> detail of the combo in use, and that will go away by magic anyway.
>
> And if you don't *know* whether a change is good, just ask me: I'll make
> something up off the top of my head, and because I'll probably guess right
> more than half the time (I used to get paid a lot for this <wink>), Python
> will speed up over time without anybody needing to think at all <ahem>.
...

Tim's message belongs in the FAQ (I have no idea what the FAQ password is
anymore). Optimizers, charge!

--
Skip

Guido van Rossum

unread,
Jul 10, 1998, 3:00:00 AM7/10/98
to
> [Tim]
> ...
> > A *good* change should be made even if timing shows it slowing down on some
> > combo; if the operation count has been reduced, the timing result is a fluke
> > due to bad compiler or load-map luck that's probably specific to every
> > detail of the combo in use, and that will go away by magic anyway.
> >
> > And if you don't *know* whether a change is good, just ask me: I'll make
> > something up off the top of my head, and because I'll probably guess right
> > more than half the time (I used to get paid a lot for this <wink>), Python
> > will speed up over time without anybody needing to think at all <ahem>.
> ...
>
[skip]

> Tim's message belongs in the FAQ (I have no idea what the FAQ password is
> anymore). Optimizers, charge!

It's Spam.

But I'm still not comfortable. I just spent an afternoon trying
various things along these lines. I played with your patches, I moved
LOAD_FAST out of the switch (clearly reducing the operation count), I
inlined every list and tuple operation in the entire source tree that
could safely be inlined, and I ended up with an interpreter that had
roughly the same speed as before. (At one time it was even a lot
slower, but I think that's a glitch.)

Then I disabled the gcc optimizer, and something dawned upon me.
Switching from -O0 to -O2 gives an almost twofold speedup. We all
know that optimizers are very unpredictable. Which means that small
tweaks in the source can easily change what the optimizer does. So, a
more objective way to measure the effect of speedups would be to
measure *without* the C optimizer. But even then I still saw only a
very meagre speedup after all my changes!

At the moment, I think I'm not interested in massive patches that
attempt optimization, even if they have Tim's blessing. I'm even
going to back out of the changes that I made. Why? Because I think
the changes make the code look ugly. Given the marginal speedup, I'd
rather keep the sources stable -- this makes it much easier to accept
other patches that are semantically interesting (like Marc-Andre's
callable-methods patch).

Here's another take. Looking at the dynamic opcode frequencies of
various programs, I see the following. Consistently, the following 9
opcodes make up the top 9: LOAD_FAST, LOAD_CONST, LOAD_ATTR,
LOAD_GLOBAL, STORE_FAST, POP_TOP, SET_LINENO, CALL_FUNCTION, and
JUMP_IF_FALSE. Number 10 is either BINARY_ADD (pystone) or
RETURN_VALUE (other samples).

Looking at the first 9, several are very quick opcodes: LOAD_FAST,
STORE_FAST, LOAD_CONST, POP_TOP, SET_LINENO, and JUMP_IF_FALSE --
these are totally inline, and only involve stack and reference count
operations. The remaining ones can involve real work: LOAD_GLOBAL
does one or two dictionary lookups, CALL_FUNCTION does a *lot* of work
(before it even hits the first instruction of the called function, if
it's a Python function), and LOAD_ATTR is in between -- anything from
a single string compare (list.append) to a function call, but most
likely it's a bunch of dictionary lookups.

I should also note that these first 10 opcodes account for 80-85% of
the total.

My conclusion: in the interpreter, we should focus making dictionary
lookups (even!) faster, or on making fewer of them (e.g. through
caching), and on making function calls faster. Another focus area
would be a bytecode optimizer, of course, since it can reduce the
operation count in ways that nothing else can come close to!

In order to understand better where the time goes, we need to use a
good C profiler. Some platforms have very sophisticated profilers
that show you exactly how many times each line (really: each basic
block) was executed during a particular run. This should tell us a
lot on where to focus our efforts next (e.g. which path through
CALL_FUNCTION is most important).

--Guido van Rossum (home page: http://www.python.org/~guido/)


sk...@calendar.com

unread,
Jul 11, 1998, 3:00:00 AM7/11/98
to
">" == Guido

> But I'm still not comfortable. I just spent an afternoon trying various
> things along these lines. I played with your patches, I moved LOAD_FAST
> out of the switch (clearly reducing the operation count), I inlined every
> list and tuple operation in the entire source tree that could safely be
> inlined, and I ended up with an interpreter that had roughly the same
> speed as before. (At one time it was even a lot slower, but I think
> that's a glitch.)

I agree with Guido that performance improvements have to be balanced with
all the other needs of code maintenance and extension. To that end, I see a
couple potential paths:

1. There are some things like removing SET_LINENO opcodes and avoiding all
the error-checking machinery at the bottom of the loop that have been
demonstrated to improve performance without (I think) hurting
readability. These changes should not be discarded. In fact, I think
some of the error-checking avoidance has only been half-done and needs to
be completed. (There should be an addition to the comment at the top of
the opcode switch that indicates when it's safe to 'continue' the loop
instead of 'break'ing from the switch. In the current interpreter there
are still some breaks that can be eliminated, and also some spurious
tests before continues...)

2. Tinkering with experimental optimizations should continue. Obviously,
not all optimization attempts will succeed, or they will improve things
to such a small degree that their obfuscatory properties overshadow their
studliness properties.

Still, these experiments can yield useful insights. I'll divulge a
couple here in hopes that they don't disqualify the paper I'm writing for
the next Python conference. In my experiments I've been performing a
number of different byte code and interpreter optimizations. Constant
expression evaluation turned out to be a big win for complex number
arithmetic. It turns out that the byte code compiler doesn't generate
complex constants but complex expressions, so even if my optimizations
never see the light of day, I can say with some certainty that People
doing complex math should assign complex constants to variables and use
the variable instead, especially if they appear in loops:

x = 3+2i
sum
for i in xrange(10000):
sum = sum + cmath.sin(x*i)

I've also been fiddling with stack manipulations to try and reduce them
to the minimum needed for any given operation, and also to keep a copy of
the top of stack in a register variable instead of always accessing it in
the array that implements the stack. This has yielded good benefits
across the board, and I haven't gone quite all the way possible. I'm
currently caching the top of stack in a register variable, but I could
store it in a register and never on the stack itself. (This makes for
some problems with people trying to do garbage collection, since they
can't get at the top of the stack during the mark phase.)

This approach doesn't necessarily make the code any harder to
maintain. It just requires a slightly different mental approach to stack
maintenance.

3. I still suspect, though Tim will probably chide me, that some things are
going to work well on some architectures and not so well on others. I'm
doing all my current fiddling on a 100MHz Pentium. Guido has probably
been doing most of his experimentation on MIPS or SPARC processors. We
know time marches forward and things change, but today at least, I
suspect Guido's toys have more registers than my toys. It's possible
that my optimizations will work well on Pentiums but not on more RISCy
architectures. Who knows? Maybe there should be a configure
switch... ;-)

Also, compilers' abilities to optimize things have been advancing. I'm
using the Cygnus ECGS compiler, but still with fairly piddly -O2
optimization. I know for some programs like XEmacs, people have been
using much more agressive optimization with good results.

I've taken a (hopefully short) hiatus from my experimentation to try and
build a Tk-based front-end to Marc-Andre Lemburg's nice pybench benchmarking
suite. One of the things I noticed was that I found I was losing track of
just what settings I was using from run-to-run. Sometimes I needed to
recompile Python with other flags. Sometimes I needed to run it with
different optimization levels. Sometimes I needed to blast the .pyo files
so different optimization settings would take effect. My thought is that by
generating a uniform interface to all this machinery I can improve my test
repeatability, and allow others to reproduce my experiments on other
architectures, simply by sending them my configuration file and have them
load it into their version of pybench.

> Another focus area would be a bytecode optimizer, of course, since it can
> reduce the operation count in ways that nothing else can come close to!

Another observation about that. I've been trying out various optimizations
based upon patterns I see in the byte code, on what other people (like Tim)
have suggested, and from what I've found in previous efforts like xfreeze.
I'm certainly no pro compiler optimization whiz. When I make my peephole
optimizer available, my main hope is that it will make it easy for others to
add new optimizations. Who knows? It may even provide a few senior
undergraduates with the basis for a senior thesis or three.

--
Skip Montanaro | Did you know you can advertise on Musi-Cal for as little
sk...@calendar.com | as $20? http://concerts.calendar.com/cgi-bin/genad
(518)372-5583

Tim Peters

unread,
Jul 11, 1998, 3:00:00 AM7/11/98
to
[Guido, to Skip]

> But I'm still not comfortable. I just spent an afternoon
> trying various things along these lines.

The reader can already guess the outcome at this point <wink>, but I'll say
again that for cross-platform optimization it doesn't have to speed up on
your particular platform(s?) to be a win. It should be enough that it
doesn't slow down for you -- Marc-Andre & Vladimir & Skip & ... wouldn't be
treating you to an endless sequence of patches unless it *were* speeding
things up on *their* platforms.

> I played with your patches, I moved LOAD_FAST out of the
> switch (clearly reducing the operation count),

Not at all clear to me: Yes, you get to LOAD_FAST quicker, but every other
opcode slows down by the amount of time taken to deduce it's not a LOAD_FAST
and then jump over LOAD_FAST's case. Right? Depending on whether or not -O
was used, I saw LOAD_FAST account for 20-30% of dynamic opcodes, so 70-80%
were not LOAD_FAST. If so, LOAD_FAST would have to speed up a lot more than
all the others slowed down just to break even. E.g., if LOAD_FAST were sped
10 cycles, and all the others slowed by 4 for the test-and-jump
special-casing, it's a net loss at the 20% LOAD_FAST level and just a tiny
gain at the 30% level.

Seemed more likely M-A was seeing a combo-specific cache anomaly with this
one -- it didn't sound like a "good change" to me. For fighting cache rot
it's probably more useful to rearrange the cases so that the pairs commonly
executed back-to-back are adjacent, and then sort the cases by decreasing
order of frequency (just to increase the chance that the popular ones will
be in the same cache line as the switch jump, and decrease the chance that
the popular ones will end up in conflicting cache lines).

> I inlined every list and tuple operation in the entire source tree
> that could safely be inlined, and I ended up with an interpreter that
> had roughly the same speed as before. (At one time it was even a
> lot slower, but I think that's a glitch.)

I certainly hope so <wink>.

> Then I disabled the gcc optimizer, and something dawned upon me.
> Switching from -O0 to -O2 gives an almost twofold speedup. We all
> know that optimizers are very unpredictable. Which means that
> small tweaks in the source can easily change what the optimizer
> does.

By the same token, the next release of the compiler is likely to do quite
different things too even if you don't change anything, unless they've given
up working on their compiler entirely <0.9 wink>. From a long-term
cross-platform view, timings on a specific combo aren't that interesting
apart from their value as sanity checks.

> So, a more objective way to measure the effect of speedups would
> be to measure *without* the C optimizer.

Not a good idea unless you want to release Python with optimization off;
artificially forced stability is not objectivity <0.7 wink>.

> But even then I still saw only a very meagre speedup after
> all my changes!

Would be interesting to ship those changes to one of the others and see what
difference it made on their platforms -- but with optimization on. *That's*
objective <0.5 wink>.

> At the moment, I think I'm not interested in massive patches that
> attempt optimization, even if they have Tim's blessing. I'm even
> going to back out of the changes that I made. Why? Because I think
> the changes make the code look ugly.

Perhaps in aggregate, but is every single one ugly? E.g., what Skip
recently did amounts to a mechanical recasting of how the stack is
implemented, and from what I've seen of his changes the "after" version
looked, if anything, a bit cleaner (a little less code, a little less
clutter).

> Given the marginal speedup, I'd rather keep the sources stable --

Certainly attractions to that too!

> this makes it much easier to accept other patches that are
> semantically interesting (like Marc-Andre's callable-
> methods patch).

M-A, hold on to your wallet <wink>.

> Here's another take. Looking at the dynamic opcode frequencies
> of various programs, I see the following. Consistently, the
> following 9 opcodes make up the top 9: LOAD_FAST, LOAD_CONST,
> LOAD_ATTR, LOAD_GLOBAL, STORE_FAST, POP_TOP, SET_LINENO,
> CALL_FUNCTION, and JUMP_IF_FALSE.
>
> Number 10 is either BINARY_ADD (pystone) or
> RETURN_VALUE (other samples).
>
> Looking at the first 9, several are very quick opcodes: LOAD_FAST,
> STORE_FAST, LOAD_CONST, POP_TOP, SET_LINENO, and JUMP_IF_FALSE --
> these are totally inline, and only involve stack and reference
> count operations.

Indeed, the loop-trip overhead swamps the "real work" they do: mostly
overhead. Note that both branches of a JUMP_IF_FALSE usually go straight to
a POP_TOP, though, so POP_TOP could be gotten off the list by changing
codegen (and new "consuming" flavors of JUMP_IF_xxx could be introduced so
that old .pyc files would continue to run unchanged for some transitional
period). Not needing to execute POP_TOP at all is even better than speeding
it up <wink>.

> The remaining ones can involve real work: LOAD_GLOBAL does one
> or two dictionary lookups, CALL_FUNCTION does a *lot* of work
> (before it even hits the first instruction of the called function, if
> it's a Python function), and LOAD_ATTR is in between -- anything
> from a single string compare (list.append) to a function call, but
> most likely it's a bunch of dictionary lookups.
>
> I should also note that these first 10 opcodes account for 80-85% of
> the total.

It is a marvel, isn't it?

> My conclusion: in the interpreter, we should focus making dictionary
> lookups (even!) faster, or on making fewer of them (e.g. through
> caching),

LOAD_ATTR and LOAD_GLOBAL aren't exploiting interned strings now, correct?

> and on making function calls faster.

I sure hope that doesn't mean someone first has to *understand* all the code
in the CALL_FUNCTION case <wink>. Seriously, function/method calls in
Python are seriously slow, but also seriously flexible (default args,
keyword args, bound method objects, ...) -- usually can't call a function
without incurring a LOAD_ATTR or LOAD_GLOBAL first, too, so it's a double
whammy.

> Another focus area would be a bytecode optimizer, of course,
> since it can reduce the operation count in ways that nothing
> else can come close to!

Incidentally, believe I've mentioned before that changing JUMP_IF_FALSE to
consume its argument would expose more opportunities for a peepholer to
collapse jump-to-jump chains too (the POP_TOP at the branch target now
prevents that).

> In order to understand better where the time goes, we need to use a
> good C profiler. Some platforms have very sophisticated profilers
> that show you exactly how many times each line (really: each basic
> block) was executed during a particular run. This should tell us a
> lot on where to focus our efforts next (e.g. which path through
> CALL_FUNCTION is most important).

Perfectly reasonable. Just don't waste effort timing it afterwards <grin>.

time-is-a-deep-illusion-ly y'rs - tim

Tim Peters

unread,
Jul 11, 1998, 3:00:00 AM7/11/98
to
[Skip]
> [much sensible argument elided throughout]
> ...

> 3. I still suspect, though Tim will probably chide me, that some
> things are going to work well on some architectures and not so
> well on others.

No chide! That's absolutely right. What I've been talking about is
"low-level, cross-platform optimization, over time", and every qualifier
there has crucial effect on the flavor of my recent posts.

The best optimization is total redesign of a subsystem, such as when Guido
replaced LOAD_LOCAL with LOAD_FAST; those can yield major speedups on every
platform, but are hard to sell because delicate and potentially disruptive.

The next best is "mere" algorithm refinement, such as the kinds of things
already done to speed dict lookups. It sounds like Guido wants to keep his
focus on this level, and that's fine. Changes at this level often yield
speedups more platform-dependent than one hopes, though; e.g., the relative
speed of a successful strcmp still varies widely across platforms, so the
benefits of further exploiting string interning will be a mixed bag.

Low-level, cross-platform optimization, over time, is the bottom-feeder of
optimization approaches, and is most platform-dependent in *outcome* of all.
Changes that help on a Pentium today may not help at all on a SPARC today,
or vice versa, or even wrt a Pentium clone. Thinking about "operation
count" generally saves you from fiddles that will *hurt* anyone
significantly, though, and over time the differences among architectures and
compilers and libraries tend to even out -- but toward the best, not toward
the least common denominator. So a change in this category is a success if
it speeds up on *some* platform -- if the change was made for a solid "op
count" reason, other platforms will eventually benefit too (or vanish! in
which case who cares <wink>).

All approaches should be considered. The last one is the easiest to pursue,
so anyone can play (and it does appear that absolutely everyone who takes a
whack at it quickly finds *some* little fiddle that helps on their
platform -- except for Guido, who seems to have gotten his platform stuck in
a deep local minimum about a year ago <0.9 wink>).

> ...


> It's possible that my optimizations will work well on Pentiums but
> not on more RISCy architectures. Who knows?

My answer to that is a non-cynical "who cares?". The essence of
"cross-platform" is *ignoring* the platform du jour, creating better
opportunities for the best platforms over time.

> Maybe there should be a configure switch... ;-)

Then you're into platform-*specific* fiddling, and that's the mother of all
bottomless pits. OTOH, in a world where 95% of all CPUs are Pentiums, maybe
we should just ignore CNRI's machines <wink>.

or-forget-optimization-and-go-back-to-something-interesting-ly y'rs - tim

Vladimir Marangozov

unread,
Jul 11, 1998, 3:00:00 AM7/11/98
to Tim Peters
[Tim]

>
> or-forget-optimization-and-go-back-to-something-interesting-ly y'rs - tim

I almost did. For the memory of ceval.c and my sympathy to the main
loop, I made a limited nostalgic collection of ceval.c tweaks I played
with in the past. If a ceval.c committee is formed in the future for
the rewriting of this file, some stuff will be handy at:

http://starship.skyport.net/~vlad/cevalc/

[Guido]


>
> In order to understand better where the time goes, we need to use a
> good C profiler. Some platforms have very sophisticated profilers
> that show you exactly how many times each line (really: each basic
> block) was executed during a particular run. This should tell us a
> lot on where to focus our efforts next (e.g. which path through
> CALL_FUNCTION is most important).

There's some data at: http://starship.skyport.net/~vlad/tprof/

I'm willing to profile custom scripts with custom patches applied to
the distribution if people are interested in the results.
--
Vladimir MARANGOZOV | Vladimir....@inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252

Vladimir Marangozov

unread,
Jul 12, 1998, 3:00:00 AM7/12/98
to Guido van Rossum
[Guido]

>
> My conclusion: in the interpreter, we should focus making dictionary
> lookups (even!) faster, or on making fewer of them (e.g. through
> caching), and on making function calls faster. Another focus area

> would be a bytecode optimizer, of course, since it can reduce the
> operation count in ways that nothing else can come close to!
>
> In order to understand better where the time goes, we need to use a
> good C profiler.

Indeed. Here's a last comment and a hunch from me on this topic.

The micro-profiler shows that most of the time is spent in the main loop
and that the latter has 3 main bottlenecks:

1) the tstate->ticker test
2) the next opcode/oparg scan, notably oparg = NEXTARG()
3) the switch(opcode) statement

After several attempts to tweak the above, the consclusion is that
without some algorithmic changes, we cannot achieve worthy optimizations.

That said, here's another possible focus for optimization for volunteers
with some free time (like Guido ;-). It is relevant to 2).

The idea: encode opcode arguments which value is less than 256
in one byte only (instead of two bytes).

Benefits: a) reduced size of .pyc files
b) *lot* of memory lookup savings in 2)

I'll leave the reader to deduce the fairly radical changes that this
(algorithmic) change would imply.

Now to endorse the idea, here are some stats from a script that I ran
over the standard precompiled library files (.pyc) in order to have a
first impression on what we can expect:

Total: 134 .pyc files processed
1194798 bytes vs. 1108907 bytes if small args were one byte, gain: 7.2%
136936 opcodes, 95067 have args (69.4%), 85891 fit in one byte (90.3%)

This is only a linear scan of the opcodes in the code objects stored
in the .pyc files. Thus, at execution time, we can expect some good
speedup, due to less memory accesses, especially when we know that
most of the frequently used opcodes have an argument.

For those who are interested, the script (argbyte.py) is at:
http://starship.skyport.net/~vlad/argbyte/

sk...@calendar.com

unread,
Jul 13, 1998, 3:00:00 AM7/13/98
to
[Vladimir]

> The idea: encode opcode arguments which value is less than 256
> in one byte only (instead of two bytes).

In the peehole optimizer I've been doing something similar. Since LOAD_FAST
and STORE_FAST are such common opcodes, I smash consecutive pairs of them
into LOAD_TWO_FAST or STORE_TWO_FAST opcodes if their opargs fit into a
single byte (they will unless you have more than 256 local variables). This
doesn't speed things up dramatically, since you do as much work as before,
but it does save a trip around the interpreter loop, and as Vladimir said,
reduces the size of the resulting .pyc file.

--
Skip Montanaro
(sk...@calendar.com)

Reply all
Reply to author
Forward
0 new messages