Comments on ProjectPlan etc

2 views

Skip to first unread message

Paul Biggar

unread,

Mar 30, 2009, 1:24:51 PM3/30/09

to unladen...@googlegroups.com

Hi folks,

I have some questions/comments on the wiki docs. I've tried to read
everything on the mailing list and in the Wiki, but please excuse me
if I've missed obvious answers. Also, there are an absolute ton of
questions here, so sorry about that too :) It might be best to discuss
them in separate threads, though I didnt want to send 1 email per
question.

Simple questions first:

- I wonder if you can expand on "Moving to a JIT will also allow us to
move Python from a stack-based machine to a register machine". It isnt
obvious why moving to a JIT makes it any easier (though I am not
familiar with Python's code)

- For PICs, when calling functions written in C, do you intend to add
the method headers to the C extension functions using LLVM, or to
simply not use PICs when calling those functions?

- The range example (in ProjectPlan) seems to require analysis at the
bytecode/AST level, but I saw no discussion of what tools you intend
to use for this.

- I'm no Python expert, but the transformation to post-increments from
range() seems unsafe without non-peephole optimizations. What if the
loop body contains i++?

- Why require gcc 4.4 for FDO? FDO has been in gcc for years. What's
special about 4.4?

- I know you're focusing on making a register-based VM soon, but have
you looked at stack caching?

- In detailed plans part 10, you answer your own questions from part
6: profile for expensive functions. Sampling (like in profilers) or
counting iterations of backward branches or function calls are cheap,
accurate and pretty easy, from what I understand.

Longer questions:

- I feel that you havent defined exactly what you mean by JIT. At
times, I think you mean essentially ahead-of-time native code
generation from bytecode, which occurs Just-in-time for the first
execution. But sometimes I get the impression you're looking at a
mixed-mode interpreter. And occasionally I think profiling for types
and values is involved. And I'm not at all clear how much optimization
is involved. Can you clarify in some way?

- I am very confused about how to use LLVM for optimizing python, or
any scripting language. But my impression is that only very local
optimizations can be performed at the bitcode level, due to the
complexity of Python's bytecode handlers. I would be very surprised if
even type inference or constant propagation would work. I would expect
that analyzing the bytecode would be required instead, and the results
propagated downwards.

- I feel the approach in DirectLlvmIrGeneration is misguided. This is
the approach we used in phc, and its very error prone. If I was to do
it again, I'd go directly from the bytecode.
- If I understand correctly, can you not generate bitcode from each
bytecode handler at unladen-swallow build-time, and patch it together
at run-time, to avoid this expense described here?

- Can you discuss how Python2C compares to your plans? It seems very
similar. Their result, if the anecdotal evidence I've read on the
internet is to be believed, is that after a 2x speedup (mostly from
interpreter overhead - but this was 15 years ago) they lost the battle
with the instruction cache and couldn't improve any more.

To comment on what I feel is useful from my work on PHP with phc:
- You seem to have 2 use cases, apps and webapps. Long running apps
obviously benefit from running the JIT alongside the code like
Hotspot. However, for the webapps (ie google app engine), it seems
insane to run one instance of the JIT alongside each web request. If
the plan is to have one instance per machine, then you essentially
have an ahead-of-time compiler with FDO, and plans like inlining based
on the actual run-time types, dynamic deoptimization, Gal-style
tracing, etc are not possible in their standard form.

Sorry again for the length of these questions. Your project sounds
great, and I'm very enthusiastic about its future.

Thanks,
Paul

--
Paul Biggar
paul....@gmail.com

Thomas Wouters

unread,

Mar 31, 2009, 1:49:49 PM3/31/09

to paul....@gmail.com, unladen...@googlegroups.com

On Mon, Mar 30, 2009 at 19:24, Paul Biggar <paul....@gmail.com> wrote:

Hi folks,

I have some questions/comments on the wiki docs. I've tried to read
everything on the mailing list and in the Wiki, but please excuse me
if I've missed obvious answers. Also, there are an absolute ton of
questions here, so sorry about that too :) It might be best to discuss
them in separate threads, though I didnt want to send 1 email per
question.

Then excuse me for not answering all of them :)

Simple questions first:

- I wonder if you can expand on "Moving to a JIT will also allow us to
move Python from a stack-based machine to a register machine". It isnt
obvious why moving to a JIT makes it any easier (though I am not
familiar with Python's code)

I think (but I may be mistaken) it meant to say "moving to LLVM codegen will
also allow us ...". We could write our own stack->register translation
(people keep telling me it's easy) but LLVM's already got one. Once we move
off of the CPython stack & local variables (which our current LLVM IR
transcriptions still use) and to LLVM's stack, those functions can
automatically be converted to a register-machine implementation.

- For PICs, when calling functions written in C, do you intend to add
the method headers to the C extension functions using LLVM, or to
simply not use PICs when calling those functions?

I don't know what PIC means here, so I'll leave this for someone else.

- The range example (in ProjectPlan) seems to require analysis at the
bytecode/AST level, but I saw no discussion of what tools you intend
to use for this.

It doesn't really require AST-level analysis, since the main problem is
detecting whether range() is still range(). We haven't looked into specific
optimizations like that, though, since first we'll have to see how much we
can get LLVM to do for us (for instance, by rewriting range() in Python or
compiling the existing one to LLVM IR with clang, then letting LLVM optimize
it all away.) I suspect the Python for loop is a *leeetle* too complicated
for LLVM optimizers to grok, but we won't know until we try, and we need to
do the more general work anyway.

- I'm no Python expert, but the transformation to post-increments from
range() seems unsafe without non-peephole optimizations. What if the
loop body contains i++?

Python doesn't have i++, and integers are immutable. Or, if I misunderstand
your confusion, for loops don't work the way you think they do :-) The
loopvar is always newly assigned; we'd just use a temp value (which may be
optimized away by LLVM if the loopvar isn't assigned to in the loop.)

- Why require gcc 4.4 for FDO? FDO has been in gcc for years. What's
special about 4.4?

I haven't been paying attention to the details, but FDO received a lot of
love in 4.4.

- I know you're focusing on making a register-based VM soon, but have
you looked at stack caching?

The huge switch that comprises the standard CPython bytecode dispatchers
isn't exactly amenable to something like that. The vmgen threaded dispatcher
probably is (I haven't looked at it) but we expect to get a much bigger win
just by emitting LLVM IR, which can then do stack caching or full-blown
register-machine at its leisure (or perhaps based on the optimization
settings we use.) For what it's worth, while implementing some of the opcodes
in LLVM IR I noticed LLVM is already doing some threading and stack caching
for us, even though we haven't spent any time properly configuring the optimization
passes for that.

I'll leave these questions for Collin or Jeffrey, except to point out that we
are actually currently converting the bytecode to LLVM IR, as the first
step. If the later AST-based generation turns out fragile, annoying or crap,
we can stop working on that :)

- Can you discuss how Python2C compares to your plans? It seems very
similar. Their result, if the anecdotal evidence I've read on the
internet is to be believed, is that after a 2x speedup (mostly from
interpreter overhead - but this was 15 years ago) they lost the battle
with the instruction cache and couldn't improve any more.

Python2C is essentially what phc is (from what I've seen at your talk.) The
problem with Python2C is that it felt forced to do a naive translation to
Python/C API calls, and (as far as I remember) did nothing to discover types
or do any kiind of dynamic specialization. The LLVM approach is
superficially the same, converting from a bytecode dispatcher to a
lower-level implementation of the actual bytecodes, but LLVM then does more
work for us, and gives us a much more flexible optimization framework.

To comment on what I feel is useful from my work on PHP with phc:
- You seem to have 2 use cases, apps and webapps. Long running apps
obviously benefit from running the JIT alongside the code like
Hotspot. However, for the webapps (ie google app engine), it seems
insane to run one instance of the JIT alongside each web request. If
the plan is to have one instance per machine, then you essentially
have an ahead-of-time compiler with FDO, and plans like inlining based
on the actual run-time types, dynamic deoptimization, Gal-style
tracing, etc are not possible in their standard form.

At Google we basically have just one use-case for Unladen Swallow,
which is 'long running apps', be they webapp or some other kind. Python
in general has a lot of use-cases, but since the lifetime of anything
short-lived will almost certainly be dominated by startup time and code
run once, I doubt short-lived programs will benefit at all (unless we
manage to significantly reduce startup time, which I doubt.)

You seem to think, though, that webapps are necessarily short-lived. In
fact, in Python they are very often not. The WSGI model, which is pretty
prevalent now, makes long-running processes handling requests convenient.
Even so, we're looking at how much of the specialized code and type feedback
we can cache for future execution. It's just too early to early to tell what
will pay off and what not -- not only because we have no idea what the
actual costs and benefits will turn out to be (and I don't think anyone
does), but also because the underlying architecture (meaning LLVM) is likely
to evolve in the mean time, hopefully (partly) based on our feedback :)

--
Thomas Wouters <twou...@google.com>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Collin Winter

unread,

Mar 31, 2009, 6:22:43 PM3/31/09

to paul....@gmail.com, unladen...@googlegroups.com

On Mon, Mar 30, 2009 at 10:24 AM, Paul Biggar <paul....@gmail.com> wrote:
>
>
> Hi folks,
>
> I have some questions/comments on the wiki docs. I've tried to read
> everything on the mailing list and in the Wiki, but please excuse me
> if I've missed obvious answers. Also, there are an absolute ton of
> questions here, so sorry about that too :) It might be best to discuss
> them in separate threads, though I didnt want to send 1 email per
> question.

No worries :) Happy to answer any and all questions.

> Simple questions first:
>
> - I wonder if you can expand on "Moving to a JIT will also allow us to
> move Python from a stack-based machine to a register machine". It isnt
> obvious why moving to a JIT makes it any easier (though I am not
> familiar with Python's code)

One of the things I initially looked at before LLVM was suggested was
to change CPython's VM from a stack machine to a register machine;
Shi, Gregg, Beatty & Ertl, 2005 saw good results in their experiments
converting a Java VM to a register machine, and we're interested to
see if we can replicate that result with Python. So, LLVM isn't
essential for the stack->register switch, but it makes it more
attractive: CPython's VM has a lot of fetch/dispatch overhead, and
we'd still be paying that cost with a simple conversion to a register
machine. It seemed that the only thing we'd gain (which isn't
inconsiderable) is access to the larger range of compiler
optimizations that a register representation helps open up. With LLVM,
we hope to get that, plus less dispatch overhead.

I suppose you have easier access to David Gregg, though, so I'd be
interested in any experience you have with stack caching.

> - For PICs, when calling functions written in C, do you intend to add
> the method headers to the C extension functions using LLVM, or to
> simply not use PICs when calling those functions?

To be honest, I don't think we've thought about it yet.

> - The range example (in ProjectPlan) seems to require analysis at the
> bytecode/AST level, but I saw no discussion of what tools you intend
> to use for this.

There was some work done for CPython to improve globals/builtins
lookup performance by versioning those dictionaries (see
http://bugs.python.org/issue1616125 if you're curious). We'd like to
go further than that approach and actually use the version numbers to
enable inlining or other Python-specific optimizations (like knowing
what do with range(some_literal)).

> - I'm no Python expert, but the transformation to post-increments from
> range() seems unsafe without non-peephole optimizations. What if the
> loop body contains i++?

Sure, that translation was more an example of what we could do, rather
than an exact spec of what we will do. The optimization will obviously
have to compensate for that; I originally wrote the example to fit on
a slide and wanted to keep it simple.

> - Why require gcc 4.4 for FDO? FDO has been in gcc for years. What's
> special about 4.4?

FDO in gcc 4.4 has been significantly improved, from what I
understand. I may be misremembering, in which case I'll update the
docs.

> - I know you're focusing on making a register-based VM soon, but have
> you looked at stack caching?

I looked at Ertl's paper on stack caching a few months ago. At the
time I was looking at simple incremental improvements to CPython, and
stack caching looked like would be fairly tricky to implement given
the constraints of the CPython VM. I'll reread the paper, though, in
light of the new context of our LLVM work. If you know of any other
implementations/papers (Ertl 1995 is the one I've seen), I'd be happy
to add those to my reading list.

> - In detailed plans part 10, you answer your own questions from part
> 6: profile for expensive functions. Sampling (like in profilers) or
> counting iterations of backward branches or function calls are cheap,
> accurate and pretty easy, from what I understand.

Heh, part of that comes from multiple people writing the document at
different points in time. Part 10 is indeed the answer; we're trying
to talk this out as iteratively as possible and not assume too much,
mostly for our own sanity.

> Longer questions:
>
> - I feel that you havent defined exactly what you mean by JIT. At
> times, I think you mean essentially ahead-of-time native code
> generation from bytecode, which occurs Just-in-time for the first
> execution. But sometimes I get the impression you're looking at a
> mixed-mode interpreter. And occasionally I think profiling for types
> and values is involved. And I'm not at all clear how much optimization
> is involved. Can you clarify in some way?

Our initial thought was to use LLVM to codegen to machine code always
-- purely as a baseline to improve on -- recompiling as we obtain more
information at runtime. That looks like it will be prohibitively slow
(as we had predicted), so we're now thinking of using either the LLVM
interpreter or the custom Python VM for the first few runs while we
gather information, then using LLVM to produce machine code for hot
functions (aka, the right way). PyPy has done a bunch of work on
generating JITs from interpreters, so there may be ideas there we can
leverage.

I need to push more of this into the project plan.

> - I am very confused about how to use LLVM for optimizing python, or
> any scripting language. But my impression is that only very local
> optimizations can be performed at the bitcode level, due to the
> complexity of Python's bytecode handlers. I would be very surprised if
> even type inference or constant propagation would work. I would expect
> that analyzing the bytecode would be required instead, and the results
> propagated downwards.

We're not particularly looking at type inference, more at type
feedback. That may be a bias on our part since Urs Hoelzle, Jeff Dean,
Craig Chambers et al all work at Google, and I'll admit that proximity
may have unduly influenced which ideas and papers we're looking at.
Brett Cannon tried type inference on Python and it more or less went
nowhere (see http://www.ocf.berkeley.edu/~bac/thesis.pdf). Type
feedback or tracing seem to have much more promise, particularly given
how frustratingly (from our point of view) flexible Python is.

We may also write some higher-level optimization passes that take
advantage of knowledge about Python, rather than the LLVM bitcode.
That's not to say that the included LLVM passes are useless: Jeffrey
was experimenting with turning them on and off, and even without
giving LLVM more knowledge of Python, the optimization passes he tried
were able to shrink the generated IR by more than half. We'll see what
that does to performance, but we'll now give the LLVM optimization
passes more attention after this result.

(Jeffrey can send you the exact before/after code if you're interested.)

> - I feel the approach in DirectLlvmIrGeneration is misguided. This is
> the approach we used in phc, and its very error prone. If I was to do
> it again, I'd go directly from the bytecode.

Interesting. Where does the tendency to err come from? Can you
elaborate more on that? We're not wedded to this approach, especially
if your experience indicates it's a dead end.

> - If I understand correctly, can you not generate bitcode from each
> bytecode handler at unladen-swallow build-time, and patch it together
> at run-time, to avoid this expense described here?
>
> - Can you discuss how Python2C compares to your plans? It seems very
> similar. Their result, if the anecdotal evidence I've read on the
> internet is to be believed, is that after a 2x speedup (mostly from
> interpreter overhead - but this was 15 years ago) they lost the battle
> with the instruction cache and couldn't improve any more.

FWIW, the anecdotal evidence I've seen pointed to a more modest speed
boost in the range of 15-30%, but I can also believe 2x in certain
workloads. In any case, our initial work will be fairly similar to the
python2c approach; we feel that sort of straightforward translation is
a good way to get a functionally-equivalent version working to use as
a performance/correctness baseline. Where we hope to do better than
python2c:
- gcc doesn't do very well on machine-generated code. From talking to
Maciej Fijalkowski at the PyCon VM summit this past week, PyPy is
running into issues with gcc not optimizing their generated C code as
fully as they'd like, to the point of completely choking on very long
functions.
- My memory of python2c is that beyond the straightforward
translation, it attempted to eliminate some unnecessary stack
operations, but didn't do much more in terms of optimization. Because
it's an upfront compilation system, it's limited to static analysis,
which we don't feel is going to get you very far with Python. We want
to use runtime information to enable dynamic recompilation, in
particular to inline builtins and other operations where possible.

> To comment on what I feel is useful from my work on PHP with phc:
> - You seem to have 2 use cases, apps and webapps. Long running apps
> obviously benefit from running the JIT alongside the code like
> Hotspot. However, for the webapps (ie google app engine), it seems
> insane to run one instance of the JIT alongside each web request. If
> the plan is to have one instance per machine, then you essentially
> have an ahead-of-time compiler with FDO, and plans like inlining based
> on the actual run-time types, dynamic deoptimization, Gal-style
> tracing, etc are not possible in their standard form.

I think there's a difference in how Python webapps work vs PHP
webapps. My experience of writing PHP websites in a former life is
that the web server actually starts up a new PHP interpreter for every
request, and so in that respect PHP is more akin to a Javascript
engine in its requirements. By contrast, the Python web applications I
have the most experience with generally live for hours, days or weeks
at a time. We believe that this makes it more acceptable to gradually
reoptimize throughout the lifetime of the process.

About PHP, I would have thought that mod_php or similar makes it such
that the PHP interpreter persists from request to request. Does
mod_php reload the source for every request? I know how these Python
webapps work, so I'd be curious how PHP deployments differ and how
that changes your requirements.

> Sorry again for the length of these questions. Your project sounds
> great, and I'm very enthusiastic about its future.

Thanks for your questions! I'm very happy to talk about our plans at
whatever length you want, especially given what you're working on. If
nothing else, it helps sharpen our thinking, and our docs.

Thanks again,
Collin Winter

Alex_Gaynor

unread,

Mar 31, 2009, 10:13:53 PM3/31/09

to Unladen Swallow

On Mar 31, 6:22 pm, Collin Winter <collinwin...@google.com> wrote:

> lookup performance by versioning those dictionaries (seehttp://bugs.python.org/issue1616125if you're curious). We'd like to

> nowhere (seehttp://www.ocf.berkeley.edu/~bac/thesis.pdf). Type

> at a time. We believe that this makes it ...
>
> read more »

The primary issue Brett found with type infrencing was that it was
messing with the CPU cache, is it possible that this is less of an
issue with threaded dispatch?

Alex

Fredrik Lundh

unread,

Apr 1, 2009, 4:20:46 AM4/1/09

to alex....@gmail.com, Unladen Swallow

On Wed, Apr 1, 2009 at 4:13 AM, Alex_Gaynor <alex....@gmail.com> wrote:
>
> The primary issue Brett found with type infrencing was that it was
> messing with the CPU cache, is it possible that this is less of an
> issue with threaded dispatch?

His implementation of the extended byte code vocabulary messed with
the cache, not the type inference per se.

I agree with Collin that type feedback is the way to go, though.

</F>

Paul Biggar

unread,

Apr 4, 2009, 6:31:47 PM4/4/09

to Thomas Wouters, Collin Winter, unladen...@googlegroups.com

Hi Colin, Thomas,

Thank you both for your very thorough answers. For brevity, I've
combined your emails, and snipped sections where I have nothing
interesting to add.

On Tue, Mar 31, 2009 at 11:22 PM, Collin Winter <collin...@google.com> wrote:
> I suppose you have easier access to David Gregg, though, so I'd be
> interested in any experience you have with stack caching.

I'm not working on interpreters, so I don't know a great deal more
than comes from reading the same papers you have cited. But my
impression is that your approach will not benefit from stack caching.

>> - For PICs, when calling functions written in C, do you intend to add
>> the method headers to the C extension functions using LLVM, or to
>> simply not use PICs when calling those functions?

On Tue, Mar 31, 2009 at 11:22 PM, Collin Winter <collin...@google.com> wrote:
> To be honest, I don't think we've thought about it yet.

(Thomas: Polymorphic Inline Cache, from SELF -
PICs replace calls into the system's method dispatch with a direct
call to the last called method at that point. The called method has a
header added, which checks that the parameters are of the correct
type. If it does not, it calls the system's dispatcher.)

The idea of using PICs came up recently for PHP, however, its not
really possible because we can't add headers to methods in the
extension libraries. But with LLVM you have the opportunity to do
exactly that, so I would be very interested to see in seeing how they
work out for ye.

>> Longer questions:
>>
>> - I feel that you havent defined exactly what you mean by JIT. At
>> times, I think you mean essentially ahead-of-time native code
>> generation from bytecode, which occurs Just-in-time for the first
>> execution. But sometimes I get the impression you're looking at a
>> mixed-mode interpreter. And occasionally I think profiling for types
>> and values is involved. And I'm not at all clear how much optimization
>> is involved. Can you clarify in some way?

On Tue, Mar 31, 2009 at 11:22 PM, Collin Winter <collin...@google.com> wrote:
> Our initial thought was to use LLVM to codegen to machine code always
> -- purely as a baseline to improve on -- recompiling as we obtain more
> information at runtime. That looks like it will be prohibitively slow
> (as we had predicted), so we're now thinking of using either the LLVM
> interpreter or the custom Python VM for the first few runs while we
> gather information, then using LLVM to produce machine code for hot
> functions (aka, the right way).

The right way indeed. Mixed-mode interpreters seems ideal for this:
you might need some HotSpot or Jikes papers on your list, though I
don't remember which off the top of my head.

>> - I am very confused about how to use LLVM for optimizing python, or
>> any scripting language. But my impression is that only very local
>> optimizations can be performed at the bitcode level, due to the
>> complexity of Python's bytecode handlers. I would be very surprised if
>> even type inference or constant propagation would work. I would expect
>> that analyzing the bytecode would be required instead, and the results
>> propagated downwards.

> We're not particularly looking at type inference, more at type

> feedback. That may be a bias on our part since Urs Hoelzle, Jeff Dean,
> Craig Chambers et al all work at Google, and I'll admit that proximity
> may have unduly influenced which ideas and papers we're looking at.

If I had them in my vicinity, that's exactly what I'd do. But I
thought that the SELF work used a large amount of inference as well as
feedback? Isn't that where the CPA algorithm came from? I suspect the
answers might be in "Ole Agesen, Urs Hölzle: Type Feedback vs.
Concrete Type Inference: A Comparison of Optimization Techniques for
Object-Oriented Languages. OOPSLA 1995: 91-107 ", which I havent read.

> Brett Cannon tried type inference on Python and it more or less went
> nowhere (see http://www.ocf.berkeley.edu/~bac/thesis.pdf).

If I understand Cannon correctly (I hope I'm not confusing his work
with StarKiller or ShedSkin), his major problem was that he was trying
to perform intra-procedural analysis without the types of the
parameters. Given those types from type feedback, I would expect much
better results. My colleague is getting good results from using type
feedback to get the types of parameters to a method (in Lua), and
performing intra-procedural analysis on that basis.

> Type
> feedback or tracing seem to have much more promise, particularly given
> how frustratingly (from our point of view) flexible Python is.

The problem with pure-tracing is the need to check your results.
Combining tracing and inference would, in my mind, lead to better
results. (I should qualify this by saying I am not an expert in either
JITs or run-time type inference/feedback).

> We may also write some higher-level optimization passes that take
> advantage of knowledge about Python, rather than the LLVM bitcode.

I think this is important. All of my own work is at this level, and
while it can be tricky since the high-level operations can do so much,
the level of information it gives is far greater than that which can
be gained from LLVM analysis.

> That's not to say that the included LLVM passes are useless: Jeffrey
> was experimenting with turning them on and off, and even without
> giving LLVM more knowledge of Python, the optimization passes he tried
> were able to shrink the generated IR by more than half. We'll see what
> that does to performance, but we'll now give the LLVM optimization
> passes more attention after this result.

That is very cool. You should publish those numbers, if even just on a
blog. There are a lot of people talking about LLVM+scripting, and I
think numbers like this go a long way to validate the approach.

>> - I feel the approach in DirectLlvmIrGeneration is misguided. This is
>> the approach we used in phc, and its very error prone. If I was to do
>> it again, I'd go directly from the bytecode.

> Interesting. Where does the tendency to err come from? Can you

> elaborate more on that? We're not wedded to this approach, especially
> if your experience indicates it's a dead end.

I would not say that its a dead-end. After all, I'm still using it :)
However, I don't think it gains anything compared to going from AST,
rather than bytecode.

When translating from the AST to an easier-to-analyse IR, we made many
errors and incorrect assumptions about language semantics. There were
a ton of edge cases which made things more difficult. I imagine you
know Python far better than we knew PHP at that point, though.
However, the bytecode is already at the right level to do analysis on,
so I see little to gain in analysing the AST.

>> - If I understand correctly, can you not generate bitcode from each
>> bytecode handler at unladen-swallow build-time, and patch it together
>> at run-time, to avoid this expense described here?

I think everyone skipped this question?

>> - Can you discuss how Python2C compares to your plans? It seems very
>> similar. Their result, if the anecdotal evidence I've read on the
>> internet is to be believed, is that after a 2x speedup (mostly from
>> interpreter overhead - but this was 15 years ago) they lost the battle
>> with the instruction cache and couldn't improve any more.

On Tue, Mar 31, 2009 at 6:49 PM, Thomas Wouters <twou...@google.com> wrote:
> Python2C is essentially what phc is (from what I've seen at your talk.) The
> problem with Python2C is that it felt forced to do a naive translation to
> Python/C API calls, and (as far as I remember) did nothing to discover types
> or do any kiind of dynamic specialization. The LLVM approach is
> superficially the same, converting from a bytecode dispatcher to a
> lower-level implementation of the actual bytecodes, but LLVM then does more
> work for us, and gives us a much more flexible optimization framework.

What more does it do, other than optimizing the resulting code?

On Tue, Mar 31, 2009 at 11:22 PM, Collin Winter <collin...@google.com> wrote:
> Where we hope to do better than
> python2c:
> - gcc doesn't do very well on machine-generated code. From talking to
> Maciej Fijalkowski at the PyCon VM summit this past week, PyPy is
> running into issues with gcc not optimizing their generated C code as
> fully as they'd like, to the point of completely choking on very long
> functions.

I hear GHC experiences the long-function problem too. We haven't yet,
but I'm sure inlining will lead to problems. I wasn't aware that gcc
did badly though on generated code - have you any examples?

> You seem to think, though, that webapps are necessarily short-lived. In
> fact, in Python they are very often not. The WSGI model, which is pretty
> prevalent now, makes long-running processes handling requests convenient.

> I think there's a difference in how Python webapps work vs PHP
> webapps.

Ah, that does change the game. But I presume you must have many of
these processes running at once on the web-server? Does it change
things to the point that you're willing to have a JIT running
alongside each Python process?

> By contrast, the Python web applications I
> have the most experience with generally live for hours, days or weeks

> at a time. We believe that this makes it more acceptable to gradually
> reoptimize throughout the lifetime of the process.

Do you expect the feedback from one web request to be useful to the
next? I'm thinking of cases where the next user has a different
localisation setting, or is performing a different operation (or I may
totally misunderstand the model). I expect you cant know the answer at
this stage, but your thoughts would interest me.

> About PHP, I would have thought that mod_php or similar makes it such
> that the PHP interpreter persists from request to request. Does
> mod_php reload the source for every request? I know how these Python
> webapps work, so I'd be curious how PHP deployments differ and how
> that changes your requirements.

There is a persistent interpreter, but mod_php is threaded (not having
a GIL), and so does not need something like WSGI (if I understand what
you've said correctly). mod_php typically reloads the source for every
request. However, bytecode caches like APC exist to avoid the overhead
of parsing on each request.

phc's model is to call the __MAIN__ function for each PHP page, when
it is requested. This is compiled in advance, so it doesn't need a lot
of the things that PHP does, but it changes the deployment model
slightly.

Thanks again, Colin and Thomas, for your detailed answers. The
differences in the problems we're solving are particularly
interesting.

Collin Winter

unread,

Apr 6, 2009, 4:42:24 PM4/6/09

to Paul Biggar, Thomas Wouters, unladen...@googlegroups.com

On Sat, Apr 4, 2009 at 3:31 PM, Paul Biggar <paul....@gmail.com> wrote:
> Hi Colin, Thomas,
>
> Thank you both for your very thorough answers. For brevity, I've
> combined your emails, and snipped sections where I have nothing
> interesting to add.
>

>>> - I am very confused about how to use LLVM for optimizing python, or
>>> any scripting language. But my impression is that only very local
>>> optimizations can be performed at the bitcode level, due to the
>>> complexity of Python's bytecode handlers. I would be very surprised if
>>> even type inference or constant propagation would work. I would expect
>>> that analyzing the bytecode would be required instead, and the results
>>> propagated downwards.
>
>> We're not particularly looking at type inference, more at type
>> feedback. That may be a bias on our part since Urs Hoelzle, Jeff Dean,
>> Craig Chambers et al all work at Google, and I'll admit that proximity
>> may have unduly influenced which ideas and papers we're looking at.
>
> If I had them in my vicinity, that's exactly what I'd do. But I
> thought that the SELF work used a large amount of inference as well as
> feedback? Isn't that where the CPA algorithm came from? I suspect the
> answers might be in "Ole Agesen, Urs Hölzle: Type Feedback vs.
> Concrete Type Inference: A Comparison of Optimization Techniques for
> Object-Oriented Languages. OOPSLA 1995: 91-107 ", which I havent read.

I know SELF-91 used a lot of type analysis, whereas SELF-93 opted for
feedback and recompilation. In the realm of inference vs feedback, I'm
going mostly off "Adaptive Optimization in SELF" Hoelzle 1994 (section
7.5.4 is relevant here). I've just now skimmed the Agesen & Hoelzle
paper, and it looks interesting; I'll try to sit down and read the
whole thing this afternoon.

>> Brett Cannon tried type inference on Python and it more or less went
>> nowhere (see http://www.ocf.berkeley.edu/~bac/thesis.pdf).
>
> If I understand Cannon correctly (I hope I'm not confusing his work
> with StarKiller or ShedSkin), his major problem was that he was trying
> to perform intra-procedural analysis without the types of the
> parameters. Given those types from type feedback, I would expect much
> better results. My colleague is getting good results from using type
> feedback to get the types of parameters to a method (in Lua), and
> performing intra-procedural analysis on that basis.

Yes, you're pretty much right. Cannon's other hurdle was that
CPython's bytecode required custom opcodes for each class of
inference, which increases icache pressure. We observed this during
our 2009Q1 phase: the LIST_APPEND opcode was implemented to make list
comprehensions faster, and while this made microbenchmarks faster, it
slowed down overall program execution by 3-5%. It's interesting to
note that removing opcodes doesn't make the situation better: I
removed 12 rarely-used opcodes, replacing them with function calls,
and this was performance-neutral, if not very slightly
performance-negative (see
http://code.google.com/p/unladen-swallow/source/detail?r=85). MAO
(http://code.google.com/p/mao) may help this somewhat by massaging the
generated assembly to make the hardware happier.

Brett is subscribed to the list, so he may be able to chime in on this issue.

I predict that we'll end up using both inference and feedback, but
implement feedback first.

>> That's not to say that the included LLVM passes are useless: Jeffrey
>> was experimenting with turning them on and off, and even without
>> giving LLVM more knowledge of Python, the optimization passes he tried
>> were able to shrink the generated IR by more than half. We'll see what
>> that does to performance, but we'll now give the LLVM optimization
>> passes more attention after this result.
>
> That is very cool. You should publish those numbers, if even just on a
> blog. There are a lot of people talking about LLVM+scripting, and I
> think numbers like this go a long way to validate the approach.

I'm in the process of setting up a blog for the project. I'll be sure
to include this. I also need to look at what optimization passes
MacRuby is using, since they're using LLVM, too.

>>> - If I understand correctly, can you not generate bitcode from each
>>> bytecode handler at unladen-swallow build-time, and patch it together
>>> at run-time, to avoid this expense described here?

Conceivably, yes, we could do that. I looked at Ertl & Gregg's dynamic
superinstructions for our phase 1 work, which sounds like a variant of
what you're proposing (see "Optimizing indirect branch prediction
accuracy in virtual machine interpreters", Ertl & Gregg, 2003, esp.
section five). We're hoping that we can generate better code than that
approach would yield.

>>> - Can you discuss how Python2C compares to your plans? It seems very
>>> similar. Their result, if the anecdotal evidence I've read on the
>>> internet is to be believed, is that after a 2x speedup (mostly from
>>> interpreter overhead - but this was 15 years ago) they lost the battle
>>> with the instruction cache and couldn't improve any more.
>
> On Tue, Mar 31, 2009 at 6:49 PM, Thomas Wouters <twou...@google.com> wrote:
>> Python2C is essentially what phc is (from what I've seen at your talk.) The
>> problem with Python2C is that it felt forced to do a naive translation to
>> Python/C API calls, and (as far as I remember) did nothing to discover types
>> or do any kiind of dynamic specialization. The LLVM approach is
>> superficially the same, converting from a bytecode dispatcher to a
>> lower-level implementation of the actual bytecodes, but LLVM then does more
>> work for us, and gives us a much more flexible optimization framework.
>
> What more does it do, other than optimizing the resulting code?

The biggest thing is that it permits dynamic recompilation.

> On Tue, Mar 31, 2009 at 11:22 PM, Collin Winter <collin...@google.com> wrote:
>> Where we hope to do better than
>> python2c:
>> - gcc doesn't do very well on machine-generated code. From talking to
>> Maciej Fijalkowski at the PyCon VM summit this past week, PyPy is
>> running into issues with gcc not optimizing their generated C code as
>> fully as they'd like, to the point of completely choking on very long
>> functions.
>
> I hear GHC experiences the long-function problem too. We haven't yet,
> but I'm sure inlining will lead to problems. I wasn't aware that gcc
> did badly though on generated code - have you any examples?

I don't, but Maciej probably does.

>> You seem to think, though, that webapps are necessarily short-lived. In
>> fact, in Python they are very often not. The WSGI model, which is pretty
>> prevalent now, makes long-running processes handling requests convenient.
>
>> I think there's a difference in how Python webapps work vs PHP
>> webapps.
>
> Ah, that does change the game. But I presume you must have many of
> these processes running at once on the web-server? Does it change
> things to the point that you're willing to have a JIT running
> alongside each Python process?

Looking at Intel and AMD's roadmaps, the industry is going to find the
number of cores per machine increasing faster than other resources
like RAM, disk, and network connectivity. One way of taking advantage
of these extra cores would be to run a mix of applications on the same
machine: some that are CPU-hungry/RAM-light, some that are
RAM-heavy/CPU-light, etc; beyond the sophistication needed in your
scheduler, you'd need to actually possess a suitably heterogeneous
application mix.

Another option is to make existing applications better utilize the
available cores; this is the option we find attractive. We obviously
don't want to suck up extra cores gratuitously, but if we can offload
GC or hotspot optimization to another core, I think that's a win.
Depending on application characteristics, we could make the offloading
adaptive: if the application needs that core, we can dial down how
much we offload, or even defer some of this work to idle (or
relatively idle) times.

>> By contrast, the Python web applications I
>> have the most experience with generally live for hours, days or weeks
>> at a time. We believe that this makes it more acceptable to gradually
>> reoptimize throughout the lifetime of the process.
>
> Do you expect the feedback from one web request to be useful to the
> next? I'm thinking of cases where the next user has a different
> localisation setting, or is performing a different operation (or I may
> totally misunderstand the model). I expect you cant know the answer at
> this stage, but your thoughts would interest me.

I believe that the feedback will be useful from one request to the
next. While the contents of data structures and the values of object
attributes may change from request to request, the types of those
objects probably don't change. The closest experience I have is with
training gcc's FDO for a Python binary: training the FDO with a small
subset of our benchmarks produced speedups for unrelated benchmarks
and workloads. (I can send you the complete data on those experiments
if you're interested.) I would expect us to see similar applicability
for runtime feedback: identical requests benefit the most, but all
requests benefit some.

A lot of that will depend on how homogeneous the application's
requests are, though. On the last project I worked on at Google
(Mondrian: http://www.niallkennedy.com/blog/2006/11/google-mondrian.html),
we had some requests that were CPU-bound, others that were RAM-bound,
and others that were IO-bound. That said, they all shared common
serialization code for going to and from protocol buffers, and that
sort of code will probably benefit the most, since it will show up as
remarkably hot. Even in the face of fairly heterogeneous requests, if
QPS gets high enough, all requests become important.