LLVM compilation and codegen speed

Jeffrey Yasskin

unread,

Mar 26, 2009, 1:32:52 AM3/26/09

to Unladen Swallow

After presenting at the PyCon VM Summit today, several people
expressed concern that LLVM's codegen wasn't nearly fast enough to be
our only backend. So I figured out how to get a very rough idea of how
much of a problem that'll be. First, stock Python 2.5:

jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
python -mtimeit 'exec "def foo(): pass\nfoo()"'
10000 loops, best of 3: 34.5 usec per loop

Stock Python 2.6:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
python2.6 -mtimeit 'exec "def foo(): pass\nfoo()"'
10000 loops, best of 3: 56.1 usec per loop

Next, the code checked in to the llvm-working branch can JIT a really
simple for loop ("for i in r: pass") (without any of the tracing,
signal handling, or even stack popping on exit that slows down the
EvalFrameEx loop), and on an opt build (no --with-pydebug, which
builds LLVM in Release-Asserts mode) with a few bugs corrected
(_Py_Dealloc is just a macro in opt mode; oops) this runs the for loop
about 3x as fast as the stock eval loop. I'm using the JIT's "slow"
mode, and the following optimization passes:

quick_optimizations_.add(new llvm::TargetData(*engine_->getTargetData()));
// Lw: ...1; br Lx ; Lx: ...2 --> Lw: ...1 ...2
quick_optimizations_.add(llvm::createCFGSimplificationPass());
// -> SSA form.
quick_optimizations_.add(llvm::createPromoteMemoryToRegisterPass());
quick_optimizations_.add(llvm::createInstructionCombiningPass());
// Lw: br %cond Lx, Ly ; Lx: br %cond Lz, Lv --> Lw: br %cond Lz, Ly
quick_optimizations_.add(llvm::createJumpThreadingPass());
quick_optimizations_.add(llvm::createDeadStoreEliminationPass());
// Make block ordering a bit less dependent on how the C++ is arranged.
quick_optimizations_.add(llvm::createBlockPlacementPass());
// Make sure the output is still good.
quick_optimizations_.add(llvm::createVerifierPass());

("quick_optimizations" may be wrong. We'll see shortly.)

Just compiling to both, but running through the standard interpreter:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo(): pass\nfoo()"'
1000 loops, best of 3: 542 usec per loop

And actually running through LLVM's JIT:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo():
pass\nfoo.__code__.__use_llvm__=True\nfoo()"'
1000 loops, best of 3: 884 usec per loop

(ick, that'll make startup suck)

Fast codegen; still optimized:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo():
pass\nfoo.__code__.__use_llvm__=True\nfoo()"'
1000 loops, best of 3: 825 usec per loop

No optimizations and the fast codegen:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo(): pass\nfoo()"'
1000 loops, best of 3: 271 usec per loop
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo():
pass\nfoo.__code__.__use_llvm__=True\nfoo()"'
1000 loops, best of 3: 667 usec per loop

No optimizations, fast codegen, and omit verifying that the bytecode
is sane (verifyFunction):
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo(): pass\nfoo()"'
10000 loops, best of 3: 182 usec per loop
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo():
pass\nfoo.__code__.__use_llvm__=True\nfoo()"'
1000 loops, best of 3: 562 usec per loop
(zomg, who knew that checking for sanity was so expensive)

With no optimizations and fast codegen, we only implement the empty
loop 2x as fast as stock Python:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
python2.6 -mtimeit -s 'r=range(5000)' -s 'def foo(x):' -s ' for i in
x: pass' 'foo(r)'
10000 loops, best of 3: 159 usec per loop
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit -s 'r=range(5000)' -s 'def foo(x):' -s ' for i
in x: pass' 'foo(r)'
10000 loops, best of 3: 155 usec per loop
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit -s 'r=range(5000)' -s 'def foo(x):' -s ' for i
in x: pass' -s 'foo.__code__.__use_llvm__=True' 'foo(r)'
10000 loops, best of 3: 77.6 usec per loop

(le sigh)

Once we have .pyc files working, we'll only have to pay the
compilation time once. However, we have to pay for codegen the first
time we run any function (to find where the loader has put Python's
global symbols, at least). We'll probably need to poke the LLVM folks
to build a faster "fast" codegen, or do it ourselves.

Jeffrey

Rafael Espindola

unread,

Mar 26, 2009, 6:04:55 AM3/26/09

to jyas...@google.com, Unladen Swallow

> Once we have .pyc files working, we'll only have to pay the
> compilation time once. However, we have to pay for codegen the first
> time we run any function (to find where the loader has put Python's
> global symbols, at least). We'll probably need to poke the LLVM folks
> to build a faster "fast" codegen, or do it ourselves.

Does the benchmark includes the time to load the LLVM libraries?
They can be slow to load. Nick did some tests with with the gnu
hash and got some improvement. To use it you must pass
--hash-style=gnu or --hash-style=both to the linker.

> Jeffrey

Cheers,
--
Rafael Avila de Espindola

Jeffrey Yasskin

unread,

Mar 26, 2009, 9:52:12 AM3/26/09

to Rafael Espindola, Unladen Swallow

On Thu, Mar 26, 2009 at 5:04 AM, Rafael Espindola <espi...@google.com> wrote:
>> Once we have .pyc files working, we'll only have to pay the
>> compilation time once. However, we have to pay for codegen the first
>> time we run any function (to find where the loader has put Python's
>> global symbols, at least). We'll probably need to poke the LLVM folks
>> to build a faster "fast" codegen, or do it ourselves.
>
> Does the benchmark includes the time to load the LLVM libraries?
> They can be slow to load. Nick did some tests with with the gnu
> hash and got some improvement. To use it you must pass
> --hash-style=gnu or --hash-style=both to the linker.

No, it doesn't. timeit loads everything and then compiles the string
you send it and runs it in a loop (with some details to avoid
measuring loop overhead much). So unless LLVM's libraries reload every
time I create a Module, my measurements won't see loading time.

On the other hand, Jython gets away with even slower startup:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
~/opt/jython2.5b3/jython -mtimeit 'exec "def foo(): pass\nfoo()"'
100 loops, best of 3: 1.69 msec per loop

And the PyPy guys just said that 10x slower than CPython wasn't a big deal.

Maybe we can get away with it too. At the very least, I won't spend
time worrying about it until the rest of the system is working.

Thanks,
Jeffrey

Thomas Wouters

unread,

Mar 26, 2009, 10:19:26 AM3/26/09

to unladen...@googlegroups.com

On Thu, Mar 26, 2009 at 06:32, Jeffrey Yasskin <jyas...@google.com> wrote:

Just compiling to both, but running through the standard interpreter:
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo(): pass\nfoo()"'
1000 loops, best of 3: 542 usec per loop

Any idea how much of this is ramp-up time? Can you time a larger block of code, even if it's just a couple hundred empty function definitions?

--
Thomas Wouters <tho...@python.org>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Jeffrey Yasskin

unread,

Mar 26, 2009, 10:36:29 AM3/26/09

to unladen...@googlegroups.com

On Thu, Mar 26, 2009 at 9:19 AM, Thomas Wouters <tho...@python.org> wrote:
>
>
> On Thu, Mar 26, 2009 at 06:32, Jeffrey Yasskin <jyas...@google.com> wrote:
>>
>> Just compiling to both, but running through the standard interpreter:
>> jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
>> ./python.exe -mtimeit 'exec "def foo(): pass\nfoo()"'
>> 1000 loops, best of 3: 542 usec per loop
>
> Any idea how much of this is ramp-up time? Can you time a larger block of
> code, even if it's just a couple hundred empty function definitions?

Very little should be ramp up time. Recall that timeit replicates the
code you give it several times and loops over it. I can run the exec
several times before the main loop without changing the results:

jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit 'exec "def foo(): pass\nfoo()"'

10000 loops, best of 3: 180 usec per loop
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
./python.exe -mtimeit -s 'for i in range(1000): exec "def foo():
pass\nfoo()"' 'exec "def foo(): pass\nfoo()"'
10000 loops, best of 3: 183 usec per loop

If that's not right, could you give me the timeit command line you're
thinking of?

Thomas Wouters

unread,

Mar 26, 2009, 10:42:02 AM3/26/09

to unladen...@googlegroups.com

I'm just wondering if any part of LLVM's compilation phase is constant overhead, which would be less pronounced for larger blocks of code. I'm asking for the speed of:

./python.exe -m timeit -s 's = "def foo(): pass\nfoo()\n" * 1000' 'exec s'

with different values of 1000 :) But I'm in the process of building your stuff on my laptop, so I can play with it myself :)

Jeffrey Yasskin

unread,

Mar 26, 2009, 10:47:14 AM3/26/09

to tho...@python.org, unladen...@googlegroups.com

I'd have said very little was constant overhead, but it turns out to
be quite a bit:

jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$

./python.exe -m timeit -s 's = "def foo(): pass\nfoo()\n" * 1000'
'exec s'

10 loops, best of 3: 180 msec per loop
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
python -m timeit -s 's = "def foo(): pass\nfoo()\n" * 1000' 'exec s'
10 loops, best of 3: 125 msec per loop
jyasskin-macbookpro15:~/src/unladen-swallow/llvm_bench/obj jyasskin$
~/opt/jython2.5b3/jython -m timeit -s 's = "def foo(): pass\nfoo()\n"
* 1000' 'exec s'
10 loops, best of 3: 240 msec per loop

Never mind then.

Thomas Wouters

unread,

Mar 26, 2009, 12:46:38 PM3/26/09

to Jeffrey Yasskin, unladen...@googlegroups.com

Of course, I forgot the __use_llvm__ setting. My timings area little different, I'm wondering if I'm doing something wrong; I just did configure;make in the llvm-working branch, and here are the timings:

twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__=False\nfoo()\n"' 'exec s'
1000 loops, best of 3: 373 usec per loop
twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__=True\nfoo()\n"' 'exec s'
1000 loops, best of 3: 631 usec per loop

So it's only half the speed, and it seems to be just under linear to code size, although I assume it won't be linear to complexity, we just can't test that yet :)

twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = False\nfoo()\n" * 1000' 'exec s'
10 loops, best of 3: 309 msec per loop
twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = True\nfoo()\n" * 1000' 'exec s'
10 loops, best of 3: 552 msec per loop

(note msec, not usec.)

Talin

unread,

Mar 26, 2009, 12:47:01 PM3/26/09

to jyas...@google.com, Unladen Swallow

If compilation speed is a problem, this might be an area where we can easily take advantage of multiple processors: Have the interpreter continue to interpret the bytecode normally, and at the same time spin off the compilation to a worker thread.

-- Talin

Thomas Wouters

unread,

Mar 26, 2009, 1:03:16 PM3/26/09

to Jeffrey Yasskin, unladen...@googlegroups.com

On Thu, Mar 26, 2009 at 17:46, Thomas Wouters <tho...@python.org> wrote:

twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__=False\nfoo()\n"' 'exec s'
1000 loops, best of 3: 373 usec per loop
twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__=True\nfoo()\n"' 'exec s'
1000 loops, best of 3: 631 usec per loop

So it's only half the speed, and it seems to be just under linear to code size, although I assume it won't be linear to complexity, we just can't test that yet :)

twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = False\nfoo()\n" * 1000' 'exec s'
10 loops, best of 3: 309 msec per loop
twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = True\nfoo()\n" * 1000' 'exec s'
10 loops, best of 3: 552 msec per loop

D'oh, I got bitten by the timeit misfeature where you can give it multiple strings, so forgetting to say -s isn't a problem. The actual timings here:

twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit -s 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = False\nfoo()\n"' 'exec s'
1000 loops, best of 3: 375 usec per loop
twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit -s 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = True\nfoo()\n"' 'exec s'
1000 loops, best of 3: 630 usec per loop

twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit -s 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = False\nfoo()\n" * 1000' 'exec s'
10 loops, best of 3: 285 msec per loop
twouters@phoenixhawk:~/python/unladen/llvm-working$ ./python -mtimeit -s 's = "def foo(): pass\nfoo.__code__.__use_llvm__ = True\nfoo()\n" * 1000' 'exec s'
10 loops, best of 3: 530 msec per loop

So it's a little more linear than I thought.

(note msec, not usec.)

--
Thomas Wouters <tho...@python.org>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Nick Lewycky

unread,

Mar 26, 2009, 1:04:53 PM3/26/09

to jyas...@google.com, tho...@python.org, unladen...@googlegroups.com

2009/3/26 Jeffrey Yasskin <jyas...@google.com>

So what's taking time? Can you run this through oprofile on Linux or shark on Mac? Or maybe set the "-time-passes" flag (not sure if you're passing LLVM flags through or not; note that this flag applies both to IR optz'ns and codegen).

Nick

Jeffrey Yasskin

unread,

Mar 27, 2009, 12:26:26 AM3/27/09

to nlew...@google.com, tho...@python.org, unladen...@googlegroups.com

Ah, right. The useful data I forgot to collect. We're not passing
flags through to LLVM yet, so here's the shark data. This is with
CFGSimplification and mem2reg running before the fast JIT, which is
the fastest combination I've found.

Overall, compilation takes 38% and JITting takes 53.4% (running is
0.0%). It's hard to distinguish the parts of compilation LLVM is
responsible for from the parts the CPython bytecode is responsible
for, but I suspect that 50% extra time from adding a whole separate
output is pretty reasonable. Furthermore, my current calls to the
IRBuilder are probably highly unoptimized, so it should be possible to
lower the overhead in the compiler further, even ignoring the speedups
.pyc files will give us here. CFGSimplification and mem2reg combined
take about 5% of the total time.

Inside of JITting, FPPassManager::runOnFunction takes nearly all of
the time (52.7%) with MachineFunctionPass and SelectionDAGISel taking
24.9% and 18.8% respectively inside of that.

MachineFunctionPass's major components (>=1% of total runtime) are:
RALinScan: 6.1%
LiveVariables: 4.6%
SimpleRegisterCoalescing: 4.2%
LiveIntervals: 2.1%
DeadMachineInstructionElim: 1.7%
Deleter: 1.7%
Emitter: 1.4%

SelectionDAGISel's major components are:
SelectionDAGISel::CodeGenAndEmitDAG: 5.2%
SelectionDAGISel::SelectBasicBlock: 4.9%
FastISel::SelectInstruction: 2.4%
X86FastISel::TargetSelectInstruction: 2.0%
SelectionDAGISel::LowerArguments: 1.7%

Let me know if I should go back and re-run with -time-passes set.

Jeffrey Yasskin

unread,

Mar 27, 2009, 12:41:47 AM3/27/09

to nlew...@google.com, tho...@python.org, unladen...@googlegroups.com

Keep in mind, this is an extremely simple and unrepresentative
function. I only ran the benchmark to see if codegen time was going to
totally sink us. Since all three other major implementations are
significantly slower than this, I don't think it's an urgent problem,
just something to be aware of if faster codegen comes up anyway on the
LLVM side.

Owen Anderson

unread,

Mar 27, 2009, 3:32:01 AM3/27/09

to unladen...@googlegroups.com

On Mar 26, 2009, at 9:26 PM, Jeffrey Yasskin wrote:
> MachineFunctionPass's major components (>=1% of total runtime) are:
> RALinScan: 6.1%
> LiveVariables: 4.6%
> SimpleRegisterCoalescing: 4.2%
> LiveIntervals: 2.1%
> DeadMachineInstructionElim: 1.7%
> Deleter: 1.7%
> Emitter: 1.4%

You can also try using the local register allocator. I'm not sure
offhand how to enable this programmatically in the JIT, but it's
equivalent to passing -regalloc=local on the LLC commandline. Doing
so should eliminate the vast majority of this chunk of time.

FWIW, the equivalent of -O0 (codegen-wise, at least) in llvm-gcc is -
fast -regalloc=local.

> SelectionDAGISel's major components are:
> SelectionDAGISel::CodeGenAndEmitDAG: 5.2%
> SelectionDAGISel::SelectBasicBlock: 4.9%
> FastISel::SelectInstruction: 2.4%
> X86FastISel::TargetSelectInstruction: 2.0%
> SelectionDAGISel::LowerArguments: 1.7%

It looks like you're not generating FastISel-friendly IR. Basically,
for a given target, FastISel only supports a subset of all IR
operations, those that can be code-generated for the target in an
absolutely braindead manner. When it encounters something it can't
codegen, FastISel falls back to SelectionDAGISel, which is the slow
path. Obviously from this timing data, it's having to fall back a
lot. I don't have a good summary of what it supports offhand, but you
might try asking on LLVMdev.

At any rate, engineering your IR to use FastISel-supported operations
could radically cut down on this time as well.

--Owen

Jeffrey Yasskin

unread,

Mar 27, 2009, 4:15:16 PM3/27/09

to resi...@mac.com, unladen...@googlegroups.com

On Fri, Mar 27, 2009 at 2:32 AM, Owen Anderson <resi...@mac.com> wrote:
>
>
> On Mar 26, 2009, at 9:26 PM, Jeffrey Yasskin wrote:
>>
>> MachineFunctionPass's major components (>=1% of total runtime) are:
>> RALinScan: 6.1%
>> LiveVariables: 4.6%
>> SimpleRegisterCoalescing: 4.2%
>> LiveIntervals: 2.1%
>> DeadMachineInstructionElim: 1.7%
>> Deleter: 1.7%
>> Emitter: 1.4%
>
> You can also try using the local register allocator. I'm not sure offhand
> how to enable this programmatically in the JIT, but it's equivalent to
> passing -regalloc=local on the LLC commandline. Doing so should eliminate
> the vast majority of this chunk of time.
>
> FWIW, the equivalent of -O0 (codegen-wise, at least) in llvm-gcc is -fast
> -regalloc=local.

Thanks, I'll look into enabling this.

>> SelectionDAGISel's major components are:
>> SelectionDAGISel::CodeGenAndEmitDAG: 5.2%
>> SelectionDAGISel::SelectBasicBlock: 4.9%
>> FastISel::SelectInstruction: 2.4%
>> X86FastISel::TargetSelectInstruction: 2.0%
>> SelectionDAGISel::LowerArguments: 1.7%
>
> It looks like you're not generating FastISel-friendly IR. Basically, for a
> given target, FastISel only supports a subset of all IR operations, those
> that can be code-generated for the target in an absolutely braindead manner.
> When it encounters something it can't codegen, FastISel falls back to
> SelectionDAGISel, which is the slow path. Obviously from this timing data,
> it's having to fall back a lot. I don't have a good summary of what it
> supports offhand, but you might try asking on LLVMdev.
>
> At any rate, engineering your IR to use FastISel-supported operations could
> radically cut down on this time as well.

Thanks for pointing that out. I'll look at FastISel's code before
bugging LLVMdev; I can probably find the list there.

I've started http://wiki.llvm.org/HowTo:_JIT_slowly-running_machine_code_really_quickly
to collect the wisdom you've provided above and any I manage to find
or get from LLVMdev.

Much appreciated!
Jeffrey

Owen Anderson

unread,

Mar 27, 2009, 6:03:28 PM3/27/09

to unladen...@googlegroups.com

On Mar 27, 2009, at 1:15 PM, Jeffrey Yasskin wrote:
>
> Thanks for pointing that out. I'll look at FastISel's code before
> bugging LLVMdev; I can probably find the list there.
>
> I've started http://wiki.llvm.org/HowTo:_JIT_slowly-running_machine_code_really_quickly
> to collect the wisdom you've provided above and any I manage to find
> or get from LLVMdev.

I'm not necessarily advocating that this is an advisable strategy. I
was merely pointing out how, if you really want to, you can generate
code even more quickly at the cost of really crappy generated code.

To be honest, I suspect it will be an overall better strategy to use a
mixed mode interpreter/JIT. Push hot functions into a workqueue and
JIT then in a background thread, interpreting them until the JITing is
finished. Then you don't have to worry about compile time!

--Owen

Jeffrey Yasskin

unread,

Mar 27, 2009, 6:15:04 PM3/27/09

to resi...@mac.com, unladen...@googlegroups.com

But we do have to worry about maintaining the interpreter and its
input. If we can use LLVM's interpreter for this, those concerns go
away, but I've been told that it doesn't yet support calling into C
functions, and that it's so slow that even the codegen I have is
probably faster. Both of those are fixable, of course, but it'd be
less work if we can just tweak the codegen and get something that's
fast enough. Anyway, that's why I'm chasing this option first. Keeping
the CPython bytecode/interpreter and optimizing LLVM's interpreter are
still on the table.

Jeffrey

Reply all

Reply to author

Forward