My experiments with python AOT compilation with LLVM

246 views

Skip to first unread message

Giuseppe Ottaviano

unread,

Apr 5, 2009, 12:18:21 PM4/5/09

to Unladen Swallow

Hi all,
lately I have been experimenting on a similar idea that you are
implementing, i.e. replacing the CPython interpreter loop with an LLVM
JIT. Since I have no more time to work on this, I'd like to share my
results with you, if it can be of any help.

What I've done so far is just a compiler that compiles a code object
when PyEval_EvalFrameEx is going to execute it (and then caches the
machine code), by translating the sequence of opcodes into a sequence
of function calls in LLVM. Every function implements an opcode. So it
is was just matter of turn the switch cases of the interpreter loop
into normal C functions in a separate translation unit, and compile it
with llvm-gcc to a bitcode file that is later loaded by the compiler.

All the opcode functions have the same prototype, they take as an
argument a structure describing the interpreter state, and return an
error code which is used for exception handling (I have experimented
with DWARF exceptions but they gave some 40x slowdown when soft-
exceptions were thrown, i.e. always). Appended to the email there is
an example of the generated code.

Smaller opcodes are inlined, I noticed that inlining everything gives
huge compilation times while inling nothing produce code slower than
the interpreter loop. Your approach of writing LLVM IR from C is
certainly more efficient, cause in my "functional" approach a lot of
time for small opcodes is spent (i think) in extracting the frame,
code, etc.. from the interpreter_state structure. However I think that
you should be careful about "big" opcodes, which could cause the same
problems I had with compilation time and slowdown caused by
instruction cache misses. Calling a function would probably be better
(if the opcode is "big", the function call overhead is negligible).

With the current implementation almost everything (except tracing and
some opcodes I haven't implemented) works, including generators,
exceptions, etc... Most unit tests don't work because, if I understood
correctly, doctest needs tracing to work, but I hand-tested the test
cases and they pass (I have also tried some scripts and
applications).

However this implementation is not intended to be furtherly developed:
my idea was to write a JIT extension a-la-psyco, but inspired to
LuaJIT: write just a compilation pipeline in C/C++ with an LLVM
backend, and the JIT optimizer in pure python to keep the code
maintainable (and easy to experiment with) and avoid the psyco mess.
This is more or less whay pypy is trying to do, but without the
RPython translation chain. And thought for CPython, which still
desperately needs a good JIT.
So this code was just a playground to learn LLVM and evaluate the
compilation times, please don't be too picky if you read it :)

Now the benchmarks: with pybench the gain is around 11% on an Intel
Core 2 Duo compiled at 32bit, something more if compiled at 64bit on
Linux. Nothing impressive, it is more or less the same gain given by
threaded interpreter (and I don't think we can shave off anything else
without resorting to runtime optimization). Of course the gain is only
in the "minimum runtime", on average we lose 10% because of
compilation time (remember that it compiles EVERY function, even the
ones executed only one time).

On long-running applications I have experienced greater speedups.

The code is available at http://www.di.unipi.it/~ottavian/python-jit-0.01.tar.bz2
. It is based on Python 2.5.2, and is tested on Leopard and Ubuntu
Intrepid 64bit, both with llvm 2.5.

Hope this helps,
Giuseppe

-------------------------------------------------------------------------------
PYBENCH 2.0
-------------------------------------------------------------------------------
* using Python 2.5.2
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.time

-------------------------------------------------------------------------------
Benchmark: pyjit34.pybench
-------------------------------------------------------------------------------

Rounds: 10
Warp: 10
Timer: time.time

Machine Details:
Platform ID: Darwin-9.6.0-i386-32bit
Processor: i386

Python:
Executable: /Users/ot/work/experimental/Python-2.5.2-exp/
python.exe
Version: 2.5.2
Compiler: GCC 4.0.1 (Apple Inc. build 5465)
Bits: 32bit
Build: Mar 9 2009 17:01:42 (#r252:60911)
Unicode: UCS2

-------------------------------------------------------------------------------
Comparing with: py25_vanilla.pybench
-------------------------------------------------------------------------------

Rounds: 10
Warp: 10
Timer: time.time

Machine Details:
Platform ID: Darwin-9.6.0-i386-32bit
Processor: i386

Python:
Executable: /tmp/sandbox_b47ad064-e6fd-11dd-996f-001ec21bf2c7/
bin/python
Version: 2.5.2
Compiler: GCC 4.0.1 (Apple Inc. build 5465)
Bits: 32bit
Build: Jan 20 2009 15:31:21 (#r252:60911)
Unicode: UCS2

Test minimum run-time average run-
time
this other diff this
other diff
-------------------------------------------------------------------------------
BuiltinFunctionCalls: 108ms 120ms -10.3% 148ms
135ms +10.1%
BuiltinMethodLookup: 91ms 105ms -13.4% 131ms
112ms +16.4%
CompareFloats: 69ms 78ms -12.1% 93ms
82ms +14.0%
CompareFloatsIntegers: 70ms 85ms -17.6% 93ms
89ms +4.7%
CompareIntegers: 34ms 81ms -58.0% 57ms
84ms -31.8%
CompareInternedStrings: 73ms 85ms -14.1% 94ms
89ms +5.8%
CompareLongs: 63ms 78ms -19.0% 87ms
80ms +8.3%
CompareStrings: 53ms 70ms -24.2% 75ms
73ms +2.6%
CompareUnicode: 81ms 96ms -15.1% 103ms
98ms +5.1%
ConcatStrings: 119ms 120ms -1.1% 140ms
128ms +9.7%
ConcatUnicode: 72ms 70ms +2.6% 94ms
72ms +29.4%
CreateInstances: 120ms 134ms -10.6% 134ms
138ms -2.7%
CreateNewInstances: 108ms 117ms -7.8% 121ms
120ms +0.8%
CreateStringsWithConcat: 110ms 117ms -6.2% 126ms
119ms +6.0%
CreateUnicodeWithConcat: 121ms 120ms +0.9% 138ms
123ms +11.5%
DictCreation: 79ms 83ms -4.4% 96ms
84ms +14.7%
DictWithFloatKeys: 74ms 93ms -20.6% 90ms
98ms -7.7%
DictWithIntegerKeys: 67ms 88ms -23.3% 84ms
89ms -5.7%
DictWithStringKeys: 61ms 80ms -23.5% 78ms
82ms -4.7%
ForLoops: 37ms 69ms -46.9% 52ms
70ms -26.4%
IfThenElse: 57ms 80ms -28.6% 211ms
81ms +160.5%
ListSlicing: 181ms 111ms +63.6% 187ms
117ms +60.2%
NestedForLoops: 64ms 96ms -33.7% 72ms
111ms -35.8%
NormalClassAttribute: 95ms 118ms -19.6% 131ms
121ms +8.7%
NormalInstanceAttribute: 82ms 117ms -30.0% 120ms
120ms -0.2%
PythonFunctionCalls: 101ms 99ms +1.6% 133ms
102ms +29.9%
PythonMethodCalls: 127ms 141ms -10.4% 169ms
145ms +16.1%
Recursion: 123ms 130ms -5.1% 129ms
139ms -7.0%
SecondImport: 82ms 79ms +4.2% 89ms
81ms +10.6%
SecondPackageImport: 87ms 83ms +4.6% 93ms
88ms +6.5%
SecondSubmoduleImport: 110ms 107ms +2.2% 117ms
115ms +1.8%
SimpleComplexArithmetic: 78ms 101ms -22.5% 128ms
103ms +23.6%
SimpleDictManipulation: 79ms 88ms -9.9% 113ms
90ms +25.1%
SimpleFloatArithmetic: 64ms 94ms -31.8% 115ms
97ms +18.6%
SimpleIntFloatArithmetic: 54ms 78ms -30.5% 104ms
80ms +29.3%
SimpleIntegerArithmetic: 56ms 78ms -28.3% 105ms
82ms +29.2%
SimpleListManipulation: 69ms 83ms -16.5% 96ms
85ms +12.3%
SimpleLongArithmetic: 87ms 104ms -16.5% 137ms
108ms +26.2%
SmallLists: 103ms 122ms -14.9% 134ms
130ms +2.7%
SmallTuples: 101ms 105ms -3.8% 135ms
110ms +23.1%
SpecialClassAttribute: 92ms 116ms -20.5% 130ms
119ms +9.3%
SpecialInstanceAttribute: 151ms 179ms -15.5% 189ms
183ms +3.1%
StringMappings: 155ms 160ms -2.9% 168ms
164ms +2.2%
StringPredicates: 138ms 143ms -3.7% 159ms
146ms +8.4%
StringSlicing: 105ms 109ms -4.0% 117ms
117ms -0.1%
TryExcept: 54ms 71ms -24.1% 95ms
72ms +31.8%
TryRaiseExcept: 93ms 94ms -0.1% 99ms
96ms +3.0%
TupleSlicing: 113ms 112ms +0.9% 214ms
119ms +80.4%
UnicodeMappings: 99ms 100ms -0.3% 112ms
103ms +9.0%
UnicodePredicates: 107ms 106ms +0.5% 121ms
109ms +10.4%
UnicodeProperties: 91ms 98ms -7.2% 112ms
102ms +9.9%
UnicodeSlicing: 109ms 117ms -6.7% 121ms
122ms -0.9%
-------------------------------------------------------------------------------
Totals: 4718ms 5308ms -11.1% 6186ms
5522ms +12.0%

(this=pyjit34.pybench, other=py25_vanilla.pybench)

GENERATED CODE
==============

for i in [1, 2, 3]:
if i == 2:
print i

1 0 SETUP_LOOP 45 (to 48)
3 LOAD_CONST 0 (1)
6 LOAD_CONST 1 (2)
9 LOAD_CONST 2 (3)
12 BUILD_LIST 3
15 GET_ITER
>> 16 FOR_ITER 28 (to 47)
19 STORE_NAME 0 (i)

2 22 LOAD_NAME 0 (i)
25 LOAD_CONST 1 (2)
28 COMPARE_OP 2 (==)
31 JUMP_IF_FALSE 9 (to 43)
34 POP_TOP

3 35 LOAD_NAME 0 (i)
38 PRINT_ITEM
39 PRINT_NEWLINE
40 JUMP_ABSOLUTE 16
>> 43 POP_TOP
44 JUMP_ABSOLUTE 16
>> 47 POP_BLOCK
>> 48 LOAD_CONST 3 (None)
51 RETURN_VALUE

define %struct.PyObject* @bytecode_at_12102064(%struct.PyFrameObject*
%f, %struct.PyThreadState* %tstate, i32 %throwflag) {
entry:
%st = alloca %struct.interpreter_state ; <
%struct.interpreter_state*> [#uses=20]
%0 = getelementptr %struct.interpreter_state* %st, i32 0, i32 0 ; <
%struct.PyFrameObject**> [#uses=4]
store %struct.PyFrameObject* %f, %struct.PyFrameObject** %0, align 8
%1 = getelementptr %struct.PyFrameObject* %f, i32 0, i32 9 ; <
%struct.PyObject***> [#uses=2]
%2 = load %struct.PyObject*** %1, align 4 ; <%struct.PyObject**>
[#uses=1]
%3 = getelementptr %struct.interpreter_state* %st, i32 0, i32 1 ; <
%struct.PyObject***> [#uses=23]
store %struct.PyObject** %2, %struct.PyObject*** %3, align 4
store %struct.PyObject** null, %struct.PyObject*** %1, align 4
%4 = getelementptr %struct.interpreter_state* %st, i32 0, i32 2 ; <
%struct.PyThreadState**> [#uses=2]
store %struct.PyThreadState* %tstate, %struct.PyThreadState** %4,
align 8
%5 = getelementptr %struct.PyFrameObject* %f, i32 0, i32 4 ; <
%struct.PyCodeObject**> [#uses=1]
%6 = load %struct.PyCodeObject** %5, align 4 ; <
%struct.PyCodeObject*> [#uses=3]
%7 = getelementptr %struct.interpreter_state* %st, i32 0, i32 3 ; <
%struct.PyCodeObject**> [#uses=1]
store %struct.PyCodeObject* %6, %struct.PyCodeObject** %7, align 4
%8 = getelementptr %struct.PyCodeObject* %6, i32 0, i32 8 ; <
%struct.PyObject**> [#uses=1]
%9 = load %struct.PyObject** %8, align 4 ; <%struct.PyObject*>
[#uses=1]
%10 = getelementptr %struct.interpreter_state* %st, i32 0, i32 4 ; <
%struct.PyObject**> [#uses=1]
store %struct.PyObject* %9, %struct.PyObject** %10, align 8
%11 = getelementptr %struct.PyCodeObject* %6, i32 0, i32 7 ; <
%struct.PyObject**> [#uses=1]
%12 = load %struct.PyObject** %11, align 4 ; <%struct.PyObject*>
[#uses=1]
%13 = getelementptr %struct.interpreter_state* %st, i32 0, i32 5 ; <
%struct.PyObject**> [#uses=6]
store %struct.PyObject* %12, %struct.PyObject** %13, align 4
%14 = getelementptr %struct.PyFrameObject* %f, i32 0, i32 19, i32
0 ; <%struct.PyObject**> [#uses=1]
%15 = getelementptr %struct.interpreter_state* %st, i32 0, i32 6 ; <
%struct.PyObject***> [#uses=1]
store %struct.PyObject** %14, %struct.PyObject*** %15, align 8
%16 = getelementptr %struct.interpreter_state* %st, i32 0, i32 7 ;
<i32*> [#uses=4]
store i32 1, i32* %16, align 4
%17 = getelementptr %struct.interpreter_state* %st, i32 0, i32 8 ; <
%struct.PyObject**> [#uses=3]
store %struct.PyObject* null, %struct.PyObject** %17, align 8
%dispatch_to_instr = alloca i32 ; <i32*> [#uses=3]
%18 = getelementptr %struct.PyFrameObject* %f, i32 0, i32 15 ;
<i32*> [#uses=1]
%19 = load i32* %18, align 4 ; <i32> [#uses=1]
%20 = add i32 %19, 1 ; <i32> [#uses=1]
store i32 %20, i32* %dispatch_to_instr
%21 = icmp eq i32 %throwflag, 0 ; <i1> [#uses=1]
br i1 %21, label %dispatch, label %gen_throw

gen_throw: ; preds = %entry
store i32 2, i32* %16, align 4
br label %block_end

end: ; preds = %block_end, %dispatch
%22 = load %struct.PyObject** %17, align 8 ; <%struct.PyObject*>
[#uses=1]
ret %struct.PyObject* %22

dispatch: ; preds = %block_end, %entry
%23 = load i32* %dispatch_to_instr ; <i32> [#uses=1]
switch i32 %23, label %end [
i32 0, label %__op_0
i32 3, label %__op_3
i32 6, label %__op_6
i32 9, label %__op_9
i32 12, label %__op_12
i32 15, label %__op_15
i32 16, label %__op_16
i32 19, label %__op_19
i32 22, label %__op_22
i32 25, label %__op_25
i32 28, label %__op_28
i32 31, label %__op_31
i32 34, label %__op_34
i32 35, label %__op_35
i32 38, label %__op_38
i32 39, label %__op_39
i32 40, label %__op_16
i32 43, label %__op_43
i32 44, label %__op_16
i32 47, label %__op_47
i32 48, label %__op_48
i32 51, label %__op_51
]

__op_0: ; preds = %dispatch
%24 = call fastcc i32 @opcode_SETUP_LOOP(%struct.interpreter_state*
%st, i32 0, i32 120, i32 45) ; <i32> [#uses=1]
%25 = icmp eq i32 %24, 0 ; <i1> [#uses=1]
br i1 %25, label %__op_3, label %block_end

__op_3: ; preds = %__op_0, %dispatch
%26 = load %struct.PyObject** %13, align 4 ; <%struct.PyObject*>
[#uses=1]
%27 = bitcast %struct.PyObject* %26 to %struct.PyTupleObject* ; <
%struct.PyTupleObject*> [#uses=1]
%28 = getelementptr %struct.PyTupleObject* %27, i32 0, i32 3, i32
0 ; <%struct.PyObject**> [#uses=1]
%29 = load %struct.PyObject** %28, align 4 ; <%struct.PyObject*>
[#uses=2]
%30 = getelementptr %struct.PyObject* %29, i32 0, i32 0 ; <i32*>
[#uses=2]
%31 = load i32* %30, align 4 ; <i32> [#uses=1]
%32 = add i32 %31, 1 ; <i32> [#uses=1]
store i32 %32, i32* %30, align 4
%33 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=2]
store %struct.PyObject* %29, %struct.PyObject** %33, align 4
%34 = getelementptr %struct.PyObject** %33, i32 1 ; <
%struct.PyObject**> [#uses=1]
store %struct.PyObject** %34, %struct.PyObject*** %3, align 4
br label %__op_6

__op_6: ; preds = %__op_3, %dispatch
%35 = load %struct.PyObject** %13, align 4 ; <%struct.PyObject*>
[#uses=1]
%36 = bitcast %struct.PyObject* %35 to %struct.PyTupleObject* ; <
%struct.PyTupleObject*> [#uses=1]
%37 = getelementptr %struct.PyTupleObject* %36, i32 0, i32 3, i32
1 ; <%struct.PyObject**> [#uses=1]
%38 = load %struct.PyObject** %37, align 4 ; <%struct.PyObject*>
[#uses=2]
%39 = getelementptr %struct.PyObject* %38, i32 0, i32 0 ; <i32*>
[#uses=2]
%40 = load i32* %39, align 4 ; <i32> [#uses=1]
%41 = add i32 %40, 1 ; <i32> [#uses=1]
store i32 %41, i32* %39, align 4
%42 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=2]
store %struct.PyObject* %38, %struct.PyObject** %42, align 4
%43 = getelementptr %struct.PyObject** %42, i32 1 ; <
%struct.PyObject**> [#uses=1]
store %struct.PyObject** %43, %struct.PyObject*** %3, align 4
br label %__op_9

__op_9: ; preds = %__op_6, %dispatch
%44 = load %struct.PyObject** %13, align 4 ; <%struct.PyObject*>
[#uses=1]
%45 = bitcast %struct.PyObject* %44 to %struct.PyTupleObject* ; <
%struct.PyTupleObject*> [#uses=1]
%46 = getelementptr %struct.PyTupleObject* %45, i32 0, i32 3, i32
2 ; <%struct.PyObject**> [#uses=1]
%47 = load %struct.PyObject** %46, align 4 ; <%struct.PyObject*>
[#uses=2]
%48 = getelementptr %struct.PyObject* %47, i32 0, i32 0 ; <i32*>
[#uses=2]
%49 = load i32* %48, align 4 ; <i32> [#uses=1]
%50 = add i32 %49, 1 ; <i32> [#uses=1]
store i32 %50, i32* %48, align 4
%51 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=2]
store %struct.PyObject* %47, %struct.PyObject** %51, align 4
%52 = getelementptr %struct.PyObject** %51, i32 1 ; <
%struct.PyObject**> [#uses=1]
store %struct.PyObject** %52, %struct.PyObject*** %3, align 4
br label %__op_12

__op_12: ; preds = %__op_9, %dispatch
%53 = call fastcc i32 @opcode_BUILD_LIST(%struct.interpreter_state*
%st, i32 12, i32 103, i32 3) ; <i32> [#uses=1]
%54 = icmp eq i32 %53, 0 ; <i1> [#uses=1]
br i1 %54, label %__op_15, label %block_end

__op_15: ; preds = %__op_12, %dispatch
%55 = call fastcc i32 @opcode_GET_ITER(%struct.interpreter_state*
%st, i32 15, i32 68, i32 0) ; <i32> [#uses=1]
%56 = icmp eq i32 %55, 0 ; <i1> [#uses=1]
br i1 %56, label %__op_16, label %block_end

__op_16: ; preds = %dispatch, %__op_43, %bb.i5, %dispatch, %__op_39,
%__op_15, %dispatch
%57 = volatile load i32* @_Py_Ticker, align 4 ; <i32> [#uses=1]
%58 = add i32 %57, -1 ; <i32> [#uses=1]
volatile store i32 %58, i32* @_Py_Ticker, align 4
%59 = volatile load i32* @_Py_Ticker, align 4 ; <i32> [#uses=1]
%.not.i = icmp sgt i32 %59, -1 ; <i1> [#uses=1]
br i1 %.not.i, label %bb3.i, label %bb1.i

bb1.i: ; preds = %__op_16
%60 = load %struct.PyThreadState** %4, align 8 ; <
%struct.PyThreadState*> [#uses=1]
%61 = call fastcc i32 @do_periodic_things(%struct.PyThreadState* %60)
nounwind ; <i32> [#uses=1]
%62 = icmp eq i32 %61, 0 ; <i1> [#uses=1]
br i1 %62, label %bb2.i, label %bb3.i

bb2.i: ; preds = %bb1.i
store i32 2, i32* %16, align 4
%63 = load %struct.PyFrameObject** %0, align 8 ; <
%struct.PyFrameObject*> [#uses=1]
%64 = getelementptr %struct.PyFrameObject* %63, i32 0, i32 15 ;
<i32*> [#uses=1]
store i32 16, i32* %64, align 4
br label %opcode_FOR_ITER.exit

bb3.i: ; preds = %bb1.i, %__op_16
%65 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=1]
%66 = getelementptr %struct.PyObject** %65, i32 -1 ; <
%struct.PyObject**> [#uses=1]
%67 = load %struct.PyObject** %66, align 4 ; <%struct.PyObject*>
[#uses=2]
%68 = getelementptr %struct.PyObject* %67, i32 0, i32 1 ; <
%struct._typeobject**> [#uses=1]
%69 = load %struct._typeobject** %68, align 4 ; <
%struct._typeobject*> [#uses=1]
%70 = getelementptr %struct._typeobject* %69, i32 0, i32 28 ; <
%struct.PyObject* (%struct.PyObject*)**> [#uses=1]
%71 = load %struct.PyObject* (%struct.PyObject*)** %70, align 4 ; <
%struct.PyObject* (%struct.PyObject*)*> [#uses=1]
%72 = call %struct.PyObject* %71(%struct.PyObject* %67) nounwind ; <
%struct.PyObject*> [#uses=2]
%73 = icmp eq %struct.PyObject* %72, null ; <i1> [#uses=1]
br i1 %73, label %bb5.i, label %bb4.i

bb4.i: ; preds = %bb3.i
%74 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=2]
store %struct.PyObject* %72, %struct.PyObject** %74, align 4
%75 = getelementptr %struct.PyObject** %74, i32 1 ; <
%struct.PyObject**> [#uses=1]
store %struct.PyObject** %75, %struct.PyObject*** %3, align 4
br label %opcode_FOR_ITER.exit

bb5.i: ; preds = %bb3.i
%76 = call %struct.PyObject* @PyErr_Occurred() nounwind ; <
%struct.PyObject*> [#uses=1]
%77 = icmp eq %struct.PyObject* %76, null ; <i1> [#uses=1]
br i1 %77, label %bb9.i, label %bb6.i

bb6.i: ; preds = %bb5.i
%78 = load %struct.PyObject** @PyExc_StopIteration, align 4 ; <
%struct.PyObject*> [#uses=1]
%79 = call i32 @PyErr_ExceptionMatches(%struct.PyObject* %78)
nounwind ; <i32> [#uses=1]
%80 = icmp eq i32 %79, 0 ; <i1> [#uses=1]
br i1 %80, label %bb7.i, label %bb8.i

bb7.i: ; preds = %bb6.i
%81 = load %struct.PyFrameObject** %0, align 8 ; <
%struct.PyFrameObject*> [#uses=1]
%82 = getelementptr %struct.PyFrameObject* %81, i32 0, i32 15 ;
<i32*> [#uses=1]
store i32 16, i32* %82, align 4
br label %opcode_FOR_ITER.exit

bb8.i: ; preds = %bb6.i
call void @PyErr_Clear() nounwind
br label %bb9.i

bb9.i: ; preds = %bb8.i, %bb5.i
%83 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=1]
%84 = getelementptr %struct.PyObject** %83, i32 -1 ; <
%struct.PyObject**> [#uses=2]
store %struct.PyObject** %84, %struct.PyObject*** %3, align 4
%85 = load %struct.PyObject** %84, align 4 ; <%struct.PyObject*>
[#uses=3]
%86 = getelementptr %struct.PyObject* %85, i32 0, i32 0 ; <i32*>
[#uses=2]
%87 = load i32* %86, align 4 ; <i32> [#uses=1]
%88 = add i32 %87, -1 ; <i32> [#uses=2]
store i32 %88, i32* %86, align 4
%89 = icmp eq i32 %88, 0 ; <i1> [#uses=1]
br i1 %89, label %bb10.i, label %opcode_FOR_ITER.exit

bb10.i: ; preds = %bb9.i
%90 = getelementptr %struct.PyObject* %85, i32 0, i32 1 ; <
%struct._typeobject**> [#uses=1]
%91 = load %struct._typeobject** %90, align 4 ; <
%struct._typeobject*> [#uses=1]
%92 = getelementptr %struct._typeobject* %91, i32 0, i32 6 ; <void
(%struct.PyObject*)**> [#uses=1]
%93 = load void (%struct.PyObject*)** %92, align 4 ; <void
(%struct.PyObject*)*> [#uses=1]
call void %93(%struct.PyObject* %85) nounwind
br label %opcode_FOR_ITER.exit

opcode_FOR_ITER.exit: ; preds = %bb9.i, %bb2.i, %bb7.i, %bb10.i,
%bb4.i
%94 = phi i32 [ 0, %bb4.i ], [ 2, %bb10.i ], [ 1, %bb7.i ], [ 1,
%bb2.i ], [ 2, %bb9.i ] ; <i32> [#uses=1]
switch i32 %94, label %block_end [
i32 0, label %__op_19
i32 2, label %__op_47
]

__op_19: ; preds = %opcode_FOR_ITER.exit, %dispatch
%95 = call fastcc i32 @opcode_STORE_NAME(%struct.interpreter_state*
%st, i32 19, i32 90, i32 0) ; <i32> [#uses=1]
%96 = icmp eq i32 %95, 0 ; <i1> [#uses=1]
br i1 %96, label %__op_22, label %block_end

__op_22: ; preds = %__op_19, %dispatch
%97 = call fastcc i32 @opcode_LOAD_NAME(%struct.interpreter_state*
%st, i32 22, i32 101, i32 0) ; <i32> [#uses=1]
%98 = icmp eq i32 %97, 0 ; <i1> [#uses=1]
br i1 %98, label %__op_25, label %block_end

__op_25: ; preds = %__op_22, %dispatch
%99 = load %struct.PyObject** %13, align 4 ; <%struct.PyObject*>
[#uses=1]
%100 = bitcast %struct.PyObject* %99 to %struct.PyTupleObject* ; <
%struct.PyTupleObject*> [#uses=1]
%101 = getelementptr %struct.PyTupleObject* %100, i32 0, i32 3, i32
1 ; <%struct.PyObject**> [#uses=1]
%102 = load %struct.PyObject** %101, align 4 ; <%struct.PyObject*>
[#uses=2]
%103 = getelementptr %struct.PyObject* %102, i32 0, i32 0 ; <i32*>
[#uses=2]
%104 = load i32* %103, align 4 ; <i32> [#uses=1]
%105 = add i32 %104, 1 ; <i32> [#uses=1]
store i32 %105, i32* %103, align 4
%106 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=2]
store %struct.PyObject* %102, %struct.PyObject** %106, align 4
%107 = getelementptr %struct.PyObject** %106, i32 1 ; <
%struct.PyObject**> [#uses=1]
store %struct.PyObject** %107, %struct.PyObject*** %3, align 4
br label %__op_28

__op_28: ; preds = %__op_25, %dispatch
%108 = call fastcc i32 @opcode_COMPARE_OP(%struct.interpreter_state*
%st, i32 28, i32 106, i32 2) ; <i32> [#uses=1]
%109 = icmp eq i32 %108, 0 ; <i1> [#uses=1]
br i1 %109, label %__op_31, label %block_end

__op_31: ; preds = %__op_28, %dispatch
%110 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=1]
%111 = getelementptr %struct.PyObject** %110, i32 -1 ; <
%struct.PyObject**> [#uses=1]
%112 = load %struct.PyObject** %111, align 4 ; <%struct.PyObject*>
[#uses=3]
%113 = icmp eq %struct.PyObject* %112, bitcast (%struct.PyIntObject*
@_Py_TrueStruct to %struct.PyObject*) ; <i1> [#uses=1]
br i1 %113, label %is_top_true.exit, label %bb1.i1

bb1.i1: ; preds = %__op_31
%114 = icmp eq %struct.PyObject* %112, bitcast (%struct.PyIntObject*
@_Py_ZeroStruct to %struct.PyObject*) ; <i1> [#uses=1]
br i1 %114, label %is_top_true.exit, label %bb3.i2

bb3.i2: ; preds = %bb1.i1
%115 = call i32 @PyObject_IsTrue(%struct.PyObject* %112) nounwind ;
<i32> [#uses=1]
%116 = icmp sgt i32 %115, 0 ; <i1> [#uses=1]
%retval.i = zext i1 %116 to i32 ; <i32> [#uses=1]
br label %is_top_true.exit

is_top_true.exit: ; preds = %__op_31, %bb1.i1, %bb3.i2
%117 = phi i32 [ %retval.i, %bb3.i2 ], [ 1, %__op_31 ], [ 0,
%bb1.i1 ] ; <i32> [#uses=1]
%118 = icmp eq i32 %117, 0 ; <i1> [#uses=1]
br i1 %118, label %__op_43, label %__op_34

__op_34: ; preds = %is_top_true.exit, %dispatch
%119 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=1]
%120 = getelementptr %struct.PyObject** %119, i32 -1 ; <
%struct.PyObject**> [#uses=2]
store %struct.PyObject** %120, %struct.PyObject*** %3, align 4
%121 = load %struct.PyObject** %120, align 4 ; <%struct.PyObject*>
[#uses=3]
%122 = getelementptr %struct.PyObject* %121, i32 0, i32 0 ; <i32*>
[#uses=2]
%123 = load i32* %122, align 4 ; <i32> [#uses=1]
%124 = add i32 %123, -1 ; <i32> [#uses=2]
store i32 %124, i32* %122, align 4
%125 = icmp eq i32 %124, 0 ; <i1> [#uses=1]
br i1 %125, label %bb.i, label %__op_35

bb.i: ; preds = %__op_34
%126 = getelementptr %struct.PyObject* %121, i32 0, i32 1 ; <
%struct._typeobject**> [#uses=1]
%127 = load %struct._typeobject** %126, align 4 ; <
%struct._typeobject*> [#uses=1]
%128 = getelementptr %struct._typeobject* %127, i32 0, i32 6 ; <void
(%struct.PyObject*)**> [#uses=1]
%129 = load void (%struct.PyObject*)** %128, align 4 ; <void
(%struct.PyObject*)*> [#uses=1]
call void %129(%struct.PyObject* %121) nounwind
br label %__op_35

__op_35: ; preds = %bb.i, %__op_34, %dispatch
%130 = call fastcc i32 @opcode_LOAD_NAME(%struct.interpreter_state*
%st, i32 35, i32 101, i32 0) ; <i32> [#uses=1]
%131 = icmp eq i32 %130, 0 ; <i1> [#uses=1]
br i1 %131, label %__op_38, label %block_end

__op_38: ; preds = %__op_35, %dispatch
%132 = call fastcc i32 @opcode_PRINT_ITEM(%struct.interpreter_state*
%st, i32 38, i32 71, i32 0) ; <i32> [#uses=1]
%133 = icmp eq i32 %132, 0 ; <i1> [#uses=1]
br i1 %133, label %__op_39, label %block_end

__op_39: ; preds = %__op_38, %dispatch
%134 = call fastcc i32 @opcode_PRINT_NEWLINE
(%struct.interpreter_state* %st, i32 39, i32 72, i32 0) ; <i32>
[#uses=1]
%135 = icmp eq i32 %134, 0 ; <i1> [#uses=1]
br i1 %135, label %__op_16, label %block_end

__op_43: ; preds = %is_top_true.exit, %dispatch
%136 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=1]
%137 = getelementptr %struct.PyObject** %136, i32 -1 ; <
%struct.PyObject**> [#uses=2]
store %struct.PyObject** %137, %struct.PyObject*** %3, align 4
%138 = load %struct.PyObject** %137, align 4 ; <%struct.PyObject*>
[#uses=3]
%139 = getelementptr %struct.PyObject* %138, i32 0, i32 0 ; <i32*>
[#uses=2]
%140 = load i32* %139, align 4 ; <i32> [#uses=1]
%141 = add i32 %140, -1 ; <i32> [#uses=2]
store i32 %141, i32* %139, align 4
%142 = icmp eq i32 %141, 0 ; <i1> [#uses=1]
br i1 %142, label %bb.i5, label %__op_16

bb.i5: ; preds = %__op_43
%143 = getelementptr %struct.PyObject* %138, i32 0, i32 1 ; <
%struct._typeobject**> [#uses=1]
%144 = load %struct._typeobject** %143, align 4 ; <
%struct._typeobject*> [#uses=1]
%145 = getelementptr %struct._typeobject* %144, i32 0, i32 6 ; <void
(%struct.PyObject*)**> [#uses=1]
%146 = load void (%struct.PyObject*)** %145, align 4 ; <void
(%struct.PyObject*)*> [#uses=1]
call void %146(%struct.PyObject* %138) nounwind
br label %__op_16

__op_47: ; preds = %opcode_FOR_ITER.exit, %dispatch
%147 = call fastcc i32 @opcode_POP_BLOCK(%struct.interpreter_state*
%st, i32 47, i32 87, i32 0) ; <i32> [#uses=1]
%148 = icmp eq i32 %147, 0 ; <i1> [#uses=1]
br i1 %148, label %__op_48, label %block_end

__op_48: ; preds = %__op_47, %dispatch
%149 = load %struct.PyObject** %13, align 4 ; <%struct.PyObject*>
[#uses=1]
%150 = bitcast %struct.PyObject* %149 to %struct.PyTupleObject* ; <
%struct.PyTupleObject*> [#uses=1]
%151 = getelementptr %struct.PyTupleObject* %150, i32 0, i32 3, i32
3 ; <%struct.PyObject**> [#uses=1]
%152 = load %struct.PyObject** %151, align 4 ; <%struct.PyObject*>
[#uses=2]
%153 = getelementptr %struct.PyObject* %152, i32 0, i32 0 ; <i32*>
[#uses=2]
%154 = load i32* %153, align 4 ; <i32> [#uses=1]
%155 = add i32 %154, 1 ; <i32> [#uses=1]
store i32 %155, i32* %153, align 4
%156 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=2]
store %struct.PyObject* %152, %struct.PyObject** %156, align 4
%157 = getelementptr %struct.PyObject** %156, i32 1 ; <
%struct.PyObject**> [#uses=1]
store %struct.PyObject** %157, %struct.PyObject*** %3, align 4
br label %__op_51

__op_51: ; preds = %__op_48, %dispatch
%158 = load %struct.PyObject*** %3, align 4 ; <%struct.PyObject**>
[#uses=1]
%159 = getelementptr %struct.PyObject** %158, i32 -1 ; <
%struct.PyObject**> [#uses=2]
store %struct.PyObject** %159, %struct.PyObject*** %3, align 4
%160 = load %struct.PyObject** %159, align 4 ; <%struct.PyObject*>
[#uses=1]
store %struct.PyObject* %160, %struct.PyObject** %17, align 8
store i32 8, i32* %16, align 4
%161 = load %struct.PyFrameObject** %0, align 8 ; <
%struct.PyFrameObject*> [#uses=1]
%162 = getelementptr %struct.PyFrameObject* %161, i32 0, i32 15 ;
<i32*> [#uses=1]
store i32 51, i32* %162, align 4
br label %block_end

block_end: ; preds = %__op_51, %__op_47, %__op_39, %__op_38,
%__op_35, %__op_28, %__op_22, %__op_19, %opcode_FOR_ITER.exit,
%__op_15, %__op_12, %__op_0, %gen_throw
%163 = call fastcc i32 @unwind_stack(%struct.interpreter_state* %st,
i32* %dispatch_to_instr) ; <i32> [#uses=1]
%164 = icmp eq i32 %163, 0 ; <i1> [#uses=1]
br i1 %164, label %dispatch, label %end
}

Collin Winter

unread,

Apr 6, 2009, 6:09:44 PM4/6/09

to giu...@gmail.com, Unladen Swallow

Hi Giuseppe,

On Sun, Apr 5, 2009 at 9:18 AM, Giuseppe Ottaviano <giu...@gmail.com> wrote:
>
>
> Hi all,
> lately I have been experimenting on a similar idea that you are
> implementing, i.e. replacing the CPython interpreter loop with an LLVM
> JIT. Since I have no more time to work on this, I'd like to share my
> results with you, if it can be of any help.
>
> What I've done so far is just a compiler that compiles a code object
> when PyEval_EvalFrameEx is going to execute it (and then caches the
> machine code), by translating the sequence of opcodes into a sequence
> of function calls in LLVM. Every function implements an opcode. So it
> is was just matter of turn the switch cases of the interpreter loop
> into normal C functions in a separate translation unit, and compile it
> with llvm-gcc to a bitcode file that is later loaded by the compiler.

This is how we're proceeding for now, just to get a baseline. Once we
get the baseline feature-complete, we'll start getting fancy and aim
for better code generation.

> All the opcode functions have the same prototype, they take as an
> argument a structure describing the interpreter state, and return an
> error code which is used for exception handling (I have experimented
> with DWARF exceptions but they gave some 40x slowdown when soft-
> exceptions were thrown, i.e. always). Appended to the email there is
> an example of the generated code.

Yes, we're planning to stick with Python's current exception strategy,
both for compatibility and because Python throws exceptions all the
time, making "zero-cost" exceptions too costly.

> Smaller opcodes are inlined, I noticed that inlining everything gives
> huge compilation times while inling nothing produce code slower than
> the interpreter loop. Your approach of writing LLVM IR from C is
> certainly more efficient, cause in my "functional" approach a lot of
> time for small opcodes is spent (i think) in extracting the frame,
> code, etc.. from the interpreter_state structure. However I think that
> you should be careful about "big" opcodes, which could cause the same
> problems I had with compilation time and slowdown caused by
> instruction cache misses. Calling a function would probably be better
> (if the opcode is "big", the function call overhead is negligible).

We've taken steps to ameliorate this somewhat in Unladen Swallow by
converting some of the least-used/most-expensive opcodes into
functions, so those will be function calls already (this was included
in our 2009Q1 release).

What benefit did you see from LLVM's optimization passes? Did you find
any passes to be particularly helpful or harmful?

Thanks,
Collin

Giuseppe Ottaviano

unread,

Apr 10, 2009, 10:13:01 AM4/10/09

to Collin Winter, Unladen Swallow

>> What I've done so far is just a compiler that compiles a code object
>> when PyEval_EvalFrameEx is going to execute it (and then caches the
>> machine code), by translating the sequence of opcodes into a sequence
>> of function calls in LLVM. Every function implements an opcode. So it
>> is was just matter of turn the switch cases of the interpreter loop
>> into normal C functions in a separate translation unit, and compile
>> it
>> with llvm-gcc to a bitcode file that is later loaded by the compiler.
>
> This is how we're proceeding for now, just to get a baseline. Once we
> get the baseline feature-complete, we'll start getting fancy and aim
> for better code generation.

Yes, the difference is that you write the directly the IR without
using llvm-gcc to obtain it from C code. Also, I may have not read
your code thoroughly, but it seems that it has no periodic controls
(needed for example to switch to other threads), nor the ability to
start from an arbitrary line of the bytecode (needed, I think, to
resume generators) reading from frame's f_lasti (I am not sure if you
set f_lasti at all). In my case, implementing those features impacted
significantly both on compilation time and efficiency of the generated
code.

When you complete the baseline version, which plans have highest
priority? Removing the GIL, or evolving the JIT to a specializing
compiler?

>> All the opcode functions have the same prototype, they take as an
>> argument a structure describing the interpreter state, and return an
>> error code which is used for exception handling (I have experimented
>> with DWARF exceptions but they gave some 40x slowdown when soft-
>> exceptions were thrown, i.e. always). Appended to the email there is
>> an example of the generated code.
>
> Yes, we're planning to stick with Python's current exception strategy,
> both for compatibility and because Python throws exceptions all the
> time, making "zero-cost" exceptions too costly.

With zero-cost exceptions I didn't change the semantics of exceptions:
in my code I needed some early exit mechanism because my opcodes are
implemented as function (I later implemented it by checking the return
value of the opcode). In your code everything is in the body of the
function so you can just make a jump.

>> Smaller opcodes are inlined, I noticed that inlining everything gives
>> huge compilation times while inling nothing produce code slower than
>> the interpreter loop. Your approach of writing LLVM IR from C is
>> certainly more efficient, cause in my "functional" approach a lot of
>> time for small opcodes is spent (i think) in extracting the frame,
>> code, etc.. from the interpreter_state structure. However I think
>> that
>> you should be careful about "big" opcodes, which could cause the same
>> problems I had with compilation time and slowdown caused by
>> instruction cache misses. Calling a function would probably be better
>> (if the opcode is "big", the function call overhead is negligible).
>
> We've taken steps to ameliorate this somewhat in Unladen Swallow by
> converting some of the least-used/most-expensive opcodes into
> functions, so those will be function calls already (this was included
> in our 2009Q1 release).

Did the 2009Q1 already use LLVM, or it just called functions from the
interpreter loop? (Which is what Perl does, IIRC)

> What benefit did you see from LLVM's optimization passes? Did you find
> any passes to be particularly helpful or harmful?

I haven't had the time to evaluate different combinations of passes,
but I found that since the opcode bodies were already optimized by
llvm-gcc, all that was needed was to optimize the code produced by the
inlining of (some of) the opcodes, so I have found that at least
InstructionCombiningPass, DeadCodeEliminationPass, GVNPass and
CFGSimplificationPass were essential. PromoteMemoryToRegisterPass was
very useful when I inlined ALL the opcodes.

BTW, I found that most of the time is spent in backend code
generation, not in the optimization passes (unless I used really a lot
of exotic passes). Do you think that your approach of caching the LLVM
IR in the pyc will be sufficient to make the compilation overhead
negligible?