LuaJIT 2 beta 3 is out: Support both x32 & x64(为什么会如此快?)

24 views

Skip to first unread message

Chunlin Zhang

unread,

Mar 9, 2010, 10:02:47 PM3/9/10

to lua_cn

http://mryufeng.javaeye.com/blog/610519

Sent to you by Chunlin Zhang via Google Reader:

LuaJIT 2 beta 3 is out: Support both x32 & x64(为什么会如此快?)

via erlang非业余研究 on 3/8/10

LuaJIT's interpreter is fast, because:

•It's written in assembler.
•It keeps all important state in registers. No C compiler manages to do that on x86.
•It uses indirect threading (aka labeled goto in C).
•It has a very small I-cache footprint (the core of the interpreter fits in 6K).
•The parser generates a register-based bytecode.
•The bytecode is really a word-code (32 bit/ins) and designed for fast decoding.
•Bytecode decode and dispatch is heavily optimized for superscalar CPUs.
•The bytecode is type-specialized and patched on-the-fly.
•The dispatch table is patched to allow for debug hooks and trace recording. No need to check for these cases in the fast paths.
•It uses NaN tagging for object references. This allows unboxed FP numbers with a minimal cache footprint for stacks/arrays. FP stores are auto-tagging.
•It inlines all fast paths.
•It uses special calling conventions for built-ins (fast functions).
•Tons more tuning in the VM ... and the JIT compiler has it's own bag of tricks.
E.g. x=x+1 is turned into the ADDVN instruction. This means it's specialized for the 2nd operand to be a constant. Here's the x86 code (+ SSE2 enabled) for this instruction:

// Prologue for type ABC instructions (others have a zero prologue).
movzx ebp, ah                  Decode RC (split of RD)
movzx eax, al                  Decode RB (split of RD)

// The instruction itself.
cmp    [edx+ebp*8+0x4], -13     Type check of [RB]
ja     ->lj_vmeta_arith_vn
movsd xmm0, [edx+ebp*8]        Load of [RB]
addsd xmm0, [edi+eax*8]        Add to [RC]
movsd [edx+ecx*8], xmm0        Store in [RA]

// Standard epilogue: decode + dispatch the next instruction.
mov    eax, [esi]               Load next bytecode
movzx ecx, ah                  Decode RA
movzx ebp, al                  Decode opcode
add    esi, 0x4                 Increment PC
shr    eax, 0x10                Decode RD
jmp    [ebx+ebp*4]              Dispatch to next instruction
Yes, that's all of it. I don't think you can do this with less instructions. This code reaches up to 2.5 ipc on a Core2 and takes 5-6 cycles (2 nanoseconds on a 3 GHz machine).

已有 16 人发表留言，猛击->>这里<<-参与讨论

JavaEye推荐

Things you can do from here:

Subscribe to erlang非业余研究 using Google Reader
Get started using Google Reader to easily keep up with all your favorite sites

Reply all

Reply to author

Forward

0 new messages