LuaJIT's interpreter is fast, because:
•It's written in assembler.
•It keeps all important state in registers. No C compiler manages to do that on x86.
•It uses indirect threading (aka labeled goto in C).
•It has a very small I-cache footprint (the core of the interpreter fits in 6K).
•The parser generates a register-based bytecode.
•The bytecode is really a word-code (32 bit/ins) and designed for fast decoding.
•Bytecode decode and dispatch is heavily optimized for superscalar CPUs.
•The bytecode is type-specialized and patched on-the-fly.
•The dispatch table is patched to allow for debug hooks and trace recording. No need to check for these cases in the fast paths.
•It uses NaN tagging for object references. This allows unboxed FP numbers with a minimal cache footprint for stacks/arrays. FP stores are auto-tagging.
•It inlines all fast paths.
•It uses special calling conventions for built-ins (fast functions).
•Tons more tuning in the VM ... and the JIT compiler has it's own bag of tricks.
E.g. x=x+1 is turned into the ADDVN instruction. This means it's specialized for the 2nd operand to be a constant. Here's the x86 code (+ SSE2 enabled) for this instruction:
// Prologue for type ABC instructions (others have a zero prologue).
movzx ebp, ah Decode RC (split of RD)
movzx eax, al Decode RB (split of RD)
// The instruction itself.
cmp [edx+ebp*8+0x4], -13 Type check of [RB]
ja ->lj_vmeta_arith_vn
movsd xmm0, [edx+ebp*8] Load of [RB]
addsd xmm0, [edi+eax*8] Add to [RC]
movsd [edx+ecx*8], xmm0 Store in [RA]
// Standard epilogue: decode + dispatch the next instruction.
mov eax, [esi] Load next bytecode
movzx ecx, ah Decode RA
movzx ebp, al Decode opcode
add esi, 0x4 Increment PC
shr eax, 0x10 Decode RD
jmp [ebx+ebp*4] Dispatch to next instruction
Yes, that's all of it. I don't think you can do this with less instructions. This code reaches up to 2.5 ipc on a Core2 and takes 5-6 cycles (2 nanoseconds on a 3 GHz machine).
已有 16 人发表留言,猛击->>这里<<-参与讨论
JavaEye推荐