Hi,
As some of you may know I have been working on an alternative RISC-V ISA simulator, with an eventual goal of exploring RISC-V binary translation to x86_64.
rv-sim ABI Proxy Simulator - small Linux ABI subset for emulation of simple static executables
rv-sys Privileged System Emulator - similar to Spike; can boot riscv-linux to a busybox shell
rv-jit Proof of Concept ABI Proxy Simulator with JIT binary translation (alpha code)
The idea has always been to explore a monomorphism (or perhaps epimorphism) between the RISC-V ISA and a well known legacy 16-register 64-bit CISC ISA, potentially coalescing RISC sequences into CISC sequences, to achieve a relatively dense packing in real time. The principle is to explore a unidirectional mapping in contrast to a mapping to a target independent IR. The internal representation is the RISC-V ISA and fused RISC-V pseudo instructions (call, tail, la, li, etc). In any case, there has been substantial work to get a framework and scaffolding in place for “return to interpreter” before starting on trace-based binary translation. A priv-1.9 interpreter was completed a couple of months ago and rv-sys can now boot riscv-linux. Next up is binary translation…
I have just added a new program “rv-jit” and am now exploring how to map from the RISC-V ISA to x86_64. The intention is to implement a “fusion matcher” to coalesce RISC-V instruction sequences and translate them into CISC instruction sequences, for example, potentially merging sequences into CISC memory operands (displacement, base, index, scale), only to be later be broken back into micro-ops by the CISC micro-architecture. The principle idea is to maintain instruction density so that instruction level parallelism can exist in the translation.
There are many interesting challenges:
- Lifting RISC-V assembler back into pseudos
call, tail, la, li, plus more complex patterns (PLT+GOT indirect jumps)
- Register allocation (31 registers to 16 registers)
dynamic spill slots used as CISC memory operands to ALU ops (L1 latency)
- Handling mapping 3 operands to 2 operand destructive ALU ops (ADD, SUB, IMUL, IDIV)
potentially perform lookahead liveness analysis to eliminate temporaries
- Branches and mapping flags to registers, CMP, SETcc (zero extending for SLT[i][u])
- Indrect jumps and call stack emulation
- Efficient register save restore when jumping to/from JIT traces
only save and restore registers used in the trace
- Efficient JIT of instret and cycles, at trace tails.
potentially a cycle accurate translation mode
However I now am at the early stage where I have a very simple working proof of concept that can execute native code for simple loops. There was a reasonable amount of work to get the framework in place.
The interpreter (similarly to spike) has a memory backed array for registers e.g.
add r[rd] = r[rs1] + r[rs2]
The JIT assigns commonly used target registers to host registers (currently using a static mapping). The JIT exploits CISC memory operands for spilt registers to maintain instruction density. The interpreter maintains a program counter execution count history (not optimised yet - uses std::map internally) and once an instruction meets a given threshold, the interpreter switches to trace and JIT compile, concurrently interpreting and emitting JIT code. The tracer exits on loop termination or when it finds an instruction it can’t translate (currently the proof of concept has just minimal instructions for optimising simple loops). Before executing any instruction the JIT trace cache is looked up by program counter and if a JIT trace exists, the native trace is executed from the trace cache.
- lookup trace cache for program counter, execute native trace if it exists
- fetch instruciton, update program counter stats, execute
- If execution threshold is met, run JIT codegen tracer and cache trace.
The abstraction is that the instruction fetch throws an exception (internally similar to a misaligned or any other fetch exception) when the instruction count threshold is met. The tracer is then invoked and concurrently interprets and JIT compiles code and exits when it detects loop termination. It could potentially do non interpretive opportunistic JIT of branch tails. Something to explore…
The proof of concept works, and there is now a scaffolding to explore RISC-V to x86_64 translation. The POC has just enough instructions to optimise simple loops (add, addi, bne, ld). Encouraging early results show a 40X speed improvement over interpretation in a synthetic loop sum benchmark. This is not surprising as interpretation is dominated by decode and memory access to registers.
Interestingly the test loop gets translated 3 times as the code does not yet index on basic block. The interpreter simply starts a trace on any instruction that reaches the execution threshold, and on tail exit from JIT another instruction entry point might hit the threshold and cause re-translation. Even so, it achieves a 40X speed boost, likely with a reasonable amount of instruction level parallelism.
I still need to enable the fusion matcher and translate many more instructions… and once I have a more complete JIT, the JIT engine will be integrated with the full system emulator to accelerate riscv-linux on x86_64 hosts.
Note: in the asm below [rbp + (reg+1) * 8] refers to split registers with [rbp + 0] as the program counter. In the trace you can see the tails of branches loading the host code program counter for re-entry into the interpreter. The translated code below only happens to be using hot registers so the trace doesn’t show any [rbp + offset] accesses to spilt registers.
rbp -> u64 pc
u64 regs[32] /* translator drops writes to zero */
test-loop-2 (see below) has 24 million loop iterations or 128,000,176 RISC-V instructions retired
- interp 97.5 MIPS
- jit-interp 4000 MIPS
Here is the simple loop sum benchmark:
The benchmark is run on a single core of a 3.1GHz Broadwell i7-5557U so the JITed RISC-V x86_64 code is retiring approximately 1.3 instructions per cycle. The prolog/epilog of the trace can be further optimised. This is version 0.0-prealpha. It will still be some time before the translator will be able to run useful workloads (like linux kernel and gcc) but the initial results are promising.
$ time build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-loop-2
rv-jit-0.0.0-prealpha-0
total=300000000
real 0m1.312s
user 0m1.308s
sys 0m0.000s
$ time build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-loop-2 --trace
rv-jit-0.0.0-prealpha-0
total=300000000
real
0m0.032s
user
0m0.032s
sys
0m0.000s
$ time build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-loop-2 --trace --trace-iters 100 --log-jit-trace
rv-jit-0.0.0-prealpha-0
jit-trace-begin pc=0x00000000000100d4
# 0x00000000000100d4 add a0, a3, a4
mov r8, r11
add r8, r12
# 0x00000000000100d8 ld a0, 0(a0)
mov r8, qword ptr [r8 + 0]
# 0x00000000000100dc addi a4, a4, 8
add r12, 8
# 0x00000000000100e0 add a1, a1, a0
add r9, r8
# 0x00000000000100e4 bne a4, a2, pc - 16
cmp r12, r10
jne 1f
mov [rbp + 0], 0x100e8
jmp term
1:
mov [rbp + 0], 0x100d4
jmp term
# 0x00000000000100e8 addi a5, a5, -1
add r13, -1
# 0x00000000000100ec bne a5, zero, pc - 28
cmp r13, 0
jne 1f
mov [rbp + 0], 0x100f0
jmp term
1:
mov [rbp + 0], 0x100d0
jmp term
# 0x00000000000100d0 addi a4, zero, 0
mov r12, 0
term:
jit-trace-end pc=0x00000000000100d4
jit-trace-begin pc=0x00000000000100d0
# 0x00000000000100d0 addi a4, zero, 0
mov r12, 0
# 0x00000000000100d4 add a0, a3, a4
mov r8, r11
add r8, r12
# 0x00000000000100d8 ld a0, 0(a0)
mov r8, qword ptr [r8 + 0]
# 0x00000000000100dc addi a4, a4, 8
add r12, 8
# 0x00000000000100e0 add a1, a1, a0
add r9, r8
# 0x00000000000100e4 bne a4, a2, pc - 16
cmp r12, r10
jne 0x00000000000100d4
mov [rbp + 0], 0x100e8
jmp term
term:
jit-trace-end pc=0x00000000000100d4
jit-trace-begin pc=0x00000000000100e8
# 0x00000000000100e8 addi a5, a5, -1
add r13, -1
# 0x00000000000100ec bne a5, zero, pc - 28
cmp r13, 0
jne 1f
mov [rbp + 0], 0x100f0
jmp term
1:
mov [rbp + 0], 0x100d0
jmp term
# 0x00000000000100d0 addi a4, zero, 0
mov r12, 0
# 0x00000000000100d4 add a0, a3, a4
mov r8, r11
add r8, r12
# 0x00000000000100d8 ld a0, 0(a0)
mov r8, qword ptr [r8 + 0]
# 0x00000000000100dc addi a4, a4, 8
add r12, 8
# 0x00000000000100e0 add a1, a1, a0
add r9, r8
# 0x00000000000100e4 bne a4, a2, pc - 16
cmp r12, r10
jne 0x00000000000100d4
jmp 0x00000000000100e8
term:
jit-trace-end pc=0x00000000000100d4
total=300000000
real 0m0.060s
user 0m0.056s
sys 0m0.000s