Hi,
I added some new benchmark stats for the RISC-V to x86-64 binary translation engine. We are now in the order of ~2.5X QEMU user mode binary translation performance on a small set of benchmarks.
I’m now exploring more detailed statistics such as retired instructions vs retired micro-ops in an attempt to factor out instruction set and compiler efficiency and to see how much performance is lost due to binary translation and how much is lost due to the maturity of the compiler port, or even due to missing instructions in the ISA (BSWAP, ROR). The stats might be useful to isolate performance issues with RISC-V GCC. I also intend to add stats on number of macro-op fuses the JIT engine is performing to assess the benefit of various macro-op fusion patterns.
I've been able to reach a peak single-thread simulation performance of ~6.5 billion instructions per second on dhrystone, on a single core of a dual core 3.4Ghz Intel NUC (6th-gen Broadwell Core i7-5557U). The x86 native code for drystone executes ~8.6 billion instructions per second and ~11.5 billion micro-ops per second respectively, however the benchmark runtime for rv-jit is ~4.38 times slower than the native code.
So while the gross performance of dhrystone is ~4.38 times slower, when comparing based on retired micro-ops per second the JIT engine is only issuing ~1.78 times less instructions per second than native x86-64 micro-ops. This means the x86-64 compiler or ISA has a bigger advantage on this particular benchmark and it means we could potentially attribute ~2.46 times to compiler and/or instruction set efficiency (for that particular benchmark). Something to root cause… In other benchmarks it goes the other way…
I am working on and have measured total retired RISC-V instructions and total retired x86 instructions and total retired micro-ops in an attempt to factor out compiler efficiency. e.g. work done per operation versus total operations. A RISC-V instruction is more comparable to an x86 micro-op in terms of work done, however RISC-V misses out on quite a few micro-ops such as BSWAP and ROR/ROL that are heavily used in the digest and cipher benchmarks or other issues such as fused add and load micro-ops or compiler coalescing of loads and store. I can’t quantify some of these performance issue until we add comparable instructions to the RISC-V compiler and then measure the difference in number of dynamic instructions executed.
I have results for:
- AES
- SHA-2
- NORX
- dhrystone
- miniz (compression)
- qsort
The stats include:
- Runtime (rv-jit, qemu-user, native x86-64)
- Runtime ratios
- Total retired instructions (x86, RISC-V)
- Total retired micro-ops (x86)
- MIPS (Millions os Instructions Per Second)
- MOPS aka Mμops/sec (Millions of micro-ops per second)
- MIPS:MIPS ratio
- MIPS:MOPS ratio
I’ve also spent some time on infrastructure for recording retired instructions, dynamic register usage and dynamic instruction usage. Of course it is not cycle accurate however it can aid in assessment. If we come up with some BSWAP and RRL/RLL encodings, add them to GCC then we can see how much they impact performance. I suspect the performance issue with the cipher and digest benchmarks is multifactorial. The dynamic instruction executed counts are interesting. I also need to update my compiler baseline and re-run with -O3 instead of -Os (to quantify how much that impacts performance). In any case, I’m now gathering scripts to automate this…
I intend to gather more detailed data on number on dynamic branches as well as macro-ops fuses to look at the frequency of the various macro-op pattern matches (e.g. c.bnez,
c.mv). I should also add cycles into the model. In any case I’ve made a step forward since last time I reported.
Michael.