binary translation update, measuring retired instructions/micro-ops

152 views
Skip to first unread message

Michael Clark

unread,
Jul 13, 2017, 6:51:43 PM7/13/17
to RISC-V SW Dev
Hi,

I added some new benchmark stats for the RISC-V to x86-64 binary translation engine. We are now in the order of ~2.5X QEMU user mode binary translation performance on a small set of benchmarks.

I’m now exploring more detailed statistics such as retired instructions vs retired micro-ops in an attempt to factor out instruction set and compiler efficiency and to see how much performance is lost due to binary translation and how much is lost due to the maturity of the compiler port, or even due to missing instructions in the ISA (BSWAP, ROR). The stats might be useful to isolate performance issues with RISC-V GCC. I also intend to add stats on number of macro-op fuses the JIT engine is performing to assess the benefit of various macro-op fusion patterns.

I've been able to reach a peak single-thread simulation performance of ~6.5 billion instructions per second on dhrystone, on a single core of a dual core 3.4Ghz Intel NUC (6th-gen Broadwell Core i7-5557U). The x86 native code for drystone executes ~8.6 billion instructions per second and ~11.5 billion micro-ops per second respectively, however the benchmark runtime for rv-jit is ~4.38 times slower than the native code. 

So while the gross performance of dhrystone is ~4.38 times slower, when comparing based on retired micro-ops per second the JIT engine is only issuing ~1.78 times less instructions per second than native x86-64 micro-ops. This means the x86-64 compiler or ISA has a bigger advantage on this particular benchmark and it means we could potentially attribute ~2.46 times to compiler and/or instruction set efficiency (for that particular benchmark). Something to root cause In other benchmarks it goes the other way

I am working on and have measured total retired RISC-V instructions and total retired x86 instructions and total retired micro-ops in an attempt to factor out compiler efficiency. e.g. work done per operation versus total operations. A RISC-V instruction is more comparable to an x86 micro-op in terms of work done, however RISC-V misses out on quite a few micro-ops such as BSWAP and ROR/ROL that are heavily used in the digest and cipher benchmarks or other issues such as fused add and load micro-ops or compiler coalescing of loads and store. I can’t quantify some of these performance issue until we add comparable instructions to the RISC-V compiler and then measure the difference in number of dynamic instructions executed.

See updates here: https://rv8.io/#benchmarks 

I have results for:

- AES
- SHA-2
- NORX
- dhrystone
- miniz (compression)
- qsort 
- sieve

The stats include: 

- Runtime (rv-jit, qemu-user, native x86-64)
- Runtime ratios
- Total retired instructions (x86, RISC-V) 
- Total retired micro-ops (x86) 
- MIPS (Millions os Instructions Per Second) 
- MOPS aka Mμops/sec (Millions of micro-ops per second) 
- MIPS:MIPS ratio
- MIPS:MOPS ratio

I’ve also spent some time on infrastructure for recording retired instructions, dynamic register usage and dynamic instruction usage. Of course it is not cycle accurate however it can aid in assessment. If we come up with some BSWAP and RRL/RLL encodings, add them to GCC then we can see how much they impact performance. I suspect the performance issue with the cipher and digest benchmarks is multifactorial. The dynamic instruction executed counts are interesting. I also need to update my compiler baseline and re-run with -O3 instead of -Os (to quantify how much that impacts performance). In any case, I’m now gathering scripts to automate this…

I intend to gather more detailed data on number on dynamic branches as well as macro-ops fuses to look at the frequency of the various macro-op pattern matches (e.g. c.bnez, c.mv). I should also add cycles into the model. In any case I’ve made a step forward since last time I reported.

Michael. 

Andrew Waterman

unread,
Jul 14, 2017, 5:02:43 AM7/14/17
to Michael Clark, RISC-V SW Dev
On Thu, Jul 13, 2017 at 3:51 PM Michael Clark <michae...@mac.com> wrote:
Hi,

I added some new benchmark stats for the RISC-V to x86-64 binary translation engine. We are now in the order of ~2.5X QEMU user mode binary translation performance on a small set of benchmarks.

I’m now exploring more detailed statistics such as retired instructions vs retired micro-ops in an attempt to factor out instruction set and compiler efficiency and to see how much performance is lost due to binary translation and how much is lost due to the maturity of the compiler port, or even due to missing instructions in the ISA (BSWAP, ROR). The stats might be useful to isolate performance issues with RISC-V GCC. I also intend to add stats on number of macro-op fuses the JIT engine is performing to assess the benefit of various macro-op fusion patterns.

I've been able to reach a peak single-thread simulation performance of ~6.5 billion instructions per second on dhrystone, on a single core of a dual core 3.4Ghz Intel NUC (6th-gen Broadwell Core i7-5557U). The x86 native code for drystone executes ~8.6 billion instructions per second and ~11.5 billion micro-ops per second respectively, however the benchmark runtime for rv-jit is ~4.38 times slower than the native code. 

So while the gross performance of dhrystone is ~4.38 times slower, when comparing based on retired micro-ops per second the JIT engine is only issuing ~1.78 times less instructions per second than native x86-64 micro-ops. This means the x86-64 compiler or ISA has a bigger advantage on this particular benchmark and it means we could potentially attribute ~2.46 times to compiler and/or instruction set efficiency (for that particular benchmark). Something to root cause In other benchmarks it goes the other way

Dhrystone is short enough that the best way to analyze it is by hand. No need to hypothesize.


I am working on and have measured total retired RISC-V instructions and total retired x86 instructions and total retired micro-ops in an attempt to factor out compiler efficiency. e.g. work done per operation versus total operations. A RISC-V instruction is more comparable to an x86 micro-op in terms of work done, however RISC-V misses out on quite a few micro-ops such as BSWAP and ROR/ROL that are heavily used in the digest and cipher benchmarks or other issues such as fused add and load micro-ops or compiler coalescing of loads and store. I can’t quantify some of these performance issue until we add comparable instructions to the RISC-V compiler and then measure the difference in number of dynamic instructions executed.

See updates here: https://rv8.io/#benchmarks 

I have results for:

- AES
- SHA-2
- NORX
- dhrystone
- miniz (compression)
- qsort 
- sieve

The stats include: 

- Runtime (rv-jit, qemu-user, native x86-64)
- Runtime ratios
- Total retired instructions (x86, RISC-V) 
- Total retired micro-ops (x86) 
- MIPS (Millions os Instructions Per Second) 
- MOPS aka Mμops/sec (Millions of micro-ops per second) 
- MIPS:MIPS ratio
- MIPS:MOPS ratio

I’ve also spent some time on infrastructure for recording retired instructions, dynamic register usage and dynamic instruction usage. Of course it is not cycle accurate however it can aid in assessment. If we come up with some BSWAP and RRL/RLL encodings, add them to GCC then we can see how much they impact performance. I suspect the performance issue with the cipher and digest benchmarks is multifactorial. The dynamic instruction executed counts are interesting. I also need to update my compiler baseline and re-run with -O3 instead of -Os (to quantify how much that impacts performance). In any case, I’m now gathering scripts to automate this…

I intend to gather more detailed data on number on dynamic branches as well as macro-ops fuses to look at the frequency of the various macro-op pattern matches (e.g. c.bnez, c.mv). I should also add cycles into the model. In any case I’ve made a step forward since last time I reported.

Michael. 

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/665BA762-20F5-462F-97DE-7B914E8F1476%40mac.com.

Michael Clark

unread,
Jul 14, 2017, 6:53:10 PM7/14/17
to Andrew Waterman, RISC-V SW Dev
On 14 Jul 2017, at 9:02 PM, Andrew Waterman <and...@sifive.com> wrote:


On Thu, Jul 13, 2017 at 3:51 PM Michael Clark <michae...@mac.com> wrote:
Hi,

I added some new benchmark stats for the RISC-V to x86-64 binary translation engine. We are now in the order of ~2.5X QEMU user mode binary translation performance on a small set of benchmarks.

I’m now exploring more detailed statistics such as retired instructions vs retired micro-ops in an attempt to factor out instruction set and compiler efficiency and to see how much performance is lost due to binary translation and how much is lost due to the maturity of the compiler port, or even due to missing instructions in the ISA (BSWAP, ROR). The stats might be useful to isolate performance issues with RISC-V GCC. I also intend to add stats on number of macro-op fuses the JIT engine is performing to assess the benefit of various macro-op fusion patterns.

I've been able to reach a peak single-thread simulation performance of ~6.5 billion instructions per second on dhrystone, on a single core of a dual core 3.4Ghz Intel NUC (6th-gen Broadwell Core i7-5557U). The x86 native code for drystone executes ~8.6 billion instructions per second and ~11.5 billion micro-ops per second respectively, however the benchmark runtime for rv-jit is ~4.38 times slower than the native code. 

So while the gross performance of dhrystone is ~4.38 times slower, when comparing based on retired micro-ops per second the JIT engine is only issuing ~1.78 times less instructions per second than native x86-64 micro-ops. This means the x86-64 compiler or ISA has a bigger advantage on this particular benchmark and it means we could potentially attribute ~2.46 times to compiler and/or instruction set efficiency (for that particular benchmark). Something to root cause In other benchmarks it goes the other way

Dhrystone is short enough that the best way to analyze it is by hand. No need to hypothesize.

Yes we should.

You do set a high standard for proof, but of course it is required for fixing compiler codegen. In fact the only references I can find to benchmarks showing retired micro-ops are Patterson et al. It has really been a case of incrementally improving my previously rather crude measurement practices. Up until this point I had been eyeballing asm to spot sequences for which I could improve the mapping to x86 and testing whether the changes would make the JIT go faster. I had to dig through the Intel architecture tomes so I could find the retired micro-ops model specific performance register as it not exposed (yet) in the large set of named performance registers visible in the linux perf tool. There are 90 named performance registers and micro-ops surprisingly is not there. This is the command:

perf stat -e instructions,r1c2 <prog>

Also to note, it’s dhry1.1-mc not dhry1.2. When I started on the JIT, it was the only version I could get to compile and run with newlib; and when I changed the K&R function declarations to C99 i.e. to make them typesafe, the compiler spotted a longstanding bug in a pointer dereference which I fixed; the compiler previously didn’t have type info due to the K&R declarations. I don’t think it was substantive. The difference between dhry1.1 and dhry1.2 may be bigger. In any case dhrystone is a terrible piece of code. It’s amazing that it became such a popular benchmark.

The best thing would be to run a sampling profiler to see whether its in Proc1, 2, 3, 4, 5, 6, 7… or Proc8. I can easily arrange for program counter samples to symbol mappings as I have both of those parts in the codebase already and I’ll take a closer look at the codegen at some point so we don’t have to speculate.

It’s probably quite important that I rebase to a well-known compiler version e.g. gcc 7.1, however when I started working on the JIT, I was using a particular baseline compiler that was in git at that time; so that I had a stable reference point for any optimisations I made, so I ended up with a particular compiler from a particular date and a particular set of flags. The other issue is with the use of -Os vs -O3. I need to test with both -O3 and -Os. The use of -Os is a historical personal practice based on recommendations and defaults set by Apple for ARM and x86-64 codegen in LLVM. Xcode defaults to -Os and in my testing with Clang in the past, it has been just as fast as -O3. That’s why I had -Os for the RISC-V benchmarks. It seems that however that gcc- O3 could be in the order of 20% faster than -O3 (at least with the JIT engine, from some previous experiments). I’ll quantify this so we don’t have to base this off anecdotes in the next report. I’ll also work on my scripts a bit more so that I can generate charts for -O3 and -Os (for both RISC-V and x86-64), and also perhaps surface RV32 and x86-32 stats. 

At least now I have some JIT neutral stats now we record new retired instructions and retired micro-ops, so it points to a few cases that are worth exploring where the retired micro-ops is much higher on RISC-V. I also want to increase the runtime of the sieve and qsort test cases so there is less noise from having short runs. The other constraint I have on the benchmarks is that they are single file, which makes them easier to handle. Earlier on I was using bzip2 and gzip compiled with newlib however qemu-user was not able to run them.

BTW I don’t have access to SPEC and even if I did, I think there is some exorbitant sum that needs to be paid if I wish to publish their figures, so I’m assembling my own little suite. It’s scalar integer centric at the moment as I haven’t implemented JIT for F&D. Mapping V to AVX-512 might be quite interesting too. Cover the combinatorial explosion of bad opcode space management with a modern forward looking variable length Vector extension encoding.

In any case it’s interesting to compare retired micro-ops on CISC versus RISC-V instructions however there are still some flaws in my methodology. I didn’t see this before, so now I can see which benchmarks the JIT is having to work harder on due to the number of ops… I should commit my scripts after tidying them up so the results can be reproduced.

Michael Clark

unread,
Jul 14, 2017, 7:24:40 PM7/14/17
to Andrew Waterman, RISC-V SW Dev

On 15 Jul 2017, at 10:53 AM, Michael Clark <michae...@mac.com> wrote:

 gcc- O3 could be in the order of 20% faster than -O3

 gcc- O3 could be in the order of 20% faster than -Os

Rakesh H S

unread,
Jan 30, 2019, 1:12:04 AM1/30/19
to RISC-V SW Dev
Hi,

I need to run the benchmarks so at first i install the coremark and rv8-bench. while running the npm start bench for qemu, native, size with these targets i'm getting undefined as the runtime, So how to run the benchmarks and also using npm start

Thank you

Regards,
Rakesh

Michael Clark

unread,
Jan 30, 2019, 7:25:39 PM1/30/19
to Rakesh H S, RISC-V SW Dev


On 30/01/2019, at 7:12 PM, Rakesh H S <rakes...@gmail.com> wrote:

Hi,

I need to run the benchmarks so at first i install the coremark and rv8-bench. while running the npm start bench for qemu, native, size with these targets i'm getting undefined as the runtime, So how to run the benchmarks and also using npm start

I suspect you will need to do some debugging.

The scripts are provided “as is”, and are just an artefact of an otherwise tedious process of tabulating dozens of metrics for comparison of interpreters and binary translators while I was optimising the binary translation engine in rv8. Read the license. I shared the scripts and templates because I thought others might find them useful, as often when folk make presentations, the scripts and methods to create the tables and plots are often omitted.

I deliberately do not use any proprietary benchmarks in this open source repo, as the proprietary benchmarks usually have restrictions on redistribution and publishing of results, so if you want to add coremark, you will need to edit the scripts and debug the changes.

Note: one caveat, for file sizes, I chose to use the stripped binary size although one should probably use riscv-none-elf-size. Both have pros and cons, and thus do static vs dynamic linking as one should account for the size of dynamic section relocs and data sections, because of inline vs out-of-line constants. There are quite a few nuances with respect to configuration of the C library. RISC-V uses float128 for long double, and this impacts the size of printf which can dominate small benchmarks when using static linking.

You are welcome to use the scripts and benchmarks however you wish, but if you want to edit the scripts to add other benchmarks you will need an understanding of nodejs, JavaScript, GNU make, etc. I chose JavaScript because at the time, I was more familiar with JS vs python. The JS code just launches the benchmarks, saves results and substitutes the results into markdown table templates. At some point I might rewrite the script in python. You are welcome to make a pull request if you make any enhancements that are possible to share: I don’t think we can add coremark as the license may not be compatible.

Michael

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.

Bruce Hoult

unread,
Jan 30, 2019, 7:33:33 PM1/30/19
to Michael Clark, Rakesh H S, RISC-V SW Dev
The github copy of coremark says it has an Apache license and you can
redistribute it.

https://github.com/eembc/coremark/blob/master/LICENSE.md

There is still a, presumably out of date, PDF that says you can't.

https://www.eembc.org/coremark/coremark_license.pdf
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/96230593-5E9D-4976-AFCA-42B5B8E00F38%40mac.com.

Michael Clark

unread,
Jan 30, 2019, 8:18:46 PM1/30/19
to Bruce Hoult, Rakesh H S, RISC-V SW Dev


> On 31/01/2019, at 1:33 PM, Bruce Hoult <bruce...@sifive.com> wrote:
>
> The github copy of coremark says it has an Apache license and you can
> redistribute it.
>
> https://github.com/eembc/coremark/blob/master/LICENSE.md

Oh that’s good to know.

Rakesh H S

unread,
Feb 6, 2019, 12:54:35 AM2/6/19
to RISC-V SW Dev
Hi,

I'm not able to create the native-i386 and native-X86_64 runtime binary translator in rv8-bench. How to perf tool to check the cpu usage also guide me step by step to create the native x86.
While running the benchmark it's showing the runtime undefined.

Rakesh

On Friday, July 14, 2017 at 4:21:43 AM UTC+5:30, Michael Clark wrote:
Reply all
Reply to author
Forward
0 new messages