RFC: QEMU RISC-V modular ISA decoding

78 views
Skip to first unread message

Bastian Koppelmann

unread,
Jul 25, 2017, 9:05:19 AM7/25/17
to RISC-V SW Dev, QEMU Developers
Hi QEMU devs, hi risc-v-sw devs,

I'm posting this cross mailing list since I'd like to get feedback from
the both sides.

Right now the RISC-V port for QEMU uses the classic decoding scheme of
one function decoding the first opcode (and prefixes) and then branches
to different functions for decoding the rest (as in target/arm or most
of the other targets). This is all done using switch, case statements.

This is a little bit tricky to extend, especially for third parties. I
don't think it's too bad, but it can definitely be improved and I really
like the way target/s390x does it, but I'm not sure about it's drawbacks.

I see three options to proceed from here:

1) Completely move to a decoding scheme similar to target/s390x in
QEMU. On the plus side it make it super easy to add new
instructions and/or new instruction formats, and reduces decoding
errors. I don't know the major drawbacks to this approach, maybe
performance. Does anyone know? Other than that it needs a major
rewrite of the decoder, which will take some time and thus delay
the development of RISC-V QEMU upstream. (I think RV32/64I can
be left as is, since everybody has to implement it anyways)

2) The compromise: Leave the core as is, i.e. RV32GC, and provide a
nice interface for any other extension similar to target/s390.
The big plus here is that no code needs to be changed and only
the interface needs to be added. We don't add any performance
overhead (if s390x-style decoding adds any), but this might
result in nobody using it, since they don't know about the
interface and they just hack their stuff in. Then it was a waste
of our time to implement the interface.

3) The status quo. Just leave it as is.

Any comments?

Cheers,
Bastian



Samuel Falvo II

unread,
Jul 25, 2017, 10:31:45 AM7/25/17
to Bastian Koppelmann, RISC-V SW Dev, QEMU Developers
For those of us who are not in the know, how does target/s390 decoding work?

I've maintained a 65816 emulator
(https://bitbucket.org/kc5tja/lib65816/src) which also uses a giant
case construct. This method is used because it's fast. Any
alternative approaches you decide to take might well work and satisfy
extensibility requirements, but it'll likely take a performance hit as
well.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To post to this group, send email to sw-...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/e071dd23-8d19-93ba-7962-b2e2df69a6ee%40mail.uni-paderborn.de.



--
Samuel A. Falvo II

Bruce Hoult

unread,
Jul 25, 2017, 12:37:23 PM7/25/17
to Bastian Koppelmann, RISC-V SW Dev, QEMU Developers
Do you have any good estimates for how much of the execution time is typically spent in instruction decode?

RISC-V qemu is twice as fast as ARM or Aarch64 qemu, so it's doing something right!

(I suspect it's probably mostly the lack of needing to emulate condition codes)


--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.

Corey Richardson

unread,
Jul 25, 2017, 1:52:32 PM7/25/17
to sw-...@groups.riscv.org
And rv8 faster still :)
> > email to sw-dev+un...@groups.riscv.org.
> > To post to this group, send email to sw-...@groups.riscv.org.
> > Visit this group at https://groups.google.com/a/
> > groups.riscv.org/group/sw-dev/.
> > To view this discussion on the web visit https://groups.google.com/a/
> > groups.riscv.org/d/msgid/sw-dev/e071dd23-8d19-93ba-7962-
> > b2e2df69a6ee%40mail.uni-paderborn.de.
> >
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To post to this group, send email to sw-...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CAMU%2BEkyxu9rwJ4s8xCGb87cSexfGweTmQ_Q7smLVWxjmbVqu_Q%40mail.gmail.com.


--
cmr
http://octayn.net/
+16038524272

Michael Clark

unread,
Jul 25, 2017, 2:22:22 PM7/25/17
to Corey Richardson, sw-...@groups.riscv.org
However rv8 instruction decode and interpretation speed is very similar to spike.

spike has a fast and interesting decoder. Instead of using a big switch, which is necessarily static, spike linearly searches an array with mask and match values, bubbling recently found entries to the top, so that recently used entries are found quickly e.g. optimised for loops. This proves to be very fast in practice. In addition to that spike hashes decoded instructions modulo 8191 (a mersenne prime) in a cache, to short-circuit re-decoding. spike’s scheme has the benefit of allowing new opcodes to be defined at runtime. spike can load isa extensions as shared objects. That simply won't work with a static switch style decoder.

I see qemu s390x has target-s390x/insn-data.def, target-s390x/insn-format.def, #include “insn-data.def” and #include “insn-format.def” so it must take a similar approach to spike. I would say that would be preferable and easier to maintain than a manually edited switch statement.

I tried making a dynamic decoder but it performed poorly. The rv8 switch decoder is machine generated from a derivative of riscv-opcodes, which is what spike uses for its dynamic decoder, so at the very least we don’t have to manually edit a big switch statement. JITing the decoder at runtime would be the next step (perhaps using BMI instructions to extract fields), but decode isn’t really high on the profile once the code has been translated.

In fact spike’s decoder is somewhat like a log(n) software emulation of a ternary CAM. It’s largely agnostic to RISC-V decoding, and could handle any fixed size RISC style encoding with length markers (2, 4, 6, 8-bytes).
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/1501005150.2922322.1052257872.7282198C%40webmail.messagingengine.com.

Cesar Eduardo Barros

unread,
Jul 25, 2017, 6:57:48 PM7/25/17
to Michael Clark, Corey Richardson, sw-...@groups.riscv.org
I've thought of making a branchless table-driven "shift and mask"
decoder for RISC-V:

1. Shift the instruction right a number of bits, then AND it with a mask;
2. Use the result as the index into a table;
3. The table contains the next shift amount, the next mask, and the next
table, repeat steps 1 and 2 using them;
4. After doing this a few times, the last level of the table contains
the opcode and whatever other information the emulator requires.

This needs at least three levels of tables (major opcode, funct3,
funct7), perhaps four or five levels to decode in more detail without
bloating the tables. For instructions with only one level of decoding
(like LUI), the other levels of the table can use a shift of 0, a mask
of 0, and a single-entry table (since doing an AND with 0 always results
in 0).

With a few tricks (use one byte each for shift amount and mask, plus an
offset into the next-level table instead of a pointer, reuse identical
entries), this should be small enough to fit into the low-latency L1D
cache with plenty to spare.

In the same way, decoding an immediate could also be done with only a
fixed (and small) number of shifts and masks (only one of the shifts
being "negative" or zero, and with only one of the shifts
sign-extending), with an OR at the end. These shift amounts and masks
could be stored together with the opcode in the last level of the table,
or even (since the instruction format depends only on the major opcode,
at least for non-RVC) in the first level of the table, so the immediate
could be decoded while waiting for the L1D for the opcode tables.

I don't know how fast that would be. It would avoid branch misprediction
penalties, at the cost of wasted work for "simple" instructions like LUI
or AUIPC, and possibly bubbles waiting for the L1D cache. Also, using it
for RVC would need more levels (the tables can easily ignore the upper
16 bits, however), and generating optimal decoding tables is a pain.
Cesar Eduardo Barros
ces...@cesarb.eti.br

Alex Suykov

unread,
Jul 25, 2017, 7:28:22 PM7/25/17
to Cesar Eduardo Barros
Tue, Jul 25, 2017 at 07:57:39PM -0300, Cesar Eduardo Barros wrote:

> I've thought of making a branchless table-driven "shift and mask" decoder
> for RISC-V:
>
> 1. Shift the instruction right a number of bits, then AND it with a mask;
> 2. Use the result as the index into a table;
> 3. The table contains the next shift amount, the next mask, and the next
> table, repeat steps 1 and 2 using them;
> 4. After doing this a few times, the last level of the table contains the
> opcode and whatever other information the emulator requires.

https://github.com/arsv/riscv-qemu/blob/devel/target-riscv/translate.c#L186
https://github.com/arsv/riscv-qemu/blob/devel/target-riscv/translate_rvi.c#L84

And the rest of the code there. That's assuming the compiler turns switch()es
with nice case values like 0...n into jump tables, which it often does.

> This needs at least three levels of tables (major opcode, funct3, funct7),
> perhaps four or five levels to decode in more detail without bloating the
> tables.

It works nicely to a certain level, then the tables become too small.
Merging non-contiguous fields worked for me in a couple of places.
Take a look, maybe you'll come up with something better.

https://github.com/arsv/riscv-qemu/blob/devel/target-riscv/translate_rvi.c#L177
https://github.com/arsv/riscv-qemu/blob/devel/target-riscv/translate_rvi.c#L216

Michael Clark

unread,
Jul 25, 2017, 8:21:10 PM7/25/17
to Cesar Eduardo Barros, Corey Richardson, sw-...@groups.riscv.org
On 26 Jul 2017, at 10:57 AM, Cesar Eduardo Barros <ces...@cesarb.eti.br> wrote:

I've thought of making a branchless table-driven "shift and mask" decoder for RISC-V:

1. Shift the instruction right a number of bits, then AND it with a mask;
2. Use the result as the index into a table;
3. The table contains the next shift amount, the next mask, and the next table, repeat steps 1 and 2 using them;
4. After doing this a few times, the last level of the table contains the opcode and whatever other information the emulator requires.

This needs at least three levels of tables (major opcode, funct3, funct7), perhaps four or five levels to decode in more detail without bloating the tables. For instructions with only one level of decoding (like LUI), the other levels of the table can use a shift of 0, a mask of 0, and a single-entry table (since doing an AND with 0 always results in 0).

With a few tricks (use one byte each for shift amount and mask, plus an offset into the next-level table instead of a pointer, reuse identical entries), this should be small enough to fit into the low-latency L1D cache with plenty to spare.

Yes. icache vs dcache. It’s hard to know what is best without measuring. A decoder using PEXT/PDEP would be interesting. I don’t know how much it would speed things up. It would need to be quantified.

In the same way, decoding an immediate could also be done with only a fixed (and small) number of shifts and masks (only one of the shifts being "negative" or zero, and with only one of the shifts sign-extending), with an OR at the end. These shift amounts and masks could be stored together with the opcode in the last level of the table, or even (since the instruction format depends only on the major opcode, at least for non-RVC) in the first level of the table, so the immediate could be decoded while waiting for the L1D for the opcode tables.

I don't know how fast that would be. It would avoid branch misprediction penalties, at the cost of wasted work for "simple" instructions like LUI or AUIPC, and possibly bubbles waiting for the L1D cache. Also, using it for RVC would need more levels (the tables can easily ignore the upper 16 bits, however), and generating optimal decoding tables is a pain.

You can have an early out bit in the state table for major opcodes i.e. a terminal node in the state machine.

There are also a lot of bits you can ignore because they are only ever used for registers or immediate values.

I used a recursive greedy algorithm that naturally discovers the RVC prefixes, then the majors, then functions and sub-functions. It works by scanning the ternary grid for a group of columns with the most cover, partitioning and recursing over subsets of nodes with common prefixes; prefix in the abstract sense as it groups columns with equivalent cover, effectively sorting columns with most cover first. If it finds discontiguous columns with the same cover, it bunches them together (with shift, or) to switch them together as one unit. We could use the groupings of bits with equivalent cover to make PEXT patterns, however PEXT can handle bits crossing over. i.e. it requires more than one pass if bits are re-ordered as well as extracted. In fact we can just switch on the alternate order. The order of the bits is only significant for immediate decoding.


It makes this:


It almost looks like code a human would write.

-- 
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.

Stefan O'Rear

unread,
Jul 26, 2017, 2:56:27 AM7/26/17
to Bruce Hoult, Bastian Koppelmann, RISC-V SW Dev, QEMU Developers
On Tue, Jul 25, 2017 at 9:37 AM, Bruce Hoult <br...@hoult.org> wrote:
> Do you have any good estimates for how much of the execution time is
> typically spent in instruction decode?
>
> RISC-V qemu is twice as fast as ARM or Aarch64 qemu, so it's doing something
> right!
>
> (I suspect it's probably mostly the lack of needing to emulate condition
> codes)

The last time I tried to profile qemu (system mode, running the Go
bootstrap, I think), I didn't get very far because no jit-map and I
wasn't able to get frame pointers working, but as far as I got none of
the translate functions showed up.

Most time spent in translated code, trampolines for entering and
exiting translated code, TLB maintenance, and the code to choose which
basic block to run next.

Making the instruction decoder a bit slower is not likely to have much
effect (but do do before and after measurements to be sure).

Significant wins would come from reducing the number of switches
between translations (e.g. by translating larger units, all code on a
page at once, functions, traces), by making switches between
translations cheaper (e.g. with inline caches), or by reducing the
cost of access translation (e.g. two accesses relative to the same
base register in the same translation often hit the same virtual page
and can share translation effort, or more speculatively by using host
translation hardware. (I am willing to discuss any of these further
OFF-LIST ONLY.)

-s

Bastian Koppelmann

unread,
Jul 26, 2017, 7:46:10 AM7/26/17
to Bruce Hoult, RISC-V SW Dev, QEMU Developers
On 07/25/2017 06:37 PM, Bruce Hoult wrote:
> Do you have any good estimates for how much of the execution time is
> typically spent in instruction decode?
>
> RISC-V qemu is twice as fast as ARM or Aarch64 qemu, so it's doing
> something right!
>
> (I suspect it's probably mostly the lack of needing to emulate condition
> codes)

And most likely du to no overflow calculations. I don't expect
performance to be too big of an issue (I don't have hard data on that),
this was just the first thing that came to mind. I was rather hoping to
get some feedback from the s390x maintainers/qemu devs on other problems
I'm not seeing.

The important question to me: Is it worth it to refactor the code to
allow easy extensibility or not?

Cheers,
Bastian

Bastian Koppelmann

unread,
Jul 26, 2017, 8:01:13 AM7/26/17
to Samuel Falvo II, RISC-V SW Dev, QEMU Developers
Hi Samuel,

On 07/25/2017 04:31 PM, Samuel Falvo II wrote:
> For those of us who are not in the know, how does target/s390 decoding work?

sorry about that. I was going into this with a QEMU-dev mindset :)

The basic idea of s390x is to have every instruction + instruction
formats specified in files that are parsed by the preprocessor and then
used through preprocessor magic to generate switch-case statements for
insn selection and data structures filled with the decoded data.

s390x has two files:
- insn-data.def -> contains all the instructions, including opcodes,
name, ref to insn specific translate function,
ref to insn format, and some more
- insn-format.def -> contains all the instruction formats

these are then used to automatically generate the switch-case statements
and decoding code.

If you want to extend this, you add your own insn format to the
insn-format.def files add functions for decoding parameters in
translate.c. And then add your insn referencing the new format to
insn-def.data and add translation functions for each of them.

The main benefit here is that you don't have to bother with writing all
that boring glue code.

I hope that helped :)

Cheers,
Bastian

Richard W.M. Jones

unread,
Jul 26, 2017, 1:36:33 PM7/26/17
to Bastian Koppelmann, Samuel Falvo II, RISC-V SW Dev, QEMU Developers
On Wed, Jul 26, 2017 at 02:00:14PM +0200, Bastian Koppelmann wrote:
> Hi Samuel,
>
> On 07/25/2017 04:31 PM, Samuel Falvo II wrote:
> > For those of us who are not in the know, how does target/s390 decoding work?
>
> sorry about that. I was going into this with a QEMU-dev mindset :)
>
> The basic idea of s390x is to have every instruction + instruction
> formats specified in files that are parsed by the preprocessor and then
> used through preprocessor magic to generate switch-case statements for
> insn selection and data structures filled with the decoded data.
>
> s390x has two files:
> - insn-data.def -> contains all the instructions, including opcodes,
> name, ref to insn specific translate function,
> ref to insn format, and some more
> - insn-format.def -> contains all the instruction formats
>
> these are then used to automatically generate the switch-case statements
> and decoding code.

I looked at the s390x TCG code for the first time now and this is a
far more sensible way of doing it. We should do it for all the arches :-)
I wonder if there's a performance penalty?

Rich.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/58da690d-03d4-2c96-469a-35ff8c25ef1d%40mail.uni-paderborn.de.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

kr...@berkeley.edu

unread,
Jul 26, 2017, 4:58:42 PM7/26/17
to Richard W.M. Jones, Bastian Koppelmann, Samuel Falvo II, RISC-V SW Dev, QEMU Developers

Given that one of the goals of RISC-V is extensibility, it would be
nice if the QEMU port was done in a way to make it easier to extend by
third parties, including other automated tools. I'm sure that, over
time, the preprocessor can be improved to automatically incorporate
optimizations for better performance.

Krste
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/20170726173628.GF30459%40redhat.com.

Michael Clark

unread,
Jul 26, 2017, 5:25:18 PM7/26/17
to kr...@berkeley.edu, Richard W.M. Jones, Bastian Koppelmann, Samuel Falvo II, RISC-V SW Dev, QEMU Developers
On 27 Jul 2017, at 8:58 AM, kr...@berkeley.edu wrote:


Given that one of the goals of RISC-V is extensibility, it would be
nice if the QEMU port was done in a way to make it easier to extend by
third parties, including other automated tools.  I'm sure that, over
time, the preprocessor can be improved to automatically incorporate
optimizations for better performance.

I had a look at the s390x code in more detail.

It essentially does some length and major opcode unpacking to construct/linearise the opcode space, and would be equivalent to constructing a 15-bit opcode for RISC-V. It does some pre-decoding to find length, majors, and has a switch on majors to find instruction type, and then append minors, if any.


It would be akin to something list this (at a high-level):

len = inst_len(inst) /* bits[1:0] */
op = major_opcode(inst) /* bits[6:2] */

switch op
- rvc
/* hairy bit */
- r-type
op2 = funct7 << 3 | funct3
- i-type
- s-type
op2 = funct3
- u-type
op2 = 0

op = op | op2 << 5


Reply all
Reply to author
Forward
0 new messages