Dirty RVC it/then/else trick

95 views
Skip to first unread message

Bruce Hoult

unread,
Dec 26, 2017, 6:32:18 AM12/26/17
to RISC-V SW Dev
If you are compiling an IF/THEN/ELSE and it happens that the ELSE part can be done with a single 16-bit instruction, instead of ending the THEN part with a branch around the ELSE, you could instead emit 0x0037.

This creates an "lui x0,#imm20" instruction (which is a no-op) that includes the single RVC instruction in the ELSE part inside the imm20.

Useful? Too dirty to live? Will it break and/or slow down any likely microarchitecture?

Michael Chapman

unread,
Dec 26, 2017, 8:40:40 AM12/26/17
to sw-...@groups.riscv.org

Binary to binary translators won't like it. E.g. translating RISC-V binaries to x86.

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CAMU%2BEkyBCMxV8NSFjucMFrkneV2vFffk-QZc%2Brz-091bArh-ew%40mail.gmail.com.

Virus-free. www.avg.com

Bruce Hoult

unread,
Dec 26, 2017, 12:11:26 PM12/26/17
to Michael Chapman, RISC-V SW Dev
Having written one of those, I don't see why. Static compilation, ok, sure I agree, but that kind of thing is exactly why dynamic translators such as qemu or rv8 exist and work on basically anything a real CPU can handle.


On Tue, Dec 26, 2017 at 4:41 PM, Michael Chapman <michael.c...@gmail.com> wrote:

Binary to binary translators won't like it. E.g. translating RISC-V binaries to x86.


On 26-Dec-17 12:32, Bruce Hoult wrote:
If you are compiling an IF/THEN/ELSE and it happens that the ELSE part can be done with a single 16-bit instruction, instead of ending the THEN part with a branch around the ELSE, you could instead emit 0x0037.

This creates an "lui x0,#imm20" instruction (which is a no-op) that includes the single RVC instruction in the ELSE part inside the imm20.

Useful? Too dirty to live? Will it break and/or slow down any likely microarchitecture?

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to sw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.

Michael Clark

unread,
Dec 26, 2017, 3:37:29 PM12/26/17
to Bruce Hoult, Michael Chapman, RISC-V SW Dev
rv8 will end up with separate translations for each direction of the first branch, one that parses the LUI nop and another that starts parsing the instruction stream at the start of the “embedded” 16-bit instruction.

It would make disassembly and static analysis more difficult. Perhaps a good trick for an obfuscating compiler.

What is predicted branch latency on rocket? 1 cycle? the branch predictor has to re-steer fetches from I$. Given most branches are predicted and the nop takes 1 cycle, then it’s only a win if predicted branches take longer than 1 cycle. I don’t think it is worth the dirtiness unless it’s a substantial win in performance for the general case given most branches are predicted in typical code. i.e. loops

> On 27/12/2017, at 6:11 AM, Bruce Hoult <br...@hoult.org> wrote:
>
> Having written one of those, I don't see why. Static compilation, ok, sure I agree, but that kind of thing is exactly why dynamic translators such as qemu or rv8 exist and work on basically anything a real CPU can handle.
>
>
> On Tue, Dec 26, 2017 at 4:41 PM, Michael Chapman <michael.c...@gmail.com> wrote:
> Binary to binary translators won't like it. E.g. translating RISC-V binaries to x86.
>
> On 26-Dec-17 12:32, Bruce Hoult wrote:
>> If you are compiling an IF/THEN/ELSE and it happens that the ELSE part can be done with a single 16-bit instruction, instead of ending the THEN part with a branch around the ELSE, you could instead emit 0x0037.
>>
>> This creates an "lui x0,#imm20" instruction (which is a no-op) that includes the single RVC instruction in the ELSE part inside the imm20.
>>
>> Useful? Too dirty to live? Will it break and/or slow down any likely microarchitecture?
>>
>> --
>> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
>> Virus-free. www.avg.com
>
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To post to this group, send email to sw-...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CAMU%2BEkzhyN3Yby8P9BXY2LyaTa2mrJW6Tuz9MQpJW124DHVY8Q%40mail.gmail.com.

Cesar Eduardo Barros

unread,
Dec 26, 2017, 5:16:52 PM12/26/17
to Bruce Hoult, RISC-V SW Dev
First of all: as far as I know, there's no rule forbidding it. If it
breaks, the core is defective. It might be a good idea to add some tests
to the riscv-tests repository to make sure (I didn't find this case on a
quick look over rvc.S).

I'd expect simpler cores which decode one instruction at a time to not
have any slowdown, and little if any speedup (unconditional jumps can be
predicted taken). I'd expect cores which decode several instructions in
parallel, or cores which convert short forward jumps into predicated
micro-instructions (page 18 of RISC-V ISA v2.2) to have to throw away
work and start again at the decode step.

To decode several instructions in parallel, a core has to know the
length of the instructions. This can be done with a boolean formula over
the first two bits of every 16-bit parcel. Jumping to the middle of an
instruction defeats that mechanism.

So, I'd put this trick together with self-modifying code in the "please
don't" pile.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Samuel Falvo II

unread,
Dec 26, 2017, 5:51:04 PM12/26/17
to Cesar Eduardo Barros, Bruce Hoult, RISC-V SW Dev
On Tue, Dec 26, 2017 at 2:16 PM, Cesar Eduardo Barros
<ces...@cesarb.eti.br> wrote:
> I'd expect simpler cores which decode one instruction at a time to not have
> any slowdown, and little if any speedup (unconditional jumps can be

Without defending the idea, historically, the point for this trick is
to save space, not to make things faster. This is a *VERY* commonly
used technique to pack maximum amount of code into limited ROM space
for Z80 and 6502 systems (e.g.,
http://6502.org/tutorials/6502opcodes.html#BIT). It's at least
partially responsible for enabling complete BASIC implementations,
with floating point support, on Z80 systems with only 16KB of ROM.

On something like the E310 chip, with only 16KB of scratchpad RAM,
this might be a valuable trick to apply for code placed in that RAM,
especially considering the code density of RISC-V is about 2x-3x
poorer than a Z80.

> So, I'd put this trick together with self-modifying code in the "please
> don't" pile.

The success of this approach as a space optimization depends entirely
on how often this substitution makes sense. On a 6502 or Z80, this
can be used to effectively predicate things like loading a register
based on which entry-point you use to a procedure (see CLOSE1, CLOSE2,
CLOSE3 labels in site referenced above), or to select between
incrementing or decrementing a register in response to some flag
condition, etc. It's a technique which worked well for CISC
processors; I'm not sure how well it'd work for RISC in general, and
RISC-V in particular. I'd be interested in seeing a more formal study
on it.

--
Samuel A. Falvo II

Stefan O'Rear

unread,
Dec 26, 2017, 6:11:41 PM12/26/17
to Bruce Hoult, RISC-V SW Dev
lui x0 is not a noop, it's an undocumented reserved hint encoding:
https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/B9r5BdKOHNk/09kO_VCUAQAJ

-s

Jim Wilson

unread,
Dec 26, 2017, 6:45:24 PM12/26/17
to Stefan O'Rear, Bruce Hoult, RISC-V SW Dev
On Tue, Dec 26, 2017 at 3:11 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> On Tue, Dec 26, 2017 at 3:32 AM, Bruce Hoult <br...@hoult.org> wrote:
>> This creates an "lui x0,#imm20" instruction (which is a no-op) that includes
>> the single RVC instruction in the ELSE part inside the imm20.
>
> lui x0 is not a noop, it's an undocumented reserved hint encoding:
> https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/B9r5BdKOHNk/09kO_VCUAQAJ

4-byte instructions aren't hints. Only 2-byte instructions are hints.
c.lui x0 is a hint, but that isn't being used here.

This is documented in the published v2.2 of the user level ISA spec in
section 12.7 RVC Instrution Set Listings. Though there are a few more
hints that are in the spec source that haven't been formally published
yet. You can see them by checking out the spec sources and building
them. It is not documented what hints do though. Perhaps that is
what you meant.

Jim

Bruce Hoult

unread,
Dec 26, 2017, 7:09:22 PM12/26/17
to Samuel Falvo II, Cesar Eduardo Barros, RISC-V SW Dev
On Wed, Dec 27, 2017 at 1:51 AM, Samuel Falvo II <sam....@gmail.com> wrote:
On Tue, Dec 26, 2017 at 2:16 PM, Cesar Eduardo Barros
<ces...@cesarb.eti.br> wrote:
> I'd expect simpler cores which decode one instruction at a time to not have
> any slowdown, and little if any speedup (unconditional jumps can be

Without defending the idea, historically, the point for this trick is
to save space, not to make things faster.  This is a *VERY* commonly
used technique to pack maximum amount of code into limited ROM space
for Z80 and 6502 systems (e.g.,
http://6502.org/tutorials/6502opcodes.html#BIT).  It's at least
partially responsible for enabling complete BASIC implementations,
with floating point support, on Z80 systems with only 16KB of ROM.

Yes, well aware of that. I learned to program on 6502 and Z80, in machine code as I didn't have a compiler or even assembler. And then moved to PDP11, VAX, M68k. 
 
On something like the E310 chip, with only 16KB of scratchpad RAM,
this might be a valuable trick to apply for code placed in that RAM,
especially considering the code density of RISC-V is about 2x-3x
poorer than a Z80.

I'd like to see some evidence for that!

On https://github.com/deater/ll_asm Z80 comes in 8% smaller than E310 code (RV32IMC). 6502 is 18% bigger than RISC-V.

Hand coded Z80 might be a little smaller than RISC-V, but nothing like 2x-3x for real code, not string copy micro-benchmarks. I'd expect generic C code to compile to smaller on RISC-V than on Z80, even if your variables are all char and short. If you need 32 bit or larger ints then there will be no comparison.

Stefan O'Rear

unread,
Dec 26, 2017, 7:16:43 PM12/26/17
to Jim Wilson, Bruce Hoult, RISC-V SW Dev
In the linked message, Andrew Waterman indicated that the spec source
was wrong and 4-byte instructions are also hints. I'd like to see a
clear retraction if a retraction is intended.

-s

Samuel Falvo II

unread,
Dec 26, 2017, 7:19:56 PM12/26/17
to Bruce Hoult, Cesar Eduardo Barros, RISC-V SW Dev
On Tue, Dec 26, 2017 at 4:09 PM, Bruce Hoult <br...@hoult.org> wrote:
> Yes, well aware of that. I learned to program on 6502 and Z80, in machine

I'm not responding to you.

> I'd like to see some evidence for that!

This is based on my port of eForth to the RV64I ISA. eForth for Z80
can readily fit in 16KB of space, while mine requires 34KB.

> On https://github.com/deater/ll_asm Z80 comes in 8% smaller than E310 code
> (RV32IMC). 6502 is 18% bigger than RISC-V.

Cute; but, I prefer to see more serious programs, however.

Jim Wilson

unread,
Dec 26, 2017, 7:57:25 PM12/26/17
to Stefan O'Rear, Bruce Hoult, RISC-V SW Dev
On Tue, Dec 26, 2017 at 4:16 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> In the linked message, Andrew Waterman indicated that the spec source
> was wrong and 4-byte instructions are also hints. I'd like to see a
> clear retraction if a retraction is intended.

Changing the meaning of existing valid 4-byte instructions would cause
trouble. The 2-byte instruction hints are just changing formerly
reserved encodings to be hints, which does not break anything.
Perhaps Andrew was just speculating on something that might be nice to
have, but did not get formally approved, because it would have caused
too much trouble for too many people.

Jim

Vince Weaver

unread,
Dec 26, 2017, 8:42:06 PM12/26/17
to Bruce Hoult, Samuel Falvo II, Cesar Eduardo Barros, RISC-V SW Dev
On Wed, 27 Dec 2017, Bruce Hoult wrote:

> I'd like to see some evidence for that!
>
> On https://github.com/deater/ll_asm Z80 comes in 8% smaller than E310 code
> (RV32IMC). 6502 is 18% bigger than RISC-V.

To be fair a team of 6502 programmers have made a huge improvement in the
6502 results, I just haven't had the time to update things yet.

Vince

Bruce Hoult

unread,
Dec 27, 2017, 4:59:54 AM12/27/17
to Samuel Falvo II, Cesar Eduardo Barros, RISC-V SW Dev
On Wed, Dec 27, 2017 at 3:19 AM, Samuel Falvo II <sam....@gmail.com> wrote:
On Tue, Dec 26, 2017 at 4:09 PM, Bruce Hoult <br...@hoult.org> wrote:
> I'd like to see some evidence for that!

This is based on my port of eForth to the RV64I ISA.  eForth for Z80
can readily fit in 16KB of space, while mine requires 34KB.

You were talking about the E310 core, which is 32 bit and implements the C extension. Based on general experience that would probably knock your 34KB down to 24KB or so. So quite far from 2x - 3x.

FORTH is a very special case, much less representative of "normal" programs than is the linux_logo example. Typical simple FORTH compilation to subroutine threaded code is indeed a bad fit for typical RISC ISAs because it depends heavily on manipulating the stack for individual items (needing autoincrement/decrement addressing) and ignoring the wealth of registers available.

Something as simple as inlining small words and gathering all stack pointer adjustment into a single instruction at the start and/or end of each word helps a lot. Storing temporary values consumed within the word in registers instead of on the stack also helps a lot.

But it seems no one is bothering to do even these simple things.

Token-threaded or address-threaded FORTH should take exactly the same space on RISC-V as on anything else. That needs a small interpreter, but it will run pretty fast.


> On https://github.com/deater/ll_asm Z80 comes in 8% smaller than E310 code
> (RV32IMC). 6502 is 18% bigger than RISC-V.

Cute; but, I prefer to see more serious programs, however.

Let's compare the output of C or Pascal compilers on serious programs then.
 

kr...@berkeley.edu

unread,
Jan 1, 2018, 9:37:16 AM1/1/18
to Stefan O'Rear, Jim Wilson, Bruce Hoult, RISC-V SW Dev

I think we should keep hint meaning on equivalent 4-byte instructions,
as otherwise non-C implementations wouldn't be able to use same hints,
and we'd be breaking the general rule that all C instructions expand
to a single regular instruction (implying no new functionality in
C-only instructions).

Krste
| --
| You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
| To post to this group, send email to sw-...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CADJ6UvPVAZT3vunAV4B9heXhTSnQXRX7AdPPD%3DoZkke8i4Wb-g%40mail.gmail.com.

kr...@berkeley.edu

unread,
Jan 1, 2018, 9:40:06 AM1/1/18
to Jim Wilson, Stefan O'Rear, Bruce Hoult, RISC-V SW Dev

Hints wouldn't change the meaning of any valid 4-byte instruction.

Hints are just reusing instruction encodings that have no
architectural effect (NOPs), and hints themselves must not imply any
architectural state change so that they can be ignored by a valid
implementation.

Krste
| --
| You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
| To post to this group, send email to sw-...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/CAFyWVaZC0%2Bj%2BYaJ%3DWFQgVu0-NDz3DceFs7evwpxQ9U1J%2BgF7cA%40mail.gmail.com.

kr...@berkeley.edu

unread,
Jan 1, 2018, 9:48:25 AM1/1/18
to Cesar Eduardo Barros, Bruce Hoult, RISC-V SW Dev

This trick could provide faster execution even on cores with BTBs, if
the BTB latency was >1 cycle, or if the fetch width was >1. It will
also avoid occupying a BTB entry for the taken branch, so saving BTB
capacity for other taken branches.

That said, I think it's a bad idea to use this trick, as more
intelligent microarchitectures can get the same effects without
obfuscating the code.

Krste
| --
| You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
| To post to this group, send email to sw-...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/8c4050e5-8b82-2a97-acd9-39a042a8939b%40cesarb.eti.br.

Richard W.M. Jones

unread,
Jan 2, 2018, 8:38:57 AM1/2/18
to kr...@berkeley.edu, Cesar Eduardo Barros, Bruce Hoult, RISC-V SW Dev
I though this thread was quite interesting but no one mentioned
(what I think is) the obvious question yet: Should overlapping
instructions be explicitly outlawed by the specification?

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Jim Wilson

unread,
Jan 2, 2018, 1:41:12 PM1/2/18
to Richard W.M. Jones, kr...@berkeley.edu, Cesar Eduardo Barros, Bruce Hoult, RISC-V SW Dev
On Tue, Jan 2, 2018 at 5:38 AM, Richard W.M. Jones <rjo...@redhat.com> wrote:
> I though this thread was quite interesting but no one mentioned
> (what I think is) the obvious question yet: Should overlapping
> instructions be explicitly outlawed by the specification?

Forbidding deliberate use of overlapping instructions would not stop a
black hat hacker from trying to take advantage of accidental instances
via return/jump oriented programming tricks. This seems to make
forbidding them pointless. It might be better to add architecture
level defenses, similar to what Intel has recently done with its
Control-flow Enforcement Technology.

Jim

Cesar Eduardo Barros

unread,
Jan 2, 2018, 5:32:38 PM1/2/18
to Richard W.M. Jones, kr...@berkeley.edu, Bruce Hoult, RISC-V SW Dev
Em 02-01-2018 11:38, Richard W.M. Jones escreveu:
> I though this thread was quite interesting but no one mentioned
> (what I think is) the obvious question yet: Should overlapping
> instructions be explicitly outlawed by the specification?

No. The opposite: they should be explicitly allowed.

The hardware can't forbid them in the general case. They are only a bad
idea, performance wise.

In the example that started this thread, the program jumped into the
middle of an instruction in the same (or in the next) cacheline. Now
consider what happens if the jump into the middle of the instruction is
from a farther location. Consider also what happens if, in the meantime,
enough code was executed that the original cacheline was already gone
from all caches. The hardware has no idea whether it is jumping into the
middle of an instruction.

It would be inconsistent if the result of jumping into the middle of an
instruction depended on how many other instructions executed in the
meantime, or even on ISRs or other unpredictable factors. Therefore, the
only option which makes sense is to explicitly allow overlapping
instructions, be it using half of a 32-bit instruction as a 16-bit
instruction, or overlapping a 32-bit instruction with another 32-bit
instruction.

Of course, the specification cannot promise anything about the
performance impact of overlapping instructions.

My opinion is:

- The specification should explicitly allow overlapping instructions;
- The commentary on the specification should mention that overlapping
instructions might be slower on some situations on some implementations;
- The riscv-tests repository should have tests that make sure
overlapping instructions work, especially on limit situations like
jumping into the middle of the current, the preceding, or the following
instruction, the overlapping instruction being either 16-bit or 32-bit;
- Those who worry about overlapping instructions being used to bypass
binary validators should use validation tricks similar to what NaCl did.

Andrew Waterman

unread,
Jan 2, 2018, 5:53:38 PM1/2/18
to Cesar Eduardo Barros, Richard W.M. Jones, Krste Asanovic, Bruce Hoult, RISC-V SW Dev
I agree with all this. It's already implicitly mandatory to support
overlapping instructions, but for clarity it should be made explicit.

I'd happily review pull requests to the spec (the commentary should
also reference the NaCl approach), and to the riscv-tests, if someone
wants to lend a hand.

>
> --
> Cesar Eduardo Barros
> ces...@cesarb.eti.br
>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sw-dev+un...@groups.riscv.org.
> To post to this group, send email to sw-...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/fb0e456f-d8bb-67a2-8739-f1d47edcf2ac%40cesarb.eti.br.

Richard W.M. Jones

unread,
Jan 3, 2018, 5:27:33 AM1/3/18
to Cesar Eduardo Barros, kr...@berkeley.edu, Bruce Hoult, RISC-V SW Dev
On Tue, Jan 02, 2018 at 08:32:27PM -0200, Cesar Eduardo Barros wrote:
> Em 02-01-2018 11:38, Richard W.M. Jones escreveu:
> >I though this thread was quite interesting but no one mentioned
> >(what I think is) the obvious question yet: Should overlapping
> >instructions be explicitly outlawed by the specification?
>
> No. The opposite: they should be explicitly allowed.
>
> The hardware can't forbid them in the general case. They are only a
> bad idea, performance wise.
>
> In the example that started this thread, the program jumped into the
> middle of an instruction in the same (or in the next) cacheline. Now
> consider what happens if the jump into the middle of the instruction
> is from a farther location. Consider also what happens if, in the
> meantime, enough code was executed that the original cacheline was
> already gone from all caches. The hardware has no idea whether it is
> jumping into the middle of an instruction.

To be clear, I didn't mean that the hardware would enforce this (which
is obviously impossible), I meant that the result would be explicitly
undefined so that it would be clear to programmers that using these
tricks is wrong and undefined, and easier for hardware designers to
implement optimizations.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Stefan O'Rear

unread,
Jan 3, 2018, 5:31:53 AM1/3/18
to Richard W.M. Jones, Cesar Eduardo Barros, Krste Asanovic, Bruce Hoult, RISC-V SW Dev
On Wed, Jan 3, 2018 at 2:27 AM, Richard W.M. Jones <rjo...@redhat.com> wrote:
> On Tue, Jan 02, 2018 at 08:32:27PM -0200, Cesar Eduardo Barros wrote:
>> Em 02-01-2018 11:38, Richard W.M. Jones escreveu:
>> >I though this thread was quite interesting but no one mentioned
>> >(what I think is) the obvious question yet: Should overlapping
>> >instructions be explicitly outlawed by the specification?
>>
>> No. The opposite: they should be explicitly allowed.
>>
>> The hardware can't forbid them in the general case. They are only a
>> bad idea, performance wise.
>>
>> In the example that started this thread, the program jumped into the
>> middle of an instruction in the same (or in the next) cacheline. Now
>> consider what happens if the jump into the middle of the instruction
>> is from a farther location. Consider also what happens if, in the
>> meantime, enough code was executed that the original cacheline was
>> already gone from all caches. The hardware has no idea whether it is
>> jumping into the middle of an instruction.
>
> To be clear, I didn't mean that the hardware would enforce this (which
> is obviously impossible), I meant that the result would be explicitly
> undefined so that it would be clear to programmers that using these
> tricks is wrong and undefined, and easier for hardware designers to
> implement optimizations.

It's far from clear how you would even _specify_ such a thing.

-s

Richard W.M. Jones

unread,
Jan 3, 2018, 5:36:17 AM1/3/18
to Stefan O'Rear, Cesar Eduardo Barros, Krste Asanovic, Bruce Hoult, RISC-V SW Dev
"If an instruction has been previously decoded by a hart at address X
with length N bytes, then it is undefined what happens if the hart is
subsequently asked to decode an instruction beginning at addresses
X+1 thru X+N-1."

Richard W.M. Jones

unread,
Jan 3, 2018, 5:39:00 AM1/3/18
to Stefan O'Rear, Cesar Eduardo Barros, Krste Asanovic, Bruce Hoult, RISC-V SW Dev
... with a list of fence-type operations which clear the
state of a page of memory allowing new code to be loaded.

Cesar Eduardo Barros

unread,
Jan 3, 2018, 5:43:17 AM1/3/18
to Richard W.M. Jones, kr...@berkeley.edu, Bruce Hoult, RISC-V SW Dev
Em 03-01-2018 08:27, Richard W.M. Jones escreveu:
> On Tue, Jan 02, 2018 at 08:32:27PM -0200, Cesar Eduardo Barros wrote:
>> Em 02-01-2018 11:38, Richard W.M. Jones escreveu:
>>> I though this thread was quite interesting but no one mentioned
>>> (what I think is) the obvious question yet: Should overlapping
>>> instructions be explicitly outlawed by the specification?
>>
>> No. The opposite: they should be explicitly allowed.
>>
>> The hardware can't forbid them in the general case. They are only a
>> bad idea, performance wise.
>>
>> In the example that started this thread, the program jumped into the
>> middle of an instruction in the same (or in the next) cacheline. Now
>> consider what happens if the jump into the middle of the instruction
>> is from a farther location. Consider also what happens if, in the
>> meantime, enough code was executed that the original cacheline was
>> already gone from all caches. The hardware has no idea whether it is
>> jumping into the middle of an instruction.
>
> To be clear, I didn't mean that the hardware would enforce this (which
> is obviously impossible), I meant that the result would be explicitly
> undefined so that it would be clear to programmers that using these
> tricks is wrong and undefined, and easier for hardware designers to
> implement optimizations.

As a software developer, I like that user-mode RISC-V has currently no
undefined behavior (though LR/SC outside the forward progress guarantees
pushes the line a bit). There's precedent elsewhere: the spec defined
the precise behavior of divide by zero, which bits should be set on a
NaN instead of leaving it as implementation-defined, and allowed
unaligned loads/stores even though this might complicate hardware
implementations.

So in my opinion, the result should be explicitly _defined_, and
precisely specified. However, it doesn't have to be optimized; bailing
out, flushing the pipeline, and restarting from the decode step would be
a valid approach.
Reply all
Reply to author
Forward
0 new messages