Re: Comment on: RISC-V Semihosting specification

Anup Patel

unread,

Oct 10, 2024, 1:09:56 AM10/10/24

to Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

Hi,

On Thu, Sep 26, 2024 at 11:06 AM Frank K. Gurkaynak <k...@iis.ee.ethz.ch> wrote:
>
> Hello,
>
> I would appreciate if a short explanation is added to explain how this specific command sequence
>
> slli x0, x0, 0x1f
> ebreak
> srai x0, x0, 7
>
> has been selected/designed. This could be a footnote, or explained inline. I believe this is important for people 'learning' about ISAs and studying them. The base ISA explains almost all of their choices very clearly, something that is highly appreciated when used in teaching or studied by beginners. Even something like "we needed a distinctive unlikely NOP instruction to mark the begin and end, and slli and srai have been chosen because.." would help.
>
> As a HW engineer, in some smaller implementations where a barrel shifter is expensive, I can think of cases where the slli with non power of two actually takes several clock cycles as opposed to say any simple boolean logic function.

The shift based NOP instructions and the EBREAK instruction are part
of the RV32I and RV64I (aka Base Integer Instruction Set) which is
mandatory for all RISC-V implementations (including microcontrollers)
so these instructions are assumed to be always present. I will add
some non-normative text along these lines.

Regards,
Anup

Bruce Hoult

unread,

Oct 10, 2024, 1:55:34 AM10/10/24

to Anup Patel, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

I missed the original message in my mail.

> As a HW engineer, in some smaller implementations where a barrel shifter is expensive, I can think of cases where the slli with non power of two actually takes several clock cycles as opposed to say any simple boolean logic function.

I'm curious about this, from several angles.

First of all, RISC-V software commonly assumes that arbitrary shifts
are cheap e.g. `slli a0,a0,32-off-size; srai a0,a0,32-size` to extract
and sign-extend an arbitrary bitfield.

I'm not aware of any implementations where shift is slow, except SeRV
where shifts are 64 cycles while logic functions or add/sub are 32
cycles.

Certainly it's possible someone has done a 1 bit per cycle shifter in
an otherwise single-cycle core. But you appear to be implying that
there are cores that iterate shifts, but do it by powers of 2. Does
that really save much circuitry? Or gate delays, for that matter.

I could be wrong, but I think a 5 or 6 layer barrel shifter isn't a
lot more latency than a 32 or 64 bit adder? Sure, both have a lot more
gate delays than AND / OR / XOR. Does something out there have
single-cycle boolean operations, but multi-cycle add (and shift)? That
certainly seems *possible*, but is it sensible to make that effort
given the rarity of AND / OR / XOR in most code?

And finally, given everything else that is implied by semihosting e.g.
communicating with a remote host system, is it even remotely important
if the shift NOPs are a little slow? Also, I note that the trap
handler can easily arrange to skip execution of the `srai` by bumping
the PC by 8 instead of 4 before returning. It obviously has to fetch
and examine that instruction, but it doesn't have to execute it.

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAAhSdy1Av3g9%3DojERdMe81tGt0i2jRt%3D78iPBMdV9WyUxqVeQw%40mail.gmail.com.

Guy Lemieux

unread,

Oct 10, 2024, 2:24:37 AM10/10/24

to Bruce Hoult, Anup Patel, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

The official NOP instruction in RISC-V is ADDI x0,x0,_something_

The SLLI and SRAI instructions in this command sequence write to x0,
and use x0 as a source. Hence, they are essentially NOPs. Some
instructions like this (which are essentially NOPs) are also defined
as microarchitectural HINT instructions. I don't know if these two
encodings are HINTs or not.

I believe many bit encodings that write x0 are "reserved" for future HINTs.

A variable shifter often works with log(XLEN) layers, where each layer
shifts by 1b, 2b, 4b, 8b, etc, respectively. I do not know of any
implementation where a variable shift that happens to be a power-of-2
is easier/faster than a non-power-of-2. Shift instructions can also be
replaced by an integer multiply. For left-shift, it's obvious that you
just multiply by 2^shamt. For right-shift, you need to shift-left by
2^(6-shamt) and then take the LSBs of the MULHI portion of the result.
However, I don't see how any of this is relevant to the code sequence
that was selected.

I hope some non-normative text can clear up precisely why these
NOP-like instructions (with precise shift amounts of 0x1f and 7) were
selected -- do they behave as official microarchitectural HINTs? are
they simply markers to software to make a unique 96b signature in the
code to find these points? etc.

Guy

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkwmLiBpTq-gKF%2BQjP6N80Tf5KFQ2x6%2BDmThrhKsJZeA6A%40mail.gmail.com.

Anup Patel

unread,

Oct 10, 2024, 2:34:07 AM10/10/24

to Guy Lemieux, Bruce Hoult, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

On Thu, Oct 10, 2024 at 11:54 AM Guy Lemieux <guy.l...@gmail.com> wrote:
>
> The official NOP instruction in RISC-V is ADDI x0,x0,_something_
>
> The SLLI and SRAI instructions in this command sequence write to x0,
> and use x0 as a source. Hence, they are essentially NOPs. Some
> instructions like this (which are essentially NOPs) are also defined
> as microarchitectural HINT instructions. I don't know if these two
> encodings are HINTs or not.
>
> I believe many bit encodings that write x0 are "reserved" for future HINTs.
>
> A variable shifter often works with log(XLEN) layers, where each layer
> shifts by 1b, 2b, 4b, 8b, etc, respectively. I do not know of any
> implementation where a variable shift that happens to be a power-of-2
> is easier/faster than a non-power-of-2. Shift instructions can also be
> replaced by an integer multiply. For left-shift, it's obvious that you
> just multiply by 2^shamt. For right-shift, you need to shift-left by
> 2^(6-shamt) and then take the LSBs of the MULHI portion of the result.
> However, I don't see how any of this is relevant to the code sequence
> that was selected.
>
> I hope some non-normative text can clear up precisely why these
> NOP-like instructions (with precise shift amounts of 0x1f and 7) were
> selected -- do they behave as official microarchitectural HINTs? are
> they simply markers to software to make a unique 96b signature in the
> code to find these points? etc.

The SLLI and SRAI based HINTs are designated for custom
use as-per the ratified RV32I and RV64I specifications.

One simple optimization could be that implementations can
always skip the shifting work for SLLI and SRAI instructions
whenever rd=x0.

Regards,
Anup

Tommy Murphy

unread,

Oct 10, 2024, 2:46:03 AM10/10/24

to Anup Patel, Guy Lemieux, Bruce Hoult, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

In case it helps at all, there is some previous discussion of this semihosting instruction sequence on the old mailing list/Google Group...

https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/n-5VQ9PHZ4w/m/W6BLkpTRBwAJ

https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/M7LDRtBtxrk

Unfortunately, so far, I haven't been able to locate the original long discussion that led to the instruction sequence in question being settled on in the first plane...

Bruce Hoult

unread,

Oct 10, 2024, 8:32:21 AM10/10/24

to Guy Lemieux, Anup Patel, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

On Thu, Oct 10, 2024 at 7:24 PM Guy Lemieux <guy.l...@gmail.com> wrote:
> A variable shifter often works with log(XLEN) layers, where each layer
> shifts by 1b, 2b, 4b, 8b, etc, respectively.

Yes, my previous message was written under this assumption.

The point is that this doesn't -- on all implementations I know of --
take log(XLEN) clock cycles, but rather all log(XLEN) stages cascade
within 1 clock cycle. This is the case because integer add (which is
far more common and important) also takes O(log(XLEN)) layers of logic
to propagate the carry (in the best implementation), and the big O
constant is relatively similar in both cases.

So, yes, there is no advantage to power of two shifts, despite each
layer of the shift network (optionally) shifting by a power of two.

Tommy Murphy

unread,

Oct 10, 2024, 8:50:29 AM10/10/24

to Bruce Hoult, Guy Lemieux, Anup Patel, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

> In case it helps at all, there is some previous discussion of this semihosting instruction sequence on the old mailing list/Google Group...

>

> https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/n-5VQ9PHZ4w/m/W6BLkpTRBwAJ

>

> https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/M7LDRtBtxrk

>

> Unfortunately, so far, I haven't been able to locate the original long discussion that led to the instruction sequence in question being settled on in the first place...

FWIW I think that this may be the original discussion but I'm not sure if it explains the rationale for the selection of the specific instructions used:

https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/1Su9Z7L18qM/m/ePSk4rulAQAJ

Bruce Hoult

unread,

Oct 10, 2024, 9:26:52 AM10/10/24

to Tommy Murphy, Guy Lemieux, Anup Patel, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

This discussion started in November 2017 (and I was part of it then).
Ideally, Liviu wanted a new EBREAK instruction, possibly with a few
bits of literal in it, but anyway one code point would serve the
semi-hosting need.

There was a reluctance in 2017, 18 months before the existing ISA was
ratified, to add new instructions. Andrew: "I can see your point, but
changing the ISA is all the more messy."

Since ratification we have had dozens of ISA extensions proposed and
ratified. Changing the ISA.

Why not one for semi-hosting? It could easily be a fast track extension.

If it's a currently illegal instruction then it can be made to work on
old hardware too. It could even be ECALL followed by the currently
illegal instruction, rather than ECALL surrounded by NOPs.

Regardless, the original "We don't want to modify the ISA" reason for
going down this current track seems to be outdated.

Anup Patel

unread,

Oct 10, 2024, 10:28:59 AM10/10/24

to Bruce Hoult, Tommy Murphy, Guy Lemieux, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

On Thu, Oct 10, 2024 at 6:56 PM Bruce Hoult <br...@hoult.org> wrote:
>
> This discussion started in November 2017 (and I was part of it then).
> Ideally, Liviu wanted a new EBREAK instruction, possibly with a few
> bits of literal in it, but anyway one code point would serve the
> semi-hosting need.
>
> There was a reluctance in 2017, 18 months before the existing ISA was
> ratified, to add new instructions. Andrew: "I can see your point, but
> changing the ISA is all the more messy."
>
> Since ratification we have had dozens of ISA extensions proposed and
> ratified. Changing the ISA.
>
> Why not one for semi-hosting? It could easily be a fast track extension.
>
> If it's a currently illegal instruction then it can be made to work on
> old hardware too. It could even be ECALL followed by the currently
> illegal instruction, rather than ECALL surrounded by NOPs.
>
> Regardless, the original "We don't want to modify the ISA" reason for
> going down this current track seems to be outdated.

Clearly, the selection of the semihosting sequence was done 6+ years
back and a lot of us were not part of those discussions.

The semihosting v1.0 (this spec) only documents what many upstream
open-source projects (such as OpenOCD, QEMU, OpenSBI, U-Boot, etc)
have already accepted as the RISC-V semihosting sequence. It is
mandatory for software ecosystem to continue supporting this RISC-V
semihosting sequence and accordingly we should ratify this spec so
that upstream software is not using any non-ratified / non-documented
sequence.

If a new semihosting specific instruction is required then it should be
done as a separate effort and once such an instruction is ratified after
that another effort should be done to come-up with a semihosting v2.0
specification.

Regards,
Anup

BGB

unread,

Oct 10, 2024, 3:02:15 PM10/10/24

to isa...@groups.riscv.org

On 10/10/2024 7:32 AM, Bruce Hoult wrote:
> On Thu, Oct 10, 2024 at 7:24 PM Guy Lemieux <guy.l...@gmail.com> wrote:
>> A variable shifter often works with log(XLEN) layers, where each layer
>> shifts by 1b, 2b, 4b, 8b, etc, respectively.
>
> Yes, my previous message was written under this assumption.
>
> The point is that this doesn't -- on all implementations I know of --
> take log(XLEN) clock cycles, but rather all log(XLEN) stages cascade
> within 1 clock cycle. This is the case because integer add (which is
> far more common and important) also takes O(log(XLEN)) layers of logic
> to propagate the carry (in the best implementation), and the big O
> constant is relatively similar in both cases.
>
> So, yes, there is no advantage to power of two shifts, despite each
> layer of the shift network (optionally) shifting by a power of two.
>

Hmm, There could be an RV32I- or RV32E-, with additional cost cutting:
Only constant shifts are allowed, and only power of 2;
The shift amount is treated as part of the opcode;
For operations like BEQ and friends, Rs2 is required to be X0;
Effectively, only comparing Rs1 with zero.
Unaligned load/store is disallowed;
JAL and JALR may only have X0 and X1 as a destination;
...

The smallest CPU core I had pulled off in the past was around 4 kLUT,
and had used an SH-2 like subset of the SuperH ISA.
No integer multiply;
No variable shift;
...
The way one would implement variable shifts being to branch into a table
of 1-bit shift operations.

In past attempts, was not able to get RV32I quite this small, at least
assuming a core that is pipelined and uses full-width registers.

My past attempts at a basic RV64I style core were closer to 9 kLUT, with
around 7 for 32-bit.

For a smaller core, things like shift are fairly expensive, ...
Also expensive:
Supporting misaligned load/store
Can nearly double the cost of the L1 D$.
Dealing efficiently with things like memory RAW/WAW hazards;
Forwarding = expensive;
Stall = slow.
...

Though, at this point, most of the semi-mainline FPGA dev-boards
(excluding ICE40 and similar) don't come with anything much smaller than
an XC7S25 or XC7A35T, which can handle such a core (though, bigger core
does mean less space left for peripheral logic).

Though, most of the still-available boards with these FPGA's also tend
to lack external RAM modules, which severely limits their utility
(without external RAM, can't do that much more than a small
microcontroller).

And, for boards with an XC7S50 or XC7A100T or similar, one can afford a
bigger core.

And, many tiny cores doesn't seem particularly useful for most purposes.

Dunno about ASIC space though.

Well, vs my current core:
64 bit, SIMD, etc;
64x 64-bit registers
Split into two sets of 32 for RV;
3 lanes, 6R3W register file;
SIMD was does in the main registers in my case.
Two ISA's;
Roughly 8 decoders:
3x for my own ISA;
3x for RV;
1 for the 16-bit variant of my ISA;
1 for RVC.

Which weighs in at around 40 kLUT.

As-is, the decoders weigh in at roughly 18% of the LUT cost, but could
try something to reduce this (I am internally debating possible ways to
reduce decoder cost, to reduce the amount of internal MUX'ing needed,
but decided not to go into the specifics here).

Note that the 3rd lane doesn't see much traffic from actual
instructions, but the cost difference between 2 and 3 wide wasn't that
large, and the 3rd lane still serves a use for providing extra register
ports for some cases. Functionally, all it really does at this point is
basic ALU ops (AD/SUB/AND/OR/XOR, constant load, sign/zero extension).

As-is, much of the cost difference seems to be in the decoders and
register file.

Things like FPU and FP-SIMD also eat a lot of LUTs, but the single
biggest consumer of the LUT budget is the L1 D$ (around 27% of the total
LUT budget for the core).

Note that this is with the L1 D$ still only supporting a single memory
port...

...

Allen Baum

unread,

Oct 10, 2024, 8:25:14 PM10/10/24

to Anup Patel, Bruce Hoult, Tommy Murphy, Guy Lemieux, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

I suspect that the selection of those instructions were made because

- they are noops (of the right class)
- they are each unlikely to show up in real code, and

- they are even less likely to both show up in real code,

thus making it easily recognizable as a semihosting marker.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAAhSdy39soZqVNx%3DrFR2vi5GT_aWJEG9Pmp11JdVG5NBUM4%2B0g%40mail.gmail.com.

kr...@sifive.com

unread,

Oct 10, 2024, 9:55:34 PM10/10/24

to Bruce Hoult, Tommy Murphy, Guy Lemieux, Anup Patel, Frank K. Gurkaynak, RISC-V ISA Dev, apa...@ventanamicro.com

We're just trying to ratify the way it is currently working.

It isn't broke, so we won't "fix" it by adding new instructions at
this point,

Krste

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkxuu5hQUTZtqPPSgz96AVh_caXPZ6jgrfYt7mn8SGxY_g%40mail.gmail.com.

kr...@sifive.com

unread,

Oct 14, 2024, 10:00:20 PM10/14/24

to Frank K. Gurkaynak, kr...@sifive.com, Bruce Hoult, Tommy Murphy, Guy Lemieux, Anup Patel, RISC-V ISA Dev, apa...@ventanamicro.com

I was reacting to Bruce's suggestion to add a new ebreak.

I agree it would be useful to add a longer non-normative note
explaining the choice when the extension is folded in. The current
note in unpriv doesn't quite say why this was chosen, although the
historical email threads capture the train of thought,

Krste

>>>>> On Fri, 11 Oct 2024 08:05:06 +0200, "Frank K. Gurkaynak" <k...@iis.ee.ethz.ch> said:

| The initial comment was just to add a clarification as to how this particular sequence was chosen to the document. It seems like it is really random (i.e. it is not following a convention of using a particular set of instructions described somewhere else), so a footnote that says:
| "These instructions which are effectively NOPs have been randomly selected from the base ISA as an unlikely sequence of instructions to appear in real life code"

| would do the trick. I was not suggesting that it be modified, just clarified.

| Cheers,
| Frank

Reply all

Reply to author

Forward