RFC: RISC-V microcontroller profile

Liviu Ionescu

unread,

Mar 14, 2018, 6:29:51 AM3/14/18

to RISC-V ISA Dev

(I apologise to the Linux audience, but currently I could not find an
email group dedicated to embedded RISC-V devices).

I did a study on using the RISC-V ISA in embedded devices and I
managed to identify several issues together with possible solutions.

The result is a proposal for a 'RISC-V microcontroller profile',
intended to complement the existing privileged profile:

https://github.com/emb-riscv/specs-markdown/blob/master/README.md

The main issues identified are interrupt latency and lack of C/C++ friendliness.

For microcontrollers, the main solutions to minimise latency are:

- a lighter EABI (preferably designed for only 16 registers, and
making use of the other 16 as a bonus, to look similar for both E and
I cores) together with
- hardware stacking/unstacking the interrupt context (which also
greatly improves C friendliness).

Otherwise RISC-V cores will never be able to compete with ARM Cortex-M
cores, which not only have nested interrupts, but advertise maximum 12
cycles from the interrupt to the final C handler (!) and 10 cycles to
fully exit interrupts (not to mention tail-chaining and, when FP is
present, lazy saving of the FP registers).

Regards,

Liviu

p.s. feel free to reply directly to my email address; if this topic
triggers any interest for embedded devices, I suggest to create a
separate group, like embe...@groups.riscv.org, and move the
discussions there.

Alex Bradbury

unread,

Mar 14, 2018, 7:04:23 AM3/14/18

to Liviu Ionescu, RISC-V ISA Dev

On 14 March 2018 at 10:29, Liviu Ionescu <i...@livius.net> wrote:
> (I apologise to the Linux audience, but currently I could not find an
> email group dedicated to embedded RISC-V devices).
>
>
> I did a study on using the RISC-V ISA in embedded devices and I
> managed to identify several issues together with possible solutions.
>
> The result is a proposal for a 'RISC-V microcontroller profile',
> intended to complement the existing privileged profile:
>
> https://github.com/emb-riscv/specs-markdown/blob/master/README.md

Thanks for sharing your thoughts here Liviu.

>
> The main issues identified are interrupt latency and lack of C/C++ friendliness.
>
> For microcontrollers, the main solutions to minimise latency are:
>
> - a lighter EABI (preferably designed for only 16 registers, and
> making use of the other 16 as a bonus, to look similar for both E and
> I cores) together with

If this is desirable, I'd be strongly in favour of this being done as
a modification of the not-yet-finalised RV32E ABI rather than having
it be yet another ABI to live alongside ilp32, ilp32e, ilp32f, ilp32d,
lp64, lp64f and lp64d.

I think it's important to get some performance numbers on the options
here. How were you hoping to evaluate them - are there real-world
workloads we might analyse, or do you think that targeted
microbenchmarks are sufficient?

Best,

Alex

Liviu Ionescu

unread,

Mar 14, 2018, 8:04:59 AM3/14/18

to Alex Bradbury, RISC-V ISA Dev

On 14 March 2018 at 13:04:22, Alex Bradbury (a...@asbradbury.org) wrote:

> ... strongly in favour of this being done as

> a modification of the not-yet-finalised RV32E ABI

fully agree.

> rather than having
> it be yet another ABI to live alongside ilp32, ilp32e, ilp32f, ilp32d,
> lp64, lp64f and lp64d.

that would be nice, but I'm afraid not enough.

my proposal for a RV32E ABI would be to save no more than 6 registers
(and no less than 4), plus the return address, the status register and
minimal status.

then preferably save the same registers by RV32I/RV64I. at the limit
we can save only 4 registers for RV32E, and 6 for the larger cores,
but 16 registers, as are now marked as 'saved by the caller', are way
too many.

> I think it's important to get some performance numbers on the options
> here. How were you hoping to evaluate them - are there real-world
> workloads we might analyse, or do you think that targeted
> microbenchmarks are sufficient?

that's a good question.

as prequisites, I would assume that the stacking/unstaking mechanism
uses only internal fast RAM, that idealy requires 1 cycle per word,
otherwise results are not comparable.

if Cortex-M can stack/unstack a total of 8 (eight) 32-bits words in
only 12/10 cycles; matching it (at least) would be a good challenge to
start with.

right now, looking at the entry.S code used in Linux, I see that it
saves all 32 registers plus 6 CSRs. I don't have actual measurements,
but I would expect this to take at least 38 cycles, plus some good
more cycles in the assembly logic used before/after calling the C
code. an inacurate estimate would be that now we are somewhere in the
50-60 range, without handling nesting and interrupt pre-emption.
adding an extra software stack for nesting might take a few more good
cycles, probably raising the total stacking time to 80-100 cycles, and
slightly less for unstacking.

regards,

Liviu

p.s. any chance to get a separate embe...@groups.riscv.org group?

Watson Ladd

unread,

Mar 14, 2018, 4:12:32 PM3/14/18

to Liviu Ionescu, Alex Bradbury, RISC-V ISA Dev

On Wed, Mar 14, 2018 at 5:04 AM, Liviu Ionescu <i...@livius.net> wrote:
> On 14 March 2018 at 13:04:22, Alex Bradbury (a...@asbradbury.org) wrote:
>
>> ... strongly in favour of this being done as
>> a modification of the not-yet-finalised RV32E ABI
>
> fully agree.
>
>> rather than having
>> it be yet another ABI to live alongside ilp32, ilp32e, ilp32f, ilp32d,
>> lp64, lp64f and lp64d.
>
> that would be nice, but I'm afraid not enough.
>
> my proposal for a RV32E ABI would be to save no more than 6 registers
> (and no less than 4), plus the return address, the status register and
> minimal status.
>
> then preferably save the same registers by RV32I/RV64I. at the limit
> we can save only 4 registers for RV32E, and 6 for the larger cores,
> but 16 registers, as are now marked as 'saved by the caller', are way
> too many.

If the Cortex-M0 can save 8 32 bit words in 12 cycles, we should be
able to save 16 in 24. This is the only work that needs to be done in
an interrupt in hardware: now we are ready to jump to
an interrupt handler that looks just like a C function.

Liviu Ionescu

unread,

Mar 14, 2018, 4:42:38 PM3/14/18

to Watson Ladd, RISC-V ISA Dev, Alex Bradbury

On 14 March 2018 at 22:12:31, Watson Ladd (watso...@gmail.com) wrote:

> If the Cortex-M0 can save 8 32 bit words in 12 cycles,

I double checked and for Cortex-M3/M4/M7 the latency is 12/10
(entry/exit). Cortex-M0+ is specified at 15 cycles and Cortex-M0 at 16
cycles. the non-FP stack frame is 8 words in all cases. Cortex-M0/M0+
also have an optional zero jitter feature.

> we should be able to save 16 in 24.

with the current ABI, I estimated at least 18 registers to be saved.
the same rule gives 27 cycles.

> This is the only work that needs to be done in
> an interrupt in hardware: now we are ready to jump to
> an interrupt handler that looks just like a C function.

that would be nice.

for compatibility reasons, my proposal also supports the current ABI,
but a ligher ABI would further reduce latency to values comparable
with Cortex-M.

regards,

Liviu

Torbjørn Viem Ness

unread,

Mar 14, 2018, 5:20:27 PM3/14/18

to RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

Hi,

This is probaby an unconventional suggestion but I'm still an unexperienced student, so too young to know better I guess. =)

What if we added one more set of registers "behind" the ones that need to be saved upon entering an interrupt handler?
This way the entire status could just be "shifted out" in one cycle before jumping to the routine, and propagating the data to RAM or cache could be handled in the background to prepare for another context swap if necessary, and the latency would be invisible to the user (unless a new call occurs before it's done saving the data from the previous one).
Then after the handler completes, the previous context can simply be shifted back and be ready to go one cycle later.

Does this sound like a good idea (or even feasible), or would it be too expensive in terms of area and complexity seeing as we're talking about microcontrollers?

--
Torbjørn Ness
M.Sc. student, NTNU

Rogier Brussee

unread,

Mar 14, 2018, 5:30:20 PM3/14/18

to RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

Op woensdag 14 maart 2018 21:42:38 UTC+1 schreef Liviu Ionescu:

If I understand correctly, to save ra + N additional (word sized) registers a0, a1, a2, .. a5, t1 t2 on a 16 byte boundary should do

(assuming the C extension is defined[1])

C.add16isp ((N +3)>>2) # i.e. addi sp sp ((N +3)>>2 <<4)

jalr t0 zero Csavew - (N<<1)

where Csavew is the bit of milicode

Csavew-16 C.swsp t2 (+32)

Csavew-14 C.swsp t1 (+28)

Csavew-12 C.swsp a5 (+24)

Csavew-10 C.swsp a4 (+20)

Csavew -8 C.swsp a3 (+16)

Csavew -6 C.swsp a2 (+12)

Csavew -4 C.swsp a1 (+8)

Csavew -2 C.swsp a0 (+4)

Csavew: C.swsp ra (0)

j t0

If one architecturally fixes the adres of Csavew just like you defined adresses for mmaped CSR's (representable in 12 bits to avoid

an additional lui or 11 bits if one does not want to run into top negative range to avoid)

then that jalr t0 zero Csavew - (N<<1) could be _allowed_ to be implemented in hardware without demanding it.

[1] Mutatis mutandis the same can be done without the C extension but the adresses corresponding to registers would be different.

regards,

Liviu

Liviu Ionescu

unread,

Mar 14, 2018, 5:33:36 PM3/14/18

to Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

On 14 March 2018 at 23:20:30, Torbjørn Viem Ness (tbn...@gmail.com) wrote:

> This is probably an unconventional suggestion but I'm still an unexperienced

> student, so too young to know better I guess. =)

don't worry, creativity has no age ;-)

> What if we added one more set of registers "behind" the ones that need to
> be saved upon entering an interrupt handler?

if I'm not terribly wrong, this technique is called 'shadow register
set', and it is used by some other architectures more concerned with
latency (MIPS, PIC32, maybe SPARC, possibly others).

it is probably the fastest solution.

> Does this sound like a good idea (or even feasible), or would it be too
> expensive in terms of area and complexity seeing as we're talking about
> microcontrollers?

I would say it is not cheap.

regards,

Liviu

Liviu Ionescu

unread,

Mar 14, 2018, 5:39:40 PM3/14/18

to Rogier Brussee, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

On 14 March 2018 at 23:30:22, Rogier Brussee (rogier....@gmail.com) wrote:

> ... to save ra + N additional (word sized) registers ... on a 16 byte boundary

in my proposal I considered an 8 byte alignment enough, but if a 16
bytes boundary simplifies the logic or makes things faster, I see no
problem to update the specs to support it (the number of extra words
added must be preserved somewhere to de-adjust the stack pointer on
exit).

regards,

Liviu

Jacob Bachmeyer

unread,

Mar 14, 2018, 5:41:20 PM3/14/18

to Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

Torbjørn Viem Ness wrote:
> This is probaby an unconventional suggestion but I'm still an
> unexperienced student, so too young to know better I guess. =)
>
> What if we added one more set of registers "behind" the ones that need
> to be saved upon entering an interrupt handler?

You are asking for shadow registers.

> This way the entire status could just be "shifted out" in one cycle
> before jumping to the routine, and propagating the data to RAM or
> cache could be handled in the background to prepare for another
> context swap if necessary, and the latency would be invisible to the
> user (unless a new call occurs before it's done saving the data from
> the previous one).
> Then after the handler completes, the previous context can simply be
> shifted back and be ready to go one cycle later.
>
> Does this sound like a good idea (or even feasible), or would it be
> too expensive in terms of area and complexity seeing as we're talking
> about microcontrollers?

I am working on a similar proposal (there are some points of
disagreement between Liviu Ionescu and myself) that uses a shadow
register bank almost exactly as you suggest, including
spilling/reloading the inactive shadow registers into stack frames. The
main difference is that I will also propose an EABI with a minimum of
caller-saved registers, and only those EABI caller-saved registers are
shadowed.

-- Jacob

Liviu Ionescu

unread,

Mar 14, 2018, 5:52:16 PM3/14/18

to Torbjørn Viem Ness, jcb6...@gmail.com, a...@asbradbury.org, RISC-V ISA Dev, watso...@gmail.com

On 14 March 2018 at 23:41:20, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> ... a similar proposal ... that uses a shadow register bank

any solution that provides a better latency and/or ease of use, and
has an acceptable cost is welcomed.

from my point of view, for a real life device, that includes flash,
ram and lots of complex peripherals, architecture C/C++ friendliness
and performance (like latency) are more important than the number of
transistors in the core (within reasonable limits).

regards,

Liviu

Jacob Bachmeyer

unread,

Mar 14, 2018, 6:30:31 PM3/14/18

to Liviu Ionescu, RISC-V ISA Dev

Liviu Ionescu wrote:
> (I apologise to the Linux audience, but currently I could not find an
> email group dedicated to embedded RISC-V devices).
>

This is the RISC-V ISA mailing list; while there may be many people here
interested primarily in Linux, I am quite certain that wider discussions
are appropriate.

> I did a study on using the RISC-V ISA in embedded devices and I
> managed to identify several issues together with possible solutions.
>

I have been working on a similar proposal; taking a different approach
to some of the issues Liviu Ionescu has raised. I have attached a
working draft and also seek comments and comparisons between our proposals.

-- Jacob

risc-v-microcontroller-system-isa.org

Rogier Brussee

unread,

Mar 14, 2018, 7:13:14 PM3/14/18

to RISC-V ISA Dev, rogier....@gmail.com, watso...@gmail.com, a...@asbradbury.org

Op woensdag 14 maart 2018 22:39:40 UTC+1 schreef Liviu Ionescu:

The 16 byte alignment is not essential but allows to use addi16sp.

Essentially the same idea can be used with 4 byte allignment

addi sp sp N << 2

jalr t0 zero Csavew - (N<<1)

or 8 byte alignment

addi sp sp (((N +1)>>1) << 3)

jalr t0 zero Csavew - (N<<1)

Regards,

Rogier

regards,

Liviu

Richard Herveille

unread,

Mar 15, 2018, 4:55:13 AM3/15/18

to Liviu Ionescu, Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org, Richard Herveille

The problem with shadow registers is that you always run out and you still need to spill to main memory.

For an RVE implementation, which reduces the RF in half to save gates, it would be weird to double the memory now, just to implement a shadow register.

Richard

cid:image001.png@01D348FE.8B6D1030

Richard Herveille

Managing Director

Phone +31 (45) 405 5681

Cell +31 (6) 5207 2230

richard....@roalogic.com

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.

Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG7hfcJRmvBgfMCg2rtnAh-3xfoahRiF225Yvu9k%3DBbNjy_xDA%40mail.gmail.com.

Liviu Ionescu

unread,

Mar 15, 2018, 5:41:50 AM3/15/18

to Torbjørn Viem Ness, Richard Herveille, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

On 15 March 2018 at 10:55:11, Richard Herveille

(richard....@roalogic.com) wrote:

> The problem with shadow registers is that you always run out and you still need to spill
> to main memory.

you run out when the interrupt nesting gets deeper than the available
register banks; in most cases, the depth is 1, rarely 2, even rarely
3, and so on.

spilling can be done in parallel, while starting the handler.

but this method reveals a possible latency problem: if a high priority
interrupt occurs right after a series of other interrupts, and there
are no more register banks, it must wait for a previous spill to
complete, to free a register bank, leading to a jitter on the high
priority interrupt latency.

most applications tolerate a small jitter, but for applications that
implement control loops we might need a way to disable this mechanism
and provide constant latency (even if it is slightly higher).
Cortex-M0 has such a configuration bit to prevent jitter.

> For an RVE implementation, which reduces the RF in half to save gates, it would be weird
> to double the memory now, just to implement a shadow register.

yes, this mechanism is not cheap. however, as Jacob suggested, only
the ABI caller registers need to be shadowed/spilled, so, with a
lighter EABI, the extra cost may be kept to a minimum.

---

using shadow registers seems attractive at first sight, but I'm afraid
it brings more problems that is solves.

for the moment, regardless the implementation, my conclusion is that a
light EABI with a small number of caller registers, plus a fast
hardware stacking/unstacking seem required anyway.

regards,

Liviu

Michael Clark

unread,

Mar 15, 2018, 6:29:52 AM3/15/18

to jcb6...@gmail.com, Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

> On 14/03/2018, at 2:41 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Torbjørn Viem Ness wrote:
>> This is probaby an unconventional suggestion but I'm still an unexperienced student, so too young to know better I guess. =)
>>
>> What if we added one more set of registers "behind" the ones that need to be saved upon entering an interrupt handler?
>
> You are asking for shadow registers.

I had asked for a similar feature called “Privileged Register Windows” that unlike “Register Windows” on SPARC, are only windowed on privileged mode changes, not on all procedure calls. Only a0-a7 would be shared between different privileged modes and the remainder of the registers would be in a per mode window. On an OoO, this could be handled in the renamer, and in a single issue, access to the larger register file could be muxed by privilege mode.

This is very different from “Register Windows” that have fallen out of favour due to the fact that compiler Register allocators can do the job better than using Register Windows on regular procedure calls, rather “Privileged Register Windows” are a feature to minimise save restore when handing synchronous or asynchronous traps between Privilege levels.

Of course they don’t help the case of a nested trap in the same mode, but they could reduce syscall and asynchronous trap latency in a system where code tends to be operating in U mode and the Interrupt bottom half is running in S mode.

>> This way the entire status could just be "shifted out" in one cycle before jumping to the routine, and propagating the data to RAM or cache could be handled in the background to prepare for another context swap if necessary, and the latency would be invisible to the user (unless a new call occurs before it's done saving the data from the previous one).
>> Then after the handler completes, the previous context can simply be shifted back and be ready to go one cycle later.
>>
>> Does this sound like a good idea (or even feasible), or would it be too expensive in terms of area and complexity seeing as we're talking about microcontrollers?
>
> I am working on a similar proposal (there are some points of disagreement between Liviu Ionescu and myself) that uses a shadow register bank almost exactly as you suggest, including spilling/reloading the inactive shadow registers into stack frames. The main difference is that I will also propose an EABI with a minimum of caller-saved registers, and only those EABI caller-saved registers are shadowed.
>
>
>
> -- Jacob
>

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AA996FD.8040807%40gmail.com.

Rogier Brussee

unread,

Mar 15, 2018, 6:32:05 AM3/15/18

to RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

Op donderdag 15 maart 2018 10:41:50 UTC+1 schreef Liviu Ionescu:

On 15 March 2018 at 10:55:11, Richard Herveille
(richard....@roalogic.com) wrote:

> The problem with shadow registers is that you always run out and you still need to spill
> to main memory.

you run out when the interrupt nesting gets deeper than the available
register banks; in most cases, the depth is 1, rarely 2, even rarely
3, and so on.

spilling can be done in parallel, while starting the handler.

but this method reveals a possible latency problem: if a high priority
interrupt occurs right after a series of other interrupts, and there
are no more register banks, it must wait for a previous spill to
complete, to free a register bank, leading to a jitter on the high
priority interrupt latency.

most applications tolerate a small jitter, but for applications that
implement control loops we might need a way to disable this mechanism
and provide constant latency (even if it is slightly higher).
Cortex-M0 has such a configuration bit to prevent jitter.

> For an RVE implementation, which reduces the RF in half to save gates, it would be weird
> to double the memory now, just to implement a shadow register.

yes, this mechanism is not cheap. however, as Jacob suggested, only
the ABI caller registers need to be shadowed/spilled, so, with a
lighter EABI, the extra cost may be kept to a minimum.

_If_ you have all 31 + 1 registers available, what stops you from defining an ABI that that sets aside, say, registers 22--31

exclusively for the highest level interrupts e.g. using

x22 -> irq_ra,

x23 -> irq_sp,

x24 -> irq_tp,

x25 -> irq_s0

x26 -> irq_t0,

x27 -> irq_a0

x28 -> irq_a1

..

x31-> irq_a4

(this assumes x3 = gp can still be used as a global pointer)

Liviu Ionescu

unread,

Mar 15, 2018, 6:38:07 AM3/15/18

to Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

On 15 March 2018 at 12:32:07, Rogier Brussee (rogier....@gmail.com) wrote:

> If_ you have all 31 + 1 registers available, what stops you from
> defining an ABI that that sets aside, say, registers 22--31

> exclusively for the highest level interrupts e.g. using ...

can you elaborate?

how would this work with nested interrupts?

regards,

Liviu

Rogier Brussee

unread,

Mar 15, 2018, 7:23:43 AM3/15/18

to RISC-V ISA Dev, rogier....@gmail.com, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

Op donderdag 15 maart 2018 11:38:07 UTC+1 schreef Liviu Ionescu:

It would not work, at least not without spilling the (fewer) reserved registers, exactly as with all other shadow schemes.

I wrote highest level (highest priority) interrupts assuming the interrupt necessarily

runs uninterrupted to completion. In any case I just wanted to point out that I think you could use the 32 register ISA as 16 registers and 16 "shadowy" registers, or with

any other split like 22 registers and 10 "shadowy" registers for interrupts or for that matter, 16 registers for normal use, 10 regs for nestable interrupts that have to spill on entry and

6 for nonnestable, uninterruptable highest priority interrupts.

regards,

Liviu

Liviu Ionescu

unread,

Mar 15, 2018, 7:34:08 AM3/15/18

to Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

On 15 March 2018 at 13:23:45, Rogier Brussee (rogier....@gmail.com) wrote:

> ... interrupts assuming the interrupt necessarily
> runs uninterrupted to completion.

real-time systems **need** nested interrupts, this is one of the main
requirements for the microcontroller profile.

> ... you could use the 32 register ISA as 16 registers and 16
> "shadowy" registers

I still think we should design the basic RISC-V EABI with a set of 16
registers (for very small RV32E devices), and, then extend it to a
sibling that has more registers (for RV32I/RV64I), but be sure the
extra registers have no special meaning, so the compiler can use them
only for more local variables.

regards,

Liviu

kr...@berkeley.edu

unread,

Mar 15, 2018, 7:48:55 PM3/15/18

to Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

I'll shortly be sending out an invite to a new Foundation Task Group
we have formed to address adding fast interrupts to RISC-V.

Germane to this thread, one feature of the proposal under development
is to standardize interrupt attribute annotations so C compilers can
generate interrupt handlers that only save registers as needed. This
effectively changes the calling conventions just for the handlers but
leaves the rest of the ABI unchanged.

/* Not real code, just a sketch. */
extern volatile int *DEVICE;
extern volatile int *COUNT;

void __attribute__ ((interrupt))
foo() {
*DEVICE = 0;
*COUNT++;
}

A rough sketch of what a generated handler looks like is:

# Small ISR that pokes device to clear interrupt, and increments in-memory counter.

.align 3 # Has to be 8-byte aligned.
foo:
addi sp, sp, -16 # Create a frame on stack.
sw s0, 0(sp) # Save working register.
sw x0, DEVICE, s0 # Clear interrupt flag.
sw s1, 4(sp) # Save working register.
la s0, COUNT # Get counter address.
li s1, 1
amoadd.w x0, (s0), s1 # Increment counter in memory.
lw s1, 4(sp) # Restore registers.
lw s0, 0(sp)
addi sp, sp, 16 # Free stack frame.
mret # Return from handler using saved mepc.

This change will be useful even with existing interrupt architecture,
but TG will be looking at a new design that supports nested
interrupts. Our initial studies show a small core can take interrupt,
enter, execute, and exit the handler above in less than 20 cycles,
while supporting preemption on any clock cycle (i.e., only a few cycles ~3
to get to first instruction).

Krste

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG7hfcJ7gWvOW8hGnykJ-NHi8i%2BT6Vd39hm%3D7KFZyygqyoaHHw%40mail.gmail.com.

Tommy Thorn

unread,

Mar 15, 2018, 7:58:17 PM3/15/18

to kr...@berkeley.edu, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

> On Mar 15, 2018, at 16:48 , kr...@berkeley.edu wrote:
> I'll shortly be sending out an invite to a new Foundation Task Group
> we have formed to address adding fast interrupts to RISC-V.
>
> Germane to this thread, one feature of the proposal under development
> is to standardize interrupt attribute annotations so C compilers can
> generate interrupt handlers that only save registers as needed. This
> effectively changes the calling conventions just for the handlers but
> leaves the rest of the ABI unchanged.
>
> /* Not real code, just a sketch. */
> extern volatile int *DEVICE;
> extern volatile int *COUNT;
>
> void __attribute__ ((interrupt))
> foo() {
> *DEVICE = 0;
> *COUNT++;
> }
>
> A rough sketch of what a generated handler looks like is:
>
> # Small ISR that pokes device to clear interrupt, and increments in-memory counter.
>
> .align 3 # Has to be 8-byte aligned.
> foo:
> addi sp, sp, -16 # Create a frame on stack.

If the ABI had included a stack "red zone" with a small reservation for interrupts,
then the two "addi sp, " instructions could have been avoided in most cases.

> sw s0, 0(sp) # Save working register.

Presumedly you meant to load s0 with a global pointer?

> sw x0, DEVICE, s0 # Clear interrupt flag.
> sw s1, 4(sp) # Save working register.
> la s0, COUNT # Get counter address.
> li s1, 1
> amoadd.w x0, (s0), s1 # Increment counter in memory.
> lw s1, 4(sp) # Restore registers.
> lw s0, 0(sp)
> addi sp, sp, 16 # Free stack frame.
> mret # Return from handler using saved mepc.

Tommy

Andrew Waterman

unread,

Mar 15, 2018, 8:16:12 PM3/15/18

to Tommy Thorn, Krste Asanovic, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com, Alex Bradbury

On Thu, Mar 15, 2018 at 4:58 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:
>> On Mar 15, 2018, at 16:48 , kr...@berkeley.edu wrote:
>> I'll shortly be sending out an invite to a new Foundation Task Group
>> we have formed to address adding fast interrupts to RISC-V.
>>
>> Germane to this thread, one feature of the proposal under development
>> is to standardize interrupt attribute annotations so C compilers can
>> generate interrupt handlers that only save registers as needed. This
>> effectively changes the calling conventions just for the handlers but
>> leaves the rest of the ABI unchanged.
>>
>> /* Not real code, just a sketch. */
>> extern volatile int *DEVICE;
>> extern volatile int *COUNT;
>>
>> void __attribute__ ((interrupt))
>> foo() {
>> *DEVICE = 0;
>> *COUNT++;
>> }
>>
>> A rough sketch of what a generated handler looks like is:
>>
>> # Small ISR that pokes device to clear interrupt, and increments in-memory counter.
>>
>> .align 3 # Has to be 8-byte aligned.
>> foo:
>> addi sp, sp, -16 # Create a frame on stack.
>
> If the ABI had included a stack "red zone" with a small reservation for interrupts,
> then the two "addi sp, " instructions could have been avoided in most cases.

In the non-preemptible case, the addi instructions can be elided
as-is. This example works for the (forthcoming) preemptible case, as
well.

>
>> sw s0, 0(sp) # Save working register.
>
> Presumedly you meant to load s0 with a global pointer?

That's what's going on here. "sw x0, DEVICE, s0" is "store x0 to
global symbol DEVICE using s0 as a temporary", i.e., it's syntactic
sugar for "1: auipc s0, %pcrel_hi(DEVICE); sw x0, %pcrel_lo(1b)(s0)"

>
>> sw x0, DEVICE, s0 # Clear interrupt flag.
>> sw s1, 4(sp) # Save working register.
>> la s0, COUNT # Get counter address.
>> li s1, 1
>> amoadd.w x0, (s0), s1 # Increment counter in memory.
>> lw s1, 4(sp) # Restore registers.
>> lw s0, 0(sp)
>> addi sp, sp, 16 # Free stack frame.
>> mret # Return from handler using saved mepc.
>
> Tommy
>
>
>>
>> This change will be useful even with existing interrupt architecture,
>> but TG will be looking at a new design that supports nested
>> interrupts. Our initial studies show a small core can take interrupt,
>> enter, execute, and exit the handler above in less than 20 cycles,
>> while supporting preemption on any clock cycle (i.e., only a few cycles ~3
>> to get to first instruction).
>>
>> Krste
>

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/995D7241-2798-4754-B2D0-866B910C4B02%40esperantotech.com.

Bruce Hoult

unread,

Mar 15, 2018, 8:26:18 PM3/15/18

to Tommy Thorn, Krste Asanovic, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com, Alex Bradbury

On Thu, Mar 15, 2018 at 4:58 PM, Tommy Thorn <tommy...@esperantotech.com> wrote:

> On Mar 15, 2018, at 16:48 , kr...@berkeley.edu wrote:
> I'll shortly be sending out an invite to a new Foundation Task Group
> we have formed to address adding fast interrupts to RISC-V.
>
> Germane to this thread, one feature of the proposal under development
> is to standardize interrupt attribute annotations so C compilers can
> generate interrupt handlers that only save registers as needed. This
> effectively changes the calling conventions just for the handlers but
> leaves the rest of the ABI unchanged.
>
> /* Not real code, just a sketch. */
> extern volatile int *DEVICE;
> extern volatile int *COUNT;
>
> void __attribute__ ((interrupt))
> foo() {
> *DEVICE = 0;
> *COUNT++;
> }
>
> A rough sketch of what a generated handler looks like is:
>
> # Small ISR that pokes device to clear interrupt, and increments in-memory counter.
>
> .align 3 # Has to be 8-byte aligned.
> foo:
> addi sp, sp, -16 # Create a frame on stack.

If the ABI had included a stack "red zone" with a small reservation for interrupts,
then the two "addi sp, " instructions could have been avoided in most cases.

I believe you've got that backwards.

A "red zone" is stack space below the Stack Pointer that may be used by normal leaf functions without adjusting the Stack Pointer.

If the ABI has a Red Zone then all interrupt service routines must subtract the size of the Red Zone from the Stack Pointer *in addition* to whatever space the interrupt routine will use.

> sw s0, 0(sp) # Save working register.

Presumedly you meant to load s0 with a global pointer?

> sw x0, DEVICE, s0 # Clear interrupt flag.
> sw s1, 4(sp) # Save working register.
> la s0, COUNT # Get counter address.
> li s1, 1
> amoadd.w x0, (s0), s1 # Increment counter in memory.
> lw s1, 4(sp) # Restore registers.
> lw s0, 0(sp)
> addi sp, sp, 16 # Free stack frame.
> mret # Return from handler using saved mepc.

Tommy

>
> This change will be useful even with existing interrupt architecture,
> but TG will be looking at a new design that supports nested
> interrupts. Our initial studies show a small core can take interrupt,
> enter, execute, and exit the handler above in less than 20 cycles,
> while supporting preemption on any clock cycle (i.e., only a few cycles ~3
> to get to first instruction).
>
> Krste

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/995D7241-2798-4754-B2D0-866B910C4B02%40esperantotech.com.

Alex Bradbury

unread,

Mar 15, 2018, 11:41:50 PM3/15/18

to Krste Asanovic, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com

Why is 8-byte function alignment required?

Standardising an interrupt attribute similar to those supported by
compilers for most other targets would definitely be worthwhile.
Liviu's document raises the concern that you have to spill the
caller-saved registers in the case where your interrupt handler calls
a function compiled for the standard calling convention. Of course if
your ISR is calling functions where inlining can't be justified, your
interrupt handling is fairly heavyweight already, meaning the extra
overhead of saving caller-saved registers should be a smaller
percentage of execution time.

Better understanding and characterising the workloads people are
struggling with would really help in defining the best solution here.

Best,

Alex

Krste Asanovic

unread,

Mar 16, 2018, 1:48:34 AM3/16/18

to Alex Bradbury, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com

That follows from the trap vector alignment constraint in a particular proposal.

> Standardising an interrupt attribute similar to those supported by
> compilers for most other targets would definitely be worthwhile.
> Liviu's document raises the concern that you have to spill the
> caller-saved registers in the case where your interrupt handler calls
> a function compiled for the standard calling convention. Of course if
> your ISR is calling functions where inlining can't be justified, your
> interrupt handling is fairly heavyweight already, meaning the extra
> overhead of saving caller-saved registers should be a smaller
> percentage of execution time.

Yes. Also, at some point regardless of calling convention, you should schedule long-running compute to a background thread scheduled at a more appropriate time, both to reduce interrupt latency and to avoid wasting processor time shuffling registers in ISR routines.

> Better understanding and characterising the workloads people are
> struggling with would really help in defining the best solution here.

Of course. The difficult part is persuading owners to share their workload.
Any offers?

Krste

> Best,
>
> Alex

Liviu Ionescu

unread,

Mar 16, 2018, 3:37:48 AM3/16/18

to kr...@berkeley.edu, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

On 16 March 2018 at 01:48:53, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> > void __attribute__ ((interrupt))
> foo() {
> *DEVICE = 0;
> *COUNT++;
> }

...

> > Our initial studies show a small core can take interrupt,
> enter, execute, and exit the handler above in less than 20 cycles,
> while supporting preemption on any clock cycle (i.e., only a
> few cycles ~3
> to get to first instruction).

you can make a design lile this and claim less than 20 cycles latency,
but most real applications need to call a plain C function from the
interrupt handler.

to be correct, you must measure latency from the interrupt to the
moment execution enters the plain C function.

can you estimate latency in this case? both entry and exit latencies
are important.

for the final design you must also consider cases like tail chaining
and late arrival.

regards,

Liviu

Albert Cahalan

unread,

Mar 16, 2018, 3:45:07 AM3/16/18

to Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

On 3/15/18, Liviu Ionescu <i...@livius.net> wrote:

> real-time systems **need** nested interrupts, this is one of the
> main requirements for the microcontroller profile.

No they don't. I've supported real-time systems as an OS developer
for two different OSes, MC/OS and RedHawk. Customers had a
fondness for running with interrupts disabled entirely. This caused
all sorts of fun for the OS's internal housekeeping tasks. There could
be no clock tick, yet the customer expects the clock to keep working!

In one case, the customer was trying to warp a mirror to aim a laser.
This had to overcome turbulent air bending the beam away from the
intended target. Failure is literally fatal, due to an incoming missile.

For one of those OSes, real-time tasks would run on cores that were
being babysat by the other cores. The cores with real-time tasks would
simply not take any interrupts at all.

The fastest way to get data is to spin waiting for it. That is, you poll
for just one thing continuously. You don't mess around with interrupts.

> I still think we should design the basic RISC-V EABI with a set of 16
> registers (for very small RV32E devices), and, then extend it to a

Normal compilers hardly even use 16 when that is what is available.
I think going past 16 registers was not good, but this is water under
the bridge now. Dropping to 16 is a completely different architecture.
There comes a time to push ahead with what you have, flaws and all.

Liviu Ionescu

unread,

Mar 16, 2018, 4:02:18 AM3/16/18

to Albert Cahalan, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

On 16 March 2018 at 09:45:06, Albert Cahalan (acah...@gmail.com) wrote:

> real-time tasks would run on cores that were
> being babysat by the other cores. The cores with real-time tasks
> would simply not take any interrupts at all.

yes, for hard real-time tasks this is probably the case, but I'm not
sure all applications can afford multi-core devices plus the added
cost of writing multi-core software; I would estimate that only the
top 10% of real-time applications are that extreme, and the rest of
them can still use simpler solutions, if properly designed.

regards,

Liviu

Liviu Ionescu

unread,

Mar 16, 2018, 4:31:22 AM3/16/18

to Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com

On 16 March 2018 at 05:41:49, Alex Bradbury (a...@asbradbury.org) wrote:

> Standardising an interrupt attribute similar to those supported
> by
> compilers for most other targets would definitely be worthwhile.

interrupt attributes are common to the architectures of yesterday, if
RISC-V wants to be the architecture of the future, it should not look
only to the past.

modern microcontroller architectures use no attributes at all, the
hardware is able to call plain C functions directly.

for the privileged profile you can invent any attributes you like and
try to enforce inlining for the entire interrupt handler, but for the
microcontroller profile I think that hardware stacking/unstaking
coupled with a lite ABI can provide the best performace (with a target
of 12+10 cycles for the total entry/exit to a C function). plus that
it is unbeatable in terms of ease of use.

regards,

Liviu

kr...@berkeley.edu

unread,

Mar 16, 2018, 4:38:59 AM3/16/18

to Liviu Ionescu, kr...@berkeley.edu, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

>>>>> On Fri, 16 Mar 2018 00:37:45 -0700, Liviu Ionescu <i...@livius.net> said:
| you can make a design lile this and claim less than 20 cycles latency,
| but most real applications need to call a plain C function from the
| interrupt handler.

It is less than 20 cycles for this use case. There are many different
use cases, including common patterns represented by this example that
explicitly avoid complex code in the ISR itself.

| to be correct, you must measure latency from the interrupt to the
| moment execution enters the plain C function.

No, that is not the standard definition of interrupt latency. The
measure you propose is only interesting for a particular use case
where ISRs are compiled as standard C functions. I understand this
use cases exists, but it is not the only one.

| can you estimate latency in this case? both entry and exit latencies
| are important.

Obviously this depends on the ABI when calling a standard compiled C
function, and for the standard ABI you will have to save/restore a lot
of registers.

| for the final design you must also consider cases like tail chaining
| and late arrival.

Yes - by providing comparable performance for the situations that led
to these architecture-specific optimizations, but not by necessarily
copying an existing design.

Krste

| regards,

| Liviu

Liviu Ionescu

unread,

Mar 16, 2018, 4:46:44 AM3/16/18

to Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com

On 16 March 2018 at 07:48:33, Krste Asanovic (kr...@berkeley.edu) wrote:

> at some point regardless of calling convention, you should
> schedule long-running compute to a background thread scheduled
> at a more appropriate time, both to reduce interrupt latency
> and to avoid wasting processor time shuffling registers in ISR
> routines.

yes, two-tiered interrupt processing is the ideal textbook solution,
but few RTOSes/applications do it.

how do you suggest to notify the background thread to wakeup and pick
up the job from where the ISR left it?

regards,

Liviu

kr...@berkeley.edu

unread,

Mar 16, 2018, 4:50:44 AM3/16/18

to Liviu Ionescu, Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com

If the background thread was sleeping on WFI, it will now wake up and
check for work when control returns from ISR. If it was not sleeping,
it'll find work on end of its queue.

Krste

kr...@berkeley.edu

unread,

Mar 16, 2018, 4:53:32 AM3/16/18

to Liviu Ionescu, Albert Cahalan, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

It is common to throw hardware at a problem to save software design
effort.

Krste

| regards,

| Liviu

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG7hfcK4cK3FuFR9EKZo5Zg9VtNZ6eBG1CYPD2mpFntSqxc%2B0A%40mail.gmail.com.

kr...@berkeley.edu

unread,

Mar 16, 2018, 5:01:07 AM3/16/18

to Liviu Ionescu, Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com

I agree this model is convenient to use, but there is a cost for the
convenience, and some would prefer to have the option to choose faster
interrupts, simpler hardware, and faster regular code instead in some
cases.

Krste

| regards,

| Liviu

kr...@berkeley.edu

unread,

Mar 16, 2018, 5:03:45 AM3/16/18

to Albert Cahalan, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

>>>>> On Fri, 16 Mar 2018 03:45:04 -0400, Albert Cahalan <acah...@gmail.com> said:
|| I still think we should design the basic RISC-V EABI with a set of 16
|| registers (for very small RV32E devices), and, then extend it to a

| Normal compilers hardly even use 16 when that is what is available.
| I think going past 16 registers was not good, but this is water under
| the bridge now. Dropping to 16 is a completely different
| architecture.

Yes, the E variant exists.

| There comes a time to push ahead with what you have, flaws and all.

More than 16 registers is noticeably superior for high-performance
code using floating-point or a vector unit, and will also be very
useful in smaller systems using the P extension out of the x
registers.

Krste

Liviu Ionescu

unread,

Mar 16, 2018, 5:38:32 AM3/16/18

to kr...@berkeley.edu, a...@asbradbury.org, tbn...@gmail.com, Rogier Brussee, watso...@gmail.com, RISC-V ISA Dev, richard....@roalogic.com

On 16 March 2018 at 10:38:58, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> No, that is not the standard definition of interrupt latency. The
> measure you propose is only interesting for a particular use case
> where ISRs are compiled as standard C functions. I understand this
> use cases exists, but it is not the only one.

agree, it is not the only one.

but in today microcontroller world it is by far the most common one.

> ... but not by necessarily copying an existing design.

sure, we should struggle to do better than existing designs.

but when this is not achievable, we should do at least as good as
existing designs, not go one generation behind (yes, the current
M-only privileged profile with the current non-nested PLIC, is one
generation behind Cortex-M, both in terms of interrupt performance and
ease of use).

On 16 March 2018 at 10:50:43, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> If the background thread was sleeping on WFI, it will now wake
> up and
> check for work when control returns from ISR. If it was not sleeping,
> it'll find work on end of its queue.

right. so you enqueued an element to a queue.

well, in all RTOSes that I had to do, this is done with a plain C
function, complicated enough to be non-inlineable. and requiring some
kind of protection, a critical section implemented via modifying the
interrupt priority on single cores, or atomics on multi core.

you might ask all RTOS maintainers to modify their designs to run all
this code in an inlined handler, but I doubt this will happen any time
soon.

(the same as you might ask all debug and IDE maintainers to add
special processing and views for the many RISC-V CSRs, but I also
doubt this will happen any time soon)

On 16 March 2018 at 10:53:31, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> It is common to throw hardware at a problem to save software design
> effort.

hardware stacking/unstaking is one such case. a little bit of hardware
greatly helps the software.

On 16 March 2018 at 11:01:06, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> I agree this model is convenient to use, but there is a cost for the
> convenience, and some would prefer to have the option to choose faster
> interrupts, simpler hardware, and faster regular code instead in some
> cases.

that's why I suggested a separate microcontroller profile. one of the
main design requirements is ease of use via C/C++ friendliness.

if the privileged profile has other requirements, like minimising
hardware, fine, but don't enforce it to all, and allow for other
profiles to exist.

---

for those who did not take the time to read my proposal, I copy here
that this is actually the first step to be considered by the
foundation, to acknowledge that the world does not turn around Linux
and privileged profiles, and to make possible the existence of other
profiles too (first and foremost by not mandating the privileged
profile).

regards,

Liviu

Alex Bradbury

unread,

Mar 16, 2018, 6:50:59 AM3/16/18

to Liviu Ionescu, Krste Asanovic, Torbjørn Viem Ness, Rogier Brussee, Watson Ladd, RISC-V ISA Dev, Richard Herveille

On 16 March 2018 at 09:38, Liviu Ionescu <i...@livius.net> wrote:
> for those who did not take the time to read my proposal, I copy here
> that this is actually the first step to be considered by the
> foundation, to acknowledge that the world does not turn around Linux
> and privileged profiles, and to make possible the existence of other
> profiles too (first and foremost by not mandating the privileged
> profile).

I thought the intent was that implementers could pick and choose which
standards to implement, as long as they are clear on what they _do_
conform to. e.g. an implementer might ship a device implementing the
standard RV32IMA instruction set (as described in
https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf),
plus their own non-standard privileged implementation, their own
non-standard debug, and other non-standard extensions. I understood
they would still be able to advertise as something like "Compliant
with the RV32IMA ISA described in version 2.2 of the RISC-V User-Level
ISA", despite not being compliant with other specifications published
by the RISC-V Foundation.

Perhaps someone could confirm whether I am correct in this?

Thanks,

Alex

Liviu Ionescu

unread,

Mar 16, 2018, 7:28:17 AM3/16/18

to Alex Bradbury, RISC-V ISA Dev, Torbjørn Viem Ness, Krste Asanovic, Rogier Brussee, Watson Ladd, Richard Herveille

On 16 March 2018 at 12:50:58, Alex Bradbury (a...@asbradbury.org) wrote:

> I thought the intent was that implementers could pick and choose which
> standards to implement, as long as they are clear on what they _do_
> conform to. e.g. an implementer might ship a device implementing the
> standard RV32IMA instruction set (as described in
> https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf),
> plus their own non-standard privileged implementation, their own
> non-standard debug, and other non-standard extensions. I understood
> they would still be able to advertise as something like "Compliant
> with the RV32IMA ISA described in version 2.2 of the RISC-V User-Level
> ISA", despite not being compliant with other specifications published
> by the RISC-V Foundation.
>
> Perhaps someone could confirm whether I am correct in this?

ISA Volume I, v2.2, at page 3, states:

"The RISC-V manual is structured in two volumes. This volume covers
the user-level ISA design, including optional ISA extensions. The
second volume provides the privileged architecture."

I suggested Krste to make Volume II optional, but he insisted Volume
II is always needed.

https://github.com/riscv/riscv-isa-manual/commit/a439dada57fe6c1ed426351742a5ba7dd2cace37#commitcomment-27447508

my microcontroller profile does not use 'mstatus' and can do very well
without any of the requirements of the privileged specs.

it is an irony that the RISC-V specs allow any weird custom
extensions, but does not allow designs to deviate from the privileged
specs.

de-entangling the privileged specs from the instruction set is the
main request addressed by my microcontroller profile proposal.

https://github.com/emb-riscv/specs-markdown/blob/master/improvements-upon-privileged.md#de-entangle-the-privileged-specs

even the name itself is biased towards applications running an
operating system, "User-level ISA" automatically implies a non-user
level ISA.

no, the first volume, the only one that should be mandatory for
compliance reasons (Allen?), should not be related to any user or
privileged mode, and should cover only the instruction set, as neutral
as possible.

I suggest the name: "The RISC-V Architecture Manual: Volume I: The
Instruction Set".

(if anyone from the Foundation board reads this, please consider this
name suggestion a formal proposal to be analysed by the board).

regards,

Liviu

Alex Bradbury

unread,

Mar 16, 2018, 8:06:34 AM3/16/18

to Liviu Ionescu, RISC-V ISA Dev, Torbjørn Viem Ness, Krste Asanovic, Rogier Brussee, Watson Ladd, Richard Herveille

I think this is important, and actually would make the RISC-V
Foundation's job much easier. If compliance with the privileged spec
is required to badge something as 'RISC-V', this puts additional
pressure on defining something that can keep every potential user
happy. As you point out, the contrast with the flexibility available
to implementers for ISA extensions is jarring. You could produce a
core that implements RV32IM but deviates hugely from the standard
extensions in other ways, maybe providing a more application-tuned
alternative to the compressed extension, a novel take on atomics, and
posits from that IEEE floating point. Such a core could be marketed as
a RISC-V implementation with no problems, but you'd lose that ability
if you felt that the privileged spec didn't meet your requirements and
deviated in some areas?

It seems inconsistent to embrace and encourage innovation in the
user-level ISA, but discourage it for the privileged spec. I'd point
out that custom user-level extensions can have just as much impact on
compatibility with 'standard' RISC-V software stacks as deviations
from the privileged spec. e.g. if an extension introduces new
relocation types, or introduces extra user state which must be saved
when context switching.

For anyone who didn't click through and read Liviu's work at
https://github.com/emb-riscv/specs-markdown/blob/master/README.md I'd
strongly recommend doing so - I think Liviu perhaps undersold the
amount of content there in the opening to this thread.

Best,

Alex

Alex Bradbury

unread,

Mar 16, 2018, 8:21:06 AM3/16/18

to Liviu Ionescu, RISC-V ISA Dev, Torbjørn Viem Ness, Krste Asanovic, Rogier Brussee, Watson Ladd, Richard Herveille

"posits rather than IEEE floating point".

Sorry, that sentence got mangled.

Guy Lemieux

unread,

Mar 16, 2018, 10:46:46 AM3/16/18

to Krste Asanovic, RISC-V ISA Dev

Krste,

I'm glad to see a new task group being spun out. However, I believe
the focus should be broader than just fast interrupts. The
microcontroller environment needs to be better specified, as the
current Privileged ISA Specification is inappropriate for that domain.

My small RISC-V processor survey done for DAC2017 was able to capture
26+ RISC-V CPU designs, of which 35% were low-performance micro
controllers and 54% were high-performance microcontrollers/embedded
CPUs. Only 3 of these designs were RV32E.

In this email thread, we have already seen two entirely different
proposals for a microcontroller environment. This will quickly
fracture even more as users are forced to "roll their own". The RISC-V
Foundation needs to address this immediately to protect the brand from
excessive variation at the microcontroller level, which arguably is
the most popular level and the one to the quickest impact.

Guy

Liviu Ionescu

unread,

Mar 16, 2018, 11:36:23 AM3/16/18

to RISC-V ISA Dev, Guy Lemieux, Krste Asanovic

On 16 March 2018 at 16:46:46, Guy Lemieux (glem...@vectorblox.com) wrote:

> I'm glad to see a new task group being spun out. However, I believe
> the focus should be broader than just fast interrupts.

+1

> The
> microcontroller environment needs to be better specified, as the
> current Privileged ISA Specification is inappropriate for that domain.

Fully agree.

I mentioned this in many of my previous posts, some times with
arguments, but the mainly Linux audience was generally annoyed and
dismissed them.

> My small RISC-V processor survey done for DAC2017 was able to capture
> 26+ RISC-V CPU designs, of which 35% were low-performance micro
> controllers and 54% were high-performance microcontrollers/embedded
> CPUs. Only 3 of these designs were RV32E.

In my proposal I identified three sub-classes of microcontrollers:

https://github.com/emb-riscv/specs-markdown/blob/master/introduction.md#sub-profiles

RV32E is tentatively used only in the S (small) sub-profile.

> In this email thread, we have already seen two entirely different
> proposals for a microcontroller environment. This will quickly
> fracture even more as users are forced to "roll their own". The RISC-V
> Foundation needs to address this immediately to protect the brand from
> excessive variation at the microcontroller level, which arguably is
> the most popular level and the one to the quickest impact.

... and received the less amount of attention... so far...

Thank you, Guy.

Liviu

Samuel Falvo II

unread,

Mar 16, 2018, 2:58:15 PM3/16/18

to Liviu Ionescu, Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com

On Fri, Mar 16, 2018 at 1:46 AM, Liviu Ionescu <i...@livius.net> wrote:
> yes, two-tiered interrupt processing is the ideal textbook solution,
> but few RTOSes/applications do it.

Citation needed? This is such a common pattern that I find this
statement questionable. Perhaps its rare for a certain profile of
embedded applications; however, in my embedded experience, I've *only*
used two-tiered interrupt processing systems.

> how do you suggest to notify the background thread to wakeup and pick
> up the job from where the ISR left it?

AmigaOS and VMS both use "event flags" (AmigaOS calls them "signals",
so if I use this term, please understand that I'm not referring to
POSIX signals). (DISCLAIMER: Neither of these OSes are hard
real-time; however, they have nonetheless been used in hard real-time
projects.) L4 treats interrupts as messages delivered to message
queues; if a task waits on an interrupt queue, it'll block until the
interrupt fires. Then, part of the kernel's job is to reschedule any
waiting tasks. If no task is waiting, it sets a "interrupt has
happened" flag, so that next time a task tries to wait, it just
doesn't bother. I could be wrong, but if memory serves me right, QNX
maps interrupts to messages sent to message queues as well.

I'm thinking that Liviu's concern will manifest most when projects
cannot drive the processor clock at a faster rate to compensate for a
scheduler's overhead. This could be due to battery life concerns,
etc. However, I've personally never worked on an embedded project
where this was an issue. What kinds of projects would cause this to
become important?

--
Samuel A. Falvo II

Liviu Ionescu

unread,

Mar 16, 2018, 4:57:59 PM3/16/18

to Samuel Falvo II, Alex Bradbury, tbn...@gmail.com, Rogier Brussee, watso...@gmail.com, RISC-V ISA Dev, Richard Herveille, Krste Asanovic

On 16 March 2018 at 20:58:13, Samuel Falvo II (sam....@gmail.com) wrote:

> On Fri, Mar 16, 2018 at 1:46 AM, Liviu Ionescu wrote:
> > yes, two-tiered interrupt processing is the ideal textbook solution,
> > but few RTOSes/applications do it.
>
> Citation needed?

sure.

the venerable eCos calls them ISRs and DSRs; the µC/OS-III calls them
direct and deferred interrupts; FreeRTOS has an optional Deferred
Interrupt Handling.

> This is such a common pattern that I find this
> statement questionable.

I'm not sure we are talking about the same thing.

the traditional interrupt processing use case is to register an ISR,
which does all the operations required by the peripheral, then
notifies an associated user thread (via a semaphore, queue, etc), to
continue processing.

the two-tired interrupt processing, also called deferred interrupt
processing, uses **two** routines, one direct and one deferred.

the direct routine (some times shared by multiple peripherals)
practically does the absolute minimum, possibly does not even touch
the peripheral, it only identifies the peripheral, arranges for the
associated DSR to be executed and returns as soon as possible.

immediately after the interrupt returns, the system executes the DSR,
usually on the context of a special system thread, running with a
priority higher than all application threads.

the DSR accesses the peripheral and does all first hand related
processing, then notifies the user thread, similarly as in the
traditional use case.

the big difference is that the DSR does not run on an interrupt
context, but on a regular thread context. in other words, the core is
free to accept further interrupts.

this mechanism was probably fashionable with some legacy architectures
(MIPS?), that used a single interrupt handler, and executed the
interrupt handlers with interrupts disabled (I know it since the 90s).

modern architectures went for nested interrupts and running the
handlers with the interrupts enabled, so the need for inovative tricks
to return from the handler as soon as possible was no longer so
paramount.

when it came in 2004, already Cortex-M adopted this modern mechanism
and eclipse most other microcontrollers in terms of ease of use (since
then I don't remember hearing about the need for deferred interrupts).

imagine my surprise in 2017 to see that RISC-V opted for a single trap
handler and running with interrupts disabled... :-(

> Perhaps its rare for a certain profile of
> embedded applications; however, in my embedded experience, I've *only*
> used two-tiered interrupt processing systems.

if you really used two-tiered interrupts, congratulations!

> I'm thinking that Liviu's concern will manifest most when projects
> cannot drive the processor clock at a faster rate to compensate for a
> scheduler's overhead. This could be due to battery life concerns,
> etc. However, I've personally never worked on an embedded project
> where this was an issue. What kinds of projects would cause this to
> become important?

I'm not sure I understand the question, but I had a project where the
top priority event was hooked to NMI.

regards,

Liviu

Alex Elsayed

unread,

Mar 16, 2018, 5:01:04 PM3/16/18

to RISC-V ISA Dev

On Mar 16, 2018 13:57, "Liviu Ionescu" <i...@livius.net> wrote:

On 16 March 2018 at 20:58:13, Samuel Falvo II (sam....@gmail.com) wrote:

> On Fri, Mar 16, 2018 at 1:46 AM, Liviu Ionescu wrote:
> > yes, two-tiered interrupt processing is the ideal textbook solution,
> > but few RTOSes/applications do it.
>
> Citation needed?

sure.

the venerable eCos calls them ISRs and DSRs; the µC/OS-III calls them
direct and deferred interrupts; FreeRTOS has an optional Deferred
Interrupt Handling.

I believe you misunderstood the question - I'm quite sure Samuel wanted you to justify your claim that they are _rare_. He's quite aware of what they are, as can be seen in the rest of his mail.

Samuel Falvo II

unread,

Mar 16, 2018, 5:14:53 PM3/16/18

to Alex Elsayed, RISC-V ISA Dev

On Fri, Mar 16, 2018 at 2:01 PM, Alex Elsayed <etern...@gmail.com> wrote:
> I believe you misunderstood the question - I'm quite sure Samuel wanted you
> to justify your claim that they are _rare_. He's quite aware of what they
> are, as can be seen in the rest of his mail.

Correct; that these alternative interrupt handling techniques are
supported by a given RTOS is not, prima facia, evidence that they are
widely used; only that they have proven valuable to "enough" clients
of the RTOSes to be a useful addition to their API.

I'm not intending to say direct interrupt handling is a bad design
choice, or that it is somehow "better" to choose two-level interrupt
structures. Rather, I just want to know when one would want to choose
to go that route, and see if there are general patterns between
projects that influences that decision accordingly. I think it's a
valuable data-point that would contribute to the discussion, and it's
not something I've seen mentioned before.

Hope that clarifies my position.

Liviu Ionescu

unread,

Mar 16, 2018, 5:17:09 PM3/16/18

to RISC-V ISA Dev, Alex Elsayed

On 16 March 2018 at 23:01:04, Alex Elsayed (etern...@gmail.com) wrote:

> I believe you misunderstood the question - I'm quite sure Samuel wanted you
> to justify your claim that they are _rare_.

ah, sorry for the misunderstanding.

maybe there are other RTOSes that, for historical reasons
(compatibility with legacy architectures), still implement deferred
interrupts, but I don't remember seeing this feature in the young
RTOSes, that came to market in the Cortex-M age, and, for RTOSes that
implement it, I don't remember seeing Cortex-M projects using it.

I accept that I may be wrong, and on other architectures this feature
might be still in use, but in my world these other architectures
became irrelevant.

regards,

Liviu

Liviu Ionescu

unread,

Mar 16, 2018, 5:30:36 PM3/16/18

to Alex Elsayed, Samuel Falvo II, RISC-V ISA Dev

On 16 March 2018 at 23:14:53, Samuel Falvo II (sam....@gmail.com) wrote:

> ... only that they have proven valuable to "enough" clients

> of the RTOSes to be a useful addition to their API.

yes, having "enough" client requests is a good reason for extending an
API, but my personal guess is that these features were added when the
RTOSes were ported on architectures with non-preemptive interrupts.

this might also be one reason why deferred interrupts are not present
in recent RTOSes, because their designers focused on Cortex-M and
decided to ignore legacy architectures.

regards,

Liviu

kr...@berkeley.edu

unread,

Mar 16, 2018, 8:16:24 PM3/16/18

to Alex Bradbury, Liviu Ionescu, RISC-V ISA Dev, Torbjørn Viem Ness, Krste Asanovic, Rogier Brussee, Watson Ladd, Richard Herveille

I want to clear up the misconception that we don't encourage
experimentation or standardization of alternative privileged
architectures.

Part of the docs clearly states that people can completely replace the
privileged architecture, and in fact, one of our goals in writing
specs this way was to enable experimentation with new OS models.
While I agree microcontrollers are a big near-term market for RISC-V,
RISC-V Unix cores are also a large upcoming market that has a vastly
larger software base that needs time to port and mature, and so a lot
of effort has gone into stable standards for conventional OS ports.

I agree we also need a standard "rich" microcontroller profile and
that this should support C ISRs and preemption/nesting efficiently, as
one of several use cases. I disagree that hardware stacking is the
only solution for this, and prefer more flexible primitives to achieve
the same goal while meeting other goals. I agree modifications to the
ABI can also help when interrupt latency is more important than
straight line performance.

In my github thread response to Liviu, what I am trying to get across
is that for standard profiles, you want to minimize changes to
existing mstatus/privileged, not that they can never be extended. The
new task group is looking at extending interrupt behavior, but with a
view to maintaining backwards compatibility and to support dual-use
cores that run either real-time or virtual-memory code.

Krste

Jacob Bachmeyer

unread,

Mar 16, 2018, 9:34:32 PM3/16/18

to Guy Lemieux, Krste Asanovic, RISC-V ISA Dev

Guy Lemieux wrote:
> In this email thread, we have already seen two entirely different
> proposals for a microcontroller environment.

Part of the reason that I wrote a separate proposal is an expectation
that the two (or more?) microcontroller proposals will "cross-pollinate"
to some extent and yield a better proposal than either Liviu Ionescu or
myself could have produced alone.

-- Jacob

Liviu Ionescu

unread,

Mar 17, 2018, 3:45:01 AM3/17/18

to Guy Lemieux, jcb6...@gmail.com, RISC-V ISA Dev

On 17 March 2018 at 03:34:32, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> ... an expectation

> that the two (or more?) microcontroller proposals will "cross-pollinate"
> to some extent and yield a better proposal than either Liviu Ionescu or
> myself could have produced alone.

which, I think, is a resonable expectation.

I already included some of Jacob's comments in my pages, and I'm open
to any comments and new proposals.

I think that the goal of designing a performant and easy to use RISC-V
microcontroller profile is too important to stumble on personal
preferences.

any proposals that come with use cases that show they are simpler to
use, and performance estimates that show better latency or better
performance in general, are welcomed.

regards,

Liviu

Guy Lemieux

unread,

Mar 17, 2018, 4:05:32 AM3/17/18

to Liviu Ionescu, RISC-V ISA Dev, jcb6...@gmail.com

On Sat, Mar 17, 2018 at 12:44 AM Liviu Ionescu <i...@livius.net> wrote:

any proposals that come with use cases that show they are simpler to
use, and performance estimates that show better latency or better
performance in general, are welcomed.

I'd like to add that cost of implementation must also be considered, particularly for fpga targets.

hence, I would support a modular approach, much like how the whole ISA is designed, where there is a minimal core and then optional extensions.

for example, adding counters and exception support adds considerable area to FPGA implementations. adding more CSRs makes it worse. yet, the reduced register file size of RV32E produces no savings on FPGAs.

Guy

Liviu Ionescu

unread,

Mar 17, 2018, 12:31:56 PM3/17/18

to Guy Lemieux, RISC-V ISA Dev, jcb6...@gmail.com

On 17 March 2018 at 10:05:31, Guy Lemieux (glem...@vectorblox.com) wrote:

> On Sat, Mar 17, 2018 at 12:44 AM Liviu Ionescu wrote:
>
> > any proposals that come with use cases that show they are simpler to
> > use, and performance estimates that show better latency or better
> > performance in general, are welcomed.
>
>
> I'd like to add that cost of implementation must also be considered,
> particularly for fpga targets.

yes, sure, the cost is important, and if any of my proposals prove
unreasonably expensive to implement, we'll redesign them.

> hence, I would support a modular approach, much like how the whole ISA is
> designed, where there is a minimal core and then optional extensions.

yes, if possible, this would be a reasonable and consistent approach.

> for example, adding counters and exception support adds considerable area
> to FPGA implementations.

if I remember right, there are two counters in my proposal, the system
counter and the rtc counter, both having the same behaviour as the
mtimer in the privileged specs. are those counters too heavy for you?

also, since you obviously have experience with those things, can you
estimate how much extra space would require the logic to automatically
stack/unstack registers?

> adding more CSRs makes it worse.

in my proposal I reduced the CSRs to a minimum of 10. maybe not all
are needed as CSRs, we can reconsider some of them.

> yet, the reduced
> register file size of RV32E produces no savings on FPGAs.

you mean the reduction is not significative compared to the total
size? since, in my naive understanding of these things, 16 registers
of 32-bits should save some space anyway, like at least some 512
simple cells (again, I'm totally out-of-date with new FPGAs, I only
used some old Xilinx chips more than ten years ago, so I might be
terribly wrong).

anyway, my proposal includes three sub-profiles:

https://github.com/emb-riscv/specs-markdown/blob/master/introduction.md

for highly optimised softcores I would prioritise footprint
optimisations only on the 'small' sub-profile (similarly to ARM
Cortex-M1, which is optimised for synthesised cores).

for the medium sub-profile, although we obviously should not waste
unnecessary resources, I would accept to trade a few resources for
some more gains in terms of 'ease of use'.

ARM had a similar approach with Cortex-M0/M0+/M1 vs Cortex-M3/M4/M7, I
think this two-fold approach is a good starting point, and, given the
RISC-V modularity, we should try to do it even better.

regards,

Liviu

Liviu Ionescu

unread,

Mar 17, 2018, 2:27:55 PM3/17/18

to kr...@berkeley.edu, RISC-V ISA Dev

On 17 March 2018 at 02:16:23, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> I want to clear up the misconception that we don't encourage
> experimentation or standardization of alternative privileged
> architectures.

That's actually the point, you encourage alternate 'privileged'
architectures and completely prevent experimentation or
standardisation on non-privileged architectures.

> Part of the docs clearly states that people can completely replace the
> privileged architecture, and in fact, one of our goals in writing
> specs this way was to enable experimentation with new OS models.

Based on the prerequisites in my proposal, microcontrollers are not
expected to not run any 'new OS models', they are not expected to run
any true OS at all, they should run bare-metal or at most they should
run a multi-threaded scheduler.

https://github.com/emb-riscv/specs-markdown/blob/master/introduction.md#limitations

> While I agree microcontrollers are a big near-term market for RISC-V,
> RISC-V Unix cores are also a large upcoming market that has a vastly
> larger software base that needs time to port and mature, and so a lot
> of effort has gone into stable standards for conventional OS ports.

I think that everyone on this list will agree that the results are
outstanding and these efforts should continue.

However, those with a minimum experience with microcontrollers, will
agree with Guy's statement:

"The microcontroller environment needs to be better specified, as the
current Privileged ISA Specification is inappropriate for that domain."

As someone who actually put a lot of effort to write software
components for the HiFive1 and S31Arty/S51Arty devices (see the
Eclipse RISC-V project template), I would say that the current
privileged ISA specifications are totally inapropriate for wide usage
in microcontroller applications.

> I agree we also need a standard "rich" microcontroller profile and
> that this should support C ISRs and preemption/nesting efficiently, as
> one of several use cases.

Ok, great, we'll further explore this line.

> I disagree that hardware stacking is the
> only solution for this, and prefer more flexible primitives to achieve
> the same goal while meeting other goals.

It is not the only solution, we also have a proposal to use shadow registers.

If your team comes with a feasible solution to use a special compiler
semantic for interrupt routines, which saves only used registers as
long as everything is inlined, I see no problem to support this
mechanism too in the microcontroller profile.

In practical terms, I would add a bit in the interrupt status
registers (per each interrupt source), and allow the user to select
this alternate mode for really fast interrupts, that have very simple
handlers.

Personally I doubt that many applications will be able to inline
everything in the handler, but if someone takes the time to do it, I
think he/she deserves a prize, and this prize can be to have this
feature available, especially since it should be no problem to
implement it.

However, the default interrupt behaviour should use C handlers,
regardless how we decide to implement this.

> I agree modifications to the
> ABI can also help when interrupt latency is more important than
> straight line performance.

Ok, great.

We'll work on a proposal for a RISC-V EABI. Actually two, one for
RV32E and the other for RV32I/RV64I.

> The
> new task group is looking at extending interrupt behavior, but
> with a
> view to maintaining backwards compatibility and to support
> dual-use
> cores that run either real-time or virtual-memory code.

Sure, feel free to extend the privileged profile in Volume II as you
think appropriate, but please do not enforce it on the whole RISC-V
community.

The microcontroller profile simply cannot be a subset of the privileged profile.

Ideally the Foundation should acknowledge this, and possibly support
the design of a new microcontroller profile, complementary to the
privileged profile.

As the first step, I repeat the proposal sent a few messages ago:

---
The first volume, the only one that should be mandatory for compliance
reasons, should not be related to any user or privileged mode, and

should cover only the instruction set, as neutral as possible.

I suggest the name: "The RISC-V Architecture Manual: Volume I: The
Instruction Set".

---

If the Foundation decides to not acknowledge that the microcontroller
profile should be a separate profile, it is fine too, we can always
make it a community effort, as I already started with my proposal, and
hopefully get some support from companies interested in creating
microcontrollers based on RISC-V cores.

But this would not be a fortunate situation...

Regards,

Liviu

Liviu Ionescu

unread,

Mar 17, 2018, 2:31:01 PM3/17/18

to kr...@berkeley.edu, RISC-V ISA Dev

On 17 March 2018 at 20:27:52, Liviu Ionescu (i...@livius.net) wrote:

> microcontrollers are not expected to not run any 'new OS models'

microcontrollers are not expected to run any 'new OS models'

Michael Clark

unread,

Mar 17, 2018, 3:25:21 PM3/17/18

to Liviu Ionescu, Krste Asanovic, RISC-V ISA Dev

> On 17/03/2018, at 11:27 AM, Liviu Ionescu <i...@livius.net> wrote:
>
> On 17 March 2018 at 02:16:23, kr...@berkeley.edu (kr...@berkeley.edu) wrote:
>
>> I want to clear up the misconception that we don't encourage
>> experimentation or standardization of alternative privileged
>> architectures.
>
> That's actually the point, you encourage alternate 'privileged'
> architectures and completely prevent experimentation or
> standardisation on non-privileged architectures.

I think that there is a general expectation that a considered design would minimize unnecessary deltas from the current privileged architecture, versus making radical changes by defining a completely new privileged architecture. There is a difference between privileged architecture experimentation and a wholesale replacement of the existing privileged architecture.

There are some valuable insights from your proposal with respect to lightweight recursive interrupts, however there are also many unnecessary changes where existing lightweight mechanisms are already provided for by the current privileged architecture. These unnecessary changes will only lead to fragmentation and unnecessary complexity for common code. Well, in fact, common code is somewhat thrown out the window with your proposal.

There are some very clear misconceptions in your section on “RISC-V compatibility CSRs”. The specification is quite clear regarding unsupported CSRs (which can be hardwired to zero). Defining alternate mechanisms for well-defined existing mechanisms is what I would call pointless complexity, unless there is a genuine advantage from the alternate approach.

The crux of the issue is mstatus ie and pie bits, mip and mie CSRs, and defining a new mechanism for lightweight recursive interrupts. The other changes are unnecessary. I think baking the stack into the ISA with respect to interrupt handling is a big mistake. You’re adding unnecessary load/store latency where an alternative considered approach would allow an ISR to recurse while only performing register operations. The RISC-V architecture is carefully designed such that one instruction results in one unit of work. i.e. an instruction is a micro-op. Your proposal in its current form is very CISCy. I suspect you might find it challenging to get your proposal adopted as it stands currently.

This is not to say that an alternative experimental lightweight recursive interrupt system, as a considered delta to the current privileged ISA, wouldn’t achieve your high-level objective, and that is compiler annotated irq handlers with support for recursion. i.e. minimize assembly, minimize interrupt latency, and support recursive interrupts.

A better approach may be a requirements or use-case driven approach, to help instruction set architects make a considered delta based on your feedback. e.g.

- no assembly needed to define IRQs. i.e. compiler attributes
- handle prioritized recursive interrupts
- reduce interrupt latency (save/restore overhead)

Adding a complex interrupt state machine for save/restore that bakes a stack into an architecture when the stack is in the most part an ABI convention is perhaps an antithesis to an improvement to interrupt handling. The lowest possible latency approach will perform register swaps, and may use register aliasing or other technique.

My personal opinion is that while Register Windows have proven to be less than optimal as a mechanism for procedure calls, because compilers handle register allocation better, they may very well have some specific use cases with respect to changes in privilege modes or interrupt recursion level. This however would need some research and experimentation…

Liviu, my advice, build a simulator and modify the tools (binutils/gcc) and then you’ll see that some of the unnecessary changes will only hurt (you, if you are the one modifying binutils/gcc, the simulators and RTL, etc)

Liviu Ionescu

unread,

Mar 17, 2018, 5:02:16 PM3/17/18

to Michael Clark, Krste Asanovic, RISC-V ISA Dev

On 17 March 2018 at 21:25:21, Michael Clark (michae...@mac.com) wrote:

> ... The RISC-V architecture

> is carefully designed such that one instruction results in one unit of work. i.e. an instruction
> is a micro-op. Your proposal in its current form is very CISCy.

you can name it as you like, but the current proposal does not include
any change to the instruction set, and no such changes are planned,
nor encouraged.

the microcontroller profile will be compliant with the current specs
in Volume I, except:
- the list of CSRs defined in Table 19.3 will be shortened, by
removing the cycle, time and instret registers (or make them return 0)
- the related instructions rdcycle, rdtime and rdinstret defined in
Chapter 2.8, will either be removed, or they will return 0
- and the ABI, will will be replaced with a lighter EABI, when available

otherwise it'll use exactly the current CSR instructions, but for a
limited set of registers. the rest of the system registers will be
memory mapped.

the microcontroller profile will be as RISCy as the current privileged
profile, which does exactly the same, some (many) registers are CSRs,
some are memory mapped (mtime, mtimcmp, plic, etc).

> Liviu, my advice, build a simulator

yes, it is part of the plan to update QEMU to support the
microcontroller profile, and have a development platform to port the
software to it (architecture packages, device packages, RTOS, etc).

> and modify the tools (binutils/gcc) and then you’ll
> see that some of the unnecessary changes will only hurt

in the first phase it is not necessary to modify any tools, we can use
the current ABI and the current tools, just that the interrupt latency
will probably be at least twice the latency provided by Cortex-M
devices, due to the large ABI caller register set that need to be
saved when entering/exiting interrupts.

once the new EABI will be clearly defined, yes, the tools will need an
update, but this is hardly a surprise for anyone.

regards,

Liviu

kr...@berkeley.edu

unread,

Mar 17, 2018, 5:42:42 PM3/17/18

to Liviu Ionescu, kr...@berkeley.edu, RISC-V ISA Dev

In RISC-V ISA, we use "privileged" to signify features that are
expected to be protected from some execution environments. The user
manual name could perhaps be changed to the "unprivileged" manual to
indicate these are instructions that are expected to be made available
in most execution environments. Of course, there is no such thing as
a simple and precise definition, so these are necessarily a bit fuzzy.

Modern microcontrollers often include some notion of privilege level
or security compartment, which limit access to the whole machine.
While not running anything as heavyweight as Unix, there are
microcontroller runtimes that proclaim themselves an OS. Any
microcontroller connected to the internet, or any microcontroller
holding secrets, will probably have some notion of "privileged"
architecture. Simple bare-metal microcontrollers are also common, and
while we could create a separate fork to only consider their needs, we
want to minimize incompatibility in modules that can be shared across
these use cases.

So, "privileged" does not mean "Unix". We agree that the current
privileged architecture can be improved for some microcontroller use
cases (well, everything can always be improved). The fast interrupts
group is one activity here.

Krste

Jacob Bachmeyer

unread,

Mar 17, 2018, 6:19:27 PM3/17/18

to Liviu Ionescu, kr...@berkeley.edu, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 17 March 2018 at 02:16:23, kr...@berkeley.edu (kr...@berkeley.edu) wrote:
>
>
>> I disagree that hardware stacking is the
>> only solution for this, and prefer more flexible primitives to achieve
>> the same goal while meeting other goals.
>>
>
> It is not the only solution, we also have a proposal to use shadow registers.
>

My proposal only requires two sets of shadow registers (main registers
for thread context; shadow odd and shadow even for nested trap
contexts). Hardware may implement any number of shadow sets, but must
spill the oldest shadow set to the trap stack when taking a nested trap
with all-but-one shadow sets used. The stack pointer is special case:
there are exactly two of them, one for thread context and one for trap
context. The trap stack pointer is initialized from the trap stack base
when a trap is taken from thread mode. (The processor starts in thread
mode with all interrupt channels masked and no interrupt sources mapped.)

> If your team comes with a feasible solution to use a special compiler
> semantic for interrupt routines, which saves only used registers as
> long as everything is inlined, I see no problem to support this
> mechanism too in the microcontroller profile.
>
> In practical terms, I would add a bit in the interrupt status
> registers (per each interrupt source), and allow the user to select
> this alternate mode for really fast interrupts, that have very simple
> handlers.
>

Easiest way to do that is to add another CSR.

> Personally I doubt that many applications will be able to inline
> everything in the handler, but if someone takes the time to do it, I
> think he/she deserves a prize, and this prize can be to have this
> feature available, especially since it should be no problem to
> implement it.
>
> However, the default interrupt behaviour should use C handlers,
> regardless how we decide to implement this.
>

What do you think of my proposal to use trap a0 as epc?

>> I agree modifications to the
>> ABI can also help when interrupt latency is more important than
>> straight line performance.
>>
>
> Ok, great.
>
> We'll work on a proposal for a RISC-V EABI. Actually two, one for
> RV32E and the other for RV32I/RV64I.
>

For the eABI that I propose, RVE base simply omits the high 16 integer
and all FP registers. All of the omitted integer registers are
callee-saved in my proposal.

>> The
>> new task group is looking at extending interrupt behavior, but
>> with a
>> view to maintaining backwards compatibility and to support
>> dual-use
>> cores that run either real-time or virtual-memory code.
>>
>
> Sure, feel free to extend the privileged profile in Volume II as you
> think appropriate, but please do not enforce it on the whole RISC-V
> community.
>
> The microcontroller profile simply cannot be a subset of the privileged profile.
>
> Ideally the Foundation should acknowledge this, and possibly support
> the design of a new microcontroller profile, complementary to the
> privileged profile.
>

This is a major reason that I believe the microcontroller environment
ISA should eschew privilege levels entirely. That forces fundamental
changes that clearly require the microcontroller environment ISA to be
very (but not gratuitously) different from the POSIX environment ISA.
Very advanced microcontrollers could even use the standard PLIC on one
or more of their local interrupt inputs.

> As the first step, I repeat the proposal sent a few messages ago:
>
> ---
> The first volume, the only one that should be mandatory for compliance
> reasons, should not be related to any user or privileged mode, and
> should cover only the instruction set, as neutral as possible.
>
> I suggest the name: "The RISC-V Architecture Manual: Volume I: The
> Instruction Set".
> ---
>

I suggest adding "Application", producing "The RISC-V Instruction Set
Manual: Volume I: Application ISA". The current privileged ISA would
then become "The RISC-V Instruction Set Manual: Volume II-1: POSIX
Environment ISA" as it is intended to support POSIX operating systems.

> If the Foundation decides to not acknowledge that the microcontroller
> profile should be a separate profile, it is fine too, we can always
> make it a community effort, as I already started with my proposal, and
> hopefully get some support from companies interested in creating
> microcontrollers based on RISC-V cores.
>

The Foundation would not be entirely out of line to wait until a
community effort produces an implementable and useful profile before
beginning a standardization process if the Foundation desires to focus
on POSIX-capable systems at this time. (Which are probably much more
interesting to academics than microcontrollers.)

> But this would not be a fortunate situation...
>

As long as the "fork" is more of a "development branch" with a goal of
being merged someday, I see nothing wrong.

-- Jacob

Jacob Bachmeyer

unread,

Mar 17, 2018, 6:23:41 PM3/17/18

to Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

Guy Lemieux wrote:
> On Sat, Mar 17, 2018 at 12:44 AM Liviu Ionescu <i...@livius.net
> <mailto:i...@livius.net>> wrote:
>
> any proposals that come with use cases that show they are simpler to
> use, and performance estimates that show better latency or better
> performance in general, are welcomed.
>
>
> I'd like to add that cost of implementation must also be considered,
> particularly for fpga targets.

This is one of the reasons that my proposal defines a microcontroller as
having a simple, synchronous memory interface, after someone complained
on hw-dev about TileLink taking up more LUTs than the actual Rocket core.

> hence, I would support a modular approach, much like how the whole ISA
> is designed, where there is a minimal core and then optional extensions.
>
> for example, adding counters and exception support adds considerable
> area to FPGA implementations. adding more CSRs makes it worse. yet,
> the reduced register file size of RV32E produces no savings on FPGAs.

Can you compare the two current proposals on this? Are shadow register
sets effectively "free" on FPGA implementations, presumably up to some
limit? Where is that limit?

-- Jacob

Krste Asanovic

unread,

Mar 17, 2018, 6:25:27 PM3/17/18

to jcb6...@gmail.com, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

Type of memory interface does not belong in the micro controller profile, wrong level of detail.

Krste

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AAD9569.5050007%40gmail.com.

Liviu Ionescu

unread,

Mar 17, 2018, 6:43:27 PM3/17/18

to jcb6...@gmail.com, RISC-V ISA Dev

On 18 March 2018 at 00:19:26, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> My proposal only requires two sets of shadow registers (main registers
> for thread context; shadow odd and shadow even for nested trap
> contexts). Hardware may implement any number of shadow sets, but must
> spill the oldest shadow set to the trap stack when taking a nested trap
> with all-but-one shadow sets used.

as Richard mentioned,

> The problem with shadow registers is that you always run out and you still need to spill
> to main memory.

https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/U5ZXz0lZB-g/h25om5LiAQAJ

this means it is not possible to ensure a constant low latency,
sometimes the latency may include the time to spill, which is
problematic, since jitter latency is not acceptable in control loops,
as you also mentioned in a previous message.

> > In practical terms, I would add a bit in the interrupt status
> > registers (per each interrupt source), and allow the user to select
> > this alternate mode for really fast interrupts, that have very simple
> > handlers.
> >
>
> Easiest way to do that is to add another CSR.

the bit may be set for with each interrupt source, so its location is
probably in the `status` word

https://github.com/emb-riscv/specs-markdown/blob/master/interrupt-controller.md#per-interrupt-registers

> What do you think of my proposal to use trap a0 as epc?

I'm not sure I understood what this means.

> For the eABI that I propose, RVE base simply omits the high 16 integer
> and all FP registers. All of the omitted integer registers are
> callee-saved in my proposal.

isn't it the same as:

https://github.com/emb-riscv/specs-markdown/blob/master/eabi.md#rv32e

regards,

Liviu

Jacob Bachmeyer

unread,

Mar 17, 2018, 6:45:16 PM3/17/18

to Liviu Ionescu, Guy Lemieux, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 17 March 2018 at 10:05:31, Guy Lemieux (glem...@vectorblox.com) wrote:
>
>> On Sat, Mar 17, 2018 at 12:44 AM Liviu Ionescu wrote:
>>
>>> any proposals that come with use cases that show they are simpler to
>>> use, and performance estimates that show better latency or better
>>> performance in general, are welcomed.
>>>
>> I'd like to add that cost of implementation must also be considered,
>> particularly for fpga targets.
>>
>
> yes, sure, the cost is important, and if any of my proposals prove
> unreasonably expensive to implement, we'll redesign them.
>

This is one of the categories where Rocket more-or-less falls flat on
its face, if recent complaints on hw-dev are accurate. I suspect that
the modular architecture Rocket uses, that is capable of scaling up to
meet the needs of large systems, is the source of this high minimum
complexity.

>> hence, I would support a modular approach, much like how the whole ISA is
>> designed, where there is a minimal core and then optional extensions.
>>
>
> yes, if possible, this would be a reasonable and consistent approach.
>

I see this as in tension with the goal of keeping the maximum complexity
on microcontrollers as low as possible. Balancing these is interesting.

>> for example, adding counters and exception support adds considerable area
>> to FPGA implementations.
>>
>
> if I remember right, there are two counters in my proposal, the system
> counter and the rtc counter, both having the same behaviour as the
> mtimer in the privileged specs. are those counters too heavy for you?
>

My proposal drops those counters as well, moving them to "timer"
peripherals, if they are implemented at all.

> also, since you obviously have experience with those things, can you
> estimate how much extra space would require the logic to automatically
> stack/unstack registers?
>

Likewise, since my proposal requires the shadow sets to be spilled if
traps nest more deeply than the number of shadow sets.

>> adding more CSRs makes it worse.
>>
>
> in my proposal I reduced the CSRs to a minimum of 10. maybe not all
> are needed as CSRs, we can reconsider some of them.
>

My proposal currently has 4 "hard" CSRs: status, trap vector base,
interrupt assert, interrupt enable. An additional four CSRs provide the
stack ranges, and another five are windows into the shadow register sets
(the windows are needed for context-switching).

>> yet, the reduced
>> register file size of RV32E produces no savings on FPGAs.
>>
>
> you mean the reduction is not significative compared to the total
> size? since, in my naive understanding of these things, 16 registers
> of 32-bits should save some space anyway, like at least some 512
> simple cells (again, I'm totally out-of-date with new FPGAs, I only
> used some old Xilinx chips more than ten years ago, so I might be
> terribly wrong).
>

Even old Xilinx FPGAs have block RAM elements. If your 32-entry
register file fits in a block RAM, one block RAM will be allocated for
the register file. If you then move to a 16-entry register file, one
block RAM will *still* be allocated for the register file. All that you
save is a tiny amount of routing resources to handle the fifth register
bit. Similarly, if the 32-entry+N-shadows register file fits in a block
RAM, one block RAM will be allocated for the register file.

If you can actually save 512 simple cells by dropping 16 registers from
the register file, either your FPGA does not have block RAM or your
synthesis tools are garbage. Some Xilinx FPGAs also had the ability to
use LUTs as RAM during operation. IIRC these were 4-input LUTs, so if
you are targeting one of these FPGAs, and your register file is using
"cell RAM", you could save half the space, but only about XLEN cells,
since each group of 16 registers would need XLEN "RAM cells" and some
logic cells for address decoding.

-- Jacob

Liviu Ionescu

unread,

Mar 17, 2018, 7:32:11 PM3/17/18

to jcb6...@gmail.com, Guy Lemieux, RISC-V ISA Dev

On 18 March 2018 at 00:45:15, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> My proposal drops those counters as well, moving them to "timer"
> peripherals, if they are implemented at all.

for software reasons, it is highly beneficial to have the system
timers completely specified by the architecture and available in all
implementations.

> My proposal currently has 4 "hard" CSRs: status, trap vector base,
> interrupt assert, interrupt enable. An additional four CSRs provide the
> stack ranges, and another five are windows into the shadow register sets
> (the windows are needed for context-switching).

in a microcontroller profile, the main reason for using CSRs is speed.

there are two registers used for implementing interrupt critical
sections, one register with the interrupts enabled/disabled flag, and
one register with the interrupts threshold.

other speed sensitive registers are those that must be accessed during
thread context switches.

in a properly written multi-threaded RTOS, assuming the normal
interrupts are plain C functions, the context switch interrupt, with
the lowest possible priority, has the only handler where assembly code
is required, to save/restore the registers, and this code should be as
fast as possible.

in the current proposal, the status register and the stack registers
(base and limit) will probably need to be saved during context
switches, so they got the privilege of having associated CSRs.

there may be small details to fix after writing the context switch
code, but this is the general idea.

> Even old Xilinx FPGAs have block RAM elements. ...

thank you for the details, I'll try to update my knowledge of modern
FPGA, but this will take some time, and for the moment I'll probably
prioritise software issues first, so I'll rely on professional advice
for hardware synthesis questions.

but once the design is ready, qemu did validate the design and we have
functional applications that include an rtos running on qemu, the next
logical step is to work on a verilog version of the microcontroller
profile, so by that time I'll dedicate more time to FPGAs.

regards,

Liviu

Jacob Bachmeyer

unread,

Mar 17, 2018, 10:10:33 PM3/17/18

to Liviu Ionescu, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 18 March 2018 at 00:19:26, Jacob Bachmeyer (jcb6...@gmail.com) wrote:
>
>> My proposal only requires two sets of shadow registers (main registers
>> for thread context; shadow odd and shadow even for nested trap
>> contexts). Hardware may implement any number of shadow sets, but must
>> spill the oldest shadow set to the trap stack when taking a nested trap
>> with all-but-one shadow sets used.
>>
>
> as Richard mentioned,
>
>
>> The problem with shadow registers is that you always run out and you still need to spill
>> to main memory.
>>
>
> https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/U5ZXz0lZB-g/h25om5LiAQAJ
>
> this means it is not possible to ensure a constant low latency,
> sometimes the latency may include the time to spill, which is
> problematic, since jitter latency is not acceptable in control loops,
> as you also mentioned in a previous message.
>

The trick is that hardware can spill the oldest shadow set in the
background while running the ISR. All you need are four instructions in
the ISR that do not access memory.

>>> In practical terms, I would add a bit in the interrupt status
>>> registers (per each interrupt source), and allow the user to select
>>> this alternate mode for really fast interrupts, that have very simple
>>> handlers.
>> Easiest way to do that is to add another CSR.
>>
>
> the bit may be set for with each interrupt source, so its location is
> probably in the `status` word
>
> https://github.com/emb-riscv/specs-markdown/blob/master/interrupt-controller.md#per-interrupt-registers
>

In my proposal, the only per-interrupt-source register is the MMIO
interrupt channel map. Since fast interrupts require different ISRs,
and I propose that ISRs be tied to interrupt channels, the "use fast
interrupt" flag is per-interrupt-channel. There are XLEN interrupt
channels in my proposal. A "use fast interrupt" flag CSR therefore
makes sense.

>> What do you think of my proposal to use trap a0 as epc?
>>
>
> I'm not sure I understood what this means.
>

The a0 register is shadowed. C trap handlers are called with the
interrupted program counter value as their first argument and return the
address at which execution is to resume. The CSRs loaded upon trap
entry are thus eliminated. The value that would be in tval is instead
provided as the second argument in a1, which is also shadowed.

>> For the eABI that I propose, RVE base simply omits the high 16 integer
>> and all FP registers. All of the omitted integer registers are
>> callee-saved in my proposal.
>>
>
> isn't it the same as:
>
> https://github.com/emb-riscv/specs-markdown/blob/master/eabi.md#rv32e
>

Not quite. I propose putting the stack limit in a CSR, retaining a4 and
a5 instead of changing them to t3 and t4, and making the argument
registers callee-saved; only the return values in a0 and a1 are
caller-saved.

-- Jacob

Jacob Bachmeyer

unread,

Mar 17, 2018, 10:24:38 PM3/17/18

to Liviu Ionescu, Guy Lemieux, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 18 March 2018 at 00:45:15, Jacob Bachmeyer (jcb6...@gmail.com) wrote:
>
>> My proposal drops those counters as well, moving them to "timer"
>> peripherals, if they are implemented at all.
>>
>
> for software reasons, it is highly beneficial to have the system
> timers completely specified by the architecture and available in all
> implementations.
>

This would ease writing a "universal" scheduler, but counters as CSRs
are rather expensive in FPGAs, as others have pointed out. Since
microcontrollers should have a standard memory map, ("it does not have
to be present, but if it is present, it should be at *this* base
address") and timers are relatively simple and basic peripherals, I
would prefer to standardize a timer interface and an MMIO base address
for timer0 (if present at all) and let implementations be flexible
beyond that.

>> My proposal currently has 4 "hard" CSRs: status, trap vector base,
>> interrupt assert, interrupt enable. An additional four CSRs provide the
>> stack ranges, and another five are windows into the shadow register sets
>> (the windows are needed for context-switching).
>>
>
> in a microcontroller profile, the main reason for using CSRs is speed.
>

This is not actually relevant to a microcontroller, since
microcontrollers operate with synchronous memory buses: a CSR and an
MMIO control register are equally fast. (This *is* relevant on
application processors and is one reason why MMIO CSRs are generally a
non-starter, even though microcontrollers can use them.)

> there are two registers used for implementing interrupt critical
> sections, one register with the interrupts enabled/disabled flag, and
> one register with the interrupts threshold.
>
> other speed sensitive registers are those that must be accessed during
> thread context switches.
>
> in a properly written multi-threaded RTOS, assuming the normal
> interrupts are plain C functions, the context switch interrupt, with
> the lowest possible priority, has the only handler where assembly code
> is required, to save/restore the registers, and this code should be as
> fast as possible.
>
> in the current proposal, the status register and the stack registers
> (base and limit) will probably need to be saved during context
> switches, so they got the privilege of having associated CSRs.
>
> there may be small details to fix after writing the context switch
> code, but this is the general idea.
>

I plan to include MMIO CSR space in the standard memory map. ("CSRs are
not required to be MMIO, but may be and should be *here* if they are MMIO.")

-- Jacob

Jacob Bachmeyer

unread,

Mar 17, 2018, 10:33:05 PM3/17/18

to Krste Asanovic, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

Krste Asanovic wrote:
> Type of memory interface does not belong in the micro controller profile, wrong level of detail.

On this I disagree. As I see it, many of the trade-offs that Liviu
Ionescu was complaining about that led to this discussion are *driven*
*by* the more complex memory interfaces used in modern application
processors. These complex memory hierarchies are not wrong -- they are,
after all, how GHz+ processors and multi-GiB RAM are reasonable for most
large systems.

They are also, well, complex, and therefore a good candidate for drawing
a dividing line between simple microcontrollers and more complex
application processors. If the complex memory hierarchies are not
excluded, then the general POSIX environment ISA really is optimal or
very close. Excluding these system architectures enables the
microcontroller environment ISA to take different choices to better fit
a specific niche that the general environment ISA serves badly.

I envision RISC-V microcontrollers being commonly used as peripherals in
larger systems for real-time subtasks, particularly related to real-time
I/O, such as sensors.

-- Jacob

Jacob Bachmeyer

unread,

Mar 17, 2018, 10:46:48 PM3/17/18

to Jan Gray, Liviu Ionescu, Guy Lemieux, RISC-V ISA Dev

Jan Gray wrote:
> To amplify on what Guy Lemieux wrote: "for example, adding counters and exception support adds considerable area to FPGA implementations. adding more CSRs makes it worse. yet, the reduced register file size of RV32E produces no savings on FPGAs",
>
> Guy is right! For more context, here are some modern FPGAs' small, fast embedded SRAMs:
>
> 1. Xilinx Virtex-[567], UltraScale[+], Artix-[67], Kintex-[67], Spartan-[67] are 6-LUT FPGAs. About half of the LUTs can be configured as "distributed LUT RAM" which is 32x2 or 64x1 per LUT. There are various single port and dual port configurations but the minimum depth is 32 entries. A (simple, single context, single cycle) 16-entry register file implemented in this LUT RAM consumes the same area as a 32-entry register file.
>
> 2. Intel/Altera Cyclone-V, Stratix-V, Stratix-10, Arria-V, Arria-10 have (8-input) 6-LUT ALMs, some of which can implement a 32x2 or 64x1 SRAM. These are collected in MLABs of 10 ALMs, which implement 64x10 or 32x20 true dual port SRAMs. Once again a simple 16-entry register file has the same area as a 32-entry one. MAX10 FPGAs have no LUT RAM, but have M9K blocks with a minimum depth of 256 (x36).
>
> 3. Lattice ice40 FPGAs are 4-LUT devices. They lack LUT RAM but have RAM4K blocks with a minimum depth of 256 (x16).
>
> 4. MicroSemi PolarFire and IGLOO2 FPGAs are 4-LUT devices. They lack LUT RAM but have 64xn micro-SRAM blocks.
>
> So in these target devices, a 16-entry register files save no area. (In single cycle per instruction microarchitectures. If you implement RV32I with an 8- or 16-bit datapath and multi-cycle operation, that's different.)
>

Do I understand correctly that this also means that, once a single set
of four shadow registers is added to the base register file (the
separate thread stack pointer fills the "x0" slot), that adding enough
to go all the way to a 64-entry physical register file is essentially
free on modern FPGAs? (And on some FPGA series going all the way to a
256-entry register file with 56 shadow sets costs only additional
control logic?)

> As for RV32I counters and exceptions, so far I have been loathe to add them to GRVI RV32I-subset parallel PEs. Just for the counters, you need to keep a 64b INSTRET per core. (You can share/amortize TIME and perhaps CYCLE across cores.) In a Xilinx FPGA, a 64b counter + a 32b 6:1 mux are larger than a 2R1W register file + an ALU. A ~30% larger core can mean ~30% fewer cores per die...
>

How much of this cost is incurred by "implementing" a counter CSR by
hardwiring it to zero?

> In some prior FPGA RISCs I've built, basic interrupt support employed a reserved register (e.g. x31) as an interrupt return address, and required injecting an interrupt handler call (e.g. jal x31,irqhander(x0)) into the instruction fetch stream. This added ~0 LUTS to the datapath, and the overhead to enter and return from an empty interrupt handler call was 3+3 = 6 cycles. RISC-V exceptions/interrupts/CSRs are less parsimonious. (I understand why.)
>

With shadowed registers, such that an interrupt observes (and can
change) its return address in a0, could similar performance be within
reach of a RISC-V implementation if the decoder can generate references
to physical registers that are about to become visible when the ISR is
entered?

-- Jacob

kr...@berkeley.edu

unread,

Mar 18, 2018, 1:41:27 AM3/18/18

to jcb6...@gmail.com, Krste Asanovic, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

>>>>> On Sat, 17 Mar 2018 21:33:02 -0500, Jacob Bachmeyer <jcb6...@gmail.com> said:
| Krste Asanovic wrote:
|| Type of memory interface does not belong in the micro controller profile, wrong level of detail.

| On this I disagree. As I see it, many of the trade-offs that Liviu
| Ionescu was complaining about that led to this discussion are *driven*
| *by* the more complex memory interfaces used in modern application
| processors.

No - the big difference comes from virtualization requirements and the
assumption of fast processors and smart devices, versus bare-metal
access with slow processors and dumb devices. Nothing directly
related to complex memory hierarchies, and definitely nothing about
complex memory *interfaces*.

| These complex memory hierarchies are not wrong -- they are,
| after all, how GHz+ processors and multi-GiB RAM are reasonable for most
| large systems.

Memory interface != memory hierarchy.

| They are also, well, complex, and therefore a good candidate for drawing
| a dividing line between simple microcontrollers and more complex
| application processors. If the complex memory hierarchies are not
| excluded, then the general POSIX environment ISA really is optimal or
| very close. Excluding these system architectures enables the
| microcontroller environment ISA to take different choices to better fit
| a specific niche that the general environment ISA serves badly.

There are certainly differences in platform features, but few
necessary differences at user ISA level.

| I envision RISC-V microcontrollers being commonly used as peripherals in
| larger systems for real-time subtasks, particularly related to real-time
| I/O, such as sensors.

Microcontrollers can have complex memory systems, particularly when
placed inside larger SoCs and/or when using various power control
tricks, even without considering caches, which are also reasonably
common for microcontrollers. Some microcontrollers have both L1 I &
L2 D caches, and multiple local L1 SRAMs, and shared L2 cache, and
multiple shared L2 SRAMs.

Krste

| -- Jacob

Liviu Ionescu

unread,

Mar 18, 2018, 1:44:02 AM3/18/18

to Krste Asanovic, jcb6...@gmail.com, Guy Lemieux, RISC-V ISA Dev

On 18 March 2018 at 04:33:04, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> Krste Asanovic wrote:
> > Type of memory interface does not belong in the micro controller profile, wrong level
> of detail.
>
> On this I disagree. As I see it, many of the trade-offs that Liviu
> Ionescu was complaining about that led to this discussion are *driven*
> *by* the more complex memory interfaces used in modern application
> processors.

On this I agree with Krste.

Complex memory interfaces are irelevant for the discussion.

Most current Cortex-M implementations have ample amounts of internal
flash/ram, which for slow devices makes them synchronous, but they
also include I-Cache & D-Cache, allowing the cores to reach 200-300
Mhz.

And most high-end implementations include a memory controller, that
can connect external static/dynamic ram and all kind of flash
memories.

I see no reason for not allowing such configurations for RISC-V
microcontrollers.

> I envision RISC-V microcontrollers being commonly used as peripherals in
> larger systems for real-time subtasks, particularly related to real-time
> I/O, such as sensors.

There are currently many other use cases for microcontrollers.

Regrds,

Liviu

Liviu Ionescu

unread,

Mar 18, 2018, 2:17:41 AM3/18/18

to kr...@berkeley.edu, jcb6...@gmail.com, RISC-V ISA Dev, Guy Lemieux

On 18 March 2018 at 07:41:25, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> No - the big difference comes from virtualization requirements
> and the
> assumption of fast processors and smart devices, versus bare-metal
> access with slow processors and dumb devices.

To paraphrase Neil deGrasse Tyson, "by my time, people were smart, not devices".

Since when Linux devices are smart and non-Linux devices are dumb?

Nope. The virtualization requirements may be interresting, but for
microcontrollers are totally irelevant.

Hardware guys tend to forget one small detail, that all the devices
they invent need software to run.

The big difference between these devices is how you write software for
them: do you expect a full blown multi-process operating system, with
virtual memory, supervisor modes and everything?

Or a single-process, no virtual memory, no supervisor modes, maybe no
system at all, is enough?

By a language abuse, the later run-time environments are also called
'operating systems', as in RTOS, but they are only a caricature of an
operating system, not to mention that in most cases they are included
in the same binary image with the application, and run in the same
mode.

I think this is the dividing line between the two.

See also:

https://github.com/emb-riscv/specs-markdown/blob/master/improvements-upon-privileged.md#the-dividing-line

Regards,

Liviu

Liviu Ionescu

unread,

Mar 18, 2018, 2:45:19 AM3/18/18

to Jan Gray, jcb6...@gmail.com, Guy Lemieux, RISC-V ISA Dev

On 18 March 2018 at 03:16:18, Jan Gray (jsg...@acm.org) wrote:

> A ~30% larger core can mean ~30% fewer cores per die...

that's true, but I was always puzzled by such statements.

assuming a hypothetical physical device, which include a single hart
RISC-V core, with, let's say 128 kiB ram, 512 kiB flash, possible
caches, a reasonable assortment of peripherals (gpios, timers, spis,
i2c, adc/dac, usb, etc), debug logic, connecting pads, etc, can you
estimate how much, in procents of the total die area, is occupied by
the core itself?

based on public photos I saw, my totally uneducated guess is that not much.

it would also be interesting how this procent compares with existing
Cortex-M3/M4 devices.

regards,

Liviu

Michael Chapman

unread,

Mar 18, 2018, 3:28:14 AM3/18/18

to Liviu Ionescu, Jan Gray, jcb6...@gmail.com, Guy Lemieux, RISC-V ISA Dev

128KiB of SRAM is huge for many embedded systems. 512KiB is also on the
large size.
Some of the systems we have delivered cores for, have 2KiB of SRAM and
16KiB of NVM.

We have delivered cores where then end selling price (including
packaging and test) is 10c (including all royalty payments for IPs).

You can do the calculation as to how many gates that buys you and you
will see that there is a sensitivity to die size. This is why the ARM
Cortex M0 exists vs the ARM Cortex M3.

In that context, the debug logic and large number of CSRs as currently
defined for RISC-V are relatively expensive. The relatively large
register file (compared to Cortex M0 and competing cores), is also a
handicap for a standard full scan EDA flow. There are workarounds to
reduce this.

You are right, that pads are expensive. Particularly in smaller
geometries. That is why some chips we make have only 6 pads including
power, ground and test.

Liviu Ionescu

unread,

Mar 18, 2018, 3:45:04 AM3/18/18

to jcb6...@gmail.com, RISC-V ISA Dev

On 18 March 2018 at 04:24:37, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> a CSR and an
> MMIO control register are equally fast.

not exactly, I did a small test and the csr read takes one
instruction, while the mmio read takes two instructions:

f(riscv_csr_read_mstatus(), SYSPERIPH->cmd);

20400238: 30002573 csrr a0,mstatus

2040023c: f00007b7 lui a5,0xf0000
20400240: 43cc lw a1,4(a5)

20400242: 307000ef jal ra,20400d48 <f(unsigned long, unsigned long)>

for this reason I prefer all registers needed in interrupt critical
sections and context switches to be accessed with CSR instructions.
all other non critical registers can be memory mapped.

regards,

Liviu

Alex Bradbury

unread,

Mar 18, 2018, 5:26:42 AM3/18/18

to Krste Asanovic, Liviu Ionescu, RISC-V ISA Dev, Torbjørn Viem Ness, Rogier Brussee, Watson Ladd, Richard Herveille

On 17 March 2018 at 00:16, <kr...@berkeley.edu> wrote:
>
> I want to clear up the misconception that we don't encourage
> experimentation or standardization of alternative privileged
> architectures.
>

> Part of the docs clearly states that people can completely replace the
> privileged architecture, and in fact, one of our goals in writing
> specs this way was to enable experimentation with new OS models.

> While I agree microcontrollers are a big near-term market for RISC-V,
> RISC-V Unix cores are also a large upcoming market that has a vastly
> larger software base that needs time to port and mature, and so a lot
> of effort has gone into stable standards for conventional OS ports.

Thanks for the clarifying that Krste, but I think the point of
confusion is slightly different. What really needs confirmation is
whether an implementer who adopts the user-level ISA but jettisons the
privileged architecture in favour of an alternative approach can still
claim to be RISC-V (obviously clearly stating that it is the RISC-V
user-level spec that it is compliant with). Is that the case? If so,
Liviu's suggestion to add the word "optional" to the reference to the
privileged spec seems very sensible.

Thanks,

Alex

Liviu Ionescu

unread,

Mar 18, 2018, 12:22:39 PM3/18/18

to jcb6...@gmail.com, RISC-V ISA Dev

On 18 March 2018 at 04:10:32, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> The trick is that hardware can spill the oldest shadow set in
> the
> background while running the ISR. All you need are four instructions
> in
> the ISR that do not access memory.

four? I tink this number depends on the ABI, which was not yet decided.

I think the general design should not depend on specifics like this,
the mechanism should work for any ABI, including the current one, with
16 registers to be saved.

I'm afraid I don't know how you can guarantee that the spill is
completed when a subsequent interrupt comes.

regards,

Liviu

Krste Asanovic

unread,

Mar 18, 2018, 3:07:27 PM3/18/18

to Alex Bradbury, Liviu Ionescu, RISC-V ISA Dev, Torbjørn Viem Ness, Rogier Brussee, Watson Ladd, Richard Herveille

For ISA compliance, without any platform compliance, the main goal would be to ensure that standard compiler and assembly code executes correctly.
These ISA compliance tests are being designed so they don’t bake in any platform assumptions, My view is that if the core can run the ISA tests, it can claim ISA compliance, even if it is not compliant to any platform standard. I would say in this context “unprivileged” ISA versus “user” ISA, as in bare microcontrollers, there is no user mode. I believe privileged ISA components can really only be tested as part of platform compliance.

Most of this confusion is because the current manuals are not split up into exactly the modules we want to standardize, and the platform profiles don’t exist yet.
Of course, despite this, a lot of work has managed to be done. Having the current manuals, spike, and a now upstream single version of gcc and binutils, and same for Linux, has propagated a de facto standard, but we are moving ahead with more formal standards at the Foundation.

Krste

Liviu Ionescu

unread,

Mar 18, 2018, 4:14:22 PM3/18/18

to Krste Asanovic, Alex Bradbury, RISC-V ISA Dev

On 18 March 2018 at 21:07:25, Krste Asanovic (kr...@berkeley.edu) wrote:

> “unprivileged” ISA versus “user” ISA ... privileged ISA

I still have a problem with understanding the semantics behind these names.

According to Volume I, Introduction, the first line, ISA means
"instruction set architecture".

By expanding the initials, we get "user instruction set architecture"
and "privileged instruction set architecture".

For me, a non-native English speaker, mixing these words together is confusing,

- the 'architecture' is one thing (the overall umbrella that includes
the instruction set plus everything else)
- the 'instruction set' is another thing, that generally has nothing
to do with any running modes, privileged or unprivileged,
- the 'user' or 'privileged' are modes, let's say specific definitions
related to different execution environments

I don't think that a few extra instructions available in the
privileged mode are the main differentiator between user and
privileged, probably the reasons for the differences are somewhere
else.

Personally I'd expect the official documentation to be a bit more
carefull with the terminology; in particular 'privileged ISA' seems
troubling, since it puts the emphasises on the instruction set.

Aren't other terms more suitable for this? For example 'profile'?

This would separate the RISC-V Architecture into:

- the instruction set, with encodings and everything, like 'The RISC-V
Instruction Set'
- a set of extensions, with additional instruction sets, like 'The
RISC-V Instruction Set Extension A - Atomics'
- a set of profiles (like privileged/microcontroller/etc), for
different execution environments, like 'The RISC-V Privileged profile'
(or 'The RISC-V POSIX profile', 'The RISC-V Application profile', or
any better name that you may find)

If, for compliance reasons, you decide to reorganise/split the manuals
based on better criteria, perhaps you can consider some more carefully
chosen names.

And, of course, decouple compliance for the instruction set from the
extensions and the profiles.

Regards,

Liviu

Jacob Bachmeyer

unread,

Mar 18, 2018, 10:29:02 PM3/18/18

to Liviu Ionescu, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 18 March 2018 at 04:10:32, Jacob Bachmeyer (jcb6...@gmail.com) wrote:
>
>
>> The trick is that hardware can spill the oldest shadow set in
>> the
>> background while running the ISR. All you need are four instructions
>> in
>> the ISR that do not access memory.
>>
>
> four? I tink this number depends on the ABI, which was not yet decided.
>

My proposal includes an eABI that declares all registers except for link
registers (x1, x5) and return values (x10, x11) to be callee-saved.
These caller-saved registers are included in shadow sets.

> I think the general design should not depend on specifics like this,
> the mechanism should work for any ABI, including the current one, with
> 16 registers to be saved.
>

It can still work, at the price of increased latency.

> I'm afraid I don't know how you can guarantee that the spill is
> completed when a subsequent interrupt comes.
>

If interrupts are arriving faster than you can spill shadow sets for any
significant length of time, you are screwed no matter what.

If shadow sets are cheap (as they appear to be in at least some FPGA
implementations), one good option is to set a threshold and start
spilling shadow sets in the background when some fraction of the shadow
sets are in use. This leaves a buffer, such that a few shadow sets are
available if a burst of interrupts arrive.

Another option is to spill the shadow set while refilling the pipeline
after taking a trap. This is the primary argument I see for hardware
interrupt stacking. For the case of a higher-priority interrupt
arriving exactly as a lower-priority ISR is starting, but before any
instructions from the lower-priority ISR have committed, simply clear
the pipeline and take the higher priority interrupt, leaving the
lower-priority interrupt asserted; it will be handled after the
higher-priority ISR has returned. For a five-stage IF/ID/EX/MEM/WB
pipeline, spilling a 4-register shadow set fits nicely in the pipeline's
inherent 4-cycle latency, after which either an original or
higher-priority interrupt (that arrived during the latency period) can
be taken with either the original 4-cycle latency or a new 4-cycle
latency to service the higher-priority interrupt after clearing the
pipeline. The lower-priority interrupt sees its 4-cycle latency only to
itself be interrupted; the higher-priority interrupt sees exactly 4
cycles of latency.

-- Jacob

Jacob Bachmeyer

unread,

Mar 18, 2018, 10:35:24 PM3/18/18

to Liviu Ionescu, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 18 March 2018 at 04:24:37, Jacob Bachmeyer (jcb6...@gmail.com) wrote:
>
>> a CSR and an
>> MMIO control register are equally fast.
>>
>
> not exactly, I did a small test and the csr read takes one
> instruction, while the mmio read takes two instructions:
>
> f(riscv_csr_read_mstatus(), SYSPERIPH->cmd);
>
> 20400238: 30002573 csrr a0,mstatus
>
> 2040023c: f00007b7 lui a5,0xf0000
> 20400240: 43cc lw a1,4(a5)
>
> 20400242: 307000ef jal ra,20400d48 <f(unsigned long, unsigned long)>
>

For hardware simple enough to be in-scope for my microcontroller
proposal, the CSR instructions can themselves be executed as (possibly
RVA) memory accesses to MMIO control registers.

> for this reason I prefer all registers needed in interrupt critical
> sections and context switches to be accessed with CSR instructions.
> all other non critical registers can be memory mapped.
>

On this I agree, and those registers are CSRs in my proposal. Although,
if hardware handles nested interrupts properly, there should be no
interrupt critical sections.

-- Jacob

Jacob Bachmeyer

unread,

Mar 18, 2018, 10:59:58 PM3/18/18

to kr...@berkeley.edu, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

kr...@berkeley.edu wrote:
>>>>>> On Sat, 17 Mar 2018 21:33:02 -0500, Jacob Bachmeyer <jcb6...@gmail.com> said:
>>>>>>
> | Krste Asanovic wrote:
> || Type of memory interface does not belong in the micro controller profile, wrong level of detail.
>
> | On this I disagree. As I see it, many of the trade-offs that Liviu
> | Ionescu was complaining about that led to this discussion are *driven*
> | *by* the more complex memory interfaces used in modern application
> | processors.
>
> No - the big difference comes from virtualization requirements and the
> assumption of fast processors and smart devices, versus bare-metal
> access with slow processors and dumb devices. Nothing directly
> related to complex memory hierarchies, and definitely nothing about
> complex memory *interfaces*.
>

Perhaps we are not sharing the same concept of memory subsystem
complexity. The lines that I see as relevant for microcontrollers are
whether (1) the memory subsystem can process requests in something other
than program order, (2) the core can continue executing instructions
while the memory subsystem is processing a request, (3) the memory can
"keep pace" with the core. Item 3 is particularly important for
microcontrollers that are intended to control some physical process, as
this greatly improves the predictability of execution timing.

> | These complex memory hierarchies are not wrong -- they are,
> | after all, how GHz+ processors and multi-GiB RAM are reasonable for most
> | large systems.
>
> Memory interface != memory hierarchy.
>

An admitted imprecision in words. The behavior of the memory hierarchy
(particularly timing) is part of the memory interface, however, and I
envision "constant memory latency" as characteristic of microcontrollers.

> | They are also, well, complex, and therefore a good candidate for drawing
> | a dividing line between simple microcontrollers and more complex
> | application processors. If the complex memory hierarchies are not
> | excluded, then the general POSIX environment ISA really is optimal or
> | very close. Excluding these system architectures enables the
> | microcontroller environment ISA to take different choices to better fit
> | a specific niche that the general environment ISA serves badly.
>
> There are certainly differences in platform features, but few
> necessary differences at user ISA level.
>

I envision *no* differences at user ISA level between microcontrollers
and general processors. (The POSIX ABI is not properly part of the user
ISA, although some details, such as link registers, do belong in the
user ISA.) In fact, a larger RISC-V system should be able to simulate
(in U-mode) a microcontroller environment for testing microcontroller
firmware by using "trap-and-emulate" for the instructions from the
microcontroller environment ISA (which requires that those instruction
encodings be either illegal or privileged in the general profile).

> | I envision RISC-V microcontrollers being commonly used as peripherals in
> | larger systems for real-time subtasks, particularly related to real-time
> | I/O, such as sensors.
>
> Microcontrollers can have complex memory systems, particularly when
> placed inside larger SoCs and/or when using various power control
> tricks, even without considering caches, which are also reasonably
> common for microcontrollers. Some microcontrollers have both L1 I &
> L2 D caches, and multiple local L1 SRAMs, and shared L2 cache, and
> multiple shared L2 SRAMs.
>

This is where we start to encounter a dividing line between what is a
proper application for a microcontroller and what is a proper
application for a general processor. I expect that most applications
using such complex SoCs should actually be using the general profile;
the microcontroller profile I envision is intended for smaller systems,
or as a component in a larger system, perhaps as one of the "smart
devices" you mentioned being used with application processors, probably
even embedded inside one of those SoCs as a peripheral to an application
processor.

-- Jacob

Krste Asanovic

unread,

Mar 19, 2018, 12:17:43 AM3/19/18

to jcb6...@gmail.com, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

On Mar 18, 2018, at 7:59 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

kr...@berkeley.edu wrote:

Microcontrollers can have complex memory systems, particularly when
placed inside larger SoCs and/or when using various power control
tricks, even without considering caches, which are also reasonably
common for microcontrollers. Some microcontrollers have both L1 I &
L2 D caches, and multiple local L1 SRAMs, and shared L2 cache, and
multiple shared L2 SRAMs.

This is where we start to encounter a dividing line between what is a proper application for a microcontroller and what is a proper application for a general processor. I expect that most applications using such complex SoCs should actually be using the general profile; the microcontroller profile I envision is intended for smaller systems, or as a component in a larger system, perhaps as one of the "smart devices" you mentioned being used with application processors, probably even embedded inside one of those SoCs as a peripheral to an application processor.

I think you’re ignoring many “proper applications” of microcontrollers . These do not need/want demand-paged virtual memory, but can have a much more complex memory system (in terms of types/levels of memory) than same-generation application processor, and also with fast interrupts. There is an important market in tiny microcontrollers (e.g., 8051 replacements) but these are only one piece of a much bigger universe of non-Unix-capable embedded processors.

Krste

Liviu Ionescu

unread,

Mar 19, 2018, 4:11:53 AM3/19/18

to jcb6...@gmail.com, RISC-V ISA Dev

On 19 March 2018 at 04:35:22, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> Although,
> if hardware handles nested interrupts properly, there should
> be no
> interrupt critical sections.

interrupt critical sections have nothing to do with nested interrupts,
but with scheduler integrity.

the scheduler has at least one critical data structure, the ready
list, where all ready-to-run threads are linked. elements are removed
during context switches and added when new threads become ready (for
example when an ISR posts a semaphore or adds data to a user queue).

on a single hart device, adding/removing elements from the ready list
must be done within an interrupt critical section, to preserve the
list integrity.

similarly, there are other data strucures in the scheduler that must
be protected (list of timeouts, list of devices, etc).

things start to become even more interesting on a multi-hart device,
since disabling the interrupts is no longer enough, and other
mechanisms must be used (like atomics).

thus, access to the interrupt enable bit, and, separately, to the
interrupt threashold, should be as direct and as fast as possible.

in the current privileged profile, access to the MIE bit can be done
in one CSR instruction, but the threshold register is implementation
specific.

in the microcontroller profile I brought the threshold register into a
CSRs, and accessing it is also done with a single instruction, not to
mention that having a common definition for all devices greatly
simplifies software.

in this respect, moving the mtime and mtimcmp registers to
implementation specific addresses instead of fixed architecture
addresses was an uninspired change, since it greatly complicates the
system software, requiring per-device definitions, instead of unique
per-architecture definitions.

in the microcontroller profile, the timer registers have fixed
per-architecture memory mapped addresses, common to all devices; once
implemented in the architecture support functions, application
developpers need not worry with any implementation details.

btw, these kind of small design 'features', actually design
inconsistencies and glitches, make writing embedded software for the
privileged profile quite an unpleasant experience.

regards,

Liviu

Liviu Ionescu

unread,

Mar 19, 2018, 4:37:36 AM3/19/18

to kr...@berkeley.edu, jcb6...@gmail.com, Guy Lemieux, RISC-V ISA Dev

On 19 March 2018 at 04:59:57, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> Item 3 is particularly important for
> microcontrollers that are intended to control some physical
> process, as
> this greatly improves the predictability of execution timing.

yes, just that microcontrollers are not used exclusively for control
loops (not to mention that control loops should use accurate hardware
timers, and do not depend on core speed).

predictability of execution timing should be achievable, but should
not be mandatory for all devices; for most embedded applications, the
faster they run, the better, since they can go to sleep and save
power.

for example, Cortex-M devices, which have various 3-6 stage pipelines
and allow complex memory configrations, allow to temporarily disable
the caches, to achieve predictability of execution timings, when
necessary. or Cortex-M0 has a bit that controls if interrupt latency
accepts some jitter, or not (and in this case some operations are
artificially delayed to provide constant timings).

regards,

Liviu

Richard Herveille

unread,

Mar 19, 2018, 7:13:25 AM3/19/18

to jcb6...@gmail.com, Liviu Ionescu, RISC-V ISA Dev, Richard Herveille

On 19/03/2018, 03:29, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:

Liviu Ionescu wrote:

On 18 March 2018 at 04:10:32, Jacob Bachmeyer (jcb6...@gmail.com) wrote:


The trick is that hardware can spill the oldest shadow set in
the
background while running the ISR. All you need are four instructions
in
the ISR that do not access memory.


four? I tink this number depends on the ABI, which was not yet decided.

My proposal includes an eABI that declares all registers except for link

registers (x1, x5) and return values (x10, x11) to be callee-saved.

These caller-saved registers are included in shadow sets.

For simplicity purposes, if you go with shadow registers, then the entire register file should swap.

The ARM7 and ARM9 have partial shadow registers. That allowed fast arguments in and out of routines, but is a hardware nightmare.

In an FPGA implementation that would, most likely, mean splitting the RF into multiple block RAMS or a combination of block RAMs and flip flops. So, to keep this simple, you’ll need a full set of shadow registers. How then do you pass arguments?

Spilling Shadow Registers in the background means that a separate statemachine needs to keep track of the shadow-register-stack. What happens if the memory system can’t keep up? You’ll need to stall the CPU. This, and the actual spilling, leads to different interrupt latencies and non-deterministic behavior.

If you don’t want the extra gates for the statemachine, then SW controls the spilling and the advantages seem void.

For small micros area is a main concern. Adding shadow registers adds quite a bit of logic (maybe not in an FPGA, but definitely in an ASIC). The method you describe (as I understand it), requires additional hardware to handle the push/pop of the shadow registers into main memory.

Taken all of that into account, I am not in favour of it.

Again, SPARC had a big set of full shadow registers and they still spilled.

Richard

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.

Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AAF206A.8030402%40gmail.com.

David Chisnall

unread,

Mar 19, 2018, 7:18:25 AM3/19/18

to Krste Asanovic, jcb6...@gmail.com, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

On 19 Mar 2018, at 04:17, Krste Asanovic <kr...@berkeley.edu> wrote:
>
>> This is where we start to encounter a dividing line between what is a proper application for a microcontroller and what is a proper application for a general processor. I expect that most applications using such complex SoCs should actually be using the general profile; the microcontroller profile I envision is intended for smaller systems, or as a component in a larger system, perhaps as one of the "smart devices" you mentioned being used with application processors, probably even embedded inside one of those SoCs as a peripheral to an application processor.
>>
>
> I think you’re ignoring many “proper applications” of microcontrollers . These do not need/want demand-paged virtual memory, but can have a much more complex memory system (in terms of types/levels of memory) than same-generation application processor, and also with fast interrupts. There is an important market in tiny microcontrollers (e.g., 8051 replacements) but these are only one piece of a much bigger universe of non-Unix-capable embedded processors.

I completely agree. There are a lot of devices in the category of ‘microcontroller’ and I suspect that it’s a mistake putting them in the same platform definition. Some of the discussion so far has assumed devices are aggressively resource constrained, far below the size where your fixed costs for an IC completely dominate and there’s no point trying to save transistors. Presumably these constraints are assuming other things on die. I think that we need to split this into various categories, for example:

- Embedded controllers on a large SoC (perhaps including ARM A-profile application processors or larger RISC-V cores)

- Stand-alone RISC-V cores with cost / power as the primary constraint, typically with less than 8KB of SRAM.

- Larger microcontrollers with some kind of memory protection (but not paging), longer pipelines, typically with 32+KB SRAM and perhaps some custom accelerators

Within each of these, hard and soft realtime requirements are an orthogonal issue.

Trying to define a single profile for all of these is going to involve compromises that harm some of the use cases.

David

Liviu Ionescu

unread,

Mar 19, 2018, 7:47:13 AM3/19/18

to Krste Asanovic, David Chisnall, RISC-V ISA Dev, Guy Lemieux, jcb6...@gmail.com

On 19 March 2018 at 13:18:22, David Chisnall

(david.c...@cl.cam.ac.uk) wrote:

> Some of the discussion so far has assumed devices are aggressively
> resource constrained, far below the size where your fixed costs
> for an IC completely dominate and there’s no point trying to save
> transistors.

Well, yes, some proposals assume microcontrollers are very simple/cheap devices.

Such small devices should not be ignored, but generally
microcontrollers can be more powerful than that, even
multi-core/multi-hart, use floating point, memory protection, and
still do not support running full unix-like operating systems, which
is the dividing line I used
(https://github.com/emb-riscv/specs-markdown/blob/develop/improvements-upon-privileged.md#the-dividing-line)

> Trying to define a single profile for all of these is going to
> involve compromises that harm some of the use cases.

Do you envision devices that are not in one of the 3 sub-profiles
defined in the microcontroller profile proposal?

https://github.com/emb-riscv/specs-markdown/blob/develop/introduction.md#sub-profiles

If so, do you have a better proposal for the sub-profiles?

Regards,

Liviu

Jacob Bachmeyer

unread,

Mar 19, 2018, 7:37:16 PM3/19/18

to Richard Herveille, Liviu Ionescu, RISC-V ISA Dev

Richard Herveille wrote:
>
> On 19/03/2018, 03:29, "Jacob Bachmeyer" <jcb6...@gmail.com

> <mailto:jcb6...@gmail.com>> wrote:
>
>
> Liviu Ionescu wrote:
>
> On 18 March 2018 at 04:10:32, Jacob Bachmeyer (jcb6...@gmail.com

> <mailto:jcb6...@gmail.com>) wrote:
>
>
>
>
>
> The trick is that hardware can spill the oldest shadow set in
>
> the
>
> background while running the ISR. All you need are four
> instructions
>
> in
>
> the ISR that do not access memory.
>
>
>
>
>
> four? I tink this number depends on the ABI, which was not yet
> decided.
>
>
>
>
>
> My proposal includes an eABI that declares all registers except for link
>
> registers (x1, x5) and return values (x10, x11) to be callee-saved.
>
> These caller-saved registers are included in shadow sets.
>
>
>
>
>
> For simplicity purposes, if you go with shadow registers, then the
> entire register file should swap.
>
> The ARM7 and ARM9 have partial shadow registers. That allowed fast
> arguments in and out of routines, but is a hardware nightmare.
>
> In an FPGA implementation that would, most likely, mean splitting the
> RF into multiple block RAMS or a combination of block RAMs and flip flops.
>

PRF in a single block RAM, using RAM rows beyond the 32 required for the
primary register file for shadow sets. The "rename table" is thus a
simple function of register number and status.LEVELS.

> So, to keep this simple, you’ll need a full set of shadow registers.
> How then do you pass arguments?
>

For traps, you do not pass arguments; the hardware does. Shadow
registers are used only for trap handlers; normal function calls use the
existing convention with nearly all registers callee-saved.

> Spilling Shadow Registers in the background means that a separate
> statemachine needs to keep track of the shadow-register-stack. What
> happens if the memory system can’t keep up? You’ll need to stall the
> CPU. This, and the actual spilling, leads to different interrupt
> latencies and non-deterministic behavior.
>

Ideally, the register spill is performed during an otherwise unavoidable
latency period, such as the pipeline refill latency in a five-stage
pipeline.

> If you don’t want the extra gates for the statemachine, then SW
> controls the spilling and the advantages seem void.
>

They would be void and there would be no reason for shadow registers.
Software register spill is one of Liviu Ionescu's big complaints about
the general profile.

>
> For small micros area is a main concern. Adding shadow registers adds
> quite a bit of logic (maybe not in an FPGA, but definitely in an
> ASIC). The method you describe (as I understand it), requires
> additional hardware to handle the push/pop of the shadow registers
> into main memory.
>

The additional hardware is fairly small, as it drives the MEM stage
while the ISR instructions are moving into the pipeline. The shadow
sets themselves are presumed to be "spilled" into an internal (4*XLEN)xN
SRAM first; the oldest row from this SRAM is split up into 4 XLEN-bit
pieces and spilled to the stack in the background. Note that only the
top-most shadow registers are accessible to software in my proposal.

> Taken all of that into account, I am not in favour of it.
>
> Again, SPARC had a big set of full shadow registers and they still
> spilled.
>

Will you suggest a better idea?

-- Jacob

Jacob Bachmeyer

unread,

Mar 19, 2018, 9:37:05 PM3/19/18

to Liviu Ionescu, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 19 March 2018 at 04:35:22, Jacob Bachmeyer (jcb6...@gmail.com) wrote:
>
>> Although,
>> if hardware handles nested interrupts properly, there should
>> be no
>> interrupt critical sections.
>>
>
> interrupt critical sections have nothing to do with nested interrupts,
> but with scheduler integrity.
>
> the scheduler has at least one critical data structure, the ready
> list, where all ready-to-run threads are linked. elements are removed
> during context switches and added when new threads become ready (for
> example when an ISR posts a semaphore or adds data to a user queue).
>
> on a single hart device, adding/removing elements from the ready list
> must be done within an interrupt critical section, to preserve the
> list integrity.
>
> similarly, there are other data strucures in the scheduler that must
> be protected (list of timeouts, list of devices, etc).
>
> things start to become even more interesting on a multi-hart device,
> since disabling the interrupts is no longer enough, and other
> mechanisms must be used (like atomics).
>

If interrupt jitter must be avoided, then atomics are necessary even on
a single-hart system in order to avoid delaying interrupts.

> thus, access to the interrupt enable bit, and, separately, to the
> interrupt threashold, should be as direct and as fast as possible.
>
> in the current privileged profile, access to the MIE bit can be done
> in one CSR instruction, but the threshold register is implementation
> specific.
>
> in the microcontroller profile I brought the threshold register into a
> CSRs, and accessing it is also done with a single instruction, not to
> mention that having a common definition for all devices greatly
> simplifies software.
>

In my variant of the microcontroller profile, the interrupt enable bits
are per-vector and are in the "ie" CSR. Interrupts can be completely
disabled using "CSRRW x5, ie, x0" to load that CSR with zero and save
its previous value in the millicode link register. By first forming a
mask in x5 and using "CSRRW x5, ie, x5" a few high-priority interrupts
can remain enabled, avoiding the need to connect a critical interrupt
source to NMI.

> in this respect, moving the mtime and mtimcmp registers to
> implementation specific addresses instead of fixed architecture
> addresses was an uninspired change, since it greatly complicates the
> system software, requiring per-device definitions, instead of unique
> per-architecture definitions.
>
> in the microcontroller profile, the timer registers have fixed
> per-architecture memory mapped addresses, common to all devices; once
> implemented in the architecture support functions, application
> developpers need not worry with any implementation details.
>

Agreed. While I do not propose to make timers mandatory, I do support
requiring that at least one timer, if implemented, use a standard MMIO
interface at a standard base address.

-- Jacob

Jacob Bachmeyer

unread,

Mar 19, 2018, 9:51:01 PM3/19/18

to Krste Asanovic, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

Krste Asanovic wrote:
>> On Mar 18, 2018, at 7:59 PM, Jacob Bachmeyer <jcb6...@gmail.com

>> <mailto:jcb6...@gmail.com>> wrote:

Correct. The tiny microcontrollers are also the niche where non-RVS
implementations of the general environment ISA fit poorly. I envision a
microcontroller environment ISA for that tiny microcontroller niche,
while larger microcontroller systems can use non-RVS implementations of
the general environment ISA. To be clear, I consider the general
environment ISA optimal for most systems, but suboptimal for the "tiny
microcontroller" niche, for which I suggest an alternate environment
ISA. Another goal I have here is simply the push to *have* some
workable process for the development and standardization of alternate
environment ISAs, for which I see "tiny microcontrollers" as a
reasonable motivation and suitable first alternate environment ISA.

I suspect that the proper scope for a specialized microcontroller
environment ISA is another point of disagreement between Liviu Ionescu
and myself. I currently believe that non-RVS implementations of the
general environment ISA *are* appropriate for larger microcontroller
applications. Perhaps there are better terms that could be used to more
clearly indicate this? Or is this an area where current computer
science simply does not have precise terminology, the once-precise terms
having become fuzzy due to technological advances? Should we call them
"nanocontrollers"? :-)

-- Jacob

Jacob Bachmeyer

unread,

Mar 19, 2018, 10:00:58 PM3/19/18

to David Chisnall, Krste Asanovic, Guy Lemieux, Liviu Ionescu, RISC-V ISA Dev

David Chisnall wrote:
> On 19 Mar 2018, at 04:17, Krste Asanovic <kr...@berkeley.edu> wrote:
>
>>> This is where we start to encounter a dividing line between what is a proper application for a microcontroller and what is a proper application for a general processor. I expect that most applications using such complex SoCs should actually be using the general profile; the microcontroller profile I envision is intended for smaller systems, or as a component in a larger system, perhaps as one of the "smart devices" you mentioned being used with application processors, probably even embedded inside one of those SoCs as a peripheral to an application processor.
>>>
>>>
>> I think you’re ignoring many “proper applications” of microcontrollers . These do not need/want demand-paged virtual memory, but can have a much more complex memory system (in terms of types/levels of memory) than same-generation application processor, and also with fast interrupts. There is an important market in tiny microcontrollers (e.g., 8051 replacements) but these are only one piece of a much bigger universe of non-Unix-capable embedded processors.
>>
>
> I completely agree. There are a lot of devices in the category of ‘microcontroller’ and I suspect that it’s a mistake putting them in the same platform definition. Some of the discussion so far has assumed devices are aggressively resource constrained, far below the size where your fixed costs for an IC completely dominate and there’s no point trying to save transistors. Presumably these constraints are assuming other things on die. I think that we need to split this into various categories, for example:
>
> - Embedded controllers on a large SoC (perhaps including ARM A-profile application processors or larger RISC-V cores)
>

If you are referring to processors embedded as peripherals to a larger
core, such devices may be within scope for the specialized
microcontroller profile I propose. I expect this type of SoC to have a
"main core" (or many such) that implements the general environment ISA,
but some "smart peripherals" might internally use the microcontroller
environment ISA.

> - Stand-alone RISC-V cores with cost / power as the primary constraint, typically with less than 8KB of SRAM.
>

This is another target niche for the specialized microcontroller
environment ISA.

> - Larger microcontrollers with some kind of memory protection (but not paging), longer pipelines, typically with 32+KB SRAM and perhaps some custom accelerators
>

I believe that this category should use the general environment ISA,
probably without support for RVS.

> Within each of these, hard and soft realtime requirements are an orthogonal issue.
>

Agreed. I also believe that meeting realtime requirements becomes more
difficult as overall system complexity increases, so having an
ultra-simple variant for hard realtime seems beneficial to me. Having
that variant be simple enough that it can easily be embedded into a
larger system as a "realtime peripheral element" is simply another step.

> Trying to define a single profile for all of these is going to involve compromises that harm some of the use cases.
>

Also agreed, which is a reason that my proposal focuses on the low-end
category.

-- Jacob

Jacob Bachmeyer

unread,

Mar 19, 2018, 10:08:03 PM3/19/18

to Liviu Ionescu, kr...@berkeley.edu, Guy Lemieux, RISC-V ISA Dev

Liviu Ionescu wrote:
> On 19 March 2018 at 04:59:57, Jacob Bachmeyer (jcb6...@gmail.com) wrote:
>
>> Item 3 is particularly important for
>> microcontrollers that are intended to control some physical
>> process, as
>> this greatly improves the predictability of execution timing.
>>
>
> yes, just that microcontrollers are not used exclusively for control
> loops (not to mention that control loops should use accurate hardware
> timers, and do not depend on core speed).
>

A hardware timer driven from an external cesium clock still must fire
events on which the core acts. For this part, a control loop inevitably
depends on core speed, not in an absolute sense (within some range) but
that the core speed must be *constant*. If the timer fires an event and
the core can respond in 16~16384 cycles, the control loop is likely to
have problems, regardless of timer accuracy.

> predictability of execution timing should be achievable, but should
> not be mandatory for all devices; for most embedded applications, the
> faster they run, the better, since they can go to sleep and save
> power.
>
> for example, Cortex-M devices, which have various 3-6 stage pipelines
> and allow complex memory configrations, allow to temporarily disable
> the caches, to achieve predictability of execution timings, when
> necessary. or Cortex-M0 has a bit that controls if interrupt latency
> accepts some jitter, or not (and in this case some operations are
> artificially delayed to provide constant timings).
>

Then timing predictability is a hardware feature. Implementations with
predictable timing are likely to omit some features like caches, instead
using a smaller amount of SRAM that can "keep pace" with the core directly.

This is another reason that I see a microcontroller environment ISA as
somewhat specialized: even larger microcontrollers are likely to do
well with the general environment ISA.

-- Jacob

Allen J. Baum

unread,

Mar 20, 2018, 12:49:25 AM3/20/18

to jcb6...@gmail.com, Richard Herveille, Liviu Ionescu, RISC-V ISA Dev

Someone (attribution is too confused) wrote:
>>Taken all of that into account, I am not in favour of it.
>>
>>Again, SPARC had a big set of full shadow registers and they still spilled.
>>

Note that the application of the shadow register sets here is for fast interrupt response. Sparc used their register windows for gneral function calling, whihc is much more frequent, so you can't really use that argument to say it won't work.

Both ARM and Intel (in their late-lamented i960) have or had shadow register sets for fast interupt handling.
--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Liviu Ionescu

unread,

Mar 20, 2018, 3:24:50 AM3/20/18

to Richard Herveille, Allen J. Baum, jcb6...@gmail.com, RISC-V ISA Dev

On 20 March 2018 at 06:49:24, Allen J. Baum

(allen...@esperantotech.com) wrote:

> Intel (in their late-lamented i960)

ah, don't mention it, in the early 2000s I used it in a project and I
had to port a scheduler to it; it was awkward, even the stack grew to
the other direction. :-(

regards,

Liviu

Liviu Ionescu

unread,

Mar 20, 2018, 3:32:17 AM3/20/18

to jcb6...@gmail.com, RISC-V ISA Dev, kr...@berkeley.edu, Guy Lemieux

On 20 March 2018 at 04:08:02, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> ... "nanocontrollers"?

If we compare your desired configurations with the
multi-core/multi-hart configurations allowed by the microcontroller
profile I proposed, yes, I think 'nanocontrollers' is more accurate.
How about "The RISC-V nanocontroller profile"?.

Regards,

Liviu

Liviu Ionescu

unread,

Mar 20, 2018, 4:05:42 AM3/20/18

to jcb6...@gmail.com, RISC-V ISA Dev

On 20 March 2018 at 03:37:04, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> > In my variant of the microcontroller profile, the interrupt
> enable bits
> are per-vector and are in the "ie" CSR. Interrupts can be completely
> disabled using "CSRRW x5, ie, x0" to load that CSR with zero and
> save
> its previous value in the millicode link register.

the implementation of an interrupt critical section looks like this:

{
xxx_t saved = swap_xxx(new_value);

// critical section

write_xxx(saved);
}

in other words, at section entry it makes some changes to the
interrupt registers while preserving the previous value, and at
section exit it restores the initial value.

if in your design you can implement the two functions as single
instruction, it should be ok.

as functionality, there are two variants:

- very simple devices completely disable/enable interrupts
- more advanced devices set the interrupt threshold register. the
threshold can be set either to the top priority, or to an intermediate
priority

> By first forming a
> mask in x5 and using "CSRRW x5, ie, x5" a few high-priority interrupts
> can remain enabled, avoiding the need to connect a critical interrupt
> source to NMI.

yes,this is a common practice, if the scheduler implements interrupt
critical section by using an intermediate priority level threshold,
the top priorities remain enabled and can preempt the scheduler at an
time. this is fine, since these fast interrupts should not use any
scheduler calls.

regards,

Liviu

Richard Herveille

unread,

Mar 20, 2018, 5:40:58 AM3/20/18

to jcb6...@gmail.com, Liviu Ionescu, RISC-V ISA Dev, Richard Herveille

Again, there are more technologies than Linux and FPGAs.

For FPGAs that do provide block RAMs, the smallest size would be enough to hold 8 full 32bit shadow registers (so 8x 32x 32bits)

So, to keep this simple, you’ll need a full set of shadow registers.
How then do you pass arguments?

For traps, you do not pass arguments; the hardware does. Shadow

registers are used only for trap handlers; normal function calls use the

existing convention with nearly all registers callee-saved.

Being pedantic … interrupts. Traps would be software exceptions and they typically do pass arguments.

Spilling Shadow Registers in the background means that a separate
statemachine needs to keep track of the shadow-register-stack. What
happens if the memory system can’t keep up? You’ll need to stall the
CPU. This, and the actual spilling, leads to different interrupt
latencies and non-deterministic behavior.

Ideally, the register spill is performed during an otherwise unavoidable

latency period, such as the pipeline refill latency in a five-stage

pipeline.

We’re talking (tiny) microcontrollers here. I doubt most would have a 5 stage pipeline.

If you don’t want the extra gates for the statemachine, then SW
controls the spilling and the advantages seem void.

They would be void and there would be no reason for shadow registers.

Software register spill is one of Liviu Ionescu's big complaints about

the general profile.

Yes, hence his proposal to reduce the amount of registers that must be pushed/poped, and the suggested ‘movem’ instruction.

For small micros area is a main concern. Adding shadow registers adds
quite a bit of logic (maybe not in an FPGA, but definitely in an
ASIC). The method you describe (as I understand it), requires
additional hardware to handle the push/pop of the shadow registers
into main memory.

The additional hardware is fairly small, as it drives the MEM stage

while the ISR instructions are moving into the pipeline. The shadow

sets themselves are presumed to be "spilled" into an internal (4*XLEN)xN

SRAM first; the oldest row from this SRAM is split up into 4 XLEN-bit

pieces and spilled to the stack in the background. Note that only the

top-most shadow registers are accessible to software in my proposal.

We’re talking tiny microcontrollers (nanocontroller they have been called).

The additional hardware is not fairly small.

There might not be a pipeline.

Your proposal adds memory for the shadow registers. You also seem to suggest only parts of the RF are shadowed (like the older ARMs), which required breaking the RF into sections or requiring additional address decoding.

Now your adding an additional internal SRAM. This seems overly complex; shadow registers->internal memory->external memory. Why not simply reduce the number of registers to push/pop. For speed purposes the stack would be in internal memory.

Taken all of that into account, I am not in favour of it.

Again, SPARC had a big set of full shadow registers and they still
spilled.

Will you suggest a better idea?

Reduce the number of registers that must be saved.

Either use regular load/store (only a few registers, stack in internal memory) or provide a ‘movem’. The disadvantage of a ‘movem’ is that it’s a big instruction; meaning it requires many bits. Full flexibility requires 31bits to specify which registers to move (x0 is always zero). So that’s not possible. For RVE there would only be 15 registers to specify. But that means there won’t be an RVC version of movem. You could further reduce the registers that can be moved, but at what point does it still make sense to add this instruction?

If the overhead can be reduced to 4 registers that must be saved, then I suggest simply using regular load/store instructions, for which there are RVC variants.

Richard

-- Jacob

Richard Herveille

unread,

Mar 20, 2018, 5:44:45 AM3/20/18

to Allen J. Baum, jcb6...@gmail.com, Liviu Ionescu, RISC-V ISA Dev, Richard Herveille

On 20/03/2018, 05:49, "Allen J. Baum" <allen...@esperantotech.com> wrote:

Someone (attribution is too confused) wrote:

Taken all of that into account, I am not in favour of it.

Again, SPARC had a big set of full shadow registers and they still spilled.

Note that the application of the shadow register sets here is for fast interrupt response. Sparc used their register windows for gneral function calling, whihc is much more frequent, so you can't really use that argument to say it won't work.

Fair enough. But the argument that there are never enough is still valid.

Both ARM and Intel (in their late-lamented i960) have or had shadow register sets for fast interupt handling.

ARM had a partial shadow register, which could be used for both.

How many shadow registers do you want to support? At some point it is not enough and you still need to spill. So just always spill, but make that fast and predictable.

It all costs area (and hence cost), which in this market segment is a driving factor, if not THE driving factor.

Richard

Liviu Ionescu

unread,

Mar 20, 2018, 6:09:31 AM3/20/18

to jcb6...@gmail.com, Richard Herveille, Allen J. Baum, RISC-V ISA Dev

On 20 March 2018 at 11:44:43, Richard Herveille

(richard....@roalogic.com) wrote:

> or provide a ‘movem’. The disadvantage of a ‘movem’ is that it’s
> a big instruction; meaning it requires many bits. Full flexibility
> requires 31bits to specify which registers to move (x0 is always
> zero). So that’s not possible. For RVE there would only be 15 registers
> to specify. But that means there won’t be an RVC version of movem.

I thought of a 'movem' intruction, it would make the scheduler context
switch assembly routine shorter, but if the implementation takes
exactly N cycles for N registers, and separate moves always take 1
cycle, there is no real gain, a sequence of single register moves is
as good as a single 'movem'.

a question still remained answered: how complex is the logic to
automatically save/restore a sequence of registers? is it reasonable
to add it to the microcontroller profile? my guess is that if ARM
added it to the small Cortex-M0, the extra cost should be worth the
advantage of having interrupt handlers as plain C functions.

I received a comment that this logic requires the presence of the
'movem' instruction. I agree that both would benefit from sharing some
logic, but I don't think that 'movem' is mandatory.

Richard, do you have any thoughts on this?

Regards,

Liviu