RFC: RISC-V microcontroller profile

1,117 views
Skip to first unread message

Liviu Ionescu

unread,
Mar 14, 2018, 6:29:51 AM3/14/18
to RISC-V ISA Dev
(I apologise to the Linux audience, but currently I could not find an
email group dedicated to embedded RISC-V devices).


I did a study on using the RISC-V ISA in embedded devices and I
managed to identify several issues together with possible solutions.

The result is a proposal for a 'RISC-V microcontroller profile',
intended to complement the existing privileged profile:

https://github.com/emb-riscv/specs-markdown/blob/master/README.md


The main issues identified are interrupt latency and lack of C/C++ friendliness.

For microcontrollers, the main solutions to minimise latency are:

- a lighter EABI (preferably designed for only 16 registers, and
making use of the other 16 as a bonus, to look similar for both E and
I cores) together with
- hardware stacking/unstacking the interrupt context (which also
greatly improves C friendliness).

Otherwise RISC-V cores will never be able to compete with ARM Cortex-M
cores, which not only have nested interrupts, but advertise maximum 12
cycles from the interrupt to the final C handler (!) and 10 cycles to
fully exit interrupts (not to mention tail-chaining and, when FP is
present, lazy saving of the FP registers).


Regards,

Liviu


p.s. feel free to reply directly to my email address; if this topic
triggers any interest for embedded devices, I suggest to create a
separate group, like embe...@groups.riscv.org, and move the
discussions there.

Alex Bradbury

unread,
Mar 14, 2018, 7:04:23 AM3/14/18
to Liviu Ionescu, RISC-V ISA Dev
On 14 March 2018 at 10:29, Liviu Ionescu <i...@livius.net> wrote:
> (I apologise to the Linux audience, but currently I could not find an
> email group dedicated to embedded RISC-V devices).
>
>
> I did a study on using the RISC-V ISA in embedded devices and I
> managed to identify several issues together with possible solutions.
>
> The result is a proposal for a 'RISC-V microcontroller profile',
> intended to complement the existing privileged profile:
>
> https://github.com/emb-riscv/specs-markdown/blob/master/README.md

Thanks for sharing your thoughts here Liviu.

>
> The main issues identified are interrupt latency and lack of C/C++ friendliness.
>
> For microcontrollers, the main solutions to minimise latency are:
>
> - a lighter EABI (preferably designed for only 16 registers, and
> making use of the other 16 as a bonus, to look similar for both E and
> I cores) together with

If this is desirable, I'd be strongly in favour of this being done as
a modification of the not-yet-finalised RV32E ABI rather than having
it be yet another ABI to live alongside ilp32, ilp32e, ilp32f, ilp32d,
lp64, lp64f and lp64d.

I think it's important to get some performance numbers on the options
here. How were you hoping to evaluate them - are there real-world
workloads we might analyse, or do you think that targeted
microbenchmarks are sufficient?

Best,

Alex

Liviu Ionescu

unread,
Mar 14, 2018, 8:04:59 AM3/14/18
to Alex Bradbury, RISC-V ISA Dev
On 14 March 2018 at 13:04:22, Alex Bradbury (a...@asbradbury.org) wrote:

> ... strongly in favour of this being done as
> a modification of the not-yet-finalised RV32E ABI

fully agree.

> rather than having
> it be yet another ABI to live alongside ilp32, ilp32e, ilp32f, ilp32d,
> lp64, lp64f and lp64d.

that would be nice, but I'm afraid not enough.

my proposal for a RV32E ABI would be to save no more than 6 registers
(and no less than 4), plus the return address, the status register and
minimal status.

then preferably save the same registers by RV32I/RV64I. at the limit
we can save only 4 registers for RV32E, and 6 for the larger cores,
but 16 registers, as are now marked as 'saved by the caller', are way
too many.

> I think it's important to get some performance numbers on the options
> here. How were you hoping to evaluate them - are there real-world
> workloads we might analyse, or do you think that targeted
> microbenchmarks are sufficient?

that's a good question.

as prequisites, I would assume that the stacking/unstaking mechanism
uses only internal fast RAM, that idealy requires 1 cycle per word,
otherwise results are not comparable.

if Cortex-M can stack/unstack a total of 8 (eight) 32-bits words in
only 12/10 cycles; matching it (at least) would be a good challenge to
start with.

right now, looking at the entry.S code used in Linux, I see that it
saves all 32 registers plus 6 CSRs. I don't have actual measurements,
but I would expect this to take at least 38 cycles, plus some good
more cycles in the assembly logic used before/after calling the C
code. an inacurate estimate would be that now we are somewhere in the
50-60 range, without handling nesting and interrupt pre-emption.
adding an extra software stack for nesting might take a few more good
cycles, probably raising the total stacking time to 80-100 cycles, and
slightly less for unstacking.


regards,

Liviu


p.s. any chance to get a separate embe...@groups.riscv.org group?

Watson Ladd

unread,
Mar 14, 2018, 4:12:32 PM3/14/18
to Liviu Ionescu, Alex Bradbury, RISC-V ISA Dev
On Wed, Mar 14, 2018 at 5:04 AM, Liviu Ionescu <i...@livius.net> wrote:
> On 14 March 2018 at 13:04:22, Alex Bradbury (a...@asbradbury.org) wrote:
>
>> ... strongly in favour of this being done as
>> a modification of the not-yet-finalised RV32E ABI
>
> fully agree.
>
>> rather than having
>> it be yet another ABI to live alongside ilp32, ilp32e, ilp32f, ilp32d,
>> lp64, lp64f and lp64d.
>
> that would be nice, but I'm afraid not enough.
>
> my proposal for a RV32E ABI would be to save no more than 6 registers
> (and no less than 4), plus the return address, the status register and
> minimal status.
>
> then preferably save the same registers by RV32I/RV64I. at the limit
> we can save only 4 registers for RV32E, and 6 for the larger cores,
> but 16 registers, as are now marked as 'saved by the caller', are way
> too many.

If the Cortex-M0 can save 8 32 bit words in 12 cycles, we should be
able to save 16 in 24. This is the only work that needs to be done in
an interrupt in hardware: now we are ready to jump to
an interrupt handler that looks just like a C function.

Liviu Ionescu

unread,
Mar 14, 2018, 4:42:38 PM3/14/18
to Watson Ladd, RISC-V ISA Dev, Alex Bradbury
On 14 March 2018 at 22:12:31, Watson Ladd (watso...@gmail.com) wrote:

> If the Cortex-M0 can save 8 32 bit words in 12 cycles,

I double checked and for Cortex-M3/M4/M7 the latency is 12/10
(entry/exit). Cortex-M0+ is specified at 15 cycles and Cortex-M0 at 16
cycles. the non-FP stack frame is 8 words in all cases. Cortex-M0/M0+
also have an optional zero jitter feature.

> we should be able to save 16 in 24.

with the current ABI, I estimated at least 18 registers to be saved.
the same rule gives 27 cycles.

> This is the only work that needs to be done in
> an interrupt in hardware: now we are ready to jump to
> an interrupt handler that looks just like a C function.

that would be nice.

for compatibility reasons, my proposal also supports the current ABI,
but a ligher ABI would further reduce latency to values comparable
with Cortex-M.


regards,

Liviu

Torbjørn Viem Ness

unread,
Mar 14, 2018, 5:20:27 PM3/14/18
to RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org
Hi,

This is probaby an unconventional suggestion but I'm still an unexperienced student, so too young to know better I guess. =)

What if we added one more set of registers "behind" the ones that need to be saved upon entering an interrupt handler?
This way the entire status could just be "shifted out" in one cycle before jumping to the routine, and propagating the data to RAM or cache could be handled in the background to prepare for another context swap if necessary, and the latency would be invisible to the user (unless a new call occurs before it's done saving the data from the previous one).
Then after the handler completes, the previous context can simply be shifted back and be ready to go one cycle later.

Does this sound like a good idea (or even feasible), or would it be too expensive in terms of area and complexity seeing as we're talking about microcontrollers?

--
Torbjørn Ness
M.Sc. student, NTNU

Rogier Brussee

unread,
Mar 14, 2018, 5:30:20 PM3/14/18
to RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org


Op woensdag 14 maart 2018 21:42:38 UTC+1 schreef Liviu Ionescu:
If I understand correctly, to save ra + N additional (word sized) registers a0, a1, a2, .. a5, t1 t2  on a 16 byte boundary should do
(assuming the C extension is defined[1])

C.add16isp ((N +3)>>2)                  # i.e. addi sp sp ((N +3)>>2 <<4)       

jalr   t0 zero  Csavew - (N<<1)


where Csavew is the bit of milicode 


Csavew-16   C.swsp t2   (+32)

Csavew-14   C.swsp t1   (+28)     

Csavew-12   C.swsp a5  (+24)

Csavew-10   C.swsp a4  (+20)

Csavew -8    C.swsp a3  (+16)

Csavew -6    C.swsp a2  (+12)

Csavew -4    C.swsp a1  (+8)

Csavew -2    C.swsp a0  (+4)

Csavew:       C.swsp ra   (0)

                     j t0


If one architecturally fixes the adres of Csavew just like you defined adresses for mmaped CSR's (representable in 12 bits to avoid 
an additional lui or 11 bits if one does not want to run into top negative range to avoid) 
then that jalr  t0 zero  Csavew - (N<<1) could be _allowed_ to be implemented in hardware without demanding it. 


[1] Mutatis mutandis the same can be done without the C extension but the adresses corresponding to registers would be different.




regards,

Liviu

Liviu Ionescu

unread,
Mar 14, 2018, 5:33:36 PM3/14/18
to Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org
On 14 March 2018 at 23:20:30, Torbjørn Viem Ness (tbn...@gmail.com) wrote:

> This is probably an unconventional suggestion but I'm still an unexperienced
> student, so too young to know better I guess. =)

don't worry, creativity has no age ;-)

> What if we added one more set of registers "behind" the ones that need to
> be saved upon entering an interrupt handler?

if I'm not terribly wrong, this technique is called 'shadow register
set', and it is used by some other architectures more concerned with
latency (MIPS, PIC32, maybe SPARC, possibly others).

it is probably the fastest solution.

> Does this sound like a good idea (or even feasible), or would it be too
> expensive in terms of area and complexity seeing as we're talking about
> microcontrollers?

I would say it is not cheap.


regards,

Liviu

Liviu Ionescu

unread,
Mar 14, 2018, 5:39:40 PM3/14/18
to Rogier Brussee, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org
On 14 March 2018 at 23:30:22, Rogier Brussee (rogier....@gmail.com) wrote:

> ... to save ra + N additional (word sized) registers ... on a 16 byte boundary

in my proposal I considered an 8 byte alignment enough, but if a 16
bytes boundary simplifies the logic or makes things faster, I see no
problem to update the specs to support it (the number of extra words
added must be preserved somewhere to de-adjust the stack pointer on
exit).

regards,

Liviu

Jacob Bachmeyer

unread,
Mar 14, 2018, 5:41:20 PM3/14/18
to Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org
Torbjørn Viem Ness wrote:
> This is probaby an unconventional suggestion but I'm still an
> unexperienced student, so too young to know better I guess. =)
>
> What if we added one more set of registers "behind" the ones that need
> to be saved upon entering an interrupt handler?

You are asking for shadow registers.

> This way the entire status could just be "shifted out" in one cycle
> before jumping to the routine, and propagating the data to RAM or
> cache could be handled in the background to prepare for another
> context swap if necessary, and the latency would be invisible to the
> user (unless a new call occurs before it's done saving the data from
> the previous one).
> Then after the handler completes, the previous context can simply be
> shifted back and be ready to go one cycle later.
>
> Does this sound like a good idea (or even feasible), or would it be
> too expensive in terms of area and complexity seeing as we're talking
> about microcontrollers?

I am working on a similar proposal (there are some points of
disagreement between Liviu Ionescu and myself) that uses a shadow
register bank almost exactly as you suggest, including
spilling/reloading the inactive shadow registers into stack frames. The
main difference is that I will also propose an EABI with a minimum of
caller-saved registers, and only those EABI caller-saved registers are
shadowed.



-- Jacob

Liviu Ionescu

unread,
Mar 14, 2018, 5:52:16 PM3/14/18
to Torbjørn Viem Ness, jcb6...@gmail.com, a...@asbradbury.org, RISC-V ISA Dev, watso...@gmail.com
On 14 March 2018 at 23:41:20, Jacob Bachmeyer (jcb6...@gmail.com) wrote:

> ... a similar proposal ... that uses a shadow register bank

any solution that provides a better latency and/or ease of use, and
has an acceptable cost is welcomed.

from my point of view, for a real life device, that includes flash,
ram and lots of complex peripherals, architecture C/C++ friendliness
and performance (like latency) are more important than the number of
transistors in the core (within reasonable limits).


regards,

Liviu

Jacob Bachmeyer

unread,
Mar 14, 2018, 6:30:31 PM3/14/18
to Liviu Ionescu, RISC-V ISA Dev
Liviu Ionescu wrote:
> (I apologise to the Linux audience, but currently I could not find an
> email group dedicated to embedded RISC-V devices).
>

This is the RISC-V ISA mailing list; while there may be many people here
interested primarily in Linux, I am quite certain that wider discussions
are appropriate.

> I did a study on using the RISC-V ISA in embedded devices and I
> managed to identify several issues together with possible solutions.
>

I have been working on a similar proposal; taking a different approach
to some of the issues Liviu Ionescu has raised. I have attached a
working draft and also seek comments and comparisons between our proposals.


-- Jacob
risc-v-microcontroller-system-isa.org

Rogier Brussee

unread,
Mar 14, 2018, 7:13:14 PM3/14/18
to RISC-V ISA Dev, rogier....@gmail.com, watso...@gmail.com, a...@asbradbury.org


Op woensdag 14 maart 2018 22:39:40 UTC+1 schreef Liviu Ionescu:
The 16 byte alignment is not essential but allows to use addi16sp.

Essentially the same idea can be used with 4 byte allignment

addi sp sp N << 2

jalr   t0 zero  Csavew - (N<<1)



or 8 byte alignment

addi sp sp (((N +1)>>1) << 3)                       

jalr   t0 zero  Csavew - (N<<1)



Regards,
Rogier
 
regards,

Liviu

Richard Herveille

unread,
Mar 15, 2018, 4:55:13 AM3/15/18
to Liviu Ionescu, Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org, Richard Herveille

The problem with shadow registers is that you always run out and you still need to spill to main memory.

For an RVE implementation, which reduces the RF in half to save gates, it would be weird to double the memory now, just to implement a shadow register.

 

Richard

 

 

cid:image001.png@01D348FE.8B6D1030

 

Richard Herveille

Managing Director

Phone +31 (45) 405 5681

Cell +31 (6) 5207 2230

richard....@roalogic.com

 

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.

 

Liviu Ionescu

unread,
Mar 15, 2018, 5:41:50 AM3/15/18
to Torbjørn Viem Ness, Richard Herveille, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org
On 15 March 2018 at 10:55:11, Richard Herveille
(richard....@roalogic.com) wrote:

> The problem with shadow registers is that you always run out and you still need to spill
> to main memory.

you run out when the interrupt nesting gets deeper than the available
register banks; in most cases, the depth is 1, rarely 2, even rarely
3, and so on.

spilling can be done in parallel, while starting the handler.

but this method reveals a possible latency problem: if a high priority
interrupt occurs right after a series of other interrupts, and there
are no more register banks, it must wait for a previous spill to
complete, to free a register bank, leading to a jitter on the high
priority interrupt latency.

most applications tolerate a small jitter, but for applications that
implement control loops we might need a way to disable this mechanism
and provide constant latency (even if it is slightly higher).
Cortex-M0 has such a configuration bit to prevent jitter.

> For an RVE implementation, which reduces the RF in half to save gates, it would be weird
> to double the memory now, just to implement a shadow register.

yes, this mechanism is not cheap. however, as Jacob suggested, only
the ABI caller registers need to be shadowed/spilled, so, with a
lighter EABI, the extra cost may be kept to a minimum.

---

using shadow registers seems attractive at first sight, but I'm afraid
it brings more problems that is solves.

for the moment, regardless the implementation, my conclusion is that a
light EABI with a small number of caller registers, plus a fast
hardware stacking/unstacking seem required anyway.


regards,

Liviu

Michael Clark

unread,
Mar 15, 2018, 6:29:52 AM3/15/18
to jcb6...@gmail.com, Torbjørn Viem Ness, RISC-V ISA Dev, watso...@gmail.com, a...@asbradbury.org

> On 14/03/2018, at 2:41 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Torbjørn Viem Ness wrote:
>> This is probaby an unconventional suggestion but I'm still an unexperienced student, so too young to know better I guess. =)
>>
>> What if we added one more set of registers "behind" the ones that need to be saved upon entering an interrupt handler?
>
> You are asking for shadow registers.

I had asked for a similar feature called “Privileged Register Windows” that unlike “Register Windows” on SPARC, are only windowed on privileged mode changes, not on all procedure calls. Only a0-a7 would be shared between different privileged modes and the remainder of the registers would be in a per mode window. On an OoO, this could be handled in the renamer, and in a single issue, access to the larger register file could be muxed by privilege mode.

This is very different from “Register Windows” that have fallen out of favour due to the fact that compiler Register allocators can do the job better than using Register Windows on regular procedure calls, rather “Privileged Register Windows” are a feature to minimise save restore when handing synchronous or asynchronous traps between Privilege levels.

Of course they don’t help the case of a nested trap in the same mode, but they could reduce syscall and asynchronous trap latency in a system where code tends to be operating in U mode and the Interrupt bottom half is running in S mode.

>> This way the entire status could just be "shifted out" in one cycle before jumping to the routine, and propagating the data to RAM or cache could be handled in the background to prepare for another context swap if necessary, and the latency would be invisible to the user (unless a new call occurs before it's done saving the data from the previous one).
>> Then after the handler completes, the previous context can simply be shifted back and be ready to go one cycle later.
>>
>> Does this sound like a good idea (or even feasible), or would it be too expensive in terms of area and complexity seeing as we're talking about microcontrollers?
>
> I am working on a similar proposal (there are some points of disagreement between Liviu Ionescu and myself) that uses a shadow register bank almost exactly as you suggest, including spilling/reloading the inactive shadow registers into stack frames. The main difference is that I will also propose an EABI with a minimum of caller-saved registers, and only those EABI caller-saved registers are shadowed.
>
>
>
> -- Jacob
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AA996FD.8040807%40gmail.com.

Rogier Brussee

unread,
Mar 15, 2018, 6:32:05 AM3/15/18
to RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org


Op donderdag 15 maart 2018 10:41:50 UTC+1 schreef Liviu Ionescu:
On 15 March 2018 at 10:55:11, Richard Herveille
(richard....@roalogic.com) wrote:

> The problem with shadow registers is that you always run out and you still need to spill
> to main memory.

you run out when the interrupt nesting gets deeper than the available
register banks; in most cases, the depth is 1, rarely 2, even rarely
3, and so on.

spilling can be done in parallel, while starting the handler.

but this method reveals a possible latency problem: if a high priority
interrupt occurs right after a series of other interrupts, and there
are no more register banks, it must wait for a previous spill to
complete, to free a register bank, leading to a jitter on the high
priority interrupt latency.

most applications tolerate a small jitter, but for applications that
implement control loops we might need a way to disable this mechanism
and provide constant latency (even if it is slightly higher).
Cortex-M0 has such a configuration bit to prevent jitter.

> For an RVE implementation, which reduces the RF in half to save gates, it would be weird
> to double the memory now, just to implement a shadow register.

yes, this mechanism is not cheap. however, as Jacob suggested, only
the ABI caller registers need to be shadowed/spilled, so, with a
lighter EABI, the extra cost may be kept to a minimum.


_If_  you have all 31 + 1 registers available, what stops you from defining an ABI that that sets aside, say, registers 22--31 
exclusively for the highest level interrupts e.g. using
 
x22 -> irq_ra, 
x23 -> irq_sp, 
x24 -> irq_tp,
x25 -> irq_s0 
x26 -> irq_t0,
x27 -> irq_a0
x28 -> irq_a1
..
x31-> irq_a4

 
(this assumes x3 = gp can still be used as a global pointer)

Liviu Ionescu

unread,
Mar 15, 2018, 6:38:07 AM3/15/18
to Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org
On 15 March 2018 at 12:32:07, Rogier Brussee (rogier....@gmail.com) wrote:

> If_ you have all 31 + 1 registers available, what stops you from
> defining an ABI that that sets aside, say, registers 22--31
> exclusively for the highest level interrupts e.g. using ...

can you elaborate?

how would this work with nested interrupts?


regards,

Liviu

Rogier Brussee

unread,
Mar 15, 2018, 7:23:43 AM3/15/18
to RISC-V ISA Dev, rogier....@gmail.com, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org


Op donderdag 15 maart 2018 11:38:07 UTC+1 schreef Liviu Ionescu:
It would not work, at least not without spilling the (fewer) reserved registers, exactly as with all other shadow schemes. 

I wrote highest level (highest priority) interrupts assuming the interrupt necessarily
runs uninterrupted to completion. In any case I just wanted to point out that I think you could use the 32 register ISA as 16 registers and 16 "shadowy" registers, or with 
any other split like 22 registers and 10 "shadowy" registers for interrupts or for that matter, 16 registers for normal use, 10 regs for nestable interrupts that have to spill on entry and 
6 for nonnestable, uninterruptable highest priority interrupts. 

 
regards,

Liviu

Liviu Ionescu

unread,
Mar 15, 2018, 7:34:08 AM3/15/18
to Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org
On 15 March 2018 at 13:23:45, Rogier Brussee (rogier....@gmail.com) wrote:

> ... interrupts assuming the interrupt necessarily
> runs uninterrupted to completion.

real-time systems **need** nested interrupts, this is one of the main
requirements for the microcontroller profile.

> ... you could use the 32 register ISA as 16 registers and 16
> "shadowy" registers

I still think we should design the basic RISC-V EABI with a set of 16
registers (for very small RV32E devices), and, then extend it to a
sibling that has more registers (for RV32I/RV64I), but be sure the
extra registers have no special meaning, so the compiler can use them
only for more local variables.

regards,

Liviu

kr...@berkeley.edu

unread,
Mar 15, 2018, 7:48:55 PM3/15/18
to Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

I'll shortly be sending out an invite to a new Foundation Task Group
we have formed to address adding fast interrupts to RISC-V.

Germane to this thread, one feature of the proposal under development
is to standardize interrupt attribute annotations so C compilers can
generate interrupt handlers that only save registers as needed. This
effectively changes the calling conventions just for the handlers but
leaves the rest of the ABI unchanged.

/* Not real code, just a sketch. */
extern volatile int *DEVICE;
extern volatile int *COUNT;

void __attribute__ ((interrupt))
foo() {
*DEVICE = 0;
*COUNT++;
}

A rough sketch of what a generated handler looks like is:

# Small ISR that pokes device to clear interrupt, and increments in-memory counter.

.align 3 # Has to be 8-byte aligned.
foo:
addi sp, sp, -16 # Create a frame on stack.
sw s0, 0(sp) # Save working register.
sw x0, DEVICE, s0 # Clear interrupt flag.
sw s1, 4(sp) # Save working register.
la s0, COUNT # Get counter address.
li s1, 1
amoadd.w x0, (s0), s1 # Increment counter in memory.
lw s1, 4(sp) # Restore registers.
lw s0, 0(sp)
addi sp, sp, 16 # Free stack frame.
mret # Return from handler using saved mepc.

This change will be useful even with existing interrupt architecture,
but TG will be looking at a new design that supports nested
interrupts. Our initial studies show a small core can take interrupt,
enter, execute, and exit the handler above in less than 20 cycles,
while supporting preemption on any clock cycle (i.e., only a few cycles ~3
to get to first instruction).

Krste
| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG7hfcJ7gWvOW8hGnykJ-NHi8i%2BT6Vd39hm%3D7KFZyygqyoaHHw%40mail.gmail.com.

Tommy Thorn

unread,
Mar 15, 2018, 7:58:17 PM3/15/18
to kr...@berkeley.edu, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org
> On Mar 15, 2018, at 16:48 , kr...@berkeley.edu wrote:
> I'll shortly be sending out an invite to a new Foundation Task Group
> we have formed to address adding fast interrupts to RISC-V.
>
> Germane to this thread, one feature of the proposal under development
> is to standardize interrupt attribute annotations so C compilers can
> generate interrupt handlers that only save registers as needed. This
> effectively changes the calling conventions just for the handlers but
> leaves the rest of the ABI unchanged.
>
> /* Not real code, just a sketch. */
> extern volatile int *DEVICE;
> extern volatile int *COUNT;
>
> void __attribute__ ((interrupt))
> foo() {
> *DEVICE = 0;
> *COUNT++;
> }
>
> A rough sketch of what a generated handler looks like is:
>
> # Small ISR that pokes device to clear interrupt, and increments in-memory counter.
>
> .align 3 # Has to be 8-byte aligned.
> foo:
> addi sp, sp, -16 # Create a frame on stack.

If the ABI had included a stack "red zone" with a small reservation for interrupts,
then the two "addi sp, " instructions could have been avoided in most cases.

> sw s0, 0(sp) # Save working register.

Presumedly you meant to load s0 with a global pointer?

> sw x0, DEVICE, s0 # Clear interrupt flag.
> sw s1, 4(sp) # Save working register.
> la s0, COUNT # Get counter address.
> li s1, 1
> amoadd.w x0, (s0), s1 # Increment counter in memory.
> lw s1, 4(sp) # Restore registers.
> lw s0, 0(sp)
> addi sp, sp, 16 # Free stack frame.
> mret # Return from handler using saved mepc.

Tommy

Andrew Waterman

unread,
Mar 15, 2018, 8:16:12 PM3/15/18
to Tommy Thorn, Krste Asanovic, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com, Alex Bradbury
On Thu, Mar 15, 2018 at 4:58 PM, Tommy Thorn
<tommy...@esperantotech.com> wrote:
>> On Mar 15, 2018, at 16:48 , kr...@berkeley.edu wrote:
>> I'll shortly be sending out an invite to a new Foundation Task Group
>> we have formed to address adding fast interrupts to RISC-V.
>>
>> Germane to this thread, one feature of the proposal under development
>> is to standardize interrupt attribute annotations so C compilers can
>> generate interrupt handlers that only save registers as needed. This
>> effectively changes the calling conventions just for the handlers but
>> leaves the rest of the ABI unchanged.
>>
>> /* Not real code, just a sketch. */
>> extern volatile int *DEVICE;
>> extern volatile int *COUNT;
>>
>> void __attribute__ ((interrupt))
>> foo() {
>> *DEVICE = 0;
>> *COUNT++;
>> }
>>
>> A rough sketch of what a generated handler looks like is:
>>
>> # Small ISR that pokes device to clear interrupt, and increments in-memory counter.
>>
>> .align 3 # Has to be 8-byte aligned.
>> foo:
>> addi sp, sp, -16 # Create a frame on stack.
>
> If the ABI had included a stack "red zone" with a small reservation for interrupts,
> then the two "addi sp, " instructions could have been avoided in most cases.

In the non-preemptible case, the addi instructions can be elided
as-is. This example works for the (forthcoming) preemptible case, as
well.

>
>> sw s0, 0(sp) # Save working register.
>
> Presumedly you meant to load s0 with a global pointer?

That's what's going on here. "sw x0, DEVICE, s0" is "store x0 to
global symbol DEVICE using s0 as a temporary", i.e., it's syntactic
sugar for "1: auipc s0, %pcrel_hi(DEVICE); sw x0, %pcrel_lo(1b)(s0)"

>
>> sw x0, DEVICE, s0 # Clear interrupt flag.
>> sw s1, 4(sp) # Save working register.
>> la s0, COUNT # Get counter address.
>> li s1, 1
>> amoadd.w x0, (s0), s1 # Increment counter in memory.
>> lw s1, 4(sp) # Restore registers.
>> lw s0, 0(sp)
>> addi sp, sp, 16 # Free stack frame.
>> mret # Return from handler using saved mepc.
>
> Tommy
>
>
>>
>> This change will be useful even with existing interrupt architecture,
>> but TG will be looking at a new design that supports nested
>> interrupts. Our initial studies show a small core can take interrupt,
>> enter, execute, and exit the handler above in less than 20 cycles,
>> while supporting preemption on any clock cycle (i.e., only a few cycles ~3
>> to get to first instruction).
>>
>> Krste
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/995D7241-2798-4754-B2D0-866B910C4B02%40esperantotech.com.

Bruce Hoult

unread,
Mar 15, 2018, 8:26:18 PM3/15/18
to Tommy Thorn, Krste Asanovic, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com, Alex Bradbury
On Thu, Mar 15, 2018 at 4:58 PM, Tommy Thorn <tommy...@esperantotech.com> wrote:
> On Mar 15, 2018, at 16:48 , kr...@berkeley.edu wrote:
> I'll shortly be sending out an invite to a new Foundation Task Group
> we have formed to address adding fast interrupts to RISC-V.
>
> Germane to this thread, one feature of the proposal under development
> is to standardize interrupt attribute annotations so C compilers can
> generate interrupt handlers that only save registers as needed.  This
> effectively changes the calling conventions just for the handlers but
> leaves the rest of the ABI unchanged.
>
> /* Not real code, just a sketch. */
> extern volatile int *DEVICE;
> extern volatile int *COUNT;
>
> void __attribute__ ((interrupt))
> foo() {
>      *DEVICE = 0;
>      *COUNT++;
> }
>
> A rough sketch of what a generated handler looks like is:
>
> # Small ISR that pokes device to clear interrupt, and increments in-memory counter.
>
>        .align 3 # Has to be 8-byte aligned.
>     foo:
>        addi sp, sp, -16             # Create a frame on stack.

If the ABI had included a stack "red zone" with a small reservation for interrupts,
then the two "addi sp, " instructions could have been avoided in most cases.

I believe you've got that backwards.

A "red zone" is stack space below the Stack Pointer that may be used by normal leaf functions without adjusting the Stack Pointer.

If the ABI has a Red Zone then all interrupt service routines must subtract the size of the Red Zone from the Stack Pointer *in addition* to whatever space the interrupt routine will use.
 
 
>        sw s0, 0(sp)                 # Save working register.

Presumedly you meant to load s0 with a global pointer?

>        sw x0, DEVICE, s0            # Clear interrupt flag.
>        sw s1, 4(sp)                 # Save working register.
>        la s0, COUNT                 # Get counter address.
>        li s1, 1
>        amoadd.w x0, (s0), s1        # Increment counter in memory.
>        lw s1, 4(sp)                 # Restore registers.
>        lw s0, 0(sp)
>        addi sp, sp, 16              # Free stack frame.
>        mret                         # Return from handler using saved mepc.

Tommy


>
> This change will be useful even with existing interrupt architecture,
> but TG will be looking at a new design that supports nested
> interrupts.  Our initial studies show a small core can take interrupt,
> enter, execute, and exit the handler above in less than 20 cycles,
> while supporting preemption on any clock cycle (i.e., only a few cycles ~3
> to get to first instruction).
>
> Krste

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Alex Bradbury

unread,
Mar 15, 2018, 11:41:50 PM3/15/18
to Krste Asanovic, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com
Why is 8-byte function alignment required?

Standardising an interrupt attribute similar to those supported by
compilers for most other targets would definitely be worthwhile.
Liviu's document raises the concern that you have to spill the
caller-saved registers in the case where your interrupt handler calls
a function compiled for the standard calling convention. Of course if
your ISR is calling functions where inlining can't be justified, your
interrupt handling is fairly heavyweight already, meaning the extra
overhead of saving caller-saved registers should be a smaller
percentage of execution time.

Better understanding and characterising the workloads people are
struggling with would really help in defining the best solution here.

Best,

Alex

Krste Asanovic

unread,
Mar 16, 2018, 1:48:34 AM3/16/18
to Alex Bradbury, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, Richard Herveille, watso...@gmail.com
That follows from the trap vector alignment constraint in a particular proposal.

> Standardising an interrupt attribute similar to those supported by
> compilers for most other targets would definitely be worthwhile.
> Liviu's document raises the concern that you have to spill the
> caller-saved registers in the case where your interrupt handler calls
> a function compiled for the standard calling convention. Of course if
> your ISR is calling functions where inlining can't be justified, your
> interrupt handling is fairly heavyweight already, meaning the extra
> overhead of saving caller-saved registers should be a smaller
> percentage of execution time.

Yes. Also, at some point regardless of calling convention, you should schedule long-running compute to a background thread scheduled at a more appropriate time, both to reduce interrupt latency and to avoid wasting processor time shuffling registers in ISR routines.

> Better understanding and characterising the workloads people are
> struggling with would really help in defining the best solution here.

Of course. The difficult part is persuading owners to share their workload.
Any offers?

Krste

> Best,
>
> Alex

Liviu Ionescu

unread,
Mar 16, 2018, 3:37:48 AM3/16/18
to kr...@berkeley.edu, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org
On 16 March 2018 at 01:48:53, kr...@berkeley.edu (kr...@berkeley.edu) wrote:

> > void __attribute__ ((interrupt))
> foo() {
> *DEVICE = 0;
> *COUNT++;
> }
...
> > Our initial studies show a small core can take interrupt,
> enter, execute, and exit the handler above in less than 20 cycles,
> while supporting preemption on any clock cycle (i.e., only a
> few cycles ~3
> to get to first instruction).

you can make a design lile this and claim less than 20 cycles latency,
but most real applications need to call a plain C function from the
interrupt handler.

to be correct, you must measure latency from the interrupt to the
moment execution enters the plain C function.

can you estimate latency in this case? both entry and exit latencies
are important.


for the final design you must also consider cases like tail chaining
and late arrival.


regards,

Liviu

Albert Cahalan

unread,
Mar 16, 2018, 3:45:07 AM3/16/18
to Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org
On 3/15/18, Liviu Ionescu <i...@livius.net> wrote:

> real-time systems **need** nested interrupts, this is one of the
> main requirements for the microcontroller profile.

No they don't. I've supported real-time systems as an OS developer
for two different OSes, MC/OS and RedHawk. Customers had a
fondness for running with interrupts disabled entirely. This caused
all sorts of fun for the OS's internal housekeeping tasks. There could
be no clock tick, yet the customer expects the clock to keep working!

In one case, the customer was trying to warp a mirror to aim a laser.
This had to overcome turbulent air bending the beam away from the
intended target. Failure is literally fatal, due to an incoming missile.

For one of those OSes, real-time tasks would run on cores that were
being babysat by the other cores. The cores with real-time tasks would
simply not take any interrupts at all.

The fastest way to get data is to spin waiting for it. That is, you poll
for just one thing continuously. You don't mess around with interrupts.

> I still think we should design the basic RISC-V EABI with a set of 16
> registers (for very small RV32E devices), and, then extend it to a

Normal compilers hardly even use 16 when that is what is available.
I think going past 16 registers was not good, but this is water under
the bridge now. Dropping to 16 is a completely different architecture.
There comes a time to push ahead with what you have, flaws and all.

Liviu Ionescu

unread,
Mar 16, 2018, 4:02:18 AM3/16/18
to Albert Cahalan, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org
On 16 March 2018 at 09:45:06, Albert Cahalan (acah...@gmail.com) wrote:

> real-time tasks would run on cores that were
> being babysat by the other cores. The cores with real-time tasks
> would simply not take any interrupts at all.

yes, for hard real-time tasks this is probably the case, but I'm not
sure all applications can afford multi-core devices plus the added
cost of writing multi-core software; I would estimate that only the
top 10% of real-time applications are that extreme, and the rest of
them can still use simpler solutions, if properly designed.


regards,

Liviu

Liviu Ionescu

unread,
Mar 16, 2018, 4:31:22 AM3/16/18
to Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com
On 16 March 2018 at 05:41:49, Alex Bradbury (a...@asbradbury.org) wrote:

> Standardising an interrupt attribute similar to those supported
> by
> compilers for most other targets would definitely be worthwhile.

interrupt attributes are common to the architectures of yesterday, if
RISC-V wants to be the architecture of the future, it should not look
only to the past.

modern microcontroller architectures use no attributes at all, the
hardware is able to call plain C functions directly.

for the privileged profile you can invent any attributes you like and
try to enforce inlining for the entire interrupt handler, but for the
microcontroller profile I think that hardware stacking/unstaking
coupled with a lite ABI can provide the best performace (with a target
of 12+10 cycles for the total entry/exit to a C function). plus that
it is unbeatable in terms of ease of use.


regards,

Liviu

kr...@berkeley.edu

unread,
Mar 16, 2018, 4:38:59 AM3/16/18
to Liviu Ionescu, kr...@berkeley.edu, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

>>>>> On Fri, 16 Mar 2018 00:37:45 -0700, Liviu Ionescu <i...@livius.net> said:
| you can make a design lile this and claim less than 20 cycles latency,
| but most real applications need to call a plain C function from the
| interrupt handler.

It is less than 20 cycles for this use case. There are many different
use cases, including common patterns represented by this example that
explicitly avoid complex code in the ISR itself.

| to be correct, you must measure latency from the interrupt to the
| moment execution enters the plain C function.

No, that is not the standard definition of interrupt latency. The
measure you propose is only interesting for a particular use case
where ISRs are compiled as standard C functions. I understand this
use cases exists, but it is not the only one.

| can you estimate latency in this case? both entry and exit latencies
| are important.

Obviously this depends on the ABI when calling a standard compiled C
function, and for the standard ABI you will have to save/restore a lot
of registers.

| for the final design you must also consider cases like tail chaining
| and late arrival.

Yes - by providing comparable performance for the situations that led
to these architecture-specific optimizations, but not by necessarily
copying an existing design.

Krste


| regards,

| Liviu

Liviu Ionescu

unread,
Mar 16, 2018, 4:46:44 AM3/16/18
to Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com
On 16 March 2018 at 07:48:33, Krste Asanovic (kr...@berkeley.edu) wrote:

> at some point regardless of calling convention, you should
> schedule long-running compute to a background thread scheduled
> at a more appropriate time, both to reduce interrupt latency
> and to avoid wasting processor time shuffling registers in ISR
> routines.

yes, two-tiered interrupt processing is the ideal textbook solution,
but few RTOSes/applications do it.

how do you suggest to notify the background thread to wakeup and pick
up the job from where the ISR left it?


regards,

Liviu

kr...@berkeley.edu

unread,
Mar 16, 2018, 4:50:44 AM3/16/18
to Liviu Ionescu, Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com
If the background thread was sleeping on WFI, it will now wake up and
check for work when control returns from ISR. If it was not sleeping,
it'll find work on end of its queue.

Krste

kr...@berkeley.edu

unread,
Mar 16, 2018, 4:53:32 AM3/16/18
to Liviu Ionescu, Albert Cahalan, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org
It is common to throw hardware at a problem to save software design
effort.

Krste


| regards,

| Liviu

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To post to this group, send email to isa...@groups.riscv.org.
| Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
| To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAG7hfcK4cK3FuFR9EKZo5Zg9VtNZ6eBG1CYPD2mpFntSqxc%2B0A%40mail.gmail.com.

kr...@berkeley.edu

unread,
Mar 16, 2018, 5:01:07 AM3/16/18
to Liviu Ionescu, Krste Asanovic, Alex Bradbury, RISC-V ISA Dev, tbn...@gmail.com, Rogier Brussee, Richard Herveille, watso...@gmail.com
I agree this model is convenient to use, but there is a cost for the
convenience, and some would prefer to have the option to choose faster
interrupts, simpler hardware, and faster regular code instead in some
cases.

Krste


| regards,

| Liviu

kr...@berkeley.edu

unread,
Mar 16, 2018, 5:03:45 AM3/16/18
to Albert Cahalan, Liviu Ionescu, Rogier Brussee, RISC-V ISA Dev, tbn...@gmail.com, richard....@roalogic.com, watso...@gmail.com, a...@asbradbury.org

>>>>> On Fri, 16 Mar 2018 03:45:04 -0400, Albert Cahalan <acah...@gmail.com> said:
|| I still think we should design the basic RISC-V EABI with a set of 16
|| registers (for very small RV32E devices), and, then extend it to a

| Normal compilers hardly even use 16 when that is what is available.
| I think going past 16 registers was not good, but this is water under
| the bridge now. Dropping to 16 is a completely different
| architecture.

Yes, the E variant exists.

| There comes a time to push ahead with what you have, flaws and all.

More than 16 registers is noticeably superior for high-performance
code using floating-point or a vector unit, and will also be very
useful in smaller systems using the P extension out of the x
registers.

Krste

Liviu Ionescu

unread,
Mar 16, 2018, 5:38:32 AM3/16/18