Some thoughts on privilege changes, exceptions, and interrupts

261 views
Skip to first unread message

Andrew Lutomirski

unread,
Oct 4, 2016, 4:20:13 PM10/4/16
to RISC-V ISA Dev
Hi all-

Background: I've never written a line of RISC-V anything in my life, but I have recently rewritten a large fraction of Linux's x86 entry code, and I thought I'd give my two cents on the privileged spec 1.9.

When a synchronous exception is delivered, it seems to me that it would be very helpful to record the faulting instruction into a CSR.  This way faults can be handled (and instructions can be emulated if needed) by higher-privileged code without needing to re-fetch the instruction.  This avoids races as well as any need to worry about memory protection that would allow, say, a user-executable instruction that is not readable by supervisor code without fiddling with the memory protection control bits.  Both VMX and SVM on x86 can do something like this for VM monitors only, and it's quite handy.  It seems like it would be straightforward for RISC-V to support it for all modes, especially given that instruction words are short.

The overall privilege exception mechanism looks like it will make it quite difficult for kernels to handle cases where an exception may occur during exception entry.  This can happen for a couple of reasons:
  • "perf"-style intentional NMI.  By design, these should be able to happen with as few restrictions as practical.
  • Recoverable memory failure.  Admittedly, Linux can't usefully recover from memory failure that triggers during exception entry.
  • Stack overflow.  (In implemented mostly reliable kernel stack overflow handling for Linux on x86 starting in version 4.9).

It doesn't seem very pleasant to write the glue to handle these cases on RISC-V.  In particular, the exception entry code needs to save away all of the control state (sepc, scause, sbadaddr, etc), and if an exception nests before this is completed, then the outer exception's state is irretreivably lost.  I can see a way to work around this (save all the state to a non-stack location, then set some flag so that a nested exception will know that it's nested, then save again to the stack), but in my opinion it would be quite nice to have some form of hardware support.


x86 avoids this issue entirely by pushing everything to the stack in microcode, but I can see why RISC-V would want to avoid this.

Michael Clark

unread,
Oct 4, 2016, 4:56:48 PM10/4/16
to Andrew Lutomirski, RISC-V ISA Dev
Makes sense to me: avoid reparsing instructions when things are not atomic (change of protection domain); and; RISC-V is a register architecture. There is no concept of stack other than the convention in the C ABI of the kernel that would typically be running (at the top of memory). e.g. the convention that x31 = sp.

Constraint: This limits width of VLIW encodings depending on operating mode. i.e 32-bit on RV32 ... 128-bit on RV128. It doesn't strictly limit instruction widths, but it limits the width at which an actual reparse is required; to the register width (versus the instruction fetch width which could be higher than the register width). Not a major limitation in the common case of RV32/RV64.

An implementation that is in a custom mode where it is executing 8 instruction dependency-free bundles (256-bit VLIW bundles with RV64G 32-bit instruction encoding with no re-order logic in hardware) would have to do something pragmatic like just store the first 64-bits of the instruction fetch buffer in the faulting instruction register. This custom implementation probably has interrupts disabled so it's likely less of an issue.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d7bbe596-7aea-4839-ae8e-152bb17b1857%40groups.riscv.org.

Christopher Celio

unread,
Oct 4, 2016, 5:24:08 PM10/4/16
to Andrew Lutomirski, RISC-V ISA Dev
Very interesting feedback!

As a CPU guy (and I in no way speak for the RISC-V devs), I have a few concerns about the hw costs of this.

1) Although there are no standard RISC-V examples, RISC-V is a variable length ISA that supports instructions > 4 bytes in size.  I believe instructions can in fact be arbitrarily large.

2) For something like an out-of-order core, I throw away the instruction bits after decode. I may have 200 instructions "in flight", and that's a lot of data to maintain. Even for small cores, the cost of maintaining 4 bytes through every pipeline stage may be noticeable. Conceivably, the core could refetch the excepting instruction and write it into a CSR before handing control over to the kernel, but that feels unsettling.

-Chris


Stefan O'Rear

unread,
Oct 4, 2016, 5:25:36 PM10/4/16
to Andrew Lutomirski, RISC-V ISA Dev
On Tue, Oct 4, 2016 at 1:20 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> Hi all-
>
> Background: I've never written a line of RISC-V anything in my life, but I
> have recently rewritten a large fraction of Linux's x86 entry code, and I
> thought I'd give my two cents on the privileged spec 1.9.

Thank you for the time, I am very pleased with the attention the
RISC-V S-mode is getting recently.

> When a synchronous exception is delivered, it seems to me that it would be
> very helpful to record the faulting instruction into a CSR. This way faults
> can be handled (and instructions can be emulated if needed) by
> higher-privileged code without needing to re-fetch the instruction. This
> avoids races as well as any need to worry about memory protection that would
> allow, say, a user-executable instruction that is not readable by supervisor
> code without fiddling with the memory protection control bits. Both VMX and
> SVM on x86 can do something like this for VM monitors only, and it's quite
> handy. It seems like it would be straightforward for RISC-V to support it
> for all modes, especially given that instruction words are short.

RISC-V core instructions are short but the ISA allows (does not
require) arbitrarily long instructions. As such it's probably
necessary to have a mechanism to fetch whatever part of the
instruction is not captured in the buffer; there are already
mechanisms for S-mode to fake a U-mode data access, but not to fake a
U-mode instruction access, which is a problem since RISC-V supports
execute-only pages.

> The overall privilege exception mechanism looks like it will make it quite
> difficult for kernels to handle cases where an exception may occur during
> exception entry. This can happen for a couple of reasons:
>
> "perf"-style intentional NMI. By design, these should be able to happen
> with as few restrictions as practical.
> Recoverable memory failure. Admittedly, Linux can't usefully recover from
> memory failure that triggers during exception entry.
> Stack overflow. (In implemented mostly reliable kernel stack overflow
> handling for Linux on x86 starting in version 4.9).
>
> It doesn't seem very pleasant to write the glue to handle these cases on
> RISC-V. In particular, the exception entry code needs to save away all of
> the control state (sepc, scause, sbadaddr, etc), and if an exception nests
> before this is completed, then the outer exception's state is irretreivably
> lost. I can see a way to work around this (save all the state to a
> non-stack location, then set some flag so that a nested exception will know
> that it's nested, then save again to the stack), but in my opinion it would
> be quite nice to have some form of hardware support.
>
>
> x86 avoids this issue entirely by pushing everything to the stack in
> microcode, but I can see why RISC-V would want to avoid this.

The privileged spec as currently written does not allow exception
handlers to nest at a given privilege level. The act of taking an
exception at S-mode clears SIE, which prevents any S-mode exceptions
from being taken until SIE is re-enabled, which either happens
automatically on SRET or can be done by entry code after saving
SEPC/etc somewhere. If an exception cannot be deferred, then it would
have to be handled in M-mode; MEPC is a separate register from SEPC so
you have an implicit stack that way.

It seems as though there are some issues here but I don't understand
them well enough to make a proposal.

RISC-V has a concept of a "non-maskable interrupt" which, unlike the
x86 NMI, is literal: entering the NMI handler does not block NMI. It
has been stated earlier on the list that NMIs are not intended to ever
be recoverable.

"Just use a stack" is a half-answer because there is always going to
be a finite amount of memory available to the stack, so there must be
some means to prevent unbounded trap recursion.

-s

Samuel Falvo II

unread,
Oct 4, 2016, 5:30:02 PM10/4/16
to Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev
On Tue, Oct 4, 2016 at 2:25 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> The privileged spec as currently written does not allow exception
> handlers to nest at a given privilege level. The act of taking an
> exception at S-mode clears SIE, which prevents any S-mode exceptions
> from being taken until SIE is re-enabled, which either happens
> automatically on SRET or can be done by entry code after saving
> SEPC/etc somewhere.

I think this needs to be clarified: SIE works for interrupts, but will
not work for things like instruction alignment faults, page faults,
privilege violations, etc. Maybe I'm not understanding correctly, but
I interpreted Andrew's original question as talking about nested
exceptions in general, and not interrupts in particular. Since
supervisor mode runs in a translated address space, it's entirely
conceivable (perhaps through a bug) for a page fault handler to itself
generate a page fault in kernel-space. That's how I took his
question. Am I mistaken?

--
Samuel A. Falvo II

Andrew Waterman

unread,
Oct 4, 2016, 5:45:13 PM10/4/16
to Samuel Falvo II, Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev
Nested exceptions (even interrupts) at the same privilege level are
possible -- Stefan meant they aren't provided by a hardware mechanism.
It's up to the kernel to save the register state (including sstatus)
to the stack before enabling interrupts or taking a synchronous
exception.

Also, it is certainly possible to handle kernel stack overflows in
RISC-V in a non-fatal manner; it just requires some software
chicanery.

>
> --
> Samuel A. Falvo II
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAEz%3Dso%3DyA-uMfzzsa9kHJh-SDEB0EAb7uBjmhTJA8L-BQSxaTg%40mail.gmail.com.

Stefan O'Rear

unread,
Oct 4, 2016, 5:46:59 PM10/4/16
to Samuel Falvo II, Andrew Lutomirski, RISC-V ISA Dev
On Tue, Oct 4, 2016 at 2:30 PM, Samuel Falvo II <sam....@gmail.com> wrote:
Yes, what I gave above is a half-answer because I don't have a full
answer. Taking a page fault in "ordinary" kernel code is not a
problem; the problem is entry stubs, the code which saves registers to
the kernel stack immediately after a fault is taken. Now the OP is
currently/recently in a project to add stack overflow guard pages to
the Linux kernel (right now stacks are 16KB and often have unrelated
allocations contiguously below them); what happens if the register
save code hits a page fault? You now have a half-written register
save frame, and you might also have lost the original SEPC. How do
you recover?

-s

Andrew Lutomirski

unread,
Oct 4, 2016, 7:18:30 PM10/4/16
to RISC-V ISA Dev, aml...@gmail.com


On Tuesday, October 4, 2016 at 2:25:36 PM UTC-7, sorear2 wrote:
On Tue, Oct 4, 2016 at 1:20 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> Hi all-
>
> Background: I've never written a line of RISC-V anything in my life, but I
> have recently rewritten a large fraction of Linux's x86 entry code, and I
> thought I'd give my two cents on the privileged spec 1.9.

Thank you for the time, I am very pleased with the attention the
RISC-V S-mode is getting recently.

:)  RISC-V looks like a very nice architecture.
 

> When a synchronous exception is delivered, it seems to me that it would be
> very helpful to record the faulting instruction into a CSR.  This way faults
> can be handled (and instructions can be emulated if needed) by
> higher-privileged code without needing to re-fetch the instruction.  This
> avoids races as well as any need to worry about memory protection that would
> allow, say, a user-executable instruction that is not readable by supervisor
> code without fiddling with the memory protection control bits.  Both VMX and
> SVM on x86 can do something like this for VM monitors only, and it's quite
> handy.  It seems like it would be straightforward for RISC-V to support it
> for all modes, especially given that instruction words are short.

RISC-V core instructions are short but the ISA allows (does not
require) arbitrarily long instructions.  As such it's probably
necessary to have a mechanism to fetch whatever part of the
instruction is not captured in the buffer; there are already
mechanisms for S-mode to fake a U-mode data access, but not to fake a
U-mode instruction access, which is a problem since RISC-V supports
execute-only pages.


That might be worth adding.  In general, if I were working on a RISC-V kernel or hypervisor, I think I'd be okay if only a bounded number of instruction bytes were reported.  The really long instructions are probably much less useful to emulate anyway.
 
As a very minimal proposal, what if there were just a few more scratch registers along with a move-CSR-to-CSR instruction (which may already exist -- I haven't paid enough attention to the encoding)?  Then the page fault entry could just stash the CSRs it cares about (which, in practice, might be just SEPC, but kernels could do whatever they need) into the extra scratch CSRs along with a flag saying "I'm in the page fault prologue".  Then a nested page fault could notice this flag and handle the stack overflow.

Alternatively, the STVEC register has the two low bits free.  What if one of those bits meant "don't update SEPC, etc on entry" or perhaps "write to SEPC2, etc on entry" where SEPC2 was an extra CSR?  Then the page fault entry could switch STVEC to a special nested-entry handler with that bit set and then restore it back to normal once it's done saving all of its state away.

--Andy

Stefan O'Rear

unread,
Oct 4, 2016, 7:41:52 PM10/4/16
to Andrew Lutomirski, RISC-V ISA Dev
On Tue, Oct 4, 2016 at 4:18 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> That might be worth adding. In general, if I were working on a RISC-V
> kernel or hypervisor, I think I'd be okay if only a bounded number of
> instruction bytes were reported. The really long instructions are probably
> much less useful to emulate anyway.

> As a very minimal proposal, what if there were just a few more scratch
> registers along with a move-CSR-to-CSR instruction (which may already exist
> -- I haven't paid enough attention to the encoding)? Then the page fault
> entry could just stash the CSRs it cares about (which, in practice, might be
> just SEPC, but kernels could do whatever they need) into the extra scratch
> CSRs along with a flag saying "I'm in the page fault prologue". Then a
> nested page fault could notice this flag and handle the stack overflow.

Yes, part of our problem is that the very beginning of the trap
handler is constrained by a lack of free registers. sscratch isn't
truly free because it's used to locate the spill area IIRC.

One potential complication is that we might want to minimize the
number of CSR accesses - BOOM treats all CSR accesses as serializing
instructions. There's speculation in the ISA spec that
implementations might be able to rename sscratch and mscratch, but I
don't know how that will work in practice, and I imagine renaming
stvec is out of the question.

I'm not sure a CSR-CSR move is needed (it definitely does not exist);
you can move through the GPR file once you've moved one GPR to the
spill area.

If a hardware interrupt arrives when the kernel stack is full, how do
you process the stack overflow without losing the interrupt whose
entry code faulted? The entry code presumably needs to look at scause
before deciding which stack to use.

> Alternatively, the STVEC register has the two low bits free. What if one of
> those bits meant "don't update SEPC, etc on entry" or perhaps "write to
> SEPC2, etc on entry" where SEPC2 was an extra CSR? Then the page fault
> entry could switch STVEC to a special nested-entry handler with that bit set
> and then restore it back to normal once it's done saving all of its state
> away.

I think only one if the chip implements RVC or another odd-length
instruction set.

-s

Andrew Lutomirski

unread,
Oct 4, 2016, 7:56:13 PM10/4/16
to RISC-V ISA Dev, aml...@gmail.com


On Tuesday, October 4, 2016 at 4:41:52 PM UTC-7, sorear2 wrote:
On Tue, Oct 4, 2016 at 4:18 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> That might be worth adding.  In general, if I were working on a RISC-V
> kernel or hypervisor, I think I'd be okay if only a bounded number of
> instruction bytes were reported.  The really long instructions are probably
> much less useful to emulate anyway.

> As a very minimal proposal, what if there were just a few more scratch
> registers along with a move-CSR-to-CSR instruction (which may already exist
> -- I haven't paid enough attention to the encoding)?  Then the page fault
> entry could just stash the CSRs it cares about (which, in practice, might be
> just SEPC, but kernels could do whatever they need) into the extra scratch
> CSRs along with a flag saying "I'm in the page fault prologue".  Then a
> nested page fault could notice this flag and handle the stack overflow.

Yes, part of our problem is that the very beginning of the trap
handler is constrained by a lack of free registers.  sscratch isn't
truly free because it's used to locate the spill area IIRC.

One potential complication is that we might want to minimize the
number of CSR accesses - BOOM treats all CSR accesses as serializing
instructions.  There's speculation in the ISA spec that
implementations might be able to rename sscratch and mscratch, but I
don't know how that will work in practice, and I imagine renaming
stvec is out of the question.

That sounds unfortunate.  Even the current Linux code does quite a few CSR accesses when handling traps (read all the interesting state on entry and write it back on exit), and if serializing instructions are anywhere near as expensive as they are on x86, this will destroy exception handling performance.

x86 can currently handle system calls without any serialization at all, and this ability is very nice.
 

I'm not sure a CSR-CSR move is needed (it definitely does not exist);
you can move through the GPR file once you've moved one GPR to the
spill area.

That may already be too late if that spill area is on the stack.
 

If a hardware interrupt arrives when the kernel stack is full, how do
you process the stack overflow without losing the interrupt whose
entry code faulted?  The entry code presumably needs to look at scause
before deciding which stack to use.

Off the top of my head, I'd look at scause or whatever scratch register I used to figure out the cause.  FWIW, I don't love this solution.
 

> Alternatively, the STVEC register has the two low bits free.  What if one of
> those bits meant "don't update SEPC, etc on entry" or perhaps "write to
> SEPC2, etc on entry" where SEPC2 was an extra CSR?  Then the page fault
> entry could switch STVEC to a special nested-entry handler with that bit set
> and then restore it back to normal once it's done saving all of its state
> away.

I think only one if the chip implements RVC or another odd-length
instruction set.


The 1.9 draft explicitly says it's 4-byte aligned.
 
-s

Stefan O'Rear

unread,
Oct 4, 2016, 8:05:19 PM10/4/16
to Andrew Lutomirski, RISC-V ISA Dev
On Tue, Oct 4, 2016 at 4:56 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> That sounds unfortunate. Even the current Linux code does quite a few CSR
> accesses when handling traps (read all the interesting state on entry and
> write it back on exit), and if serializing instructions are anywhere near as
> expensive as they are on x86, this will destroy exception handling
> performance.

Chris can comment more on whether we should be concerned. I have no
idea and it's clearly more "punted" than "impossible".

>> I'm not sure a CSR-CSR move is needed (it definitely does not exist);
>> you can move through the GPR file once you've moved one GPR to the
>> spill area.
>
> That may already be too late if that spill area is on the stack.

I'm thinking spill a "few" registers to a fixed area, then look at
scause to find out which stack you're supposed to use (system calls
and page faults both jump to stvec...), then once you have a stack and
have verified sufficient space, spill the rest of the registers and
copy out the contents of the fixed area.

-s

Andrew Lutomirski

unread,
Oct 4, 2016, 8:14:12 PM10/4/16
to RISC-V ISA Dev, aml...@gmail.com

I think this would require an ability to access per-cpu data without any prior setup.  This can be done with OS-provided per-hart vectors, I suppose.  But it could also be done with improved paging, which I'll post about in a sec.
 

-s

Stefan O'Rear

unread,
Oct 4, 2016, 8:17:06 PM10/4/16
to Andrew Lutomirski, RISC-V ISA Dev
;; sscratch=per-cpu
csrrw t0,sscratch,t0
sd t1,8(t0)
mv t1,t0
csrrw t0,sscratch,t0
sd t0,0(t1)
;; old t0 and t1 are now saved in per-cpu, t1 is a pointer to per-cpu
sd t2,16(t1) ;; if you need more registers for the "find my stack"
code, repeat this as needed

-s

Andrew Waterman

unread,
Oct 4, 2016, 8:18:40 PM10/4/16
to Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev
On Tue, Oct 4, 2016 at 5:05 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> On Tue, Oct 4, 2016 at 4:56 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
>> That sounds unfortunate. Even the current Linux code does quite a few CSR
>> accesses when handling traps (read all the interesting state on entry and
>> write it back on exit), and if serializing instructions are anywhere near as
>> expensive as they are on x86, this will destroy exception handling
>> performance.
>
> Chris can comment more on whether we should be concerned. I have no
> idea and it's clearly more "punted" than "impossible".

It's very possible -- a Andrew points out, Intel's x86 implementations
can, in certain circumstances, transition between privilege modes
without serializing, and they've got quite a bit more baggage to deal
with in such an operation. It's a mere matter of microarchitectural
complexity.

I think you'd get a lot of the benefit from renaming *scratch and
pattern-matching a few *status writes that don't have visible side
effects (e.g., no need to flush the pipeline when twiddling
interrupt-enables, if you're careful). Also, no need to serialize
when reading *epc/*cause, as long as you serialize on exceptions and
*epc/*cause writes.

By comparison, avoiding serializing on privilege transfers would
provide less improvement than the above, and would be substantially
more complex. Still doable, though.

>
>>> I'm not sure a CSR-CSR move is needed (it definitely does not exist);
>>> you can move through the GPR file once you've moved one GPR to the
>>> spill area.
>>
>> That may already be too late if that spill area is on the stack.
>
> I'm thinking spill a "few" registers to a fixed area, then look at
> scause to find out which stack you're supposed to use (system calls
> and page faults both jump to stvec...), then once you have a stack and
> have verified sufficient space, spill the rest of the registers and
> copy out the contents of the fixed area.
>
> -s
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADJ6UvMs-s1PL7tiWtPBLrMopxYy4DAnKndBfCu-2Z1sesE67w%40mail.gmail.com.

Christopher Celio

unread,
Oct 4, 2016, 8:22:11 PM10/4/16
to Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev
One potential complication is that we might want to minimize the
number of CSR accesses - BOOM treats all CSR accesses as serializing
instructions.  There's speculation in the ISA spec that
implementations might be able to rename sscratch and mscratch, but I
don't know how that will work in practice, and I imagine renaming
stvec is out of the question.

Renaming CSRs like sscratch and mscratch is relatively straight-forward - you lie to the decoder and treat accesses to them as general-purpose registers and make the rename map table slightly bigger (although such complexity could increase decode and rename latency).  In short, they become registers x32 and x33. 

But that only works cleanly for CSRs that can be treated as general-purpose XLEN registers -- no quirky semantics, no "bit2 does side-effect Y under privilege mode Z and also change the base TLB address ".  So there is a distinction between "go fast using extra HW complexity" CSRs and "handle carefully" CSRs.  Of course, you can brute force your way through some of the more common scenarios, like the *status registers that sometimes don't have any side-effects.

Chris can comment more on whether we should be concerned.  I have no
idea and it's clearly more "punted" than "impossible".

I haven't done this in BOOM yet because I didn't want to increase the "surface area" of bugs without A) having a good set of torture tests that hit the privilege mode harder and B) I don't have any good micro-benchmarks that require faster CSR accesses (both would be really cool for us to have as a community). 


-Chris





Stefan O'Rear

unread,
Oct 4, 2016, 8:25:15 PM10/4/16
to Christopher Celio, Andrew Lutomirski, RISC-V ISA Dev
On Tue, Oct 4, 2016 at 5:22 PM, Christopher Celio
<ce...@eecs.berkeley.edu> wrote:
> One potential complication is that we might want to minimize the
> number of CSR accesses - BOOM treats all CSR accesses as serializing
> instructions. There's speculation in the ISA spec that
> implementations might be able to rename sscratch and mscratch, but I
> don't know how that will work in practice, and I imagine renaming
> stvec is out of the question.
>
>
> Renaming CSRs like sscratch and mscratch is relatively straight-forward -
> you lie to the decoder and treat accesses to them as general-purpose
> registers and make the rename map table slightly bigger (although such
> complexity could increase decode and rename latency). In short, they become
> registers x32 and x33.

How much complexity does the second write-port of csrrw t0,sscratch,t0 add?

-s

Krste Asanovic

unread,
Oct 4, 2016, 8:37:12 PM10/4/16
to Stefan O'Rear, Christopher Celio, Andrew Lutomirski, RISC-V ISA Dev
Doesn’t need any write ports at all.
Just a rename table update.

Krste
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADJ6UvPmxLZjzf76cJhSR6pss84vmSMxhWjTFgHeALhvG0bnLA%40mail.gmail.com.

Christopher Celio

unread,
Oct 4, 2016, 9:05:10 PM10/4/16
to Krste Asanovic, Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev
How much complexity does the second write-port of csrrw t0,sscratch,t0 add?

Great question.

As Krste said, the most clever way is to simply swap the map table entries.  The other way would be to break it into two micro-ops.  There may be a slight complication to make sure that the "atomic swap" part is semantically maintained, but if there is, it's not immediately obvious to me.

-Chris

Andrew Lutomirski

unread,
Oct 4, 2016, 9:31:12 PM10/4/16
to RISC-V ISA Dev, sor...@gmail.com, aml...@gmail.com


On Tuesday, October 4, 2016 at 5:18:40 PM UTC-7, waterman wrote:
On Tue, Oct 4, 2016 at 5:05 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> On Tue, Oct 4, 2016 at 4:56 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
>> That sounds unfortunate.  Even the current Linux code does quite a few CSR
>> accesses when handling traps (read all the interesting state on entry and
>> write it back on exit), and if serializing instructions are anywhere near as
>> expensive as they are on x86, this will destroy exception handling
>> performance.
>
> Chris can comment more on whether we should be concerned.  I have no
> idea and it's clearly more "punted" than "impossible".

It's very possible -- a Andrew points out, Intel's x86 implementations
can, in certain circumstances, transition between privilege modes
without serializing, and they've got quite a bit more baggage to deal
with in such an operation.  It's a mere matter of microarchitectural
complexity.

I think you'd get a lot of the benefit from renaming *scratch and
pattern-matching a few *status writes that don't have visible side
effects (e.g., no need to flush the pipeline when twiddling
interrupt-enables, if you're careful).  Also, no need to serialize
when reading *epc/*cause, as long as you serialize on exceptions and
*epc/*cause writes.

By comparison, avoiding serializing on privilege transfers would
provide less improvement than the above, and would be substantially
more complex.  Still doable, though.

AFAIK x86 doesn't serialize on privilege transfers via SYSCALL and SYSRET, and ECALL is even simpler.

Just to check, though: do you mean the same thing by "serialize"?  On x86, it means that the pipelines are flushed, nothing is fetched until the serializing instruction retires, and all pending stores are flushed as well.  This implies a full fence and more.  (I have no idea why.  On x86, having stores flushed out from the CPU into the cache coherency domain is not observable as far as I know.)  The upshot is that interrupt return takes some 1000 cycles.

Michael Clark

unread,
Oct 4, 2016, 9:46:12 PM10/4/16
to Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev



Sent from my iPhone
> On 5/10/2016, at 12:41 PM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> If a hardware interrupt arrives when the kernel stack is full, how do
> you process the stack overflow without losing the interrupt whose
> entry code faulted? The entry code presumably needs to look at scause
> before deciding which stack to use.

The kernel stack is which kernel stack?

Isn't this very much based on architecture dependent trap mechanisms (architecture that switches stacks on interrupts). If the architecture doesn't switch stacks in microcode then isn't there more freedom?

One would ask why are we are processing interrupts on the kernel stack of a user task? Is this done? If this was done, the interrupts could be for completely unrelated tasks. e.g. problems when receiving and interrupt on a core that is in a syscall on the current tasks kernel stack.

My understanding of Linux x86 dual kernel/user stack handling is limited (some knowledge on current task struct i.e. thread and current mm struct {sptbr} i.e. process).

The question would be, isn't the interrupt stack used going to be tied to the irq? not the kernel stack of some random task that happens to be running when the interrupt is received?

Then one would have a buffer per interrupt source, and load sp from an interrupt vector in the scratch space? No stack will be being used while this is being done and interrupts are masked?

I need to learn. Is there a stack per interrupt number (MSI table)? or are interrupts processed on the kernel stack of the task that is currently running? Sorry if I sound dumb.

Andrew Waterman

unread,
Oct 4, 2016, 9:50:22 PM10/4/16
to Andrew Lutomirski, RISC-V ISA Dev, Stefan O'Rear
Most likely, a legacy OS from an Important Vendor depends on that
behavior. I meant just the pipeline flush.

>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6741ea59-3012-405b-a20d-33015b68797f%40groups.riscv.org.

Stefan O'Rear

unread,
Oct 4, 2016, 10:04:04 PM10/4/16
to Michael Clark, Andrew Lutomirski, RISC-V ISA Dev
Linux and PK currently use sscratch for the kernel stack pointer
itself, which implies there can only be one. Having one kernel stack
is a problem; the purpose of this thread is to workshop solutions.

I found the solution in my last post though: By using sscratch for a
per-cpu pointer and using the two-stage spill approach, you can have
as many kernel stacks as you need.

-s

Jacob Bachmeyer

unread,
Oct 5, 2016, 8:18:38 PM10/5/16
to Andrew Lutomirski, RISC-V ISA Dev
Andrew Lutomirski wrote:
> When a synchronous exception is delivered, it seems to me that it
> would be very helpful to record the faulting instruction into a CSR.
> This way faults can be handled (and instructions can be emulated if
> needed) by higher-privileged code without needing to re-fetch the
> instruction. This avoids races as well as any need to worry about
> memory protection that would allow, say, a user-executable instruction
> that is not readable by supervisor code without fiddling with the
> memory protection control bits.

An easy solution to the memory-protection issue would be for all user
pages to be implicitly read-write in S-mode. (Note that this does not
permit the supervisor to execute from a user page, but I believe that
the most common use for S-mode execution from a user page is exploiting
the supervisor.)

Race conditions could be a bit harder to address, but what would the
S-mode parsing of a user instruction on the same hart race with?

> It seems like it would be straightforward for RISC-V to support it for
> all modes, especially given that instruction words are short.

This is the fly in the ointment: currently instruction words are short,
but the instruction length is extensible and can exceed XLEN as others
have noted.

> The overall privilege exception mechanism looks like it will make it
> quite difficult for kernels to handle cases where an exception may
> occur during exception entry. This can happen for a couple of reasons:
>
> * "perf"-style intentional NMI. By design, these should be able
> to happen with as few restrictions as practical.
> * Recoverable memory failure. Admittedly, Linux can't usefully
> recover from memory failure that triggers during exception entry.
> * Stack overflow. (In implemented mostly reliable kernel stack
> overflow handling for Linux on x86 starting in version 4.9).
>
> It doesn't seem very pleasant to write the glue to handle these cases
> on RISC-V. In particular, the exception entry code needs to save away
> all of the control state (sepc, scause, sbadaddr, etc), and if an
> exception nests before this is completed, then the outer exception's
> state is irretreivably lost. I can see a way to work around this
> (save all the state to a non-stack location, then set some flag so
> that a nested exception will know that it's nested, then save again to
> the stack), but in my opinion it would be quite nice to have some form
> of hardware support.
>
> x86 avoids this issue entirely by pushing everything to the stack in
> microcode, but I can see why RISC-V would want to avoid this.
>

I believe that x86 also has a notion of a separate "double fault"
handler, but that works on x86 because x86 exceptions are more-or-less
software interrupts, where RISC-V uses "fast traps" to a common handler
for everything including interrupts. Section 3.1.12 in the privileged
ISA draft gives some information about trap handling, but it seems that
you are right--if a page fault occurs while the supervisor is saving the
exception context, the exception context is trashed.

Currently, all traps are notionally delivered first to machine mode.
Traps (and interrupts) can be delegated in hardware to less-privileged
modes, in which case the M-mode trap handler need not run, or in
software, by an M-mode trap handler that sets the CSRs and executes xRET
to "return" to a less-privileged trap handler. However, trap handlers
are always entered with interrupts disabled.

This leads to a possible workaround: inhibit delegation to S-mode if
SIE is clear--the SEE trap handler can save the pending trap state
somewhere and deliver the nested trap, but I am not certain how the SEE
would regain control and redeliver the original trap. This would seem
to require nested trap flags that cause xRET to trap if set.
Analogously, hardware trap delegation to U-mode and H-mode would also be
inhibited if UIE and HIE are clear. There is, of course, no way out for
M-mode, which lacks a more-privileged mode to handle double faults. If
the monitor faults on trap entry, the system crashes.

Another limitation is that this would require trap handlers to either be
idempotent or to run with interrupts enabled--if a fault occurs and the
trap handler is running with interrupts disabled, the execution
environment will restart the trap handler from the beginning. This
could be overcome, but at the price of adding yet another flag to signal
"trap acknowledged". There are exactly six bits left in RV32 mstatus as
of the current draft and this seems to require six new flags, since the
"nested trap" and "trap acknowledged" flags would need to be replicated
for each of H/S/U-mode.

-- Jacob

Paolo Bonzini

unread,
Oct 6, 2016, 1:26:20 AM10/6/16
to Jacob Bachmeyer, Andrew Lutomirski, RISC-V ISA Dev

Il 06/ott/2016 02:18, "Jacob Bachmeyer" <jcb6...@gmail.com> ha scritto:
> An easy solution to the memory-protection issue would be
> for all user pages to be implicitly read-write in S-mode.  (Note that
> this does not permit the supervisor to execute from a user page,
> but I believe that the most common use for S-mode execution
> from a user page is exploiting the supervisor.)

This would work, but you don't want the kernel to be able to read certain pages except during code sections that copy from/to userspace, in order to block attacks such as return-oriented programming. So this would require a bit such as SPRV (mimicking MPRV) in sstatus.

Paolo

Stefan O'Rear

unread,
Oct 6, 2016, 2:10:00 AM10/6/16
to Paolo Bonzini, Jacob Bachmeyer, Andrew Lutomirski, RISC-V ISA Dev
I've long wondered (in the context of SMAP, before I discovered
RISC-V, but still relevant here) why this kind of thing is handled
with a status bit instead of special instructions. Userspace might
try to pass a negative address to read(), and you can check that in
copy_from_user, but wouldn't it be better to have a sb_user
instruction that refused to write to kernel pages and was impossible
to misuse?

(fairly ignorant question)

-s

Michael Chapman

unread,
Oct 6, 2016, 12:19:02 PM10/6/16
to isa...@groups.riscv.org

Is there any particular reason for specifying the result of an integer divide by zero?
(Other than specifying that the operation terminates and produces an undefined bit pattern in the register).

In particular, a signed value divided by zero using signed division is specified as giving the result of -1 for the quotient.
In at least one particular standard implementation method, the value resulting from a signed divide of a negative number by 0 if no additional logic is spent for detecting this case, is +1.

A slightly more natural value than -1 would be either the minimum possible signed value (i.e. the negative value 0x80000000 for a 32 bit operation) or the maximum possible signed value (i.e. 0x7fffffff for a 32 bit operation) depending on the sign of the dividend if we are going to the trouble of implementing a specific value for these cases.

Specifying -1 for the result of a signed division of a number by 0 does not seem logical.



Samuel Falvo II

unread,
Oct 6, 2016, 12:38:25 PM10/6/16
to Michael Chapman, RISC-V ISA Dev
On Thu, Oct 6, 2016 at 9:18 AM, Michael Chapman
<michael.c...@gmail.com> wrote:
> Specifying -1 for the result of a signed division of a number by 0 does not
> seem logical.

Not an expert; just guessing. But, it probably takes less logic to
detect that error condition and produce the -1 result.

It seems like it could perhaps be easier to detect in software as
well, especially if you're register limited:

; do division here; result in x2
addi x2, x2, 1
beq x2, x0, _possibleOverflowUnderflow
addi x2, x2, -1
; use quotient here.

If you had to check for 0x8000...0000 and 0x7FFF...FFF explicitly,
it'd take more code, and unless these were pre-loaded into registers,
you'd almost certainly incur a fetch from program space to quickly
load them for comparison purposes. (Unless you synthesize them from
addi/slli instructions, which could take more time than a simple
fetch.)

Andrew Lutomirski

unread,
Oct 6, 2016, 12:40:45 PM10/6/16
to RISC-V ISA Dev, aml...@gmail.com, jcb6...@gmail.com


On Wednesday, October 5, 2016 at 5:18:38 PM UTC-7, Jacob Bachmeyer wrote:
Andrew Lutomirski wrote:
> When a synchronous exception is delivered, it seems to me that it
> would be very helpful to record the faulting instruction into a CSR.  
> This way faults can be handled (and instructions can be emulated if
> needed) by higher-privileged code without needing to re-fetch the
> instruction.  This avoids races as well as any need to worry about
> memory protection that would allow, say, a user-executable instruction
> that is not readable by supervisor code without fiddling with the
> memory protection control bits.

An easy solution to the memory-protection issue would be for all user
pages to be implicitly read-write in S-mode.  (Note that this does not
permit the supervisor to execute from a user page, but I believe that
the most common use for S-mode execution from a user page is exploiting
the supervisor.)

Race conditions could be a bit harder to address, but what would the
S-mode parsing of a user instruction on the same hart race with?

I think this would cause problems.  In Linux, at least, there are two or three modes for access to user pages that the kernel would want to use:

1. No access to user pages at all.  For kernel hardening, this should be the normal state of affairs.  When the kernel wants to access user memory, it will change the mode.  (On very new x86, this is available using SMAP.  ARM and ARM64 have similar mechanisms.)  Switching in and out of this mode would ideally be very fast.

2. Access using the same restrictions that user code uses.  When the kernel wants to access user variables (__get_user(), __put_user(), copy_from_user(), copy_to_user(), etc), it wants normal user access semantics to apply.  This may result in page faults, and the kernel will handle those page faults appropriately.  A user should not be able to call a function like gettimeofday() with a pointer to read-only memory as the output argument and get that read-only memory overwritten.

3. (special case, rare) The kernel occasionally wants some way to read user instruction memory.  This isn't very common and could be done by manually walking the page tables, but it would be somewhat nice if the kernel could just do it.  Maybe a CSR bit could be set causing reads to user memory to use instruction fetch semantics.  For x86, this hasn't mattered much in the past because reads were more permissive than instruction fetches, but memory protection keys (PKRU) changed this and it's a minor mess now.
 

> It seems like it would be straightforward for RISC-V to support it for
> all modes, especially given that instruction words are short.

This is the fly in the ointment:  currently instruction words are short,
but the instruction length is extensible and can exceed XLEN as others
have noted.

There could be multiple registers for this, I suppose.
 

This seems like it would require a spec for how the supervisor coordinates this with M-mode, which seems overcomplicated to me.  Also, traps in Linux (at least) with interrupts disabled aren't necessarily indications of a major problem.  Linux sometimes expects non-fatal page faults when interrupts are disabled.
 

Another limitation is that this would require trap handlers to either be
idempotent or to run with interrupts enabled

This is not currently so easy.  All proposals I've seen (in this thread and in the Linux port) use csrrw to swap a register into sscratch, and that can't be idempotent.

I'd been hoping that per-hart memory or maybe even a per-hart trap vector would do the trick so that ordinary memory could be used as scratch space, but there don't seem to be direct pc-relative addressing modes or addressing modes with large enough absolute offsets to make this work cleanly.

sorear2's solution looks fairly good, although I expect will be slow unless sscratch access is well-optimized by the hardware implementation.

Here's one more straw-man proposal: have trap entry set a bit in sstatus and make traps that happen while that bit is still set be called "double faults" and enter via a different vector.  At the very least, this would let the kernel print a nice error instead of infinite looping if something goes wrong very early in exception entry.

Samuel Falvo II

unread,
Oct 6, 2016, 1:01:45 PM10/6/16
to Andrew Lutomirski, RISC-V ISA Dev, jcb6...@gmail.com
On Thu, Oct 6, 2016 at 9:40 AM, Andrew Lutomirski <aml...@gmail.com> wrote:
> I'd been hoping that per-hart memory or maybe even a per-hart trap vector
> would do the trick so that ordinary memory could be used as scratch space,
> but there don't seem to be direct pc-relative addressing modes or addressing
> modes with large enough absolute offsets to make this work cleanly.


Why not place a table of HART-specific memory pointers in page 0 of
memory? Then you could do something like this:

csrr x2, mhartid
slli x2, x2, 3 ; or whatever
ld x2, 0(x2) ; X2 now points to HART-specific region of memory.

User-space is typically restricted from using page 0 for fear of null
pointers anyway; theoretically, supervisor space could be more
trustworthy.

Or, if you have finite-sized blocks (e.g., a slab of HART-specific
regions), you can slli x2, x2, log2size of the slab, and just add a
constant to get to the correct address. Sure, it's an effective
address calculation, but you only have to do it once.

(The advantage of the table of pointers approach above, though, is you
can get away with using only one register.)

Stefan O'Rear

unread,
Oct 6, 2016, 1:22:15 PM10/6/16
to Andrew Lutomirski, RISC-V ISA Dev, Jacob Bachmeyer
On Thu, Oct 6, 2016 at 9:40 AM, Andrew Lutomirski <aml...@gmail.com> wrote:
> sorear2's solution looks fairly good, although I expect will be slow unless
> sscratch access is well-optimized by the hardware implementation.

OT: Does anyone know what keeps going wrong with my name? Going on
Google Groups and clicking "show original" from the action menu
reports the correct name in the "From:" message header, so I'm not
sure where my name is being stripped out. (It's Stefan, by the way)

-s

Tommy Thorn

unread,
Oct 6, 2016, 1:30:58 PM10/6/16
to Michael Chapman, isa...@groups.riscv.org
On Oct 6, 2016, at 09:18 , Michael Chapman <michael.c...@gmail.com> wrote:

Is there any particular reason for specifying the result of an integer divide by zero?
(Other than specifying that the operation terminates and produces an undefined bit pattern in the register).


0. In particularly in the context of a processor ISA, *everything* should be defined
by the spec. Leaving things undefined means it *will* be defined by implementations
and not necessarily consistently.  Poorly defined behavior leads to misunderstandings,
leading to bugs, leading to security issues.

1. Exactly *what* the behavior should be is a different issue. Personally, I prefer the
principle of least surprise, followed by programming convenience, and only last,
implementation convenience.

Tommy




In particular, a signed value divided by zero using signed division is specified as giving the result of -1 for the quotient.
In at least one particular standard implementation method, the value resulting from a signed divide of a negative number by 0 if no additional logic is spent for detecting this case, is +1.

A slightly more natural value than -1 would be either the minimum possible signed value (i.e. the negative value 0x80000000 for a 32 bit operation) or the maximum possible signed value (i.e. 0x7fffffff for a 32 bit operation) depending on the sign of the dividend if we are going to the trouble of implementing a specific value for these cases.

Specifying -1 for the result of a signed division of a number by 0 does not seem logical.




--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Chapman

unread,
Oct 6, 2016, 2:07:33 PM10/6/16
to Tommy Thorn, isa...@groups.riscv.org

On 06/10/2016 19:30, Tommy Thorn wrote:
0. In particularly in the context of a processor ISA, *everything* should be defined
by the spec. Leaving things undefined means it *will* be defined by implementations
and not necessarily consistently.  Poorly defined behavior leads to misunderstandings,
leading to bugs, leading to security issues.

I can accept that argument.


1. Exactly *what* the behavior should be is a different issue. Personally, I prefer the
principle of least surprise, followed by programming convenience, and only last,
implementation convenience.

The principle of least surprise for me is that for a signed divide by zero, the result should be either INT_MAX or INT_MIN depending on the sign of the dividend.
Currently -1 is specified which I find surprising - it is neither logical, nor particularly convenient, nor what "falls" out of a typical hardware implementation.

For an unsigned divide by zero, the value specified is UINT_MAX which seems reasonable.
(It is of course the same bit pattern as -1 in the signed divide case)

Mike

Andrew Waterman

unread,
Oct 6, 2016, 3:05:44 PM10/6/16
to Stefan O'Rear, Paolo Bonzini, Jacob Bachmeyer, Andrew Lutomirski, RISC-V ISA Dev
Opcode space is a precious resource, and loads & stores consume lots
of it. So you'd either end up burning 1/16 of the opcode space on
these instructions, or add a new addressing mode. It's also more to
design and verify in every processor design, vs. getting
copy_to/from_user correct just the once.

You're also still stuck with the opposite problem of preventing
regular S-mode loads & stores from touching user memory when they
aren't supposed to. I guess in your proposal, you could forbid the
S-mode loads & stores from accessing user memory altogether. That
could work fine for Linux, but other OSes might not cope so well with
that restriction.

>
> (fairly ignorant question)
>
> -s
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADJ6UvM-hvYb3AFZhR1MTcBGhaaGPGiGL-k3A74RkmDaMwxEZQ%40mail.gmail.com.

Jacob Bachmeyer

unread,
Oct 7, 2016, 6:43:10 PM10/7/16
to bon...@gnu.org, Andrew Lutomirski, RISC-V ISA Dev
Paolo Bonzini wrote:
>
> Il 06/ott/2016 02:18, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> ha scritto:
We already have that control: PUM ("protect user memory") in sstatus.
If PUM is set, all supervisor accesses to user pages fault, regardless
of permissions. I am suggesting that when PUM is clear, all user pages
are read-write for the supervisor.

-- Jacob

Jacob Bachmeyer

unread,
Oct 7, 2016, 7:08:20 PM10/7/16
to Andrew Lutomirski, RISC-V ISA Dev
PUM currently provides this.

> 2. Access using the same restrictions that user code uses. When the
> kernel wants to access user variables (__get_user(), __put_user(),
> copy_from_user(), copy_to_user(), etc), it wants normal user access
> semantics to apply. This may result in page faults, and the kernel
> will handle those page faults appropriately. A user should not be
> able to call a function like gettimeofday() with a pointer to
> read-only memory as the output argument and get that read-only memory
> overwritten.

The kernel should be able to verify addresses using its own
memory-tracking structures. I admit that these checks would not benefit
from the TLB. Could a "verify user address" instruction be worth adding
even though it would not have encoding space to support an immediate offset?

> 3. (special case, rare) The kernel occasionally wants some way to read
> user instruction memory. This isn't very common and could be done by
> manually walking the page tables, but it would be somewhat nice if the
> kernel could just do it. Maybe a CSR bit could be set causing reads
> to user memory to use instruction fetch semantics. For x86, this
> hasn't mattered much in the past because reads were more permissive
> than instruction fetches, but memory protection keys (PKRU) changed
> this and it's a minor mess now.

I suggest that the supervisor always be able to read all user memory.

You have missed the more common case of the kernel needing to *write*
user instruction memory as when paging in program text.

> > It seems like it would be straightforward for RISC-V to support
> it for
> > all modes, especially given that instruction words are short.
>
> This is the fly in the ointment: currently instruction words are
> short,
> but the instruction length is extensible and can exceed XLEN as
> others
> have noted.
>
>
> There could be multiple registers for this, I suppose.

Fair enough, but that could quickly lead to spiraling complexity.
I have written a proposal to move those register swaps into hardware and
use special context-save instructions to work around the idempotency
problem.

> I'd been hoping that per-hart memory or maybe even a per-hart trap
> vector would do the trick so that ordinary memory could be used as
> scratch space, but there don't seem to be direct pc-relative
> addressing modes or addressing modes with large enough absolute
> offsets to make this work cleanly.

I think that *all* CSRs are per-hart, so we already have per-hart stvec
and sscratch.

> sorear2's solution looks fairly good, although I expect will be slow
> unless sscratch access is well-optimized by the hardware implementation.
>
> Here's one more straw-man proposal: have trap entry set a bit in
> sstatus and make traps that happen while that bit is still set be
> called "double faults" and enter via a different vector. At the very
> least, this would let the kernel print a nice error instead of
> infinite looping if something goes wrong very early in exception entry.

That is the "trap acknowledged" flag I suggested, but with different
effects and an additional trap vector. What happens if the double fault
handler causes a fault?

-- Jacob

Andrew Lutomirski

unread,
Oct 7, 2016, 7:08:24 PM10/7/16
to RISC-V ISA Dev, bon...@gnu.org, aml...@gmail.com, jcb6...@gmail.com

I don't like this at all.  Suppose a user program does:

void *ptr = mmap(/etc/passed MAP_SHARED read only);
read(1, ptr, 4);

Now the kernel does:

... most of the read syscall...
PUM = 0;
store something to *ptr

To prevent this from being a trivial privilege escalation, the kernel needs to either manually validate the store or make sure it's going to generate a page fault.  On x86, it manually checks that ptr < 0x7fffffffffffffff, then clears AC (x86's equivalent of PUM) and writes to the pointer.  The hardware checks that the pointer is actually writable.  Your suggestion would totally break this and require the kernel to manually walk the page tables.

If you want to be fancy, how about having three status bits:

PUM: if set, S-mode can't access user pages at all.
PSM ("protect S mode"): if PSM && PUM, then S-mode can't access *supervisor* pages.
SAXR: "supervisor allow X reads": if set, S-mode reads to X, !R pages act as though the R permission was set.

Now the kernel would normally leave PSM set.  When accessing user memory, the kernel could skip the address check entirely and instead just set PUM and do the access.  When reading an instruction for emulation, the kernel temporarily set SAXR.  When running with the truly awful "set_fs(KERNEL_DS)" hack in effect, the kernel would clear PSM.

If this actually happened, I'd be tempted to ask the compiler folks to implement a "address space" that controls the PUM bit.

Stefan O'Rear

unread,
Oct 7, 2016, 7:12:44 PM10/7/16
to Andrew Lutomirski, RISC-V ISA Dev, Paolo Bonzini, Jacob Bachmeyer
On Fri, Oct 7, 2016 at 4:08 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> To prevent this from being a trivial privilege escalation, the kernel needs
> to either manually validate the store or make sure it's going to generate a
> page fault. On x86, it manually checks that ptr < 0x7fffffffffffffff, then
> clears AC (x86's equivalent of PUM) and writes to the pointer. The hardware
> checks that the pointer is actually writable. Your suggestion would totally
> break this and require the kernel to manually walk the page tables.

Checking if this processor honours the WP bit even in supervisor mode...

It's been done before.

-s

Andrew Lutomirski

unread,
Oct 7, 2016, 7:25:44 PM10/7/16
to RISC-V ISA Dev, aml...@gmail.com, jcb6...@gmail.com

True, and per-hart vec would do it.  What I meant was: if there was a reasonably clean way to address per-hart in-memory scratch space without needing any integer registers free, then it would solve most of the problem at very little hardware cost.
 

> sorear2's solution looks fairly good, although I expect will be slow
> unless sscratch access is well-optimized by the hardware implementation.
>
> Here's one more straw-man proposal: have trap entry set a bit in
> sstatus and make traps that happen while that bit is still set be
> called "double faults" and enter via a different vector.  At the very
> least, this would let the kernel print a nice error instead of
> infinite looping if something goes wrong very early in exception entry.

That is the "trap acknowledged" flag I suggested, but with different
effects and an additional trap vector.  What happens if the double fault
handler causes a fault?

A triple fault!  It could either force an immediate reset or trap out to M-mode.

Andrew Lutomirski

unread,
Oct 7, 2016, 7:33:55 PM10/7/16
to RISC-V ISA Dev, aml...@gmail.com, jcb6...@gmail.com


On Friday, October 7, 2016 at 4:08:20 PM UTC-7, Jacob Bachmeyer wrote:
Andrew Lutomirski wrote:

> I think this would cause problems.  In Linux, at least, there are two
> or three modes for access to user pages that the kernel would want to use:
>
> 1. No access to user pages at all.  For kernel hardening, this should
> be the normal state of affairs.  When the kernel wants to access user
> memory, it will change the mode.  (On very new x86, this is available
> using SMAP.  ARM and ARM64 have similar mechanisms.)  Switching in and
> out of this mode would ideally be very fast.

PUM currently provides this.

Indeed, and I like that.
 

> 2. Access using the same restrictions that user code uses.  When the
> kernel wants to access user variables (__get_user(), __put_user(),
> copy_from_user(), copy_to_user(), etc), it wants normal user access
> semantics to apply.  This may result in page faults, and the kernel
> will handle those page faults appropriately.  A user should not be
> able to call a function like gettimeofday() with a pointer to
> read-only memory as the output argument and get that read-only memory
> overwritten.

The kernel should be able to verify addresses using its own
memory-tracking structures.  I admit that these checks would not benefit
from the TLB.  Could a "verify user address" instruction be worth adding
even though it would not have encoding space to support an immediate offset?

Maybe, but it's even easier if it just uses the normal access control mechanism.  This also avoids races where one hart verifies access, then another hard edits the page tables to point elsewhere with different permissions, and then the first hard chases the new PTE and clobbers something it shouldn't have clobbered.
 

> 3. (special case, rare) The kernel occasionally wants some way to read
> user instruction memory.  This isn't very common and could be done by
> manually walking the page tables, but it would be somewhat nice if the
> kernel could just do it.  Maybe a CSR bit could be set causing reads
> to user memory to use instruction fetch semantics.  For x86, this
> hasn't mattered much in the past because reads were more permissive
> than instruction fetches, but memory protection keys (PKRU) changed
> this and it's a minor mess now.

I suggest that the supervisor always be able to read all user memory.

You have missed the more common case of the kernel needing to *write*
user instruction memory as when paging in program text.

I can't speak for other OS's, but on Linux this essentially this doesn't happen.  Linux will allocate a physical page, fill it in using an alias in the kernel virtual address range, and only map it into the user address range once it's fully populated.  (And, on RISC-V, the kernel will have to put a FENCE w,w in there as well.)

P.S. Am I understanding the memory model right?  One CPU doing FENCE w,w; store to address A will synchronize against another CPU doing load from address A; FENCE r,r, right?

Jacob Bachmeyer

unread,
Oct 7, 2016, 7:36:14 PM10/7/16
to Samuel Falvo II, Andrew Lutomirski, RISC-V ISA Dev
It's actually even easier than that--sscratch itself is per-hart, so you
can simply put a pointer to a hart-specific region in there. Also,
mhartid is not accessible to S-mode. Supervisors get the current hart
ID with the sbi_hart_id() SBI call.

-- Jacob

Jacob Bachmeyer

unread,
Oct 7, 2016, 7:42:03 PM10/7/16
to Michael Chapman, Tommy Thorn, isa...@groups.riscv.org
Michael Chapman wrote:
>
> The principle of least surprise for me is that for a signed divide by
> zero, the result should be either INT_MAX or INT_MIN depending on the
> sign of the dividend.
> Currently -1 is specified which I find surprising - it is neither
> logical, nor particularly convenient, nor what "falls" out of a
> typical hardware implementation.
>
> For an unsigned divide by zero, the value specified is UINT_MAX which
> seems reasonable.
> (It is of course the same bit pattern as -1 in the signed divide case)

All bits set is easy to produce in hardware, and this allows signed and
unsigned division to use the same circuitry to detect division-by-zero
and (quickly) produce the same error result.

-- Jacob

Jacob Bachmeyer

unread,
Oct 7, 2016, 8:54:27 PM10/7/16
to Andrew Lutomirski, RISC-V ISA Dev, bon...@gnu.org
Andrew Lutomirski wrote:
> On Friday, October 7, 2016 at 3:43:10 PM UTC-7, Jacob Bachmeyer wrote:
>
> Paolo Bonzini wrote:
> >
> > Il 06/ott/2016 02:18, "Jacob Bachmeyer" <jcb6...@gmail.com
> <javascript:>
> > <mailto:jcb6...@gmail.com <javascript:>>> ha scritto:
While it would not be able to write to /etc/passwd because the kernel
would still think that the page is clean since it is mapped read-only,
this could corrupt the copy in the page cache that would be silently
discarded later, leaving no trace after the exploit. Ouch... you are
correct.

> To prevent this from being a trivial privilege escalation, the kernel
> needs to either manually validate the store or make sure it's going to
> generate a page fault. On x86, it manually checks that ptr <
> 0x7fffffffffffffff, then clears AC (x86's equivalent of PUM) and
> writes to the pointer. The hardware checks that the pointer is
> actually writable. Your suggestion would totally break this and
> require the kernel to manually walk the page tables.

I revise my suggestion to: "User pages are always readable in S-mode,
writable in S-mode if and only if they are writable in U-mode, and never
executable in S-mode. If PUM is set, any S-mode access to a user page
faults, regardless of permissions."

Supervisor writes to pages that are read-only in U-mode must be done
through a separate (writable, supervisor) mapping that aliases the same
physical page. This is what I was trying to avoid, but I now admit that
it is necessary.

> If you want to be fancy, how about having three status bits:
>
> PUM: if set, S-mode can't access user pages at all.
> PSM ("protect S mode"): if PSM && PUM, then S-mode can't access
> *supervisor* pages.
> SAXR: "supervisor allow X reads": if set, S-mode reads to X, !R pages
> act as though the R permission was set.
>
> Now the kernel would normally leave PSM set. When accessing user
> memory, the kernel could skip the address check entirely and instead
> just set PUM and do the access. When reading an instruction for
> emulation, the kernel temporarily set SAXR. When running with the
> truly awful "set_fs(KERNEL_DS)" hack in effect, the kernel would clear
> PSM.
>
> If this actually happened, I'd be tempted to ask the compiler folks to
> implement a "address space" that controls the PUM bit.

The problem is that there are only a few bits left in mstatus, and they
may end up needed for nested trap handling. Since the supervisor
presumably loaded an X,!R page, I think that S-mode should always be
able (if PUM is clear) to read user pages, but never be able to execute
from user pages. You have convinced me that user-read-only mappings
should also be supervisor-read-only.

-- Jacob

Jacob Bachmeyer

unread,
Oct 7, 2016, 9:21:12 PM10/7/16
to Andrew Lutomirski, RISC-V ISA Dev
I now agree with treating the user-write permission bit also as a
supervisor-write permission bit, but maintain that user pages should
always be readable and never be executable in S-mode.

Races between page tables and underlying memory are an issue that I
still question currently. You are correct about software PTE walks
being inherently vulnerable to races and this would also affect a
"verify user address" instruction. I withdraw that suggestion.

I still have an unanswered question: What combination of FENCE,
SFENCE.VM, remote fences, etc. is required to globally kill a page
mapping before reusing that physical page?

> > 3. (special case, rare) The kernel occasionally wants some way
> to read
> > user instruction memory. This isn't very common and could be
> done by
> > manually walking the page tables, but it would be somewhat nice
> if the
> > kernel could just do it. Maybe a CSR bit could be set causing
> reads
> > to user memory to use instruction fetch semantics. For x86, this
> > hasn't mattered much in the past because reads were more permissive
> > than instruction fetches, but memory protection keys (PKRU) changed
> > this and it's a minor mess now.
>
> I suggest that the supervisor always be able to read all user memory.
>
> You have missed the more common case of the kernel needing to *write*
> user instruction memory as when paging in program text.
>
>
> I can't speak for other OS's, but on Linux this essentially this
> doesn't happen. Linux will allocate a physical page, fill it in using
> an alias in the kernel virtual address range, and only map it into the
> user address range once it's fully populated. (And, on RISC-V, the
> kernel will have to put a FENCE w,w in there as well.)
>
> P.S. Am I understanding the memory model right? One CPU doing FENCE
> w,w; store to address A will synchronize against another CPU doing
> load from address A; FENCE r,r, right?

Unless I also misunderstand the memory model, your FENCEs are
incorrect. FENCE is defined informally as "no other RISC-V thread or
external device can observe any operation in the successor set following
a FENCE before any operation in the predecessor set preceding the FENCE"
(RISC-V user spec v2.1, sec. 2.7 "Memory Model"). Assuming that FENCE
is "FENCE <pred>, <succ>", paging in program text would seem to require
a "FENCE w,r; FENCE.I" sequence, but I am uncertain what must be done to
ensure that other harts will actually see the new page contents.
Mapping it to user space will require SFENCE.VM, since the TLB already
has a page fault for that address.


-- Jacob

Jacob Bachmeyer

unread,
Oct 7, 2016, 9:28:36 PM10/7/16
to Andrew Lutomirski, RISC-V ISA Dev
I was proposing that the SEE handle double faults by stashing S-mode
state somewhere and then reflecting the nested trap into S-mode.
Obviously at some point, the SEE must detect that the S-mode trap
handler is itself faulting consistently and do ... something.

-- Jacob

Andrew Lutomirski

unread,
Oct 7, 2016, 10:58:36 PM10/7/16
to RISC-V ISA Dev, aml...@gmail.com, bon...@gnu.org, jcb6...@gmail.com

On an architecture with limited paging access control like x86, this would be fine (in fact, it's more or less what x86 does -- pages are pretty much always readable).  But RISC-V is better and gives individual R, W, and X control.  This means that the problem exists in the other direction.

uint8_t *ptr = mmap(whatever, PROT_WRITE);
*ptr = 1; /* okay */
x = *ptr; /* segfaults, as it should */
write(1, ptr, 1);  /* should fail */

This means that the kernel still wants supervisor access to work like user access.
 


The problem is that there are only a few bits left in mstatus, and they
may end up needed for nested trap handling.  Since the supervisor
presumably loaded an X,!R page, I think that S-mode should always be
able (if PUM is clear) to read user pages, but never be able to execute
from user pages.  You have convinced me that user-read-only mappings
should also be supervisor-read-only.

It's not really a big deal (IMO) if the supervisor has to jump through hoops to read a user X,!R page.  That case is rare.  I think it would only be worth optimizing in hardware if it were essentially free.  What I was getting at with "protect supervisor mode" is that some architectures have totally separate user and kernel address spaces, and this actually works out quite nicely and gives some security benefits.

In RISC-V, this model could look like having two entirely separate paging hierarchies along with a set of magic load and store instructions that would let supervisor code access user memory.  Admittedly, this chews up opcode space, which would be unfortunate.  But it makes everything conceptually very simple.

(If RISC-V does not accept Paolo's suggestion to get rid of H mode, then the explosion of opcodes could be even worse.)

--Andy

Jacob Bachmeyer

unread,
Oct 9, 2016, 8:28:15 PM10/9/16
to Andrew Lutomirski, RISC-V ISA Dev, bon...@gnu.org
The current draft privileged ISA does not permit write-only pages (sec
4.5.1 in the current draft). The permission bit combinations that would
indicate write-but-no-read are reserved and invalid. Your example would
fault every time the pointer is used, because the "write-only" page
mapping is considered invalid by hardware.

The only case I know where supervisor-can-always-read would open a
loophole would be an execute-only mapping, but I do not see an actual
loophole here because any program could simply open the same file for
read. The only possible uses for an execute-only file are exec(2) and
dlopen(), but dlopen() must be able to read the ELF headers, so read
permission is also required.

> The problem is that there are only a few bits left in mstatus, and
> they
> may end up needed for nested trap handling. Since the supervisor
> presumably loaded an X,!R page, I think that S-mode should always be
> able (if PUM is clear) to read user pages, but never be able to
> execute
> from user pages. You have convinced me that user-read-only mappings
> should also be supervisor-read-only.
>
>
> It's not really a big deal (IMO) if the supervisor has to jump through
> hoops to read a user X,!R page. That case is rare. I think it would
> only be worth optimizing in hardware if it were essentially free.
> What I was getting at with "protect supervisor mode" is that some
> architectures have totally separate user and kernel address spaces,
> and this actually works out quite nicely and gives some security benefits.
>
> In RISC-V, this model could look like having two entirely separate
> paging hierarchies along with a set of magic load and store
> instructions that would let supervisor code access user memory.
> Admittedly, this chews up opcode space, which would be unfortunate.
> But it makes everything conceptually very simple.

I did consider proposing essentially this earlier, splitting sptbr into
ssptbr and suptbr, with the supervisor ASID hardwired to all-1, but
decided not to write it up because load and store already fill an entire
major opcode each, so there really is no room for new
access-other-address-space instructions. Essentially, the only access
to user space the supervisor would have would require aliasing pages
into the supervisor address space. Great for a Mach-like microkernel
that is already built around VM tricks, not so great for a monolithic
kernel like Linux.

> (If RISC-V does not accept Paolo's suggestion to get rid of H mode,
> then the explosion of opcodes could be even worse.)

Not necessarily, since a hypervisor does not need the same close access
to supervisor memory that current supervisors need to user memory.
Requiring page aliasing, for example, is far less of a burden for a
hypervisor than for a supervisor. Further, according to (IIRC) previous
discussions on this list, most traps are expected to be system calls, so
H-mode has less need for very fast hypercalls and (supervisor) context
switch than S-mode does for fast syscalls and (user) context switch.



-- Jacob
Reply all
Reply to author
Forward
0 new messages