A simple proposal to improve interrupt/exception nesting

Andrew Lutomirski

unread,

Dec 1, 2016, 5:59:20 PM12/1/16

to RISC-V ISA Dev

Currently, RISC-V cannot usefully handle exceptions that happen during exception entry or exit. To summarize earlier discussions, here are some concrete problems:

If a kernel wants to cleanly detect stack overflow and a stack overflow is possible during early entry, the assembly code gets very hairy.
If someone wants to use a tool like Linux's perf on RISC-V, there will be demand for as much of the kernel as possible to be profilable, which means that sampling interrupts should be enabled almost all the time, but RISC-V does not have a usable recoverable NMI mechanism. Bolting something on later may be rather messy.
Big iron machines can recover from accesses to bad DRAM. But what happens if bad DRAM is accessed while SIE=0?

I have a minimal (IMO) proposal that could substantially improve the situation:

Rename sstatus.SIE to sstatus.STE. STE stands for Supervisor Trap Enabled. (Having SIE and sie be different things is IMO rather confusing, too.) Similarly, SPIE is renamed to SPTE.
STE's semantics change a bit. Currently, sstatus.SIE=0 (presumably - it's not quite clear in the spec) causes asynchronous interrupts to be deferred instead of causing traps. It keeps doing this, but it gains a new function: the supervisor entry point (stvec) will not be invoked under any circumstances at all if STE=0. If a synchronous exception would cause a trap to supervisor mode, it traps to machine mode instead, and machine mode code is expected to kill the supervisor or, if configured via SBI, notify the supervisor by some means other than calling stvec.

To use this, kernels change their behavior slightly. When a kernel wants to turn off interrupts to be temporarily atomic (Linux's local_irq_disable(), for example), it will do so using the SIE CSR instead of using sstatus.SIE. Kernels would normally run with STE=0 only during entry or exit.

This gets some nice benefits. Instead of using NMIs, a perf-like tool would allocate an interrupt vector and would just not mask that particular vector in local_irq_disable(). This would allow profiling of everything except the entry and exit code itself.

Handling stack overflow becomes simpler. If a stack overflow occurs during entry (with STE=1), then either the supervisor is killed cleanly or it would register using SBI for some fancier handling.

Auditing or formally verifying entry code becomes *much* easier. The CSR swap right at the beginning is nasty with respect to traps that occur right after it. If the whole critical sequence runs with STE=0, then there cannot be any nested trap at all, so verification doesn't need to worry about reentrancy.

Thoughts?

Stefan O'Rear

unread,

Dec 1, 2016, 6:14:26 PM12/1/16

to Andrew Lutomirski, RISC-V ISA Dev

I like it. If there is no higher power, would a trap in the entry
code functionally be a triple fault?

-s

Jacob Bachmeyer

unread,

Dec 1, 2016, 7:03:26 PM12/1/16

to Andrew Lutomirski, RISC-V ISA Dev

Andrew Lutomirski wrote:
> I have a minimal (IMO) proposal that could substantially improve the
> situation:
>

> 1. Rename sstatus.SIE to sstatus.STE. STE stands for Supervisor

> Trap Enabled. (Having SIE and sie be different things is IMO
> rather confusing, too.) Similarly, SPIE is renamed to SPTE.

> 2. STE's semantics change a bit. Currently, sstatus.SIE=0

> (presumably - it's not quite clear in the spec) causes
> asynchronous interrupts to be deferred instead of causing
> traps. It keeps doing this, but it gains a new function: the
> supervisor entry point (stvec) will not be invoked under any
> circumstances at all if STE=0. If a synchronous exception would
> cause a trap to supervisor mode, it traps to machine mode
> instead, and machine mode code is expected to kill the
> supervisor or, if configured via SBI, notify the supervisor by
> some means other than calling stvec.
>
> To use this, kernels change their behavior slightly. When a kernel
> wants to turn off interrupts to be temporarily atomic (Linux's
> local_irq_disable(), for example), it will do so using the SIE CSR
> instead of using sstatus.SIE. Kernels would normally run with STE=0

> /only/ during entry or exit.
>

What about other privilege modes? A hypervisor faces the same problems.

> Auditing or formally verifying entry code becomes *much* easier. The
> CSR swap right at the beginning is nasty with respect to traps that
> occur right after it. If the whole critical sequence runs with STE=0,
> then there cannot be any nested trap at all, so verification doesn't
> need to worry about reentrancy.

This is also one of the reasons that I proposed *swap CSRs--by
performing the swap in hardware, it is possible for the SEE to determine
afterwards what value should go where and deliver a nested trap.

> Thoughts?
>

I like the concept and offer another that I have been formulating: add
virtual trap flags for execution environments to use. This would
require two or three (three if U-mode virtual traps are supported)
additional flags. These would be HVT, SVT, UVT and each would be
controlled by a higher privilege level than it affects. When the
virtual trap flag is set, xRET traps to the execution environment. All
trap delegation to a privilege level is inhibited while that privilege
level is handling a virtual trap.

In the example case of a supervisor taking a trap during trap entry, the
system would trap to the SEE, which saves enough state to resume the
original trap and "fakes" a new (double-fault?) trap, by returning to
stvec with SVT set. After the supervisor has handled the virtual trap
and executes SRET, the SEE regains control and can re-deliver the
original trap.

This also allows for virtual interrupts to support channel I/O: the SEE
can "fake" a trap entry, either to stvec or another prearranged address,
with a cause code identifying a virtual device.

Thoughts?

-- Jacob

Andrew Lutomirski

unread,

Dec 1, 2016, 7:12:18 PM12/1/16

to Stefan O'Rear, RISC-V ISA Dev

It would be equivalent to a triple fault in x86 speak, but it would
really only be a double fault in this context.

I assume you're asking what the effect should be. I'd say halt or
reset, depending on the whims of the system builder.

--Andy

Andrew Lutomirski

unread,

Dec 1, 2016, 7:17:41 PM12/1/16

to RISC-V ISA Dev, aml...@gmail.com, jcb6...@gmail.com

On Thursday, December 1, 2016 at 4:03:26 PM UTC-8, Jacob Bachmeyer wrote:

Andrew Lutomirski wrote:
> I have a minimal (IMO) proposal that could substantially improve the
> situation:
>
> 1. Rename sstatus.SIE to sstatus.STE. STE stands for Supervisor
> Trap Enabled. (Having SIE and sie be different things is IMO
> rather confusing, too.) Similarly, SPIE is renamed to SPTE.
> 2. STE's semantics change a bit. Currently, sstatus.SIE=0
> (presumably - it's not quite clear in the spec) causes
> asynchronous interrupts to be deferred instead of causing
> traps. It keeps doing this, but it gains a new function: the
> supervisor entry point (stvec) will not be invoked under any
> circumstances at all if STE=0. If a synchronous exception would
> cause a trap to supervisor mode, it traps to machine mode
> instead, and machine mode code is expected to kill the
> supervisor or, if configured via SBI, notify the supervisor by
> some means other than calling stvec.
>
> To use this, kernels change their behavior slightly. When a kernel
> wants to turn off interrupts to be temporarily atomic (Linux's
> local_irq_disable(), for example), it will do so using the SIE CSR
> instead of using sstatus.SIE. Kernels would normally run with STE=0
> /only/ during entry or exit.
>

What about other privilege modes? A hypervisor faces the same problems.

The behavior is the same for all modes. A user-mode trap with UTE=0 goes to the supervisor instead. A supervisor-mode trap with STE=0 goes to H-mode (assuming H-mode actually becomes a real thing) or to M-mode. An H-mode trap with HTE=0 goes to M-mode. An M-mode trap with MTE=0 kills the system.

I haven't tried to specify what happens if you're running in U-mode with STE=0. Presumably you shouldn't be allowed to do this.

> Auditing or formally verifying entry code becomes *much* easier. The
> CSR swap right at the beginning is nasty with respect to traps that
> occur right after it. If the whole critical sequence runs with STE=0,
> then there cannot be any nested trap at all, so verification doesn't
> need to worry about reentrancy.

This is also one of the reasons that I proposed *swap CSRs--by
performing the swap in hardware, it is possible for the SEE to determine
afterwards what value should go where and deliver a nested trap.

It was indeed intended to solve the same problem.

> Thoughts?
>

I like the concept and offer another that I have been formulating: add
virtual trap flags for execution environments to use. This would
require two or three (three if U-mode virtual traps are supported)
additional flags. These would be HVT, SVT, UVT and each would be
controlled by a higher privilege level than it affects. When the
virtual trap flag is set, xRET traps to the execution environment. All
trap delegation to a privilege level is inhibited while that privilege
level is handling a virtual trap.

In the example case of a supervisor taking a trap during trap entry, the
system would trap to the SEE, which saves enough state to resume the
original trap and "fakes" a new (double-fault?) trap, by returning to
stvec with SVT set. After the supervisor has handled the virtual trap
and executes SRET, the SEE regains control and can re-deliver the
original trap.

I'm not sure I understand why this needs new flags. The SEE is essentially a very fancy microcode that can access memory and such. Why not have it save all the trap-related CSRs, load some supervisor-configured stack pointer, and enter at a supervisor-configured address to handle a double fault? None if this kind of complexity has any hardware cost at all.

Andrew Waterman

unread,

Dec 1, 2016, 8:00:33 PM12/1/16

to Andrew Lutomirski, RISC-V ISA Dev

A very old version of the privileged architecture, prior to the
introduction of M-mode, had a similar design. We changed it because
it made less sense with only two privilege modes, but there is more
appeal now.

It can't be turtles all the way down, so the same strategy makes less
sense for M-mode.

On Thu, Dec 1, 2016 at 2:59 PM, Andrew Lutomirski <aml...@gmail.com> wrote:

> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d012bd41-c96c-4a84-91b3-2212df445726%40groups.riscv.org.

Andrew Lutomirski

unread,

Dec 1, 2016, 8:06:35 PM12/1/16

to Andrew Waterman, RISC-V ISA Dev

True, although at some point if M-mode messes up badly enough,
resetting is probably the only option. If M-mode blows up, I'd rather
see it reset than somehow get stuck in an infinite loop of traps that
all trap again early on. M-mode is running unpaged anyway, so it
seems to be that there's very little scope for failure in the M-mode
entry path. And I really can imagine someone trying to write a
formally verified M-mode stack, where a mechanism like this would be
nice.

Jacob Bachmeyer

unread,

Dec 1, 2016, 10:31:10 PM12/1/16

to Andrew Lutomirski, RISC-V ISA Dev

The new flag is needed so that the SEE gets control and can restore the
old supervisor stack and the various trap-related CSRs before resuming
execution. Although I suppose that the double fault handler could end
with ECALL instead of SRET. This could also work for virtual
interrupts, not unlike using a system call to return from a signal handler.

-- Jacob

Stefan O'Rear

unread,

Dec 1, 2016, 11:29:12 PM12/1/16

to Jacob Bachmeyer, Andrew Lutomirski, RISC-V ISA Dev

On Thu, Dec 1, 2016 at 7:31 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> The new flag is needed so that the SEE gets control and can restore the old
> supervisor stack and the various trap-related CSRs before resuming
> execution. Although I suppose that the double fault handler could end with
> ECALL instead of SRET. This could also work for virtual interrupts, not
> unlike using a system call to return from a signal handler.

Exit from a S-mode double fault handler needs to restore all registers
including sepc and then jump somewhere else. This can't be done
entirely within S-mode but for systems that can support it at all I'd
much rather have a new ecall than new architectural state.

-s

Andrew Lutomirski

unread,

Dec 1, 2016, 11:47:36 PM12/1/16

to Stefan O'Rear, Jacob Bachmeyer, RISC-V ISA Dev

All that's really needed is an ecall or an instruction that takes a pointer to a structure containing the relevant CSRs, PC, and at least one GPR and loads them all.

Doing this with an instruction would involve fancy microcode that's otherwise unnecessary. Ecall seems simpler.

Andrew Waterman

unread,

Feb 6, 2017, 9:34:41 PM2/6/17

to Andrew Lutomirski, RISC-V ISA Dev

On Thu, Dec 1, 2016 at 2:59 PM, Andrew Lutomirski <aml...@gmail.com> wrote:

This would work naturally if local_irq_disable returned an opaque
value that was passed to local_irq_enable, as with local_irq_save and
local_irq_restore. Alas, the former APIs are based around the notion
that there is a single interrupt enable bit. It's not clear how to
implement them without additional state (e.g. an in-memory flag) or
restrictions on SIE usage (e.g. either all interrupts are enabled or
all interrupts are disabled).

>
>
> This gets some nice benefits. Instead of using NMIs, a perf-like tool would
> allocate an interrupt vector and would just not mask that particular vector
> in local_irq_disable(). This would allow profiling of everything except the
> entry and exit code itself.
>
>
> Handling stack overflow becomes simpler. If a stack overflow occurs during
> entry (with STE=1), then either the supervisor is killed cleanly or it would
> register using SBI for some fancier handling.
>
>
> Auditing or formally verifying entry code becomes *much* easier. The CSR
> swap right at the beginning is nasty with respect to traps that occur right
> after it. If the whole critical sequence runs with STE=0, then there cannot
> be any nested trap at all, so verification doesn't need to worry about
> reentrancy.
>
>
> Thoughts?
>
>
>

Andrew Lutomirski

unread,

Feb 6, 2017, 10:01:42 PM2/6/17

to RISC-V ISA Dev, aml...@gmail.com

I'm not sure I see the issue. local_irq_disable() would clear SIE and local_irq_enable() would set SIE. STE would only be affected by specialized RISC-V-specific entry code.

Andrew Waterman

unread,

Feb 6, 2017, 10:03:55 PM2/6/17

to Andrew Lutomirski, RISC-V ISA Dev

The SIE CSR is a mask of currently enabled interrupts. Which ones
would local_irq_enable turn back on?

>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1ea4cd06-e8ef-43d0-9a6c-9728c617df65%40groups.riscv.org.

Andrew Lutomirski

unread,

Feb 6, 2017, 10:44:02 PM2/6/17

to Andrew Waterman, RISC-V ISA Dev

All of them, because Linux doesn't use interrupt priorities :)

Even if Linux did use fancy masks, just storing the mask in a percpu variable ought to do the trick.

--Andy

>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to isa-dev+unsubscribe@groups.riscv.org.

> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1ea4cd06-e8ef-43d0-9a6c-9728c617df65%40groups.riscv.org.

Andrew Waterman

unread,

Feb 6, 2017, 11:04:35 PM2/6/17

to Andrew Lutomirski, RISC-V ISA Dev

Anyone want to comment on other OSes?

>> email to isa-dev+u...@groups.riscv.org.

>> To post to this group, send email to isa...@groups.riscv.org.
>> Visit this group at
>> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> To view this discussion on the web visit
>>
>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1ea4cd06-e8ef-43d0-9a6c-9728c617df65%40groups.riscv.org.
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to isa-dev+u...@groups.riscv.org.

> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAObL_7HkUuScHG%2B%3DXRUnVoWDGe4L9hk1YEsWSeYQ2GYdjYFLhA%40mail.gmail.com.

Samuel Falvo II

unread,

Feb 8, 2017, 11:54:17 AM2/8/17

to Andrew Lutomirski, Stefan O'Rear, Jacob Bachmeyer, RISC-V ISA Dev

On Thu, Dec 1, 2016 at 8:47 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> All that's really needed is an ecall or an instruction that takes a pointer
> to a structure containing the relevant CSRs, PC, and at least one GPR and
> loads them all.

And this is called SIEBK in zArchitecture. ;) Enter Start
Interpreted Execution discussions now, if they haven't already.

--
Samuel A. Falvo II

Andrew Lutomirski

unread,

Feb 8, 2017, 11:58:16 AM2/8/17

to Samuel Falvo II, Stefan O'Rear, Jacob Bachmeyer, RISC-V ISA Dev

It's called IRET on x86, and it's a turd, so YMMV.

--Andy

Samuel Falvo II

unread,

Feb 8, 2017, 1:19:28 PM2/8/17

to Andrew Lutomirski, Stefan O'Rear, Jacob Bachmeyer, RISC-V ISA Dev

On Wed, Feb 8, 2017 at 8:57 AM, Andrew Lutomirski <aml...@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 8:54 AM, Samuel Falvo II <sam....@gmail.com> wrote:
> It's called IRET on x86, and it's a turd, so YMMV.

Indeed it will, considering IRET doesn't reload CRs at all, let alone
atomically.

Jacob Bachmeyer

unread,

Feb 8, 2017, 6:49:16 PM2/8/17

to Andrew Lutomirski, RISC-V ISA Dev

Andrew Lutomirski wrote:
> Currently, RISC-V cannot usefully handle exceptions that happen during
> exception entry or exit. To summarize earlier discussions, here are
> some concrete problems:
>

> * If a kernel wants to cleanly detect stack overflow and a stack

> overflow is possible during early entry, the assembly code gets
> very hairy.

> * If someone wants to use a tool like Linux's perf on RISC-V,

> there will be demand for as much of the kernel as possible to be
> profilable, which means that sampling interrupts should be
> enabled almost all the time, but RISC-V does not have a usable
> recoverable NMI mechanism. Bolting something on later may be
> rather messy.

> * Big iron machines can recover from accesses to bad DRAM. But

> what happens if bad DRAM is accessed while SIE=0?
>

Access to bad DRAM causes NMI which goes straight to M-mode and does not
use stvec in any case. The only traps that go to stvec are in Table 4.1
in the privileged ISA spec. NMIs are described in section 3.4 and are
restricted to hardware error conditions. I am surprised that I did not
remember this earlier.

> This gets some nice benefits. Instead of using NMIs, a perf-like tool
> would allocate an interrupt vector and would just not mask that
> particular vector in local_irq_disable(). This would allow profiling
> of everything except the entry and exit code itself.
>

I think that the current model can do the same--as long as context save
can be guaranteed to complete without causing a fault, the perf-like
tool can do this with an interrupt vector that never gets masked and the
kernel simply does not actually clear SIE to disable interrupts, instead
masking all-but-the-perf-timer. Also, disabling interrupts merely
delays them--the timer interrupt is not lost because the kernel held it
off to complete some critical section. While using SIE for critical
sections limits the granularity of the sampling interrupts (a sampling
interrupt that occurs during a critical section will appear to have
occurred at the end of the critical section), that is an argument for
minimizing critical sections, not creating landmines that blow up the
supervisor, as STE would be.

The even better option: have the SEE implement and handle the sampling
interrupts and provide the information in some way. This way, even the
critical trap-entry code can be profiled, since the SEE can take and
handle a profiling sample timer interrupt regardless of SIE or any other
supervisor state.

> Handling stack overflow becomes simpler. If a stack overflow occurs
> during entry (with STE=1), then either the supervisor is killed
> cleanly or it would register using SBI for some fancier handling.
>

The supervisor can guarantee that trap-entry code will never take a
fault if a per-task save area is always present. In another discussion,
efforts towards nested traps led to (message-id
<CADJ6UvP=rmCWeXu3g_u2u7KCKfr6n...@mail.gmail.com>
from Stefan O'Rear) which I will quote here:
> On Thu, Nov 3, 2016 at 12:37 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> > I think that the problem can be avoided, if the user context save area for
>> > the current task is guaranteed to always be present, but I was trying to
>> > find a general solution.
>>
>
> If the problem you are trying to solve is "I swapped out my percpu
> struct, how do I swap it back in", I think you made a mistake several
> steps ago.

I think that this applies here as well and creating the situation where
a synchronous trap (*any* synchronous trap) is effectively a
triple-fault is probably a bad idea.

The save area obviously must be in a region that can be accessed without
causing stack faults, but that should be easy to arrange. The stack
fault handler simply must be carefully written to avoid causing nested
stack faults. I think that this is a reasonable restriction.

-- Jacob

Andrew Lutomirski

unread,

Feb 10, 2017, 6:15:19 PM2/10/17

to RISC-V ISA Dev, aml...@gmail.com, jcb6...@gmail.com

On Wednesday, February 8, 2017 at 3:49:16 PM UTC-8, Jacob Bachmeyer wrote:

Andrew Lutomirski wrote:

> This gets some nice benefits. Instead of using NMIs, a perf-like tool
> would allocate an interrupt vector and would just not mask that
> particular vector in local_irq_disable(). This would allow profiling
> of everything except the entry and exit code itself.
>

I think that the current model can do the same--as long as context save
can be guaranteed to complete without causing a fault, the perf-like
tool can do this with an interrupt vector that never gets masked and the
kernel simply does not actually clear SIE to disable interrupts, instead
masking all-but-the-perf-timer.

True, but what's the benefit over my proposal? The kernel code is essentially the same but without protection against error conditions that cause the entry code to fault.

The even better option: have the SEE implement and handle the sampling
interrupts and provide the information in some way. This way, even the
critical trap-entry code can be profiled, since the SEE can take and
handle a profiling sample timer interrupt regardless of SIE or any other
supervisor state.

Then the SEE will have to be updated whenever the kernel wants to change what it does when a sample fires.

> Handling stack overflow becomes simpler. If a stack overflow occurs
> during entry (with STE=1), then either the supervisor is killed
> cleanly or it would register using SBI for some fancier handling.
>

The supervisor can guarantee that trap-entry code will never take a
fault if a per-task save area is always present.

The whole point is that, if something goes wrong and an entry nests, the current approach will result in corruption (infinite loops, messy crashes, or even privilege escalation). Having a clean error would IMO be much nicer.

In another discussion,
efforts towards nested traps led to (message-id

<CADJ6UvP=rmCWeXu3g_u2u7KCKfr6nvmdhYLZ9ikH3U6HsdJpH...@mail.gmail.com>

from Stefan O'Rear) which I will quote here:
> On Thu, Nov 3, 2016 at 12:37 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> > I think that the problem can be avoided, if the user context save area for
>> > the current task is guaranteed to always be present, but I was trying to
>> > find a general solution.
>>
>
> If the problem you are trying to solve is "I swapped out my percpu
> struct, how do I swap it back in", I think you made a mistake several
> steps ago.

I think that this applies here as well and creating the situation where
a synchronous trap (*any* synchronous trap) is effectively a
triple-fault is probably a bad idea.

A triple-fault is much better than privilege escalation.

The save area obviously must be in a region that can be accessed without
causing stack faults, but that should be easy to arrange. The stack
fault handler simply must be carefully written to avoid causing nested
stack faults. I think that this is a reasonable restriction.

This code may be distinctly non-trivial to write in a Linux-like kernel. There can be multiple nested kernel entries in the same task, and *all* of them need to save state. Most (all?) arch ports handle this by saving the state on the stack, and stacks can overflow.

Jacob Bachmeyer

unread,

Feb 10, 2017, 9:24:08 PM2/10/17

to Andrew Lutomirski, RISC-V ISA Dev

Andrew Lutomirski wrote:
> On Wednesday, February 8, 2017 at 3:49:16 PM UTC-8, Jacob Bachmeyer
> wrote:
>
> Andrew Lutomirski wrote:
>
> > This gets some nice benefits. Instead of using NMIs, a
> perf-like tool
> > would allocate an interrupt vector and would just not mask that
> > particular vector in local_irq_disable(). This would allow
> profiling
> > of everything except the entry and exit code itself.
> >
>
> I think that the current model can do the same--as long as context
> save
> can be guaranteed to complete without causing a fault, the perf-like
> tool can do this with an interrupt vector that never gets masked
> and the
> kernel simply does not actually clear SIE to disable interrupts,
> instead
> masking all-but-the-perf-timer.
>
>
> True, but what's the benefit over my proposal? The kernel code is
> essentially the same but without protection against error conditions
> that cause the entry code to fault.

The benefit is avoiding the creation of a landmine that can blow up the
supervisor.

What error conditions can cause the entry code to fault and what
prevents the kernel from guaranteeing that those conditions never
occur? The currently defined scause values are:
(0) instruction access misaligned
Not possible at trap entry--stvec cannot store an unaligned address.
(1) instruction access fault
"I swapped out my trap handler, how do I swap it back in?"
(2) illegal instruction, (6) AMO address misaligned, and (8)
environment call
A kernel can easily prevent these from occurring during the trap
entry critical section.
(3) breakpoint
Also easily prevented: simply disallow setting a breakpoint
within the trap entry sequence or use an alternate breakpoint strategy
that branches directly to the debugger code from within the handler.
The debugger can even transparently refer to the saved context.
(5) load access fault and (7) store access fault
These are the page fault cause codes and can be prevented by
ensuring that the context save area is always present. In practice, the
location of the context save area must be derivable from sscratch, and
it probably makes sense to store the address of the save area in
sscratch and place a pointer to the current task struct immediately
prior to the context save area in memory. This avoids waiting for a
LOAD to complete before the context save can begin and should improve
the performance of the trap handler on superscalar implementations
significantly.

> The even better option: have the SEE implement and handle the
> sampling
> interrupts and provide the information in some way. This way,
> even the
> critical trap-entry code can be profiled, since the SEE can take and
> handle a profiling sample timer interrupt regardless of SIE or any
> other
> supervisor state.
>
>
> Then the SEE will have to be updated whenever the kernel wants to
> change what it does when a sample fires.

For profiling, there is very little to do when a sample fires: save
some amount of context into a supervisor-provided log buffer somewhere,
set up the next sample, and continue execution. If you want to get
really fancy, the SEE could use a BPF variant for this. :)

> > Handling stack overflow becomes simpler. If a stack overflow
> occurs
> > during entry (with STE=1), then either the supervisor is killed
> > cleanly or it would register using SBI for some fancier handling.
> >
>
> The supervisor can guarantee that trap-entry code will never take a
> fault if a per-task save area is always present.
>
>
> The whole point is that, if something goes wrong and an entry nests,
> the current approach will result in corruption (infinite loops, messy
> crashes, or even privilege escalation). Having a clean error would
> IMO be much nicer.

How can trap entry nest that is not a bug in the supervisor? (I was
once arguing for a different solution to the same problem as you are
here; I was convinced that this problem is illusory when I could not
produce a satisfactory answer to that question.)

> In another discussion,
> efforts towards nested traps led to (message-id

> <CADJ6UvP=rmCWeXu3g_u2u7KCKfr6n...@mail.gmail.com
> <javascript:>>

> from Stefan O'Rear) which I will quote here:
> > On Thu, Nov 3, 2016 at 12:37 AM, Jacob Bachmeyer
> <jcb6...@gmail.com <javascript:>> wrote:
> >
> >> > I think that the problem can be avoided, if the user context
> save area for
> >> > the current task is guaranteed to always be present, but I
> was trying to
> >> > find a general solution.
> >>
> >
> > If the problem you are trying to solve is "I swapped out my percpu
> > struct, how do I swap it back in", I think you made a mistake
> several
> > steps ago.
>
> I think that this applies here as well and creating the situation
> where
> a synchronous trap (*any* synchronous trap) is effectively a
> triple-fault is probably a bad idea.
>
>
> A triple-fault is much better than privilege escalation.

Agreed, but avoiding the problem in the first place is better than either.

> The save area obviously must be in a region that can be accessed
> without
> causing stack faults, but that should be easy to arrange. The stack
> fault handler simply must be carefully written to avoid causing
> nested
> stack faults. I think that this is a reasonable restriction.
>
>
> This code may be distinctly non-trivial to write in a Linux-like
> kernel. There can be multiple nested kernel entries in the same task,
> and *all* of them need to save state. Most (all?) arch ports handle
> this by saving the state on the stack, and stacks can overflow.

A simple solution: upon kernel entry, allocate the space for the *next*
nested kernel entry's saved context on the stack, branching to the stack
overrun handler (which must have its own stack) if this would cause an
overflow. The initial save area is near the base of the task kernel
stack and contains the user-mode context for the task; sscratch points
here during user-mode execution and is updated to point to the next
save area upon allocation of same. (The sscratch update ends the
critical section.) Each save area is preceded by a pointer to the
current task struct. (The kernel thread pointer would point to the
current task struct after trap entry.) This would require per-task
kernel stacks, but that should not be much of a problem, especially
since only the kernel stacks for currently-running tasks really need to
be pinned. Each save area contains a pointer to the next save area;
this pointer is NULL in the unused innermost save area. Task-switch
requires walking the list to find the last used save area, which
contains the saved context for the innermost kernel entry for that
task. This simple approach has the disadvantage that schedule() must
"fake" a nested kernel entry on the stack to save its caller's state,
but that may not actually be a problem and can be avoided with a more
complex save area chain that allows the innermost save area to be used
if the task was suspended synchronously.

There is another possibility: inhibit hardware delegation of horizontal
traps when xIE is clear. (A particularly lazy supervisor could always
have SIE clear--S-mode interrupts are unconditionally enabled when the
processor is in U-mode.) The execution environment is then responsible
for saving enough state to resume the outer trap. This should be
relatively rare--possibly even rare enough for some SEEs to consider it
a fatal error in a supervisor. (But not-so-rare in a user-mode trap
handler--the supervisor can swap out any user page at will, and so must
be prepared to handle page faults when the U-mode context save area has
been swapped out. Likewise for a hypervisor, if virtual memory is
used.) Obviously, this does not help M-mode, but M-mode is the lowest
implementation level. Alternately, a synchronous trap in M-mode with
MIE clear could be promoted to an NMI, but that rubs me the wrong way
just like your xTE proposal does.

-- Jacob

Reply all

Reply to author

Forward