I have a minimal (IMO) proposal that could substantially improve the situation:
To use this, kernels change their behavior slightly. When a kernel wants to turn off interrupts to be temporarily atomic (Linux's local_irq_disable(), for example), it will do so using the SIE CSR instead of using sstatus.SIE. Kernels would normally run with STE=0 only during entry or exit.
This gets some nice benefits. Instead of using NMIs, a perf-like tool would allocate an interrupt vector and would just not mask that particular vector in local_irq_disable(). This would allow profiling of everything except the entry and exit code itself.
Handling stack overflow becomes simpler. If a stack overflow occurs during entry (with STE=1), then either the supervisor is killed cleanly or it would register using SBI for some fancier handling.
Auditing or formally verifying entry code becomes *much* easier. The CSR swap right at the beginning is nasty with respect to traps that occur right after it. If the whole critical sequence runs with STE=0, then there cannot be any nested trap at all, so verification doesn't need to worry about reentrancy.
Thoughts?
Andrew Lutomirski wrote:
> I have a minimal (IMO) proposal that could substantially improve the
> situation:
>
> 1. Rename sstatus.SIE to sstatus.STE. STE stands for Supervisor
> Trap Enabled. (Having SIE and sie be different things is IMO
> rather confusing, too.) Similarly, SPIE is renamed to SPTE.
> 2. STE's semantics change a bit. Currently, sstatus.SIE=0
> (presumably - it's not quite clear in the spec) causes
> asynchronous interrupts to be deferred instead of causing
> traps. It keeps doing this, but it gains a new function: the
> supervisor entry point (stvec) will not be invoked under any
> circumstances at all if STE=0. If a synchronous exception would
> cause a trap to supervisor mode, it traps to machine mode
> instead, and machine mode code is expected to kill the
> supervisor or, if configured via SBI, notify the supervisor by
> some means other than calling stvec.
>
> To use this, kernels change their behavior slightly. When a kernel
> wants to turn off interrupts to be temporarily atomic (Linux's
> local_irq_disable(), for example), it will do so using the SIE CSR
> instead of using sstatus.SIE. Kernels would normally run with STE=0
> /only/ during entry or exit.
>
What about other privilege modes? A hypervisor faces the same problems.
> Auditing or formally verifying entry code becomes *much* easier. The
> CSR swap right at the beginning is nasty with respect to traps that
> occur right after it. If the whole critical sequence runs with STE=0,
> then there cannot be any nested trap at all, so verification doesn't
> need to worry about reentrancy.
This is also one of the reasons that I proposed *swap CSRs--by
performing the swap in hardware, it is possible for the SEE to determine
afterwards what value should go where and deliver a nested trap.
> Thoughts?
>
I like the concept and offer another that I have been formulating: add
virtual trap flags for execution environments to use. This would
require two or three (three if U-mode virtual traps are supported)
additional flags. These would be HVT, SVT, UVT and each would be
controlled by a higher privilege level than it affects. When the
virtual trap flag is set, xRET traps to the execution environment. All
trap delegation to a privilege level is inhibited while that privilege
level is handling a virtual trap.
In the example case of a supervisor taking a trap during trap entry, the
system would trap to the SEE, which saves enough state to resume the
original trap and "fakes" a new (double-fault?) trap, by returning to
stvec with SVT set. After the supervisor has handled the virtual trap
and executes SRET, the SEE regains control and can re-deliver the
original trap.
All that's really needed is an ecall or an instruction that takes a pointer to a structure containing the relevant CSRs, PC, and at least one GPR and loads them all.
Doing this with an instruction would involve fancy microcode that's otherwise unnecessary. Ecall seems simpler.
>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+unsubscribe@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
Andrew Lutomirski wrote:
> This gets some nice benefits. Instead of using NMIs, a perf-like tool
> would allocate an interrupt vector and would just not mask that
> particular vector in local_irq_disable(). This would allow profiling
> of everything except the entry and exit code itself.
>
I think that the current model can do the same--as long as context save
can be guaranteed to complete without causing a fault, the perf-like
tool can do this with an interrupt vector that never gets masked and the
kernel simply does not actually clear SIE to disable interrupts, instead
masking all-but-the-perf-timer.
The even better option: have the SEE implement and handle the sampling
interrupts and provide the information in some way. This way, even the
critical trap-entry code can be profiled, since the SEE can take and
handle a profiling sample timer interrupt regardless of SIE or any other
supervisor state.
> Handling stack overflow becomes simpler. If a stack overflow occurs
> during entry (with STE=1), then either the supervisor is killed
> cleanly or it would register using SBI for some fancier handling.
>
The supervisor can guarantee that trap-entry code will never take a
fault if a per-task save area is always present.
In another discussion,
efforts towards nested traps led to (message-id
from Stefan O'Rear) which I will quote here:
> On Thu, Nov 3, 2016 at 12:37 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> > I think that the problem can be avoided, if the user context save area for
>> > the current task is guaranteed to always be present, but I was trying to
>> > find a general solution.
>>
>
> If the problem you are trying to solve is "I swapped out my percpu
> struct, how do I swap it back in", I think you made a mistake several
> steps ago.
I think that this applies here as well and creating the situation where
a synchronous trap (*any* synchronous trap) is effectively a
triple-fault is probably a bad idea.
The save area obviously must be in a region that can be accessed without
causing stack faults, but that should be easy to arrange. The stack
fault handler simply must be carefully written to avoid causing nested
stack faults. I think that this is a reasonable restriction.