> * Add a new status bit SATOMIC. SATOMIC is never set when running
> U-mode code. It has no particular effect when running in H-mode
> or M-mode (except that its value is preserved when returning to
> S-mode).
> * On a trap to S-mode, if SATOMIC is clear: x1 is saved to a new
> CSR called SX1. Another new CSR called SX1 is copied to X1. A
> new flag SATOMIC is set. (Presumably this flag lives in a
> status CSR. See farther down for its exact effect.)
> * On a trap to S-mode if SATOMIC is set: nothing is saved at all.
> The sole effect of the trap is to change PC to point to the
> double-fault handler.
> * SRET is illegal if SATOMIC is not set.
>
> Other than SATOMIC, this has relatively little effect except that
> handlers no longer have any need to begin with a CSR swap. This makes
> it easier to write a handler that can survive a nested trap.
SATOMIC is basically the invisible "trap handler critical section" flag
I suggested, but UATOMIC and HATOMIC will also be needed for nested
traps in U-mode and H-mode, which is why I made the flag invisible to
save bits in mstatus. The other major difference is that my proposal
enables execution to continue after a nested trap: the trap handler is
simply resumed at the instruction where the inner trap occurred. How
does execution resume after the double-fault handler when the address of
the faulting instruction is not known? Or are we back to idempotent
critical sections?
The SX1 register you suggest is essentially sswap hardwired to the value
0x01 and a single sscratch register. Simpler, I admit, but I did not
want to hardwire which registers a trap swaps out. Also, the
save-full-context instruction I propose would save the value of SX1 in
the x1 slot in the context object, since that is the value x1 had when
the trap was taken.
> A lightweight entry (syscall on some microkernels, etc) can save
> whatever registers it needs for scratch space, do some work, restore
> registers, and return. Any nested trap is an error or has special
> handling.
Same as my proposal--the critical section is the entire trap handler if
context-save is never executed.
> A normal entry will likely start with a short instruction sequence to
> save SEPC, SCAUSE, STP, etc to memory (using X1-relative stores), will
> update SX1 to point wherever a nested entry wants it, and then will
> clear SATOMIC.
This is equivalent, except I propose a special save-minimal-context
instruction to save those registers and save-full-context instructions
to store the complete less-privileged context. The latter is intended
to improve performance for a user context switch, see below.
> It may also improve performance, judging by my x86 experience. Having
> an exception handler start with a special instruction to load a
> pointer into some register and then doing everything relative to that
> pointer means that you have to wait for the first entry instruction to
> retire before making any further progress at all. I did an experiment
> once in which i got rid of the SWAPGS at the beginning of the x86
> syscall handler, and it saved several cycles. (I did it with a gross
> hack in which I assumed only one CPU and stored everything at a fixed
> absolute address.) SWAPGS on x86 is conceptually like a swap to
> SSCRATCH. If the CPU were responsible for populating a base register,
> the microcode could do it early, so the value would be available
> without pipeline hazards right away.
The reason to add special instructions is that hardware may be able to
use higher-performance burst-mode memory accesses not otherwise
available in RISC-V. This is also the reason that the context object
layout is required to be (machine-readably) documented but is left
unspecified--the exact layout is a matter of microarchitectural
convenience. For example, an implementation that internally divides the
register file into banks may find it easier to save the registers in
some interleaved pattern and this is allowed, provided that each
register's value is contiguous and naturally-aligned. (The context
object itself must be aligned on at least an XLEN-bit boundary.) I draw
the line at dumping a rename table, though, since that would trade too
much software complexity (every access to a saved context would need to
use the dumped rename table) for very little benefit (hardware could
easily do rename table lookups while saving the context). Similarly,
shuffling register contents is not allowed, since software may have to
untangle them, and permutations are essentially free in hardware, no
matter how the registers are stored internally. The save-context
instructions are simply a faster version of a sequence of stores that
also clear the internal trap critical section flag, although they do
perform another neat trick, directly saving certain CSRs without
clobbering an architectural register.
The broader purpose was to allow SRET to also perform a (user thread)
context switch. In the case of a process switch, the supervisor would
first load a new *ptbr value. I have seen the pages of assembler code
in Linux to do context switches, and thought "it would be nice to have
hardware context-save/context-restore instructions that are actually
faster than doing it in software". Again, I expect this to be an
operation that can complete in less than 50 cycles, including the trap
return, which is a perfectly predictable jump.
Linux could, at first, always save a full user context on trap entry (an
operation (including the trap itself) that I expect to require less than
50 cycles, absent memory latency, which is plausible if L1 cachelines
are available to hold the writes and the trap entry is cached) and later
optimize for particular cases where this can be delayed or avoided.
The instructions I propose can be implemented without actual microcode,
using a bit of additional logic in the decoder to synthesize the
required operation sequence. (Okay, so that is basically microcode
generated by hardware as needed instead of read from ROM. Po-tay-to,
po-tah-to, I admit.) But a more advanced implementation actually could
implement context-save in hardware, perhaps with a special double-width
memory path that uses both register read ports at once and requires less
than 20 cycles to save trap CSRs and the less-privileged values of all
31 integer registers. An out-of-order implementation could simply mark
the data hazards and queue the stores for context-save, and shuffle the
loads for context-restore to satisfy earlier dependencies first. If I
understand correctly, even a full user context-save on trap entry should
be faster on RISC-V than merely raising an exception on x86 in all or
nearly all implementations.
Note that there is no requirement that a saved context actually be
restored--if the supervisor does not clobber any registers not swapped
with sscratch CSRs, then plain SRET will correctly continue user
execution. An implementation is, of course, allowed to hold a copy of
the most-recently saved context in some internal buffer (or a set of
shadow registers), and to use it if a context-restore is executed from
the same address with no intervening stores (or to snoop such writes and
update the copy), but the supervisor can freely "take snapshots" and
throw them away if they are not needed. This is important, because
context-save may be faster than context-restore. The register file has
two read ports, but may only have one write port, although banked
register files may have secondary access ports useful for
context-restore and context-save.
> Aside: All of the packing of status bits and "previous" status bits
> into the same register is nice from a space-saving perspective, but it
> might be nicer in the long run to have them be the same bit positions
> in different registers. That would reduce the amount of thought
> needed for save-and-restore. It would also allow SATOMIC, HATOMIC,
> all the various interrupt enable bits, etc to be consolidated into
> just the active copy and the copy that's saved for restoration by
> xRET. But maybe this doesn't matter so much.
It also seems to be intended for performance as well--only one status
register to save/restore on context switch, at any privilege level.
-- Jacob