Proposal: swap CSRs, additional scratch CSRs and fast context save/restore instructions

220 views

Skip to first unread message

Jacob Bachmeyer

unread,

Oct 7, 2016, 6:38:53 PM10/7/16

to RISC-V ISA Dev

The recent discussion about nested traps got me thinking and I have a
possible improvement for trap handling: add hardware support for
exchanging scratch CSRs and fast context save/restore instructions that
recognize the scratch CSRs. Fortunately, there is plenty of space under
the PRIV minor opcode. This proposal uses at most 16 instruction slots
under the PRIV minor opcode, but more probably eight if the preferred
option to combine restore-context and trap-return is chosen.

The new trap model I propose begins with a critical section until a
save-context instruction is executed, which writes the necessary state
to cover a nested trap to a designated memory address. The trap handler
then executes and ends with a second critical section containing
restore-context followed by xRET. Trap handlers that can guarantee that
they will not take nested traps can omit the
save-context/restore-context and simply run in a critical section with
interrupts disabled. Another option, which eliminates the second
critical section, is to combine restore-context and trap-return into a
single instruction. The critical sections in a trap handler must be
idempotent, since a nested trap during a critical section will cause the
execution environment to restart the critical section. Hardware trap
delegation is suppressed during a critical section and any synchronous
traps taken are instead delivered to the execution environment.

For concreteness and to simplify the discussion, I will use supervisor
traps and supervisor CSRs as an example; this proposal includes
analogous changes for other privilege levels. Configuration string
examples assume the user mode traps and hypervisor mode are not
implemented and the needed extensions should be obvious. Machine mode
is a bit different, as explained at the end.

I propose moving the sip CSR to a different group to simplify CSR
decoding, since sscratch, sepc, scause, and sbadaddr are all essentially
XLEN-bit general purpose registers and their contents do not affect the
operation of the processor. Implementations with register renaming
could usefully place these CSRs in an expanded register file and decode
CSR access opcodes for these registers accordingly. This decoding is
simplified if these "side-effect-free" CSRs are easier to distinguish
from CSRs that do have side-effects. (To be clear: of the current
"supervisor trap handling" CSRs, only sip can have side-effects while
the others are more-or-less general-purpose, some of which are written
by hardware.)

==New sswap CSR and additional sscratch CSRs==

The proposed new sswap CSR would be in the supervisor trap setup group
and would contain an implementation-defined number of 5-bit WARL fields,
up to the number of supervisor scratch registers available in the
implementation. The "swap register" fields each can contain a 5-bit
register number or zero and are numbered starting from the
least-significant end of the sswap CSR and must contain strictly
increasing values except for fields containing zero. Unimplemented
fields in sswap are hardwired to zero. All implemented fields must be
contiguous on the least-significant end of the sswap CSR. On trap
entry, each field in sswap causes the numbered architectural register to
be exchanged with the corresponding scratch CSR. Fields requesting a
swap with x0 result in no swap at all for that scratch CSR. The
ordering constraint simplifies the implementation of the context save
instructions.

The supervisor trap handling CSRs would be renumbered to put sepc,
scause, and sbadaddr ahead of an implementation-defined number of
sscratch registers, neither less than 1 nor more than 29. The upper
limit ensures that the total number of trap handling side-effect-free
CSRs does not exceed 32. Providing more scratch registers to S-mode
than to higher privilege levels may make sense, since most traps are
expected to be to S-mode. I expect most implementations to provide one
to six sscratch CSRs. The exact number of scratch registers available
to each privilege level must be in the configuration string and may
differ by privilege level. For example, in a "core" resource,
"scratch-csr-count { M 2; S 4; };" for an implementation that has 2
mscratch CSRs and 4 sscratch CSRs and supports neither user traps nor
H-mode.

A diagram for sswap, on RV32:

3 3 2 2 2 2 1 1 1 1
1 0 9 5 4 0 9 5 4 0 9 5 4 0
| -- | SSR5 | SSR4 | SSR3 | SSR2 | SSR1 | SSR0 |
5 5 5 5 5 5

All fields in sswap are WARL and are initialized to zero when a
supervisor begins execution. Suppose that an implementation has only
four sscratch registers (sscratch0-sscratch3) and only supports three
automatic swaps. Then SSR5, SSR4, and SSR3 will all be hardwired to
zero and SSR0, SSR1, and SSR2 will control which registers are exchanged
with sscratch0, sscratch1, and sscratch2, respectively. On RV64 and
RV128, sswap may contain up to SSR11 and SSR24, respectively, but
implementations are not required to actually support this many swaps.

These proposed automatic CSR swaps serve two purposes. First, they can
fill in part of the latency of taking a trap on implementations where
the writes to sepc, scause, and (possibly) sbadaddr still leave empty
cycles before the first instruction of the trap handler is available.
Second, and perhaps more importantly, this permits a trap handler to
immediately have working registers, instead of doing a very careful
dance to preserve state, similar to the long-standing concept of a
shadow register file.

The additional hardware burden on implementations is small and mostly
optional--sswap can be entirely hardwired to zero if this feature is not
supported. To implement this, the decoder must be able to construct a
sequence of CSRRW operations with the csr field counting through
sscratch registers and rs1=rd=SSRn or a no-op if SSRn is zero. Assuming
that trap latency is fixed, this can be a fixed sequence.

The additional burden on supervisor software is also small, since the
use of this feature is optional and sswap is initially zero at the
supervisor's entry point. To make full use of this, supervisors will
need to generate their trap entry and return stubs during
initialization, once the number of available scratch swaps is known. A
supervisor need only be aware of the changes to CSR numbering that are
part of this proposal.

==Fast context save/restore instructions==

There are eight forms of saved context in this proposal, corresponding
to minimal and full contexts at each privilege level. Contexts can only
be saved at an address aligned to an XLEN-bit boundary and context
save/restore may be more efficient at still wider alignments. The
recommended alignment for saving each type of context and the size of
each type of context must appear in the configuration string. For
example, in a "core" resource, "base-context-size {M 16 176 32; S 16 144
32;};" would indicate that both supervisor and monitor minimal contexts
are 16 bytes, a full context saved in M-mode (covering the supervisor
context) is 176 bytes, a full context saved in S-mode (covering the user
context is 144 bytes, and all types of contexts have a recommended
alignment of 32 bytes.

A minimal context includes only enough information to permit a nested
trap at the same privilege level. This is at least sstatus, sepc,
scause, and sbaddaddr for the supervisor and possibly some additional
state. A full context includes all of the information that must be
preserved to effect a context switch, at least for the main integer
unit. Floating-point context is managed separately. The layout of the
saved context must be documented in machine-readable form in the
configuration string.

This requires either 16 instructions (minimal/full context save/restore
for each privilege level) or 9 instructions (minimal/full context save
for each privilege level and one common context restore) if all contexts
begin with the relevant status register and the SD and XS fields (which
are read-only even in mstatus because they are generated "live" by
hardware) are reused to indicate minimal/full and the privilege level of
the saved context. Obviously, privilege levels can only restore
less-privileged contexts.

Supervisor fast full context save, which writes sstatus, the trap state
(sepc, scause, sbadaddr) CSRs, and all user state, uses the sswap CSR to
retrieve the user-visible contents of the register file from the
sscratch CSRs. Supervisor context restore also uses the sswap CSR to
write the user-visible contents of the appropriate registers to the
sscratch CSRs instead of the actual register file. Effectively,
registers listed in the SSRn fields "belong" to the supervisor.

A preferred option is to have only the fast context save instructions
that tag the saved context and integrate context restore with xRET. In
this approach, xRET with rs1=x0 is the current simple trap return, while
xRET with rs1 as some other register also restores the context at the
address in rs1.

==Differences in machine mode==

In machine mode, there is no execution environment to pick up the pieces
and sort out nested traps. Fortunately, machine mode can easily avoid
synchronous traps during critical sections, and all interrupts are
disabled during a critical section in machine mode. This requires that
MPRV be clear on entry to a machine mode trap handler from a
less-privileged mode, since save-context is a data access and must not
fault.

-- Jacob

Andrew Lutomirski

unread,

Oct 7, 2016, 7:22:13 PM10/7/16

to RISC-V ISA Dev, jcb6...@gmail.com

Nifty. I don't suppose that the minimum number of sscratch registers could be 2 or 3 instead of 1? Although I suppose that each OS will end up publishing its minimum requirements.

These proposed automatic CSR swaps serve two purposes. First, they can
fill in part of the latency of taking a trap on implementations where
the writes to sepc, scause, and (possibly) sbadaddr still leave empty
cycles before the first instruction of the trap handler is available.
Second, and perhaps more importantly, this permits a trap handler to
immediately have working registers, instead of doing a very careful
dance to preserve state, similar to the long-standing concept of a
shadow register file.

Do I understand correctly the automatic swap is purely an optimization? In other words, a supervisor could get exactly the same effect by simply placing a series of CSR swaps at the beginning of its trap handler?

==Fast context save/restore instructions==

I'm confused. Is there a concrete proposal that I'm too inept to find? I'm not sure I understand exactly what these instructions do.

Jacob Bachmeyer

unread,

Oct 9, 2016, 7:59:12 PM10/9/16

to Andrew Lutomirski, RISC-V ISA Dev

The minimum number of sscratch registers could be increased. I chose to
start off with "not less than 1" because the current draft spec contains
only one and one is the hard minimum to make the model work at all.

> These proposed automatic CSR swaps serve two purposes. First,
> they can
> fill in part of the latency of taking a trap on implementations where
> the writes to sepc, scause, and (possibly) sbadaddr still leave empty
> cycles before the first instruction of the trap handler is
> available.
> Second, and perhaps more importantly, this permits a trap handler to
> immediately have working registers, instead of doing a very careful
> dance to preserve state, similar to the long-standing concept of a
> shadow register file.
>
>
> Do I understand correctly the automatic swap is purely an
> optimization? In other words, a supervisor could get exactly the same
> effect by simply placing a series of CSR swaps at the beginning of its
> trap handler?

I first thought of the automatic swap as an optimization because any
trap handler *must* begin with at least one CSR swap. The automatic
swaps become more important with the context save/restore instructions
which can save the "user values" that were swapped into the scratch
registers.

> ==Fast context save/restore instructions==
>
> I'm confused. Is there a concrete proposal that I'm too inept to
> find? I'm not sure I understand exactly what these instructions do.

If you are confused, then I goofed defining them, although offering
three different variants of the proposal probably did not help. There
is not a fully-concrete proposal yet, since the exact encoding and
mnemonics for these instructions are yet to be determined.

The idea is to have hardware support for quickly saving user state for a
context switch. The supervisor simply provides a buffer and says "put
the user register contents here". This is another effort to work around
the nasty race conditions with trap entry and return. Since context
save is a single instruction, it is atomic with respect to nested traps
and interrupts. The save-context instruction is essentially setjmp(),
while restore-context-and-return-from-trap is essentially longjmp() that
can jump into a less-privileged context. These could also serve as
performance optimizations, since the entire register file could be
written to memory in a multi-word burst in an advanced implementation.

The broader concept is that a trap handler begins in a critical section
(indicated by an internal flag not visible to software) and exits the
critical section after saving enough information that a nested trap will
not destroy anything needed for execution to continue. In a critical
section, trap delegation is inhibited and the execution environment will
receive any traps. The execution environment is then expected to save
the context somewhere, switch into a "double fault" context and deliver
the nested trap. After the nested trap is handled, the original trap
handler is resumed. Earlier, I expected that the original trap handler
would need to be restarted, but I have since realized that nested traps
actually are fully recoverable except in machine mode. (They can be
made recoverable in machine mode with some extra CSRs, but prohibiting
monitor trap handlers from taking synchronous traps or interrupts is
probably easier.)

Exiting the critical section is indicated by executing a save-context
instruction that records information needed to resume after a nested
trap. The size and layout of a context buffer is defined by the
implementation and documented in the configuration string. Linux, for
example, would generate the accessor functions for various user
registers and such during early init. The user context structure is
implementation-defined rather than compiled into the kernel. This need
not impede system calls, since syscall arguments would be in registers
that could be preserved during trap entry and passed directly into the
kernel.

The execution environment knows that a nested trap was delivered to a
more privileged mode when it receives a trap that "should have" been
delegated by hardware. Implementations that do not support hardware
trap delegation would be required to handle all traps as nested traps.

The "minimal" context save instructions save less information and
therefore should be faster, and are intended for use when handling traps
that cannot cause a context switch. Very simple trap handlers may
eschew saving context at all. (The automatic CSR swaps are important
for this, since the hardware can thus know which scratch registers
currently hold less-privileged values--if x5 was swapped with sscratch0,
then save-full-user-context will write the contents of sscratch0 into
the x5 slot in the context object, instead of the supervisor's contents
of x5.)

-- Jacob

Andrew Lutomirski

unread,

Oct 11, 2016, 12:41:53 AM10/11/16

to RISC-V ISA Dev, aml...@gmail.com, jcb6...@gmail.com

On Sunday, October 9, 2016 at 4:59:12 PM UTC-7, Jacob Bachmeyer wrote:

Andrew Lutomirski wrote:

> ==Fast context save/restore instructions==
>
> I'm confused. Is there a concrete proposal that I'm too inept to
> find? I'm not sure I understand exactly what these instructions do.

If you are confused, then I goofed defining them, although offering
three different variants of the proposal probably did not help. There
is not a fully-concrete proposal yet, since the exact encoding and
mnemonics for these instructions are yet to be determined.

The idea is to have hardware support for quickly saving user state for a
context switch. The supervisor simply provides a buffer and says "put
the user register contents here". This is another effort to work around
the nasty race conditions with trap entry and return. Since context
save is a single instruction, it is atomic with respect to nested traps
and interrupts. The save-context instruction is essentially setjmp(),
while restore-context-and-return-from-trap is essentially longjmp() that
can jump into a less-privileged context. These could also serve as
performance optimizations, since the entire register file could be
written to memory in a multi-word burst in an advanced implementation.

I think that all of this could work, but I wanted to offer a less dramatic change that could maybe work just as well. Everything works like it does in 1.9, except for two sets of changes. First:

Add a new status bit SATOMIC. SATOMIC is never set when running U-mode code. It has no particular effect when running in H-mode or M-mode (except that its value is preserved when returning to S-mode).
On a trap to S-mode, if SATOMIC is clear: x1 is saved to a new CSR called SX1. Another new CSR called SX1 is copied to X1. A new flag SATOMIC is set. (Presumably this flag lives in a status CSR. See farther down for its exact effect.)
On a trap to S-mode if SATOMIC is set: nothing is saved at all. The sole effect of the trap is to change PC to point to the double-fault handler.
SRET is illegal if SATOMIC is not set.

Other than SATOMIC, this has relatively little effect except that handlers no longer have any need to begin with a CSR swap. This makes it easier to write a handler that can survive a nested trap.

A lightweight entry (syscall on some microkernels, etc) can save whatever registers it needs for scratch space, do some work, restore registers, and return. Any nested trap is an error or has special handling.

A normal entry will likely start with a short instruction sequence to save SEPC, SCAUSE, STP, etc to memory (using X1-relative stores), will update SX1 to point wherever a nested entry wants it, and then will clear SATOMIC.

It may also improve performance, judging by my x86 experience. Having an exception handler start with a special instruction to load a pointer into some register and then doing everything relative to that pointer means that you have to wait for the first entry instruction to retire before making any further progress at all. I did an experiment once in which i got rid of the SWAPGS at the beginning of the x86 syscall handler, and it saved several cycles. (I did it with a gross hack in which I assumed only one CPU and stored everything at a fixed absolute address.) SWAPGS on x86 is conceptually like a swap to SSCRATCH. If the CPU were responsible for populating a base register, the microcode could do it early, so the value would be available without pipeline hazards right away.

Aside: All of the packing of status bits and "previous" status bits into the same register is nice from a space-saving perspective, but it might be nicer in the long run to have them be the same bit positions in different registers. That would reduce the amount of thought needed for save-and-restore. It would also allow SATOMIC, HATOMIC, all the various interrupt enable bits, etc to be consolidated into just the active copy and the copy that's saved for restoration by xRET. But maybe this doesn't matter so much.

Jacob Bachmeyer

unread,

Oct 11, 2016, 7:04:48 PM10/11/16

to Andrew Lutomirski, RISC-V ISA Dev

> * Add a new status bit SATOMIC. SATOMIC is never set when running

> U-mode code. It has no particular effect when running in H-mode
> or M-mode (except that its value is preserved when returning to
> S-mode).

> * On a trap to S-mode, if SATOMIC is clear: x1 is saved to a new

> CSR called SX1. Another new CSR called SX1 is copied to X1. A
> new flag SATOMIC is set. (Presumably this flag lives in a
> status CSR. See farther down for its exact effect.)

> * On a trap to S-mode if SATOMIC is set: nothing is saved at all.

> The sole effect of the trap is to change PC to point to the
> double-fault handler.

> * SRET is illegal if SATOMIC is not set.

>
> Other than SATOMIC, this has relatively little effect except that
> handlers no longer have any need to begin with a CSR swap. This makes
> it easier to write a handler that can survive a nested trap.

SATOMIC is basically the invisible "trap handler critical section" flag
I suggested, but UATOMIC and HATOMIC will also be needed for nested
traps in U-mode and H-mode, which is why I made the flag invisible to
save bits in mstatus. The other major difference is that my proposal
enables execution to continue after a nested trap: the trap handler is
simply resumed at the instruction where the inner trap occurred. How
does execution resume after the double-fault handler when the address of
the faulting instruction is not known? Or are we back to idempotent
critical sections?

The SX1 register you suggest is essentially sswap hardwired to the value
0x01 and a single sscratch register. Simpler, I admit, but I did not
want to hardwire which registers a trap swaps out. Also, the
save-full-context instruction I propose would save the value of SX1 in
the x1 slot in the context object, since that is the value x1 had when
the trap was taken.

> A lightweight entry (syscall on some microkernels, etc) can save
> whatever registers it needs for scratch space, do some work, restore
> registers, and return. Any nested trap is an error or has special
> handling.

Same as my proposal--the critical section is the entire trap handler if
context-save is never executed.

> A normal entry will likely start with a short instruction sequence to
> save SEPC, SCAUSE, STP, etc to memory (using X1-relative stores), will
> update SX1 to point wherever a nested entry wants it, and then will
> clear SATOMIC.

This is equivalent, except I propose a special save-minimal-context
instruction to save those registers and save-full-context instructions
to store the complete less-privileged context. The latter is intended
to improve performance for a user context switch, see below.

> It may also improve performance, judging by my x86 experience. Having
> an exception handler start with a special instruction to load a
> pointer into some register and then doing everything relative to that
> pointer means that you have to wait for the first entry instruction to
> retire before making any further progress at all. I did an experiment
> once in which i got rid of the SWAPGS at the beginning of the x86
> syscall handler, and it saved several cycles. (I did it with a gross
> hack in which I assumed only one CPU and stored everything at a fixed
> absolute address.) SWAPGS on x86 is conceptually like a swap to
> SSCRATCH. If the CPU were responsible for populating a base register,
> the microcode could do it early, so the value would be available
> without pipeline hazards right away.

The reason to add special instructions is that hardware may be able to
use higher-performance burst-mode memory accesses not otherwise
available in RISC-V. This is also the reason that the context object
layout is required to be (machine-readably) documented but is left
unspecified--the exact layout is a matter of microarchitectural
convenience. For example, an implementation that internally divides the
register file into banks may find it easier to save the registers in
some interleaved pattern and this is allowed, provided that each
register's value is contiguous and naturally-aligned. (The context
object itself must be aligned on at least an XLEN-bit boundary.) I draw
the line at dumping a rename table, though, since that would trade too
much software complexity (every access to a saved context would need to
use the dumped rename table) for very little benefit (hardware could
easily do rename table lookups while saving the context). Similarly,
shuffling register contents is not allowed, since software may have to
untangle them, and permutations are essentially free in hardware, no
matter how the registers are stored internally. The save-context
instructions are simply a faster version of a sequence of stores that
also clear the internal trap critical section flag, although they do
perform another neat trick, directly saving certain CSRs without
clobbering an architectural register.

The broader purpose was to allow SRET to also perform a (user thread)
context switch. In the case of a process switch, the supervisor would
first load a new *ptbr value. I have seen the pages of assembler code
in Linux to do context switches, and thought "it would be nice to have
hardware context-save/context-restore instructions that are actually
faster than doing it in software". Again, I expect this to be an
operation that can complete in less than 50 cycles, including the trap
return, which is a perfectly predictable jump.

Linux could, at first, always save a full user context on trap entry (an
operation (including the trap itself) that I expect to require less than
50 cycles, absent memory latency, which is plausible if L1 cachelines
are available to hold the writes and the trap entry is cached) and later
optimize for particular cases where this can be delayed or avoided.

The instructions I propose can be implemented without actual microcode,
using a bit of additional logic in the decoder to synthesize the
required operation sequence. (Okay, so that is basically microcode
generated by hardware as needed instead of read from ROM. Po-tay-to,
po-tah-to, I admit.) But a more advanced implementation actually could
implement context-save in hardware, perhaps with a special double-width
memory path that uses both register read ports at once and requires less
than 20 cycles to save trap CSRs and the less-privileged values of all
31 integer registers. An out-of-order implementation could simply mark
the data hazards and queue the stores for context-save, and shuffle the
loads for context-restore to satisfy earlier dependencies first. If I
understand correctly, even a full user context-save on trap entry should
be faster on RISC-V than merely raising an exception on x86 in all or
nearly all implementations.

Note that there is no requirement that a saved context actually be
restored--if the supervisor does not clobber any registers not swapped
with sscratch CSRs, then plain SRET will correctly continue user
execution. An implementation is, of course, allowed to hold a copy of
the most-recently saved context in some internal buffer (or a set of
shadow registers), and to use it if a context-restore is executed from
the same address with no intervening stores (or to snoop such writes and
update the copy), but the supervisor can freely "take snapshots" and
throw them away if they are not needed. This is important, because
context-save may be faster than context-restore. The register file has
two read ports, but may only have one write port, although banked
register files may have secondary access ports useful for
context-restore and context-save.

> Aside: All of the packing of status bits and "previous" status bits
> into the same register is nice from a space-saving perspective, but it
> might be nicer in the long run to have them be the same bit positions
> in different registers. That would reduce the amount of thought
> needed for save-and-restore. It would also allow SATOMIC, HATOMIC,
> all the various interrupt enable bits, etc to be consolidated into
> just the active copy and the copy that's saved for restoration by
> xRET. But maybe this doesn't matter so much.

It also seems to be intended for performance as well--only one status
register to save/restore on context switch, at any privilege level.

-- Jacob

Reply all

Reply to author

Forward

0 new messages