Mar 26, 2018, 1:56:41 AM3/26/18
Problems that M-mode virtualization solves:
The complexity of the M-mode ISA being different from the S-mode ISA.
The need for two different access control mechanisms (privilege level 0/1/3 and virtualization level 0/1).
The need for an HS-specific hstatus CSR with VTSR, VTW, VTVM, SPV, STL, and SPRV fields, and the various other HS-specific h* and bs* CSRs, and MPV and MTL in mstatus.
The need to modify the ISA to work with hardware that supports exotic virtualization configurations.
The inability to use virtualization as just a software composition mechanism, without having to waste resources for unnecessary security mechanisms.
The inability to transparently (without mucking up the customer's M-mode binaries by making him link with the manufacturer's emulation code) provide trap-and-emulate for features not implemented in hardware, e.g. Allen's DIV.
The possible need to complicate the compliance test suite, as Cesar and Allen described, to deal with partial implementation of extensions, due to the lack of transparent emulation.
This message is long, not because virtualizing M-mode is complicated, but because I'm describing it relative to the current privileged arch manual, and describing implications of the differences. At first glance, it will look like I'm removing essential CSRs and proliferating superfluous ones, but in the end the only major changes are the addition of a handful of meta-CSRs and the addition of a mechanism resembling CSR number translation, analogous to address translation.
§1: Combine M, HS, S, and U into a single mode
Move the f* CSRs to M-mode.
Move satp to M-mode, and the description of it and virtual memory to the Machine-Level ISA chapter.
Remove the rest of the Supervisor-Level ISA chapter, the hypervisor proposal, the tables for U-mode and S-mode CSR numbers, and the concepts of HS, S, and U modes from the manual.
Rename satp to “atp”.
Remove the “m” prefix from all the M-mode CSR names.
Remove the concept of privilege level/mode, and replace it by virtualization level, with level 0 meaning unvirtualized.
Recombine MRET, HRET, SRET, and URET all into ERET (which was removed from the current version of the manual), with virtualization level as an immediate argument.
§2: Virtualize that single mode to arbitrary nesting levels
Virtualization level is objective, but each level subjectively sees itself as level 0, and sees the next nested level as 1, etc. E.g. level 2 sees itself as 0, and sees 3 as 1. So, add the current virt level to the argument to ERET to get the target level.
Add a new 32-bit meta-CSR: vconf (virtualization configuration), explained below.
Separate the concepts of CSR, CSR type, and CSR view. For each CSR (including vconf), create an array of duplicates of it; they're all the same type, but each element of the array is a distinct CSR. The array length is implementation-defined, will generally be different for each type, and determines the hardware virtualization nesting depth that's supported for that type. The length of any particular array in a typical implementation will be no more than 4. Some of the arrays will have length 1 (i.e. only 1 CSR of that type), and some length 0 (for an unsupported type). E.g. typically debug CSR arrays will have length 1, and if address translation isn't supported, the atp CSR array will have length 0. But a Unix-capable processor will have trap setup/handing CSR arrays of at least length 2 and atp at least length 1, and a hypervisor-capable processor will have lengths at least 3 and 2, respectively (with atp for SLAT).
Change the M-mode CSR table in the spec to say it's a table of CSRviews, not of CSRs. Each view, e.g. «status», always uses a fixed number, e.g. 0x300, but the CSR that it accesses depends on the current virtualization level. For CSRs that are always hardware virtualized from one virt level to the next, the current virt level is an index into the array. E.g. at level 0, status accesses status, corresponding to traditional mstatus. At level 1, status accesses status, corresponding to sstatus, and at level 2, status, corresponding to ustatus. An implementation can have arbitrarily long arrays, to efficiently support arbitrary nesting depths, with no need to change the ISA or assign new CSRview names or numbers for every new level. This is academically useful, but typical hardware won't go very deep.
For each CSRtype, add a second CSRview, with a different number than the first, with “g” (guest) prefixed to the name. When CSRview x accesses CSR x[i], gx accesses x[i+1]. This is a bit like the background CSRs in the current hypervisor spec proposal. But x and gx don't get swapped, and in general the array length can be greater than 2.
Optionally also add g2x and g3x for x[i+2] and x[i+3], etc, up to the length of the x CSRarray, though generally there's no need for this. Since g* CSRs never affect machine behavior and don't require atomic access (until a higher virt level that uses them is activated, either fully, or partially via use of ML (formerly MPRV; see below) or MXR), and proliferation of CSRviews would put pressure on the 12-bit number space, another option would be to map the g* CSRviews to some standard memory addresses. But the CSRs themselves must still be hart-local real CSRs, not just stored in main memory, because they (along with lower level CSRs) do affect machine behavior at higher virt levels.
CSRs aren't always hardware-virtualized going from one virt level to the next. They might not be virtualized at all, and instead access the host's (i.e. previous level's) own CSR directly, e.g. typically fflags accesses the host's CSR. Or they might be software-virtualized. Or they might be omitted; e.g. typically virt levels above 0 have no debug CSRs. So, support of m virt levels in typical usage doesn't require every CSRarray to have length m. And at virt level n, a particular CSRview might access a CSR at an array index less than n. But for simplicity of specification and implementation, CSRviews x and gx never access the same CSR; when x accesses x[i], gx always accesses x[i+1], g2x accesses x[i+2], etc.
Controlling which CSRviews access which CSRs is what the vconf CSRs are for. Each vconf CSR has a two-bit field for each of the following groups of CSRviews:
info registers (vendorid, etc are all one group)
trap setup (status, ie, and tvec, but not isa, trap delegation, or counteren)
trap delegation (edeleg and ideleg, but not counteren)
pmp* (pmpcfg0..15 and pmpaddr0..15 all together are one group)
Virt level n-1 configures its guest, n, by writing to vconf[n]. For n, for each field f in vconf[n], for each CSRview x in the group controlled by f, the value of f specifies which (and how) CSR x_ is accessed by x.
f value and x access:
0: None. And n is told x_ doesn't exist. Do horizontal trap on access to x, so n itself can optionally emulate x_. If n's host (i.e. virt level n-1) provides no trap setup/handling CSRs to n, then do vertical trap if x is accessed, and n-1 can kill n.
1: None. But n is told x_ does exist. Do vertical trap on access to x, and n-1 emulates x_.
2: x[i-1], the CSR that x accesses for the host.
3: x[i], i.e. the next one in the array.
IOW, bit 0 of the field is high if the CSR is virtualized, and bit 1 is high if the CSR is implemented in hardware (i.e. implemented by the host of virt level 0; of course the hardware itself might be emulated in software, but that's irrelevant here).
If the atp field value in vconf[n] is 0 or 1, then SFENCE.VMA in virt level n is also trapped, in addition to the atp CSR.
The privileged arch manual's trap setup and handling CSRs tangle up all privilege levels. To untangle them:
Rename the MIE bit to IE, and MPIE to PIE, in status (remember, mstatus was renamed above to status).
Rename MPP to PL (previous level; besides dropping “M” prefix, rename it because it records previous virt level, not previous privilege mode).
Remove the SIE, SPIE, SPP, UIE, and UPIE bits. They're replaced respectively by IE, PIE, and PL in gstatus, and IE and PIE in g2status.
Rename MPRV to ML (modify level).
Rename TSR to TER (Trap ERET).
Rename MXL to XL in isa (renamed from misa).
Remove the SXL and UXL fields from status. They're replaced by XL in gisa and g2isa, respectively.
Remove the TVM bit; it's replaced by the atp field in vconf.
Remove the “M” prefix from bit names in ip (renamed from mip) and ie (renamed from mie).
Remove S* and U* bits from ip and ie; they're replaced by the un-prefixed bits in gip, g2ip, gie, and g2ie.
PL field width could be implementation-defined (and thus specified in a read-only meta-CSR), since it must be at least log2 of the length of the vconf CSRarray, i.e. of the number of hardware virt levels supported, which in general can be greater than 4. But in typical implementations, it will be no greater than 4, so PL will need only 2 bits. Or, instead of the complexity of having the width be implementation-defined, it could be fixed at e.g. 3 or 4 bits, which creates an ISA-level limit on the hardware-supported virtualization nesting depth, on the assumption that no reasonable hardware will ever exceed that limit.
Behavioral changes for trap setup and handling:
Before writing value v to status.PL for virt level n, the processor first decrements v by n, to maintain n's subjective view of virt levels.
ML and MXR in status for virt level n affect only n and higher. TER, TW, and SUM affect n+1.
Hardware-implemented CSRs inherit the host's access type (read-only vs. read/write). I.e. if a CSR x[i] is read-only at level n, and the field for x in vconf[n+1] has value 2 or 3, then x[i+1] is read-only at level n+1. However, x[i+1] is read/write at level n (and is accessed using CSRview gx), which enables n to modify the read-only value seen by n+1. Changing read-only vs. read/write access (i.e. preventing inheritance) isn't a feature provided by hardware virtualization; to do it, use field value 1, i.e. trap and emulate in software.
An exception is the counters/timers. If the field value in vconf[n+1] for counters/timers is 0, 2, or 3, then it works as described above, and the counteren CSR at virt level n is ignored. But if the field value is 1, then counteren bit c specifies whether to always trap counter c of level n+1 or grant read-only access to n's own counter c (and trap only the writes). But regardless of configuration, the CSRview numbers for counters for all virt levels are the same as for virt level 0; they don't change as they do for different privilege levels in the privileged arch manual.
Another exception to inheritance is vconf itself. vconf is always hardwired, but vconf[n] for n>0 will be hardwired or read/write depending on whether the processor supports reconfigurable virtualization. Since hypervisor support is indicated by the processor having CSRarrays of sufficient length (e.g. at least length 2 for atp), the H bit in the isa CSR is no longer needed for that, so the H bit is used instead to indicate virtualization reconfigurability.
Each vconf CSR that specifies value 3 in the field for CSRview x requires an additional element in the CSRarray of x's type. For values 0, 1, and 2, no additional element is required. IOW, virtualizing CSRs in hardware requires more of them; virtualizing them in software (or not at all) doesn't.
There's no field for vconf itself, because it's always implicitly 0, i.e. a program running at virt level n can never access the vconf[n] CSR. Write access obviously must be disallowed, because it would allow the program to change which CSRs it has access to. But read access also must be disallowed, because it would be a virtualization hole, by allowing the program to see whether its CSRs have vconf field values 1, 2, or 3.
Thus, although there's an array of vconf CSRs, the vconf CSRview can never be used, so it doesn't even have a number assigned for it in the CSRview table in the manual; it's only an explanatory mechanism. Instead, another CSRview, hconf (host configuration), is a read-only, masked view of vconf. hconf tells virt level n which CSRs are available, i.e. what configuration n's host has set up for n. vconf CSRs are always hardware virtualized from one virt level to the next, so hconf always accesses vconf[n] for virt level n (and gvconf always accesses vconf[n+1]). So, the gvconf field in every vconf[n] must have value 3, except 0 for the last vconf array element that's used.
For each field f in hconf[n], bit 1 is the logical «or» of the two bits of f in vconf[n]. IOW, where f controls CSRview x, bit 1 indicates whether the CSR accessed by x exists for virt level n. hconf[n] bit 0 is used to indicate whether the processor supports hardware virtualization (from the point of view of virt level n) of the CSR accessed by x. If x in virt level n accesses the last CSR in the array of that type, then (further) hardware virtualization isn't supported, because it would require an additional element beyond the last one.
The value of bit 0 is 1 if another element in the hardware CSRarray of x's type is available, and no vconf[p] for p<n has f value 0 or 1. IOW, if any virt level disables the CSR for the next level, or virtualizes it in software, then hardware virtualization at higher levels is illegal, even if more elements in the array are available. That's because lower-level CSRs generally affect machine behavior, but disabling or software-virtualizing them would require trapping even when the processor itself tries to read them to implement that behavior, and that's making my head hurt.
Thus, f values in hconf mean:
0: the CSR accessed by x doesn't exist
2: it does exist, but it isn't hardware virtualizable
3: it does exist, and is hardware virtualizable
Value 1 never appears in an hconf field.
For each field, given a value in hconf[n], the allowed values in gvconf[n] (i.e. vconf[n+1]) are:
0: 0, 1
2: 0, 1, 2
3: 0, 1, 2, 3
IOW, it's illegal for n to hardware-virtualize a CSR that isn't hardware virtualizable, or to grant to n+1 pass-through access to a CSR that's nonexistent for n.
Trying to write an illegal value to a field in gvconf (or g2vconf, etc) results in a trap. And when f in hconf isn't 3, trying to read or write to gx results in a trap.
Since CSRviews are logically grouped, rather than each CSRview getting its own field in vconf (which would make vconf excessively wide), and hconf indicates availability of additional CSRarray elements, all of the CSRarrays in a given group must therefore have the same length.
The processor hardwires the vconf fields for all unsupported CSRs to 0. Fields for supported CSRs are hardwired to 2, and therefore will be read as 2 or 3 in hconf depending on whether the processor has more than just 1 CSR in each of the arrays of the types controlled by the field.
The CSRR* instructions take a CSRview number, not a CSR number. To find the CSR for CSRview x for virt level n, count the quantity q of CSRs in the vconf array in the range 0..n for which the field for x has value 3 (meaning hardware virtualized). Then, the CSR accessed by x is x[q]. But if the field for x in any of those vconf CSRs has value 0 or 1, then trap.
§3: Some other implications
With M-mode virtualization, the standard M/S/U structure that's currently inflexibly specified in the privileged architecture manual is still a programmable option, if the relevant CSRarrays have at least length 2, corresponding to provision of the traditional s* CSRs, or length 3, for s* and u*. The program running at virt level 0 would assign zero to all the fields in gvconf, except the following:
trap setup: 3
trap delegation: 3
trap handling: 3
Virt level 1 (traditionally the supervisor) would assign zero to all, except:
trap setup: 3 (to support the Risc-v «N» extension)
trap handling: 3 (ditto)
In this example, since virt level 1 assigns 0 to the gvconf field in its gvconf CSR, level 2 doesn't get a gvconf CSR, because level 2 is traditionally what's called “user-level” and doesn't present an environment to a nested VM.
The ABI difference is that virt levels 1 and 2 use the same CSRview numbers that virt level 0 does, instead of different (e.g. s* and u*) ones. Not only programs at virt level 0, but also programs at all other levels, perceive themselves to be running at level 0. If their hosts set fields in gvconf to 0 to disable access for some CSRs, this is indistinguishable from running on a processor with omitted CSRs.
Besides the standard structure, this enables other options. E.g.
A. Address translation can be managed at virt level 0.
B. Virt level 0 can virtualize itself efficiently, by copying hconf to gconf, then changing gvconf fields with value 2 to 1, and doing trap-and-emulate only for the non-hardware-virtualizable CSRs (i.e. the ones with hconf value 2).
C. Virtualization can be used just for software composition, without any hardware security mechanisms at all, such as PMP or address translation.
D. Virt level 0 might do nothing except trap and emulate for omitted hardware (e.g. Allen's divider) and isa (to report that the M extension is fully implemented), and fully delegate everything else to virt level 1. To do this, copy hconf to gvconf, then change all gvconf fields with value 3 to 2 (except isa and trap setup/handling/delegation), copy isa to gisa, set the M bit in gisa to 1, set all ideleg bits to 1, set all edeleg bits to 1 except illegal-instruction, and handle just DIV, and delegate all other illegals. Such a program could run from on-board ROM at reset, so that the first externally loaded program (such as the compliance test suite) has no idea that it's not running at virt level 0, besides the fact that DIV is slow and the platform reserves a small memory range for use by the real virt level 0. Thus, the DIV-emulation program wouldn't be a compliance headache. This is an example of using virtualization for software composition without needing any security mechanism
s. Though in this example, adding the ROM would kind of defeat the purpose of removing the hardware divider. And, um, Allen just said Krste just said it's ok if MULT doesn't trap, so that kind of moots the point of this exercise. Well, I already wrote this, so I might as well spam all your inboxes with it.
E. Add hardware support for second-level PMP, so programs running on a small RTOS subdivide their own memory, with no change needed to the ISA.
F. Add hardware support for third-level address translation, with no change needed to the ISA. Why just waste energy running Linux on Xen, when you can waste even more running Windows on KVM on Xen?
§4: Microarchitecture changes
An array of vconf CSRs must be provided, with length m+1. vconf is hardwired. vconf[1..m] can be hardwired to encode a standard structure (M, M/U, M/S/U, or M/HS/S/U), or be read/write to enable virtualization reconfiguration, with typically less than 100 extra bits of state.
The CSRR* instructions require a bit counter to find CSRs based on the CSRview argument and the current virt level. But this counter only has to count as high as m, where m is the maximum virt level, which will be no more than 3 for typical processors. And like with the fg/bg swaps in the current hypervisor proposal, this overhead is only incurred when the current virt level changes; it doesn't have to be incurred every time a CSR is accessed.
Adders are required for ERET and status.PL.
Circuitry must be added for the traps described above.
I think that's about it. Current Risc-v core designs would require just a minimal amount of change to be compatible with this ISA.