On 19/09/2016 19:55, Andrew Waterman wrote:
> Thanks for taking the time to think about these issues.
>
> On Mon, Sep 19, 2016 at 8:05 AM, Paolo Bonzini <
bon...@gnu.org> wrote:
>> One of the aspects of the RISC-V privileged architecture that puzzled me
>> the most is that VM is not part of sstatus. Yet, there's no way to set
>> it via SBI, so the OS must know how to deal with whatever page table
>> format has been set by M-mode code.
>
> Our expectation is that the loader sets the VM field for the OS, and
> then it would remain constant. A field in the ELF header could carry
> this information.
Oh, I found it now: "once the machine is set up, the OS kernel is mapped
into virtual memory, and its entry point is called".
But where does the loader end and the OS start? For example, are U-Boot
or a UEFI environment a loader? Both are complicated enough that they
might want to set up paging, even if it's only to place a guard page on
the bottom of the stack. (On x86, we did have cases where the guard
page caught UEFI bugs due to an excessively small stack). What if their
page table format is different from that of the kernel they load?
Also, Linux is able to start another kernel with kexec. Why limit that
to a kernel with the same page table format?
Yeah, there's always the SBI route, but... it's really a hack. Chapter
9 says "Several features that might normally handled by the supervisor
operating system (OS) directly are handled via SBI calls to the SEE in
RISC-V" but it doesn't provide any rationale, and SBI right now is
mostly providing multi-processor services. There are a few exceptions
(console get/put, shutdown, mask/unmask interrupt---which is probably
not really a necessary part of the SBI either) but tweaking the page
table format doesn't fit any of them.
>> In addition, M-mode and S-mode translation are not stackable. Some
>> type-1 hypervisors in the past (e.g. Xen on 32-bit x86, which ran the
>> guest kernel in x86 ring 1) have used single base-and-bound to protect
>> themselves from the guests, so I'm introducing a similar mechanism in
>> this proposal to allow the M-mode monitor to protect itself from
>> H/S/U-mode code.
>
> Protection only, rather than an additional translation step, should
> suffice for this purpose. PMAs can handle this.
If stacking segmentation with paging is unnecessary, why even make
segmentation an M-mode concept, and paging an S-mode concept? Both
could be simply S-mode concepts, that naturally move up to M-mode if
S-mode doesn't exist.
(IOW, by allowing an M-mode monitor to sets up sptbr page tables for
U-mode code, paging can be made available on M,U implementations, rather
than only M,S,U as in Table 3.3. This removes an unnecessary difference
between the various RISC-V translation modes).
So with this observation mstatus.VM would be reduced to 2 bits only
(let's call them mstatus.SVM if they're then made visible in sstatus):
00 = Mbare or Mbb depending on mbase/mbound (well, sbase/sbound?...)
01 = Mbbid
10 = 32-bit virtual addresses with RV32 sptbr format (Sv32)
11 = RV64 sptbr format, page table depth contained in sptbr
For 00 and 01, segmentation would only be applied to U-mode and S-mode
would run with physical addresses just like M-mode. If S-mode exists,
M-mode protection would be left to PMA/PMP.
Actually I like this even more that the previous one! :)
>> - a new representation for the RV64 sptbr:
>>
>> bit 0-39 PPN (WARL)
>> bit 40-55 ASID (WARL)
>> bit 56-57 SVSZ (supervisor virtual address size, WLRL)
>> bit 58-63 reseved (WLRL)
>>
>> SVSZ is defined as:
>>
>> 00 39-bit virtual addresses (Sv39, three-level page tables)
>> 01 48-bit virtual addresses (Sv48, four-level page tables)
>> 10 Reserved for 57-bit virtual addresses
>> 11 Reserved for 64-bit virtual addresses
>
> This seems like a reasonable design to me, but I'm not yet convinced
> supervisor software needs direct control over the page table depth
> (and, if it does, whether an SBI call would suffice).
>
> Does exposing this option directly to the OS complicate hypervisor
> implementation at all?
Maybe I've not understood the question. It's _not_ exposing this option
that complicates the hypervisor, even without considering type-2
hypervisors, i.e. purely within the H-mode framework envisioned by the
1.9 spec.
The hypervisor has to deal with _three_ kinds of address translations:
the hypervisor page tables (H->M) and the two translation levels for the
guest page tables (Svirtual->Sphysical from sptbr, and Sphysical->M).
Just to make things simple let's say you use the same page format for
H->M and Sphysical->M (it is an arbitrary limitation---but it's one that
I could live with).
But, different guests may well demand different sizes for the
Svirtual->Sphysical translation.
Therefore the hypervisor _really_ needs control on that one, and it has
to change it fast because it is part of the hypervisor context switch;
so you really need to configure Svirtual->Sphysical with a knob that
isn't mstatus.VM.
The good news is that you won't have any problem fitting the page
formats in mstatus.VM (there are 8 values in use out of 32, and you'll
only need 5 more if the same format controls hptbr and hsptbr).
The ugly news is that the sptbr page table format would be defined by
mstatus.vm on bare-metal and by something else when running under
H-mode. This something else would presumably be some field in hstatus
and it would waste several precious bits in mstatus (this is the bad news).
It's frankly very messy, in English and probably in the HDL too; it
sounds even more messy if the solution is obvious, i.e. let whoever sets
up the page tables pick the format they desire. So for sptbr that's the
S-mode code itself; and then putting the format in sptbr rather than
sstatus is the obvious way to save on mstatus bits.
> We're still amenable to changes prior to the next version, provided
> compelling justification. We still haven't had time to think through
> the implications of your hypervisor proposal from a few weeks ago, but
> hope to do so soon.
Great, thanks. FWIW, I would already be very happy if you just removed
all traces of H mode from the spec. :)
(And sorry if I sound excessively negative. There's plenty of great
stuff of course!).
Paolo
ps: yes, I have thought of how to apply this in the no-H-mode proposal.
Assuming the 2-bits definition of mflags.VM from this email, the
extension could be:
- VM[3:2] becomes a new field HPVM, defining the hpptbr format
- VM[4] is HVM: 0 for two-level translation with 32-bit page-table
entries, 1 for 64-bit page-table entries.
- both mstatus.HPVM and mstatus.HVM are visible in hstatus
- both HRET and hypervisor traps swap mstatus.SVM with mstatus.HPVM
With the separate MVM/SVM from the previous attempt, you'd need a sixth
bit for HVM.