Hi Anup,
>> 1. Figure 5.46 on page 117 (the transformed HSV/HLV/HLVX instructions)
>> does not properly distinguish between hypervisor loads and stores. HLV
>> instructions do not have an rs2 field, and HSV instructions do not have
>> an rd.
>
> The func7 of the instruction encoding can be used to distinguish between
> HLV and HSV instructions.
I agree that the instructions can be distinguished by a hypervisor; my
objection is only that figure 5.46 is not correct, and should be split
into two separate figures.
> The misaligned load/store emulation is only done by the M-mode runtime
> firmware for both Hypervisor and Guest/VM. If there is a page fault while
> emulating misaligned load/store in M-mode firmware then it is forwarded
> to the hypervisor or Guest/VM where it came from. This means misaligned
> load/store faults don't reach the hypervisors and Guest/VM because the
> M-mode firmware tries to compensate for missing HW features using
> software emulation. In fact, the upcoming OS-A platform specs puts
> misaligned load/store emulation as a required feature from M-mode
> runtime firmware.
>
> Currently, hypervisors don't require the "Addr. offset" information
> though it is useful. Most hypervisors determine the exact faulting
> address from a combination of htval and stval CSRs.
I'm referring to guest page faults where the underlying address was
misaligned (that is, scause = 20, 21, or 23), not misaligned access
exceptions (scause = 4 and 6)
To give a concrete example: A hypervisor is implements memory range
0x10000-0x2ffff to a guest, which consists of two contiguous guest
pages. Page #1, which comprises bytes 0x10000-0x1ffff, is configured
as regular memory. That is, the G-stage page table entry for this page
is set with V = 1 and RWX = 111. Page #2 spans bytes 0x20000-0x2ffff,
and the hypervisor intends to use this to emulate an MMIO driver, so
the G-stage PTE for this page is set with V = 0.
The guest then sets up its own page tables, which don't matter a whole
lot for the purposes of this argument as long as the pages stay
contiguous, but to be concrete let's say that the guest pushes
everything into negative addresses, so page #1 is now mapped to
0xfff10000-0xfff1ffff (guest virtual address) and page #2 is mapped to
0xfff20000-0xfff2ffff.
Now: suppose the guest operating system attempts a four-byte load from
guest-virtual address 0xfff1fffe. The VS-stage page tables do not cause
any problems here, so this is converted into a request to load four
bytes from guest-physical address 0x1fffe. There are two possible valid
outcomes: either the bytes at guest-physical addresses 0x1fffe, 0x1ffff,
0x20000, and 0x20001 are concatenated and placed in rd (which is what
would happen if both pages were in normal memory), or the guest
receives an access fault for trying to do a misaligned access that
overlaps two memory regions with different PMA attributes. (The base
spec says that this is an option if the execution environment defines
it as one. If you don't think this should be allowed to happen, then
that's fine: you believe that the only possible outcome is that the
four bytes are read and concatenated in rd.)
My claim is that the M-mode firmware and the HS-mode hypervisor,
working together, will fail to do either of the correct behaviors, and
what will instead happen is that the hypervisor will incorrectly load
bytes 0x20000, 0x20001, 0x20002, and 0x20003 into rd.
Here is how that happens in the normal case:
1. The hardware detects a misaligned memory access and delivers a
load address misaligned fault to M-mode. It sets mcause = 4 and
mtval = 0xfff1fffe.
2. The M-mode firmware sets mstatus.MPRV = 1 so that the virtual
address in mtval undergoes two-stage page table translation, and it
uses 'lbu' instructions to manually load the four bytes from virtual
addresses 0xfff1fffe, 0xfff1ffff, 0xfff20000, and 0xfff20001.
3. The first two lbu instructions succeed but the third (the attempt
to load 0xfff20000) fails with a "load guest-page fault" exception
with:
mcause = 21 [Load guest-page fault]
mtval = 0xfff20000
mtval2 = 0x8000 [Guest-physical address 0x20000 shifted two bits right]
mtinst = transformation of an "lbu" instruction with "Addr. offset" = 0
4. M-mode firmware forwards this error to the HS-mode hypervisor. But
the hypervisor attempted to load a word, not a byte, and the M-mode
firmmware adjusts htinst appropriately, setting:
scause = 21
stval = 0xfff20000
htval = 0x8000
htinst = transformation of an "lwu" instruction with "Addr. offset" = 2
Note that section 5.2.3 of the psec says that the address in stval must
be the faulting portion of the virtual address, which may be a page
boundary that is higher than the original misaligned address that the
guest tried to load from.
Now here is what's *supposed* to happen:
5a. The hypervisor uses stval and htval to compute the faulting
guest-physical address as (htval << 2) | (stval & 3) = 0x20000. But it
should also recover the *original* guest-physical address by
subtracting "Addr. offset" from the faulting address. Since Addr offset
is 2, the original guest-physical address is correctly recovered as
0x1fffe, and the hypervisor can deal with this appropriately.
Here is what a hypervisor is *supposed* to do if htinst is zero:
5b. The hypervisor computes the faulting guest-physical address as
0x20000 as before. It notices that htinst is empty so it loads the
faulting instruction from sepc. It sees that the original instruction
was an "lwu." It must *also* extract rs1 and the immediate-offset from
this lwu instruction and add them togehter to recover the original
guest-virtual address (which was 0xfff1fffe). It then compares this to
the contents of stval (which is 0xfff20000) and realizes that the
originally-requested address is *not* the address that faulted. (Note
that the hypervisor should be able to skip this check if the value in
stval is not a multiple of 0x10000, because the problem I'm talking
about only occurs when stval points to the beginning of a guest page.)
Here is what happens in v19 of the KVM patch:
5c. The hypervisor computes the faulting guest-physical address as
0x20000 just like the other examples. It checks htinst to determine
the faulting instruction type, and if htinst is zero it loads the
faulting instruction from sepc; either way it determines that the
faulting instruction was "lwu." It then emulates a word-sized load
from guest-physical address 0x20000 (failing to detect that the guest
actually tried to load from 0x1fffe), and returns the emulated contents
of bytes 0x20000-0x20003 from the emulated MMIO driver into rd.
That's the core problem: if htinst is zero on a guest page fault, the
only way for the hypervisor to determine that the original address was
misaligned is to recompute that address from the decoded rs1 and
immediate in the instruction itself.
Furthermore: I believe that guest-page faults (and access faults caused
by PMP/PMA checks, which should never occur in practice for
correctly-written hypervisors) are the only use for the "Addr. offset"
field. To respond to your note that:
> Like I mentioned above, "Addr. offset" is mostly a hint but misaligned
> load/store traps never reach Hypervisor or Guest/VM because M-mode
> runtime firmware treats this as missing HW feature and emulates it for
> all lower privilege modes.
Even if the OS-A platform specs didn't require the firmware to deal
with misaligned load/store traps, the "Addr. offset" field would always
be zero in htinst for misaligned load/store traps because the original
load address would already be in stval, and the offset is what you're
supposed to subtract from stval to get the original address. So the
only use of the Addr. offset field is to catch this sort of situation
where a misaligned access spans permission boundaries and the second
half of the load fails.
Per my original email, hypervisors could avoid this problem by mapping
a guard page before any emulated MMIO memory. The scenario described
above was possible only because the MMIO page at 0x20000-0x2ffff was
immediately preceded by a regular memory page at 0x10000-0x1ffff. If
that preceding page had *also* been unmapped, then any sane firmware
implementation would report a page fault for address 0x1fffe instead
of 0x20000, and the problem would go away.
To protect against *that* case, I do think (per my previous email) that
the spec should still be tightened up to require that the firmware try
to read the bytes in order. That is, if both memory pages are unmapped
and unprivileged software tries to do a four-byte load from address
0x1fffe, the page fault for 0x1fffe should take precedence over the
page fault from 0x20000, to guarantee that the hypervisor sees the
misaligned address resulting from the guard page.
That said, the situation *does* leave me with complicated feelings
surrounding the "Addr. offset" field. Stepping back and thinking about
why things are they way they are: The behavior of stval for ordinary
page faults makes sense, because ordinary supervisors aren't generally
emulating anything---when a regular operating system receives a page
fault, it is more interested in knowing which page faulted (so it can
decide whether to terminate the user process, make a writable copy of
a copy-on-write page, swap the page back in from the hard disk, etc.)
but generally speaking it will handle the situation by *either* fixing
up the page table and letting the user process reattempt the access,
*or* killing the user process. In neither of these common use cases
does the supervisor care whether the user process was attempting a
misaligned access.
Hypervisors, on the other hand, are far more likely to want to emulate
virtualized-mode memory accesses. While hypervisors are free to reject
misaligned accesses in emulated device-space (the RISC-V unprivileged
spec says that execution environments are allowed to return access
faults in this situation), they still need a fast, reliable, and clean
way to *detect* memory accessess. The "Addr. offset" field is not
reliable because htinst might be zero. Loading the original instruction
from memory and recomputing the faulting address is not fast. Loading
the original instruction from memory only when
(htinst == 0 && (stval & 0xffff) != 0) will be fast *on average*
because it only happens when stval points to the first word in a page,
but it doesn't seem clean: hypervisors now need a bunch of code to
decode the immediates in compressed load/store instructions and such,
but it's only used in this really quirky situation that really *feels*
like a specification hole.
I can see two ways to modify the hypervisor specification (or to
tighten the requirements for the hypervisor spec in a platform
specification like "OS-A" that Anup mentioned earlier) but one is
weird and the other is heavyweight.
The heavyweight approach is to require that htinst be populated
with a transformed memory instruction when both (1) the underlying
memory fault was caused by a scalar integer memory operation (that
is, a load, store, or AMO operation from the integer registers) that
triggered a guest-page laod or store fault, and (2) the value in
stval is not the original address. We would also declare a new
"Supported access type PMA" which says that certain non-main memory
regions cannot be manipulated from the floating point or vector
registers.
This requirement would not apply to mtinst (since M-mode firmware
might be responsible for populating htinst manually), it would not
apply to instruction guest-page faults (since hypervisors don't usually
want to emulate those), and it would not apply to access faults (since
this caused by a misconfigured hypervisor, not a legitimate attempt to
use emulated MMIO). It would not apply to vector loads/stores/AMOs
(which do not currently have "standard transformation" formats to put
in htinst) and it would not apply to floating-point loads/stores (which
are not emulated in v19 of Anup's KVM patch, in any event).
This approach is heavyweight because htinst would need enough
flip-flops to store transformed integer load and store instructions
(as well as misaligned AMOs if they are included). But if we put it in
a platform spec then it becomes a "deluxe feature" (non-deluxe
hypervisors can just use guard pages, like I suggested earlier) and it
is easy for M-mode firmware to populate htinst properly. If M-mode is
emulating a misaligned load/store, it will already know the original
instruction as well as Addr. offset, so it can easily compute the
appropriate value of htinst.
A hypervisor would then handle load/store guest page faults like so:
- If the faulting page in stval does not correspond to device memory,
then handle this like a non-emulated page fault (e.g. deliver an
access violation to the guest operating system, or map a valid page
in the PTE and let the supervisor reattempt the address). In this
situation the hypervisor does not care whether the access was
misaligned, and these operations will still succeed even for
floating-point and vector accesses.
- Otherwise we are emulating device memory. First we check that the
original memory operation was a scalar integer operation: if it was
a float or vector load/store, then we reject it with an access
violation, which is legal because it violates the new "scalar integer
only" PMA classification.
- If we're still here, then it was a memory operation, so *either*
stval corresponds to the original address, *or* stval points to the
start of a guest page and htinst's Addr. offset field is populated.
Then the original address was:
(htval << 2) + (stval & 3) - ((htinst >> 15) & 31)
and we can test this for alignment.
===
There's a lot here and I'm sorry it's so long, but to summarize my
points on the misaligned access issue:
1. v19 of the KVM patch fails to detect misaligned accesses
masquerading as guest-page faults.
2. Hypervisors can avoid this issue in practice by preceding emulated
MMIO memory with unmapped guest pages.
3. The privileged spec should explicitly require that when both pages
touched by a misaligned memory accesses would trigger guest-page
faults, the addresses in stval and htval should correspond to the
"first" page fault, not the "second" page-aligned fault.
4. A platform spec should require that M-mode firmware populate htinst
in this particular situation, and should require the hardware to
support enough bits in htinst to make this possible.
5. The list of supported access type PMAs in section 3.6 should include
reference to the "scalar integer only" restriction, so that
hypervisors aren't required to emulate loads and stores from vector
or floating-point registers.
Anthony