Hypervisor spec review

95 views
Skip to first unread message

Anthony Coulter

unread,
Sep 19, 2021, 3:45:01 PM9/19/21
to isa...@groups.riscv.org
1. Figure 5.46 on page 117 (the transformed HSV/HLV/HLVX instructions)
does not properly distinguish between hypervisor loads and stores. HLV
instructions do not have an rs2 field, and HSV instructions do not have
an rd.

2. Section 5.2.3 ("Hypervisor Interrupt Registers") defines hip as a
writable CSR, but the only writable bit is VSSIP, which is an alias of
the corresponding bit in hvip. Why not just make hip read-only? Is it
so that the CSR address can end in 0x44 like mip and sip?

3. Section 5.6.3 ("Transformed Instruction or Pseudoinstruction for
mtinst or htinst") on page 116 does not mention the FLH and FSH
instructions from the Zfh proposal, nor does the Zfh proposal mention
the issues for hypervisor emulation. If both specifications are
approved, I would expect FLH and FSH to be added to the list of
transformable memory instructions alongside FLW, FLD, FLQ, FSW, FSD,
and FSQ.

4. Also in section 5.6.3, transformed instructions now contain an
"Addr. offset" field in bits 15-19 which seems really handy for one
particular situation, but since this field is not guaranteed to be
available (because htinst might be set to zero) I don't see how
hypervisor software can be written to benefit from it.

To elaborate on the "Addr. offset" problem: the spec says that this
field will be nonzero only for misaligned memory accesses. That's true,
but I would go a step farther: I believe it can believe it can be
nonzero only for misaligned memory accesses that straddle page
boundaries. This field contains the difference between the "faulting
virtual address (written to mtval or stval) and the original virtual
address." If part of a memory access is valid and another part is not,
the access must straddle either a page boundary or a PMP/PMA boundary,
but in practice I would expect PMP/PMA boundaries to be page-aligned.

So "Addr. offset" only contains useful information when a memory access
straddles a page boundary, the first page was valid, and the second
page was not. Hypervisor software can use this Addr. offset field to
determine that stval does not contain the original virtual address,
which is important information if you are emulating the memory
accesses. (Even if you refuse to emulate a misaligned memory access to
an emulated device driver, you want to at least detect the issue and
forward the exception to the guest.)

But if htinst is sometimes zero, hypervisor software cannot count on
this information being available. Suppose the hypervisor gets a guest
page fault, that the virtual address in stval points to the beginning
of a guest page, and that htinst is zero. How does the hypervisor
determine whether this is a misaligned access and that the address in
stval is not the one that the guest tried to access? I can think of two
ways to address the situation:

1. Recompute the original virtual address by reading the original
instruction from memory, decoding the immediate, finding a copy of
rs1, and adding them. This is a lot of hassle and I suspect most
hypervisors will forget to do this. (I don't see any code like this
in the latest RISC-V KVM patch, for example.)

2. Follow a policy of leaving an unmapped guest page before any guest
page that is being used to emulate a device driver, so that the
address in stval is guaranteed to be the original virtual address,
and so can be tested for alignment.

If a hypervisor follows policy (2), then the addr_offset field becomes
useless because it will always be zero in situations where the
hypervisor wants to emulate guest memory. (It might be nonzero when
VU-mode software is trying to do a misaligned access in regular memory
but one of the pages happens to be swapped out, but in this situation
the hypervisor doesn't care about the offset; it just wants to know
which page it needs to remap.)

To ensure that (2) works properly, I would like to tighten some
exception priorities. Section 16.1 of the unprivileged spec
("Definition of the RVWMO memory order") says that misaligned loads and
stores may be decomposed into a set of memory operations of any
granularity that are mutually unordered (with respect to the memory
model). But Table 3.7 of the privileged spec (Synchronous exception
priorities) refers to the "first encountered page fault or access fault"
without defining it. There is some language in section 5.5.2
("Guest-Page Faults") that claims that "two different address
translations are involved" for misaligned memory accesses but does not
impose any ordering on them.

Now, section 4.3.2 ("Virtual Address Translation Process") establishes
relative priorities between page fault and access violation exceptions
within a single stage of translation for a single virtual address. It's
also clear that in two-stage translation of a single guest-virtual
address, the VS-stage must occur before the G-stage. But for a
misaligned virtual address we need to lock down a couple of things:

1. Move the "two different address translations are involved" language
from section 5.5.2 (where it applies only to hypervisors) to 4.3.2
so it also applies to supervisors.

2. Clarify that the language applies only to "scalar memory operations"
so that it is not construed to imply exception priorities for vector
memory operations.

3. Add a sentence indicating that if both address translations result
in page faults, the page fault corresponding to the lower-numbered
address will be encountered first for the purposes of table 3.7.
(Thus, the value in stval is guaranteed to be the original virtual
address in these situations; implementations are not allowed to
decompose the virtual address translation into two parts and do the
second translation first.)

4. Make a decision for the situation where both parts of the address
have faults, but they are not both page faults. Suppose the first
page has an access fault but the second page has a page fault. Is
an implementation required to "encounter" faults on the first page
before it encounters faults with the second page? Likewise, in
two-stage address translation, there are three possible exceptions
for each virtual address: VS page fault, guest page fault, and
access fault. What orderings are forbidden?

I think everyone can agree that if any supervisor-mode software
receives a page fault exception for a misaligned address, and both
parts of the access would trigger page faults, then the address in
stval ought to point to the original address, which is equivalent to
saying that the lower address is translated first for the purposes of
table 3.7. My opinion here is that (1) it should be made explicit
because the rest of the RISC-V spec takes an "anything goes" attitude
towards misaligned addresses and (2) we should be explicit about what
other sorts of ordering are or are not implied by the phrase "first
encountered page fault or access fault" in table 3.7.

That change will at least allow hypervisors to use unmapped guard
pages to catch misaligned addresses straddling guest page boundaries.

But that leaves my discomfort about the "Addr. offset" field, since
I'm not sure what it's good for now. Questions:

1. Will "Addr. offset" ever be nonzero *other* than situations where
a misaligned memory access straddles either a guest page boundary,
or a PMP/PMA boundary?

2. Can anyone think of a way to warn the hypervisor that the value in
stval is not the original virtual address, using a mandatory
mechanism instead of an optional one like htinst?

Regards,
Anthony

Anup Patel

unread,
Sep 22, 2021, 1:16:30 AM9/22/21
to Anthony Coulter, RISC-V ISA Dev
Hi,

I have tried to answer some of your comments belows based on my
KVM RISC-V experience. I will let John answer all other comments.

On Mon, Sep 20, 2021 at 1:15 AM Anthony Coulter
<ri...@anthonycoulter.name> wrote:
>
> 1. Figure 5.46 on page 117 (the transformed HSV/HLV/HLVX instructions)
> does not properly distinguish between hypervisor loads and stores. HLV
> instructions do not have an rs2 field, and HSV instructions do not have
> an rd.

The func7 of the instruction encoding can be used to distinguish between
HLV and HSV instructions.
The misaligned load/store emulation is only done by the M-mode runtime
firmware for both Hypervisor and Guest/VM. If there is a page fault while
emulating misaligned load/store in M-mode firmware then it is forwarded
to the hypervisor or Guest/VM where it came from. This means misaligned
load/store faults don't reach the hypervisors and Guest/VM because the
M-mode firmware tries to compensate for missing HW features using
software emulation. In fact, the upcoming OS-A platform specs puts
misaligned load/store emulation as a required feature from M-mode
runtime firmware.

Currently, hypervisors don't require the "Addr. offset" information
though it is useful. Most hypervisors determine the exact faulting
address from a combination of htval and stval CSRs.

>
> But if htinst is sometimes zero, hypervisor software cannot count on
> this information being available. Suppose the hypervisor gets a guest
> page fault, that the virtual address in stval points to the beginning
> of a guest page, and that htinst is zero. How does the hypervisor
> determine whether this is a misaligned access and that the address in
> stval is not the one that the guest tried to access? I can think of two
> ways to address the situation:
>
> 1. Recompute the original virtual address by reading the original
> instruction from memory, decoding the immediate, finding a copy of
> rs1, and adding them. This is a lot of hassle and I suspect most
> hypervisors will forget to do this. (I don't see any code like this
> in the latest RISC-V KVM patch, for example.)

This is incorrect. The KVM RISC-V implementation will fallback to
unpriv (HLVX) instruction based approach when htinst is zero.

Please check emulate_load() and emulate_store() in PATCH7 of the
latest KVM RISC-V patches.
(Refer, https://lkml.org/lkml/2021/7/27/83)

>
> 2. Follow a policy of leaving an unmapped guest page before any guest
> page that is being used to emulate a device driver, so that the
> address in stval is guaranteed to be the original virtual address,
> and so can be tested for alignment.

All existing RISC-V hypervisors (KVM, Xvisor, etc) never map anything
in the G-stage for emulated MMIO devices. This ensures that faulting
address can be retrieved using htval and stval CSRs.

>
> If a hypervisor follows policy (2), then the addr_offset field becomes
> useless because it will always be zero in situations where the
> hypervisor wants to emulate guest memory. (It might be nonzero when
> VU-mode software is trying to do a misaligned access in regular memory
> but one of the pages happens to be swapped out, but in this situation
> the hypervisor doesn't care about the offset; it just wants to know
> which page it needs to remap.)

Like I mentioned above, "Addr. offset" is mostly a hint but misaligned
load/store traps never reach Hypervisor or Guest/VM because M-mode
runtime firmware treats this as missing HW feature and emulates it for
all lower privilege modes.

Also, VS-stage pages being swapped out at runtime while misaligned
load/store emulation is happening has been taken care of by M-mode
runtime firmware (such as OpenSBI).
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/7a01264684f35938%40raines.redjes.us.

Regards,
Anup

Anthony Coulter

unread,
Sep 22, 2021, 11:38:50 AM9/22/21
to an...@brainfault.org, ri...@anthonycoulter.name, isa...@groups.riscv.org
Hi Anup,

>> 1. Figure 5.46 on page 117 (the transformed HSV/HLV/HLVX instructions)
>> does not properly distinguish between hypervisor loads and stores. HLV
>> instructions do not have an rs2 field, and HSV instructions do not have
>> an rd.
>
> The func7 of the instruction encoding can be used to distinguish between
> HLV and HSV instructions.

I agree that the instructions can be distinguished by a hypervisor; my
objection is only that figure 5.46 is not correct, and should be split
into two separate figures.


> The misaligned load/store emulation is only done by the M-mode runtime
> firmware for both Hypervisor and Guest/VM. If there is a page fault while
> emulating misaligned load/store in M-mode firmware then it is forwarded
> to the hypervisor or Guest/VM where it came from. This means misaligned
> load/store faults don't reach the hypervisors and Guest/VM because the
> M-mode firmware tries to compensate for missing HW features using
> software emulation. In fact, the upcoming OS-A platform specs puts
> misaligned load/store emulation as a required feature from M-mode
> runtime firmware.
>
> Currently, hypervisors don't require the "Addr. offset" information
> though it is useful. Most hypervisors determine the exact faulting
> address from a combination of htval and stval CSRs.

I'm referring to guest page faults where the underlying address was
misaligned (that is, scause = 20, 21, or 23), not misaligned access
exceptions (scause = 4 and 6)

To give a concrete example: A hypervisor is implements memory range
0x10000-0x2ffff to a guest, which consists of two contiguous guest
pages. Page #1, which comprises bytes 0x10000-0x1ffff, is configured
as regular memory. That is, the G-stage page table entry for this page
is set with V = 1 and RWX = 111. Page #2 spans bytes 0x20000-0x2ffff,
and the hypervisor intends to use this to emulate an MMIO driver, so
the G-stage PTE for this page is set with V = 0.

The guest then sets up its own page tables, which don't matter a whole
lot for the purposes of this argument as long as the pages stay
contiguous, but to be concrete let's say that the guest pushes
everything into negative addresses, so page #1 is now mapped to
0xfff10000-0xfff1ffff (guest virtual address) and page #2 is mapped to
0xfff20000-0xfff2ffff.

Now: suppose the guest operating system attempts a four-byte load from
guest-virtual address 0xfff1fffe. The VS-stage page tables do not cause
any problems here, so this is converted into a request to load four
bytes from guest-physical address 0x1fffe. There are two possible valid
outcomes: either the bytes at guest-physical addresses 0x1fffe, 0x1ffff,
0x20000, and 0x20001 are concatenated and placed in rd (which is what
would happen if both pages were in normal memory), or the guest
receives an access fault for trying to do a misaligned access that
overlaps two memory regions with different PMA attributes. (The base
spec says that this is an option if the execution environment defines
it as one. If you don't think this should be allowed to happen, then
that's fine: you believe that the only possible outcome is that the
four bytes are read and concatenated in rd.)

My claim is that the M-mode firmware and the HS-mode hypervisor,
working together, will fail to do either of the correct behaviors, and
what will instead happen is that the hypervisor will incorrectly load
bytes 0x20000, 0x20001, 0x20002, and 0x20003 into rd.

Here is how that happens in the normal case:

1. The hardware detects a misaligned memory access and delivers a
load address misaligned fault to M-mode. It sets mcause = 4 and
mtval = 0xfff1fffe.

2. The M-mode firmware sets mstatus.MPRV = 1 so that the virtual
address in mtval undergoes two-stage page table translation, and it
uses 'lbu' instructions to manually load the four bytes from virtual
addresses 0xfff1fffe, 0xfff1ffff, 0xfff20000, and 0xfff20001.

3. The first two lbu instructions succeed but the third (the attempt
to load 0xfff20000) fails with a "load guest-page fault" exception
with:

mcause = 21 [Load guest-page fault]
mtval = 0xfff20000
mtval2 = 0x8000 [Guest-physical address 0x20000 shifted two bits right]
mtinst = transformation of an "lbu" instruction with "Addr. offset" = 0

4. M-mode firmware forwards this error to the HS-mode hypervisor. But
the hypervisor attempted to load a word, not a byte, and the M-mode
firmmware adjusts htinst appropriately, setting:

scause = 21
stval = 0xfff20000
htval = 0x8000
htinst = transformation of an "lwu" instruction with "Addr. offset" = 2

Note that section 5.2.3 of the psec says that the address in stval must
be the faulting portion of the virtual address, which may be a page
boundary that is higher than the original misaligned address that the
guest tried to load from.

Now here is what's *supposed* to happen:

5a. The hypervisor uses stval and htval to compute the faulting
guest-physical address as (htval << 2) | (stval & 3) = 0x20000. But it
should also recover the *original* guest-physical address by
subtracting "Addr. offset" from the faulting address. Since Addr offset
is 2, the original guest-physical address is correctly recovered as
0x1fffe, and the hypervisor can deal with this appropriately.

Here is what a hypervisor is *supposed* to do if htinst is zero:

5b. The hypervisor computes the faulting guest-physical address as
0x20000 as before. It notices that htinst is empty so it loads the
faulting instruction from sepc. It sees that the original instruction
was an "lwu." It must *also* extract rs1 and the immediate-offset from
this lwu instruction and add them togehter to recover the original
guest-virtual address (which was 0xfff1fffe). It then compares this to
the contents of stval (which is 0xfff20000) and realizes that the
originally-requested address is *not* the address that faulted. (Note
that the hypervisor should be able to skip this check if the value in
stval is not a multiple of 0x10000, because the problem I'm talking
about only occurs when stval points to the beginning of a guest page.)

Here is what happens in v19 of the KVM patch:

5c. The hypervisor computes the faulting guest-physical address as
0x20000 just like the other examples. It checks htinst to determine
the faulting instruction type, and if htinst is zero it loads the
faulting instruction from sepc; either way it determines that the
faulting instruction was "lwu." It then emulates a word-sized load
from guest-physical address 0x20000 (failing to detect that the guest
actually tried to load from 0x1fffe), and returns the emulated contents
of bytes 0x20000-0x20003 from the emulated MMIO driver into rd.

That's the core problem: if htinst is zero on a guest page fault, the
only way for the hypervisor to determine that the original address was
misaligned is to recompute that address from the decoded rs1 and
immediate in the instruction itself.

Furthermore: I believe that guest-page faults (and access faults caused
by PMP/PMA checks, which should never occur in practice for
correctly-written hypervisors) are the only use for the "Addr. offset"
field. To respond to your note that:

> Like I mentioned above, "Addr. offset" is mostly a hint but misaligned
> load/store traps never reach Hypervisor or Guest/VM because M-mode
> runtime firmware treats this as missing HW feature and emulates it for
> all lower privilege modes.

Even if the OS-A platform specs didn't require the firmware to deal
with misaligned load/store traps, the "Addr. offset" field would always
be zero in htinst for misaligned load/store traps because the original
load address would already be in stval, and the offset is what you're
supposed to subtract from stval to get the original address. So the
only use of the Addr. offset field is to catch this sort of situation
where a misaligned access spans permission boundaries and the second
half of the load fails.


Per my original email, hypervisors could avoid this problem by mapping
a guard page before any emulated MMIO memory. The scenario described
above was possible only because the MMIO page at 0x20000-0x2ffff was
immediately preceded by a regular memory page at 0x10000-0x1ffff. If
that preceding page had *also* been unmapped, then any sane firmware
implementation would report a page fault for address 0x1fffe instead
of 0x20000, and the problem would go away.

To protect against *that* case, I do think (per my previous email) that
the spec should still be tightened up to require that the firmware try
to read the bytes in order. That is, if both memory pages are unmapped
and unprivileged software tries to do a four-byte load from address
0x1fffe, the page fault for 0x1fffe should take precedence over the
page fault from 0x20000, to guarantee that the hypervisor sees the
misaligned address resulting from the guard page.


That said, the situation *does* leave me with complicated feelings
surrounding the "Addr. offset" field. Stepping back and thinking about
why things are they way they are: The behavior of stval for ordinary
page faults makes sense, because ordinary supervisors aren't generally
emulating anything---when a regular operating system receives a page
fault, it is more interested in knowing which page faulted (so it can
decide whether to terminate the user process, make a writable copy of
a copy-on-write page, swap the page back in from the hard disk, etc.)
but generally speaking it will handle the situation by *either* fixing
up the page table and letting the user process reattempt the access,
*or* killing the user process. In neither of these common use cases
does the supervisor care whether the user process was attempting a
misaligned access.

Hypervisors, on the other hand, are far more likely to want to emulate
virtualized-mode memory accesses. While hypervisors are free to reject
misaligned accesses in emulated device-space (the RISC-V unprivileged
spec says that execution environments are allowed to return access
faults in this situation), they still need a fast, reliable, and clean
way to *detect* memory accessess. The "Addr. offset" field is not
reliable because htinst might be zero. Loading the original instruction
from memory and recomputing the faulting address is not fast. Loading
the original instruction from memory only when
(htinst == 0 && (stval & 0xffff) != 0) will be fast *on average*
because it only happens when stval points to the first word in a page,
but it doesn't seem clean: hypervisors now need a bunch of code to
decode the immediates in compressed load/store instructions and such,
but it's only used in this really quirky situation that really *feels*
like a specification hole.

I can see two ways to modify the hypervisor specification (or to
tighten the requirements for the hypervisor spec in a platform
specification like "OS-A" that Anup mentioned earlier) but one is
weird and the other is heavyweight.

The heavyweight approach is to require that htinst be populated
with a transformed memory instruction when both (1) the underlying
memory fault was caused by a scalar integer memory operation (that
is, a load, store, or AMO operation from the integer registers) that
triggered a guest-page laod or store fault, and (2) the value in
stval is not the original address. We would also declare a new
"Supported access type PMA" which says that certain non-main memory
regions cannot be manipulated from the floating point or vector
registers.

This requirement would not apply to mtinst (since M-mode firmware
might be responsible for populating htinst manually), it would not
apply to instruction guest-page faults (since hypervisors don't usually
want to emulate those), and it would not apply to access faults (since
this caused by a misconfigured hypervisor, not a legitimate attempt to
use emulated MMIO). It would not apply to vector loads/stores/AMOs
(which do not currently have "standard transformation" formats to put
in htinst) and it would not apply to floating-point loads/stores (which
are not emulated in v19 of Anup's KVM patch, in any event).

This approach is heavyweight because htinst would need enough
flip-flops to store transformed integer load and store instructions
(as well as misaligned AMOs if they are included). But if we put it in
a platform spec then it becomes a "deluxe feature" (non-deluxe
hypervisors can just use guard pages, like I suggested earlier) and it
is easy for M-mode firmware to populate htinst properly. If M-mode is
emulating a misaligned load/store, it will already know the original
instruction as well as Addr. offset, so it can easily compute the
appropriate value of htinst.

A hypervisor would then handle load/store guest page faults like so:

- If the faulting page in stval does not correspond to device memory,
then handle this like a non-emulated page fault (e.g. deliver an
access violation to the guest operating system, or map a valid page
in the PTE and let the supervisor reattempt the address). In this
situation the hypervisor does not care whether the access was
misaligned, and these operations will still succeed even for
floating-point and vector accesses.

- Otherwise we are emulating device memory. First we check that the
original memory operation was a scalar integer operation: if it was
a float or vector load/store, then we reject it with an access
violation, which is legal because it violates the new "scalar integer
only" PMA classification.

- If we're still here, then it was a memory operation, so *either*
stval corresponds to the original address, *or* stval points to the
start of a guest page and htinst's Addr. offset field is populated.
Then the original address was:

(htval << 2) + (stval & 3) - ((htinst >> 15) & 31)

and we can test this for alignment.

===

There's a lot here and I'm sorry it's so long, but to summarize my
points on the misaligned access issue:

1. v19 of the KVM patch fails to detect misaligned accesses
masquerading as guest-page faults.

2. Hypervisors can avoid this issue in practice by preceding emulated
MMIO memory with unmapped guest pages.

3. The privileged spec should explicitly require that when both pages
touched by a misaligned memory accesses would trigger guest-page
faults, the addresses in stval and htval should correspond to the
"first" page fault, not the "second" page-aligned fault.

4. A platform spec should require that M-mode firmware populate htinst
in this particular situation, and should require the hardware to
support enough bits in htinst to make this possible.

5. The list of supported access type PMAs in section 3.6 should include
reference to the "scalar integer only" restriction, so that
hypervisors aren't required to emulate loads and stores from vector
or floating-point registers.

Anthony

Anup Patel

unread,
Sep 23, 2021, 12:27:10 PM9/23/21
to Anthony Coulter, RISC-V ISA Dev
On Wed, Sep 22, 2021 at 9:08 PM Anthony Coulter
<ri...@anthonycoulter.name> wrote:
>
> Hi Anup,
>
> >> 1. Figure 5.46 on page 117 (the transformed HSV/HLV/HLVX instructions)
> >> does not properly distinguish between hypervisor loads and stores. HLV
> >> instructions do not have an rs2 field, and HSV instructions do not have
> >> an rd.
> >
> > The func7 of the instruction encoding can be used to distinguish between
> > HLV and HSV instructions.
>
> I agree that the instructions can be distinguished by a hypervisor; my
> objection is only that figure 5.46 is not correct, and should be split
> into two separate figures.

I see your point but (like you said) this can be easily corrected by
This is certainly an interesting example. It should be noted that
most hypervisors generally don't place Guest/VM RAM and emulated
I/O next to each other so to replicate this case we will have to create
synthetic Guest and software running inside Guest which will do
misaligned access at the RAM boundary.
Yes, this issue has nothing to do with where misaligned load/store traps
are emulated.

>
>
> Per my original email, hypervisors could avoid this problem by mapping
> a guard page before any emulated MMIO memory. The scenario described
> above was possible only because the MMIO page at 0x20000-0x2ffff was
> immediately preceded by a regular memory page at 0x10000-0x1ffff. If
> that preceding page had *also* been unmapped, then any sane firmware
> implementation would report a page fault for address 0x1fffe instead
> of 0x20000, and the problem would go away.

Yes, for most hypervisors we have an empty/unmapped area around
Guest MMIO space.
Over time there will be more fixes and hardening of the KVM RISC-V
implementation so I will put this in my TODO list. Thanks for pointing.

I will also evaluate OpenSBI firmware (M-mode firmware) in-context
of your example.

>
> 2. Hypervisors can avoid this issue in practice by preceding emulated
> MMIO memory with unmapped guest pages.

Like mentioned previously, We will generally have empty/unmapped
space before and after Guest MMIO space for most hypervisors.

>
> 3. The privileged spec should explicitly require that when both pages
> touched by a misaligned memory accesses would trigger guest-page
> faults, the addresses in stval and htval should correspond to the
> "first" page fault, not the "second" page-aligned fault.
>
> 4. A platform spec should require that M-mode firmware populate htinst
> in this particular situation, and should require the hardware to
> support enough bits in htinst to make this possible.

Currently, the platform spec says the following for mtinst/htinst:
"htinst and mtinst must not be hardwired to 0 and must be written with
a transformed instruction (versus zero) when defined and allowed
architecturally."

>
> 5. The list of supported access type PMAs in section 3.6 should include
> reference to the "scalar integer only" restriction, so that
> hypervisors aren't required to emulate loads and stores from vector
> or floating-point registers.
>
> Anthony
>

Regards,
Anup

Anthony Coulter

unread,
Sep 23, 2021, 2:52:23 PM9/23/21
to and...@sifive.com, an...@brainfault.org, isa...@groups.riscv.org
Hi Anup! Thanks for replying again. I suppose I should mention that I
don't mean to criticize your KVM patches so much as point out that the
"gotchas" I'm pointing out in the hypervisor spec are easily overlooked
even by RISC-V hypervisor experts like yourself. I often look to your
Linux and OpenSBI code online to refine my understanding of how the
features of the privileged spec are intended to be used---so when I
say "the KVM patch doesn't do such-and-such" what I mean is "This issue
will trip up even the experts!"


>> [My description of the two contiguous pages]
>>
>
> This is certainly an interesting example. It should be noted that
> most hypervisors generally don't place Guest/VM RAM and emulated
> I/O next to each other so to replicate this case we will have to create
> synthetic Guest and software running inside Guest which will do
> misaligned access at the RAM boundary.

I'll grant that most hypervisors will have what I'm calling a "guard
page" without even thinking about this particular issue, but one day
someone working in RV32 might find themselves packing their address
space pretty tightly and this issue could become possible. Since it's
subtle and nonobvious, I would propose the following normative text
for inclusion in the hypervisor spec:

"If, on a guest-page fault, the address in stval points to the
beginning of a guest page, hypervisor software must check the Addr.
offset field in htinst to verify that the trapping instruction was
not a misaligned memory access pointing to the guest-page immediately
preceding the one pointed to by stval. If htinst is zero and the
Addr. offset field is not available, the hypervisor must compute the
original virtual address by decoding the original instruction. To
avoid this situation, hypervisors can avoid placing valid guest
memory pages immediately before unmapped guest pages that are used
to emulate device memory."

The most logical place to put this would be just under the paragraph
explaining the "Addr. offset" field.


>> 4. A platform spec should require that M-mode firmware populate htinst
>> in this particular situation, and should require the hardware to
>> support enough bits in htinst to make this possible.
>
> Currently, the platform spec says the following for mtinst/htinst:
> "htinst and mtinst must not be hardwired to 0 and must be written with
> a transformed instruction (versus zero) when defined and allowed
> architecturally."

Oh, wow, my copy of riscv-platform-specs has been following "master"
and not "main" all this time, so I didn't realize stuff has been added
since 2018. I consider this problem solved, then.


I'm also CC'ing Andrew Waterman to draw his attention to two points
that, I think, have implications beyond the hypervisor spec.


>> 3. The privileged spec should explicitly require that when both pages
>> touched by a misaligned memory accesses would trigger guest-page
>> faults, the addresses in stval and htval should correspond to the
>> "first" page fault, not the "second" page-aligned fault.

Question for Andrew: Does the privileged spec already require this for
regular (one-stage) virtual address translations and I'm just missing
it? To restate the question so you don't have to read the whole
discussion: A misaligned memory access that straddles a page boundary
causes two separate virtual address translations, each of which can
raise either a page fault or an PMP/PMA access fault. Supposing that
both of these memory accesses would cause faults, does the privileged
spec currently guarantee that the exception delivered to S-mode will
be the fault corresponding to the first part of the virtual address?
I see lots of warnings all throughout the specs which insist that
misaligned memory accesses should be thought of as being implemented
by a sequence of unordered byte stores, which leads me to believe that
if M-mode firmware wanted to emulate a misaligned memory access by
starting with the highest address and working its way down, it is free
to do so. But that would mean that even in the non-hypervisor case, if
a supervisor gets a page fault and stval points to the beginning of a
page, the only way to tell if the original access was misaligned is to
decode the instruction manually, even if the faulting page in stval
is preceded by another unmapped page.

I don't necessarily consider this behavior to be a problem for regular
supervisors since they aren't usually emulating things, but I'm still
curious---is this an allowable behavior under the current privileged
spec? I have not been able to find any text that forbids it.


>> 5. The list of supported access type PMAs in section 3.6 should include
>> reference to the "scalar integer only" restriction, so that
>> hypervisors aren't required to emulate loads and stores from vector
>> or floating-point registers.

This is another question for which I'm interested in Andrew's thoughts,
since while the issue was never really important before the hypervisor
and vector specs, the concept of a "PMA" is a fairly generic thing in
chapter 3 of the supervisor spec.

Right now the vector spec proposal says that misaligned vector memory
accesses should be covered by a separate PMA from misaligned scalar
memory accesses. But even aligned vector memory accesses would be a
pain to emulate. It's easy to think of ways to use the vector extension
to speed up a device driver: memcpy is the most obvious, but maybe you
could also use a zero-stride store to write a bunch of words to the
same address for a device that benefits from that. Such accesses are
not misaligned, so in the absence of a separate PMA for vector memory
accesses (even when they're aligned), hypervisors are technically
obligated to emulate vector memory operations for any memory range that
supports any sort of emulation.

This could be seen as a vector-specific concern, but I'll note that
floating-point loads and stores are also of interest. True, they're
much easier to emulate than vector instructions, but they're also much
less likely to be used in practice. I can imagine someone using a
vector-enhanced memcpy or a zero-stride load or store within a device
driver, but who's using the floating point registers for that? (To be
fair, old x86 libraries sometimes implemented memcpy with the floating
point register stack because it worked faster that way on some chips.)

My thoughts here are:

1. Defining a "scalar accesses only" PMA allows hypervisors that run on
hardware which supports the vector spec to decline to emulate vector
memory accesses to device drivers.

2. If we're going so far as to define a "scalar accesses only" PMA for
use in device I/O memory, we might as well go a step further and
define an "integer scalar accesses only" PMA.

3. If we have an "integer scalar accesses only" PMA, I suspect that
nobody will ever use the "scalar accesses only" PMA in practice, so
we don't actually need to define the latter.

4. We can define this PMA regardless of the details of the hypervisor
and vector specs; the justification for "integer scalar accesses
only" it is simply that hypervisors (whether they're using the
hypervisor extension or they're written in raw M-mode) can't be
expected to keep up with all of RISC-V extensions (which currently
include floating-point and vector memory accesses, but might include
more in the future).

Anthony

John Hauser

unread,
Oct 1, 2021, 9:56:49 PM10/1/21
to RISC-V ISA Dev
Thanks for your review of the hypervisor extension, Anthony.  I'll
answer your first three points here, and join the long discussion about
your fourth point separately.


> 1. Figure 5.46 on page 117 (the transformed HSV/HLV/HLVX instructions)
> does not properly distinguish between hypervisor loads and stores. HLV
> instructions do not have an rs2 field, and HSV instructions do not have
> an rd.

If we look to the instruction formats at the top of Table 6.1 of the
Privileged Achitectures manual, it appears that HSV instructions are
meant to be R-type instructions, which have an rd field.  The HSV
instructions just require this field to be zero, so the destination
register is always x0.  Hence, I don't believe Figure 5.46 is really
ambiguous for HSV.

On the other hand, your observation appears valid for HLV and HLVX, as
these may reasonably be interpreted as I-type instructions.  We should
modify the document in some way to account for this.


> 2. Section 5.2.3 ("Hypervisor Interrupt Registers") defines hip as a
> writable CSR, but the only writable bit is VSSIP, which is an alias of
> the corresponding bit in hvip. Why not just make hip read-only? Is it
> so that the CSR address can end in 0x44 like mip and sip?

When they are not forced to be read-only zeros, bits SEIP, STIP, and
SSIP of vsip are defined as aliases of hip's VSEIP, VSTIP, and VSSIP
respectively.  When implemented, sip.SSIP is required to be writable
(see Section 4.1.3, "Supervisor Interrupt Registers (sip and sie)"),
which implies that vsip.SSIP must be writable, which implies that
hip.VSSIP is also writable.

Given that vsip.SEIP and vsip.STIP are specified naturally as aliases
of hip.VSEIP and hip.VSTIP, we don't want to be forced to define
vsip.SSIP irregularly as an alias of hvip.VSSIP instead of hip.VSSIP,
assuming hip.VSSIP were read-only as you've proposed.


> 3. Section 5.6.3 ("Transformed Instruction or Pseudoinstruction for
> mtinst or htinst") on page 116 does not mention the FLH and FSH
> instructions from the Zfh proposal, nor does the Zfh proposal mention
> the issues for hypervisor emulation. If both specifications are
> approved, I would expect FLH and FSH to be added to the list of
> transformable memory instructions alongside FLW, FLD, FLQ, FSW, FSD,
> and FSQ.

Of course.

But this will be an ongoing issue for every new extension that adds
memory access instructions, not just Zfh.  I'm sure we can expect
the documentation to get updated eventaully for new memory access
instructions.  In the meantime, if RISC-V International is inordinately
slow to define transformations for some new memory access instructions
(e.g. FLH and FSH, or the memory access instructions of the vector
extension), let's not forget that the effect on software is solely
performative, never functional, assuming software is written correctly
to handle a substitute of zero as required.

I understand there are plans to require that "RVA" platforms never
write zero to mtinst/htinst.  I'm not sure that's a viable choice,
but I'll attempt to address that question in the separate, longer
discussion.

    - John Hauser

Andrew Waterman

unread,
Oct 2, 2021, 7:59:28 AM10/2/21
to John Hauser, RISC-V ISA Dev
FWIW, I don't think that's a viable choice, either.  For expediency, I'll hold off on explaining why unless this topic comes to a head.
 

    - John Hauser

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Greg Favor

unread,
Oct 2, 2021, 1:06:38 PM10/2/21
to Andrew Waterman, John Hauser, RISC-V ISA Dev
On Sat, Oct 2, 2021 at 4:59 AM Andrew Waterman <and...@sifive.com> wrote:
I understand there are plans to require that "RVA" platforms never
write zero to mtinst/htinst.  I'm not sure that's a viable choice,
but I'll attempt to address that question in the separate, longer
discussion.

FWIW, I don't think that's a viable choice, either.  For expediency, I'll hold off on explaining why unless this topic comes to a head.

The proposed platform plan is not to require only non-zero values be written.  Besides there of course being many arch exceptions where zero is architecturally required to be written, the Priv spec (H chapter) only defines specific transformed instruction formats for a limited set of load/store instructions (since it isn't "aware" of new load/store instructions introduced by new extensions).  The spec also says that a "future standard or custom extension may permit other values to be written, chosen from the set of allowed values established earlier".

In other words it is the responsibility of a new extension to define the transformed instruction formats, for all, some, or none of the new load/store instructions that it creates.  The CMO extensions, for example, do this.  Without a new arch spec defining the specific transformed instruction format for a new instruction, zero must be written for that instruction.

Greg

Greg Favor

unread,
Oct 2, 2021, 1:17:01 PM10/2/21
to Greg Favor, Andrew Waterman, John Hauser, RISC-V ISA Dev
On Sat, Oct 2, 2021 at 10:06 AM Greg Favor <gfa...@ventanamicro.com> wrote:
In other words it is the responsibility of a new extension to define the transformed instruction formats, for all, some, or none of the new load/store instructions that it creates.  The CMO extensions, for example, do this.  Without a new arch spec defining the specific transformed instruction format for a new instruction, zero must be written for that instruction.

P.S. This also addresses the issue of a new extension wanting transformed instruction support but having to wait for a new version of the H extension to be ratified.  As well as retaining the ISA modularity of RISC-V in a clear and easy to manage way.

Greg
 

John Hauser

unread,
Oct 2, 2021, 3:59:57 PM10/2/21
to RISC-V ISA Dev
Anthony Coulter wrote:
> 4. Also in section 5.6.3, transformed instructions now contain an
> "Addr. offset" field in bits 15-19 which seems really handy for one
> particular situation, but since this field is not guaranteed to be
> available (because htinst might be set to zero) I don't see how
> hypervisor software can be written to benefit from it.
>
> To elaborate on the "Addr. offset" problem: the spec says that this
> field will be nonzero only for misaligned memory accesses. That's true,
> but I would go a step farther: I believe it can believe it can be
> nonzero only for misaligned memory accesses that straddle page
> boundaries. This field contains the difference between the "faulting
> virtual address (written to mtval or stval) and the original virtual
> address." If part of a memory access is valid and another part is not,
> the access must straddle either a page boundary or a PMP/PMA boundary,
> but in practice I would expect PMP/PMA boundaries to be page-aligned.

While this is likely to be true in practice, the RISC-V Privileged ISA
allows PMP to be 4-byte granularity even when the hypervisor extension
is implemented.  So the ISA manual can't promise that "Addr. offset"
will be nonzero only on a 4-kB page boundary.


> So "Addr. offset" only contains useful information when a memory access
> straddles a page boundary, the first page was valid, and the second
> page was not. Hypervisor software can use this Addr. offset field to
> determine that stval does not contain the original virtual address,
> which is important information if you are emulating the memory
> accesses. (Even if you refuse to emulate a misaligned memory access to
> an emulated device driver, you want to at least detect the issue and
> forward the exception to the guest.)
>
> But if htinst is sometimes zero, hypervisor software cannot count on
> this information being available. Suppose the hypervisor gets a guest
> page fault, that the virtual address in stval points to the beginning
> of a guest page, and that htinst is zero. How does the hypervisor
> determine whether this is a misaligned access and that the address in
> stval is not the one that the guest tried to access? I can think of two
> ways to address the situation:
>
> 1. Recompute the original virtual address by reading the original
> instruction from memory, decoding the immediate, finding a copy of
> rs1, and adding them.

Complete emulation of a trapping memory access requires knowing at a
minimum the type of access (size, signedness, etc.) and the source or
destination register for the data to load/store, all of which requires
decoding the original instruction.  If information about the original
instruction isn't provided by htinst because htinst is zero, then the
original instruction must be read from memory.  In addition, for a
store, the data to store must be obtained from the correct source
register, while for a load, the loaded data must be written to the
correct destination register.

Among all of that, obtaining the values of the base address register
and offset and adding them together is just a few more steps.  If
htinst is nonzero, these extra steps can be avoided thanks to the
"Addr. offset" field of the transformed instruction.  But that's solely
a performance enhancement.

> This is a lot of hassle and I suspect most
> hypervisors will forget to do this. (I don't see any code like this
> in the latest RISC-V KVM patch, for example.)

Hassle or not, hypervisors must do it.  I don't accept that we should
be concerned they will forget.  I trust that Anup Patel is correct when
he says you've overlooked the KVM code that handles this for RISC-V.


> 2. Follow a policy of leaving an unmapped guest page before any guest
> page that is being used to emulate a device driver, so that the
> address in stval is guaranteed to be the original virtual address,
> and so can be tested for alignment.
>
> If a hypervisor follows policy (2), then the addr_offset field becomes
> useless because it will always be zero in situations where the
> hypervisor wants to emulate guest memory. (It might be nonzero when
> VU-mode software is trying to do a misaligned access in regular memory
> but one of the pages happens to be swapped out, but in this situation
> the hypervisor doesn't care about the offset; it just wants to know
> which page it needs to remap.)
>
> To ensure that (2) works properly, I would like to tighten some
> exception priorities.  [...]

Your option 2 (and all that follows that I've elided) isn't sufficient
to emulate a memory access when htinst is zero.  It's still necessary
to decode the original instruction, which in this case must be read
from memory.


> But that leaves my discomfort about the "Addr. offset" field, since
> I'm not sure what it's good for now. Questions:
>
> 1. Will "Addr. offset" ever be nonzero *other* than situations where
> a misaligned memory access straddles either a guest page boundary,
> or a PMP/PMA boundary?

Since your proposed solution isn't actually sufficient to allow memory
accesses to be emulated, I don't believe it renders the "Addr. offset"
field useless.


> 2. Can anyone think of a way to warn the hypervisor that the value in
> stval is not the original virtual address, using a mandatory
> mechanism instead of an optional one like htinst?

Doing so wouldn't be sufficient.

    - John Hauser

Anthony Coulter

unread,
Oct 2, 2021, 4:50:39 PM10/2/21
to isa...@groups.riscv.org, jhause...@gmail.com
> Complete emulation of a trapping memory access requires knowing at a
> minimum the type of access (size, signedness, etc.) and the source or
> destination register for the data to load/store, all of which requires
> decoding the original instruction. If information about the original
> instruction isn't provided by htinst because htinst is zero, then the
> original instruction must be read from memory. In addition, for a
> store, the data to store must be obtained from the correct source
> register, while for a load, the loaded data must be written to the
> correct destination register.
>
> Among all of that, obtaining the values of the base address register
> and offset and adding them together is just a few more steps. If
> htinst is nonzero, these extra steps can be avoided thanks to the
> "Addr. offset" field of the transformed instruction. But that's solely
> a performance enhancement.
>
> [...]
>
> Your option 2 (and all that follows that I've elided) isn't sufficient
> to emulate a memory access when htinst is zero. It's still necessary
> to decode the original instruction, which in this case must be read
> from memory.

This makes sense; I can't disagree. Objection withdrawn.

Anthony
Reply all
Reply to author
Forward
0 new messages