Public review for standard extension Svpbmt

670 views
Skip to first unread message
Assigned to guore...@gmail.com by me

Stephano Cetola

unread,
Sep 16, 2021, 8:07:55 PM9/16/21
to isa...@groups.riscv.org, Dan Lustig, Andrea Mondelli
We are delighted to announce the start of the public review period for
the following proposed standard extension to the RISC-V ISA:
Svpbmt - Page-based Memory Types

The review period begins today, Thursday Sept 16, and ends on Sunday
Oct 31 (inclusive).

This extension is part of the Privileged Specification.

This extension is described in the PDF spec available at:
https://github.com/riscv/virtual-memory/blob/main/specs/663-Svpbmt.pdf

To respond to the public review, please either email comments to the
public isa-dev mailing list or add issues to the Virtual Memory GitHub
repo: https://github.com/riscv/virtual-memory. We welcome all input
and appreciate your time and effort in helping us by reviewing the
specification.

During the public review period, corrections, comments, and
suggestions, will be gathered for review by the Virtual Memory Task
Group. Any minor corrections and/or uncontroversial changes will be
incorporated into the specification. Any remaining issues or proposed
changes will be addressed in the public review summary report. If
there are no issues that require incompatible changes to the public
review specification, the Privileged ISA Committee will recommend the
updated specifications be approved and ratified by the RISC-V
Technical Steering Committee and the RISC-V Board of Directors.

Thanks to all the contributors for all their hard work.

Kind Regards,
Stephano
--
Stephano Cetola
Director of Technical Programs
RISC-V International

Paul Donahue

unread,
Oct 18, 2021, 2:31:38 PM10/18/21
to RISC-V ISA Dev, step...@riscv.org, Daniel Lustig, Andrea Mondelli
Svpbmt says:
When Svpbmt is used with non-zero PBMT encodings, it is possible for multiple virtual aliases of the same physical page to exist simultaneously with different memory attributes. It is also possible for a U-mode or S-mode mapping through a PTE with Svpbmt enabled to observe different memory attributes for a given region of physical memory than a concurrent access to the same page performed by M-mode or when MODE=Bare. In such cases, the behaviors dictated by the attributes may be violated, and platform-specific mechanisms must be used to restore the expected behaviors.

My understanding is that "the attributes" in the last sentence refers only to cacheability, idempotency, memory ordering, and main memory vs. I/O.  Couldn't there be a loss of coherency and thus a violation of the coherency attribute?


Thanks,

-Paul

Greg Favor

unread,
Oct 18, 2021, 5:25:17 PM10/18/21
to Paul Donahue, RISC-V ISA Dev, step...@riscv.org, Daniel Lustig, Andrea Mondelli
On Mon, Oct 18, 2021 at 11:31 AM Paul Donahue <pdon...@ventanamicro.com> wrote:
In such cases, the behaviors dictated by the attributes may be violated, and platform-specific mechanisms must be used to restore the expected behaviors.

My understanding is that "the attributes" in the last sentence refers only to cacheability, idempotency, memory ordering, and main memory vs. I/O.  Couldn't there be a loss of coherency and thus a violation of the coherency attribute?

Yes.  Although pbmt's only directly affect certain attributes, if software allows "mismatched attributes" in how it uses pbmt's, then coherency between accesses using different cacheability attributes is "lost".  Accesses using the same cacheability attribute will have coherency wrt each other.  So in that sense coherency is maintained.  But if software creates mismatched attributes, then global coherency across all types of accesses is "lost".

Greg

Greg Favor

unread,
Oct 18, 2021, 5:29:42 PM10/18/21
to Paul Donahue, RISC-V ISA Dev, step...@riscv.org, Daniel Lustig, Andrea Mondelli
On Mon, Oct 18, 2021 at 2:25 PM Greg Favor <gfa...@ventanamicro.com> wrote:
Yes.  Although pbmt's only directly affect certain attributes, if software allows "mismatched attributes" in how it uses pbmt's, then coherency between accesses using different cacheability attributes is "lost".  Accesses using the same cacheability attribute will have coherency wrt each other.  So in that sense coherency is maintained.  But if software creates mismatched attributes, then global coherency across all types of accesses is "lost".

P.S. I agree that the spec should be clarified regarding this possible cause of "loss" of coherency.
 

Dan Lustig

unread,
Oct 19, 2021, 12:53:32 PM10/19/21
to Greg Favor, Paul Donahue, RISC-V ISA Dev, step...@riscv.org, Andrea Mondelli
On 10/18/2021 5:29 PM, Greg Favor wrote:
> On Mon, Oct 18, 2021 at 2:25 PM Greg Favor <gfa...@ventanamicro.com> wrote:
>
>> Yes. Although pbmt's only directly affect certain attributes, if software
>> allows "mismatched attributes" in how it uses pbmt's, then coherency
>> *between* accesses using different cacheability attributes is "lost".
>> Accesses using the same cacheability attribute will have coherency wrt each
>> other. So in that sense coherency is maintained. But if software creates
>> mismatched attributes, then global coherency across all types of accesses
>> is "lost".
>>
>
> P.S. I agree that the spec should be clarified regarding this possible
> cause of "loss" of coherency.

Thanks Paul (and Greg), I've noted this down as part of the public review
feedback.

Perhaps there's also a clarification to be made in the PMA section text?

https://github.com/riscv/riscv-isa-manual/blob/399c9a759eb4540a65c60e2cc236164821ff2346/src/machine.tex#L3178
> Where a platform supports configurable cacheability settings for a
> memory region, a platform-specific machine-mode routine will change
> the settings and flush caches if necessary, so the system is only
> incoherent during the transition between cacheability settings. This
> transitory state should not be visible to lower privilege levels.

Specifically, "the system is only incoherent during the transition
between cacheability settings" could be interpreted as saying the
opposite of Greg's answer above, so maybe we should also find a way
to tweak this text with that in mind.

Dan

Greg Favor

unread,
Oct 20, 2021, 3:16:07 PM10/20/21
to Dan Lustig, Paul Donahue, RISC-V ISA Dev, step...@riscv.org, Andrea Mondelli
On Tue, Oct 19, 2021 at 9:53 AM Dan Lustig <dlu...@nvidia.com> wrote:
Perhaps there's also a clarification to be made in the PMA section text?

> Where a platform supports configurable cacheability settings for a
> memory region, a platform-specific machine-mode routine will change
> the settings and flush caches if necessary, so the system is only
> incoherent during the transition between cacheability settings.  This
> transitory state should not be visible to lower privilege levels.

Specifically, "the system is only incoherent during the transition
between cacheability settings" could be interpreted as saying the
opposite of Greg's answer above, so maybe we should also find a way
to tweak this text with that in mind.

The existing text of course was written way back when there was no possibility of "mismatched attributes" (nor even page-based attribute overrides).  Also note that mismatched attributes isn't just a transitory thing (or at least not in the sense of the existing text).  I think one can view the existing text as just talking about changing PMA registers by M-mode software (which literally is the case).

In contrast, this "loss of coherency due to mismatched attributes" issue is a very different animal that foremost has to do with S-mode software and what it does in its page tables (i.e. it is S-mode software that chooses to cause this issue).  Hence it seems like clarifying text should be part of the Svpbmt extension chapter of the Priv spec.

Greg

John Hauser

unread,
Oct 22, 2021, 4:10:27 PM10/22/21
to RISC-V ISA Dev
I'm concerned about how the relabelling of physical I/O space as main
memory may undermine the correct functioning of device drivers.  The
Svpbmt chapter acknowledges this matter in a comment:

    A device driver written to rely on I/O strong ordering rules will
    not operate correctly if the address range is mapped as main memory
    by the page-based memory types.  As such, this configuration is
    discouraged.

    It will often still be useful to map physical I/O regions using
    PBMT=NC so that write combining and speculative accesses can be
    performed.  Such optimizations will likely improve performance when
    applied with adequate care.

However, I'm not convinced that's the best we can do.

I'd like to propose that, with Svpbmt, it's possible for a memory
region to be considered _both_ I/O and main memory for ordering
purposes (FENCE, .aq, and .rl).  If the I/O or main memory
characteristic of a page is nominally changed by a PTE's PBMT field,
then the page is considered both I/O and main memory for ordering.
When two-stage address translation is in effect, a page becomes both
I/O and main memory for ordering purposes if its I/O or main memory
characteristic is nominally changed at either translation stage,
VS-stage or G-stage.

Rather than rely on "adequate care" which I don't have confidence will
always materialize, I think that forcing the most conservative ordering
for such pages is safest.

Does anyone see a flaw in my thinking?

    - John Hauser

John Ingalls

unread,
Oct 22, 2021, 4:15:57 PM10/22/21
to John Hauser, RISC-V ISA Dev
John --
  1. Question: What about device drivers written expecting Strong Ordering, and thus do not use FENCE, .aq, or .rl?
  2. Concern: If the answer to my question is to "apply the requested Strong Ordering to normal loads and stores to main memory" then that can be slower than applying Strong Ordering to IO memory.
-- John

--
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/nOrD9t9ImEw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/42d18f15-63d3-4e19-bdc7-5d2eb35b131an%40groups.riscv.org.

Greg Favor

unread,
Oct 22, 2021, 5:02:40 PM10/22/21
to John Ingalls, John Hauser, RISC-V ISA Dev
On Fri, Oct 22, 2021 at 1:15 PM John Ingalls <john.i...@sifive.com> wrote:
  1. Question: What about device drivers written expecting Strong Ordering, and thus do not use FENCE, .aq, or .rl?
If an OS runs a device driver that talks to a device expecting/assuming it is mapped as "I/O", but changes the mapping to "main memory", then that is a big mistake/bug by the OS.  In practice the OS is responding to requests (e.g. from device drivers) to map pages with a suitable memory type.  A driver, understanding the device it is written to talk to, will recognize that that device has, for example, certain pages of memory-mapped "I/O" registers and certain pages of RAM that want to be accessed as noncacheable "main memory".  So it will make the appropriate remapping requests to the OS as needed.

Greg

John Hauser

unread,
Oct 23, 2021, 9:21:38 PM10/23/21
to RISC-V ISA Dev
I'm also concerned about what might be some other unintended

consequences of Svpbmt.  The Svpbmt chapter says:

    It is also possible for a U-mode or S-mode mapping through a PTE
    with Svpbmt enabled to observe different memory attributes for a
    given region of physical memory than a concurrent access to the
    same page performed by M-mode or when MODE=Bare. In such cases, the

    behaviors dictated by the attributes may be violated, and platform-
    specific mechanisms must be used to restore the expected behaviors.

And then in a comment:

    The forthcoming Zicbom extension and the FENCE instruction will
    collectively form a standard mechanism for restoring coherence in
    scenarios with mismatched page attributes.

But I don't think that comment is sufficient.

Consider this case where an I/O device is emulated by a hypervisor for
a guest virtual machine.  Assume the following:

  - The device being emulated reads from memory by DMA.

  - The hypervisor must emulate this device DMA by reading the memory
    from software.  (For other emulated devices, the hypervisor may
    delegate much of the work to some actual device that does the DMA
    itself, but this is not one of those times.)

  - Thinking that it's interacting with an actual device, the guest VM
    has configured its access to the DMA memory buffer to be noncached,
    by setting PBMT = NC or IO in its VS-stage page tables.

  - To perform an output operation through the (emulated) device, the
    guest VM first fills the (noncached) DMA buffer in main memory,
    then writes the starting address of this buffer to an (emulated)
    device register, which we'll call "OUTADDR".

In this scenario, when the hypervisor gets the trap for the VM's write
to OUTADDR, this may be the first evidence the hypervisor has of the
location of the VM's DMA buffer in main memory.  At this point, the
hypervisor is supposed to read the DMA buffer to emulate the device's
output.  However, without taking special action first, the hypervisor
can't safely access the DMA buffer unless its own page tables have PBMT
settings matching that of the VM at the time it filled the DMA buffer.
But since the hypervisor has no way to know through what guest virtual
addresses the VM wrote this memory, the hypervisor has no easy (or
reliable) way to know what the PBMT settings were for those writes.

This situation forces the hypervisor to depend on the CMO instructions
provided by Zicbom to first get the DMA buffer into a known safe state
in the caches before reading from it.  If we knew the VM wrote the
buffer noncached (PBMT = NC or IO), the obvious operation to use would
be CBO.INVAL.  However, given that it's just as possible the VM wrote
the buffer cached (PBMT = 0), CBO.INVAL risks discarding the VM's
output data.  That leaves CBO.FLUSH as the only option, which leads me
finally to the big question:

Is CBO.FLUSH safe for the hypervisor to use in this situation, in all
cases?  Or, if the VM actually wrote the DMA buffer noncached, would
CBO.FLUSH risk overwriting the VM's output data in memory with stale
cache contents?

My guess is, if the guest OS did all it should when it configured the
DMA buffer to have a noncached PBMT setting, then it should be safe for
the hypervisor to execute CBO.FLUSH for the DMA buffer.  But I can't be
sure, in large part because the guidance provided by the Svpbmt chapter
on this subject is inadequate.  There's no attempt in the spec to say
what CMOs the VM should execute when configuring the DMA buffer as
noncached.  So I don't know.

Even if this all gets smoothed out, I note that, after reading from the
DMA buffer, the hypervisor can't safely return back to the VM without
first executing _another_ set of CBO.FLUSH instructions for the same
memory.  And we have all these cache flushes being done entirely
defensively, without knowing whether they're really needed.  It
turns out, just having Svpbmt implemented by the machine can incur a
performance penalty for this scenario, even if Svpbmt is never actually
used.  It's enough that it _might_ be used by a guest VM, and the
hypervisor has no way to know whether it is.

Thoughts?

    - John Hauser

Dan Lustig

unread,
Oct 26, 2021, 3:54:19 PM10/26/21
to John Hauser, RISC-V ISA Dev, David Kruckemyer, Greg Favor
How do hypervisors on other architectures avoid the issue?

My first (probably naive) question is to ask why the emulated device
has to figure it out at all instead of just doing what the original
device would have done, i.e., just use non-cacheable accesses? In
other words, why can't the hypervisor just expect the guest to also
use non-cacheable accesses to write the DMA buffer? If the guest
uses PBMT incorrectly, is that the hypervisor's fault?
cc'ing David K to address the CMO questions.

Thanks,
Dan

> Thoughts?
>
> - John Hauser
>

John Hauser

unread,
Oct 26, 2021, 5:19:56 PM10/26/21
to RISC-V ISA Dev
Daniel Lustig wrote:
> My first (probably naive) question is to ask why the emulated device
> has to figure it out at all instead of just doing what the original
> device would have done, i.e., just use non-cacheable accesses? In
> other words, why can't the hypervisor just expect the guest to also
> use non-cacheable accesses to write the DMA buffer?

In an attempt to begin answering your question, let me start with a
related question:

If a machine implements Svpbmt, is it also still allowed to have
regions of main memory that are fully coherent with all agents that
can access the memory, including devices that do DMA?  I don't see
anything in the Svpbmt chapter that says "no", so I'm assuming "yes".
In fact, I'm assuming a machine may have _all_ memory be fully coherent
(the "Berkeley RISC-V tradition", let's say), yet also support Svpbmt.
Is that not true?

    - John Hauser

Dan Lustig

unread,
Oct 29, 2021, 11:41:14 AM10/29/21
to John Hauser, RISC-V ISA Dev, Greg Favor
Yes, it's legal under Svpbmt to always keep everything fully coherent,
as long as the cacheability attribute is still properly respected.

Does that somehow help answer your original question about whether
Svpbmt implies a need for Zicbom, or the performance penalty of that?
I'm still not clear about the premise that it's needed in the first
place, i.e., why it's the hypervisor's job to fix seemingly incorrect
guest configuration of PBMTs in the first place.

Dan

>
> - John Hauser
>

John Hauser

unread,
Oct 29, 2021, 2:52:58 PM10/29/21
to RISC-V ISA Dev
Dan Lustig wrote:
> Yes, it's legal under Svpbmt to always keep everything fully coherent,
> as long as the cacheability attribute is still properly respected.

Okay, next question:  If all of main memory is fully coherent with all
agents, including I/O devices, and if an OS has assigned a single hart
to control/manage a device that does DMA to read from main memory, is
it valid for the hart to configure its access to the DMA buffer for
this device with PBMT = 0?  Alternatively, is it equally valid for the
hart to configure its access to the DMA buffer for this device with
PBMT = NC or IO?

My understanding of the spec (which maybe is wrong) says the OS is
free to have the hart configure either PBMT = 0 or PBMT = NC/IO for the
hart's access to this DMA buffer.

Now assume this same OS is running as a guest in a virtual machine on
the same machine, and the hypervisor is responsible for emulating this
device for the VM.  When a trap is taken to the hypervisor to initiate
the (emulated) DMA, the guest OS may, as I understand it (see above),
have validly configured its access to the DMA buffer with any PBMT
value, 0, NC, or IO, and the hypervisor has no way to learn which.  Is
that not correct?

> I'm still not clear about the premise that it's needed in the first
> place, i.e., why it's the hypervisor's job to fix seemingly incorrect
> guest configuration of PBMTs in the first place.

Which guest configuration of the PBMTs is incorrect under these
circumstances?:  PBMT = 0, NC, or IO?

    - John Hauser

Ray Van De Walker

unread,
Nov 1, 2021, 8:15:13 PM11/1/21
to John Hauser, RISC-V ISA Dev

It’s a good idea to be able to force strong ordering on memory mapped I/O.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Reply all
Reply to author
Forward
0 new messages