[RFC] separate page table roots for U-mode vs S-mode

Christoph Hellwig

unread,

Nov 29, 2017, 4:54:28 PM11/29/17

to RISC-V ISA Dev

=== Proposal to add separate virtual memory translations for U-mode.

Currently RISC-V shared the page tables between S-mode and U-mode,
and while it has a lot of protections for separating U and S-mode the
even better approach would be to separate the virtual memory translations
for both modes entirely. This follows a model used by sparc64 and s390
in Linux, and which is emulated on x86 using the KAISER approach described
in the "KASLR is Dead: Long Live KASLR" ESSoS 2017 paper by Gruss, Lipp,
Schwarz, Fellner, Maurice and Mangard [1] and now prepared for mainline
Linux.

To avoid the KASLR issues and generally provide a more secure isolation
this proposal attempts to add support for separate S-mode vs U-mode
page tables to the RISC-V privileged instruction set

== Add a new "User Address Translation and Protection (uatp) Register"

The register is defined to have the equivalent format as satp, and is also
only accessible from S-mode. The behavior is defined the same as stap
except that uatp defines the page table root for U-mode loads, stores, and
instruction fetches access, limiting the satp defined page table root to
control S-mode loads, stores, and instruction fetch access.
If the Mode field is set to 0 (MODE=Bare in satp) a separate translation for
U-mode is not used.

== New Load/Store instructions

To allow S-mode software to access U-mode virtual memory new load and
store instructions are needed that explicitly access the U-mode virtual
memory. Such load and store instructions are only needed for integer
access, not for floating point access.

The new U-mode Load/Store instructions shall be supported inside
LL/SC regions to support shared synchronization between S-mode and
U-mode (e.g. futexes in Linux). Supporting other atomic operations
might be worth considering.

[1] https://gruss.cc/files/kaiser.pdf

Feel free to grab me at the Workshop and Members meeting if you want
to discuss the proposal in person.

p.s. sorry if anyone got multiple copies, but it seems like google
groups rejected mail from my other two addresses

Jacob Bachmeyer

unread,

Nov 29, 2017, 7:44:03 PM11/29/17

to Christoph Hellwig, RISC-V ISA Dev

I proposed something similar about a year ago (search the list archives
for "Sdas").

I have a few comments/changes:

-- The new uatp CSR is a supervisor-controlled register, so its proper
name would be "suatp".

-- Accordingly, the satp CSR should be renamed "ssatp".

-- No new instructions are needed if we instead make the MPRV bit
available to the supervisor. This would require making MPRV sensitive
to the current privilege level -- in M-mode, MPRV uses MPP, in HS-mode,
MPRV is ignored (or hstatus.SPRV is an alias of mstatus.MPRV and MPRV
has those semantics), and in S-mode, MPRV uses SPP.

-- No new instructions are needed if remapping pages is sufficiently
efficient (need an intermediate TLB that stores intermediate steps
during page table walks, so page aliasing does not need to traverse the
entire tree) that the supervisor can simply map user pages into its own
address space as needed.

-- For backwards compatibility, ssatp replaces satp, suatp is a new CSR,
and a write to ssatp also writes the same value to suatp. (Supervisors
are expected to change user page roots for every task switch, but to
retain their own address space across all tasks.)

-- Implementations should be allowed to have a limited set of ASIDs
available for ssatp if suatp.ASID != ssatp.ASID. I would expect many
implementations to only support a single supervisor ASID, but many user
ASIDs.

Overall, I like this idea because it improves isolation between the
supervisor and user tasks, eliminating the need for any kind of "magic
dance" during task switch. It also eliminates the ability for the
supervisor program counter to point to user memory, but the prohibition
on S-mode instruction fetch from user pages is still relevant, since the
supervisor can map user pages as data within its own address space. (I
expect this sort of "map user memory for supervisor access" to involve
aliasing of higher-level page tables rather than individual pages, as
such, the U bit in the PTE remains relevant.)

-- Jacob

Christoph Hellwig

unread,

Nov 29, 2017, 7:58:21 PM11/29/17

to Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev

On Wed, Nov 29, 2017 at 06:43:59PM -0600, Jacob Bachmeyer wrote:
> I proposed something similar about a year ago (search the list archives for
> "Sdas").

Ok, I'll try to find it.

>
> I have a few comments/changes:
>
> -- The new uatp CSR is a supervisor-controlled register, so its proper name
> would be "suatp".
>
> -- Accordingly, the satp CSR should be renamed "ssatp".

Makes sense.

> -- No new instructions are needed if we instead make the MPRV bit available
> to the supervisor. This would require making MPRV sensitive to the current
> privilege level -- in M-mode, MPRV uses MPP, in HS-mode, MPRV is ignored
> (or hstatus.SPRV is an alias of mstatus.MPRV and MPRV has those semantics),
> and in S-mode, MPRV uses SPP.

I don't think that is enough. A typical routine that needs to access
user space address memory copies kernel memory to user memory or vice
versa. So it will have to do loads from supervisor virtual memory and
stores to user virtual memory or the other way around.

> -- No new instructions are needed if remapping pages is sufficiently
> efficient (need an intermediate TLB that stores intermediate steps during
> page table walks, so page aliasing does not need to traverse the entire
> tree) that the supervisor can simply map user pages into its own address
> space as needed.

In general it is not fast enough - mapping memory has a high overhead
especially on SMP systems.

> -- For backwards compatibility, ssatp replaces satp, suatp is a new CSR,
> and a write to ssatp also writes the same value to suatp.

I don't understand how this helps backwards compatibility. For a CPU
or a OS not supporting the extension you would simply never touch
suatp. A OS knowing about the extension would probe the existence and
if suatp is supported maintain it,

> (Supervisors are
> expected to change user page roots for every task switch, but to retain
> their own address space across all tasks.)

Yes.

>
> -- Implementations should be allowed to have a limited set of ASIDs
> available for ssatp if suatp.ASID != ssatp.ASID. I would expect many
> implementations to only support a single supervisor ASID, but many user
> ASIDs.

Agreed.

Jacob Bachmeyer

unread,

Nov 29, 2017, 11:29:09 PM11/29/17

to Christoph Hellwig, RISC-V ISA Dev

Christoph Hellwig wrote:
> On Wed, Nov 29, 2017 at 06:43:59PM -0600, Jacob Bachmeyer wrote:
>
>> -- No new instructions are needed if we instead make the MPRV bit available
>> to the supervisor. This would require making MPRV sensitive to the current
>> privilege level -- in M-mode, MPRV uses MPP, in HS-mode, MPRV is ignored
>> (or hstatus.SPRV is an alias of mstatus.MPRV and MPRV has those semantics),
>> and in S-mode, MPRV uses SPP.
>>
>
> I don't think that is enough. A typical routine that needs to access
> user space address memory copies kernel memory to user memory or vice
> versa. So it will have to do loads from supervisor virtual memory and
> stores to user virtual memory or the other way around.
>

This would require a large amount of bit toggling, or splitting MPRV
into separate flags for load and store.

>> -- No new instructions are needed if remapping pages is sufficiently
>> efficient (need an intermediate TLB that stores intermediate steps during
>> page table walks, so page aliasing does not need to traverse the entire
>> tree) that the supervisor can simply map user pages into its own address
>> space as needed.
>>
>
> In general it is not fast enough - mapping memory has a high overhead
> especially on SMP systems.
>

Can hardware be designed to work around this? I was suggesting, for
example, an extra TLB that stores page table descriptors instead of
individual page mappings, providing a shortcut if a recently-accessed
page mapping is aliased. This should reduce overhead significantly. In
this description, I will call it the "PTLB", for "paging TLB".

Suppose user code calls write(2) -- the supervisor must read from a user
buffer. User code has already accessed the page recently, so the
current hart already has a mapping for the actual page in its DTLB and
has the relevant PTEs in its PTLB. The supervisor aliases the relevant
page or page table from user space into its own address space, either at
a hart-specific virtual address or using hart-specific supervisor page
tables. Either way, the alias does not require synchronization with
other harts and is meaningful only on the local hart. (Only the
top-level page table needs to be hart-specific and the supervisor can
then easily use a top-level PTE for 1/1024 of the user address space at
will. The rest of the supervisor page tables can be shared between all
harts.) The supervisor executes SFENCE.VMA on the mapping and the
hardware flushes any current supervisor DTLB entries for the aliased
page. The previous proposal for ranged fences would also be useful
here. In RISC-V, SFENCE.VMA *only* affects the hart that executes it --
other harts are unaffected, so, again, no SMP burden. When the
supervisor accesses the aliased page, the PTLB starts to return hits as
soon as the aliased user page tables are encountered, so aliasing larger
portions of user space is more efficient. (Presumably, the PTLB returns
"hit" in a dozen cycles or so, while memory accesses require a few
hundred cycles each. The path through the user page tables is in the
PTLB because the user code has accessed the intended page before making
the system call.) The supervisor copies data from the user buffer,
removes the alias mapping, executes SFENCE.VMA (or skips it -- flushing
the no-longer-valid alias only matters if the supervisor later
incorrectly accesses it), clears SUM, and continues.

Now suppose user code calls read(2) -- the supervisor must write to a
user buffer. Further, presume that the buffer has been freshly
allocated and does not actually exist yet. The supervisor sets up the
alias mapping similarly, then takes a page fault upon writing to the
buffer. The supervisor page fault handler detects that the fault
occurred while accessing user space, examines the supervisor paging
structures for the current user task, determines that this should be a
writable page, allocates and maps an available page, and returns. The
hardware performs the normal PTE walk and places the path into the PTLB
and the final mapping into the supervisor DTLB. The supervisor copies
data to user memory. User execution continues. User code accesses the
new buffer and the hardware begins the PTE walk. The PTLB returns hits
as soon as the formerly-aliased user page tables are encountered and the
mapping is quickly loaded into the user DTLB. User code can then use
the buffer.

Zero-copy I/O is a bit different, since the supervisor must look up the
physical address backing a user page and hand that off to the I/O
subsystem. An "SPROBE.PTE" instruction that can use the DTLB would be
advantageous here, avoiding the need for the supervisor to walk the page
tables in software.

Note that these short-lived aliases are only visible on the local hart,
and therefore have *no* SMP overhead.

>> -- For backwards compatibility, ssatp replaces satp, suatp is a new CSR,
>> and a write to ssatp also writes the same value to suatp.
>>
>
> I don't understand how this helps backwards compatibility. For a CPU
> or a OS not supporting the extension you would simply never touch
> suatp. A OS knowing about the extension would probe the existence and
> if suatp is supported maintain it,
>

The hardware must be able to run a supervisor that is ignorant of suatp,
and for virtualization, must be able to switch between supervisors that
do and do not use suatp. Existing supervisors will write only to satp,
so writes "to satp" must cause hardware that has both ssatp and suatp to
operate correctly. The ssatp CSR would be at the same address as the
existing satp CSR. Writes to ssatp would also atomically write the same
value to suatp, allowing existing supervisors to change both by writing
to the only CSR that they know about. Hardware then always uses suatp
for user PTE walks and ssatp for supervisor PTE walks, and works
correctly in both cases.

I would add one more requirement: User and supervisor code must use
different ASIDs and suatp.ASID != ssatp.ASID must hold. If ssatp.ASID
== suatp.ASID, then ssatp == suatp, or an instruction access fault is
immediately raised upon entering U-mode and any other U-mode accesses
will also fault. The MODE field in suatp is reserved and must be set to
"Bare" unless ssatp == suatp (until, of course, further extension
permits user and supervisor modes to use different paging depths).

-- Jacob

Albert Cahalan

unread,

Nov 30, 2017, 2:59:46 AM11/30/17

to Christoph Hellwig, RISC-V ISA Dev

On 11/29/17, Christoph Hellwig <h...@lst.de> wrote:

> === Proposal to add separate virtual memory translations for U-mode.

This is nice. You get an extra bit of address space.

One could also separate code/data translations.
Hints of compatibility with this can be seen in the
ptrace interface.

The now-free permission bits could be repurposed to
enforce something like Linux's might_sleep check.
They could also be used for sandboxing.

> == New Load/Store instructions

Unusually large load/store can be useful for memcpy.
This is commonly done with the FPU, sometimes by
doing load/store of the entire FPU register state at once
and sometimes by using just a single register. Vector
registers can also see use for this.

Christoph Hellwig

unread,

Nov 30, 2017, 9:40:10 AM11/30/17

to Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev

On Wed, Nov 29, 2017 at 10:29:03PM -0600, Jacob Bachmeyer wrote:
> The supervisor aliases the relevant page or
> page table from user space into its own address space, either at a
> hart-specific virtual address or using hart-specific supervisor page
> tables.
>
> Either way, the alias does not require synchronization with other
> harts and is meaningful only on the local hart. (Only the top-level page
> table needs to be hart-specific and the supervisor can then easily use a
> top-level PTE for 1/1024 of the user address space at will. The rest of
> the supervisor page tables can be shared between all harts.) The

It will require some form of synchronization. Walking the user pagetables
can and often does cause page faults, which will lead to a context switch,
which can move the execution of the software thread to a different heart.

Even without that many supervisors (e.g. RTOSes or Linux PREEMPT_RT)
are fully preemptible, so a context switch and move to a different heart
could happen any time.

> The hardware must be able to run a supervisor that is ignorant of suatp,
> and for virtualization, must be able to switch between supervisors that do
> and do not use suatp. Existing supervisors will write only to satp, so
> writes "to satp" must cause hardware that has both ssatp and suatp to
> operate correctly. The ssatp CSR would be at the same address as the
> existing satp CSR. Writes to ssatp would also atomically write the same
> value to suatp, allowing existing supervisors to change both by writing to
> the only CSR that they know about. Hardware then always uses suatp for
> user PTE walks and ssatp for supervisor PTE walks, and works correctly in
> both cases.

Ok.

> I would add one more requirement: User and supervisor code must use
> different ASIDs and suatp.ASID != ssatp.ASID must hold. If ssatp.ASID ==
> suatp.ASID, then ssatp == suatp, or an instruction access fault is
> immediately raised upon entering U-mode and any other U-mode accesses will
> also fault. The MODE field in suatp is reserved and must be set to "Bare"
> unless ssatp == suatp (until, of course, further extension permits user and
> supervisor modes to use different paging depths).

Yes, that is a good invariant.

Another issue to think about: can suatp and ssatp use different virtual
addressing modes? It might be useful to always use the mode with the
lowest overhead for the supervisor, but it might be really annoying to
implement in hardware.

Christoph Hellwig

unread,

Nov 30, 2017, 9:41:35 AM11/30/17

to Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev

On Thu, Nov 30, 2017 at 02:59:44AM -0500, Albert Cahalan wrote:
> Unusually large load/store can be useful for memcpy.
> This is commonly done with the FPU, sometimes by
> doing load/store of the entire FPU register state at once
> and sometimes by using just a single register. Vector
> registers can also see use for this.

I don't really know of anyone using the FPU for kernel to user copies,
but I do know of a few examples using large SIMD registers.

Jacob Bachmeyer

unread,

Nov 30, 2017, 11:35:02 PM11/30/17

to Christoph Hellwig, RISC-V ISA Dev

Christoph Hellwig wrote:
> On Wed, Nov 29, 2017 at 10:29:03PM -0600, Jacob Bachmeyer wrote:
>
>> The supervisor aliases the relevant page or
>> page table from user space into its own address space, either at a
>> hart-specific virtual address or using hart-specific supervisor page
>> tables.
>>
>> Either way, the alias does not require synchronization with other
>> harts and is meaningful only on the local hart. (Only the top-level page
>> table needs to be hart-specific and the supervisor can then easily use a
>> top-level PTE for 1/1024 of the user address space at will. The rest of
>> the supervisor page tables can be shared between all harts.) The
>>
>
> It will require some form of synchronization. Walking the user pagetables
> can and often does cause page faults, which will lead to a context switch,
> which can move the execution of the software thread to a different heart.
>
> Even without that many supervisors (e.g. RTOSes or Linux PREEMPT_RT)
> are fully preemptible, so a context switch and move to a different heart
> could happen any time.
>

Obviously, supervisors would need to ensure that tasks using
hart-specific aliases are not migrated to other harts while the alias is
active. What performance cost would come from pinning a task currently
in copy_{from,to}_user() to its current hart? This could be as simple
as never migrating a task that has SUM set in its saved sstatus.

Alternately, when migrating a task, also copy its hart-specific
top-level alias PTE to the destination hart's root page table and
perform SFENCE.VMA on the destination hart if the alias is currently
valid. (I envision the "userspace alias" region being at a fixed
virtual address for all harts. The virtual address is constant, but its
mapping varies between harts.) This could be as bad as a full TLB flush
(due to migration to a hart with different TLBs), but that at worst only
costs an extra (hardware) walk through the page tables, which costs
LEVELS times memory read latency. Since a PTE is a single word in the
current Sv* models, this alias PTE could easily be also stored in the
task's saved context area. On RV64, "LD t0,
TASK_ALIAS_PTE(task_context)/SD t0, ALIAS_PTE_SLOT(root_page_table)/ANDI
t0, t0, 1/BEQ zero, t0, alias_not_active/SFENCE.VMA/alias_not_active:
..." handles this. On RV32, the LD and SD become LW and SW.

[Minor nit for the benefit of someone finding this in the archives in
the future: "hart" is HARdware Thread in RISC-V, pronounced "heart" but
properly written "hart". Autocorrect is likely to cause errors with the
term.]

>> I would add one more requirement: User and supervisor code must use
>> different ASIDs and suatp.ASID != ssatp.ASID must hold. If ssatp.ASID ==
>> suatp.ASID, then ssatp == suatp, or an instruction access fault is
>> immediately raised upon entering U-mode and any other U-mode accesses will
>> also fault. The MODE field in suatp is reserved and must be set to "Bare"
>> unless ssatp == suatp (until, of course, further extension permits user and
>> supervisor modes to use different paging depths).
>>
>
> Yes, that is a good invariant.
>
> Another issue to think about: can suatp and ssatp use different virtual
> addressing modes? It might be useful to always use the mode with the
> lowest overhead for the supervisor, but it might be really annoying to
> implement in hardware.

Initially, no, but I would like to see that reserved for further
extension in the future. So for now, we have these possibilities:

I. Hardware that supports only unified address space mode: has the
current satp CSR, which affects both user and supervisor modes.
Attempts to access suatp raise illegal instruction.

II. Hardware that supports disjoint address space mode: has both ssatp
and suatp CSRs, with ssatp at the same CSR address as satp. Writes to
ssatp atomically write the same value to suatp and select unified
address space mode, which is trivial to support if disjoint address
space mode is implemented. The combinations of values in ssatp and
suatp produce these possibilities:

IIa. ssatp == suatp: system is running in unified address space mode;
none of the following constraints apply

IIb. ssatp != suatp: system is running in disjoint address space mode;
cases:

IIb1. ssatp.ASID == suatp.ASID or ssatp.PPN == suatp.PPN: invalid, all
U-mode accesses fault.

IIb2. suatp.ASID != ssatp.ASID and suatp.PPN != ssatp.PPN and
suatp.MODE == Bare: system validly configured in disjoint address space
mode; U-mode accesses use the same paging depth and translation model as
S-mode accesses, but a different ASID and a different root PPN; paging
depth and translation model is determined by ssatp.MODE.

IIb3. suatp.ASID != ssatp.ASID and suatp.PPN != ssatp.PPN and
suatp.MODE != Bare: reserved for using different paging depths or
translation schemes for U-mode and S-mode in future extension, currently
invalid, all U-mode accesses fault; MODE values in suatp may have
different meanings depending on ssatp.MODE.

In all cases under case IIb, there is a small requirement on hypervisors
to restore ssatp before restoring suatp when switching between
virtualized supervisors. This should be explicitly documented.
Attempting to run a supervisor that uses disjoint address space mode
under a hypervisor that is not aware of disjoint address space mode will
crash the guest.

-- Jacob

Christoph Hellwig

unread,

Dec 1, 2017, 12:02:04 PM12/1/17

to Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev

On Thu, Nov 30, 2017 at 10:34:55PM -0600, Jacob Bachmeyer wrote:
> Obviously, supervisors would need to ensure that tasks using hart-specific
> aliases are not migrated to other harts while the alias is active. What
> performance cost would come from pinning a task currently in
> copy_{from,to}_user() to its current hart? This could be as simple as
> never migrating a task that has SUM set in its saved sstatus.

Having non-preemptible regions in Linux is fairly easy, but we really try
to avoid them as much as possible. The problem is that they create worst-
case latencies, and page fault handling can take extremely long - the page
fauly might have to read in data from a disk or a remote network server.

That's whay having a non-preemptible region over pagefaults simply isn't
acceptible.

>
> Alternately, when migrating a task, also copy its hart-specific top-level
> alias PTE to the destination hart's root page table and perform SFENCE.VMA
> on the destination hart if the alias is currently valid.

You could do this, but this would defeat the other nice thing splitting
the user and supervisor page tables would allow you to do: nice ways
for per-cpu (aka per-hart) data in the kernel.

Btw, you really seems to want to avoid new load and store variants at all
costs, is there a good reason for that which I'm missing or is it just
an academic discussion?

Andrew Waterman

unread,

Dec 1, 2017, 2:49:41 PM12/1/17

to Christoph Hellwig, Jacob Bachmeyer, RISC-V ISA Dev

Two non-academic reasons:
- Loads & stores consume substantial opcode space, so additional loads
& stores must have very limited addressing modes (e.g.
register-indirect only)
- Hardware is already in existence that lacks these loads & stores,
and emulating them would be so slow that you'd need two variants of
the kernel

>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20171201170146.GA19697%40lst.de.

Christoph Hellwig

unread,

Dec 1, 2017, 2:53:11 PM12/1/17

to Andrew Waterman, Christoph Hellwig, Jacob Bachmeyer, RISC-V ISA Dev

On Fri, Dec 01, 2017 at 11:49:19AM -0800, Andrew Waterman wrote:
> Two non-academic reasons:
> - Loads & stores consume substantial opcode space, so additional loads
> & stores must have very limited addressing modes (e.g.
> register-indirect only)

That generally is a good reason. But a good way around would be a CSR
that switches to using the user page tables for either only load or
only stores as an easy enough work around.

> - Hardware is already in existence that lacks these loads & stores,
> and emulating them would be so slow that you'd need two variants of
> the kernel

Not a that good reason - OSes do boot time patching to decide on variants
of timing critical reasons anyway, so we'd patch in the right version
either way.

Jacob Bachmeyer

unread,

Dec 1, 2017, 10:15:32 PM12/1/17

to Christoph Hellwig, RISC-V ISA Dev

Christoph Hellwig wrote:
> On Thu, Nov 30, 2017 at 10:34:55PM -0600, Jacob Bachmeyer wrote:
>
>> Obviously, supervisors would need to ensure that tasks using hart-specific
>> aliases are not migrated to other harts while the alias is active. What
>> performance cost would come from pinning a task currently in
>> copy_{from,to}_user() to its current hart? This could be as simple as
>> never migrating a task that has SUM set in its saved sstatus.
>>
>
> Having non-preemptible regions in Linux is fairly easy, but we really try
> to avoid them as much as possible. The problem is that they create worst-
> case latencies, and page fault handling can take extremely long - the page
> fauly might have to read in data from a disk or a remote network server.
>
> That's whay having a non-preemptible region over pagefaults simply isn't
> acceptible.
>

Fair enough, although I was not proposing to prevent preemption during
page faults, only to prevent migrating a task that is currently
preempted while accessing user memory. The supervisor can do something
else while waiting for I/O to complete while handling a page fault, but
must resume the faulting task on the same hart it was previously on
after the page fault is handled. The only added latency is to the
specific task that took the page fault during copy_{from,to}_user(),
since it must wait until that specific hart is again available, instead
of being able to resume on another hart. This restriction only applies
to scheduling tasks that were interrupted during copy_{from,to}_user(),
since at other times, including during U-mode page faults, the VM alias
is not in use.

In the simplest case, the scheduler must track on which hart each task
most recently ran, and must ensure that a task can only be resumed on
that same hart if SUM is set in the saved sstatus.......

... and this is wrong. If another task enters copy_{from,to}_user() on
the same hart, the alias will be replaced. The userspace access alias
must be part of the saved task context, be installed during task switch,
and therefore carried with the task when the scheduler migrates the task
to another hart. Replacing the hart-specific top-level alias PTE is the
only option that actually works.

On a side note, these problems do *not* apply to zero-copy I/O, since
that will need shared mappings elsewhere in the supervisor address
space, especially if an implementation uses an IOMMU to protect against
rogue or malfunctioning I/O hardware.

>> Alternately, when migrating a task, also copy its hart-specific top-level
>> alias PTE to the destination hart's root page table and perform SFENCE.VMA
>> on the destination hart if the alias is currently valid.
>>
>
> You could do this, but this would defeat the other nice thing splitting
> the user and supervisor page tables would allow you to do: nice ways
> for per-cpu (aka per-hart) data in the kernel.
>

Why would it prevent that? You only need one PTE in each supervisor
root page table for aliasing 1/1024 (ok, 1/512 on RV64) of the user
address space. Only that PTE would be copied as the task is migrated,
so 1023/1024 (511/512 on RV64) of the address space remains for the
supervisor to use however it wants. Some of those entries could be used
for shared supervisor code, others for per-hart/per-module/per-node
supervisor regions. Each hart could still have a unique ssatp.PPN, so
the supervisor ASID will need to be local to each hart.

[The current spec seems unclear whether ASIDs are global or local to
each hart.]

> Btw, you really seems to want to avoid new load and store variants at all
> costs, is there a good reason for that which I'm missing or is it just
> an academic discussion?
>

I previously proposed "Sdas" (essentially your proposal in a different
context; RISC-V has changed somewhat since then) complete with new
(64-bit) load and store instructions and it went down in flames. I
suspect that avoiding new instructions increases the chances that the
proposal will be accepted.

That, and I like the idea of using VM aliasing instead of needing more
instructions. Also, while space remains in OP-STORE, the base OP-LOAD
opcode is full, so full that the RV128 LQ instruction will be encoded in
MISC-MEM instead. Finding room for new load/store variants in the
32-bit instruction space will be extremely hard and I believe that this
proposal will have much better chances of success if we can avoid
needing new instructions.

-- Jacob

Rogier Brussee

unread,

Dec 2, 2017, 4:23:31 PM12/2/17

to RISC-V ISA Dev, h...@lst.de, jcb6...@gmail.com

Op vrijdag 1 december 2017 20:49:41 UTC+1 schreef waterman:

Just having register indirect loads and stores does not seem like a real problem.

In the kernel, isn't a a read and write to user memory supposed to go through a copy_{from/to}_user() function

anyway ?

Christoph Hellwig

unread,

Dec 3, 2017, 1:45:33 PM12/3/17

to Rogier Brussee, RISC-V ISA Dev, h...@lst.de, jcb6...@gmail.com

On Sat, Dec 02, 2017 at 01:23:31PM -0800, Rogier Brussee wrote:
> Just having register indirect loads and stores does not seem like a real
> problem.

I think so.

> In the kernel, isn't a a read and write to user memory supposed to go
> through a copy_{from/to}_user() function
> anyway ?

copy_{from/to}_user() (or copying/copyout in other Unix variants) for
larger copies and get/put_user for small, single register access are
the main working horses.

The only other interesting operation would be wrong address space
atomics for optimized futexes (not even implemented in the current
RISC-V Linux port). They aren't implemented on sparc either, but
on S/390 they are implemented by the equivalent of using a CSR to
switch entirely to user page table access, similar to the idea suggested
by Jacob in his first reply.

Christoph Hellwig

unread,

Dec 3, 2017, 2:05:36 PM12/3/17

to Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev

On Fri, Dec 01, 2017 at 09:15:26PM -0600, Jacob Bachmeyer wrote:
> Fair enough, although I was not proposing to prevent preemption during page
> faults, only to prevent migrating a task that is currently preempted while
> accessing user memory. The supervisor can do something else while waiting
> for I/O to complete while handling a page fault, but must resume the
> faulting task on the same hart it was previously on after the page fault is
> handled. The only added latency is to the specific task that took the page
> fault during copy_{from,to}_user(), since it must wait until that specific
> hart is again available, instead of being able to resume on another hart.
> This restriction only applies to scheduling tasks that were interrupted
> during copy_{from,to}_user(), since at other times, including during U-mode
> page faults, the VM alias is not in use.

Yes. And realtime Linux has some mechanisms to do that (not in mainline
IIRC), but building these limits into the design for something new is
not what I'd call clean - if there is no other way it probably could be
made to work.

> On a side note, these problems do *not* apply to zero-copy I/O, since that
> will need shared mappings elsewhere in the supervisor address space,
> especially if an implementation uses an IOMMU to protect against rogue or
> malfunctioning I/O hardware.

Zero copy I/O does not need any kernel mapping at all in Linux. We did
extensive work about 15 years to ensure that as it was essential for the
x86 PAE (larger physical than virtual address space) case, and still helps
with similar extensions on ARM and MIPS, as well as security schemes like
XPFO (https://lwn.net/Articles/699116/,
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf), which btw
is also defeated by having memory mapped into kernel and user address
space if we want to be able to use normal load and store instructions.

IOMMUs not required for any of that - IOMMUs just translated between
CPU physical and bus physical in general (a weird HP design that can
do virtual address based scatter/gather excluded).

XPFO btw being another workaround for the lack of entirely separate
pagetables in user vs kernel space. Mapping parts of the user address
space into the kernel address space for copying would open up holes for
ret2dir attacks again, which would be unfortunate.

> Why would it prevent that? You only need one PTE in each supervisor root
> page table for aliasing 1/1024 (ok, 1/512 on RV64) of the user address
> space. Only that PTE would be copied as the task is migrated, so 1023/1024
> (511/512 on RV64) of the address space remains for the supervisor to use
> however it wants. Some of those entries could be used for shared
> supervisor code, others for per-hart/per-module/per-node supervisor
> regions. Each hart could still have a unique ssatp.PPN, so the supervisor
> ASID will need to be local to each hart.
>
>
> [The current spec seems unclear whether ASIDs are global or local to each
> hart.]

In general I would expect ASIDs not be hart-local, both because it wasn't
stated, and also because sharing ASIDs between harts on the same die
inherently makes sense.

Jacob Bachmeyer

unread,

Dec 4, 2017, 1:41:43 AM12/4/17

to Christoph Hellwig, RISC-V ISA Dev

Christoph Hellwig wrote:
> On Fri, Dec 01, 2017 at 09:15:26PM -0600, Jacob Bachmeyer wrote:
>
>> On a side note, these problems do *not* apply to zero-copy I/O, since that
>> will need shared mappings elsewhere in the supervisor address space,
>> especially if an implementation uses an IOMMU to protect against rogue or
>> malfunctioning I/O hardware.
>>
>
> Zero copy I/O does not need any kernel mapping at all in Linux. We did
> extensive work about 15 years to ensure that as it was essential for the
> x86 PAE (larger physical than virtual address space) case, and still helps
> with similar extensions on ARM and MIPS, as well as security schemes like
> XPFO (https://lwn.net/Articles/699116/,
> http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf), which btw
> is also defeated by having memory mapped into kernel and user address
> space if we want to be able to use normal load and store instructions.
>

Unless I am misreading that paper, it will *not* open up holes in
RISC-V. Aliasing a large region of user space by sharing the page
tables would mean that the leaf PTEs would still have the U bit set. I
am happy to say that my efforts to get an absolute prohibition of S-mode
instruction fetch from user page mappings into the RISC-V ISA were
successful, so directing control flow into such a mapping only raises
instruction fetch fault. (This was the kind of attack that I was hoping
to prevent.)

ROP chain attacks are a bit harder to prevent, but can be made much
harder to perform if the alias is only valid *during* execution of
copy_{from,to}_user(). All these functions need to do before returning
is clear sstatus.SUM, clear the V bit in the alias PTE in the root page
table, and flush the user page mappings they touched using SFENCE.VMA.
In fact, since the aliased pages from user space are still *user* pages,
clearing SUM will be enough to make those mappings no longer accessible
to the supervisor. Clearing V simply ensures that the alias will not be
carried over during a task switch. Flushing is probably unnecessary.

> IOMMUs not required for any of that - IOMMUs just translated between
> CPU physical and bus physical in general (a weird HP design that can
> do virtual address based scatter/gather excluded).
>

I was anticipating an IOMMU design that uses either the supervisor page
tables or separate I/O page tables to provide virtual addressing to I/O
hardware. I believe that some newer GPUs have similar capabilities and
expect continued developments in that direction.

> XPFO btw being another workaround for the lack of entirely separate
> pagetables in user vs kernel space. Mapping parts of the user address
> space into the kernel address space for copying would open up holes for
> ret2dir attacks again, which would be unfortunate.
>

The physmap region is for memory management, not storing kernel text, so
the existence of executable physmap pages can be considered a bug in
Linux on architectures that do not conflate "readable page" and
"executable page". RISC-V has separate R and X permission bits, so we
are good on this front.

XPFO on RISC-V could be as simple as clearing the V bit in the physmap
PTE when a page is allocated to userspace and executing SFENCE.VMA on
the physmap address. None of this affects the user space aliasing that
I propose because the key property of physmap that enables ret2dir
attacks is that physmap page mappings are supervisor mappings. I
propose aliasing part of the user address space by directly reusing the
user page tables, which contain user mappings. Address validity checks
will also reliably detect abuses of the user alias region, since it has
a specific and fixed virtual address range not otherwise used in the
supervisor. In particular, a user alias cannot be confused for a kernel
stack.

>> Why would it prevent that? You only need one PTE in each supervisor root
>> page table for aliasing 1/1024 (ok, 1/512 on RV64) of the user address
>> space. Only that PTE would be copied as the task is migrated, so 1023/1024
>> (511/512 on RV64) of the address space remains for the supervisor to use
>> however it wants. Some of those entries could be used for shared
>> supervisor code, others for per-hart/per-module/per-node supervisor
>> regions. Each hart could still have a unique ssatp.PPN, so the supervisor
>> ASID will need to be local to each hart.
>>
>>
>> [The current spec seems unclear whether ASIDs are global or local to each
>> hart.]
>>
>
> In general I would expect ASIDs not be hart-local, both because it wasn't
> stated, and also because sharing ASIDs between harts on the same die
> inherently makes sense.
>

It makes sense for user code, but if only one ASID is available for the
supervisor, making user ASIDs global and the supervisor ASID hart-local
seems to me to be the best way to handle them for this proposal.

On another note, would there be any problem with requiring *user* root
page tables to be aligned on 64KiB (8 page) boundaries? That would
allow the low three bits of suatp to be used to store independent
address space selection flags for load/store/AMO that if set would cause
load/store/AMO to use the user translations if sstatus.SUM is set and
sstatus.SPP is U. The low three bits were not chosen randomly -- they
are within the reach of the immediate-form CSR access instructions.

-- Jacob

Christoph Hellwig

unread,

Dec 4, 2017, 5:21:52 PM12/4/17

to Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev

On Mon, Dec 04, 2017 at 12:41:38AM -0600, Jacob Bachmeyer wrote:
>> x86 PAE (larger physical than virtual address space) case, and still helps
>> with similar extensions on ARM and MIPS, as well as security schemes like
>> XPFO (https://lwn.net/Articles/699116/,
>> http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf), which btw
>> is also defeated by having memory mapped into kernel and user address
>> space if we want to be able to use normal load and store instructions.
>>
>
> Unless I am misreading that paper, it will *not* open up holes in RISC-V.
> Aliasing a large region of user space by sharing the page tables would mean
> that the leaf PTEs would still have the U bit set.

If the U bit is set in the leaf pagetables that also means SUM needs to
be set in sstatus while doing the copy.

> I am happy to say that
> my efforts to get an absolute prohibition of S-mode instruction fetch from
> user page mappings into the RISC-V ISA were successful, so directing
> control flow into such a mapping only raises instruction fetch fault.
> (This was the kind of attack that I was hoping to prevent.)

That only helps when directly placing the executable code in the
double mapped pages, it does not help with indirection data structures.
(edit: you went on to that below anyway)

Not that I want to downplay your effort - I'm absolutelt grateful you
got this feature in!

> ROP chain attacks are a bit harder to prevent, but can be made much harder
> to perform if the alias is only valid *during* execution of
> copy_{from,to}_user(). All these functions need to do before returning is
> clear sstatus.SUM, clear the V bit in the alias PTE in the root page table,
> and flush the user page mappings they touched using SFENCE.VMA. In fact,
> since the aliased pages from user space are still *user* pages, clearing
> SUM will be enough to make those mappings no longer accessible to the
> supervisor. Clearing V simply ensures that the alias will not be carried
> over during a task switch. Flushing is probably unnecessary.

We still touch all these bits and do frequent mapping operations.

A current get/put_user currently is two instructions, and now we'll
need to walk and modify page tables for it. I just can't think how
this is going to ever perform reasonable.

>> IOMMUs not required for any of that - IOMMUs just translated between
>> CPU physical and bus physical in general (a weird HP design that can
>> do virtual address based scatter/gather excluded).
>>
>
> I was anticipating an IOMMU design that uses either the supervisor page
> tables or separate I/O page tables to provide virtual addressing to I/O
> hardware. I believe that some newer GPUs have similar capabilities and
> expect continued developments in that direction.

In x86-land that is called SVM or 'shared virtual memory', it's a pretty
new things and mostly useful for user page tables and/or virtualization
guests. For normal O_DIRECT style I/O it could in theory be used, but
it would be a very hard retrofit into common OSes. It probably woudn't
perform any faster either, as unmapping the I/O pagetables is a massive,
massive performance penalty, e.g. see here for a paper:

https://www.kernel.org/doc/ols/2007/ols2007v1-pages-9-20.pdf

things have gotten a little better since, but the overhead is still huge
on the unmap side. We have pretty similar numbers for the RDMA memory
registrations (which are basically an on-chip IOMMU) as recent as last
year.

> On another note, would there be any problem with requiring *user* root page
> tables to be aligned on 64KiB (8 page) boundaries? That would allow the
> low three bits of suatp to be used to store independent address space
> selection flags for load/store/AMO that if set would cause load/store/AMO
> to use the user translations if sstatus.SUM is set and sstatus.SPP is U.
> The low three bits were not chosen randomly -- they are within the reach of
> the immediate-form CSR access instructions.

I think that should be fine - in the worst case this will lead to a few
wasted slots in the next higher level.

Jacob Bachmeyer

unread,

Dec 4, 2017, 11:22:37 PM12/4/17

to Christoph Hellwig, RISC-V ISA Dev

Christoph Hellwig wrote:
> On Mon, Dec 04, 2017 at 12:41:38AM -0600, Jacob Bachmeyer wrote:
>
>>> x86 PAE (larger physical than virtual address space) case, and still helps
>>> with similar extensions on ARM and MIPS, as well as security schemes like
>>> XPFO (https://lwn.net/Articles/699116/,
>>> http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf), which btw
>>> is also defeated by having memory mapped into kernel and user address
>>> space if we want to be able to use normal load and store instructions.
>>>
>>>
>> Unless I am misreading that paper, it will *not* open up holes in RISC-V.
>> Aliasing a large region of user space by sharing the page tables would mean
>> that the leaf PTEs would still have the U bit set.
>>
>
> If the U bit is set in the leaf pagetables that also means SUM needs to
> be set in sstatus while doing the copy.
>

Yes, that is exactly the idea -- copy_{from,to}_user() sets SUM, does
its work, and clears SUM. The same model applies to {get,put}_user(),
although using page aliasing may be more overhead here than is feasible
-- bulk copying a few pages worth of data amortizes page setup nicely,
one word at a time is more of a problem.

>> I am happy to say that
>> my efforts to get an absolute prohibition of S-mode instruction fetch from
>> user page mappings into the RISC-V ISA were successful, so directing
>> control flow into such a mapping only raises instruction fetch fault.
>> (This was the kind of attack that I was hoping to prevent.)
>>
>
> That only helps when directly placing the executable code in the
> double mapped pages, it does not help with indirection data structures.
> (edit: you went on to that below anyway)
>

This was the original purpose of SUM (then called "PUM" with opposite
meaning) as I understand it -- most of the supervisor runs with no
access to user memory permitted and only the handful of locations that
actually need to access user memory enable such access. Originally,
clearing PUM (now setting SUM) also enabled the supervisor to execute
code from user memory. That last part is what I successfully campaigned
to change.

> Not that I want to downplay your effort - I'm absolutelt grateful you
> got this feature in!
>

It has its limits, some of them intentional -- supervisor and user
executable mappings that alias the same physical page are permitted,
partly because detecting them is infeasible in the general case and
partly to support obscure (non-POSIX) systems that actually *want* to
execute certain parts of user programs in supervisor mode. A POSIX
supervisor that sets up such mappings is defectively naive, however, and
the change does ensure that even corrupting a saved state to set SUM
will not permit ret2usr attacks.

And making enough noise to be noticed was quite a bit of effort; you're
welcome. :-)

>> ROP chain attacks are a bit harder to prevent, but can be made much harder
>> to perform if the alias is only valid *during* execution of
>> copy_{from,to}_user(). All these functions need to do before returning is
>> clear sstatus.SUM, clear the V bit in the alias PTE in the root page table,
>> and flush the user page mappings they touched using SFENCE.VMA. In fact,
>> since the aliased pages from user space are still *user* pages, clearing
>> SUM will be enough to make those mappings no longer accessible to the
>> supervisor. Clearing V simply ensures that the alias will not be carried
>> over during a task switch. Flushing is probably unnecessary.
>>
>
> We still touch all these bits and do frequent mapping operations.
>
> A current get/put_user currently is two instructions, and now we'll
> need to walk and modify page tables for it. I just can't think how
> this is going to ever perform reasonable.
>

The {get,put}_user() will be a minimum of four instructions in RISC-V
with unified paging (the current model): load bitmask for SUM into a
register ("LUI" can do this), "CSRS sstatus, <SUM bitmask register>",
{load,store} user word, "CSRC sstatus, <SUM bitmask register>". Care
must be taken that usable gadgets for leaving SUM set are not created.
The supervisor might even want to proactively clear SUM just before
accessing "sensitive" pointers in supervisor structures, just in case a
saved context was somehow corrupted to set SUM. (Exactly which
structures deserve such treatment in Linux is for people more
experienced than I to figure out.)

Walking page tables, at least in software, can be avoided by always
aliasing the largest region of user space possible and reserving a
single PTE in the supervisor's root page table for the alias. For
{get,put}_user() on RV32 with dual address spaces ("LP" is "load
pointer": LW on RV32, LD on RV64; "SP", "store pointer", analogous):
(1) shift user address right 22 bits to extract VPN[1] {SRLI}, (2) load
the base address of the user root page table {LP}, (3) extract the
relevant PTE from the user root page table {LP}, (4) store the PTE in
the task context (to allow for task switch) {SP}, (5) store the PTE into
the designated alias slot in the supervisor root page table (this is a
constant address for each hart) {SP}, (6) construct the alias address by
combining the fixed VPN[1] for the supervisor alias with the low 22 bits
of the user address {LUI/NOT/AND/LUI/OR}, (7) SFENCE.VMA for the alias
{SFENCE.VMA}, (8) set SUM {LUI/CSRS}, (9) load/store alias address {one
instruction}, (10) clear SUM {CSRC}, (11) (optional, to avoid carrying
the alias on task switch) store zero into the task context alias PTE
slot {SP}, (12) return {JALR}. The hardware can use a paging TLB that
stores intermediate translations for a small improvement in RV32 and a
much larger improvement in RV64. (The PTLB benefits also affect
sequential access to virtual addresses, so are not specific to this
scenario.)

While the above is reasonable for copying bulk data, I see the problem
with doing this entire 16-instruction dance with four extra memory
operations just to access a single word. While this probably will not
swamp the performance benefits RISC-V obtains from more efficient trap
handling, it will make dual address spaces on RISC-V significantly
slower than unified address spaces on RISC-V. That leaves either new
instructions or mode bits, see below.

>>> IOMMUs not required for any of that - IOMMUs just translated between
>>> CPU physical and bus physical in general (a weird HP design that can
>>> do virtual address based scatter/gather excluded).
>>>
>>>
>> I was anticipating an IOMMU design that uses either the supervisor page
>> tables or separate I/O page tables to provide virtual addressing to I/O
>> hardware. I believe that some newer GPUs have similar capabilities and
>> expect continued developments in that direction.
>>
>
> In x86-land that is called SVM or 'shared virtual memory', it's a pretty
> new things and mostly useful for user page tables and/or virtualization
> guests. For normal O_DIRECT style I/O it could in theory be used, but
> it would be a very hard retrofit into common OSes. It probably woudn't
> perform any faster either, as unmapping the I/O pagetables is a massive,
> massive performance penalty, e.g. see here for a paper:
>
> https://www.kernel.org/doc/ols/2007/ols2007v1-pages-9-20.pdf
>
> things have gotten a little better since, but the overhead is still huge
> on the unmap side. We have pretty similar numbers for the RDMA memory
> registrations (which are basically an on-chip IOMMU) as recent as last
> year.
>

I expect an IOMMU to be handled similarly to a remote hart, using MMIO
to access an "ioatp" register and the same (implementation-specific)
mechanisms for remote SFENCE (TLB shootdown) as are used for other
harts. I also expect the RISC-V community to develop its own IOMMUs and
learn from previous mistakes. (Calgary can only flush the entire
IOTLB? What were they thinking? *Were* they thinking?)

>> On another note, would there be any problem with requiring *user* root page
>> tables to be aligned on 64KiB (8 page) boundaries? That would allow the
>> low three bits of suatp to be used to store independent address space
>> selection flags for load/store/AMO that if set would cause load/store/AMO
>> to use the user translations if sstatus.SUM is set and sstatus.SPP is U.
>> The low three bits were not chosen randomly -- they are within the reach of
>> the immediate-form CSR access instructions.
>>
>
> I think that should be fine - in the worst case this will lead to a few
> wasted slots in the next higher level.
>

Huh? I was referring to user root page tables -- the structures that
get their addresses loaded into suatp. There is no higher level, but
this means that the root page table of a user address space would have
seven physical pages following it that could be allocated for other
parts of the user address space, like intermediate page tables for exec
text, exec data+heap, user stack, user mmap region, etc. A very minimal
Sv32 process could fit entirely in those eight pages (1 root page table,
1 exec region page table, 1 user stack page table, 1 user mmap page
table, 1 text page, 1 heap page, 1 stack page with room for another
text, heap, or stack page).

The reason to impose this alignment is to "steal" three bits from
suatp.PPN for three flags: SAU ("Supervisor AMO to User"), SLU
("Supervisor Load from User"), and SSU ("Supervisor Store to User").
These flags are effective if and only if (1) sstatus.SUM is set, (2)
sstatus.SPP is U, and (3) ssatp != suatp. If set and effective,
suatp.SAU causes all atomic memory operations (including LR/SC) to use
the user page translations, suatp.SLU causes OP-LOAD (and LQ on RV128)
and "load-like" operations to use the user page translations, and
suatp.SSU causes OP-STORE and "store-like" operations to use the user
page translations. The prohibition on the effects of these flags when
SPP is S is needed to ensure that the supervisor can safely take a page
fault or interrupt, otherwise the trap handler could inadvertently
direct some accesses to user space while saving the interrupted (S-mode)
context. The requirement that SUM also be set prevents setting these
flags from causing page faults. These flags are the three
least-significant bits in suatp, placing them within reach of the
CSRSI/CSRCI instructions to set or clear them in a single instruction.
Systems without the RVA extension hardwire SAU clear.

The use of these address space selection flags allows {get,put}_user()
to be implemented for dual address space systems with six instructions:
(1) load bitmask for SUM {LUI}, (2) set SUM {CSRS}, (3) set {SLU,SSU}
{CSRSI}, (4) {load,store} to user space {one instruction}, (5) clear
{SLU,SSU} {CSRCI}, (6) clear SUM {CSRC}. Since these flags individually
each affect only one of atomics, loads, or stores, copy_{from,to}_user()
can also easily use them, replacing step (4) with a copy loop. Since a
distinct flag is available for atomics, that could be used for
optimizing futexes while still having other access to supervisor data
structures if needed.

I am assuming that user address spaces will contain only user mappings
(PTE U bit set); what to do with supervisor mappings in user address
spaces in a system with dual address spaces is an interesting question.
Of course, user mappings are permitted in the supervisor address space,
both for compatibility (remember that the same hardware must support
unified address space supervisors) and for an easy XPFO implementation
-- simply set the U bit in the physmap page tables when the page frame
is given to user space. If the RSW bits in the physmap page tables
(which are obviously not swappable) can be assigned to the XPFO S and Z
flags, then the xpfo_flags field can be eliminated from struct page on
RISC-V, since the U bit in the physmap page tables effectively serves as
the XPFO T flag. Since RISC-V enforces isolation between user and
supervisor in both directions when SUM is clear (and SUM is clear most
of the time), this reduces XPFO overhead in both unified and dual
address space systems and effectively converts ret2dir attacks to
ret2usr attacks, which the existing protections mitigate. (Strict
isolation in dual address space systems is, of course, also possible by
using the PTE V bit as a complement of the XPFO T flag in the physmap
page tables.)

-- Jacob

Reply all

Reply to author

Forward