Christoph Hellwig wrote:
> On Mon, Dec 04, 2017 at 12:41:38AM -0600, Jacob Bachmeyer wrote:
>
>>> x86 PAE (larger physical than virtual address space) case, and still helps
>>> with similar extensions on ARM and MIPS, as well as security schemes like
>>> XPFO (
https://lwn.net/Articles/699116/,
>>>
http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf), which btw
>>> is also defeated by having memory mapped into kernel and user address
>>> space if we want to be able to use normal load and store instructions.
>>>
>>>
>> Unless I am misreading that paper, it will *not* open up holes in RISC-V.
>> Aliasing a large region of user space by sharing the page tables would mean
>> that the leaf PTEs would still have the U bit set.
>>
>
> If the U bit is set in the leaf pagetables that also means SUM needs to
> be set in sstatus while doing the copy.
>
Yes, that is exactly the idea -- copy_{from,to}_user() sets SUM, does
its work, and clears SUM. The same model applies to {get,put}_user(),
although using page aliasing may be more overhead here than is feasible
-- bulk copying a few pages worth of data amortizes page setup nicely,
one word at a time is more of a problem.
>> I am happy to say that
>> my efforts to get an absolute prohibition of S-mode instruction fetch from
>> user page mappings into the RISC-V ISA were successful, so directing
>> control flow into such a mapping only raises instruction fetch fault.
>> (This was the kind of attack that I was hoping to prevent.)
>>
>
> That only helps when directly placing the executable code in the
> double mapped pages, it does not help with indirection data structures.
> (edit: you went on to that below anyway)
>
This was the original purpose of SUM (then called "PUM" with opposite
meaning) as I understand it -- most of the supervisor runs with no
access to user memory permitted and only the handful of locations that
actually need to access user memory enable such access. Originally,
clearing PUM (now setting SUM) also enabled the supervisor to execute
code from user memory. That last part is what I successfully campaigned
to change.
> Not that I want to downplay your effort - I'm absolutelt grateful you
> got this feature in!
>
It has its limits, some of them intentional -- supervisor and user
executable mappings that alias the same physical page are permitted,
partly because detecting them is infeasible in the general case and
partly to support obscure (non-POSIX) systems that actually *want* to
execute certain parts of user programs in supervisor mode. A POSIX
supervisor that sets up such mappings is defectively naive, however, and
the change does ensure that even corrupting a saved state to set SUM
will not permit ret2usr attacks.
And making enough noise to be noticed was quite a bit of effort; you're
welcome. :-)
>> ROP chain attacks are a bit harder to prevent, but can be made much harder
>> to perform if the alias is only valid *during* execution of
>> copy_{from,to}_user(). All these functions need to do before returning is
>> clear sstatus.SUM, clear the V bit in the alias PTE in the root page table,
>> and flush the user page mappings they touched using SFENCE.VMA. In fact,
>> since the aliased pages from user space are still *user* pages, clearing
>> SUM will be enough to make those mappings no longer accessible to the
>> supervisor. Clearing V simply ensures that the alias will not be carried
>> over during a task switch. Flushing is probably unnecessary.
>>
>
> We still touch all these bits and do frequent mapping operations.
>
> A current get/put_user currently is two instructions, and now we'll
> need to walk and modify page tables for it. I just can't think how
> this is going to ever perform reasonable.
>
The {get,put}_user() will be a minimum of four instructions in RISC-V
with unified paging (the current model): load bitmask for SUM into a
register ("LUI" can do this), "CSRS sstatus, <SUM bitmask register>",
{load,store} user word, "CSRC sstatus, <SUM bitmask register>". Care
must be taken that usable gadgets for leaving SUM set are not created.
The supervisor might even want to proactively clear SUM just before
accessing "sensitive" pointers in supervisor structures, just in case a
saved context was somehow corrupted to set SUM. (Exactly which
structures deserve such treatment in Linux is for people more
experienced than I to figure out.)
Walking page tables, at least in software, can be avoided by always
aliasing the largest region of user space possible and reserving a
single PTE in the supervisor's root page table for the alias. For
{get,put}_user() on RV32 with dual address spaces ("LP" is "load
pointer": LW on RV32, LD on RV64; "SP", "store pointer", analogous):
(1) shift user address right 22 bits to extract VPN[1] {SRLI}, (2) load
the base address of the user root page table {LP}, (3) extract the
relevant PTE from the user root page table {LP}, (4) store the PTE in
the task context (to allow for task switch) {SP}, (5) store the PTE into
the designated alias slot in the supervisor root page table (this is a
constant address for each hart) {SP}, (6) construct the alias address by
combining the fixed VPN[1] for the supervisor alias with the low 22 bits
of the user address {LUI/NOT/AND/LUI/OR}, (7) SFENCE.VMA for the alias
{SFENCE.VMA}, (8) set SUM {LUI/CSRS}, (9) load/store alias address {one
instruction}, (10) clear SUM {CSRC}, (11) (optional, to avoid carrying
the alias on task switch) store zero into the task context alias PTE
slot {SP}, (12) return {JALR}. The hardware can use a paging TLB that
stores intermediate translations for a small improvement in RV32 and a
much larger improvement in RV64. (The PTLB benefits also affect
sequential access to virtual addresses, so are not specific to this
scenario.)
While the above is reasonable for copying bulk data, I see the problem
with doing this entire 16-instruction dance with four extra memory
operations just to access a single word. While this probably will not
swamp the performance benefits RISC-V obtains from more efficient trap
handling, it will make dual address spaces on RISC-V significantly
slower than unified address spaces on RISC-V. That leaves either new
instructions or mode bits, see below.
>>> IOMMUs not required for any of that - IOMMUs just translated between
>>> CPU physical and bus physical in general (a weird HP design that can
>>> do virtual address based scatter/gather excluded).
>>>
>>>
>> I was anticipating an IOMMU design that uses either the supervisor page
>> tables or separate I/O page tables to provide virtual addressing to I/O
>> hardware. I believe that some newer GPUs have similar capabilities and
>> expect continued developments in that direction.
>>
>
> In x86-land that is called SVM or 'shared virtual memory', it's a pretty
> new things and mostly useful for user page tables and/or virtualization
> guests. For normal O_DIRECT style I/O it could in theory be used, but
> it would be a very hard retrofit into common OSes. It probably woudn't
> perform any faster either, as unmapping the I/O pagetables is a massive,
> massive performance penalty, e.g. see here for a paper:
>
>
https://www.kernel.org/doc/ols/2007/ols2007v1-pages-9-20.pdf
>
> things have gotten a little better since, but the overhead is still huge
> on the unmap side. We have pretty similar numbers for the RDMA memory
> registrations (which are basically an on-chip IOMMU) as recent as last
> year.
>
I expect an IOMMU to be handled similarly to a remote hart, using MMIO
to access an "ioatp" register and the same (implementation-specific)
mechanisms for remote SFENCE (TLB shootdown) as are used for other
harts. I also expect the RISC-V community to develop its own IOMMUs and
learn from previous mistakes. (Calgary can only flush the entire
IOTLB? What were they thinking? *Were* they thinking?)
>> On another note, would there be any problem with requiring *user* root page
>> tables to be aligned on 64KiB (8 page) boundaries? That would allow the
>> low three bits of suatp to be used to store independent address space
>> selection flags for load/store/AMO that if set would cause load/store/AMO
>> to use the user translations if sstatus.SUM is set and sstatus.SPP is U.
>> The low three bits were not chosen randomly -- they are within the reach of
>> the immediate-form CSR access instructions.
>>
>
> I think that should be fine - in the worst case this will lead to a few
> wasted slots in the next higher level.
>
Huh? I was referring to user root page tables -- the structures that
get their addresses loaded into suatp. There is no higher level, but
this means that the root page table of a user address space would have
seven physical pages following it that could be allocated for other
parts of the user address space, like intermediate page tables for exec
text, exec data+heap, user stack, user mmap region, etc. A very minimal
Sv32 process could fit entirely in those eight pages (1 root page table,
1 exec region page table, 1 user stack page table, 1 user mmap page
table, 1 text page, 1 heap page, 1 stack page with room for another
text, heap, or stack page).
The reason to impose this alignment is to "steal" three bits from
suatp.PPN for three flags: SAU ("Supervisor AMO to User"), SLU
("Supervisor Load from User"), and SSU ("Supervisor Store to User").
These flags are effective if and only if (1) sstatus.SUM is set, (2)
sstatus.SPP is U, and (3) ssatp != suatp. If set and effective,
suatp.SAU causes all atomic memory operations (including LR/SC) to use
the user page translations, suatp.SLU causes OP-LOAD (and LQ on RV128)
and "load-like" operations to use the user page translations, and
suatp.SSU causes OP-STORE and "store-like" operations to use the user
page translations. The prohibition on the effects of these flags when
SPP is S is needed to ensure that the supervisor can safely take a page
fault or interrupt, otherwise the trap handler could inadvertently
direct some accesses to user space while saving the interrupted (S-mode)
context. The requirement that SUM also be set prevents setting these
flags from causing page faults. These flags are the three
least-significant bits in suatp, placing them within reach of the
CSRSI/CSRCI instructions to set or clear them in a single instruction.
Systems without the RVA extension hardwire SAU clear.
The use of these address space selection flags allows {get,put}_user()
to be implemented for dual address space systems with six instructions:
(1) load bitmask for SUM {LUI}, (2) set SUM {CSRS}, (3) set {SLU,SSU}
{CSRSI}, (4) {load,store} to user space {one instruction}, (5) clear
{SLU,SSU} {CSRCI}, (6) clear SUM {CSRC}. Since these flags individually
each affect only one of atomics, loads, or stores, copy_{from,to}_user()
can also easily use them, replacing step (4) with a copy loop. Since a
distinct flag is available for atomics, that could be used for
optimizing futexes while still having other access to supervisor data
structures if needed.
I am assuming that user address spaces will contain only user mappings
(PTE U bit set); what to do with supervisor mappings in user address
spaces in a system with dual address spaces is an interesting question.
Of course, user mappings are permitted in the supervisor address space,
both for compatibility (remember that the same hardware must support
unified address space supervisors) and for an easy XPFO implementation
-- simply set the U bit in the physmap page tables when the page frame
is given to user space. If the RSW bits in the physmap page tables
(which are obviously not swappable) can be assigned to the XPFO S and Z
flags, then the xpfo_flags field can be eliminated from struct page on
RISC-V, since the U bit in the physmap page tables effectively serves as
the XPFO T flag. Since RISC-V enforces isolation between user and
supervisor in both directions when SUM is clear (and SUM is clear most
of the time), this reduces XPFO overhead in both unified and dual
address space systems and effectively converts ret2dir attacks to
ret2usr attacks, which the existing protections mitigate. (Strict
isolation in dual address space systems is, of course, also possible by
using the PTE V bit as a complement of the XPFO T flag in the physmap
page tables.)
-- Jacob