On 12/15/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
> Am Freitag, 15. Dezember 2017 06:27:22 UTC+1 schrieb Jacob Bachmeyer:
>> the ISA spec is ambiguous about whether SC performs address translation...
>> if the hart already knows that SC must fail, for example, after a
>> preemptive context switch.
> translations and have a page fault? Why not specify "the translation isIf that interrupt is handled by a more-privileged level such as
> done iff the reservations are still held (e.g., not after an interrupt)" or
a hypervisor, can that more-privileged level restore the reservation?
If not, then the less-privileged level can see into the more-privileged
level to some degree.
it seems to cause
nondeterministic behavior that would break exact replay.
I propose that SC should unconditionally translate its address for write and raise a page fault if one would occur were the SC to attempt its store.
Your approach seems to be at odds with the memory model currently spec since it allows SCs to fail fairly early (e.g. before address calculation). For example, letting an SC fail due to a TLB miss would be a perfectly legal implementation,
as long as you can achieve the forward progress guarantee specified in the LR/SC section of the spec.
On Dec 15, 2017 22:08, "Andy Wright" <acwr...@mit.edu> wrote:Your approach seems to be at odds with the memory model currently spec since it allows SCs to fail fairly early (e.g. before address calculation). For example, letting an SC fail due to a TLB miss would be a perfectly legal implementation,By TLB miss you mean page fault?
I'm opening a sidetrack now.What happens if the SC is to an address to which the OS will never give write permission?
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5A3464B1.5070608%40gmail.com.
Andy Wright wrote:
This is an interesting corner case of the spec.
On Fri, Dec 15, 2017 at 12:27 AM, Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
I propose that SC should unconditionally translate its address for
write and raise a page fault if one would occur were the SC to
attempt its store.
Is there any reason you don't want an SC to fail before address translation?
Efficiency: if the SC writes to a page that is currently copy-on-write, the supervisor must copy the page, map the copy, and make the copy actually writable. SC cannot succeed when this occurs, in any case, since the supervisor has written to the reserved location. The reason to require that SC take page faults even if the reservation has been lost is to ensure that only one pass through an LR/SC sequence is enough for all pages to be present and usable, so the second iteration can succeed.
If SC translates its address conditionally, then an LR/SC to a copy-on-write region that has been swapped out will require at least three iterations to succeed. First, LR faults and the page is swapped in, but marked read-only in the PTE since it is copy-on-write. The SC fails silently and the sequence is retried. Second, LR succeeds, SC would succeed, but takes a page fault because the page is copy-on-write. The supervisor copies the page, etc. Third, the page is finally now writable and the LR/SC succeeds.
If SC translates its address unconditionally, the same situation succeeds on the first retry: LR faults, page swapped in, SC faults, page copied, SC fails; LR/SC retried, success.
To clarify: I propose that hardware be required to perform address translation for a failed SC.
LR/SC on copy-on-write must be supported: copy-on-write is transparent to user code and synchronization variables used with LR/SC are unknown to the supervisor.
You are correct that the page fault will eventually happen as long as a "can-succeed" SC translates its address, which it logically must do. If even a "will-fail" SC (after an LR that "faulted-in" the page) translates its address, the page fault indicating a need for a writable page happens sooner, so a wasted execution of the LR/SC sequence can be avoided. This also applies to implementations that use software management of A/D bits -- if the first write to a page is an SC, that SC will fault, even if the page was already mapped writable, to set the D bit.
I like the idea of modifying the forward progress guarantee to permit exceptions, but this would need to be carefully worded to avoid being an excuse to invalidate the entire guarantee -- an "LR/SC timeout" exception instead of eventual forward progress is not acceptable.
Jonas Oberhauser wrote:
Sounds like a bug (or wording issue) in the current spec extensions, which still guarantee eventual success without consideration to interrupts that they introduce. The problem already occurs if you do a misaligned jalr in the LR/SC sequence. Extensions that introduce interrupts should mention that some of these interrupts can break the forward guarantee.
PS: in my last mail I erronously wrote about overflow. I instructions do not raise interrupts, so there is no problem with them.
As I read the current spec, LR/SC to an address in an I/O region that does not support LR/SC is a PMA violation (section 3.5.3 "Atomicity PMAs" in the current draft) and therefore must trap.
Also in that section: "Implementations must guarantee that all load reservations are yielded when any trap is taken."
On the main topic, what do you think of requiring SC to always translate
its address?
-- Jacob
SC translating its address for write (and raising page
fault if that translation fails) regardless of current reservation
state. At least some implementations (specifically, those that choose
to take reservations on physical addresses) will need to translate the
address before they know if a reservation exists. I am asking for a
clarification that a failed SC either (1) always translates its address
and raises page fault if a "successful" SC would raise a page fault, or
(2) never raises a page fault, even if the address translation fails.
Case (2) requires a minor nuance to preserve forward progress: a
failing SC must ignore page faults *unless* the page fault is the cause
for the SC to fail. Case (1) has no such edges.
Jonas Oberhauser wrote:If an SC that otherwise could have succeeded failed because of a page fault, take the page fault trap.
Am Donnerstag, 21. Dezember 2017 00:48:00 UTC+1 schrieb Jacob Bachmeyer:
SC translating its address for write (and raising page
fault if that translation fails) regardless of current reservation
state. At least some implementations (specifically, those that
choose
to take reservations on physical addresses) will need to translate
the
address before they know if a reservation exists. I am asking for a
clarification that a failed SC either (1) always translates its
address
and raises page fault if a "successful" SC would raise a page
fault, or
(2) never raises a page fault, even if the address translation
fails. Case (2) requires a minor nuance to preserve forward progress: a
failing SC must ignore page faults *unless* the page fault is the
cause
for the SC to fail. Case (1) has no such edges.
If the SC can not be translated, how do you distinguish between "SC failed because of page fault" and "SC failed because I have a reservation to a different physical address"?
Only if hardware monitors PTE writes or directly implements remote SFENCE.VMA, otherwise the remote SFENCE.VMA required for that PTE change to be effective on the local hart has canceled the reservation by delivering an IPI.Note that the standard only gives a forward guarantee for same virtual addresses, but a remote store to the PTEs used for the LR and SC could cause the SC to be untranslatable even if you use the same virtual address (and successfully translated the LR).
Skipping the translation because the reservation was given up is a false economy: software will iterate and retry the LR/SC sequence.
-- Jacob
Em 28-12-2017 21:42, Jacob Bachmeyer escreveu:
Since the forward progress guarantee forbids any other memory accesses, and the DTLB must have at least one slot, the TLB entry for the reservation must still be present for SC to ever succeed. Good point.
There's only one case where the LR succeeds, the reservation was not lost, but the SC gets a page fault: the page was read-only.
In that case, the TLB entry is still there (otherwise, it would be treated as the reservation being lost), so either the virtual address or the physical address can be used.
That is, I propose the following algorithm for SC:
- Check the reservation on the virtual address (if the implementation uses it), return failure if not found;
- Read the cached TLB entry, return failure if not found;
- Check the reservation on the physical address (if the implementation uses it), return failure if not found;
- Check if the page is writeable, trap if read-only;
- Store the value and return success.
Jonas Oberhauser wrote:This rule can be simplified since the SC must fail if there is no preceding (still valid) LR: An SC is translated iff it is not the case that the SC "must fail".
I therefore suggest that an SC is translated iff there is an immediately preceding LR and it is not the case that the SC "must fail".
These are the kinds of edge cases that lead me to suggest that SC should always translate its address and trap if that raises page fault.
Note also that if the reservation is lost due to anything else --- TLB miss, timeout, different va, or such --- the SC is translated and may trap, even though HW can "prove" that the SC will fail before attempting the translation (I put prove in quotation marks because it can not always prove that the SC will fail based on the spec, but it can prove it based on the implementation).
Cesar Eduardo Barros made a good point that SC should be more lenient and I like the idea of programs being able to use his scenario and work correctly in case (3), but for his scenario to be usable, this behavior needs to be part of the spec.
Unconditional translation is simpler to implement, and I am asking for either some more advanced model (like Barros proposed) to be standardized or the spec to be clarified that portable programs must assume that SC unconditionally translates its address. A program that assumes failed SCs can still raise page fault will run (but may enter an infinite loop on an SC that can never succeed) even if failed SCs never trap. If failed SCs can trap, case (3) must be avoided -- it will crash the program if the race occurs.
Why? The TLB entry could be invalidated by a remote store to the PTE used by the LR. The SC may still succeed after retranslation. Your observation is correct in the specific case we discussed here, where the write to the PTE is preceded by a write to the reservation, but not correct in general.
I have not found any requirement that remote stores to a PTE be visible to the MMU until a local SFENCE.VMA is executed.
Can you tell me where the spec says that I am allowed to assume that a remote store replacing a PTE will be visible locally?
As I understand the spec, PTE writes are not guaranteed to be visible to remote harts until FENCE is executed (to push the stores to main memory) ...
... and IPIs signaled to cause the remote harts to execute SFENCE.VMA.
As I understand the current spec, swapping the page out is not guaranteed to be visible to the local MMU until SFENCE.VMA is executed (likely in an IPI handler).No, the page could have been swapped out but no new page swapped in. I don't know if existing OSs do this to a running user but on RISCV I don't see why not.
The IPI breaks the reservation. Presumably, similar hardware-accelerated TLB shootdown would also break reservations.
Jonas Oberhauser wrote:There is no trap in case (2): the LR saw the read-only flag clear and the SC saw the value unchanged -- the SC succeeds, since the page is still writable. (Unless, of course, the supervisor has made the page copy-on-write, then SC raises page fault.)
Under my suggestion, SCs can never trap under case 3, but they will always trap under case 2 (unless the hart was interrupted. Under the assumption that the SC will eventually be reached without interruptions, i.e., there are not too many interrupts, it will trap then.)
That appears to be the correct behavior.
The point I am trying to make is that you do not need to worry about all failed SCs, only those that "must fail".
So this is a way to determine which SCs should ignore page faults and which should trap on page faults. [...] It needs to be standardized so programs can rely on it.
However, the translation process has to appear to be done anew for each access, so in case of an LR/SC where the SC has a control-dependency on the LR, a remote store to the PTE after the LR is translated but before it is executed would always be visible to the translation of the SC, since the translation of the SC will likely not begin before the branch is evaluated (the "likely" here is due to the memory model not yet having been integrated with virtual memory, so it may be that this translation will in fact be allowed.).This is where I think that you misunderstand: if the translation must appear to be done anew for each access, what is the purpose of SFENCE.VMA?
I think that this is similar to the lack of explicit caches and lack of ability to flush/prefetch/etc. cachelines. There are still the FENCE and FENCE.I instructions that implicitly flush/synchronize caches.
Simple enough, the IPI handler executes FENCE/SFENCE.VMA. The FENCE ensures the remote stores (pushed to main memory by a remote FENCE) are visible to the local data load/store unit. The SFENCE.VMA then ensures that the same are visible to the MMU.This is indeed suggested in the commentary of the spec, but since an SFENCE does not by itself order a remote store to the PTE with the local translation, with the current spec, this only has an effect if taking an IPI counts as a store for SFENCE.
By "similar hardware-accelerated TLB shootdown" you mean an implicit one by remote store to PTE? I don't think so, but possibly "if one of the PTEs used by the LR..."Implicit TLB shootdown by MMU bus snooping is one option, but the spec seems to envision implementations that might use an MMIO store in a hart's control region to perform the same effect as that hart executing SFENCE.VMA, but asynchronously and without actually interrupting the target hart.
Jonas Oberhauser wrote:
On Jan 1, 2018 01:14, "Jacob Bachmeyer" <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
Jonas Oberhauser wrote:
However, the translation process has to appear to be done anew
for each access, so in case of an LR/SC where the SC has a
control-dependency on the LR, a remote store to the PTE after
the LR is translated but before it is executed would always be
visible to the translation of the SC, since the translation of
the SC will likely not begin before the branch is evaluated
(the "likely" here is due to the memory model not yet having
been integrated with virtual memory, so it may be that this
translation will in fact be allowed.).
This is where I think that you misunderstand: if the translation
must appear to be done anew for each access, what is the purpose
of SFENCE.VMA?
The translation is not ordered in memory order with the other stores in the instruction stream, and unlike normal local loads does not forward from stores in the instruction stream.
SFENCE makes sure that the local stores in the instruction stream are in memory order before any translations caused by subsequent instructions, and it does nothing else (unless I overlooked some lines in the spec/a more up to date version has changed the definition of SFENCE).
A commentary block in section 4.2.1 "Supervisor Memory-Management Fence Instruction" explains:
"""
Note the instruction has no effect on the translations of other harts, which must be notified separately. One approach is to use 1) a local data fence to ensure local writes are visible globally, then 2) an interprocessor interrupt to the other thread, then 3) a local SFENCE.VMA in the interrupt handler of the remote thread, and finally 4) signal back to originating thread that operation is complete. This is, of course, the RISC-V analog to a TLB shootdown. Alternatively, implementations might provide direct hardware support for remote TLB invalidation. TLB shootdowns are handled by an SBI call to hide implementation details.
"""
If we assume that the commentary gives a valid implementation and that RISC-V does actually need an analog to a TLB shootdown, then remote SFENCE.VMA must be executed to guarantee that local PTE writes affect translations on a remote hart, which means that a remote hart is allowed to use "stale" PTEs until the remote SFENCE.VMA completes.
[...] All caches are nevertheless as per my understanding kept coherent w.r.t. MO.
Cache coherency is an implementation option in RISC-V as I understand.
The local FENCE has no such powers, and neither does the local SFENCE.
Perhaps in the new memory model the local FENCE is unneeded, but as I understand the original memory model, a remote (store) FENCE is needed to ensure that the write is visible to other harts *and* a local (load) FENCE is needed to ensure that the local hart will see the most-recent updates to main memory. The overall sequence is/was: (on hart A) store PTE, FENCE w, MMIO write for IPI to hart B, (on hart B) take IPI, FENCE r, SFENCE.VMA, return from IPI, use new PTE. Non-coherent caches are discouraged but permitted in RISC-V. (RISC-V privileged ISA spec sec. 3.5.5 "Coherence and Cacheability PMAs")
Note that the IPI trap handler will unavoidably need to store the interrupted context somewhere, so you have local stores when the IPI trap is taken.The local FENCE can only order local memory operations, and the SFENCE only orders local stores and local translations.
Either one of the following is true:
1) taking the IPI orders all subsequent local translations behind the IPI, which itself is ordered w.r.t all preceeding remote operations, thus ordering local translations with the remote stores. No SFENCE necessary.
2) taking the IPI does not order the subsequent local translations behind the remote stores, but counts as a local store which is ordered with the remote stores. An SFENCE works, but for "magical" reasons: it orders the local translations with the IPI "store", and thus transitively with the remote stores.
3) taking the IPI does not order the subsequent local translations behind the remote stores, and neither will FENCE or SFENCE . To order the remote stores and the translations, one needs an extra local store with no purpose except to have a local store that the SFENCE can order.
The spec seems to imply 1), but the commentary matches none of these. Local SFENCE and FENCE simply can not impose on their own the necessary ordering with their current semantics.
Quirk aside, have we found another erratum in the spec here?