Clarification request: SC always translates its address

123 views
Skip to first unread message

Jacob Bachmeyer

unread,
Dec 15, 2017, 12:27:22 AM12/15/17
to RISC-V ISA Dev
While planning a proposal for a multi-word LR/SC option, I realized that
the ISA spec is ambiguous about whether SC performs address translation
if the hart already knows that SC must fail, for example, after a
preemptive context switch.

I propose that SC should unconditionally translate its address for write
and raise a page fault if one would occur were the SC to attempt its
store. An SC that should succeed can still raise a page fault if the
page holding the synchronization variable is currently copy-on-write.


-- Jacob

Jonas Oberhauser

unread,
Dec 15, 2017, 6:59:57 AM12/15/17
to RISC-V ISA Dev, jcb6...@gmail.com
Why? I agree there should be no ambiguity, but why do the additional translations and have a page fault? Why not specify "the translation is done iff the reservations are still held (e.g., not after an interrupt)" or something like that?

Albert Cahalan

unread,
Dec 15, 2017, 2:08:41 PM12/15/17
to Jonas Oberhauser, RISC-V ISA Dev, jcb6...@gmail.com
On 12/15/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
> Am Freitag, 15. Dezember 2017 06:27:22 UTC+1 schrieb Jacob Bachmeyer:

>> the ISA spec is ambiguous about whether SC performs address translation
>> if the hart already knows that SC must fail, for example, after a
>> preemptive context switch.
...
> translations and have a page fault? Why not specify "the translation is
> done iff the reservations are still held (e.g., not after an interrupt)" or

If that interrupt is handled by a more-privileged level such as
a hypervisor, can that more-privileged level restore the reservation?

If not, then the less-privileged level can see into the more-privileged
level to some degree. It's a minor security leak, and it seems to cause
nondeterministic behavior that would break exact replay.

Jonas Oberhauser

unread,
Dec 15, 2017, 3:44:24 PM12/15/17
to Albert Cahalan, RISC-V ISA Dev, Jacob Bachmeyer


On Dec 15, 2017 20:08, "Albert Cahalan" <acah...@gmail.com> wrote:
On 12/15/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
> Am Freitag, 15. Dezember 2017 06:27:22 UTC+1 schrieb Jacob Bachmeyer:

>> the ISA spec is ambiguous about whether SC performs address translation
>> if the hart already knows that SC must fail, for example, after a
>> preemptive context switch.
...
> translations and have a page fault? Why not specify "the translation is
> done iff the reservations are still held (e.g., not after an interrupt)" or

If that interrupt is handled by a more-privileged level such as
a hypervisor, can that more-privileged level restore the reservation?

No.

If not, then the less-privileged level can see into the more-privileged
level to some degree.

How? You already know that your reservations were lost from the SC response. What additional info do you get?

it seems to cause
nondeterministic behavior that would break exact replay.

How? You replay the interrupt. That seems to be exactly the type of nondeterminism you have solved. 

Andy Wright

unread,
Dec 15, 2017, 4:08:14 PM12/15/17
to jcb6...@gmail.com, RISC-V ISA Dev
This is an interesting corner case of the spec.

On Fri, Dec 15, 2017 at 12:27 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
I propose that SC should unconditionally translate its address for write and raise a page fault if one would occur were the SC to attempt its store.

Is there any reason you don't want an SC to fail before address translation? Also, why do you want page faults to have priority over failed SCs?

Your approach seems to be at odds with the memory model currently spec since it allows SCs to fail fairly early (e.g. before address calculation). For example, letting an SC fail due to a TLB miss would be a perfectly legal implementation, as long as you can achieve the forward progress guarantee specified in the LR/SC section of the spec.

Andy

Jonas Oberhauser

unread,
Dec 15, 2017, 4:20:48 PM12/15/17
to Andy Wright, Jacob Bachmeyer, RISC-V ISA Dev


On Dec 15, 2017 22:08, "Andy Wright" <acwr...@mit.edu> wrote:
Your approach seems to be at odds with the memory model currently spec since it allows SCs to fail fairly early (e.g. before address calculation). For example, letting an SC fail due to a TLB miss would be a perfectly legal implementation,

By TLB miss you mean page fault?

as long as you can achieve the forward progress guarantee specified in the LR/SC section of the spec.
I'm opening a sidetrack now.

What happens if the SC is to an address to which the OS will never give write permission?

Andy Wright

unread,
Dec 15, 2017, 7:10:13 PM12/15/17
to Jonas Oberhauser, Jacob Bachmeyer, RISC-V ISA Dev
Responses inline...

On Fri, Dec 15, 2017 at 4:20 PM Jonas Oberhauser <s9jo...@gmail.com> wrote:


On Dec 15, 2017 22:08, "Andy Wright" <acwr...@mit.edu> wrote:
Your approach seems to be at odds with the memory model currently spec since it allows SCs to fail fairly early (e.g. before address calculation). For example, letting an SC fail due to a TLB miss would be a perfectly legal implementation,

By TLB miss you mean page fault?

No, I actually mean TLB miss. It’s just an example of an event in the architecture causing an SC to fail other than a trap or a cache invalidation. SC instructions can fail arbitrarily as long as they meet the forward progress guarantee.

I'm opening a sidetrack now.

What happens if the SC is to an address to which the OS will never give write permission?


I’m not sure. I think the SC can either fail or result in a page fault, but I could also understand the argument that the store conditional should just return failure because LR/SC is not supported on a read-only page. It would make sense to me for SC to always fail (without causing an exception) in memory regions that don’t support LR/SC such as an accelerator’s scratch pad. Also, I don’t think this is a sidetrack at all. I think this is directly related to Jacob’s original email.


Jacob Bachmeyer

unread,
Dec 15, 2017, 7:11:32 PM12/15/17
to Jonas Oberhauser, RISC-V ISA Dev
The reason to do the translation unconditionally is to get all of the
page faults "out of the way" on the first iteration through an LR/SC
sequence. For example, copy-on-write memory becomes actually writable
after the first page fault.


-- Jacob

Jacob Bachmeyer

unread,
Dec 15, 2017, 7:13:38 PM12/15/17
to Albert Cahalan, Jonas Oberhauser, RISC-V ISA Dev
Albert Cahalan wrote:
> On 12/15/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>
>> Am Freitag, 15. Dezember 2017 06:27:22 UTC+1 schrieb Jacob Bachmeyer:
>>
>>> the ISA spec is ambiguous about whether SC performs address translation
>>> if the hart already knows that SC must fail, for example, after a
>>> preemptive context switch.
>>>
> ...
>
>> translations and have a page fault? Why not specify "the translation is
>> done iff the reservations are still held (e.g., not after an interrupt)" or
>>
>
> If that interrupt is handled by a more-privileged level such as
> a hypervisor, can that more-privileged level restore the reservation?
>

No, an SC that takes a page fault will never succeed. The memory must
either actually become writable at some point or the hypervisor must
emulate SC itself. The hypervisor does have the option of emulating SC,
writing "success" to the saved context and resuming the guest after the
faulting SC.


-- Jacob

Andy Wright

unread,
Dec 15, 2017, 7:29:34 PM12/15/17
to jcb6...@gmail.com, Jonas Oberhauser, RISC-V ISA Dev
Thank you for the copy-on-write example; it makes a lot of sense. I see why you want the translation to always happen.

If LR/SC on copy-on-write pages is important, then you need to eventually get a page fault when executing the LR/SC pair (it doesn't have to be the first time you execute SC). If you want to ensure the page fault will happen, it could be included in the forward progress guarantee. For example, if certain conditions are met (see the spec for the exact conditions), then the LR/SC will eventually succeed or trigger an exception. This prevents requiring address translations for all SC instructions, but it will allow for LR/SC on copy-on-write.

Andy

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5A3464B1.5070608%40gmail.com.

Jacob Bachmeyer

unread,
Dec 15, 2017, 7:31:10 PM12/15/17
to Andy Wright, RISC-V ISA Dev
Andy Wright wrote:
> This is an interesting corner case of the spec.
>
> On Fri, Dec 15, 2017 at 12:27 AM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> I propose that SC should unconditionally translate its address for
> write and raise a page fault if one would occur were the SC to
> attempt its store.
>
>
> Is there any reason you don't want an SC to fail before address
> translation?

Efficiency: if the SC writes to a page that is currently copy-on-write,
the supervisor must copy the page, map the copy, and make the copy
actually writable. SC cannot succeed when this occurs, in any case,
since the supervisor has written to the reserved location. The reason
to require that SC take page faults even if the reservation has been
lost is to ensure that only one pass through an LR/SC sequence is enough
for all pages to be present and usable, so the second iteration can succeed.

If SC translates its address conditionally, then an LR/SC to a
copy-on-write region that has been swapped out will require at least
three iterations to succeed. First, LR faults and the page is swapped
in, but marked read-only in the PTE since it is copy-on-write. The SC
fails silently and the sequence is retried. Second, LR succeeds, SC
would succeed, but takes a page fault because the page is
copy-on-write. The supervisor copies the page, etc. Third, the page is
finally now writable and the LR/SC succeeds.

If SC translates its address unconditionally, the same situation
succeeds on the first retry: LR faults, page swapped in, SC faults,
page copied, SC fails; LR/SC retried, success.

To clarify: I propose that hardware be required to perform address
translation for a failed SC.

> Also, why do you want page faults to have priority over failed SCs?

The SC still fails after returning from the page fault handler.

> Your approach seems to be at odds with the memory model currently spec
> since it allows SCs to fail fairly early (e.g. before address
> calculation). For example, letting an SC fail due to a TLB miss would
> be a perfectly legal implementation, as long as you can achieve the
> forward progress guarantee specified in the LR/SC section of the spec.

In order to achieve forward progress, a TLB miss that fails an SC must
still cause a TLB fill. There is no requirement that software make any
other access to the target of an SC. If an SC fails due to TLB miss and
does not initiate a TLB fill, then the SC will fail every time because
there will never be a TLB hit. TLB fills can cause page faults. If a
TLB fill initiated due to an SC causes a page fault, a page fault must
be taken at SC. I propose that this page fault be taken even if SC
fails, in order to "clear the obstacle" without an extra iteration of
the LR/SC retry loop.


-- Jacob

Jacob Bachmeyer

unread,
Dec 15, 2017, 7:32:42 PM12/15/17
to Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
Jonas Oberhauser wrote:
> What happens if the SC is to an address to which the OS will never
> give write permission?

The supervisor aborts the program. In POSIX, SIGSEGV is delivered, the
same as for a normal store attempted to a read-only page.



-- Jacob

Albert Cahalan

unread,
Dec 15, 2017, 9:31:32 PM12/15/17
to Jonas Oberhauser, RISC-V ISA Dev, Jacob Bachmeyer
On 12/15/17, Jonas Oberhauser <s9jo...@gmail.com> wrote:
> On Dec 15, 2017 20:08, "Albert Cahalan" <acah...@gmail.com> wrote:

> If that interrupt is handled by a more-privileged level such as
> a hypervisor, can that more-privileged level restore the reservation?
>
> No.
>
> If not, then the less-privileged level can see into the more-privileged
> level to some degree.
>
> How? You already know that your reservations were lost from the SC
> response. What additional info do you get?

You know that there is activity elsewhere on the system.
You are able to measure the rate of interrupts.

> it seems to cause
> nondeterministic behavior that would break exact replay.
>
> How? You replay the interrupt. That seems to be exactly the type of
> nondeterminism you have solved.

Machine mode takes a thermal-related interrupt and adjusts the fan.
This breaks the reservation. Machine mode then returns all the way
down to user mode. The hypervisor never sees the interrupt, but the
reservation has been broken. On replay, the hypervisor fails to insert
the interrupt because it was never recorded. The reservation is not
broken as it should be.

Jacob Bachmeyer

unread,
Dec 15, 2017, 10:10:32 PM12/15/17
to Albert Cahalan, Jonas Oberhauser, RISC-V ISA Dev
This sounds like a serious problem regardless of SC's behavior with
respect to page faults on failure. Maybe we need a special
"replay-capable" profile that forbids taking monitor interrupts on harts
that run user code?


-- Jacob

Jacob Bachmeyer

unread,
Dec 15, 2017, 10:15:50 PM12/15/17
to Andy Wright, Jonas Oberhauser, RISC-V ISA Dev
This conflicts with the architectural guarantee of eventual forward
progress, which allows programs to iterate on SC-failure with no further
conditions. A program could have an LR/SC sequence that meets the
requirements and therefore must eventually succeed, except the address
passed in is in a region that does not allow LR/SC. Raising an
exception is the only correct answer, otherwise the program goes into an
infinite loop.


-- Jacob

Jacob Bachmeyer

unread,
Dec 15, 2017, 10:37:06 PM12/15/17
to Andy Wright, Jonas Oberhauser, RISC-V ISA Dev
Andy Wright wrote:
> Thank you for the copy-on-write example; it makes a lot of sense. I
> see why you want the translation to always happen.
>
> If LR/SC on copy-on-write pages is important, then you need to
> eventually get a page fault when executing the LR/SC pair (it doesn't
> have to be the first time you execute SC). If you want to ensure the
> page fault will happen, it could be included in the forward progress
> guarantee. For example, if certain conditions are met (see the spec
> for the exact conditions), then the LR/SC will eventually succeed or
> trigger an exception. This prevents requiring address translations for
> all SC instructions, but it will allow for LR/SC on copy-on-write.

LR/SC on copy-on-write must be supported: copy-on-write is transparent
to user code and synchronization variables used with LR/SC are unknown
to the supervisor.

You are correct that the page fault will eventually happen as long as a
"can-succeed" SC translates its address, which it logically must do. If
even a "will-fail" SC (after an LR that "faulted-in" the page)
translates its address, the page fault indicating a need for a writable
page happens sooner, so a wasted execution of the LR/SC sequence can be
avoided. This also applies to implementations that use software
management of A/D bits -- if the first write to a page is an SC, that SC
will fault, even if the page was already mapped writable, to set the D bit.

I like the idea of modifying the forward progress guarantee to permit
exceptions, but this would need to be carefully worded to avoid being an
excuse to invalidate the entire guarantee -- an "LR/SC timeout"
exception instead of eventual forward progress is not acceptable.


-- Jacob

Bruce Hoult

unread,
Dec 16, 2017, 12:49:53 AM12/16/17
to Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev
On Sat, Dec 16, 2017 at 3:31 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Andy Wright wrote:
This is an interesting corner case of the spec.

On Fri, Dec 15, 2017 at 12:27 AM, Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:

    I propose that SC should unconditionally translate its address for
    write and raise a page fault if one would occur were the SC to
    attempt its store.


Is there any reason you don't want an SC to fail before address translation?

Efficiency:  if the SC writes to a page that is currently copy-on-write, the supervisor must copy the page, map the copy, and make the copy actually writable.  SC cannot succeed when this occurs, in any case, since the supervisor has written to the reserved location.  The reason to require that SC take page faults even if the reservation has been lost is to ensure that only one pass through an LR/SC sequence is enough for all pages to be present and usable, so the second iteration can succeed.

If SC translates its address conditionally, then an LR/SC to a copy-on-write region that has been swapped out will require at least three iterations to succeed.  First, LR faults and the page is swapped in, but marked read-only in the PTE since it is copy-on-write.  The SC fails silently and the sequence is retried.  Second, LR succeeds, SC would succeed, but takes a page fault because the page is copy-on-write.  The supervisor copies the page, etc.  Third, the page is finally now writable and the LR/SC succeeds.

If SC translates its address unconditionally, the same situation succeeds on the first retry:  LR faults, page swapped in, SC faults, page copied, SC fails; LR/SC retried, success.

To clarify:  I propose that hardware be required to perform address translation for a failed SC.

I don't see why that's a problem worthy of making a "hardware is REQUIRED" rule.

It's purely a performance issue, not a correctness issue. Different implementations rightly have different trade-offs.

It's not even much of a performance issue, as there is a strict limit on the number of instructions (and thus execution time) between LR and SC which is many orders of magnitude smaller than the time used by a page fault. The cost of retrying twice instead of once will be *utterly* down in the noise.

Jonas Oberhauser

unread,
Dec 16, 2017, 4:43:52 AM12/16/17
to Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev
I know this is a radical suggestion, but what do you think about LR requiring write permission?


On Dec 16, 2017 04:37, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:

LR/SC on copy-on-write must be supported:  copy-on-write is transparent to user code and synchronization variables used with LR/SC are unknown to the supervisor.

You are correct that the page fault will eventually happen as long as a "can-succeed" SC translates its address, which it logically must do.  If even a "will-fail" SC (after an LR that "faulted-in" the page) translates its address, the page fault indicating a need for a writable page happens sooner, so a wasted execution of the LR/SC sequence can be avoided.  This also applies to implementations that use software management of A/D bits -- if the first write to a page is an SC, that SC will fault, even if the page was already mapped writable, to set the D bit.

I like the idea of modifying the forward progress guarantee to permit exceptions, but this would need to be carefully worded to avoid being an excuse to invalidate the entire guarantee -- an "LR/SC timeout" exception instead of eventual forward progress is not acceptable.

The only exceptions that seem acceptable are the ones caused by instructions in the block such as 1) overflow 2) misalignment 3) protection faults.

The thing is, protection faults can be invisible to the user (copy on write) or visible (e.g., writing to its code page).

On the one hand it would work quite well (if you are not aborted, you enter the LR/SC loop and have a new chance to get the forward guarantee, this time without protection faults). On the other hand it feels wrong to have the spec of LR/SC depend on something the user can not see.

Jonas Oberhauser

unread,
Dec 16, 2017, 4:56:01 AM12/16/17
to Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev
Sounds like a bug (or wording issue) in the current spec extensions, which still guarantee eventual success without consideration to interrupts that they introduce. The problem already occurs if you do a misaligned jalr in the LR/SC sequence. Extensions that introduce interrupts should mention that some of these interrupts can break the forward guarantee.

PS: in my last mail I erronously wrote about overflow. I instructions do not raise interrupts, so there is no problem with them.

Cesar Eduardo Barros

unread,
Dec 16, 2017, 7:32:02 AM12/16/17
to Jonas Oberhauser, Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev
Em 16-12-2017 07:43, Jonas Oberhauser escreveu:
> I know this is a radical suggestion, but what do you think about LR
> requiring write permission?

That would be confusing (LR should behave like a load, not a store), and
would make otherwise valid programs fail.

Consider the following example: a multi-threaded library has an
immutable, reference-counted data structure, which can be allocated
either on the heap, or on the read-only data section of the executable
(for a more concrete example, think of immutable strings in a functional
language). When on the heap, the reference counter is incremented and
decremented as normal; when on the read-only data section, the reference
counter is "frozen" as all-bits-set.

The sequence to atomically update the reference counter would be: LR the
counter; branch if negative to the "success" label; add/sub immediate 1
from the counter; SC the counter; branch if failure back to the LR;
"success" label.

As you can see, this will work for the read-only case if only the SC
requires write permission, since the SC is never executed in that case;
but if the LR also requires write permission, it will fail.

You might think this is not a realistic example, but it's actually the
trick behind Qt's QStringLiteral. QString is a reference-counted string,
and QStringLiteral is a macro which creates a QString pointing to
read-only data. The QString created by QStringLiteral has a reference
count of -1. See https://woboq.com/blog/qstringliteral.html if you want
the details.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Cesar Eduardo Barros

unread,
Dec 16, 2017, 8:06:01 AM12/16/17
to Albert Cahalan, Jonas Oberhauser, RISC-V ISA Dev, jcb6...@gmail.com
That could be fixed by "holding" any external interrupts to the hart
before executing the LR, until either the SC is executed or X
instructions have executed (where X is at least the 16 instructions from
the forward progress guarantee). That would add only a small amount of
extra latency to the interrupt handling.

And then, to prevent a less-priviledged program from just waiting to do
the SC until after X instructions have passed, the reservation should be
broken at the same time the external interrupts are re-enabled.

These two rules together would make all LR/SC pairs atomic with respect
to external interrupts. The only things which could break the
reservations then would be either from the hart itself, or from another
hart doing a LR/SC to the same address or cacheline.

As a bonus for software developers, the forward progress guarantee
becomes more deterministic: if we mistakenly write a LR/SC sequence
that's too long to always work, it will fail early.

Jacob Bachmeyer

unread,
Dec 16, 2017, 6:37:22 PM12/16/17
to Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 16, 2017 04:15, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Andy Wright wrote:
>
> On Fri, Dec 15, 2017 at 4:20 PM Jonas Oberhauser
> <s9jo...@gmail.com <mailto:s9jo...@gmail.com>
As I read the current spec, LR/SC to an address in an I/O region that
does not support LR/SC is a PMA violation (section 3.5.3 "Atomicity
PMAs" in the current draft) and therefore must trap.

Also in that section: "Implementations must guarantee that all load
reservations are yielded when any trap is taken."


-- Jacob

Jacob Bachmeyer

unread,
Dec 16, 2017, 7:14:29 PM12/16/17
to Bruce Hoult, Andy Wright, RISC-V ISA Dev
Bruce Hoult wrote:
> On Sat, Dec 16, 2017 at 3:31 AM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Andy Wright wrote:
>
> This is an interesting corner case of the spec.
>
> On Fri, Dec 15, 2017 at 12:27 AM, Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
If the program is to ever make forward progress past the LR/SC, the SC
address must be translated eventually.

> It's purely a performance issue, not a correctness issue. Different
> implementations rightly have different trade-offs.

Skipping translation for a "will-fail" SC is also a performance issue.
The difference that I see is that it is a false economy -- the SC must
succeed eventually, so the translation must happen eventually. Why wait?

Further, there is a subtle correctness issue here: a "wild SC" to some
wrong address may be missed if the program only generates a "wild SC"
when atomicity is broken and the SC will fail. Then the program is run
on a different implementation that does unconditionally translate SC
addresses (perhaps because reservations are on physical addresses in the
second implementation) and crashes. We really should state one way or
the other; even requiring portable programs to avoid "wild SCs" will not
work in practice if they do not fault. The other option is to require
that page faults on failed SCs are ignored, but that almost ensures that
"wild SCs" will go undetected in development, until they almost succeed
(and fault) in production.

> It's not even much of a performance issue, as there is a strict limit
> on the number of instructions (and thus execution time) between LR and
> SC which is many orders of magnitude smaller than the time used by a
> page fault. The cost of retrying twice instead of once will be
> *utterly* down in the noise.

Traps on RISC-V have the potential to be orders of magnitude faster than
on x86, so the cost of a page fault may not be that high if all the
supervisor needs to do is set the PTE D bit and resume. Traps
themselves may be little more expensive than JALR, and a future
extension could easily add an instruction to get the physical address of
a PTE for a virtual address or the PTE where the translation faults for
that address. Since this could be a TLB or PTLB lookup, it would be
very fast, using the same TLB columns as are needed for
self-invalidating translations (another extension). The supervisor
knows where its page tables are mapped and can quickly produce a virtual
address. A carefully tuned page fault handler might be able to set a
PTE A/D bit and resume in a few dozen cycles. (Avoid saving all
registers, use extension for fast PTE lookup, etc.) This is comparable
to the time allowed for an LR/SC sequence, not orders of magnitude more.

Assuming the reserved subset is a page or smaller, LR must have already
"faulted-in" the page, so the longest page fault times (swap from disk)
are not at issue here.

Also, to use a famous quote: "Beware of little expenses, a small leak
will sink a great ship." Even though it is a small cost, it is an
unnecessary cost and implementations differing on this point may cause
subtle portability issues.


-- Jacob

Jonas Oberhauser

unread,
Dec 17, 2017, 5:28:05 AM12/17/17
to Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev


On Dec 17, 2017 12:37 AM, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:

Jonas Oberhauser wrote:



     
Sounds like a bug (or wording issue) in the current spec extensions, which still guarantee eventual success without consideration to interrupts that they introduce. The problem already occurs if you do a misaligned jalr in the LR/SC sequence. Extensions that introduce interrupts should mention that some of these interrupts can break the forward guarantee.

PS: in my last mail I erronously wrote about overflow. I instructions do not raise interrupts, so there is no problem with them.

As I read the current spec, LR/SC to an address in an I/O region that does not support LR/SC is a PMA violation (section 3.5.3 "Atomicity PMAs" in the current draft) and therefore must trap.

Also in that section:  "Implementations must guarantee that all load reservations are yielded when any trap is taken."


I don't see how that contradicts what I have said.

The current forward guarantee says
if (conditions) then SC will eventually succeed.

The conditions that I read do not say anything about traps  (nor reservations). But because of traps (PMA or misalignment) the SC may never succeed.

That is why I suggest either adding to the conditions "and none of the instructions in the block cause traps", or pointing out to me where it is already included in the conditions and I overlooked it.

Jacob Bachmeyer

unread,
Dec 18, 2017, 7:09:57 PM12/18/17
to Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
Jonas Oberhauser wrote:
>
>
> On Dec 17, 2017 12:37 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
Now I see it: the restriction to base RVI, excluding loads, stores,
FENCE, FENCE.I, and SYSTEM is supposed to exclude every instruction that
could possibly trap -- the result of an invalid LR or SC was forgotten.


-- Jacob

Jonas Oberhauser

unread,
Dec 19, 2017, 2:38:55 AM12/19/17
to Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev
Ah, that's possible. It sounds like the type of clever shortcut I might try myself -- that is, the type that only works if you don't look too close ;)

One more question though -- is the base ISA excluding jalr? Those can trigger misalignment. 

And with Albert's "trigger on h/scounter" one might want to discuss wether that counts as a trap caused by that instruction or not; possibly it does not make a difference because it is replay, so forward guarantees do not matter.

Jonas Oberhauser

unread,
Dec 19, 2017, 3:26:03 AM12/19/17
to Cesar Eduardo Barros, Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev
Good example, thanks. It would indeed be a different philosophy than "LR behaves like load", and, as you point out, one would need an additional load instruction in the program you mentioned. 
I assume that makes it non-negotiable.
I'd just like to point out that from my view, a reservation is a write-intent/reservation to write, and arguably one should not be able to obtain a write reservation to a location one can not write.

Jacob Bachmeyer

unread,
Dec 19, 2017, 6:28:45 PM12/19/17
to Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 19, 2017 01:09, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
>
> The current forward guarantee says
> if (conditions) then SC will eventually succeed.
>
> The conditions that I read do not say anything about traps
> (nor reservations). But because of traps (PMA or misalignment)
> the SC may never succeed.
>
> That is why I suggest either adding to the conditions "and
> none of the instructions in the block cause traps", or
> pointing out to me where it is already included in the
> conditions and I overlooked it.
>
>
> Now I see it: the restriction to base RVI, excluding loads,
> stores, FENCE, FENCE.I, and SYSTEM is supposed to exclude every
> instruction that could possibly trap -- the result of an invalid
> LR or SC was forgotten.
>
>
> Ah, that's possible. It sounds like the type of clever shortcut I
> might try myself -- that is, the type that only works if you don't
> look too close ;)
>
> One more question though -- is the base ISA excluding jalr? Those can
> trigger misalignment.

I think that you have found another oversight.

> And with Albert's "trigger on h/scounter" one might want to discuss
> wether that counts as a trap caused by that instruction or not;
> possibly it does not make a difference because it is replay, so
> forward guarantees do not matter.

It looks like the hypervisor will need some way to determine if a
reservation was broken, or to force LR/SC to always trap. This could be
part of a "replay" extension.

-- Jacob

Andrew Waterman

unread,
Dec 19, 2017, 8:30:01 PM12/19/17
to Jacob Bachmeyer, Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
On Wed, Dec 20, 2017 at 8:28 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Jonas Oberhauser wrote:
>>
>> On Dec 19, 2017 01:09, "Jacob Bachmeyer" <jcb6...@gmail.com
>> <mailto:jcb6...@gmail.com>> wrote:
>>
>> Jonas Oberhauser wrote:
>>
>>
>> The current forward guarantee says
>> if (conditions) then SC will eventually succeed.
>>
>> The conditions that I read do not say anything about traps
>> (nor reservations). But because of traps (PMA or misalignment)
>> the SC may never succeed.
>>
>> That is why I suggest either adding to the conditions "and
>> none of the instructions in the block cause traps", or
>> pointing out to me where it is already included in the
>> conditions and I overlooked it.
>>
>>
>> Now I see it: the restriction to base RVI, excluding loads,
>> stores, FENCE, FENCE.I, and SYSTEM is supposed to exclude every
>> instruction that could possibly trap -- the result of an invalid
>> LR or SC was forgotten.
>>
>>
>> Ah, that's possible. It sounds like the type of clever shortcut I might
>> try myself -- that is, the type that only works if you don't look too close
>> ;)
>>
>> One more question though -- is the base ISA excluding jalr? Those can
>> trigger misalignment.
>
>
> I think that you have found another oversight.

Misaligned instruction fetch exceptions aren't resumable.

Anyway, the privileged architecture says that exceptions yield load
reservations, so the behavior here is specified.

>
>> And with Albert's "trigger on h/scounter" one might want to discuss wether
>> that counts as a trap caused by that instruction or not; possibly it does
>> not make a difference because it is replay, so forward guarantees do not
>> matter.
>
>
> It looks like the hypervisor will need some way to determine if a
> reservation was broken, or to force LR/SC to always trap. This could be
> part of a "replay" extension.
>
> -- Jacob
>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5A39A0A9.7010304%40gmail.com.

Jacob Bachmeyer

unread,
Dec 19, 2017, 10:00:04 PM12/19/17
to Andrew Waterman, Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
Correct -- the program crashes in that case. I guess a crash should
obviously void the guarantee of forward progress. :-)

> Anyway, the privileged architecture says that exceptions yield load
> reservations, so the behavior here is specified.
>

While the behavior is specified for hardware, the applications
programmer who only reads the user ISA spec will not be aware of this.
An explicit statement in the user ISA spec that any trap yields
reservations would probably help. We already have the requirement that
SC fail after a preemptive context switch. The user ISA defines ECALL
and EBREAK, so the existence of traps is mentioned.

On the main topic, what do you think of requiring SC to always translate
its address?


-- Jacob

Andrew Waterman

unread,
Dec 20, 2017, 1:27:46 AM12/20/17
to jcb6...@gmail.com, Andy Wright, Jonas Oberhauser, RISC-V ISA Dev
Seems reasonable.



On the main topic, what do you think of requiring SC to always translate
its address?

Well, requiring it not translate materially complicates some reasonable microarchitectures, so that’s out. It’s also likely less performant, as you mentioned earlier.

So it’s between requiring it translate and leaving it implementation-defined (or platform-defined). I’m inclined to nail it down as required behavior but don’t feel strongly about it. It would be good to hear from other implementors who think otherwise.




-- Jacob

Stefan O'Rear

unread,
Dec 20, 2017, 1:32:32 AM12/20/17
to Jacob Bachmeyer, Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
On Tue, Dec 19, 2017 at 3:28 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> It looks like the hypervisor will need some way to determine if a
> reservation was broken, or to force LR/SC to always trap. This could be
> part of a "replay" extension.

Under Rocket, instruction cache misses clear the reservation if they
take more than 32 cycles to resolve, which makes the behavior of
cache-line crossing LR/SC sequences potentially dependent on LR state,
other traffic sharing an arbiter, etc, etc.

What you are trying to do is blatantly impossible and I'd rather this
thread stop soon.

-s

Jonas Oberhauser

unread,
Dec 20, 2017, 1:41:18 AM12/20/17
to Stefan O'Rear, Jacob Bachmeyer, Andy Wright, RISC-V ISA Dev
I'm a bit lost.
Why does that make it impossible to translate the SC? If the reservation is lost due to a cache miss -- and Andrew reminded me that this is valid behaviour -- the SC will still be fetched and decoded anyways.

To allow an implementation to always trap on SC also doesn't seem impossible.

Can you clarify?

Jacob Bachmeyer

unread,
Dec 20, 2017, 6:48:00 PM12/20/17
to Stefan O'Rear, Jonas Oberhauser, Andy Wright, RISC-V ISA Dev
Replay support is something someone who has not participated in this
thread is requesting. It is a minor sidetrack in this thread and I
agree that this sidetrack should be dropped. I will not seek to make
reply support impossible, but I have my doubts regarding its generality.

Replay is a different "kettle of fish" from what this thread is supposed
to be about: SC translating its address for write (and raising page
fault if that translation fails) regardless of current reservation
state. At least some implementations (specifically, those that choose
to take reservations on physical addresses) will need to translate the
address before they know if a reservation exists. I am asking for a
clarification that a failed SC either (1) always translates its address
and raises page fault if a "successful" SC would raise a page fault, or
(2) never raises a page fault, even if the address translation fails.
Case (2) requires a minor nuance to preserve forward progress: a
failing SC must ignore page faults *unless* the page fault is the cause
for the SC to fail. Case (1) has no such edges.


-- Jacob

Jonas Oberhauser

unread,
Dec 21, 2017, 10:21:05 AM12/21/17
to RISC-V ISA Dev, sor...@gmail.com, s9jo...@gmail.com, acwr...@mit.edu, jcb6...@gmail.com


Am Donnerstag, 21. Dezember 2017 00:48:00 UTC+1 schrieb Jacob Bachmeyer:
SC translating its address for write (and raising page
fault if that translation fails) regardless of current reservation
state.  At least some implementations (specifically, those that choose
to take reservations on physical addresses) will need to translate the
address before they know if a reservation exists.  I am asking for a
clarification that a failed SC either (1) always translates its address
and raises page fault if a "successful" SC would raise a page fault, or
(2) never raises a page fault, even if the address translation fails.  
Case (2) requires a minor nuance to preserve forward progress:  a
failing SC must ignore page faults *unless* the page fault is the cause
for the SC to fail.  Case (1) has no such edges.

If the SC can not be translated, how do you distinguish between "SC failed because of page fault" and "SC failed because I have a reservation to a different physical address"? 
Note that the standard only gives a forward guarantee for same virtual addresses, but a remote store to the PTEs used for the LR and SC could cause the SC to be untranslatable even if you use the same virtual address (and successfully translated the LR).

I think this makes the minor nuance very hard to actually implement.

It sounds like a pity to have to go through the motions for an SC that I know will fail (because I gave up the reservation), but it may be the best option. FWIW, I support your proposal.

Jacob Bachmeyer

unread,
Dec 21, 2017, 7:40:28 PM12/21/17
to Jonas Oberhauser, RISC-V ISA Dev, sor...@gmail.com, acwr...@mit.edu
Jonas Oberhauser wrote:
> Am Donnerstag, 21. Dezember 2017 00:48:00 UTC+1 schrieb Jacob Bachmeyer:
>
> SC translating its address for write (and raising page
> fault if that translation fails) regardless of current reservation
> state. At least some implementations (specifically, those that
> choose
> to take reservations on physical addresses) will need to translate
> the
> address before they know if a reservation exists. I am asking for a
> clarification that a failed SC either (1) always translates its
> address
> and raises page fault if a "successful" SC would raise a page
> fault, or
> (2) never raises a page fault, even if the address translation
> fails.
> Case (2) requires a minor nuance to preserve forward progress: a
> failing SC must ignore page faults *unless* the page fault is the
> cause
> for the SC to fail. Case (1) has no such edges.
>
>
> If the SC can not be translated, how do you distinguish between "SC
> failed because of page fault" and "SC failed because I have a
> reservation to a different physical address"?

If an SC that otherwise could have succeeded failed because of a page
fault, take the page fault trap.

> Note that the standard only gives a forward guarantee for same virtual
> addresses, but a remote store to the PTEs used for the LR and SC could
> cause the SC to be untranslatable even if you use the same virtual
> address (and successfully translated the LR).

Only if hardware monitors PTE writes or directly implements remote
SFENCE.VMA, otherwise the remote SFENCE.VMA required for that PTE change
to be effective on the local hart has canceled the reservation by
delivering an IPI.

> I think this makes the minor nuance very hard to actually implement.

It certainly makes it more complex.


> It sounds like a pity to have to go through the motions for an SC that
> I know will fail (because I gave up the reservation), but it may be
> the best option. FWIW, I support your proposal.

In most cases "go through the motions" should be nothing more than
"check that the existing TLB entry (from the previous LR) allows
writes". It really is necessary anyway: either on this loop or the
next, since forward progress will eventually require the SC address to
be translated.

Skipping the translation because the reservation was given up is a false
economy: software will iterate and retry the LR/SC sequence. If we
were migrated to another hart, translating the SC address will load the
TLB entry for LR to use at the next iteration, a cost that would
otherwise simply move to that LR instruction. If we were migrated and
our page swapped out, again the page fault at SC merely moves an event
that would inevitably occur at the next iteration ahead in time a bit,
but if the page is swapped in by SC, the supervisor knows that the page
needs to be writable and will not set up copy-on-write.



-- Jacob

Jonas Oberhauser

unread,
Dec 22, 2017, 2:17:20 AM12/22/17
to Jacob Bachmeyer, RISC-V ISA Dev, Stefan O'Rear, acwr...@mit.edu


On Dec 22, 2017 1:40 AM, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:
Am Donnerstag, 21. Dezember 2017 00:48:00 UTC+1 schrieb Jacob Bachmeyer:

    SC translating its address for write (and raising page
    fault if that translation fails) regardless of current reservation
    state.  At least some implementations (specifically, those that
    choose
    to take reservations on physical addresses) will need to translate
    the
    address before they know if a reservation exists.  I am asking for a
    clarification that a failed SC either (1) always translates its
    address
    and raises page fault if a "successful" SC would raise a page
    fault, or
    (2) never raises a page fault, even if the address translation
    fails.      Case (2) requires a minor nuance to preserve forward progress:  a
    failing SC must ignore page faults *unless* the page fault is the
    cause
    for the SC to fail.  Case (1) has no such edges.


If the SC can not be translated, how do you distinguish between "SC failed because of page fault" and "SC failed because I have a reservation to a different physical address"?

If an SC that otherwise could have succeeded failed because of a page fault, take the page fault trap.

But how would you know? You have a reservation to the same va, but the SC can not be translated (for example, because of a PMP fault in a non-leaf PTE). If there would have been no page fault, the SC may or may not have been translated to the physical address of the reservation; it may or may not have succeeded. You can not tell because you haven't fully translated the va (and maybe you can not).
Does that count as page fault worthy or not? 


Note that the standard only gives a forward guarantee for same virtual addresses, but a remote store to the PTEs used for the LR and SC could cause the SC to be untranslatable even if you use the same virtual address (and successfully translated the LR).

Only if hardware monitors PTE writes or directly implements remote SFENCE.VMA, otherwise the remote SFENCE.VMA required for that PTE change to be effective on the local hart has canceled the reservation by delivering an IPI.

I don't think that the second one is really quite possible -- as far as I understand the spec (4.3.2), once a remote store is performed, it is visible to the MMU, there is no visible TLB.


Skipping the translation because the reservation was given up is a false economy:  software will iterate and retry the LR/SC sequence. 

Well, there are the (rare?) cases where you have a lonely SC, the translation/core changes before you reach the LR again, the SW gives up, and so on.
But in general you are right.

Cesar Eduardo Barros

unread,
Dec 22, 2017, 5:47:26 AM12/22/17
to jcb6...@gmail.com, RISC-V ISA Dev
Em 15-12-2017 03:27, Jacob Bachmeyer escreveu:
> While planning a proposal for a multi-word LR/SC option, I realized that
> the ISA spec is ambiguous about whether SC performs address translation
> if the hart already knows that SC must fail, for example, after a
> preemptive context switch.
>
> I propose that SC should unconditionally translate its address for write
> and raise a page fault if one would occur were the SC to attempt its
> store.  An SC that should succeed can still raise a page fault if the
> page holding the synchronization variable is currently copy-on-write.

I just thought of something: for correctness, it's better that a SC that
would fail (because it lost the reservation from the corresponding LR)
_do not_ cause a page fault.

Consider the following (admittedly artificial) scenario: the LR/SC is
being done to a "flags" word in some structure. One of these flags is
"this structure is read-only".

Hart 1 is doing a LR/SC loop to change something else in that flags
word. Hart 2, meanwhile, does a LR/SC loop or AMOOR of its own to set
the "read-only" flag, followed by an mprotect() to set the page as
read-only.

Hart 1 lost its reservation because of hart 2 changing the flags word.
The correct behavior would be: the hart 1 SC fails because its
reservation was lost (caused either by the mprotect IPI, the cacheline
being written, or even an unrelated interrupt), it loops back to the LR,
and a test following the LR finds the read-only flag set and branches
out of the loop before the SC. It would be an incorrect behavior to trap
at the SC, because even though the page is read-only, hart 1 is
guaranteed to not write to it. The acquire/release semantics of hart 2's
SC should guarantee that the change to the flags word is visible to hart
1 before the change to the page's PTE.

This example could be used as sort of a litmus test for the iteraction
between LR/SC and the corresponding PTE's read-only/writeable flag.

Therefore, my opinion is: the SC should check the reservation state
first. Only if it would succeed should the SC check the TLB entry (which
should have been filled by the LR) and trap to the page fault handler.

(But this leads to another question: what should be done if the TLB
entry filled by the LR was lost before the SC? My opinion is, that this
should be treated identically to the reservation being lost, so the SC
should fail without trapping.)

Jacob Bachmeyer

unread,
Dec 22, 2017, 4:38:50 PM12/22/17
to Jonas Oberhauser, RISC-V ISA Dev, Stefan O'Rear, acwr...@mit.edu
Jonas Oberhauser wrote:
> On Dec 22, 2017 1:40 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
You cannot prove that the SC would fail if the address had been
translated, so the translation failure is a page fault.

> Note that the standard only gives a forward guarantee for same
> virtual addresses, but a remote store to the PTEs used for the
> LR and SC could cause the SC to be untranslatable even if you
> use the same virtual address (and successfully translated the LR).
>
>
> Only if hardware monitors PTE writes or directly implements remote
> SFENCE.VMA, otherwise the remote SFENCE.VMA required for that PTE
> change to be effective on the local hart has canceled the
> reservation by delivering an IPI.
>
>
> I don't think that the second one is really quite possible -- as far
> as I understand the spec (4.3.2), once a remote store is performed, it
> is visible to the MMU, there is no visible TLB.

Then why does SFENCE.VMA exist?


-- Jacob

Jonas Oberhauser

unread,
Dec 22, 2017, 5:24:26 PM12/22/17
to Jacob Bachmeyer, RISC-V ISA Dev, Stefan O'Rear, acwr...@mit.edu
Because MMU translations do not forward from local writes in the way reads do. 
In other words, the load value axiom for translations would not have the second rule.




-- Jacob

Jonas Oberhauser

unread,
Dec 22, 2017, 5:53:09 PM12/22/17
to Jacob Bachmeyer, RISC-V ISA Dev
(I think you meant SC failure, if not, correct me)

Would you say the only reason not to signal the page fault would be a lack of reservations? Or would it be more complicated than that?

Jacob Bachmeyer

unread,
Dec 27, 2017, 8:14:20 PM12/27/17
to Jonas Oberhauser, RISC-V ISA Dev
Jonas Oberhauser wrote:
>
>
> On Dec 22, 2017 22:38, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
> On Dec 22, 2017 1:40 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
If the implementation can prove that the SC could not have succeeded in
any case, then the page fault can be ignored. In case (2), page faults
caused by SC are generally ignored, but if an SC would have succeeded
except for the leaf PTE being a read-only mapping, the page fault *must*
be taken -- there is no other way for LR/SC to a copy-on-write page to
make forward progress.

I argue for the alternative case (1), where page faults that *could*
occur in executing an SC are *always* taken.


-- Jacob

Jacob Bachmeyer

unread,
Dec 27, 2017, 8:23:25 PM12/27/17
to Cesar Eduardo Barros, RISC-V ISA Dev
I like the litmus test idea.

> Therefore, my opinion is: the SC should check the reservation state
> first. Only if it would succeed should the SC check the TLB entry
> (which should have been filled by the LR) and trap to the page fault
> handler.
>
> (But this leads to another question: what should be done if the TLB
> entry filled by the LR was lost before the SC? My opinion is, that
> this should be treated identically to the reservation being lost, so
> the SC should fail without trapping.)

Since the forward progress guarantee forbids any other memory accesses,
and the DTLB must have at least one slot, the TLB entry for the
reservation must still be present for SC to ever succeed. Good point.

How would this interact in an implementation that holds reservations on
physical addresses and therefore must translate the address to determine
if the reservation is still valid? Or do we effectively forbid SC to an
alias of the address reserved by LR?


-- Jacob

Cesar Eduardo Barros

unread,
Dec 28, 2017, 5:21:24 AM12/28/17
to jcb6...@gmail.com, RISC-V ISA Dev
Em 27-12-2017 23:23, Jacob Bachmeyer escreveu:
> Cesar Eduardo Barros wrote:
>> Therefore, my opinion is: the SC should check the reservation state
>> first. Only if it would succeed should the SC check the TLB entry
>> (which should have been filled by the LR) and trap to the page fault
>> handler.
>>
>> (But this leads to another question: what should be done if the TLB
>> entry filled by the LR was lost before the SC? My opinion is, that
>> this should be treated identically to the reservation being lost, so
>> the SC should fail without trapping.)
>
> Since the forward progress guarantee forbids any other memory accesses,
> and the DTLB must have at least one slot, the TLB entry for the
> reservation must still be present for SC to ever succeed.  Good point.
>
> How would this interact in an implementation that holds reservations on
> physical addresses and therefore must translate the address to determine
> if the reservation is still valid?  Or do we effectively forbid SC to an
> alias of the address reserved by LR?

A possible implementation of reservations would be to store the virtual
address passed to the LR in a hidden register, "lock the bus" between
the LR and SC, and compare the virtual address passed to the SC with the
stored address. That implementation, and any other implementation which
compares the address passed to the SC with the address passed to the LR,
will never work with aliases.

The comparison can even be indirect, for instance by having a
reservation flag on something indexed by the virtual address.

Therefore, doing SC to an alias of the address reserved by the LR is not
guaranteed to work. The address passed to the SC by the programmer must
always be the same virtual address passed to the LR. Whether we should
require implementations to detect this case and fail the SC, or we
should allow this situation to work "by accident" sometimes, is another
question.

From a software developer point of view, it's best if SC to a "wrong"
address never works, since this allows finding incorrect code early.
However, from a hardware point of view, this might require more
circuitry. An extremly simple implementation could put a whole cache
line in "exclusive" mode on the LR, and allow the SC as long as the
cache line was still in the "exclusive" mode (plus a flag cleared on
traps), but that simple implementation would allow the SC as long as the
LR was to anywhere on the same cacheline, which is unexpected.

So having something to compare the address is almost required, and the
only question is whether the virtual or the physical address is
compared. The current standard appears to allow comparing the physical
address:

"The SC must be to the same address and of the same data size as the
latest LR executed. LR/SC sequences that do not meet these constraints
might complete on some attempts on some implementations, but there is no
guarantee of eventual success."

Since this is not the priviledged specification, I assume "same address"
means "same virtual address" here, so comparing physical addresses falls
into the "might complete on some implementations" case.

Jacob Bachmeyer

unread,
Dec 28, 2017, 6:42:06 PM12/28/17
to Cesar Eduardo Barros, RISC-V ISA Dev
Which is the question I was asking. The current spec allows SC to an
alias to succeed but does not guarantee success. Your solution requires
reservations be strictly held on virtual addresses and thus precludes SC
to an alias. You have a good point, but adopting that assurance appears
to require changing the spec.

> From a software developer point of view, it's best if SC to a "wrong"
> address never works, since this allows finding incorrect code early.
> However, from a hardware point of view, this might require more
> circuitry. An extremly simple implementation could put a whole cache
> line in "exclusive" mode on the LR, and allow the SC as long as the
> cache line was still in the "exclusive" mode (plus a flag cleared on
> traps), but that simple implementation would allow the SC as long as
> the LR was to anywhere on the same cacheline, which is unexpected.

The current spec specifically permits this: an "implementation can
reserve an arbitrary subset of the memory space on each LR".

> So having something to compare the address is almost required, and the
> only question is whether the virtual or the physical address is
> compared. The current standard appears to allow comparing the physical
> address:
>
> "The SC must be to the same address and of the same data size as the
> latest LR executed. LR/SC sequences that do not meet these constraints
> might complete on some attempts on some implementations, but there is
> no guarantee of eventual success."
>
> Since this is not the priviledged specification, I assume "same
> address" means "same virtual address" here, so comparing physical
> addresses falls into the "might complete on some implementations" case.

Exactly. If a failed SC can ignore page faults, we will need to
acknowledge that implementations must hold reservations on virtual
addresses, while the current spec permits different behavior.

You have presented a good argument for SC ignoring page faults under
certain conditions, but such behavior really needs to be part of the
spec or portable software will not be able to depend on a failed SC not
raising page fault.


-- Jacob

Cesar Eduardo Barros

unread,
Dec 29, 2017, 4:58:08 AM12/29/17
to jcb6...@gmail.com, RISC-V ISA Dev
>> So having something to compare the address is almost required, and the
>> only question is whether the virtual or the physical address is
>> compared. The current standard appears to allow comparing the physical
>> address:
>>
>> "The SC must be to the same address and of the same data size as the
>> latest LR executed. LR/SC sequences that do not meet these constraints
>> might complete on some attempts on some implementations, but there is
>> no guarantee of eventual success."
>>
>> Since this is not the priviledged specification, I assume "same
>> address" means "same virtual address" here, so comparing physical
>> addresses falls into the "might complete on some implementations" case.
>
> Exactly.  If a failed SC can ignore page faults, we will need to
> acknowledge that implementations must hold reservations on virtual
> addresses, while the current spec permits different behavior.

Not really. There's only one case where the LR succeeds, the reservation
was not lost, but the SC gets a page fault: the page was read-only. In
that case, the TLB entry is still there (otherwise, it would be treated
as the reservation being lost), so either the virtual address or the
physical address can be used.

That is, I propose the following algorithm for SC:

- Check the reservation on the virtual address (if the implementation
uses it), return failure if not found;
- Read the cached TLB entry, return failure if not found;
- Check the reservation on the physical address (if the implementation
uses it), return failure if not found;
- Check if the page is writeable, trap if read-only;
- Store the value and return success.

>
> You have presented a good argument for SC ignoring page faults under
> certain conditions, but such behavior really needs to be part of the
> spec or portable software will not be able to depend on a failed SC not
> raising page fault.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Andrew Waterman

unread,
Dec 29, 2017, 6:28:35 AM12/29/17
to Cesar Eduardo Barros, Jacob Bachmeyer, RISC-V ISA Dev
Requiring that SC not trap when the reservation is not held--as I
believe you are proposing--is too inflexible. It's a legitimate
implementation strategy to unconditionally check permissions before
checking the (physical-address) reservation. It fits naturally into a
virtually indexed, physically tagged cache with pipelined stores,
which is why the Rocket-style cores work this way.

(As another data point, the MIPS ISA appears to mandate checking
permissions before checking the reservation.)

I still think it's OK to stick with the status quo, i.e., leaving it
implementation-defined whether and when an SC to an impermissible
address fails instead of trapping.

>
>>
>> You have presented a good argument for SC ignoring page faults under
>> certain conditions, but such behavior really needs to be part of the spec or
>> portable software will not be able to depend on a failed SC not raising page
>> fault.
>
>
> --
> Cesar Eduardo Barros
> ces...@cesarb.eti.br
>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/27fd60ff-a42b-1c76-0783-043d6b46724f%40cesarb.eti.br.

Jonas Oberhauser

unread,
Dec 29, 2017, 6:36:53 AM12/29/17
to Cesar Eduardo Barros, Jacob Bachmeyer, RISC-V ISA Dev
The "right" way to write the program Cesar suggested is to set the read-only flag before (in MO) the page becomes read-only.

Here is an exhausting case distinction of program executions:
1) LR sees read-only,  SC is skipped
2) LR sees writable, value unchanged at SC; therefore SC is before (in MO) the setting of the flag and by transitivity also before (in MO) the page becomes read-only. 
3) LR sees writable, value changed at SC. Therefore a write to the same address as the LR appeared between (in MO) LR and SC.

Only case 3 is worrisome. As Cesar points out, it makes a certain sense for the translation not to trap.

I therefore suggest that an SC is translated iff there is an immediately preceding LR and it is not the case that the SC "must fail".

Recall that "must fail" is defined in the spec as (roughly) no remote store between (in MO) LR and SC and no context switch/privileged level xRET between (in PO) LR and SC.

This covers case 3 above.

Note that I am including interrupts here, because  it makes the definition more natural. I believe the cost for Albert is zero because LR/SC have to be emulated anyways.

Note also that if the reservation is lost due to anything else --- TLB miss, timeout, different va, or such --- the SC is translated and may trap, even though HW can "prove" that the SC will fail before attempting the translation (I put prove in quotation marks because it can not always prove that the SC will fail based on the spec, but it can prove it based on the implementation).


On Dec 29, 2017 10:58 AM, "Cesar Eduardo Barros" <ces...@cesarb.eti.br> wrote:
Em 28-12-2017 21:42, Jacob Bachmeyer escreveu:

Since the forward progress guarantee forbids any other memory accesses, and the DTLB must have at least one slot, the TLB entry for the reservation must still be present for SC to ever succeed.  Good point.

Why? The TLB entry could be invalidated by a remote store to the PTE used by the LR. The SC may still succeed after retranslation. Your observation is correct in the specific case we discussed here, where the write to the PTE is preceded by a write to the reservation, but not correct in general.

There's only one case where the LR succeeds, the reservation was not lost, but the SC gets a page fault: the page was read-only.

No, the page could have been swapped out but no new page swapped in. I don't know if existing OSs do this to a running user but on RISCV I don't see why not.

In that case, the TLB entry is still there (otherwise, it would be treated as the reservation being lost), so either the virtual address or the physical address can be used.

That is, I propose the following algorithm for SC:

- Check the reservation on the virtual address (if the implementation uses it), return failure if not found;
- Read the cached TLB entry, return failure if not found;
- Check the reservation on the physical address (if the implementation uses it), return failure if not found;
- Check if the page is writeable, trap if read-only;
- Store the value and return success.

You can not specify this because the TLB is invisible. More precisely, in spec you can not distinguish between "TLB miss" and "read-only" in case the page is read-only, while your algorithm does; so in spec, you would non-deterministically sometimes trap and sometimes not, which exactly does not solve the problem (if you want to convince me the TLB is visible in spec, please find me a line where TLB or its spelled out word or something similar appear in spec outside of a note/implementation description).

Jacob Bachmeyer

unread,
Dec 30, 2017, 7:43:29 PM12/30/17
to Jonas Oberhauser, Cesar Eduardo Barros, RISC-V ISA Dev
Jonas Oberhauser wrote:
> The "right" way to write the program Cesar suggested is to set the
> read-only flag before (in MO) the page becomes read-only.
>
> Here is an exhausting case distinction of program executions:
> 1) LR sees read-only, SC is skipped
> 2) LR sees writable, value unchanged at SC; therefore SC is before (in
> MO) the setting of the flag and by transitivity also before (in MO)
> the page becomes read-only.
> 3) LR sees writable, value changed at SC. Therefore a write to the
> same address as the LR appeared between (in MO) LR and SC.
>
> Only case 3 is worrisome. As Cesar points out, it makes a certain
> sense for the translation not to trap.
>
> I therefore suggest that an SC is translated iff there is an
> immediately preceding LR and it is not the case that the SC "must fail".

This rule can be simplified since the SC must fail if there is no
preceding (still valid) LR: An SC is translated iff it is not the case
that the SC "must fail".

> Recall that "must fail" is defined in the spec as (roughly) no remote
> store between (in MO) LR and SC and no context switch/privileged level
> xRET between (in PO) LR and SC.
>
> This covers case 3 above.
>
> Note that I am including interrupts here, because it makes the
> definition more natural. I believe the cost for Albert is zero because
> LR/SC have to be emulated anyways.

I agree that replay will require emulating LR/SC and that interrupts
should break reservations.

> Note also that if the reservation is lost due to anything else --- TLB
> miss, timeout, different va, or such --- the SC is translated and may
> trap, even though HW can "prove" that the SC will fail before
> attempting the translation (I put prove in quotation marks because it
> can not always prove that the SC will fail based on the spec, but it
> can prove it based on the implementation).

These are the kinds of edge cases that lead me to suggest that SC should
always translate its address and trap if that raises page fault.

Cesar Eduardo Barros made a good point that SC should be more lenient
and I like the idea of programs being able to use his scenario and work
correctly in case (3), but for his scenario to be usable, this behavior
needs to be part of the spec.

Unconditional translation is simpler to implement, and I am asking for
either some more advanced model (like Barros proposed) to be
standardized or the spec to be clarified that portable programs must
assume that SC unconditionally translates its address. A program that
assumes failed SCs can still raise page fault will run (but may enter an
infinite loop on an SC that can never succeed) even if failed SCs never
trap. If failed SCs can trap, case (3) must be avoided -- it will crash
the program if the race occurs.

> On Dec 29, 2017 10:58 AM, "Cesar Eduardo Barros" <ces...@cesarb.eti.br
> <mailto:ces...@cesarb.eti.br>> wrote:
>
> Em 28-12-2017 21 <tel:28-12-2017%2021>:42, Jacob Bachmeyer escreveu:
>
> Since the forward progress guarantee forbids any other
> memory accesses, and the DTLB must have at least one
> slot, the TLB entry for the reservation must still be
> present for SC to ever succeed. Good point.
>
>
> Why? The TLB entry could be invalidated by a remote store to the PTE
> used by the LR. The SC may still succeed after retranslation. Your
> observation is correct in the specific case we discussed here, where
> the write to the PTE is preceded by a write to the reservation, but
> not correct in general.

I have not found any requirement that remote stores to a PTE be visible
to the MMU until a local SFENCE.VMA is executed. Can you tell me where
the spec says that I am allowed to assume that a remote store replacing
a PTE will be visible locally?

I have suggested that an implementation *could* do this, but that was in
the context of a performance optimization to reduce the costs of PTE
updates. As I understand the spec, PTE writes are not guaranteed to be
visible to remote harts until FENCE is executed (to push the stores to
main memory) and IPIs signaled to cause the remote harts to execute
SFENCE.VMA.

> There's only one case where the LR succeeds, the reservation was
> not lost, but the SC gets a page fault: the page was read-only.
>
>
> No, the page could have been swapped out but no new page swapped in. I
> don't know if existing OSs do this to a running user but on RISCV I
> don't see why not.

As I understand the current spec, swapping the page out is not
guaranteed to be visible to the local MMU until SFENCE.VMA is executed
(likely in an IPI handler). The IPI breaks the reservation.
Presumably, similar hardware-accelerated TLB shootdown would also break
reservations.


-- Jacob

Jonas Oberhauser

unread,
Dec 31, 2017, 4:19:03 PM12/31/17
to Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev
On Dec 31, 2017 1:43 AM, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:

I therefore suggest that an SC is translated iff there is an immediately preceding LR and it is not the case that the SC "must fail".

This rule can be simplified since the SC must fail if there is no preceding (still valid) LR:  An SC is translated iff it is not the case that the SC "must fail".

Intuitively that is the case but with the current spec wording I preferred adding it explicitly to make it clear; I am referring to
"The SC must fail if there is an observable memory access from another hart to the address, or
if there is an intervening context switch on this hart, or if in the meantime the hart executed a
privileged exception-return instruction."


Note also that if the reservation is lost due to anything else --- TLB miss, timeout, different va, or such --- the SC is translated and may trap, even though HW can "prove" that the SC will fail before attempting the translation (I put prove in quotation marks because it can not always prove that the SC will fail based on the spec, but it can prove it based on the implementation).

These are the kinds of edge cases that lead me to suggest that SC should always translate its address and trap if that raises page fault.

I am under of these edge cases work fine under my suggestion. Is there a specific situation that worries you?


Cesar Eduardo Barros made a good point that SC should be more lenient and I like the idea of programs being able to use his scenario and work correctly in case (3), but for his scenario to be usable, this behavior needs to be part of the spec.

Unconditional translation is simpler to implement, and I am asking for either some more advanced model (like Barros proposed) to be standardized or the spec to be clarified that portable programs must assume that SC unconditionally translates its address.  A program that assumes failed SCs can still raise page fault will run (but may enter an infinite loop on an SC that can never succeed) even if failed SCs never trap.  If failed SCs can trap, case (3) must be avoided -- it will crash the program if the race occurs.

Under my suggestion, SCs can never trap under case 3, but they will always trap under case 2 (unless the hart was interrupted. Under the assumption that the SC will eventually be reached without interruptions, i.e., there are not too many interrupts, it will trap then.)
That appears to be the correct behavior.

The point I am trying to make is that you do not need to worry about all failed SCs, only those that "must fail".





Why? The TLB entry could be invalidated by a remote store to the PTE used by the LR. The SC may still succeed after retranslation. Your observation is correct in the specific case we discussed here, where the write to the PTE is preceded by a write to the reservation, but not correct in general.

I have not found any requirement that remote stores to a PTE be visible to the MMU until a local SFENCE.VMA is executed. 

The local SFENCE can not make remote stores visible at all. It only makes the local stores visible (see also excerpt of spec a little below).

Can you tell me where the spec says that I am allowed to assume that a remote store replacing a PTE will be visible locally?

I don't think the spec (I am currently stuck with version 2.2, I can check a more uptodate version when I'm back at the office next week) clearly says this but I also think that the spec would be broken if that was not the intent.

To be honest I think the remote store is not always visible; but in the case mentioned above it will be visible.

To be more precise, I think that the translation process does not have to appear to be atomic (although I still used to think so yesterday), so the following where a remote store between the translation and the load that caused the translation is valid:
MMU1: translate va using PTE, obtain pa = PTEa
CPU2: overwrite PTE, make non-present
CPU1: read PTE, see non-present

However, the translation process has to appear to be done anew for each access, so in case of an LR/SC where the SC has a control-dependency on the LR, a remote store to the PTE after the LR is translated but before it is executed would always be visible to the translation of the SC, since the translation of the SC will likely not begin before the branch is evaluated (the "likely" here is due to the memory model not yet having been integrated with virtual memory, so it may be that this translation will in fact be allowed.).


What leads me to believe that this will be visible is a combination of the following: 
The spec on page 48:
"Note that a single instruction
may generate multiple accesses, which may not be mutually atomic. [...] Notably, instructions that reference virtual
memory are decomposed into multiple accesses."

The description of SFENCE (4.2.1) which says "Instruction exe-
cution causes implicit reads and writes to these data structures; however, these implicit references
are ordinarily not ordered with respect   loads and stores in the instruction stream. Executing
an SFENCE.VMA instruction guarantees that any stores in the instruction stream prior to the
SFENCE.VMA are ordered before all implicit references subsequent to the SFENCE.VMA."
The second sentence is useless if in spec addresses can be translated ahead of time for one instruction and then used later, because then you could translate something before the SFENCE and use it afterwards, without getting any ordering guarantees.

The lack of an explicit TLB and lack of ability to flush it, which seems to imply that the translation process has to be done anew for each instruction. This is supported by the snippets above, which say that instruction execution causes additional loads and reads, not that it may cause them. Of course hardware can still use a TLB to avoid doing the translations, as long as it invalidates on remote store and translations with PTEs that have non-idempotent side effects on loads will never be buffered; in this case the TLB is simply a coherent cache (which does not forward local stores -- for that you need sfence).


As I understand the spec, PTE writes are not guaranteed to be visible to remote harts until FENCE is executed (to push the stores to main memory) ...

Sure.

... and IPIs signaled to cause the remote harts to execute SFENCE.VMA.

This is indeed suggested in the commentary of the spec, but since an SFENCE does not by itself order a remote store to the PTE with the local translation, with the current spec, this only has an effect if taking an IPI counts as a store for SFENCE.

I don't think that the correctness of a page fault handler should depend on a technicality like that.



No, the page could have been swapped out but no new page swapped in. I don't know if existing OSs do this to a running user but on RISCV I don't see why not.

As I understand the current spec, swapping the page out is not guaranteed to be visible to the local MMU until SFENCE.VMA is executed (likely in an IPI handler). 

It does not have to be guaranteed for my objection to be correct, only to be possible :)

The IPI breaks the reservation.  Presumably, similar hardware-accelerated TLB shootdown would also break reservations.

By "similar hardware-accelerated TLB shootdown" you mean an implicit one by remote store to PTE? I don't think so, but possibly "if one of the PTEs used by the LR..."

Jacob Bachmeyer

unread,
Dec 31, 2017, 7:14:11 PM12/31/17
to Jonas Oberhauser, Cesar Eduardo Barros, RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 31, 2017 1:43 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
>
> I therefore suggest that an SC is translated iff there is an
> immediately preceding LR and it is not the case that the SC
> "must fail".
>
>
> This rule can be simplified since the SC must fail if there is no
> preceding (still valid) LR: An SC is translated iff it is not the
> case that the SC "must fail".
>
>
> Intuitively that is the case but with the current spec wording I
> preferred adding it explicitly to make it clear; I am referring to
> "The SC must fail if there is an observable memory access from another
> hart to the address, or
> if there is an intervening context switch on this hart, or if in the
> meantime the hart executed a
> privileged exception-return instruction."

Fair enough.

> Note also that if the reservation is lost due to anything else ---
> TLB miss, timeout, different va, or such --- the SC is translated
> and may trap, even though HW can "prove" that the SC will fail
> before attempting the translation (I put prove in quotation marks
> because it can not always prove that the SC will fail based on the
> spec, but it can prove it based on the implementation).
>
>
> These are the kinds of edge cases that lead me to suggest that SC
> should always translate its address and trap if that raises page
> fault.
>
>
> I am under of these edge cases work fine under my suggestion. Is there
> a specific situation that worries you?

No specific situation, only a general sense of "fuzziness" on the issue.

> Cesar Eduardo Barros made a good point that SC should be more
> lenient and I like the idea of programs being able to use his
> scenario and work correctly in case (3), but for his scenario to
> be usable, this behavior needs to be part of the spec.
>
> Unconditional translation is simpler to implement, and I am asking
> for either some more advanced model (like Barros proposed) to be
> standardized or the spec to be clarified that portable programs
> must assume that SC unconditionally translates its address. A
> program that assumes failed SCs can still raise page fault will
> run (but may enter an infinite loop on an SC that can never
> succeed) even if failed SCs never trap. If failed SCs can trap,
> case (3) must be avoided -- it will crash the program if the race
> occurs.
>
>
> Under my suggestion, SCs can never trap under case 3, but they will
> always trap under case 2 (unless the hart was interrupted. Under the
> assumption that the SC will eventually be reached without
> interruptions, i.e., there are not too many interrupts, it will trap
> then.)
> That appears to be the correct behavior.

There is no trap in case (2): the LR saw the read-only flag clear and
the SC saw the value unchanged -- the SC succeeds, since the page is
still writable. (Unless, of course, the supervisor has made the page
copy-on-write, then SC raises page fault.)

> The point I am trying to make is that you do not need to worry about
> all failed SCs, only those that "must fail".

So this is a way to determine which SCs should ignore page faults and
which should trap on page faults. I like it. It needs to be
standardized so programs can rely on it.

> Why? The TLB entry could be invalidated by a remote store to
> the PTE used by the LR. The SC may still succeed after
> retranslation. Your observation is correct in the specific
> case we discussed here, where the write to the PTE is preceded
> by a write to the reservation, but not correct in general.
>
>
> I have not found any requirement that remote stores to a PTE be
> visible to the MMU until a local SFENCE.VMA is executed.
>
>
> The local SFENCE can not make remote stores visible at all. It only
> makes the local stores visible (see also excerpt of spec a little below).

A remote FENCE is required to make the remote stores visible to the
local data load/store unit. As I understand, SFENCE.VMA is then
required to make the MMU see the same view as the data load/store unit.

> Can you tell me where the spec says that I am allowed to assume
> that a remote store replacing a PTE will be visible locally?
>
>
> I don't think the spec (I am currently stuck with version 2.2, I can
> check a more uptodate version when I'm back at the office next week)
> clearly says this but I also think that the spec would be broken if
> that was not the intent.
>
> To be honest I think the remote store is not always visible; but in
> the case mentioned above it will be visible.

There was previously an SBI call for remote TLB shootdown; commentary
explained that one implementation would be an IPI where the IPI handler
executes SFENCE.VMA. There was also an SBI call for remote FENCE.I.

> To be more precise, I think that the translation process does not have
> to appear to be atomic (although I still used to think so yesterday),
> so the following where a remote store between the translation and the
> load that caused the translation is valid:
> MMU1: translate va using PTE, obtain pa = PTEa
> CPU2: overwrite PTE, make non-present
> CPU1: read PTE, see non-present
>
> However, the translation process has to appear to be done anew for
> each access, so in case of an LR/SC where the SC has a
> control-dependency on the LR, a remote store to the PTE after the LR
> is translated but before it is executed would always be visible to the
> translation of the SC, since the translation of the SC will likely not
> begin before the branch is evaluated (the "likely" here is due to the
> memory model not yet having been integrated with virtual memory, so it
> may be that this translation will in fact be allowed.).

This is where I think that you misunderstand: if the translation must
appear to be done anew for each access, what is the purpose of SFENCE.VMA?
I think that this is similar to the lack of explicit caches and lack of
ability to flush/prefetch/etc. cachelines. There are still the FENCE
and FENCE.I instructions that implicitly flush/synchronize caches.

> As I understand the spec, PTE writes are not guaranteed to be
> visible to remote harts until FENCE is executed (to push the
> stores to main memory) ...
>
>
> Sure.
>
> ... and IPIs signaled to cause the remote harts to execute SFENCE.VMA.
>
>
> This is indeed suggested in the commentary of the spec, but since an
> SFENCE does not by itself order a remote store to the PTE with the
> local translation, with the current spec, this only has an effect if
> taking an IPI counts as a store for SFENCE.

Simple enough, the IPI handler executes FENCE/SFENCE.VMA. The FENCE
ensures the remote stores (pushed to main memory by a remote FENCE) are
visible to the local data load/store unit. The SFENCE.VMA then ensures
that the same are visible to the MMU.

> I don't think that the correctness of a page fault handler should
> depend on a technicality like that.

I agree that such a dependency is bad.

> No, the page could have been swapped out but no new page
> swapped in. I don't know if existing OSs do this to a running
> user but on RISCV I don't see why not.
>
>
> As I understand the current spec, swapping the page out is not
> guaranteed to be visible to the local MMU until SFENCE.VMA is
> executed (likely in an IPI handler).
>
>
> It does not have to be guaranteed for my objection to be correct, only
> to be possible :)

It is worse: until SFENCE.VMA is executed, the local MMU may retain
(and use!) a mapping for the now-removed page.

> The IPI breaks the reservation. Presumably, similar
> hardware-accelerated TLB shootdown would also break reservations.
>
>
> By "similar hardware-accelerated TLB shootdown" you mean an implicit
> one by remote store to PTE? I don't think so, but possibly "if one of
> the PTEs used by the LR..."

Implicit TLB shootdown by MMU bus snooping is one option, but the spec
seems to envision implementations that might use an MMIO store in a
hart's control region to perform the same effect as that hart executing
SFENCE.VMA, but asynchronously and without actually interrupting the
target hart.


-- Jacob

Jonas Oberhauser

unread,
Dec 31, 2017, 8:06:11 PM12/31/17
to Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev


On Jan 1, 2018 01:14, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:


Under my suggestion, SCs can never trap under case 3, but they will always trap under case 2 (unless the hart was interrupted. Under the assumption that the SC will eventually be reached without interruptions, i.e., there are not too many interrupts, it will trap then.)
That appears to be the correct behavior.

There is no trap in case (2):  the LR saw the read-only flag clear and the SC saw the value unchanged -- the SC succeeds, since the page is still writable.  (Unless, of course, the supervisor has made the page copy-on-write, then SC raises page fault.)

Sorry for the imprecise language. What you wrote is what I meant though -- it will always trap *in case the translation would pagefault*. 



The point I am trying to make is that you do not need to worry about all failed SCs, only those that "must fail".

So this is a way to determine which SCs should ignore page faults and which should trap on page faults. [...]  It needs to be standardized so programs can rely on it.

Yes.


However, the translation process has to appear to be done anew for each access, so in case of an LR/SC where the SC has a control-dependency on the LR, a remote store to the PTE after the LR is translated but before it is executed would always be visible to the translation of the SC, since the translation of the SC will likely not begin before the branch is evaluated (the "likely" here is due to the memory model not yet having been integrated with virtual memory, so it may be that this translation will in fact be allowed.).

This is where I think that you misunderstand:  if the translation must appear to be done anew for each access, what is the purpose of SFENCE.VMA?

The translation is not ordered in memory order with the other stores in the instruction stream, and unlike normal local loads does not forward from stores in the instruction stream.

SFENCE makes sure that the local stores in the instruction stream are in memory order before any translations caused by subsequent instructions, and it does nothing else (unless I overlooked some lines in the spec/a more up to date version has changed the definition of SFENCE).

I think that this is similar to the lack of explicit caches and lack of ability to flush/prefetch/etc. cachelines.  There are still the FENCE and FENCE.I instructions that implicitly flush/synchronize caches.

I think it is nearly the same, and loads also appear to be done anew. 
In particular, there is no loadbuffer that can be filled by a load and then a much later load to the same address might take that stale value; a remote store in MO between the two loads will be visible to the second load.
I think it is almost the same with the translations.
The central difference is that except for the A/D bits, these accesses (and fetch) do not forward from local stores in program order, while the normal loads do.
All caches are nevertheless as per my understanding kept coherent w.r.t. MO.



This is indeed suggested in the commentary of the spec, but since an SFENCE does not by itself order a remote store to the PTE with the local translation, with the current spec, this only has an effect if taking an IPI counts as a store for SFENCE.

Simple enough, the IPI handler executes FENCE/SFENCE.VMA.  The FENCE ensures the remote stores (pushed to main memory by a remote FENCE) are visible to the local data load/store unit.  The SFENCE.VMA then ensures that the same are visible to the MMU.

The local FENCE has no such powers, and neither does the local SFENCE.

The local FENCE can only order local memory operations, and the SFENCE only orders local stores and local translations.

Either one of the following is true:
1) taking the IPI orders all subsequent local translations behind the IPI, which itself is ordered w.r.t all preceeding remote operations, thus ordering local translations with the remote stores. No SFENCE necessary.
2) taking the IPI does not order the subsequent local translations behind the remote stores, but counts as a local store which is ordered with the remote stores. An SFENCE works, but for "magical" reasons: it orders the local translations with the IPI "store", and thus transitively with the remote stores.
3) taking the IPI does not order the subsequent local translations behind the remote stores, and neither will FENCE or SFENCE . To order the remote stores and the translations, one needs an extra local store with no purpose except to have a local store that the SFENCE can order.

The spec seems to imply 1), but the commentary matches none of these. Local SFENCE and FENCE simply can not impose on their own the necessary ordering with their current semantics. 

By "similar hardware-accelerated TLB shootdown" you mean an implicit one by remote store to PTE? I don't think so, but possibly "if one of the PTEs used by the LR..."

Implicit TLB shootdown by MMU bus snooping is one option, but the spec seems to envision implementations that might use an MMIO store in a hart's control region to perform the same effect as that hart executing SFENCE.VMA, but asynchronously and without actually interrupting the target hart.

Ah, thanks :) I wasn't aware.

PS: Happy new year ;)

Jacob Bachmeyer

unread,
Dec 31, 2017, 11:32:03 PM12/31/17
to Jonas Oberhauser, Cesar Eduardo Barros, RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Jan 1, 2018 01:14, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
> However, the translation process has to appear to be done anew
> for each access, so in case of an LR/SC where the SC has a
> control-dependency on the LR, a remote store to the PTE after
> the LR is translated but before it is executed would always be
> visible to the translation of the SC, since the translation of
> the SC will likely not begin before the branch is evaluated
> (the "likely" here is due to the memory model not yet having
> been integrated with virtual memory, so it may be that this
> translation will in fact be allowed.).
>
>
> This is where I think that you misunderstand: if the translation
> must appear to be done anew for each access, what is the purpose
> of SFENCE.VMA?
>
>
> The translation is not ordered in memory order with the other stores
> in the instruction stream, and unlike normal local loads does not
> forward from stores in the instruction stream.
>
> SFENCE makes sure that the local stores in the instruction stream are
> in memory order before any translations caused by subsequent
> instructions, and it does nothing else (unless I overlooked some lines
> in the spec/a more up to date version has changed the definition of
> SFENCE).

A commentary block in section 4.2.1 "Supervisor Memory-Management Fence
Instruction" explains:
"""
Note the instruction has no effect on the translations of other harts,
which must be notified separately. One approach is to use 1) a local
data fence to ensure local writes are visible globally, then 2) an
interprocessor interrupt to the other thread, then 3) a local SFENCE.VMA
in the interrupt handler of the remote thread, and finally 4) signal
back to originating thread that operation is complete. This is, of
course, the RISC-V analog to a TLB shootdown. Alternatively,
implementations might provide direct hardware support for remote TLB
invalidation. TLB shootdowns are handled by an SBI call to hide
implementation details.
"""

If we assume that the commentary gives a valid implementation and that
RISC-V does actually need an analog to a TLB shootdown, then remote
SFENCE.VMA must be executed to guarantee that local PTE writes affect
translations on a remote hart, which means that a remote hart is allowed
to use "stale" PTEs until the remote SFENCE.VMA completes.

> [...] All caches are nevertheless as per my understanding kept
> coherent w.r.t. MO.

Cache coherency is an implementation option in RISC-V as I understand.

> This is indeed suggested in the commentary of the spec, but
> since an SFENCE does not by itself order a remote store to the
> PTE with the local translation, with the current spec, this
> only has an effect if taking an IPI counts as a store for SFENCE.
>
>
> Simple enough, the IPI handler executes FENCE/SFENCE.VMA. The
> FENCE ensures the remote stores (pushed to main memory by a remote
> FENCE) are visible to the local data load/store unit. The
> SFENCE.VMA then ensures that the same are visible to the MMU.
>
>
> The local FENCE has no such powers, and neither does the local SFENCE.

Perhaps in the new memory model the local FENCE is unneeded, but as I
understand the original memory model, a remote (store) FENCE is needed
to ensure that the write is visible to other harts *and* a local (load)
FENCE is needed to ensure that the local hart will see the most-recent
updates to main memory. The overall sequence is/was: (on hart A) store
PTE, FENCE w, MMIO write for IPI to hart B, (on hart B) take IPI, FENCE
r, SFENCE.VMA, return from IPI, use new PTE. Non-coherent caches are
discouraged but permitted in RISC-V. (RISC-V privileged ISA spec sec.
3.5.5 "Coherence and Cacheability PMAs")

> The local FENCE can only order local memory operations, and the SFENCE
> only orders local stores and local translations.
>
> Either one of the following is true:
> 1) taking the IPI orders all subsequent local translations behind the
> IPI, which itself is ordered w.r.t all preceeding remote operations,
> thus ordering local translations with the remote stores. No SFENCE
> necessary.
> 2) taking the IPI does not order the subsequent local translations
> behind the remote stores, but counts as a local store which is ordered
> with the remote stores. An SFENCE works, but for "magical" reasons: it
> orders the local translations with the IPI "store", and thus
> transitively with the remote stores.
> 3) taking the IPI does not order the subsequent local translations
> behind the remote stores, and neither will FENCE or SFENCE . To order
> the remote stores and the translations, one needs an extra local store
> with no purpose except to have a local store that the SFENCE can order.
>
> The spec seems to imply 1), but the commentary matches none of these.
> Local SFENCE and FENCE simply can not impose on their own the
> necessary ordering with their current semantics.

Note that the IPI trap handler will unavoidably need to store the
interrupted context somewhere, so you have local stores when the IPI
trap is taken.

Quirk aside, have we found another erratum in the spec here?


-- Jacob

> PS: Happy new year ;)
PS: Happy New Year from 2017 ;-)

Jonas Oberhauser

unread,
Jan 1, 2018, 3:55:13 AM1/1/18
to Jacob Bachmeyer, Cesar Eduardo Barros, RISC-V ISA Dev


On Jan 1, 2018 05:32, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:
On Jan 1, 2018 01:14, "Jacob Bachmeyer" <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:

    Jonas Oberhauser wrote:

        However, the translation process has to appear to be done anew
        for each access, so in case of an LR/SC where the SC has a
        control-dependency on the LR, a remote store to the PTE after
        the LR is translated but before it is executed would always be
        visible to the translation of the SC, since the translation of
        the SC will likely not begin before the branch is evaluated
        (the "likely" here is due to the memory model not yet having
        been integrated with virtual memory, so it may be that this
        translation will in fact be allowed.).


    This is where I think that you misunderstand:  if the translation
    must appear to be done anew for each access, what is the purpose
    of SFENCE.VMA?


The translation is not ordered in memory order with the other stores in the instruction stream, and unlike normal local loads does not forward from stores in the instruction stream.

SFENCE makes sure that the local stores in the instruction stream are in memory order before any translations caused by subsequent instructions, and it does nothing else (unless I overlooked some lines in the spec/a more up to date version has changed the definition of SFENCE).

A commentary block in section 4.2.1 "Supervisor Memory-Management Fence Instruction" explains:
"""
Note the instruction has no effect on the translations of other harts, which must be notified separately. One approach is to use 1) a local data fence to ensure local writes are visible globally, then 2) an interprocessor interrupt to the other thread, then 3) a local SFENCE.VMA in the interrupt handler of the remote thread, and finally 4) signal back to originating thread that operation is complete. This is, of course, the RISC-V analog to a TLB shootdown. Alternatively, implementations might provide direct hardware support for remote TLB invalidation. TLB shootdowns are handled by an SBI call to hide implementation details.
"""

If we assume that the commentary gives a valid implementation and that RISC-V does actually need an analog to a TLB shootdown, then remote SFENCE.VMA must be executed to guarantee that local PTE writes affect translations on a remote hart, which means that a remote hart is allowed to use "stale" PTEs until the remote SFENCE.VMA completes.

I agree, under these assumptions.


[...]  All caches are nevertheless as per my understanding kept coherent w.r.t. MO.

Cache coherency is an implementation option in RISC-V as I understand.
The local FENCE has no such powers, and neither does the local SFENCE.
Perhaps in the new memory model the local FENCE is unneeded, but as I understand the original memory model, a remote (store) FENCE is needed to ensure that the write is visible to other harts *and* a local (load) FENCE is needed to ensure that the local hart will see the most-recent updates to main memory.  The overall sequence is/was:  (on hart A) store PTE, FENCE w, MMIO write for IPI to hart B, (on hart B) take IPI, FENCE r, SFENCE.VMA, return from IPI, use new PTE.  Non-coherent caches are discouraged but permitted in RISC-V.  (RISC-V privileged ISA spec sec. 3.5.5 "Coherence and Cacheability PMAs")

That is correct, but coherence is a per-memory-region setting, not a per-cache-setting. I think my comments are still valid on coherent memory regions.


The local FENCE can only order local memory operations, and the SFENCE only orders local stores and local translations.

Either one of the following is true:
1) taking the IPI orders all subsequent local translations behind the IPI, which itself is ordered w.r.t all preceeding remote operations, thus ordering local translations with the remote stores. No SFENCE necessary.
2) taking the IPI does not order the subsequent local translations behind the remote stores, but counts as a local store which is ordered with the remote stores. An SFENCE works, but for "magical" reasons: it orders the local translations with the IPI "store", and thus transitively with the remote stores.
3) taking the IPI does not order the subsequent local translations behind the remote stores, and neither will FENCE or SFENCE . To order the remote stores and the translations, one needs an extra local store with no purpose except to have a local store that the SFENCE can order.

The spec seems to imply 1), but the commentary matches none of these. Local SFENCE and FENCE simply can not impose on their own the necessary ordering with their current semantics.

Note that the IPI trap handler will unavoidably need to store the interrupted context somewhere, so you have local stores when the IPI trap is taken.

If the only thing you do in the handler is execute SFENCE, you can probably avoid storing a context, at least if you have a HW dispatcher.

Quirk aside, have we found another erratum in the spec here?

Possibly but without knowing what the intended behaviour is I don't know what the error is. I personally think that the error is the commentary and possibly in SFENCE, and that the rest of the MMU spec works well. There is no need for a TLB shootdown, it is only that changing translations for a running user can make the user take the old or the new translation, depending on when the translation began. SFENCE may be needed by the local core, but that is not clear until we know the memory model (maybe xRET will be strong enough for that); possibly it will be needed also for incoherent caches.


You could also argue that the spec should work as the commentary suggests: translations can be buffered and stale, and SFENCE should have the power to make a TLB shootdown work. In this case the spec of SFENCE has to be strengthened, and it may be necesseray to introduce an explicit TLB (not sure yet)a .


Reply all
Reply to author
Forward
0 new messages