Misaligned AMOs

730 views
Skip to first unread message

Cesar Eduardo Barros

unread,
Dec 16, 2017, 9:13:28 AM12/16/17
to isa...@groups.riscv.org
I have just noticed a commit to the spec
(https://github.com/riscv/riscv-isa-manual/commit/243563bb4faec7f7b9c704d15678bf457d27b64b)
to allow misaligned LR/SC and AMOs. I think that's a bad change.

The situation before that commit: misaligned LR/SC and AMO are never
allowed, while other misaligned loads/stores can be emulated by the
hardware.

The situation after that commit: either an implementation *must* be able
to do misaligned LR/SC in hardware, or an implementation *must not* be
able to optimize any misaligned loads/stores in hardware.

While it's very common for software to want to operate on misaligned
data (network protocols, file formats, packed data structures), it's
less common for software to want to do atomic operations on misaligned
data. From the other side, I believe it's simpler to emulate a
non-atomic misaligned access in hardware (by breaking it into multiple
loads/stores), than to emulate an atomic misaligned access in hardware,
especially if the misaligned access straddles cache lines (for instance:
if your reservations are on the cache lines, you now must be able to
reserve twice as many).

Therefore, these all-or-nothing rules forbid the most useful
implementation option: avoiding the extra complexity of misaligned
atomic accesses, while optimizing in hardware the very common case of
misaligned non-atomic accesses.

Not only that, but that change also imposes extra costs for hardware
that traps to M-mode on misaligned accesses. While before the emulation
code for normal load/store could simply decode the faulting instruction
and do the loads/stores (since misaligned normal loads/stores are not
guaranteed to be atomic), now the emulation code is also required to
acquire a mutex. Either this uses a single global mutex (with all its
performance problems), or this wastes memory with many mutexes and needs
extra code to select the correct mutex for each memory address (and what
if the address straddles more than one mutex region? Now you have to
lock two mutexes).

Also, interrupts could previously be enabled for most of the misaligned
access emulation code; now they can't, since misaligned loads/stores now
have to be atomic against even interrupts, even when emulated.


My opinion is, this whole change is bad and should be reverted.
Misaligned non-AMO loads and stores should never be required to be
atomic. If you want an atomic load of a misaligned address, use LR; if
you want an atomic store of a misaligned address, use AMOSWAP (both are
suggested at the end of the AMO chapter, though here they would be
"relaxed" accesses without aq or rl).

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Allen J. Baum

unread,
Dec 16, 2017, 9:51:06 PM12/16/17
to Cesar Eduardo Barros, isa...@groups.riscv.org
I totally agree with this. I think
LR/SC to misaligned addresses should either trap or always fail, and
AMOs to misaligned addresses should either trap or not be guaranteed to be atomic.

Intel used to allow misaligned atomic ops, and did away with them (which was a long tortuous procedure - easy to add features, painful to remove them) - but it was decided that it was worth it, as the complexity (especially in OOO superscalar multiprocssor implementations) was too painful to implement.
(The procedure was amusing - basically they just made them work really, really slowly. Eventually, everybody got the hint and made sure that the address was aligned - at which point it could be safely removed from the architecture).

If it hurts when you do that, don't do that.
And it will hurt.
>--
>You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
>To post to this group, send email to isa...@groups.riscv.org.
>Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fd3d00e0-aea8-2521-1d64-570af8f16de6%40cesarb.eti.br.


--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Jacob Bachmeyer

unread,
Dec 16, 2017, 10:25:06 PM12/16/17
to Cesar Eduardo Barros, isa...@groups.riscv.org
Cesar Eduardo Barros wrote:
> I have just noticed a commit to the spec
> (https://github.com/riscv/riscv-isa-manual/commit/243563bb4faec7f7b9c704d15678bf457d27b64b)
> to allow misaligned LR/SC and AMOs. I think that's a bad change.

I mostly agree.

> The situation before that commit: misaligned LR/SC and AMO are never
> allowed, while other misaligned loads/stores can be emulated by the
> hardware.

Including any subset, which is what made that really nice -- an
implementation could handle misaligned accesses within a cacheline or
page, but trap to the monitor if a misaligned access spans some critical
microarchitectural boundary.

> The situation after that commit: either an implementation *must* be
> able to do misaligned LR/SC in hardware, or an implementation *must
> not* be able to optimize any misaligned loads/stores in hardware.

The same "subset" flexibility remains, since the restriction is only
with respect to other accesses to the same address and of the same
size. Implementations are still allowed to outright prohibit misaligned
LR/SC and AMO as I read the new text. These limitations only arise if
the implementation allows misaligned atomic access at all.

> While it's very common for software to want to operate on misaligned
> data (network protocols, file formats, packed data structures), it's
> less common for software to want to do atomic operations on misaligned
> data. From the other side, I believe it's simpler to emulate a
> non-atomic misaligned access in hardware (by breaking it into multiple
> loads/stores), than to emulate an atomic misaligned access in
> hardware, especially if the misaligned access straddles cache lines
> (for instance: if your reservations are on the cache lines, you now
> must be able to reserve twice as many).
>
> Therefore, these all-or-nothing rules forbid the most useful
> implementation option: avoiding the extra complexity of misaligned
> atomic accesses, while optimizing in hardware the very common case of
> misaligned non-atomic accesses.

I respectfully disagree: the option to prohibit misaligned atomic
accesses entirely remains, as was previously mandatory.

> Not only that, but that change also imposes extra costs for hardware
> that traps to M-mode on misaligned accesses. While before the
> emulation code for normal load/store could simply decode the faulting
> instruction and do the loads/stores (since misaligned normal
> loads/stores are not guaranteed to be atomic), now the emulation code
> is also required to acquire a mutex. Either this uses a single global
> mutex (with all its performance problems), or this wastes memory with
> many mutexes and needs extra code to select the correct mutex for each
> memory address (and what if the address straddles more than one mutex
> region? Now you have to lock two mutexes).

Now we are getting into some actual problems -- these atomicity
requirements make implementing misaligned AMOs in a supervisor (which
*can* provide atomicity guarantees in a preemptive multitasking
environment with virtual memory) on a system that implements only some
misaligned accesses in hardware very difficult if not impossible. The
best way to fix this is to retain the previous explicit non-guarantee of
atomicity for ordinary unaligned accesses, which were previously
permitted to produce or store "torn values". This allows the monitor to
use its "plain" emulation for unaligned LOAD and STORE, while a
supervisor that chooses to emulate unaligned atomics can still do so.

> Also, interrupts could previously be enabled for most of the
> misaligned access emulation code; now they can't, since misaligned
> loads/stores now have to be atomic against even interrupts, even when
> emulated.

This is a problem for implementations that permit misaligned atomics.
The big problem I have is that this prevents a supervisor from providing
misaligned atomic emulation without special monitor cooperation.

> My opinion is, this whole change is bad and should be reverted.
> Misaligned non-AMO loads and stores should never be required to be
> atomic. If you want an atomic load of a misaligned address, use LR; if
> you want an atomic store of a misaligned address, use AMOSWAP (both
> are suggested at the end of the AMO chapter, though here they would be
> "relaxed" accesses without aq or rl).

I have a mixed view. On one hand, implementations are now allowed to
permit misaligned AMOs, on the other, the details introduced with this
can make software emulation of misaligned AMOs impossible for a
supervisor to accomplish, where that previously was possible.


-- Jacob

David Chisnall

unread,
Dec 17, 2017, 6:27:33 AM12/17/17
to Allen J. Baum, Cesar Eduardo Barros, isa...@groups.riscv.org
On 17 Dec 2017, at 02:51, Allen J. Baum <allen...@esperantotech.com> wrote:
>
> AMOs to misaligned addresses should either trap or not be guaranteed to be atomic.

Please don’t do the latter. Intel does this, and it causes horrible pain debugging: a bug in your mutex alignment code means that your lock and unlock operations are no longer atomic (real-world example!). The symptom: some unrelated code now has races that should be impossible. How do you debug this issue? If they trap, then debugging is simple. If you silently lose the atomicity, then debugging costs huge amounts of developer time.

David

Cesar Eduardo Barros

unread,
Dec 17, 2017, 8:04:28 AM12/17/17
to jcb6...@gmail.com, isa...@groups.riscv.org
Em 17-12-2017 01:25, Jacob Bachmeyer escreveu:
> Cesar Eduardo Barros wrote:
>> I have just noticed a commit to the spec
>> (https://github.com/riscv/riscv-isa-manual/commit/243563bb4faec7f7b9c704d15678bf457d27b64b)
>> to allow misaligned LR/SC and AMOs. I think that's a bad change.
>
> I mostly agree.
>
>> The situation before that commit: misaligned LR/SC and AMO are never
>> allowed, while other misaligned loads/stores can be emulated by the
>> hardware.
>
> Including any subset, which is what made that really nice -- an
> implementation could handle misaligned accesses within a cacheline or
> page, but trap to the monitor if a misaligned access spans some critical
> microarchitectural boundary.
>
>> The situation after that commit: either an implementation *must* be
>> able to do misaligned LR/SC in hardware, or an implementation *must
>> not* be able to optimize any misaligned loads/stores in hardware.
>
> The same "subset" flexibility remains, since the restriction is only
> with respect to other accesses to the same address and of the same
> size.  Implementations are still allowed to outright prohibit misaligned
> LR/SC and AMO as I read the new text.  These limitations only arise if
> the implementation allows misaligned atomic access at all.

That's not what the change in machine.tex says:

"Memory regions that support aligned LR/SC or aligned AMOs might also
support misaligned LR/SC or misaligned AMOs for some addresses and
access widths. If, for a given address and access width, a misaligned
LR/SC or AMO generates a misaligned address exception, then {\em all}
loads, stores, LRs/SCs, and AMOs using that address and access width
must generate misaligned address exceptions."

There's no conditional here: if misaligned atomic access to an address
traps, _all_ access to that address, atomic or not, must trap. Nothing
in that paragraph allows for a mixed implementation, where misaligned
atomic access is forbidden (that is, traps) while misaligned non-atomic
access is allowed (that is, doesn't trap).

And the explanation following that paragraph shows that the intention is
really to force all access to a misaligned address to either work or trap:

"\begin{commentary}
Mandating that misaligned loads and stores trap wherever misaligned AMOs
trap permits the emulation of misaligned AMOs in an M-mode trap handler.
The handler guarantees atomicity by acquiring a global mutex and
emulating the access within the critical section. Provided that the
handler for misaligned loads and stores uses the same mutex, all
accesses to a given address that use the same word size will be mutually
atomic.
\end{commentary}"

The mutex trick can't work if some of the misaligned accesses are
implemented in hardware, and again there's no exception here: "wherever
misaligned AMOs trap" includes also implementations where misaligned
AMOs aren't allowed at all. Therefore, either you do misaligned AMOs in
hardware, or you aren't allowed to do misaligned normal loads/stores in
hardware.

>> While it's very common for software to want to operate on misaligned
>> data (network protocols, file formats, packed data structures), it's
>> less common for software to want to do atomic operations on misaligned
>> data. From the other side, I believe it's simpler to emulate a
>> non-atomic misaligned access in hardware (by breaking it into multiple
>> loads/stores), than to emulate an atomic misaligned access in
>> hardware, especially if the misaligned access straddles cache lines
>> (for instance: if your reservations are on the cache lines, you now
>> must be able to reserve twice as many).
>>
>> Therefore, these all-or-nothing rules forbid the most useful
>> implementation option: avoiding the extra complexity of misaligned
>> atomic accesses, while optimizing in hardware the very common case of
>> misaligned non-atomic accesses.
>
> I respectfully disagree:  the option to prohibit misaligned atomic
> accesses entirely remains, as was previously mandatory.

From my reading of the machine.tex changes above, the option to
prohibit misaligned atomic accesses remains, *but* now if you prohibit
misaligned atomic accesses, you must prohibit *all* misaligned accesses,
atomic or not, to the same address.

I agree that the a.tex changes work like you say, that is, from a user
mode or supervisor mode point of view, misaligned atomic access can
still be forbidden (by just not emulating them in M mode). However, from
the hardware point of view, as shown by the machine.tex changes, that
option is not available anymore.

Allen J. Baum

unread,
Dec 17, 2017, 7:29:11 PM12/17/17
to David Chisnall, Cesar Eduardo Barros, isa...@groups.riscv.org
Yeah, I understand.
I have no problem with trapping - but not if it means that non-AMO must trap (as opposed to "may trap"). I can see implementations that can deal with unaligned accesses - but not necessarily one that can guarantee atomicity between the two separate accesses without a lot of work.

I could be talked out of that opinion, but it sounds like either a lot of extra work that isn't necessary or a cut in performance that isn't necessary.

Jacob Bachmeyer

unread,
Dec 17, 2017, 8:21:08 PM12/17/17
to Cesar Eduardo Barros, isa...@groups.riscv.org
Then the change in machine.tex is inconsistent and needs to be fixed.
Two changes are needed: (1) if *all* misaligned AMOs trap to the
supervisor, then no constraints are applied on the handling of
misaligned load/store, and (2) misaligned load/store is *never*
guaranteed to be atomic. The user ISA already specifies: (section 2.6
"Load and Store Instructions") "Furthermore, naturally aligned loads
and stores are guaranteed to execute atomically, whereas misaligned
loads and stores might not, and hence require additional synchronization
to ensure atomicity."

I propose: A misaligned AMO must either be executed atomically or trap,
depending on implementation support. If the monitor cannot ensure that
emulating a specific misaligned AMO will be atomic with respect to all
other accesses, the monitor *must* delegate that specific misaligned AMO
trap to the supervisor. (The supervisor can either emulate the AMO or
abort the program. The supervisor is in a position to guarantee atomic
execution, albeit possibly at significant performance cost.)

> I agree that the a.tex changes work like you say, that is, from a user
> mode or supervisor mode point of view, misaligned atomic access can
> still be forbidden (by just not emulating them in M mode). However,
> from the hardware point of view, as shown by the machine.tex changes,
> that option is not available anymore.

You are correct. I had read the a.tex changes and foolishly assumed
that the revision was consistent and actually did what the commit
summary ("Describe optional support for misaligned AMOs (#117)") said.


-- Jacob

Jacob Bachmeyer

unread,
Dec 17, 2017, 8:24:50 PM12/17/17
to Allen J. Baum, David Chisnall, Cesar Eduardo Barros, isa...@groups.riscv.org
Allen J. Baum wrote:
> Yeah, I understand.
> I have no problem with trapping - but not if it means that non-AMO must trap (as opposed to "may trap"). I can see implementations that can deal with unaligned accesses - but not necessarily one that can guarantee atomicity between the two separate accesses without a lot of work.
>
> I could be talked out of that opinion, but it sounds like either a lot of extra work that isn't necessary or a cut in performance that isn't necessary.
>

The simple answer: misaligned load/store always works, but is not
guaranteed to be atomic and may produce or store a "torn value".
Misaligned AMOs either work atomically, including with respect to all
other accesses as far as the AMO is concerned (the AMO will not produce
or store a "torn value") or trap to the supervisor. The monitor may
emulate some misaligned AMOs if it can guarantee atomicity, but *must*
delegate each misaligned AMO trap where atomicity cannot be guaranteed.


-- Jacob

Jose Renau

unread,
Dec 18, 2017, 12:39:36 PM12/18/17
to RISC-V ISA Dev, ces...@cesarb.eti.br

On Dec 17, 2017 10:20 PM, "Andrew Waterman" <wate...@eecs.berkeley.edu> wrote:
Can you go defend our proposal? I’m on vacation and don’t have time.

---------- Forwarded message ---------
From: Cesar Eduardo Barros <ces...@cesarb.eti.br>
Date: Sat, Dec 16, 2017 at 11:13 PM
Subject: [isa-dev] Misaligned AMOs
To: <isa...@groups.riscv.org>


I have just noticed a commit to the spec
(https://github.com/riscv/riscv-isa-manual/commit/243563bb4faec7f7b9c704d15678bf457d27b64b)
to allow misaligned LR/SC and AMOs. I think that's a bad change.

The situation before that commit: misaligned LR/SC and AMO are never
allowed, while other misaligned loads/stores can be emulated by the
hardware.

The situation after that commit: either an implementation *must* be able
to do misaligned LR/SC in hardware, or an implementation *must not* be
able to optimize any misaligned loads/stores in hardware.

 The reason for the change in misaligned is that some applications have misaligned atomics (I do not know misaligned LR/SC).
The fact that ARMv8 changed the spec to support misaligned atomics is a good hint that it is a problem in the world. X86 also
supports them. In all the cases, the performance does not need to be great.

 The way that we wanted to handle is to provide support for them in software but providing a simpler
semantics. The exception handler could provide an emulation for the misaligned LR/SC and atomics but
having a slightly more relaxed definition of atomicity. The atomicity is maintained only if the operation is always at the same
word granularity. 

 If the CPU designer thinks that this is important for performance, we provide the option of avoiding the exception and support
it in hardware.

While it's very common for software to want to operate on misaligned
data (network protocols, file formats, packed data structures), it's
less common for software to want to do atomic operations on misaligned
data. From the other side, I believe it's simpler to emulate a
non-atomic misaligned access in hardware (by breaking it into multiple
loads/stores), than to emulate an atomic misaligned access in hardware,
especially if the misaligned access straddles cache lines (for instance:
if your reservations are on the cache lines, you now must be able to
reserve twice as many).

 You are right, different degrees of misaligned support are more or less complex to support. This is why the
software exception could handle some subcases. E.g: a CPU may support misaligned inside the cache line, but
trigger an exception to handle in software for cross-cache line misaligned atomics.

Therefore, these all-or-nothing rules forbid the most useful
implementation option: avoiding the extra complexity of misaligned
atomic accesses, while optimizing in hardware the very common case of
misaligned non-atomic accesses.

 In hardware, supporting misaligned inside a cache line is not neccessarely so complex. Depends on the implementation.

Not only that, but that change also imposes extra costs for hardware
that traps to M-mode on misaligned accesses. While before the emulation
code for normal load/store could simply decode the faulting instruction
and do the loads/stores (since misaligned normal loads/stores are not
guaranteed to be atomic), now the emulation code is also required to
acquire a mutex. 

 The support should be similar to the one required for misaligned non-atomics.

Either this uses a single global mutex (with all its
performance problems), or this wastes memory with many mutexes and needs
extra code to select the correct mutex for each memory address (and what
if the address straddles more than one mutex region? Now you have to
lock two mutexes).

 This is implementation dependent. At the beginning, I would go just for a single lock. If the
it becomes common, it can be optimized to have an array/hash of mutexes as you mention.

Also, interrupts could previously be enabled for most of the misaligned
access emulation code; now they can't, since misaligned loads/stores now
have to be atomic against even interrupts, even when emulated.


 Not sure that I follow. The misaligned exception still needs to be supported for non atomic or LR/SC pair.
I do not see how to avoid the exception support anyway.

Jose Renau

unread,
Dec 18, 2017, 12:42:23 PM12/18/17
to RISC-V ISA Dev, ces...@cesarb.eti.br

 X86 supports misaligned and ARMv8 recently changed (8.4) the spec to support misaligned. Before 8.4, the spec said exception. The fact that they changed
and that there are some apps around there is a strong hint that we should provide some way to support it because apps need it.

The current solution says that hardware support is optional. The software solution can handle it, so not extra complexity if the CPU designer wants to avoid besides
the exception handler which is practically the same hardware as the misaligned for not atomics.

Jose Renau

unread,
Dec 18, 2017, 12:44:31 PM12/18/17
to RISC-V ISA Dev, allen...@esperantotech.com, ces...@cesarb.eti.br, David.C...@cl.cam.ac.uk

 The proposal is to allow software to handle, and just give the option of providing hardware support if the CPU architect decides that it is good for their applications.

Jose Renau

unread,
Dec 18, 2017, 12:49:48 PM12/18/17
to RISC-V ISA Dev, jcb6...@gmail.com, ces...@cesarb.eti.br

 I think that the text may need to be improved because the key idea is not that hardware must support misaligned atomics and LR/SC. The idea
is that either exception to handle in software or the hardware supports it in which case no exception is raised.

 Since emulating atomics in software may not be possible, the memory consistency model says that the atomicity is not as strong. It only respects
atomicity if the same word granularity is used. Then, a simple lock with plain LD/ST can handle it in the exception handler.

 I think that this provides a simple solution that allows hardware to support if wanted, and provides more consistency with other ISAs that support
misaligned (X86 and newer ARMv8). We do not want to have apps to crap/fault in RISC-V while running correctly in all the other ISAs.

Jacob Bachmeyer

unread,
Dec 18, 2017, 7:24:39 PM12/18/17
to Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
Jose Renau wrote:
> X86 supports misaligned and ARMv8 recently changed (8.4) the spec to
> support misaligned. Before 8.4, the spec said exception. The fact that
> they changed
> and that there are some apps around there is a strong hint that we
> should provide some way to support it because apps need it.

We had a way to support misaligned AMO: the misaligned AMO trap can be
delegated to the supervisor. No program truly needs misaligned AMOs --
synchronization variables can always be aligned by simply inserting
padding. Find a counterexample to that before saying "apps need it".
Lazy developers may *like* misaligned AMOs, but apps do not *need*
misaligned AMOs and the current wording removes an option for efficient
handling of ordinary unaligned load/store that many programs actually do
need for hot code paths that must process packed structures such as
network packet headers. Efficient hardware support for unaligned load
replaces a seven instruction sequence: adjust pointer, aligned load,
shift, mask, second aligned load, mask, OR. Trapping to the monitor is
much slower, but that gives an incentive to develop hardware that can
handle (most) unaligned accesses, while leaving the tough edge cases
(like spanning pages) for the monitor to handle.

Handling misaligned AMOs in hardware is *far* more complex than handling
unaligned (and non-atomic!) loads and stores. The current wording
*requires* *all* unaligned accesses to a location to trap if AMOs would
trap. This eliminates an entire class of useful implementations.

> The current solution says that hardware support is optional. The
> software solution can handle it, so not extra complexity if the CPU
> designer wants to avoid besides
> the exception handler which is practically the same hardware as the
> misaligned for not atomics.

The problem is that the current wording attempts to make unaligned
load/store atomic with respect to misaligned AMO. This results in
forbidding the previously encouraged behavior of implementing some
(fast) subset of unaligned load/store in hardware while leaving
misaligned AMOs to trap. Worse, this results in a significant
performance penalty for the monitor code path that handles unaligned
load/store, since it now must serialize access and contain a critical
section. The correct change to this is to restore allowing unaligned
load/store to be entirely non-atomic.


-- Jacob

Jacob Bachmeyer

unread,
Dec 18, 2017, 7:36:41 PM12/18/17
to Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
Jose Renau wrote:
> I think that the text may need to be improved because the key idea is
> not that hardware must support misaligned atomics and LR/SC. The idea
> is that either exception to handle in software or the hardware
> supports it in which case no exception is raised.

That idea is sound for the RVA instructions. What is *not* sound is
requiring RVI load/store to trap if an AMO to the same location would
trap. Unaligned load/store is *not* required to be atomic in RISC-V,
but this change sneaks a far larger change (atomic unaligned load/store)
through the back door. It is that larger change that has prompted
objections.

> Since emulating atomics in software may not be possible, the memory
> consistency model says that the atomicity is not as strong. It only
> respects
> atomicity if the same word granularity is used. Then, a simple lock
> with plain LD/ST can handle it in the exception handler.

The problem is making unaligned RVI load/store atomic with respect to
misaligned AMOs that prevents implementing unaligned load/store in
hardware while still trapping for misaligned AMO, which is much harder
to implement in hardware.

> I think that this provides a simple solution that allows hardware to
> support if wanted, and provides more consistency with other ISAs that
> support
> misaligned (X86 and newer ARMv8). We do not want to have apps to
> crap/fault in RISC-V while running correctly in all the other ISAs.

If David Chisnall's earlier mesage is accurate, we very much *do* want
programs to fault in RISC-V if they have misaligned synchronization
variables -- x86 allows such operations _but_ _does_ _not_ _guarantee_
_that_ _misaligned_ _"atomics"_ _are_ _actually_ _atomic_ any more.
This is a far more fertile source of bugs than simply immediately
crashing a program that asks for something that the hardware cannot deliver.

Further, even if we allow that misaligned AMOs should be permitted, why
should they make ordinary unaligned load/store atomic and therefore much
more expensive? If you have a synchronization variable that you cannot
ensure is aligned, use only AMOs to access it.


-- Jacob

David Chisnall

unread,
Dec 19, 2017, 6:08:44 AM12/19/17
to jcb6...@gmail.com, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
On 19 Dec 2017, at 00:36, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> If David Chisnall's earlier mesage is accurate, we very much *do* want programs to fault in RISC-V if they have misaligned synchronization variables -- x86 allows such operations _but_ _does_ _not_ _guarantee_ _that_ _misaligned_ _"atomics"_ _are_ _actually_ _atomic_ any more. This is a far more fertile source of bugs than simply immediately crashing a program that asks for something that the hardware cannot deliver.

I don’t know that this is the case with recent x86, but it certainly was around 5-6 years ago. Given that they now have hardware transactional memory, it’s entirely possible that they crack atomic RMW instructions into micro-ops that use the transactional hardware if they span multiple cache lines / pages (if they’re in the same page but different cache lines then it’s possible that they might be simply lock two cache lines in the exclusive state, though that adds some complexity to the cache coherency mechanism).

> Further, even if we allow that misaligned AMOs should be permitted, why should they make ordinary unaligned load/store atomic and therefore much more expensive? If you have a synchronization variable that you cannot ensure is aligned, use only AMOs to access it.

If other loads and stores are not (relaxed consistency) atomic then you are going to hit a lot of fun corner cases in trying to implement the atomic versions, because even doing a piecewise atomic compare and exchange can interact with the lack of atomicity in the other loads and stores.

As a colleague recently pointed out to me, it’s not very helpful to think of atomicity in the abstract, without defining what it is atomic with respect to. If an atomic increment is not atomic with respect to a non-atomic load, then this is problematic. There is a lot of C code that assumes that it can do non-atomic loads and stores of variables and get either the ‘before’ or ‘after’ versions[1], so being able to do an atomic increment but load a value from another thread that is neither the incremented or non-incremented value will cause confusion.

It is far better to simply trap if you can’t do this safely than to subtly break code. There are basically three options here:

- Impose very complex requirements on the hardware that will add cost for little benefit to software.
- Make a small amount of software fail in subtle and very difficult to debug ways.
- Make a small amount of software trap and fail with a useful error message.

My vote would be for the third one. Anyone who really wants to do atomic operations spanning cache lines should wait for the transactional memory extension and use that.

David

[1] Yes, this is undefined behaviour in C. If you can find one nontrivial C program that doesn’t rely on undefined behaviour and is not seL4, then I’ll accept this as a counter argument.

Andrew Waterman

unread,
Dec 19, 2017, 9:37:49 AM12/19/17
to David Chisnall, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br, jcb6...@gmail.com
On Tue, Dec 19, 2017 at 8:08 PM David Chisnall <David.C...@cl.cam.ac.uk> wrote:
On 19 Dec 2017, at 00:36, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> If David Chisnall's earlier mesage is accurate, we very much *do* want programs to fault in RISC-V if they have misaligned synchronization variables -- x86 allows such operations _but_ _does_ _not_ _guarantee_ _that_ _misaligned_ _"atomics"_ _are_ _actually_ _atomic_ any more.  This is a far more fertile source of bugs than simply immediately crashing a program that asks for something that the hardware cannot deliver.

I don’t know that this is the case with recent x86, but it certainly was around 5-6 years ago.  Given that they now have hardware transactional memory, it’s entirely possible that they crack atomic RMW instructions into micro-ops that use the transactional hardware if they span multiple cache lines / pages (if they’re in the same page but different cache lines then it’s possible that they might be simply lock two cache lines in the exclusive state, though that adds some complexity to the cache coherency mechanism).

> Further, even if we allow that misaligned AMOs should be permitted, why should they make ordinary unaligned load/store atomic and therefore much more expensive?  If you have a synchronization variable that you cannot ensure is aligned, use only AMOs to access it.

If other loads and stores are not (relaxed consistency) atomic then you are going to hit a lot of fun corner cases in trying to implement the atomic versions, because even doing a piecewise atomic compare and exchange can interact with the lack of atomicity in the other loads and stores.

Yeah, it’s just not realistic to expect that all accesses to a variable use the appropriate atomicity annotations. Permitting misaligned AMOs but allowing misaligned loads to be non-atomic would be a mess.



As a colleague recently pointed out to me, it’s not very helpful to think of atomicity in the abstract, without defining what it is atomic with respect to.  If an atomic increment is not atomic with respect to a non-atomic load, then this is problematic.  There is a lot of C code that assumes that it can do non-atomic loads and stores of variables and get either the ‘before’ or ‘after’ versions[1], so being able to do an atomic increment but load a value from another thread that is neither the incremented or non-incremented value will cause confusion.

It is far better to simply trap if you can’t do this safely than to subtly break code.  There are basically three options here:

 - Impose very complex requirements on the hardware that will add cost for little benefit to software.
 - Make a small amount of software fail in subtle and very difficult to debug ways.
 - Make a small amount of software trap and fail with a useful error message.

Door #3 remains our proposal for the standard Unix-like platforms.



My vote would be for the third one.  Anyone who really wants to do atomic operations spanning cache lines should wait for the transactional memory extension and use that.

David

[1] Yes, this is undefined behaviour in C.  If you can find one nontrivial C program that doesn’t rely on undefined behaviour and is not seL4, then I’ll accept this as a counter argument.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Dec 19, 2017, 9:45:23 AM12/19/17
to Andrew Waterman, David Chisnall, Jose Renau, RISC-V ISA Dev, Cesar Eduardo Barros, Jacob Bachmeyer
On Tue, Dec 19, 2017 at 5:37 PM, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
It is far better to simply trap if you can’t do this safely than to subtly break code.  There are basically three options here:

 - Impose very complex requirements on the hardware that will add cost for little benefit to software.
 - Make a small amount of software fail in subtle and very difficult to debug ways.
 - Make a small amount of software trap and fail with a useful error message.

Door #3 remains our proposal for the standard Unix-like platforms.

This is sensible.

The one fly in the ointment is if people want to JIT or otherwise translate x86 binaries to run on RISC-V.

Maybe it's an even bigger problem with porting C code, as at least an emulator will *know* it's dealing with x86 assumptions and can insert extra fences or runtime tests or whatever.

Not being TSO is a related challenge, and can also affect both source code ports and binary emulation.


 

Jose Renau

unread,
Dec 19, 2017, 2:47:00 PM12/19/17
to RISC-V ISA Dev, re...@ucsc.edu, ces...@cesarb.eti.br, jcb6...@gmail.com

 (see inline)


On Monday, December 18, 2017 at 4:24:39 PM UTC-8, Jacob Bachmeyer wrote:
Jose Renau wrote:
>  X86 supports misaligned and ARMv8 recently changed (8.4) the spec to
> support misaligned. Before 8.4, the spec said exception. The fact that
> they changed
> and that there are some apps around there is a strong hint that we
> should provide some way to support it because apps need it.

We had a way to support misaligned AMO:  the misaligned AMO trap can be
delegated to the supervisor.  No program truly needs misaligned AMOs --
synchronization variables can always be aligned by simply inserting
padding.  Find a counterexample to that before saying "apps need it".

 By "apps need it", I mean that unless the code is changed some current apps will run in x86 and ARMv8 with just a recompile, but they will require code changes to run in RISC-V. I think that anything that requires code changes just because it is a new ISA is a not good thing to have.

 
 
Lazy developers may *like* misaligned AMOs, but apps do not *need*
misaligned AMOs and the current wording removes an option for efficient
handling of ordinary unaligned load/store that many programs actually do
need for hot code paths that must process packed structures such as
network packet headers.  Efficient hardware support for unaligned load
replaces a seven instruction sequence:  adjust pointer, aligned load,
shift, mask, second aligned load, mask, OR.  Trapping to the monitor is
much slower, but that gives an incentive to develop hardware that can
handle (most) unaligned accesses, while leaving the tough edge cases
(like spanning pages) for the monitor to handle.

Handling misaligned AMOs in hardware is *far* more complex than handling
unaligned (and non-atomic!) loads and stores.  The current wording
*requires* *all* unaligned accesses to a location to trap if AMOs would
trap.  This eliminates an entire class of useful implementations.

 It depends on the hardware. I can not provide details (NDA) but I have worked
in an ARMv8 implementation that handles non-aligned atomics and it was not an issue or 
more complex than handling non-aligned LD/ST (using the alignment constraints that ARMv8 8.4 has).
Also, as long as the misaligned is inside the cache line, there is not performance impact.
 
> The current solution says that hardware support is optional. The
> software solution can handle it, so not extra complexity if the CPU
> designer wants to avoid besides
> the exception handler which is practically the same hardware as the
> misaligned for not atomics.

The problem is that the current wording attempts to make unaligned
load/store atomic with respect to misaligned AMO.  This results in
forbidding the previously encouraged behavior of implementing some
(fast) subset of unaligned load/store in hardware while leaving
misaligned AMOs to trap.  Worse, this results in a significant
performance penalty for the monitor code path that handles unaligned
load/store, since it now must serialize access and contain a critical
section.  The correct change to this is to restore allowing unaligned
load/store to be entirely non-atomic.

 Ok, I think that I see your point. Is it because of this sentence?

"regular loads and stores using misaligned addresses also execute atomically with respect
to other accesses to the same address and of the same size"

 Let me rephrase. Your concern is that now, the plain LD/ST misaligned exception handler will have to make
the misaligned atomic with respect to that address. To do so, it needs a global lock or an array of locks
so that the 2 loads are performed atomically. It will also require a fence like in the atomic misaligned
code (or something with a similar functionality in the exception handler). Otherwise, you can not make
the operation to have atomically.

 Is that your concern?

 I agree that unless some additional hardware support is provided, this will slowdown the misaligned (non-atomic)
loads and stores. In my non-optimized example, Instead of 2 loads, it could have a LD/SC pair, 2 loads, and a fence.

 My concern to have such support is to allow "applications" to run in RISC-V without code change. if they have
many misaligned accesses, the logical thing would be to "optimize" by adding padding as you say. Notice that 
the goal is to make it a performance optimization not a correctness optimization.


-- Jacob

Cesar Eduardo Barros

unread,
Dec 19, 2017, 4:17:49 PM12/19/17
to Jose Renau, RISC-V ISA Dev, jcb6...@gmail.com
That's not the main concern. The main concern is that now it's required
to always call the plain LD/ST misaligned exception handler, even when
it could otherwise be done in hardware. Making the plain LD/ST
misaligned exception handler harder to write and slower is just the
icing on the cake.

>  I agree that unless some additional hardware support is provided, this
> will slowdown the misaligned (non-atomic)
> loads and stores. In my non-optimized example, Instead of 2 loads, it
> could have a LD/SC pair, 2 loads, and a fence.
>
>  My concern to have such support is to allow "applications" to run in
> RISC-V without code change. if they have
> many misaligned accesses, the logical thing would be to "optimize" by
> adding padding as you say. Notice that
> the goal is to make it a performance optimization not a correctness
> optimization.

For many situations where misaligned non-atomic access is important, you
can't add padding. For instance, network protocols. As an example, the
Ethernet header is 14 bytes (6+6+2); the IPv4 header has all fields
naturally aligned; but when an IPv4 header follows an Ethernet header (a
very common situation), all of its 32-bit fields are now misaligned by
two bytes.

Jose Renau

unread,
Dec 19, 2017, 4:28:59 PM12/19/17
to Cesar Eduardo Barros, RISC-V ISA Dev, jcb6...@gmail.com

If the hardware can handle misaligned LD/ST atomically in hardware,
there is not need to call the exception.
If the two parts of the LD/ST are globally visible without intervining
load/stores, there is no potential problem.

The only issue would be if the misaligned LD/ST hardware support
globally performed the two non-aligned
operations sequentially allowing remote LD/STs to happen in between.




On 12/19/2017 1:17:39 PM, "Cesar Eduardo Barros" <ces...@cesarb.eti.br>
wrote:

Cesar Eduardo Barros

unread,
Dec 19, 2017, 5:42:32 PM12/19/17
to Jose Renau, RISC-V ISA Dev, jcb6...@gmail.com
Em 19-12-2017 19:28, Jose Renau escreveu:
>
>  If the hardware can handle misaligned LD/ST atomically in hardware,
> there is not need to call the exception.
> If the two parts of the LD/ST are globally visible without intervining
> load/stores, there is no potential problem.
>
>  The only issue would be if the misaligned LD/ST hardware support
> globally performed the two non-aligned
> operations sequentially allowing remote LD/STs to happen in between.

Interesting. That might work, as long as the emulation for misaligned
AMOs does plain LD/ST within the mutex, instead of loading/storing each
piece by hand.

That is, we have the following options for the hardware:

A. No misaligned access at all, all misaligned accesses trap.
B. Misaligned LD/ST, split into two or more accesses (non-atomic).
C. Misaligned LD/ST, indivisible (atomic).
D. Misaligned LD/ST/LR/SC/AMO.

The problematic one is option B, which used to be allowed but is now
forbidden. What you just suggested is option C.

From the user space point of view, the options are:

1. All misaligned accesses trap.
2. Misaligned atomics trap, misaligned non-atomics don't.
3. No misaligned data traps.

Of course, hardware option D leads to user space option 3, and user
space option 1 needs hardware option A.

For user space option 2, and only it, hardware option B is a valid
option. That used to be the RISC-V way (through emulation if necessary).
You want to allow option 3, but that rules out option B, leaving only
option A (costly when unaligned non-atomic accesses are common) or
option D (more complex hardware). This new option C looks like a good
compromise, and the cache tricks to make it work probably aren't that
complicated (perhaps something as simple as holding both cache lines as
shared/exclusive at the same time while doing the load/store, being
careful to not fall into an AB/BA deadlock).

There's just one wart. The ideal misaligned atomic access emulation for
option A holds the mutex and loads/store the pieces by hand, but that
won't work for option C. Consider the following sequence:

hart 1 hart 2
lock mutex
load first piece
.. store new value
load second piece
unlock mutex

That is, for option C, the emulation code must allow the hardware to do
the misaligned load:

hart 1 hart 2
lock mutex
load value
.. store new value
unlock mutex

But for option A, doing it that way requires a recursive mutex:

hart 1
lock mutex
load value (traps)
(lock mutex)
(store first piece)
(store second piece)
(unlock mutex)
(return from trap)
unlock mutex

Jacob Bachmeyer

unread,
Dec 19, 2017, 6:24:33 PM12/19/17
to Jose Renau, Cesar Eduardo Barros, RISC-V ISA Dev
Jose Renau wrote:
> If the hardware can handle misaligned LD/ST atomically in hardware,
> there is not need to call the exception.
> If the two parts of the LD/ST are globally visible without intervining
> load/stores, there is no potential problem.
>
> The only issue would be if the misaligned LD/ST hardware support
> globally performed the two non-aligned
> operations sequentially allowing remote LD/STs to happen in between.

Then the current wording (privileged ISA spec section 3.5.3 "Atomicity
PMAs") is unacceptably unclear on the matter: "If, for a given address
and access width, a misaligned LR/SC or AMO generates a misaligned
address exception, then all loads, stores, LRs/SCs, and AMOs using that
address and access width must generate misaligned address exceptions."

This does not allow even for atomic hardware unaligned load/store
support -- this (to me at least) clearly mandates that "all loads,
stores, ... generate misaligned address exceptions." The user ISA
specifically states: (section 2.6 Load and Store Instructions")
"Furthermore, naturally aligned loads and stores are guaranteed to
execute atomically, whereas misaligned loads and stores might not, and
hence require additional synchronization to ensure atomicity."

The user ISA thus permits hardware that performs unaligned load/store by
breaking up the operations non-atomically to adjacent aligned
locations. For the cases Cesar Eduardo Barros describes, this is fine;
IP network code will be the only hart looking at the packet headers. In
general, this fits well with RVWMO, although explicitly stating that
intermediate values produced by an unaligned store may be visible and
are undefined is probably a good idea. (One possible implementation
splits an unaligned store into either a wider AMOAND/AMOOR pair or two
same-width pairs. Multiple harts executing these simultaneously could
produce a wide variety of intermediate values and torn results. The
RISC-V user ISA allows this for RVI LOAD/STORE as I read it.)


-- Jacob

Jacob Bachmeyer

unread,
Dec 19, 2017, 6:25:10 PM12/19/17
to David Chisnall, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
David Chisnall wrote:
> On 19 Dec 2017, at 00:36, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Further, even if we allow that misaligned AMOs should be permitted, why should they make ordinary unaligned load/store atomic and therefore much more expensive? If you have a synchronization variable that you cannot ensure is aligned, use only AMOs to access it.
>>
>
> If other loads and stores are not (relaxed consistency) atomic then you are going to hit a lot of fun corner cases in trying to implement the atomic versions, because even doing a piecewise atomic compare and exchange can interact with the lack of atomicity in the other loads and stores.
>

RISC-V does not have compare and swap, and I have been working on a
proposal for multiple-word LR/SC. If you can take reservations on two
adjacent addresses and retry if either changes before updating both,
implementing misaligned AMO should become easy. (I am also planning a
more powerful RVT proposal to put multiple-word LR/SC in perspective.
:-) )

> As a colleague recently pointed out to me, it’s not very helpful to think of atomicity in the abstract, without defining what it is atomic with respect to. If an atomic increment is not atomic with respect to a non-atomic load, then this is problematic. There is a lot of C code that assumes that it can do non-atomic loads and stores of variables and get either the ‘before’ or ‘after’ versions[1], so being able to do an atomic increment but load a value from another thread that is neither the incremented or non-incremented value will cause confusion.
>

So AMOs need to be atomic with respect to loads, such that an unaligned
load cannot see an intermediate value produced by an AMO. Does the
existing user ISA caveat that unaligned load/store can produce torn
results need to be changed to provide for atomic unaligned memory access
in all cases? That could require exposing unaligned access traps to the
supervisor, where currently the supervisor sees only misaligned AMO traps.

When you mention atomic increment, do you mean specifically AMOADD, or
are we talking about common "var++" where "var" is not marked
"volatile"? The latter is a much harder problem.

> It is far better to simply trap if you can’t do this safely than to subtly break code. There are basically three options here:
>
> - Impose very complex requirements on the hardware that will add cost for little benefit to software.
> - Make a small amount of software fail in subtle and very difficult to debug ways.
> - Make a small amount of software trap and fail with a useful error message.
>
> My vote would be for the third one.

I agree with that goal but am uncertain what will be required to
implement it.


-- Jacob

Andrew Waterman

unread,
Dec 19, 2017, 8:18:06 PM12/19/17
to Cesar Eduardo Barros, Jose Renau, RISC-V ISA Dev, Jacob Bachmeyer
On Wed, Dec 20, 2017 at 7:42 AM, Cesar Eduardo Barros
<ces...@cesarb.eti.br> wrote:
> Em 19-12-2017 19:28, Jose Renau escreveu:
>>
>>
>> If the hardware can handle misaligned LD/ST atomically in hardware,
>> there is not need to call the exception.
>> If the two parts of the LD/ST are globally visible without intervining
>> load/stores, there is no potential problem.
>>
>> The only issue would be if the misaligned LD/ST hardware support
>> globally performed the two non-aligned
>> operations sequentially allowing remote LD/STs to happen in between.
>
>
> Interesting. That might work, as long as the emulation for misaligned AMOs
> does plain LD/ST within the mutex, instead of loading/storing each piece by
> hand.
>
> That is, we have the following options for the hardware:
>
> A. No misaligned access at all, all misaligned accesses trap.
> B. Misaligned LD/ST, split into two or more accesses (non-atomic).
> C. Misaligned LD/ST, indivisible (atomic).
> D. Misaligned LD/ST/LR/SC/AMO.
>
> The problematic one is option B, which used to be allowed but is now
> forbidden. What you just suggested is option C.

Option B is still possible, depending on the PMAs.

If the PMAs for a given region forbid misaligned AMOs (i.e., access
exception, rather than misaligned exception), then misaligned loads &
stores don't need to be atomic.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/76be4266-3197-eb80-6d1a-b5912d0b1425%40cesarb.eti.br.

Jacob Bachmeyer

unread,
Dec 19, 2017, 9:53:05 PM12/19/17
to Andrew Waterman, Cesar Eduardo Barros, Jose Renau, RISC-V ISA Dev
Andrew Waterman wrote:
> On Wed, Dec 20, 2017 at 7:42 AM, Cesar Eduardo Barros
> <ces...@cesarb.eti.br> wrote:
>
>> That is, we have the following options for the hardware:
>>
>> A. No misaligned access at all, all misaligned accesses trap.
>> B. Misaligned LD/ST, split into two or more accesses (non-atomic).
>> C. Misaligned LD/ST, indivisible (atomic).
>> D. Misaligned LD/ST/LR/SC/AMO.
>>
>> The problematic one is option B, which used to be allowed but is now
>> forbidden. What you just suggested is option C.
>>
>
> Option B is still possible, depending on the PMAs.
>
> If the PMAs for a given region forbid misaligned AMOs (i.e., access
> exception, rather than misaligned exception), then misaligned loads &
> stores don't need to be atomic.

If this is the case, is the monitor still permitted to reflect that trap
to the supervisor as "misaligned AMO"? Also, is main memory permitted
to forbid misaligned AMOs?



-- Jacob

Andrew Waterman

unread,
Dec 20, 2017, 1:48:40 AM12/20/17
to jcb6...@gmail.com, Cesar Eduardo Barros, Jose Renau, RISC-V ISA Dev
Both of these questions, I think, need to be answered by a given hardware platform.

If the operation is truly unsupported then I think it should be reported to the OS as an access exception. Either way, it’s likely a SIGBUS.

I do think some main memory regions should be allowed to forbid misaligned AMOs.





-- Jacob

David Chisnall

unread,
Dec 20, 2017, 4:37:43 AM12/20/17
to jcb6...@gmail.com, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br

> On 19 Dec 2017, at 23:25, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> David Chisnall wrote:
>> On 19 Dec 2017, at 00:36, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>>> Further, even if we allow that misaligned AMOs should be permitted, why should they make ordinary unaligned load/store atomic and therefore much more expensive? If you have a synchronization variable that you cannot ensure is aligned, use only AMOs to access it.
>>>
>>
>> If other loads and stores are not (relaxed consistency) atomic then you are going to hit a lot of fun corner cases in trying to implement the atomic versions, because even doing a piecewise atomic compare and exchange can interact with the lack of atomicity in the other loads and stores.
>>
>
> RISC-V does not have compare and swap, and I have been working on a proposal for multiple-word LR/SC. If you can take reservations on two adjacent addresses and retry if either changes before updating both, implementing misaligned AMO should become easy. (I am also planning a more powerful RVT proposal to put multiple-word LR/SC in perspective. :-) )

I was going to say that I’m not aware of any existing hardware (excluding systems with transactional memory) that allows multi-word CAS (whether as a CAS instruction or as a ll/sc pair). Then I ran a simple test[1] on Haswell, and it turns out that at least `lock incl` does work even when spanning page boundaries on x86, so presumably the multi-word instructions also work. Interestingly, there didn’t appear to be any performance penalty from having the target of a `lock incl` span two cache lines, but when is spanned two pages the performance dropped by a factor of 32.

A bunch of RCU patterns in the Linux kernel require an atomic compare and exchange of a pair of pointers, but they require that these pointers be aligned on a 16-byte boundary, so won’t hit this case: atomics spanning cache lines are broken (or painfully slow) on so many microarchitectures that no one writes code assuming that it works.

Trying to implement a ll/sc that supports multiple words that can span a pair of cache lines on different pages is really hard. A few years ago, I proposed a small tweak to ll/sc that made the reservation an entire cache line and would simply discard the entire cache line on failure (ll moves cache line to exclusive state, sc either makes it globally visible or discards it if there have been concurrent writes), giving the simplest possible case of transactional memory. This is enough for a bunch of optimisations to existing concurrent data structures (and something similar was requested by some STM papers).

>> As a colleague recently pointed out to me, it’s not very helpful to think of atomicity in the abstract, without defining what it is atomic with respect to. If an atomic increment is not atomic with respect to a non-atomic load, then this is problematic. There is a lot of C code that assumes that it can do non-atomic loads and stores of variables and get either the ‘before’ or ‘after’ versions[1], so being able to do an atomic increment but load a value from another thread that is neither the incremented or non-incremented value will cause confusion.
>>
>
> So AMOs need to be atomic with respect to loads, such that an unaligned load cannot see an intermediate value produced by an AMO. Does the existing user ISA caveat that unaligned load/store can produce torn results need to be changed to provide for atomic unaligned memory access in all cases? That could require exposing unaligned access traps to the supervisor, where currently the supervisor sees only misaligned AMO traps.

Consider: One thread performs a load of an address, one thread performs an AMOADD of the same address. The address is unaligned and spans a cache-line boundary.

Case 1: Both go to different trap handlers. The first trap handler performs two loads, shifts and masks the results together, and returns. The second acquires a lock, performs two loads, shifts and masks the result together, performs the increment, and then performs the two piecewise stores and releases the locks. The AMOADD is atomic with respect to other atomic instructions, but not atomic with respect to non-atomic instructions.

Case 2: The hardware performs a slower but still piecewise (non-atomic) load for the first core, the second goes to a trap handler and acquires a lock. There is no atomicity guarantee between the two.

Case 3: Both go to the same trap handlers, or ones that are aware of concurrency. Both acquire locks and run as before. Now we have (relaxed consistency) atomic behaviour.

Case 4: The hardware performs a slower but still piecewise (non-atomic) load for the first core and the second goes to a trap handler that delivers a signal up to userspace, reporting that the AMOADD instruction was invalid on an unaligned value.

All four of these are valid according to the C[++]11 memory model, which makes it undefined behaviour to perform both atomic and non-atomic accesses to a variable. That doesn’t help much, because most real-world C/C++ code predates 2011 and so either uses ad-hoc atomics or has slowly been moving towards C++11 atomics and contains a bunch of casts (it doesn’t help that the C11 spec is completely broken and managed to confuse `volatile` and `_Atomic` in the `stdatomic.h` header, making it basically impossible to use correctly).

Cases 3 and 4 will give the guarantees that software expects: either your atomic RMW operation succeeds and is atomic with respect to things that are implicitly relaxed atomics, or it fails in a detectable way.

Note that the correct behaviour is really part of the ABI, not the ISA. As long as you trap on unaligned AMO accesses, you can decide whether to forward the trap to userspace or to try to emulate it. If you wish to emulate it, then you need a trap handler that can run atomically with respect to all other non-atomic loads and stores. If we want x86-compatibility, then simply losing the atomicity property is fine, but is a policy for the software stack (and can be delegated to userspace: if you get an unaligned access trap, deliver it as a signal and let the signal handler emulate the instruction).

> When you mention atomic increment, do you mean specifically AMOADD, or are we talking about common "var++" where "var" is not marked "volatile"? The latter is a much harder problem.

Note: volatile does not give any atomicity guarantees with respect to other threads, or with respect to other memory locations, it is solely intended for device memory.

That said, this is a fairly common idiom in C code, where you don’t care if a few updates are missed. If you do `var++` on a non-_Atomic variable, the general expectation is that if two threads race then either `var` will be incremented once or twice. This is probably not worth worrying about, because you have to try quite hard to persuade a C compiler not to naturally align `var`.

>
>> It is far better to simply trap if you can’t do this safely than to subtly break code. There are basically three options here:
>>
>> - Impose very complex requirements on the hardware that will add cost for little benefit to software.
>> - Make a small amount of software fail in subtle and very difficult to debug ways.
>> - Make a small amount of software trap and fail with a useful error message.
>>
>> My vote would be for the third one.
>
> I agree with that goal but am uncertain what will be required to implement it.

From an ISA perspective, we are somewhat conflating two things: unaligned accesses within a cache line and unaligned accesses that span a cache line. It’s fairly common to make the CPU handle ones within a cache line in hardware and ones that span a cache line boundary in software / firmware / microcode.

For CHERI, we have followed this and log to the console every time we hit an unaligned access spanning a cache line boundary. These are very rare and are almost always the result of string processing optimisations (compiler-generated inline versions of standard string functions that make stronger assumptions about alignment).

The problem for the ecosystem is what happens if one implementation supports unaligned atomic RWM operations in hardware at any location, whereas others only support them within a cache line. This, as with the optional TSO thing, will provide pressure on other implementers to support them at any granularity. This requires either that we trap on all unaligned accesses that would trap for atomic RMW instructions, or we expect all implementers to implement atomic RWM even when spanning a cache line.

Note that even this isn’t the whole story, because a compiler is free to compile this code in two ways:

```c
_Atomic(int) x;

x++;
```

It can either emit the AMOADD instruction, or a ll/sc loop. Even if the compiler always picks AMOADD for this pattern, it is well-defined C for the programmer to write an atomic or, or some more complex operation involving an explicit compare and exchange, which must be compiled as an ll/sc loop. As such, we must guarantee that all of the AMO* instructions are also atomic with respect to ll/sc, and implementing these in trap handlers is even more annoying and I’d hope that we weren’t expecting people to implement them correctly for performance (or we provide a formally verified reference implementation).

I’d much rather that we simple didn’t allow ll/sc/amo* on any location spanning a cache line boundary (by requiring in the ABI that we don’t trap and emulate these). People who need this should wait for the T extension.

David

[1] Simple test program for atomics spanning different boundaries:

```c
#include <pthread.h>
#include <stdio.h>

// Edit this number to control the number of loop iterations
static int loops = 10000000;

void inc(_Atomic(int) *a)
{
(*a)++;
}

void *run(void* ptr)
{
_Atomic(int) *a = (_Atomic(int)*)ptr;
for (int i=0 ; i<loops ; i++)
{
inc(a);
}
return ptr;
}
struct
__attribute__((packed, aligned(4096)))
{
// Edit this number to control the alignment of the variable.
char y[62];
int z;
} evil;

int main()
{
pthread_t thr1, thr2;
_Atomic(int) a;
pthread_create(&thr1, NULL, run, &evil.z);
pthread_create(&thr2, NULL, run, &evil.z);
pthread_join(thr1, NULL);
pthread_join(thr2, NULL);
fprintf(stderr, "%d (expected %d) (%p)\n", evil.z, loops*2, &evil.z);
}
```

Cesar Eduardo Barros

unread,
Dec 20, 2017, 5:38:36 AM12/20/17
to Andrew Waterman, Jose Renau, RISC-V ISA Dev, Jacob Bachmeyer
Then we go back to my initial complaint: both option B and option C are
explicitly forbidden by the change to machine.tex.

"If, for a given address and access width, a misaligned LR/SC or AMO
generates a misaligned address exception, then {\em all} loads, stores,
LRs/SCs, and AMOs using that address and access width must generate
misaligned address exceptions."

That sentence plainly mandates either option A or option D.

Jonas Oberhauser

unread,
Dec 20, 2017, 11:04:39 AM12/20/17
to RISC-V ISA Dev, jcb6...@gmail.com, re...@ucsc.edu, ces...@cesarb.eti.br, David.C...@cl.cam.ac.uk
It's a very good discussion and many good points have been raised.
I think one thing should be also be said, which wasn't immediately clear to me: it is nearly impossible to emulate AMOs w.r.t. all memory accesses.
Consider:
Thread 1 executes an *aligned* load to bytes x0 ... x3
Thread 2 executes a misaligned AMO  to bytes x1 ... x4

Clearly the aligned load should not cause an interrupt and should be executed atomically (w.r.t. other HW memory steps). If HW does not support misaligned LD/ST, Thread 2's emulation clearly would need to write some of the bytes x1 ... x3 before others (with at least two memory instructions, since there is no "write three bytes" as far as I know).
But if Thread 1 is allowed to run in parallel with Thread 2, then the trap handler of Thread 2 would update some of the bytes in x1, ..., x3 before the others and Thread 1's aligned load could therefore be "torn".

My assumption right now is that (reasonable?) legacy code never mixes misaligned accesses like I did in the example above. I think that is also the assumption for the patch that is being discussed. Is that assumption flawed?

Now as far as I understand there is really two core questions:
1) what are the options for emulating AMOs for legacy code, and what features do they need
2) should the ISA have the features that make 1) above possible

Now here is my view for 1). I know of two ways to emulate AMOs. 
The first one is to prevent races that involve AMOs. Doing that seem easy but slow. Support for it already exists: either mark tasks that could race as "not runnable" while emulating the AMO, or mark pages on which you emulate AMOs as non-present on other cores.
The second one (discussed by David as well, and the one for which presumably the patch was introduced) is to also trap non-AMOs if the SW emulation of an AMO would not be atomic w.r.t. the HW implementation of these non-AMOs.
For this the patch is a tiny bit too strong for loads: if loads and stores can be implemented with atomicity in HW, then a SW AMO emulation can race with a HW load without loss of atomicity: the SW AMO emulator executes a HW store, thus only has two states, the one before the HW store and the one after. However, an atomic HW store that races the SW AMO could appear between the HW load and HW store used by the AMO and destroy the atomicity.
To summarize, one would need *during emulation* from HW to each address and access width at least one of the following:
1) AMO and LD/ST are atomic, or
2) AMO traps, LD is atomic, ST is atomic but traps in user mode, or
(3) AMO is illegal (PMA or something))

(I think these are the necessary conditions, you can of course strengthen them, as the patch has done.)

I think this also works for LR and SC, but for a simple implementation of a reservation release it would be good to have a trap on instred or something similar (so that you can take a lock and release it after 16 instructions or something).


Now to question 2).
Is legacy software important enough to make the decision to keep the patch as is? I think the answer is no. In my eyes there is very little legacy software that needs emulation, but we are throttling all misaligned accesses for the vast rest of the code.
However if we could have an "AMO emulation mode" in HW in which the necessary traps are generated, that might be neat. The HW cost would not be big. The trap on instred was discussed already for exact replay and might find its way into the spec anyways.

I wonder if the other way of emulating misaligned AMOs -- via IPIs and page bouncing -- is fast enough to run the legacy software. In this case there is no need for a HW  "emulation mode" at all, and the patch should just be reverted. Sadly my intuition says that software which has misaligned AMOs has them for a reason, namely, that there is a lot of shared state concurrency going on and there is some performance criticality; as a result, an approach that locks out other harts may not be fast enough. I could be wrong though. Maybe Albert or David can tell us something more educated.

To summarize:
Emulation is always possible, but might be very slow. Therefore
1) revert the patch, and
2) see whether it is worth it to make the patch available as a flag, to be turned on when running legacy software that uses misaligned AMOs

What do you think?

David wrote:
The problem for the ecosystem is what happens if one implementation supports unaligned atomic RWM operations in hardware at any location, whereas others only support them within a cache line.  This, as with the optional TSO thing, will provide pressure on other implementers to support them at any granularity.

Do you mean that once people start writing programs that make use of a feature, other HW will be forced to follow suit? I see the appeal to allow for some special purpose hardware to not have to implement AMOs at any location if the software intended to run on them does not need that.

Allen Baum

unread,
Dec 20, 2017, 3:37:35 PM12/20/17
to Jonas Oberhauser, RISC-V ISA Dev, Jacob Bachmeyer, Jose Renau, Cesar Eduardo Barros, David Chisnall
I am missing something; thread1 reads bytes3..0. Thread2 writes bytes3..1 atomically (it also writes byte0)
If thread 1 reads first - its OK
if thread 1 reads after thread2 has updated 3..1 - that's OK also - regardless of whether it has also updated byte 4. It gets the same answer.

If Thread 1 wrote instead of read, then there could possibly be a tear, 
  thread2 reads old bytes3..0
  thread1 writes new bytes3..0
  thread2 writes newer bytes3..1, and old byte0. But the only legal combinations are newer3..1, new0 or new3..0
But, can't that be fixed with a simple LR/SC combination?

If unaligned Ld/St is implemented in HW, but non-atomic, the only example i can come up with that can't be emulated is when unaligned St and AMOs overlap in both halves - because you need to reserve two addresses. I think the LRM proposal could solve this if it can be made to work.

(maybe even possible if two aligned stores coverlap both halves of an AMO?)

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jonas Oberhauser

unread,
Dec 20, 2017, 4:44:33 PM12/20/17
to Allen J. Baum, RISC-V ISA Dev, Jacob Bachmeyer, Jose Renau, Cesar Eduardo Barros, David Chisnall


On Dec 20, 2017 9:37 PM, "Allen Baum" <allen...@esperantotech.com> wrote:

If Thread 1 wrote instead of read, then there could possibly be a tear, 
  thread2 reads old bytes3..0
  thread1 writes new bytes3..0
  thread2 writes newer bytes3..1, and old byte0. But the only legal combinations are newer3..1, new0 or new3..0
But, can't that be fixed with a simple LR/SC combination?

Good point, thanks! I hadn't thought of that. It seems like that should work for the normal AMOs w.r.t. *single* aligned accesses.

If unaligned Ld/St is implemented in HW, but non-atomic, the only example i can come up with that can't be emulated is when unaligned St and AMOs overlap in both halves - because you need to reserve two addresses. I think the LRM proposal could solve this if it can be made to work.

Yes -- LRM should do the trick.

(maybe even possible if two aligned stores coverlap both halves of an AMO?)

Is coverlap a technical term or a typo? Please explain if it is the former.

Do you mean something like "store x; strong memory barrier; store y" || "AMO x,y" where the store to x can overwrite the AMO but the store to y is overwritten by the AMO? (wlog x is written first by the SW AMO)

I think also a misaligned LR/SC would still be hard to emulate (as David points out, they exist as compiler mappings for CAS) -- the forward guarantee would be lost.

In any case, my suggestion still stands (revert the patch, then think as a seperate issue about an "emulation mode")

David Chisnall

unread,
Dec 21, 2017, 2:08:52 AM12/21/17
to Jonas Oberhauser, Allen J. Baum, RISC-V ISA Dev, Jacob Bachmeyer, Jose Renau, Cesar Eduardo Barros
On 20 Dec 2017, at 21:44, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>
> In any case, my suggestion still stands (revert the patch, then think as a seperate issue about an "emulation mode")

The root cause of this issue is conflating platform-level guarantees (operations on atomic variables must be atomic, even with respect to non-atomic accesses, when they span pages and cannot be atomic in a given microarchitecture) with ISA level guarantees.

The change to the ISA reference appears to have been done to permit a specific (and, in my opinion, not very useful) platform-level guarantee, without documenting what that guarantee was or the rationale for picking it.

David

Jonas Oberhauser

unread,
Dec 21, 2017, 2:09:57 AM12/21/17
to David Chisnall, Allen J. Baum, RISC-V ISA Dev, Jacob Bachmeyer, Jose Renau, Cesar Eduardo Barros
I agree with you. 

Andrew Waterman

unread,
Dec 21, 2017, 4:29:15 AM12/21/17
to Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
For PMAs that forbid misaligned AMOs altogether (which, for some platforms, could be for all addresses), these new constraints need not apply. I agree the spec does not permit this as written, but I also agree there’s no reason to forbid option B on platforms that have no need for misaligned AMOs.

Jonas Oberhauser

unread,
Dec 21, 2017, 6:15:19 AM12/21/17
to Andrew Waterman, Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
2017-12-21 10:29 GMT+01:00 Andrew Waterman <wate...@eecs.berkeley.edu>:
On Wed, Dec 20, 2017 at 7:38 PM Cesar Eduardo Barros <ces...@cesarb.eti.br> wrote:
"If, for a given address and access width, a misaligned LR/SC or AMO
generates a misaligned address exception, then {\em all} loads, stores,
LRs/SCs, and AMOs using that address and access width must generate
misaligned address exceptions."

For PMAs that forbid misaligned AMOs altogether (which, for some platforms, could be for all addresses), these new constraints need not apply. I agree the spec does not permit this as written, but I also agree there’s no reason to forbid option B on platforms that have no need for misaligned AMOs.

What parts exactly does the spec currently forbid?
As far as I understand, there are two ways of looking at it
1) misalignment interrupts *are* the PMA interrupts that forbid misaligned AMOs, and these new constraints always apply
2) PMAs do not have this level of granularity -- from the PMA side, AMOs are either allowed or forbidden, but never allowed when aligned and forbidden otherwise. Misalignment interrupts are not a PMA issue and are kind of "below" the PMA.

Is that what you mean, or is there something else? Are you taking view 2) and suggesting that on top of "normal" misalignment interrupts, there should be PMA misalignment interrupts, which render the clause invalid?

Jacob Bachmeyer

unread,
Dec 21, 2017, 7:03:11 PM12/21/17
to Jonas Oberhauser, Allen J. Baum, RISC-V ISA Dev, Jose Renau, Cesar Eduardo Barros, David Chisnall
Jonas Oberhauser wrote:
> I think also a misaligned LR/SC would still be hard to emulate (as
> David points out, they exist as compiler mappings for CAS) -- the
> forward guarantee would be lost.

If misaligned LR/SC trap to the supervisor, they can be emulated, albeit
at a high cost in performance. (Supervisor marks relevant pages
"no-access" and records the reservation in software, then waits for the
page fault from SC. Page faults from other threads block those threads
for some bounded period.)



-- Jacob

Jacob Bachmeyer

unread,
Dec 21, 2017, 7:27:19 PM12/21/17
to David Chisnall, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
David Chisnall wrote:
>> On 19 Dec 2017, at 23:25, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> David Chisnall wrote:
>>
>>> On 19 Dec 2017, at 00:36, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>
>>>
>>>> Further, even if we allow that misaligned AMOs should be permitted, why should they make ordinary unaligned load/store atomic and therefore much more expensive? If you have a synchronization variable that you cannot ensure is aligned, use only AMOs to access it.
>>>>
>>>>
>>> If other loads and stores are not (relaxed consistency) atomic then you are going to hit a lot of fun corner cases in trying to implement the atomic versions, because even doing a piecewise atomic compare and exchange can interact with the lack of atomicity in the other loads and stores.
>>>
>>>
>> RISC-V does not have compare and swap, and I have been working on a proposal for multiple-word LR/SC. If you can take reservations on two adjacent addresses and retry if either changes before updating both, implementing misaligned AMO should become easy. (I am also planning a more powerful RVT proposal to put multiple-word LR/SC in perspective. :-) )
>>
>
> I was going to say that I’m not aware of any existing hardware (excluding systems with transactional memory) that allows multi-word CAS (whether as a CAS instruction or as a ll/sc pair). Then I ran a simple test[1] on Haswell, and it turns out that at least `lock incl` does work even when spanning page boundaries on x86, so presumably the multi-word instructions also work. Interestingly, there didn’t appear to be any performance penalty from having the target of a `lock incl` span two cache lines, but when is spanned two pages the performance dropped by a factor of 32.
>
> A bunch of RCU patterns in the Linux kernel require an atomic compare and exchange of a pair of pointers, but they require that these pointers be aligned on a 16-byte boundary, so won’t hit this case: atomics spanning cache lines are broken (or painfully slow) on so many microarchitectures that no one writes code assuming that it works.
>
> Trying to implement a ll/sc that supports multiple words that can span a pair of cache lines on different pages is really hard. A few years ago, I proposed a small tweak to ll/sc that made the reservation an entire cache line and would simply discard the entire cache line on failure (ll moves cache line to exclusive state, sc either makes it globally visible or discards it if there have been concurrent writes), giving the simplest possible case of transactional memory. This is enough for a bunch of optimisations to existing concurrent data structures (and something similar was requested by some STM papers).
>

I am working on a proposal for multiple-word LR/SC; the discussion is in
the "Double-wide LR/SC" thread. Helpful input would be appreciated.

>>> As a colleague recently pointed out to me, it’s not very helpful to think of atomicity in the abstract, without defining what it is atomic with respect to. If an atomic increment is not atomic with respect to a non-atomic load, then this is problematic. There is a lot of C code that assumes that it can do non-atomic loads and stores of variables and get either the ‘before’ or ‘after’ versions[1], so being able to do an atomic increment but load a value from another thread that is neither the incremented or non-incremented value will cause confusion.
>>>
>>>
>> So AMOs need to be atomic with respect to loads, such that an unaligned load cannot see an intermediate value produced by an AMO. Does the existing user ISA caveat that unaligned load/store can produce torn results need to be changed to provide for atomic unaligned memory access in all cases? That could require exposing unaligned access traps to the supervisor, where currently the supervisor sees only misaligned AMO traps.
>>
>
> Consider: One thread performs a load of an address, one thread performs an AMOADD of the same address. The address is unaligned and spans a cache-line boundary.
>
> Case 1: Both go to different trap handlers. The first trap handler performs two loads, shifts and masks the results together, and returns. The second acquires a lock, performs two loads, shifts and masks the result together, performs the increment, and then performs the two piecewise stores and releases the locks. The AMOADD is atomic with respect to other atomic instructions, but not atomic with respect to non-atomic instructions.
>
> Case 2: The hardware performs a slower but still piecewise (non-atomic) load for the first core, the second goes to a trap handler and acquires a lock. There is no atomicity guarantee between the two.
>
> Case 3: Both go to the same trap handlers, or ones that are aware of concurrency. Both acquire locks and run as before. Now we have (relaxed consistency) atomic behaviour.
>
> Case 4: The hardware performs a slower but still piecewise (non-atomic) load for the first core and the second goes to a trap handler that delivers a signal up to userspace, reporting that the AMOADD instruction was invalid on an unaligned value.
>
> All four of these are valid according to the C[++]11 memory model, which makes it undefined behaviour to perform both atomic and non-atomic accesses to a variable. That doesn’t help much, because most real-world C/C++ code predates 2011 and so either uses ad-hoc atomics or has slowly been moving towards C++11 atomics and contains a bunch of casts (it doesn’t help that the C11 spec is completely broken and managed to confuse `volatile` and `_Atomic` in the `stdatomic.h` header, making it basically impossible to use correctly).
>
> Cases 3 and 4 will give the guarantees that software expects: either your atomic RMW operation succeeds and is atomic with respect to things that are implicitly relaxed atomics, or it fails in a detectable way.
>
> Note that the correct behaviour is really part of the ABI, not the ISA. As long as you trap on unaligned AMO accesses, you can decide whether to forward the trap to userspace or to try to emulate it. If you wish to emulate it, then you need a trap handler that can run atomically with respect to all other non-atomic loads and stores. If we want x86-compatibility, then simply losing the atomicity property is fine, but is a policy for the software stack (and can be delegated to userspace: if you get an unaligned access trap, deliver it as a signal and let the signal handler emulate the instruction).
>

This is another reason that I do not like this change: previously, the
supervisor could still emulate AMOs in case 4, and can do so correctly.
With this change, only cases 1 and 3 are allowed (hardware behavior is
the same for cases 1 and 3), but I believe that case 4 is the best solution.

>>> It is far better to simply trap if you can’t do this safely than to subtly break code. There are basically three options here:
>>>
>>> - Impose very complex requirements on the hardware that will add cost for little benefit to software.
>>> - Make a small amount of software fail in subtle and very difficult to debug ways.
>>> - Make a small amount of software trap and fail with a useful error message.
>>>
>>> My vote would be for the third one.
>>>
>> I agree with that goal but am uncertain what will be required to implement it.
>>
>
> From an ISA perspective, we are somewhat conflating two things: unaligned accesses within a cache line and unaligned accesses that span a cache line. It’s fairly common to make the CPU handle ones within a cache line in hardware and ones that span a cache line boundary in software / firmware / microcode.
>

As I understand it, the original intent for RISC-V's unaligned access
handling is exactly that: hardware can handle (some) unaligned accesses
and trap to the monitor if the access spans some microarchitectural
boundary. High-performance hardware might be able to handle unaligned
accesses spanning cache lines, but still trap for accesses spanning pages.

The monitor in RISC-V is effectively microcode itself written in the
RISC-V ISA.

> For CHERI, we have followed this and log to the console every time we hit an unaligned access spanning a cache line boundary. These are very rare and are almost always the result of string processing optimisations (compiler-generated inline versions of standard string functions that make stronger assumptions about alignment).
>
> The problem for the ecosystem is what happens if one implementation supports unaligned atomic RWM operations in hardware at any location, whereas others only support them within a cache line. This, as with the optional TSO thing, will provide pressure on other implementers to support them at any granularity. This requires either that we trap on all unaligned accesses that would trap for atomic RMW instructions, or we expect all implementers to implement atomic RWM even when spanning a cache line.
>

A preemptive multitasking supervisor can emulate atomic operations
(slowly), so this is a performance issue, rather than a correctness issue.

> Note that even this isn’t the whole story, because a compiler is free to compile this code in two ways:
>
> ```c
> _Atomic(int) x;
>
> x++;
> ```
>
> It can either emit the AMOADD instruction, or a ll/sc loop. Even if the compiler always picks AMOADD for this pattern, it is well-defined C for the programmer to write an atomic or, or some more complex operation involving an explicit compare and exchange, which must be compiled as an ll/sc loop. As such, we must guarantee that all of the AMO* instructions are also atomic with respect to ll/sc, and implementing these in trap handlers is even more annoying and I’d hope that we weren’t expecting people to implement them correctly for performance (or we provide a formally verified reference implementation).
>
> I’d much rather that we simple didn’t allow ll/sc/amo* on any location spanning a cache line boundary (by requiring in the ABI that we don’t trap and emulate these).

I agree that these should not be emulated by the monitor, but disagree
that "trap and emulate" should be forbidden for misaligned AMOs. The
supervisor can correctly emulate misaligned AMOs, although the monitor
cannot, because emulating AMOs requires knowledge of the scheduler and
control of the page tables, both of which the supervisor has and the
monitor lacks. (Well, the monitor *can* alter the page tables, but
could easily crash the supervisor by doing that. Baking assumptions
about the supervisor's scheduler into the monitor is even more fragile.)



-- Jacob

Michael Clark

unread,
Dec 21, 2017, 11:19:08 PM12/21/17
to jcb6...@gmail.com, David Chisnall, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
Misaligned AMOs seem like a bit of a misfeature to me. A reasonable amount of code that uses atomics, not only aligns atomics to their natural word size, but often aligns them to the cache line size.

An example is a single producer, single consumer queue, lock free queue. The head and tail indices along with copies are put on separate cache lines so that queue push stores and queue pop stores don’t contend for the same cache line. e.g.

__attribute__((aligned(64))) size_t head, tail_copy
__attribute__((aligned(64))) size_t tail, head_copy

push revalidates tail_copy from tail on queue full, normally just reads tail_copy and increments head i.e. only contends for tail on queue full, at which point it revalidates tail_copy and may find there is space and either pushes or returns full, or sleeps.

pop revalidates head_copy from head on queue empty, normally just reads head_copy and increments tail i.e. only contends for head on queue empty, at which point it revalidates head_copy, and may find there are new items and either pops, returns empty, or sleeps.

Interprocessor cache synchronisation messages only occur on “queue possibly full” or “queue possibly empty” as the producer and consumer have their own cache lines.

I’d seriously question who would be using unaligned atomics. That said, it appears that x86 allows misaligned atomic accesses as long as they don’t cross a cache line or page boundary.

One assumes that the C and C++ atomic types could have __attribute__((aligned(sizeof(T)))) i.e. ensure they are naturally aligned. This is the only way to guarantee they don’t cross a cache line.


Now if a processor wanted to allow misaligned atomics, I don’t think we should prevent it, but mandating it would be providing a stronger atomicity guarantee than x86. Unaligned atomics spanning cache lines seems like a lot of coherency logic for little gains given 99.99% of atomic users would be naturally aligned and the other 0.01% are likely bugs, including on x86, where behaviour is undefined if they cross cache lines. The only way to ensure atomics don’t cross cache lines is to make them at minimum naturally aligned.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Alex Solomatnikov

unread,
Dec 22, 2017, 12:10:01 AM12/22/17
to Michael Clark, jcb6...@gmail.com, David Chisnall, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
x86 does support unaligned atomics and the manual does not say anything about cache line or page boundaries:


Page 8-4 Vol. 3A:

"The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
  • Any boundary for an 8-bit access (locked or otherwise).

  • 16-bit boundary for locked word accesses.

  • 32-bit boundary for locked doubleword accesses.

  • 64-bit boundary for locked quadword accesses.

    Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchro- nize data written by one processor and read by another processor.

    For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

    Locked instructions should not be used to ensure that data written can be fetched as instructions. "

Of course, in modern CPUs there is no bus and there is no bus lock but it does not matter from SW point of view.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Dec 22, 2017, 12:48:56 AM12/22/17
to Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1

8.1.1 Guaranteed Atomic Operations … The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically: >>> Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line <<<. Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.
>> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
>> To post to this group, send email to isa...@groups.riscv.org.
>> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5A3C515D.4050900%40gmail.com.
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

David Chisnall

unread,
Dec 22, 2017, 2:58:03 AM12/22/17
to Michael Clark, jcb6...@gmail.com, Jose Renau, RISC-V ISA Dev, ces...@cesarb.eti.br
On 22 Dec 2017, at 04:18, Michael Clark <michae...@mac.com> wrote:
>
> One assumes that the C and C++ atomic types could have __attribute__((aligned(sizeof(T)))) i.e. ensure they are naturally aligned. This is the only way to guarantee they don’t cross a cache line.

The initial design of C/C++ atomics was to allow _Atomic(T) to have stricter ABI requirements than T, and even a different size. For example, _Atomic(short) might be 64 bits on an Alpha, where atomic values less than 64 bits are problematic. Unfortunately, the combination of the mess that WG14 made of stdatomic.h and the GCC implementation make this somewhat difficult for ABIs to take advantage of in practice.

David

Michael Clark

unread,
Dec 22, 2017, 3:02:02 AM12/22/17
to RISC-V ISA Dev, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Jose Renau, Cesar Eduardo Barros
It’s academic as to what happens to an atomic operation to a misaligned address with the LOCK prefix on x86, because relaxed atomic accesses on x86 don’t use the LOCK prefix so according to the Intel docs they are guaranteed to be ub when the “word” (I won’t say atomic) happens to cross a cache line boundaries, so in x86 this is ub:

$ cat a.c
#include <stdio.h>
#include <stdatomic.h>

typedef struct
{
char c;
atomic_long ub;
}__attribute__((packed, aligned(1))) fun_with_atomics;

void relaxed(fun_with_atomics *loon)
{
/* if this is in another thread we could see a torn load because there is no LOCK prefix on relaxed atomic loads that are otherwise guaranteed to be atomic, except if they cross cache line boundaries, this is why nobody will ever do this in actual code, except as an example of ub */
printf("val=%ld\n", atomic_load(&loon->ub));
}

int main()
{
/* exercise for the reader, get loon to span a cache line */
fun_with_atomics loon = { 'a', 1 };

printf("%p\n", &loon.ub);
atomic_fetch_add_explicit(&loon.ub, 1, memory_order_relaxed);
relaxed(&loon);
}

# gcc silently defies my attempt at lunacy. odd. I like it. it’s different

$ gcc a.c
$ ./a.out
0x7fff981230d8
val=2

# clang complies but warns me three times (this is on macOS)

$ clang a.c
a.c:12:35: warning: taking address of packed member 'ub' of class or structure 'fun_with_atomics' may result in an unaligned
pointer value [-Waddress-of-packed-member]
printf("val=%ld\n", atomic_load(&loon->ub));
^~~~~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/9.0.0/include/stdatomic.h:134:47: note:
expanded from macro 'atomic_load'
#define atomic_load(object) __c11_atomic_load(object, __ATOMIC_SEQ_CST)
^~~~~~
a.c:19:25: warning: taking address of packed member 'ub' of class or structure 'fun_with_atomics' may result in an unaligned
pointer value [-Waddress-of-packed-member]
printf("%p\n", &loon.ub);
^~~~~~~
a.c:20:36: warning: taking address of packed member 'ub' of class or structure 'fun_with_atomics' may result in an unaligned
pointer value [-Waddress-of-packed-member]
atomic_fetch_add_explicit(&loon.ub, 1, memory_order_relaxed);
^~~~~~~
3 warnings generated.

$ ./a.out
0x7ffee1094741
val=2


It’s really not a good idea, misaligned accesses in general, but misaligned atomics
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/BDFA5499-C632-4768-A2BA-C4F0D114E3D3%40mac.com.

Jonas Oberhauser

unread,
Dec 22, 2017, 3:27:32 AM12/22/17
to Michael Clark, RISC-V ISA Dev, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Jose Renau, Cesar Eduardo Barros
Not compiling with LOCK here may be a compiler bug.
This is what I get with clang:

mov qword ptr [rbp - 24], 1
mov rcx, qword ptr [rbp - 24]
lock
xadd qword ptr [rbp - 15], rcx
mov qword ptr [rbp - 32], rcx


In any case the question is not just "what would we do" but whether there is existing code out there that a RISCV user might want to run which needs emulation of misaligned atomics.
As far as I understand, the answer to that question is (sadly) yes.
As you know, we are not discussing adding misaligned atomics. We are discussing removing one type of emulation support for them in order to speed up misaligned loads and stores.


--
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/J1udFtmPEwI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Dec 22, 2017, 7:13:03 AM12/22/17
to Jonas Oberhauser, RISC-V ISA Dev, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Jose Renau, Cesar Eduardo Barros


> On 22/12/2017, at 9:26 PM, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>
> Not compiling with LOCK here may be a compiler bug.
> This is what I get with clang:
>
> mov qword ptr [rbp - 24], 1
> mov rcx, qword ptr [rbp - 24]
> lock
> xadd qword ptr [rbp - 15], rcx
> mov qword ptr [rbp - 32], rcx

This is what I get from clang, for the relaxed atomics, which in the case of misalignment is ub on x86 (“Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors”) hence the compiler warnings:

_relaxed:
0000000100000f10 pushq %rbp
0000000100000f11 movq %rsp, %rbp
0000000100000f14 movq 0x1(%rdi), %rsi
0000000100000f18 leaq 0x7f(%rip), %rdi
0000000100000f1f xorl %eax, %eax
0000000100000f21 popq %rbp
0000000100000f22 jmp 0x100000f7e
0000000100000f27 nopw (%rax,%rax)

> In any case the question is not just "what would we do" but whether there is existing code out there that a RISCV user might want to run which needs emulation of misaligned atomics.
> As far as I understand, the answer to that question is (sadly) yes.

Do you have evidence for this? I find this hard to believe, and if so the buggy code should be fixed. Any pointers to real code with this property?

Trapping seems to be the most sensible thing to do. On macOS the minimum alignment for allocated objects is 16 bytes and the compiler will generate code that traps due to the use of the MOVDQA, MOVAPD, MOVAPS SSE instructions for inline memcpy. Objects larger than 16 bytes, even if their components natural alignment is less than 4 or 8 bytes will trap if not 16 byte aligned due to SSE. I spent quite a while debugging “traps” when using the TLSF allocator on macOS as it simply won’t work because clang/compiler-rt generates code for 16 byte alignment.

It takes a very deliberate attempt to create code to perform regular misaligned loads (e.g. for packed structures such as network packets), besides the additional attempt to use the combination of packed structures and atomics, for which could only be categorised as a twisted attempt at invoking UB. I find it very hard to believe that there is code that uses C11/C++11 atomics combined with __attribute__((packed)).

I’m actually suprised what gcc does with __attribute__((packed). It foils my attempt at misaligning. I might have to heap allocate with malloc and cast.

struct fun_with_atomics
{
char c;
long ub;
} __attribute__((packed));

int main()
{
/* exercise for the reader, get loon to span a cache line */
struct fun_with_atomics loon = { 'a', 1 };

printf("%p\n", &loon.c);
printf("%p\n", &loon.ub);
}

$ ./a.out
0x7ffc3b75eb47
0x7ffc3b75eb48

Anyone who combines __attribute__((packed)) and atomic types is asking for traps.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAFAhBRm2pUkCV7P_na35-yVaT0fCG5R2FLWMZBusE5hW7Gyzkw%40mail.gmail.com.

Jonas Oberhauser

unread,
Dec 22, 2017, 8:05:00 AM12/22/17
to Michael Clark, RISC-V ISA Dev, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Jose Renau, Cesar Eduardo Barros
2017-12-22 13:12 GMT+01:00 Michael Clark <michae...@mac.com>:


> On 22/12/2017, at 9:26 PM, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>
> Not compiling with LOCK here may be a compiler bug.
> This is what I get with clang:
>
>   mov qword ptr [rbp - 24], 1
>   mov rcx, qword ptr [rbp - 24]
>   lock
>   xadd qword ptr [rbp - 15], rcx
>   mov qword ptr [rbp - 32], rcx

This is what I get from clang, for the relaxed atomics, which in the case of misalignment is ub on x86 (“Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors”) hence the compiler warnings:

While I don't quite think that that's UB on the x86 level -- it sounds rather like "implementation defined" -- I agree with you that either on the C level it is UB or it is a compiler bug. 
However, as far as I understand, some compilers like GCC will try very hard not to misalign your stuff (as you noticed yourself, and as described on https://gcc.gnu.org/wiki/Atomic/GCCMM/UnalignedPolicy).
 
> In any case the question is not just "what would we do" but whether there is existing code out there that a RISCV user might want to run which needs emulation of misaligned atomics.
> As far as I understand, the answer to that question is (sadly) yes.

Do you have evidence for this? I find this hard to believe, and if so the buggy code should be fixed. Any pointers to real code with this property?

I am taking the word of several people on this mailing list for it, e.g., David, who said that he has encountered a small handful of applications that have misaligned&cacheline spanning AMOs in optimized code related to (IIRC) string manipulation. I personally have not looked for code like that. Maybe someone else can give concrete examples.

Trapping seems to be the most sensible thing to do. 

We will and do trap. The question raised in this thread is just: should misaligned loads and stores trap as well, so that the trap handler can potentially emulate the AMOs?

Jose Renau

unread,
Dec 22, 2017, 12:18:35 PM12/22/17
to Jonas Oberhauser, Michael Clark, RISC-V ISA Dev, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Cesar Eduardo Barros

 What I have heard is misaligned within the cache line, not across cache lines. The reason 
is that this works fine with ARMv8 and x86.

 It should be fine if we could allow this to work with exceptions (or optional hardware).

Allen Baum

unread,
Dec 22, 2017, 3:08:03 PM12/22/17
to Jose Renau, Jonas Oberhauser, Michael Clark, RISC-V ISA Dev, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Cesar Eduardo Barros
I actually like the Intel approach of "make it work, but as slowly as possible" - which is enabled if misaligned AMOs, (but not misaligned normal accesses) trap.
I am a bit concerned about mixing normal loads and AMOs (that trap). Normally, I would expect a trapped AMO to grab a global lock, and perform the operation using AMOs for each half of the op.
But if another thread (possibly in another core) performs an unaligned store that overlaps the AMO (but isn't atomic  and doesn't observe the global lock), then could it result in a torn value? OR is that torn value always a legal interpretation, because the ordering of the normal store and AMO isn't guaranteed.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Dec 22, 2017, 4:25:19 PM12/22/17
to RISC-V ISA Dev, Jose Renau, Jonas Oberhauser, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Cesar Eduardo Barros, Allen Baum
I can reproduce a torn “atomic access” on x86 with misaligned atomics. Try this. It takes a few milliseconds on an Ivy Bridge Core i7 in my MacbookPro. Basically misaligned atomics are not guaranteed to be atomic on x86. It’s pretty neat that we have an actual reproducer for it.


#include <stdio.h>
#include <stdlib.h>
#include <stdatomic.h>
#include <pthread.h>
#include <limits.h>

#define ROUND(x,s) ( (((long)(x)) + s-1) & (~(s-1)) )

typedef struct
{
atomic_ulong ub;
}__attribute__((packed, aligned(1))) foo;

void thr_start(void *arg)
{
foo *p = arg;
for (;;) {
long val = atomic_load(&p->ub);
if (val != 0 && val != ULONG_MAX) {
printf("%016lx\n", val);
exit(1);
}
}
}

int main()
{
int ret;
pthread_t thr;
long m = (long)malloc(128);
foo *p = (foo*)(ROUND(m, 64) - (sizeof(long)/2));

atomic_store(&p->ub, 0);

printf("%p\n", &p->ub);
ret = pthread_create(&thr, NULL, thr_start, p);
if (ret < 0) {
perror("pthread_create");
exit(1);
}

for (;;) {
atomic_fetch_add_explicit(&p->ub, ULONG_MAX, memory_order_relaxed);
atomic_fetch_add_explicit(&p->ub, -ULONG_MAX, memory_order_relaxed);
}
}


$ gcc a.c
a.c:18:27: warning: taking address of packed member 'ub' of class or structure 'foo' may result in an unaligned pointer value
[-Waddress-of-packed-member]
long val = atomic_load(&p->ub);
^~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/9.0.0/include/stdatomic.h:134:47: note:
expanded from macro 'atomic_load'
#define atomic_load(object) __c11_atomic_load(object, __ATOMIC_SEQ_CST)
^~~~~~
a.c:33:16: warning: taking address of packed member 'ub' of class or structure 'foo' may result in an unaligned pointer value
[-Waddress-of-packed-member]
atomic_store(&p->ub, 0);
^~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/9.0.0/include/stdatomic.h:131:58: note:
expanded from macro 'atomic_store'
#define atomic_store(object, desired) __c11_atomic_store(object, desired, __ATOMIC_SEQ_CST)
^~~~~~
a.c:35:18: warning: taking address of packed member 'ub' of class or structure 'foo' may result in an unaligned pointer value
[-Waddress-of-packed-member]
printf("%p\n", &p->ub);
^~~~~
a.c:36:35: warning: incompatible pointer types passing 'void (void *)' to parameter of type 'void * _Nullable (* _Nonnull)(void
* _Nullable)' [-Wincompatible-pointer-types]
ret = pthread_create(&thr, NULL, thr_start, p);
^~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk/usr/include/pthread.h:328:31: note:
passing argument to parameter here
void * _Nullable (* _Nonnull)(void * _Nullable),
^
a.c:43:30: warning: taking address of packed member 'ub' of class or structure 'foo' may result in an unaligned pointer value
[-Waddress-of-packed-member]
atomic_fetch_add_explicit(&p->ub, ULONG_MAX, memory_order_relaxed);
^~~~~
a.c:44:30: warning: taking address of packed member 'ub' of class or structure 'foo' may result in an unaligned pointer value
[-Waddress-of-packed-member]
atomic_fetch_add_explicit(&p->ub, -ULONG_MAX, memory_order_relaxed);
^~~~~
6 warnings generated.

$ ./a.out
0x7fcc2cc028bc
ffffffff00000000
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DA%3D-edqhcfLNCVjdYNmcVPT9fFH4R1f7QVODOK4M06Jtw%40mail.gmail.com.

Michael Clark

unread,
Dec 22, 2017, 4:30:28 PM12/22/17
to RISC-V ISA Dev, Jose Renau, Jonas Oberhauser, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Cesar Eduardo Barros, Allen Baum

Michael Clark

unread,
Dec 23, 2017, 3:02:49 PM12/23/17
to RISC-V ISA Dev, Jose Renau, Jonas Oberhauser, Alex Solomatnikov, Jacob Bachmeyer, David Chisnall, Cesar Eduardo Barros, Allen Baum
BTW I’m fine with the text in the specification with respect to misaligned loads and stores. Misaligned loads and stores are not guaranteed to be atomic. Misaligned loads and stores are typically used for access to “packed” data structures, which makes sense. AMOs on the other hand are used for synchronisation primitives and there is no overlap between the use cases. I don’t see how anyone could possibly ever need misaligned AMOs.

2.6 Load and Store Instructions

"For best performance, the effective address for all loads and stores should be naturally aligned
for each data type (i.e., on a four-byte boundary for 32-bit accesses, and a two-byte boundary for
16-bit accesses). The base ISA supports misaligned accesses, but these might run extremely slowly
depending on the implementation. Furthermore, naturally aligned loads and stores are guaranteed
to execute atomically, whereas misaligned loads and stores might not, and hence require additional
synchronization to ensure atomicity.”

These are logically mutually exclusive use cases:

1). loads/stores for packed data - misaligned loads and stores allowed but are not guaranteed to be atomic
2). AMOS for synchronisation primitives - based on 1) alignment is mandatory to guarantee atomicity

Therefore misaligned AMOs are self invalidating - atomics with no guaranteed atomicity

It completely reasonable for misaligned AMOs to trap. There is no overlap between the use cases.

BTW If you experiment with this “demonstrator” on x86, you can see the inter-processor synchronisation is done on a cache line basis, so misaligned atomic operations work in some situations, but this is by in no way any rationale for misaligned AMOs, rather it is a byproduct of misaligned load and store support and cache coherency quantum. If you perform misaligned loads and stores within a cache line, this program will not exit:

- https://gist.github.com/michaeljclark/31fc67fe41d233a83e9ec8e3702398e8

RISC-V AMOs however are very distinct from the way atomics work on x86 i.e. a LOCK prefix which can be used on any operation that perform loads, stores or read modify write operations such as XCHG, CMPXCHG, INC, DEC, etc.

I think the correct way to support atomic operations on misaligned data is via the T extension, with the use of BEGIN/COMMIT/ABORT. Also RISC-V doesn’t need any explicit hardware lock elision support given it has native AMOs which already convey enough information for the cache system to avoid LLC cache traffic for AMOs. I don’t like the LR/LRM proposal. Based on RISC principles, one would avoid complicating the ISA with partial transactional memory support in favor of actually implementing the T extension. It might even be simpler to implement RV128 using a 64-bit pointer ABI. i.e. no new instructions are necessary at all.

Jacob Bachmeyer

unread,
Dec 27, 2017, 8:33:13 PM12/27/17
to Allen Baum, Jose Renau, Jonas Oberhauser, Michael Clark, RISC-V ISA Dev, Alex Solomatnikov, David Chisnall, Cesar Eduardo Barros
Allen Baum wrote:
> I actually like the Intel approach of "make it work, but as slowly as
> possible" - which is enabled if misaligned AMOs, (but not misaligned
> normal accesses) trap.
> I am a bit concerned about mixing normal loads and AMOs (that trap).
> Normally, I would expect a trapped AMO to grab a global lock, and
> perform the operation using AMOs for each half of the op.

I favor delegating misaligned AMO traps to the supervisor, which has the
ability to correctly (but slowly) emulate them. The monitor might be
able to handle a subset of misaligned AMOs, but the current wording
requires hardware to either handle all unaligned accesses or trap for
all unaligned accesses. While this permits the monitor to emulate
misaligned AMOs, it does so at significant cost: the monitor must also
handle RVI unaligned accesses that the hardware could otherwise easily
split -- and it requires that RVI unaligned accesses be atomic.

> But if another thread (possibly in another core) performs an unaligned
> store that overlaps the AMO (but isn't atomic and doesn't observe the
> global lock), then could it result in a torn value? OR is that torn
> value always a legal interpretation, because the ordering of the
> normal store and AMO isn't guaranteed.

I argue that the torn value is a permitted result, because the RVI store
is not guaranteed to be atomic if misaligned. The AMO must, however
produce a result atomically with respect to other AMOs and aligned
accesses to either word.

-- Jacob

Andrew Waterman

unread,
Jan 2, 2018, 2:33:59 PM1/2/18
to Jonas Oberhauser, Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
On Thu, Dec 21, 2017 at 3:14 AM, Jonas Oberhauser <s9jo...@gmail.com> wrote:
>
>
> 2017-12-21 10:29 GMT+01:00 Andrew Waterman <wate...@eecs.berkeley.edu>:
>>
>>
>> On Wed, Dec 20, 2017 at 7:38 PM Cesar Eduardo Barros
>> <ces...@cesarb.eti.br> wrote:
>>>
>>> "If, for a given address and access width, a misaligned LR/SC or AMO
>>> generates a misaligned address exception, then {\em all} loads, stores,
>>> LRs/SCs, and AMOs using that address and access width must generate
>>> misaligned address exceptions."
>>
>>
>> For PMAs that forbid misaligned AMOs altogether (which, for some
>> platforms, could be for all addresses), these new constraints need not
>> apply. I agree the spec does not permit this as written, but I also agree
>> there’s no reason to forbid option B on platforms that have no need for
>> misaligned AMOs.
>
>
> What parts exactly does the spec currently forbid?
> As far as I understand, there are two ways of looking at it
> 1) misalignment interrupts *are* the PMA interrupts that forbid misaligned
> AMOs, and these new constraints always apply
> 2) PMAs do not have this level of granularity -- from the PMA side, AMOs are
> either allowed or forbidden, but never allowed when aligned and forbidden
> otherwise. Misalignment interrupts are not a PMA issue and are kind of
> "below" the PMA.

I was thinking neither. Misalignment exceptions indicate that the
access can/should be emulated; access exceptions indicate the access
is invalid so should not be emulated. For a given memory region, the
hardware platform has a few choices:

1) The PMA forbids all misaligned accesses; they raise access
exceptions. (This is usually the case for MMIO regions.)
2) The PMA forbids misaligned AMOs; they raise store access
exceptions. Misaligned loads & stores to this PMA are supported, and
can either be executed non-atomically in HW, or can raise misalignment
exceptions and be emulated non-atomically.
3) The PMA permits misaligned AMOs. Misaligned AMOs, loads, and
stores to this PMA all execute atomically in HW.
4) The PMA permits misaligned AMOs, but does not support them in HW.
Misaligned AMOs, loads, and stores all raise misaligned address
exceptions and are emulated atomically in SW.
5) A hybrid of options 3 and 4 (e.g., based upon whether the access
crosses a cache line or page boundary).

Option 2 is more or less the status quo.

Jonas Oberhauser

unread,
Jan 2, 2018, 3:06:30 PM1/2/18
to Andrew Waterman, Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
But if I understood the spec correctly, access exceptions do not consider  alignment. In other words, if you use access exceptions to signal that the misaligned AMO should not be emulated, then aligned AMOs will be illegal too.

So, do you suggest there should be misaligned access exceptions which are different from the misalignment exceptions we already have?


2) The PMA forbids [all] AMOs; they raise store access

exceptions.  Misaligned loads & stores to this PMA are supported, and
can either be executed non-atomically in HW, or can raise misalignment
exceptions and be emulated non-atomically.

Option 2 is more or less the status quo.

As far as I understand, only if you replace "misaligned" by "all", as I have done above.

Andrew Waterman

unread,
Jan 2, 2018, 6:08:04 PM1/2/18
to Jonas Oberhauser, Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
Any PMA violation should cause an access exception. Whether
misaligned accesses should be supported/emulated in a given address
range is one of the PMAs.

Allen J. Baum

unread,
Jan 2, 2018, 10:48:37 PM1/2/18
to Andrew Waterman, Jonas Oberhauser, Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
At 3:07 PM -0800 1/2/18, Andrew Waterman wrote:
>
>Any PMA violation should cause an access exception. Whether
>misaligned accesses should be supported/emulated in a given address
>range is one of the PMAs.

On my things-to-do list is defining the PMA of our chip, so I have some skin in this game.

Part of the discussion here is whetherthe trap indication can specify exception type (access vs. alignment). The spec says:
"PMA violations manifest as load, store, or instruction-fetch access
exceptions, distinct from virtual- memory page-fault exceptions."

which certainly suggests that more than one kind of exception can be raised, though not precisely which ones can are raised.

While I think of a PMA as a single table with entries, the spec divides it into 5 PMA "checkers" that look at different properties:
- Supported_Types (which is only width and alignment restrictions),
- Atomicity (the level of AMOx ops are supported )
- Ordering (I/O vs. memory, including the channel concept)
- Coherency and Cacheability
- Idempotency (whether Rd or Wt have side effects)

So when you say one of the PMAs indicates whether misaligned accesses are supported, you are talking about the first checker above.

I don't know if this table is intended to be exhaustive - that is another aspect that should be clarified.

My initial interpretation of the PMA structure is simply, from the outside, there are inputs and there are outputs that depend on them:
- inputs: address+width+access_type, configuration bits for any region
that could have variable attribute, (possibly processor mode?).

- outputs: either the message type to be sent to the memory system,
or an exception indication.

and that the outputs depend on the inputs without restriction,
BUT the separation of the PMAs above suggest something different; they suggest that the PMA attributes are independent.

So you might report that a region should get a misaligned exception based on the address+width, and that the same region doesn't raise invalid exception for a specific AMO based on address+access_type (or does, which would raise both invalid and unaligned exceptions??), but you can't say that you should
- raise no exception for aligned AMOs & normal accesses
- raise no exceptions for unaligned normal accesses,
- raise unaligned exception for unaligned AMOs.

Or maybe not. The Supported_Type checker can certainly raise either unaligned, or invalid if the width is not supported (or both?), so why can't an AMO checker raise either of them also?: unaligned if AMO type is unaligned for some range, and invalid if the AMO type is not supported in that range.

Note that Supported_Types could even raise unaligned only if the access crossed a cache line boundary for normal operations in some address range, but not otherwise (and could do that for AMOs as well - or not).

So each checker could raise invalid or unaligned exceptions (and others?), and can raise them simultaneously with other checkers.

This is just an interpretation based on the way the PMA is currently described;
it may not be the intent, in which case that section should probably be re-written or clarified (with examples!) to make it clear that for each property being checked, each checker could raise exceptions simultaneously, and that each exception type is ORed with other corresponding exception types from other checkers, and that only the most significant exception is reported (and so precisely which ones are can be signalled and their order of precedence need to be defined)

> > So, do you suggest there should be misaligned access exceptions which are
>> different from the misalignment exceptions we already have?
>>
>>
>> 2) The PMA forbids [all] AMOs; they raise store access
> > exceptions. Misaligned loads & stores to this PMA are supported, and
>> can either be executed non-atomically in HW, or can raise misalignment
>> exceptions and be emulated non-atomically.
>>
> > Option 2 is more or less the status quo.
> >
>>
>> As far as I understand, only if you replace "misaligned" by "all", as I have
>> done above.
>
>--
>You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
>To post to this group, send email to isa...@groups.riscv.org.
>Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0CzPcN8ggEq8y2gYOrtAF0vay2kHZtCEkrGw7g-T1PppQ%40mail.gmail.com.


--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Stefan O'Rear

unread,
Jan 2, 2018, 10:53:33 PM1/2/18
to Andrew Waterman, Jonas Oberhauser, Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
On Tue, Jan 2, 2018 at 11:33 AM, Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:
> I was thinking neither. Misalignment exceptions indicate that the
> access can/should be emulated; access exceptions indicate the access
> is invalid so should not be emulated. For a given memory region, the
> hardware platform has a few choices:
>
> 1) The PMA forbids all misaligned accesses; they raise access
> exceptions. (This is usually the case for MMIO regions.)
> 2) The PMA forbids misaligned AMOs; they raise store access
> exceptions. Misaligned loads & stores to this PMA are supported, and
> can either be executed non-atomically in HW, or can raise misalignment
> exceptions and be emulated non-atomically.
> 3) The PMA permits misaligned AMOs. Misaligned AMOs, loads, and
> stores to this PMA all execute atomically in HW.
> 4) The PMA permits misaligned AMOs, but does not support them in HW.
> Misaligned AMOs, loads, and stores all raise misaligned address
> exceptions and are emulated atomically in SW.
> 5) A hybrid of options 3 and 4 (e.g., based upon whether the access
> crosses a cache line or page boundary).
>
> Option 2 is more or less the status quo.

I like this breakdown. Can it be made explicit in the spec?

-s

Andrew Waterman

unread,
Jan 2, 2018, 11:04:30 PM1/2/18
to Stefan O'Rear, Jonas Oberhauser, Cesar Eduardo Barros, Jacob Bachmeyer, Jose Renau, RISC-V ISA Dev
I think that's a good idea. Also, the ambiguities that Allen raised
should be specifically addressed.
Reply all
Reply to author
Forward
0 new messages