RISC-V Memory Consistency Model Draft

Daniel Lustig

unread,

Dec 1, 2017, 8:40:17 PM12/1/17

to isa...@groups.riscv.org

Hi everyone,

We in the RISC-V memory model task group are ready to release the first
public draft of the memory consistency model specification that we've
been working on over the past few months. For those of you who
attended the workshop this week, this document will fill in some of the
details. For those of you who couldn't make it, I've attached my
presentation slides as well. The video of my talk (and of all the other
talks) should be posted online within a week or so.

If anyone has any comments, questions, or feedback, feel free to respond
here, to reach out to us in the memory model task group, or even just to
respond to me directly. I'm more than happy to take the feedback.

Over the next few weeks, assuming nobody uncovers any glaring errors,
we'll start working to merge this into the rest of the user-level ISA
spec (in some way or other, details TBD) so that we can aim to put forth
both together for official ratification in the coming months. We'll
also of course fix any typos, bugs, or discrepancies that are found in
the meantime.

We're also actively communicating with the Linux maintainers, the gcc
and LLVM maintainers, and more so that we make sure that the memory
model interacts properly with all of the above.

Let us know what you think!

Dan

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Tue0954-RISC-V_Memory_Model-Lustig.pdf

memory-model-spec.pdf

Cesar Eduardo Barros

unread,

Dec 2, 2017, 9:26:10 AM12/2/17

to Daniel Lustig, isa...@groups.riscv.org

Em 01-12-2017 23:40, Daniel Lustig escreveu:
> Hi everyone,
>
> We in the RISC-V memory model task group are ready to release the first
> public draft of the memory consistency model specification that we've
> been working on over the past few months. For those of you who
> attended the workshop this week, this document will fill in some of the
> details. For those of you who couldn't make it, I've attached my
> presentation slides as well. The video of my talk (and of all the other
> talks) should be posted online within a week or so.
>
> If anyone has any comments, questions, or feedback, feel free to respond
> here, to reach out to us in the memory model task group, or even just to
> respond to me directly. I'm more than happy to take the feedback.

I'm not understanding the need for the RVTSO option. According to 1.2 in
the draft, everything RVTSO does is equivalent to the instruction
decoder acting as if the .aq and/or .rl bits were set in the
instruction; and for the currently non-existent l?.aq/s?.rl, adding the
corresponding fence after/before the instruction. Nothing prevents the
exact same changes from being made by the compiler and/or the assembler.

That is, the RVTSO extension would gain nothing over RVWMO with a
hypothetical l?.aq/s?.rl extension, and only have a minor code size gain
over RVWMO without it (the fences can be macro-op fused with the
load/store), at the cost of potentially fragmenting the ecosystem. Since
software written for RVTSO can't run on RVWMO, there would be a pressure
for cores to implement RVTSO, and once a core implements RVTSO, software
written for it can start accidentally depending on TSO, even if it's
tagged as being portable to RVWMO.

Also, having RVTSO be only a compiler and/or assembler option would
allow mixing TSO and WMO modules; for instance, a program ported from
x86 which assumed its stronger memory ordering could be linked to an
optimized openssl library which benefits from a weaker memory ordering.
The program could even be gradually migrated to the weaker memory
ordering, module by module.

Therefore, my opinion is that RVTSO would be better as a compiler and
assembler option, instead of an extension. If extensions are needed, it
would be better to to specify them as extra instructions which can be
used by the compiler/assembler in their TSO mode. As a bonus, the extra
instructions could also be used by RVWMO programs.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Anthony Coulter

unread,

Dec 2, 2017, 10:39:21 AM12/2/17

to isa...@groups.riscv.org

All of my feedback is related to readability. I freely admit that I'm
not qualified to judge the technical merit of this memory model, and
in fact I'm reading this document more because I want to understand
memory models in general than because I care specifically about RISC-V.

Section 1.1.1: Memory Model Primitives
Table 1.1's caption introduces ".sc" as a synonym for ".aqrl" but this
synonym is not used anywhere except section 3.1 (the formal
specification in Alloy).

The definition of LR/SC pairs in the third paragraph of this section
can be cleaned up a bit:

"A successful sc instruction is said to be paired with the last lr
instruction that precedes it in program order; their corresponding
memory operations are also said to be paired."

Section 2.4.2: I/O Ordering
This section states that the preserved program order rules don't
apply to I/O regions, and then "informally" introduces ten new rules
which are written rigorously. Would it be fair to call this a formal
specification of the "preserved program order rules for I/O accesses"?
Or are these ten new rules not enforced by RVWMO?

Also: What's the difference between rules 2 and 4 in this section? More
specifically, what's the difference between strongly-ordered I/O
regions and their corresponding channel numbers? I/O channels are
introduced in section 3.5.4 of the RISC-V privileged spec but it isn't
really clear to me what exactly they are, whether they're visible to
software, etc. (I had always assumed they were *not* software-visible,
and were just an informal labeling of different strongly-ordered I/O
regions. But my understanding doesn't really explain why channels zero
and one are special; it would seem that every memory map is required to
support these two I/O regions with special synchronization properties
even if there are no underlying devices that require those properties.
I'm almost certainly wrong.)

Section 2.6: Code Porting Guidelines
Table 2.2 uses the phrase "fence-based equivalents" in both a column
header and the caption. Since the fences are strictly stronger than
their .aq/.rl counterparts, I would lean against the word "equivalent."
Could this be renamed to "fence-based implementation" ?

Section 2.7: Implementation guidelines
The litmus test explaining write subsumption says that "(a) must follow
(f) in the global memory order." I believe "follow" should be replaced
with "precede."

Sections 2.3.9 and 2.4.2: The phrase "is ordered before"
Most of the document uses the verb "precedes" but these two sections
use the clumsier "is ordered before." I would prefer a uniform use of
the verb "precede" in all of these places, with the convention that
the word "precedes" always refers to global memory order unless
otherwise specified, e.g. in the phrase "precedes in program order."
This convention could be stated explicitly at the beginning of the
document.

Section 2.8: Summary of New/Modified ISA Features
The spec makes a few references to "fence.tso" as though it is an
assembler pseudoinstruction, but fence.tso isn't mentioned here or
in the RISC-V user spec. Will it be added?

Section ???: The Flat RISC-V Operational Model
This document/section has its own title, authors, and page numbers.
It also ends mid-sentence. Is this part of the official draft?

Regards,
Anthony Coulter

Mark Friedenbach

unread,

Dec 2, 2017, 2:11:08 PM12/2/17

to Daniel Lustig, isa...@groups.riscv.org

I find it extremely concerning that it is proposed for there to be two memory consistency models that would result in incompatible binaries, particularly when the failure mode is perhaps subtly-different implementation-defined results. When properly accounted for this fragments the RISC-V compiled software ecosystem, even when the best intentions are otherwise. When not properly manage this results in data corruption and different-in-testing-and-production bugs.

I’m not sure in what way this is written down as an official foundation goal, but in practice I’ve observed that preventing ecosystem fragmentation is a strong goal of this community. As was pointed out in another thread, it is for example the case that RV32 code can now be run on RV64 by configuring the SXL and UXL fields in mstatus / sstatus—removing a terrible wart in the original ISA where RV32 couldn’t be run on RV64. RV32E, the microcontroller profile with a smaller register file is a proper subset of the user-mode RV32G profile, allowing RV32E to be run unmodified on more powerful 32-bit chips, or even RV64 when care is taken to configure SXL/UXL appropriately.

This forked memory consistency model would break that very fundamental feature of the RISC-V ecosystem. Furthermore, if I understand it correctly it breaks compatibility in the “wrong” direction. It would now be the case that binaries compiled for less capable in-order profiles would be unable to guarantee the same semantics when run on faster, more capable systems that make use of the optimizations a more weakly ordered memory model provides. You can’t take the binary firmware compiled for TSO and then choose to run it on a more powerful WMO chip requiring explicit fences.

I apologize that this feedback comes so late in the process, but I wonder if there is an ISA approach that can be taken to ensure that code written for either profile will at least run correctly on the other. You have that compatibility in one directly now — WMO code will run unmodified on TSO platforms, with the extra fences treated as a NOP. Could you not achieve compatibility in the other directly by introducing a global consistency flag field in the status registers that instructs all loads and stores to be automatically fenced? Obviously that would be an inefficient use of silicon, but it allows the binary to run and produce correct results even on a different architectural profile than it was designed for, which is an extremely important and pragmatic fallback option for real-world engineering projects, both in testing and development and operational support of existing deployments that might receive hardware upgrades while maintaining software compatibility. And critically, while ELF binaries will still have a consistency profile bit, any single system will be able to execute either binary, the difference being only which memory model was optimized for (code size for TSO, explicit fencing for WMO).

(An even better solution would be to allow this configuration on a per-page level so that TSO programs can be linked to WMO libraries and vice-versa to run in the same process, but the details of the privilege spec are not something I’ve kept abreast of. I leave it to others to make specific suggestions there.)

Mark Friedenbach
Blockstream

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3729d0a6-183b-6e76-dcf1-c24948df765e%40nvidia.com.
> <Tue0954-RISC-V_Memory_Model-Lustig.pdf><memory-model-spec.pdf>

Christopher Celio

unread,

Dec 2, 2017, 3:13:47 PM12/2/17

to Mark Friedenbach, Daniel Lustig, isa...@groups.riscv.org

Unfortunately, there's not really a way to win here.

In the past, the ISA was controlled by one company. And the memory model was "whatever our first chip did" and later chips had to follow (or patch fixes into the binaries somehow).

For RISC-V, even if the only ratified memory model was WMO, there's nothing stopping somebody shipping a TSO chip, or a sequentially consistent (SC) chip, or a WMO chip, as they all would obey the WMO model.

If a particular chip becomes successful in the RISC-V marketplace that happened to be TSO, then TSO could become the de facto standard anyways. Or companies struggling to port their x86-TSO code may go shopping for a RISC-V core and mandate to the cpu dev that it be TSO. So I think there's a fair argument for encoding which MCM the software expects if we're going to end up in this mess anyways.

The important part of a RISC-V standard memory model is that all upstreamed software must abide by WMO (Linux, gcc, etc.). That's the value. And they already do. Groups like Fedora are not going to ship two versions, they're going to do WMO. Most programmers are calling into libraries and using compilers anyways, so MCM is hidden away.

> You have that compatibility in one directly now — WMO code will run unmodified on TSO platforms

As a cpu guy, it's annoying to target two MCMs, but frankly, WMO is more project risk so you'll always want a fallback to TSO anyways. So that's how you'll run TSO-code on "WMO-platforms": you'll simply force the CPU to drain stores in-order and get explicit acks on each store before proceeding to the next.

-Chris

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6289F75A-4978-491A-A916-3033FD482355%40friedenbach.org.

Andrew Waterman

unread,

Dec 2, 2017, 3:46:38 PM12/2/17

to Mark Friedenbach, Daniel Lustig, isa...@groups.riscv.org

On Sat, Dec 2, 2017 at 11:11 AM Mark Friedenbach <ma...@friedenbach.org> wrote:

I find it extremely concerning that it is proposed for there to be two memory consistency models that would result in incompatible binaries, particularly when the failure mode is perhaps subtly-different implementation-defined results. When properly accounted for this fragments the RISC-V compiled software ecosystem, even when the best intentions are otherwise. When not properly manage this results in data corruption and different-in-testing-and-production bugs.

TSO binaries will have different ELF flags, so the failure won’t be subtle: the OS/loader will refuse to execute the binary. In this respect, it’s like other ISA/ABI extensions.

All standard software will be WMO and thus will run on either hardware. At the moment there aren’t even any compilers that target TSO, and as one of the GCC maintainers, my preference is to never support TSO code generation. So, while I share your concerns about fragmentation, I think we can succeed in mitigating it.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6289F75A-4978-491A-A916-3033FD482355%40friedenbach.org.

Andrew Waterman

unread,

Dec 2, 2017, 3:53:04 PM12/2/17

to Cesar Eduardo Barros, Daniel Lustig, isa...@groups.riscv.org

The main reason for RVTSO is that some vendors insist on straight porting of x86 code, or want to have additional/easier-to-prove guarantees in safety-critical systems. I’m not one of these vendors, but I acknowledge that no matter what we ISA designers do, some people will build and advertise TSO hardware. So it makes sense to me to standardize what that means.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/94483c8d-74ae-ac2f-7acd-bc1fc4fc7921%40cesarb.eti.br.

Mark Friedenbach

unread,

Dec 2, 2017, 4:02:56 PM12/2/17

to Andrew Waterman, Daniel Lustig, isa...@groups.riscv.org

On Dec 2, 2017, at 12:46 PM, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:

TSO binaries will have different ELF flags, so the failure won’t be subtle: the OS/loader will refuse to execute the binary. In this respect, it’s like other ISA/ABI extensions.

You’re assuming a standard UNIX-y environment with ELF binaries.

All standard software will be WMO and thus will run on either hardware. At the moment there aren’t even any compilers that target TSO, and as one of the GCC maintainers, my preference is to never support TSO code generation. So, while I share your concerns about fragmentation, I think we can succeed in mitigating it.

That’s wishful thinking. On the other hand, TSO code is smaller (that’s why it’s even being officially supported by the ISA, right?) so I guarantee you’ll see a performance-critical, size-constrained micro distributions compiling TSO code without fences. People will use these to make products. Years later, people supporting or inheriting these products will do a hardware refresh and for whatever reason (I can think of many) want to source a WMO design instead. Maybe they want to emulate the microcontroller inside a virtualized thread on a single chip, and don’t have the original source code, just a flat binary firmware file. They’re stuck — they have to source a TSO chip or pick a different approach.

It’s a hypothetical, but I wouldn't consider this an unrealistic or implausible outcome at all.

Andrew Waterman

unread,

Dec 2, 2017, 4:17:28 PM12/2/17

to Mark Friedenbach, Daniel Lustig, isa...@groups.riscv.org

On Sat, Dec 2, 2017 at 1:02 PM Mark Friedenbach <ma...@friedenbach.org> wrote:

On Dec 2, 2017, at 12:46 PM, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:

TSO binaries will have different ELF flags, so the failure won’t be subtle: the OS/loader will refuse to execute the binary. In this respect, it’s like other ISA/ABI extensions.

You’re assuming a standard UNIX-y environment with ELF binaries.

All standard software will be WMO and thus will run on either hardware. At the moment there aren’t even any compilers that target TSO, and as one of the GCC maintainers, my preference is to never support TSO code generation. So, while I share your concerns about fragmentation, I think we can succeed in mitigating it.

That’s wishful thinking. On the other

Let’s call it hopeful thinking. It doesn’t require a magical outcome: it just requires staying the current course.

hand, TSO code is smaller (that’s why it’s even being officially supported by the ISA, right?) so I guarantee you’ll see a performance-critical, size-constrained micro distributions compiling TSO code without fences.

The memory’s model has nearly zero impact on code size, since synchronization is usually statically uncommon, contained in library code, or both. In embedded systems where code size matters the most, most synchronization is I/O, not memory, and thus unaffected by this debate.

People will use these to make products. Years later, people supporting or inheriting these products will do a hardware refresh and for whatever reason (I can think of many) want to source a WMO design instead. Maybe they want to emulate the microcontroller inside a virtualized thread on a single chip, and don’t have the original source code, just a flat binary firmware file. They’re stuck — they have to source a TSO chip or pick a different approach.

Yep. No way around this sort of lock-in. However, it will happen whether or not we standardized RVTSO. As I said in another email, vendors are building and advertising TSO as a feature. We can’t stop that. We made the strategic decision to standardize how vendors should do this, as a means of preventing yet more fragmentation (e.g., RISC-VI).

Christoph Hellwig

unread,

Dec 3, 2017, 2:10:01 PM12/3/17

to Cesar Eduardo Barros, Daniel Lustig, isa...@groups.riscv.org

On Sat, Dec 02, 2017 at 12:26:03PM -0200, Cesar Eduardo Barros wrote:
> Also, having RVTSO be only a compiler and/or assembler option would allow
> mixing TSO and WMO modules; for instance, a program ported from x86 which
> assumed its stronger memory ordering could be linked to an optimized
> openssl library which benefits from a weaker memory ordering. The program
> could even be gradually migrated to the weaker memory ordering, module by
> module.

I agree. Having two different memory models offered by hardware is
a very bad idea that will lead to compatibility nightmares. The right
way to support a RVTSO in RISC-V is to define at the assembler layer.

> Therefore, my opinion is that RVTSO would be better as a compiler and
> assembler option, instead of an extension. If extensions are needed, it
> would be better to to specify them as extra instructions which can be used
> by the compiler/assembler in their TSO mode. As a bonus, the extra
> instructions could also be used by RVWMO programs.

Agreed, although I think no extract instructions should be required.

Christoph Hellwig

unread,

Dec 3, 2017, 2:12:29 PM12/3/17

to Andrew Waterman, Mark Friedenbach, Daniel Lustig, isa...@groups.riscv.org

On Sat, Dec 02, 2017 at 08:46:24PM +0000, Andrew Waterman wrote:
> TSO binaries will have different ELF flags, so the failure won’t be subtle:
> the OS/loader will refuse to execute the binary. In this respect, it’s like
> other ISA/ABI extensions.

The problem isn't TSO binaries. The problem is people writing code that
is no aware of weak memory order, only testing on a chip that implements
TSO and then running into problems when run elsewhere.

An assembler/compiler option to generate acquire/release fences for
every load and store is the much better way to go, and chips that handles
this efficiently will surely get some good market traction.

Christoph Hellwig

unread,

Dec 3, 2017, 2:15:38 PM12/3/17

to Andrew Waterman, Mark Friedenbach, Daniel Lustig, isa...@groups.riscv.org

On Sat, Dec 02, 2017 at 09:17:13PM +0000, Andrew Waterman wrote:
> The memory’s model has nearly zero impact on code size, since
> synchronization is usually statically uncommon, contained in library code,
> or both. In embedded systems where code size matters the most, most
> synchronization is I/O, not memory, and thus unaffected by this debate.

That's not always the case for a lot of system software, or highly optimized
parallel software.

This whole idea of "only experts write synchronization primites and they
are in libraries) is wrong. Not only do a lot of programmers that
shouldn't write synchronizations primitives do so, but also a lot of well
written synchronization code is either available inline (e.g. GCC builtins,
Linux kernel arch functions and their copies in lots of userspace package)
or inlined on demand by JIT compilers.

David Chisnall

unread,

Dec 3, 2017, 2:34:15 PM12/3/17

to Andrew Waterman, Mark Friedenbach, Daniel Lustig, isa...@groups.riscv.org

On 2 Dec 2017, at 20:46, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>
> TSO binaries will have different ELF flags, so the failure won’t be subtle: the OS/loader will refuse to execute the binary. In this respect, it’s like other ISA/ABI extensions.

Unfortunately, that’s not how it is likely to work. There is currently no way for a compiler to statically analyse code to ensure that it is safe in a WMO mode (though some of my colleagues are working in that direction). This means that people are likely to compile code that implicitly assumes TSO without specifying the extra flag that says ‘this needs a TSO CPU’, test it on their TSO CPU, see that it runs fine, and ship it.

Worse, code that requires TSO will often work fine on WMO processors with subtle and intermittent failures. Unless you have a very well-written test suite that hammers the concurrent parts of a program, it’s very easy to have something that appears to work fine on a WMO system, but fails in subtle ways, and works fine on a TSO system.

I agree with Christopher, the end result of this is very likely to be a convoluted way of defining that RISC-V needs TSO. You are going to end up with some chips that are TSO, some that aren’t, code that is not properly tested with the weak ordering and the only way for a new RISC-V core to guarantee that code will work is to implement TSO.

If the end goal is to end up in this state, then skipping the WMO spec makes sense - we should just define TSO and be done with it.

The only place where this distinction makes sense is for code that’s written assuming something like the C++11 memory model and can, with a single compiler flag, be compiled to use a weaker memory ordering. The implicit hypothesis here is that a CPU implementing the WMO with explicit barriers will be faster / lower power than an equivalent CPU implementing TSO running the same code compiled without the barriers. This is still something of an open research question

David

Alex Bradbury

unread,

Dec 3, 2017, 3:05:44 PM12/3/17

to David Chisnall, Andrew Waterman, Mark Friedenbach, Daniel Lustig, RISC-V ISA Dev

On 3 December 2017 at 19:34, David Chisnall <David.C...@cl.cam.ac.uk> wrote:
> On 2 Dec 2017, at 20:46, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>>
>> TSO binaries will have different ELF flags, so the failure won’t be subtle: the OS/loader will refuse to execute the binary. In this respect, it’s like other ISA/ABI extensions.
>
> Unfortunately, that’s not how it is likely to work. There is currently no way for a compiler to statically analyse code to ensure that it is safe in a WMO mode (though some of my colleagues are working in that direction). This means that people are likely to compile code that implicitly assumes TSO without specifying the extra flag that says ‘this needs a TSO CPU’, test it on their TSO CPU, see that it runs fine, and ship it.
>
> Worse, code that requires TSO will often work fine on WMO processors with subtle and intermittent failures. Unless you have a very well-written test suite that hammers the concurrent parts of a program, it’s very easy to have something that appears to work fine on a WMO system, but fails in subtle ways, and works fine on a TSO system.

A given RISC-V implementation can always choose to implement a
stronger memory model than WMO: either something quite a lot strong
stronger such as TSO or SC, or just neglecting to exploit some of the
opportunities for memory reordering. As you point out, code might have
hidden bugs that are difficult/impossible to trigger on a particular
implementation. I think this concern is orthogonal to RVTSO And RVWMO
memory models. Or at least, it would exist even if RISC-V only defined
RVWMO, or if it only defined RVTSO (if an implementation actually
enforced something stronger).

The only way to avoid your concern seems to be to define the standard
RISC-V memory model as the strongest we believe any vendor is likely
to implement.

Best,

Alex

Mark Friedenbach

unread,

Dec 3, 2017, 3:15:19 PM12/3/17

to Andrew Waterman, Daniel Lustig, isa...@groups.riscv.org

On Sat, Dec 2, 2017 at 1:02 PM Mark Friedenbach <ma...@friedenbach.org> wrote:

All standard software will be WMO and thus will run on either hardware. At the moment there aren’t even any compilers that target TSO, and as one of the GCC maintainers, my preference is to never support TSO code generation. So, while I share your concerns about fragmentation, I think we can succeed in mitigating it.

That’s wishful thinking. On the other

Let’s call it hopeful thinking. It doesn’t require a magical outcome: it just requires staying the current course.

It requires incentives different than we have in reality. The current course is not leading toward a future where TSO and WMO are clearly distinguished, TSO used only in application specific code by those who need it, and all software targets WMO by default. The reality, as I and others are pointing out, is that it will result in ecosystem fragmentation as otherwise well meaning but busy people take the path of least resistance and make decisions that help product delivery and bottom line without considering broader ecosystem issues.

hand, TSO code is smaller (that’s why it’s even being officially supported by the ISA, right?) so I guarantee you’ll see a performance-critical, size-constrained micro distributions compiling TSO code without fences.

The memory’s model has nearly zero impact on code size, since synchronization is usually statically uncommon, contained in library code, or both. In embedded systems where code size matters the most, most synchronization is I/O, not memory, and thus unaffected by this debate.

Then why support TSO at all? Define WMO as *the* memory consistency model and be done with it, since having excess memory barrier opcodes isn’t a code size or performance issue. Code that assumes a stronger consistency model and lacks these barriers would be either out of compliance with the ISA, or be considered to be making use of a non-portable vender extension.

Putting the seal of approval on TSO, and (critically) standardizing a different executable format for TSO code makes ecosystem fragmentation inevitable, no matter the intentions. And what benefit is had for the RISC-V community as a result? Compilers can target WMO code, that is not a problem. The only issue seems to be vendors wanting to lazily translate x86 code with TSO assumptions. Is platform fragmentation an acceptable trade off for saving a few vendors engineering time in translating legacy code (which, as you point out, mostly is a concern in isolated library code)? It doesn’t seem like a win or good long term thinking from my perspective, and it is entirely reasonable for the foundation to just say “no." We are charged with safeguarding the RISC-V ISA and ecosystem, after all, even from those who just want a short cut to using it.

I was operating under the assumption that there were application domains where TSO vs WMO makes a large enough difference that there are tangible benefits from officially supporting one over the other in an ISA-visible way. If that is not the case, then we should really just pick one — presumably WMO as weakly ordered assumptions work on TSO architectures but not vice versa, but sticking to one model matters more than which one is selected.

People will use these to make products. Years later, people supporting or inheriting these products will do a hardware refresh and for whatever reason (I can think of many) want to source a WMO design instead. Maybe they want to emulate the microcontroller inside a virtualized thread on a single chip, and don’t have the original source code, just a flat binary firmware file. They’re stuck — they have to source a TSO chip or pick a different approach.

Yep. No way around this sort of lock-in. However, it will happen whether or not we standardized RVTSO. As I said in another email, vendors are building and advertising TSO as a feature. We can’t stop that. We made the strategic decision to standardize how vendors should do this, as a means of preventing yet more fragmentation (e.g., RISC-VI).

Venders are also including all sorts of custom extensions of the base ISA for their specific applications. Code using these applications specific features are not compatible with other RISC-V products. It means that code written to make use of these features is not necessarily portable across implementations. This is all fine and expected.

A vendor producing and advertising a chip with a stronger memory consistency model isn’t really any different than one which is producing a chip with custom opcodes. It’s just making promises about what would otherwise be implementation defined behavior, and software making use of these guarantees would be just as incompatible as software using otherwise invalid opcode extensions.

If we standardized on WMO as the only memory consistency model, then we are likely to see compilers and all non-application-specific general software binaries adhere to that model. We will not have ecosystem fragmentation any more than you see in, say, Intel vs AMD vector extensions—only people who know what they are doing produce incompatible code, and often only in application-specific contexts.

But standardizing TSO in such a way that precludes binary compatibility is a vastly different proposition. You *will* get TSO-optimized Linux distributions, or TSO-optimized vender binaries, because the original deployments were all TSO and some engineer there read online that TSO optimization cuts code size, etc. etc.

This seems a very dangerous road to go down.

Christopher Celio

unread,

Dec 3, 2017, 3:17:59 PM12/3/17

to David Chisnall, Andrew Waterman, Mark Friedenbach, Daniel Lustig, isa...@groups.riscv.org

Re-reading what I wrote, I did not mean to imply that I support TSO as the base standard.

TSO is completely unacceptable for many use-cases, particularly for cores that are connected to a large, shared memory system but have no caches (which is where most RISC-V cores will be). Many RISC-V companies will be selling core IP that plugs into a customer's existing memory system which will have been based on ARM's WMO. RISC-V will be completely unable to displace ARM if it mandates TSO.

-Chris

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/49209E72-EEA1-4562-9F42-30D95E7FAE45%40cl.cam.ac.uk.

Cesar Eduardo Barros

unread,

Dec 3, 2017, 4:50:01 PM12/3/17

to Andrew Waterman, Daniel Lustig, isa...@groups.riscv.org

Em 02-12-2017 18:52, Andrew Waterman escreveu:
> The main reason for RVTSO is that some vendors insist on straight
> porting of x86 code, or want to have additional/easier-to-prove
> guarantees in safety-critical systems. I’m not one of these vendors, but
> I acknowledge that no matter what we ISA designers do, some people will
> build and advertise TSO hardware. So it makes sense to me to standardize
> what that means.

If the main reason for RVTSO is porting of x86 code, is there any reason
this can't be done by the compiler, the assembler, or even the linker
(similar to linker relaxation)? The transformations given in the draft
are very simple, just set the .aq and/or .rl bits on some of the
instructions, and add fences near the other stores/loads. With macro-op
fusion, the performance loss over hardware-based TSO would be only due
to the small code size increase from the fences.

For safety-critical systems, both the compiler/assembler/linker approach
(perhaps with a post-processing step which verifies that no instructions
have been left unmodified) and the hardware approach would work equally
well. In fact, the hardware approach is equivalent to the instruction
decoder doing the same transformations the compiler/assembler/linker
could have done.

Yes, some people will build and advertise TSO hardware. Without a RVTSO
standard, they will be in the minority. With an official RVTSO standard,
however, there would be not only two sets of cores (RVTSO and RVWMO),
but also two sets of software (RVTSO and RVWMO). Since RVTSO cores could
run both sets of software, they would crowd out RVWMO; once RVTSO is in
the majority, most software (even software marked as RVWMO) would depend
on RVTSO. Therefore, my opinion is that making RVTSO a "strong" standard
will be harmful to the RISC-V ecosystem.

In my opinion, it will be better to have TSO be nothing more than a
footnote in the standard, listing which transformations in the
instruction stream would give a program the stronger TSO ordering. If
someone wants to advertise TSO hardware, they would just have to say
that their front-end does these same transformations in hardware, or
their equivalent. However, and this is the important point, standard
software *can't* depend on that.

Daniel Lustig

unread,

Dec 3, 2017, 9:40:12 PM12/3/17

to Cesar Eduardo Barros, Andrew Waterman, isa...@groups.riscv.org

Hi all,

Thanks for the feedback. Obviously this has been and will continue to
be a contentious topic, and we recognize that. Just look at how many
people have reacted with "obviously it should be just TSO" and also
"obviously it should be just RVWMO". And believe me, we discussed
pretty much all of these points over the past few months.

Sorry for the long response, but I want to address all of the points
you all are bringing up.

So, a few comments:

- RVWMO isn't going away. It matches the intention of the original
spec, and it supports much of the hardware that's already out there
today. As Andrew said, RVTSO is being formalized because there are
vendors out there (who are free to identify themselves) who want a
path to port their legacy x86 code and have it "just work", and in
such a way that it won't accidentally get run on RVWMO hardware which
doesn't support it. There are also people who will super-optimize
for some of their TSO microarchitectures anyway, or want to formally
verify their cores, etc. But those are all special cases. RVWMO is
the memory model for all standard, portable RISC-V code. If we as a
community want to converge around a single model, it should be RVWMO.

The debate is this: if people are going to just write TSO-dependent
software anyway, and if they're going to just build TSO hardware
anyway, do we just keep it hush-hush and hope nobody tries to go
back later and run that code on RVWMO hardware (where it will break)?
Or do we find a way to label this situation so TSO-dependent software
will simply refuse to run on hardware that doesn't support it?

Again, the following should be the debate here: should Ztso be
defined as a standard extension, or not? The rest is either not
going to change or part of a separate later debate (see below).
As a task group, in conjunction with the Foundation, we thought
it would be better to standardize Ztso than to leave it quietly
unsupported, or supported with unusable performance, or to simply
alienate that segment of the market altogether.

- Personally speaking, I'm not one of the TSO proponents myself, so I
can't personally vouch for everything those vendors have in mind.

- Nothing in our proposal implies that hardware will have to support
two models simultaneously. We don't want to impose a burden for all
hardware to support any kind of mode switch, particularly not in the
base ISA. So hardware is free to only implement RVWMO, or to only
support RVTSO.

- Nothing in our proposal is meant to rule out or discourage people
from proposing new extensions with new l{b|h|w|d}.aq + s{b|h|w|d}.rl,
per-page modes, dynamically reconfigurable "TSO mode", and/or anything
else. We're simply aiming to propose a model for the base ISA as it's
currently defined, with the instructions it currently has, because
that's what's out there today, and that's what software is being built
around. If you want to propose any of the above, that's great, but
let's consider it a separate discussion about a future extension.
In the meantime let's discuss the base ISA memory model here. We want
to keep those decoupled because we want to move forward with ratifying
the base ISA in the coming weeks or months, and don't want to delay it
to include lots of things which won't be in the base ISA spec.

- The problem with just setting every load and store to be just .aq and
.rl today is that the opcodes just don't exist. Like I said, it's
fine if someone wants to propose adding them, but the Foundation wants
us to treat that as a separate task group and a separate future
extension. We are proposing to add these as assembler
pseudoinstructions though, for similar reasons, and for forwards
compatibility if/when they do get added.

- People who write C code that implicitly assumes TSO are writing
invalid code. C has its own memory model, and it's not TSO, and it's
not going to be TSO regardless of what RISC-V does. That said, we
hear the feedback that people do this a lot anyway, and/or that people
have lots of such code lying around for legacy reasons. We're looking
into a compiler option to emit such code as RVWMO an RVWMO binary
which has enough .aq/.rl/fences to make it TSO, but since this code is
already in violation of the C spec anyway, it's not an easy thing to
just build and put into production.

- Plus, race-free code (i.e., a good chunk of the code out there) really
doesn't need to be TSO everywhere. That's the magic of "SC-for-DRF".
The hard part is racy code, and/or code used to write synchronization
primitives. Unfortunately there's no easy way that I know of to say
"compile this block of code as TSO" and "compile the rest as DRF".

- It's also not the compiler's job to statically analyze C code to
determine whether it's safe under RVWMO or not. It's the programmer's
job. We could debate the merits of DRF and undefined behavior for a
long time, but that's how C works. So C is perfectly happy to
compile to RVWMO, just as it's perfectly happy compiling to ARM,
Power, Itanium, and anything else that isn't TSO.

- In some sense this fragmentation is no different than having hardware
which does or doesn't support any number of the other standard or
non-standard extensions. The most important use cases will
standardize around common extensions, such as "RV64GC" for the Unix
platform spec, and software will have to match that to be compatible.
If you add your own custom stuff to that, it won't work on standard
cores that don't have that custom stuff. Same thing with Ztso really.
RISC-V is designed to make it easy for people to add custom stuff,
but if you want to be fully compatible with the rest of the broader
ecosystem, you can't use the custom stuff in your software, Ztso or
otherwise.

Dan

Daniel Lustig

unread,

Dec 3, 2017, 9:57:22 PM12/3/17

to Anthony Coulter, isa...@groups.riscv.org

Hi Anthony,

On 12/2/2017 7:39 AM, Anthony Coulter wrote:
> All of my feedback is related to readability. I freely admit that I'm
> not qualified to judge the technical merit of this memory model, and
> in fact I'm reading this document more because I want to understand
> memory models in general than because I care specifically about RISC-V.

We're happy to have this kind of feedback as well!

>
> Section 1.1.1: Memory Model Primitives
> Table 1.1's caption introduces ".sc" as a synonym for ".aqrl" but this
> synonym is not used anywhere except section 3.1 (the formal
> specification in Alloy).

This is mostly a vestige of an earlier discussion of whether we should
use ".aqrl" or ".sc" for RCsc atomics. More than likely we'll settle
on just one notation.

>
> The definition of LR/SC pairs in the third paragraph of this section
> can be cleaned up a bit:
>
> "A successful sc instruction is said to be paired with the last lr
> instruction that precedes it in program order; their corresponding
> memory operations are also said to be paired."

Sure, that works, thanks.

>
> Section 2.4.2: I/O Ordering
> This section states that the preserved program order rules don't
> apply to I/O regions, and then "informally" introduces ten new rules
> which are written rigorously. Would it be fair to call this a formal
> specification of the "preserved program order rules for I/O accesses"?
> Or are these ten new rules not enforced by RVWMO?

They're kind of a "formalization in progress", in some sense. But I/O
hasn't been nearly as well studied from a memory model formalization
perspective as normal memory, so we're not super inclined to make
strong formal guarantees about I/O yet.

>
> Also: What's the difference between rules 2 and 4 in this section? More
> specifically, what's the difference between strongly-ordered I/O
> regions and their corresponding channel numbers? I/O channels are
> introduced in section 3.5.4 of the RISC-V privileged spec but it isn't
> really clear to me what exactly they are, whether they're visible to
> software, etc. (I had always assumed they were *not* software-visible,
> and were just an informal labeling of different strongly-ordered I/O
> regions. But my understanding doesn't really explain why channels zero
> and one are special; it would seem that every memory map is required to
> support these two I/O regions with special synchronization properties
> even if there are no underlying devices that require those properties.
> I'm almost certainly wrong.)

Andrew or someone else who knows the privileged spec better than I do
should answer this one.

>
> Section 2.6: Code Porting Guidelines
> Table 2.2 uses the phrase "fence-based equivalents" in both a column
> header and the caption. Since the fences are strictly stronger than
> their .aq/.rl counterparts, I would lean against the word "equivalent."
> Could this be renamed to "fence-based implementation" ?

Sure.

>
> Section 2.7: Implementation guidelines
> The litmus test explaining write subsumption says that "(a) must follow
> (f) in the global memory order." I believe "follow" should be replaced
> with "precede."

Yes, you're right, thank you!

>
> Sections 2.3.9 and 2.4.2: The phrase "is ordered before"
> Most of the document uses the verb "precedes" but these two sections
> use the clumsier "is ordered before." I would prefer a uniform use of
> the verb "precede" in all of these places, with the convention that
> the word "precedes" always refers to global memory order unless
> otherwise specified, e.g. in the phrase "precedes in program order."
> This convention could be stated explicitly at the beginning of the
> document.
>
> Section 2.8: Summary of New/Modified ISA Features
> The spec makes a few references to "fence.tso" as though it is an
> assembler pseudoinstruction, but fence.tso isn't mentioned here or
> in the RISC-V user spec. Will it be added?

Yes, there was some late discussion from the opcode management task
group on how fence.tso might be easily encoded inside the current
FENCE opcode. But details TBD here. We may or may not define that
prior to official ratification of the base ISA.

>
> Section ???: The Flat RISC-V Operational Model
> This document/section has its own title, authors, and page numbers.
> It also ends mid-sentence. Is this part of the official draft?

The formalizations are all still in progress, and so these are just
the snapshots I happened to have available at the time I sent this
draft out.

>
> Regards,
> Anthony Coulter

Albert Cahalan

unread,

Dec 4, 2017, 1:57:50 AM12/4/17

to Daniel Lustig, Cesar Eduardo Barros, Andrew Waterman, isa...@groups.riscv.org

On 12/3/17, Daniel Lustig <dlu...@nvidia.com> wrote:

> - Nothing in our proposal implies that hardware will have to support
> two models simultaneously. We don't want to impose a burden for all
> hardware to support any kind of mode switch, particularly not in the
> base ISA. So hardware is free to only implement RVWMO, or to only
> support RVTSO.

Even if hardware only supports one or the other, it would be useful
and not burdensome to standardize the way to control and indicate it.

For example, a bit in a register is set if TSO is active. If the processor
can switch modes, then the bit is writable and defaults to TSO.

David Chisnall

unread,

Dec 4, 2017, 4:40:15 AM12/4/17

to Daniel Lustig, Cesar Eduardo Barros, Andrew Waterman, isa...@groups.riscv.org

On 4 Dec 2017, at 02:40, Daniel Lustig <dlu...@nvidia.com> wrote:
>
> - Plus, race-free code (i.e., a good chunk of the code out there) really
> doesn't need to be TSO everywhere. That's the magic of "SC-for-DRF".

I would love to see the research that backs up this assertion. Our preliminary investigations in this area indicate that the biggest problem with the C11 memory model is that there are very few nontrivial C programs that are data-race free and therefore all concurrent C code depends on undefined behaviour. This has been one of the key driving forces in the proposal for the OCaml memory model to include a notion of local data-race freedom.

David

Daniel Lustig

unread,

Dec 5, 2017, 12:43:56 PM12/5/17

to David Chisnall, Cesar Eduardo Barros, Andrew Waterman, isa...@groups.riscv.org

Are you suggesting that most C code is buggy? (believable) Or are you
suggesting that C should change its memory model to not be DRF, or to
always to compile to be TSO? (pretty unlikely...)

Assuming C is not going to fundamentally change its memory model in
such a way, we need to design the RISC-V memory model to support C as
it's currently designed and following current standard practice for C
compilers. And UB is one of the key things enabling lots of standard
compiler optimizations. It's also way outside the scope of this
RISC-V memory model task group to propose eliminating UB in C.

Thanks for the pointer to the OCaml memory model though.

Dan

Daniel Lustig

unread,

Dec 5, 2017, 12:53:54 PM12/5/17

to Albert Cahalan, Cesar Eduardo Barros, Andrew Waterman, isa...@groups.riscv.org

Yes, that will more than likely tie in to whatever solution ends up
getting proposed here. But there are also questions such as: will this
be exposed/accessible to user and/or supervisor mode, as opposed to
just machine mode (since it'll likely be in the misa register)? Will
there be a way to switch this per-process or per VM? Will there be a
per-page mechanism of some kind too? All are good questions that people
have proposed and for which we'd welcome discussion. We just didn't see
a compelling need to spec this all out before we ratify the rest of the
memory model and the ISA spec. We'd rather wait and get that right
as part of a future extension, if that's what ends up being proposed.
That's all.

Dan

Palmer Dabbelt

unread,

Dec 5, 2017, 4:33:35 PM12/5/17

to Daniel Lustig, Anthony Coulter, isa...@groups.riscv.org

On Sun, Dec 3, 2017 at 6:57 PM, Daniel Lustig <dlu...@nvidia.com> wrote:

Hi Anthony,

On 12/2/2017 7:39 AM, Anthony Coulter wrote:
> Also: What's the difference between rules 2 and 4 in this section? More
> specifically, what's the difference between strongly-ordered I/O
> regions and their corresponding channel numbers? I/O channels are
> introduced in section 3.5.4 of the RISC-V privileged spec but it isn't
> really clear to me what exactly they are, whether they're visible to
> software, etc. (I had always assumed they were *not* software-visible,
> and were just an informal labeling of different strongly-ordered I/O
> regions. But my understanding doesn't really explain why channels zero
> and one are special; it would seem that every memory map is required to
> support these two I/O regions with special synchronization properties
> even if there are no underlying devices that require those properties.
> I'm almost certainly wrong.)

Andrew or someone else who knows the privileged spec better than I do
should answer this one.

The RISC-V privileged spec uses I/O channels as a mechanism to describe constraints on ordered regions. Each ordered I/O region is associated with one channel, and the options are

* Channel 0: Hart-to-region strong ordering.

* Channel 1: Global strong ordering.

* Channel 2 and above: Hart-to-channel strong ordering.

These are implicitly software visible: for example, the draft of the Unix-class platform spec says

Unless otherwise specified by a given I/O device, I/O regions are at least point-to-point strongly ordered. All devices attached to a given PCIe root complex are on the same ordered channel (numbered 2 or above), though different root complexes may not be on the same ordering channel.

This is written to match what Linux expects. While there's no way to query or modify I/O channels from supervisor mode, this wording allows us to avoid the gratuitous fences that would be require to implement Linux's MMIO ordering requirements on RISC-V.

Jonas Oberhauser

unread,

Dec 6, 2017, 8:39:09 AM12/6/17

to RISC-V ISA Dev

Overall it looks like a sane memory model. Good job to everyone who was involved, I'm sure it was a lot of effort.

In 1.1.2 Dependencies

"A register r read by an instruction b has a syntactic dependency on an instruction a if [...] There is some other instruction i such that a register read by i has a dependency on a, and

a register read by b has a dependency on i"

Do you mean the following:

"A register r read by an instruction b has a syntactic dependency on an instruction a if [...] There is some other instruction i such that a register read by i has a dependency on a, and

*register r* read by b has a dependency on i"

Otherwise I don't quite understand, and I'd appreciate if a short explanation would be added as a comment.

(on a slightly related note, I suggest sticking either to the term "syntactic dependency", "dependency", "register dependency", or disambiguating between them. They seem to mean the same thing but it's not quite clear. If they are the same thing I suggest calling it "dependency" because it is the shortest.)

Also, and I'm sure you have discussed this and can answer me quickly, do current processors implement these strong dependencies? To specify my concern, imagine a "swap r1 r2" instruction which swaps the values of two registers.

now if you have

a = load to r1 ; i=swap r1 r2 ; b = store from r1

so r1 of i has a dependency on a, and r1 of b has a dependency on i, but the value of r1 of b does not really depend on a.

In your definition, r1 of b depends on a.

But in hardware, there is no dependency; the dependency has to be maintained artificially.

Next point: in PPO, Rule 13

"and there is no other store to an overlapping memory

location(s) between m and b" --- this should probably be in local (processor) order. The way it is written it looks as if it was in global order, but it's not clear why a global store should imply this ordering (while a local store means that you can forward from it without waiting for m's address to be resolved).

I didn't have time to read part 2 yet, so feel free to point to it if something is explained in there.

Now if I understood correctly, you haven't yet addressed 1) mixed size/misaligned accesses, 2) memory management 3) interrupts?

Jonas Oberhauser

unread,

Dec 6, 2017, 8:42:22 AM12/6/17

to RISC-V ISA Dev, ces...@cesarb.eti.br, wate...@eecs.berkeley.edu

Am Montag, 4. Dezember 2017 03:40:12 UTC+1 schrieb Daniel Lustig:

- Again, the following should be the debate here: should Ztso be

defined as a standard extension, or not? The rest is either not
going to change or part of a separate later debate (see below).
As a task group, in conjunction with the Foundation, we thought
it would be better to standardize Ztso than to leave it quietly
unsupported, or supported with unusable performance, or to simply
alienate that segment of the market altogether.

Sure. I think this makes perfect sense.

- Nothing in our proposal is meant to rule out or discourage people
from proposing new extensions with new l{b|h|w|d}.aq + s{b|h|w|d}.rl,
per-page modes, dynamically reconfigurable "TSO mode", and/or anything
else. We're simply aiming to propose a model for the base ISA as it's
currently defined, with the instructions it currently has, because
that's what's out there today, and that's what software is being built
around. If you want to propose any of the above, that's great, but
let's consider it a separate discussion about a future extension.
In the meantime let's discuss the base ISA memory model here. We want
to keep those decoupled because we want to move forward with ratifying
the base ISA in the coming weeks or months, and don't want to delay it
to include lots of things which won't be in the base ISA spec.

[...]

The problem with just setting every load and store to be just .aq and
.rl today is that the opcodes just don't exist. Like I said, it's
fine if someone wants to propose adding them, but the Foundation wants
us to treat that as a separate task group and a separate future
extension. We are proposing to add these as assembler
pseudoinstructions though, for similar reasons, and for forwards
compatibility if/when they do get added.

Ummm... Ok. I somewhat understand the rationale behind not wishing to open up all of this jazz if you want to ratify the base ISA. But they are very useful, and in particular the "strong" aq/rl bits (.sc) are very dear to my heart. I want them available for normal stores/loads as well. The macros are a good enough as a temporary workaround, and for the time being I'll pretend that there are aq/rl bits in these instructions :)

- People who write C code that implicitly assumes TSO are writing
invalid code. C has its own memory model, and it's not TSO, and it's
not going to be TSO regardless of what RISC-V does.

True. But intuitively I would assume that a properly annotated C program where all annotated stores&loads are compiled with .rl and .aq (and remaining stores/loads are compiled to normal s/l and never moved by the compiler across aq/rl mops) is going to have only TSO behaviors on RVWMO.

In fact I conjecture that you get only SC behaviors if you compile all of the annotated stores&loads to .sc (and the remaining stores and loads to normal s/l, which are never reordered with earlier .sc loads and later .sc stores).

We're looking into a compiler option to emit such code as RVWMO an RVWMO binary
which has enough .aq/.rl/fences to make it TSO, but since this code is
already in violation of the C spec anyway, it's not an easy thing to
just build and put into production.

Yes, I'm afraid you'd have to put .rl and .aq bits everywhere. Finding out which accesses should have been annotated to my understanding is hard for finite programs and impossible in general.

- Plus, race-free code (i.e., a good chunk of the code out there) really
doesn't need to be TSO everywhere. That's the magic of "SC-for-DRF".
The hard part is racy code, and/or code used to write synchronization
primitives. Unfortunately there's no easy way that I know of to say
"compile this block of code as TSO" and "compile the rest as DRF".

Okay, I'm not sure what exactly you mean, and my answer depends a little on what you mean.

1) race-free and DRF can mean (as often in academic circles) "the only shared state are locks, which are accessed with lock() and unlock()". It can also mean (as in C or Java) "all accesses to shared state are annotated" (which is a much weaker condition)

2) "compile this block of code as TSO", do you mean "the semantics of this code should be TSO"? Or do you mean "this code lacks proper annotations, compile it correctly anyways"?

3) "compile this block of code as DRF", do you mean "the semantics of this code should be sequential consistency"? Or do you mean "this code only accesses local state"? Or do you mean "the accesses to shared state in this code are annotated"?

The thing that makes this complicated is (as you know) that code which lacks proper annotations normally has UB. In particular the loop (while (x != 1);) has undefined behavior and can be transformed to (among other things) x=1; At this point you no longer have TSO as a memory model even if you compile to a SC CPU because the compiler itself will give you a completely absurd memory model.

But I'll try to give you one useful answer.

Assume that you have received a C program where some parts have been annotated correctly, and other parts have not, but you know which are which.

In particular, accesses to state which is shared between the annotated code and the un-annotated code are annotated in the annotated code (but likely not in the un-annotated code)

1) place a fence at the borders of the un-annotated code, one fence before one fence after

2) compile the un-annoted code naively, i.e., do not do optimizations that might affect the memory model (so while(x!=0) becomes an actual loop which polls x)

3) compile the annotated code like you normally would

Then the compiled program on a TSO CPU will behave as if

1) the code in the annotated section has no store buffer; it's loads and stores go directly and immediately to the shared memory

2) the code in the un-annoted section has a store buffer; it's stores will go to the buffer before changing anything, it's loads will first look into the buffer, then into memory

Is that what you meant?

(Similarly, if you have code that is annotated correctly everywhere, but you want some portion of the code to have a store buffer and another part of the code not to,

1) place fences around the code with the store buffers

2) compile the code with the store buffers without placing fences that order stores and loads

3) compile the code without the store buffers as you normally would)

There have been suggestions that the fence at the end of the un-annotated/TSO code is not necessary, but I have never seen a proof for that claim. The fence in the other direction is definitely necessary (in general), already the typical TSO litmus test shows this (assume both reads are in unannotated code).

Jonas Oberhauser

unread,

Dec 6, 2017, 8:52:23 AM12/6/17

to RISC-V ISA Dev, David.C...@cl.cam.ac.uk

Really? Is there something I can read?

Daniel Lustig

unread,

Dec 6, 2017, 8:28:01 PM12/6/17

to Jonas Oberhauser, RISC-V ISA Dev

Hi Jonas,

Thanks for the feedback.

On 12/6/2017 5:39 AM, Jonas Oberhauser wrote:
> Overall it looks like a sane memory model. Good job to everyone who was
> involved, I'm sure it was a lot of effort.
>
>
> In 1.1.2 Dependencies
>
> "A register r read by an instruction b has a syntactic dependency on an
> instruction a if [...] There is some other instruction i such that a
> register read by i has a dependency on a, and
> a register read by b has a dependency on i"
>
> Do you mean the following:
>
> "A register r read by an instruction b has a syntactic dependency on an
> instruction a if [...] There is some other instruction i such that a
> register read by i has a dependency on a, and
> *register r* read by b has a dependency on i"

Yes, you're right, the second part there doesn't really talk about r.
We'll clean that up for the next draft. See also below.

>
> Otherwise I don't quite understand, and I'd appreciate if a short
> explanation would be added as a comment.
>
> (on a slightly related note, I suggest sticking either to the term
> "syntactic dependency", "dependency", "register dependency", or
> disambiguating between them. They seem to mean the same thing but it's not
> quite clear. If they are the same thing I suggest calling it "dependency"
> because it is the shortest.)

The distinction between syntactic and semantic dependency is important,
so we're using that particular term very intentionally. We can clean up
the term "register dependency" though.

So maybe something like this?

A register r read by an instruction b has a syntactic dependency on an

instruction a if a precedes b in program order, r is not x0, and
either of the following hold:

1. r is written by a, and no other instruction between a and b in
program order writes r
2. There is some other instruction i between a and b in program order
such that a register read by i has a syntactic dependency on a, and
r has a syntactic dependency on i

<snip>

Finally, b has a syntactic success dependency on a if a is a
store-conditional and a register read by b has a syntactic dependency
on the store conditional success register written by a.

(Also note that the pc register is implicitly but intentionally also
excluded from this definition. We don't really say that now, since
RISC-V treats pc as a somewhat special register in the current spec,
but we should probably clarify nevertheless...)

>
>
> Also, and I'm sure you have discussed this and can answer me quickly, do
> current processors implement these strong dependencies? To specify my
> concern, imagine a "swap r1 r2" instruction which swaps the values of two
> registers.
> now if you have
> a = load to r1 ; i=swap r1 r2 ; b = store from r1
> so r1 of i has a dependency on a, and r1 of b has a dependency on i, but
> the value of r1 of b does not really depend on a.
> In your definition, r1 of b depends on a.
> But in hardware, there is no dependency; the dependency has to be
> maintained artificially.

In general, the notion that the hardware has to respect dependencies
that hardware might otherwise be able to "optimize away" is the key
distinction between syntactic and semantic dependencies. Part 2
gives some examples. And yes, hardware is charged with maintaining
those artificially. Real-world code depends on them working.

A hypothetical swap instruction would be weird in the way you describe,
but I think the RISC nature of this ISA means we don't have to worry
about weird cases like that. Do you see any current RISC-V
instructions where this actually comes up? I haven't surveyed them
myself one-by-one to see if something like this does actually come up,
but we probably should do that.

I'm also making somewhat of an implicit/informal assumption that rs1/rs2
are read and rd is written, but really we should work with the ISA
formalization task group to formalize the notions of "register is read/
written by", "used to calculate the address of", etc.

>
> Next point: in PPO, Rule 13
> "and there is no other store to an overlapping memory
> location(s) between m and b" --- this should probably be in local
> (processor) order. The way it is written it looks as if it was in global
> order, but it's not clear why a global store should imply this ordering
> (while a local store means that you can forward from it without waiting for
> m's address to be resolved).

Yes, in program order, not in global memory order. Fixed.

>
>
> I didn't have time to read part 2 yet, so feel free to point to it if
> something is explained in there.
>
> Now if I understood correctly, you haven't yet addressed 1) mixed
> size/misaligned accesses, 2) memory management 3) interrupts?
>

#1 is addressed in the draft. Did you miss it? Or did we miss
something else that you're looking for?

I'm not sure what you mean by #2.

We have not formalized #3, no.

Thanks,

Daniel Lustig

unread,

Dec 6, 2017, 8:55:59 PM12/6/17

to Jonas Oberhauser, RISC-V ISA Dev, ces...@cesarb.eti.br, wate...@eecs.berkeley.edu

The memory model is already designed to accommodate those instructions,
whether or not opcode space is allocated for them in some future ISA
extension. The burden for that extension will be convincing everyone it's
worth the opcode encoding space, not convincing us that it's a good
idea to capture in the memory model.

>
> - People who write C code that implicitly assumes TSO are writing
>> invalid code. C has its own memory model, and it's not TSO, and it's
>> not going to be TSO regardless of what RISC-V does.
>
>
> True. But intuitively I would assume that a properly annotated C program
> where all annotated stores&loads are compiled with .rl and .aq (and
> remaining stores/loads are compiled to normal s/l and never moved by the
> compiler across aq/rl mops) is going to have only TSO behaviors on RVWMO.
> In fact I conjecture that you get only SC behaviors if you compile all of
> the annotated stores&loads to .sc (and the remaining stores and loads to
> normal s/l, which are never reordered with earlier .sc loads and later .sc
> stores).

Yes, that's exactly what the SC-for-DRF theorem states. C11/C++11 uses the
term "atomic" for what you call "annotated".

>
>
>> We're looking into a compiler option to emit such code as RVWMO an RVWMO
>> binary
>> which has enough .aq/.rl/fences to make it TSO, but since this code is
>> already in violation of the C spec anyway, it's not an easy thing to
>> just build and put into production.
>>
>
> Yes, I'm afraid you'd have to put .rl and .aq bits everywhere. Finding out
> which accesses should have been annotated to my understanding is hard for
> finite programs and impossible in general.
>
> - Plus, race-free code (i.e., a good chunk of the code out there) really
>> doesn't need to be TSO everywhere. That's the magic of "SC-for-DRF".
>> The hard part is racy code, and/or code used to write synchronization
>> primitives. Unfortunately there's no easy way that I know of to say
>> "compile this block of code as TSO" and "compile the rest as DRF".
>>
>
> Okay, I'm not sure what exactly you mean, and my answer depends a little on
> what you mean.
> 1) race-free and DRF can mean (as often in academic circles) "the only
> shared state are locks, which are accessed with lock() and unlock()". It
> can also mean (as in C or Java) "all accesses to shared state are
> annotated" (which is a much weaker condition)

The C/C++ standard has a well-defined notion of race-free, and it doesn't
require locks. It (informally) requires that for any conflicting accesses
(accesses to the same object, at least one of which is a write), either one
happens-before the other (another well-defined term), or both are atomic
(in the sense of what you called "annotated").

Locks are one way to enforce that, but they're not the only way.

> 2) "compile this block of code as TSO", do you mean "the semantics of this
> code should be TSO"? Or do you mean "this code lacks proper annotations,
> compile it correctly anyways"?

Both, kind of. "This code lacks proper annotations, but since people had
this code running on x86, I can probably assume it was safe to run this
under TSO, so compile with enough added synchronization that the resulting
behaviors are only TSO even if the code is run on an RVWMO machine."

Note that even this isn't going to prevent source-to-source transformations,
but a compiler targeting x86 doesn't do that either. It would only ensure
that the behaviors of the emitted binary are all only TSO behaviors, not
that the compiler itself refrains from reordering any loads or stores in
the program. That's a concept that lives only in -O0 or in academic papers
like "End-to-End Sequential Consistency".

> 3) "compile this block of code as DRF", do you mean "the semantics of this
> code should be sequential consistency"? Or do you mean "this code only
> accesses local state"? Or do you mean "the accesses to shared state in this
> code are annotated"?

Valid C code must be race free by assumption. Then SC-for-DRF says that
it doesn't matter what you do to optimize the straight line code, because
the end result is going to be equivalent to some SC execution anyway.
Therefore, the compiler is completely free to reorder or elide accesses
or do whatever crazy things it wants to do for straight line code, until
it sees an atomic access or a fence or some other synchronization operation.

>
> The thing that makes this complicated is (as you know) that code which
> lacks proper annotations normally has UB. In particular the loop (while (x
> != 1);) has undefined behavior and can be transformed to (among other
> things) x=1; At this point you no longer have TSO as a memory model even if
> you compile to a SC CPU because the compiler itself will give you a
> completely absurd memory model.

Like I said, the fact that racy code leads to UB is exactly the thing that
gives the compiler permission to optimize non-synchronization code as
much as it wants to, as people would intuitively expect any compiler to do.
UB doesn't kill the memory model; it actually enables it!

>
> But I'll try to give you one useful answer.
> Assume that you have received a C program where some parts have been
> annotated correctly, and other parts have not, but you know which are which.
> In particular, accesses to state which is shared between the annotated code
> and the un-annotated code are annotated in the annotated code (but likely
> not in the un-annotated code)
> 1) place a fence at the borders of the un-annotated code, one fence before
> one fence after
> 2) compile the un-annoted code naively, i.e., do not do optimizations that
> might affect the memory model (so while(x!=0) becomes an actual loop which
> polls x)
> 3) compile the annotated code like you normally would
>
> Then the compiled program on a TSO CPU will behave as if
> 1) the code in the annotated section has no store buffer; it's loads and
> stores go directly and immediately to the shared memory
> 2) the code in the un-annoted section has a store buffer; it's stores will
> go to the buffer before changing anything, it's loads will first look into
> the buffer, then into memory
>
> Is that what you meant?

This is all hypothetical. Someone would have to prove this is even
possible before we can talk about specifying it as an officially sanctioned
option.

>
> (Similarly, if you have code that is annotated correctly everywhere, but
> you want some portion of the code to have a store buffer and another part
> of the code not to,
> 1) place fences around the code with the store buffers
> 2) compile the code with the store buffers without placing fences that
> order stores and loads
> 3) compile the code without the store buffers as you normally would)

Likewise.

>
> There have been suggestions that the fence at the end of the
> un-annotated/TSO code is not necessary, but I have never seen a proof for
> that claim. The fence in the other direction is definitely necessary (in
> general), already the typical TSO litmus test shows this (assume both reads
> are in unannotated code).
>

Jonas Oberhauser

unread,

Dec 7, 2017, 1:17:04 AM12/7/17

to Daniel Lustig, RISC-V ISA Dev

On Dec 7, 2017 02:28, "Daniel Lustig" <dlu...@nvidia.com> wrote:

Hi Jonas,

Thanks for the feedback.

On 12/6/2017 5:39 AM, Jonas Oberhauser wrote:
>
The distinction between syntactic and semantic dependency is important,
so we're using that particular term very intentionally. We can clean up
the term "register dependency" though.

Where is the definition of semantic dependency?

(Also note that the pc register is implicitly but intentionally also
excluded from this definition. We don't really say that now, since
RISC-V treats pc as a somewhat special register in the current spec,
but we should probably clarify nevertheless...)

I noticed, but it did cause me to stop and think for a second. Clarifying this may be worthwhile.

Do you see any current RISC-V
instructions where this actually comes up? I haven't surveyed them
myself one-by-one to see if something like this does actually come up,
but we probably should do that.

But who has the time for that ;)

> Now if I understood correctly, you haven't yet addressed 1) mixed
> size/misaligned accesses, 2) memory management 3) interrupts?
>

#1 is addressed in the draft. Did you miss it? Or did we miss
something else that you're looking for?

Does not TSO preserve program order for partial hits? I did not see that for Ztso.

A partial hit occurs if a is a store, b is a load, accessed regions overlap, but b loads from addresses not changed by a.

I'm not sure what you mean by #2.

I meant address translation and access and dirty bits. The memory operations of the MMU behave on their own and w.r.t. memory operations of the CPU how?

Cesar Eduardo Barros

unread,

Dec 7, 2017, 5:37:33 AM12/7/17

to Daniel Lustig, Jonas Oberhauser, RISC-V ISA Dev

Em 06-12-2017 23:27, Daniel Lustig escreveu:
> On 12/6/2017 5:39 AM, Jonas Oberhauser wrote:
>> Also, and I'm sure you have discussed this and can answer me quickly, do
>> current processors implement these strong dependencies? To specify my
>> concern, imagine a "swap r1 r2" instruction which swaps the values of two
>> registers.
>> now if you have
>> a = load to r1 ; i=swap r1 r2 ; b = store from r1
>> so r1 of i has a dependency on a, and r1 of b has a dependency on i, but
>> the value of r1 of b does not really depend on a.
>> In your definition, r1 of b depends on a.
>> But in hardware, there is no dependency; the dependency has to be
>> maintained artificially.
>
> In general, the notion that the hardware has to respect dependencies
> that hardware might otherwise be able to "optimize away" is the key
> distinction between syntactic and semantic dependencies. Part 2
> gives some examples. And yes, hardware is charged with maintaining
> those artificially. Real-world code depends on them working.
>
> A hypothetical swap instruction would be weird in the way you describe,
> but I think the RISC nature of this ISA means we don't have to worry
> about weird cases like that. Do you see any current RISC-V
> instructions where this actually comes up? I haven't surveyed them
> myself one-by-one to see if something like this does actually come up,
> but we probably should do that.

RISC-V does have a register swap instruction, CSRRW, used for instance
on trap entry (swap x4 and mscratch/sscratch). You might think this is
not a problem, since the CSRs are special, but consider the sequence
"swap r1 scratch; swap scratch r2", which moves the value from r1 to r2
in a roundabout way.

You either have to treat the CSRs as normal registers (but once you do
that, you have a swap instruction), or treat CSR instructions as
serializing (which might be less acceptable to some).

> I'm also making somewhat of an implicit/informal assumption that rs1/rs2
> are read and rd is written, but really we should work with the ISA
> formalization task group to formalize the notions of "register is read/
> written by", "used to calculate the address of", etc.

That assumption might even be important for the hardware, since it
allows reading the registers in parallel with decoding most of the
instruction (only the major opcode needs to be decoded first).

And yet again, the CSR instructions are weird: some of them treat rs1
not as a register number, but as an immediate. I believe that should be
mentioned in the memory model, in the following way: "these particular
instructions don't create a dependency on rs1, but it's okay if a
particular implementation has a false dependency (due to treating it as
a dummy register read)". False dependencies are generally ok, but in
this particular case I believe it's better to be explicit that it's not
a problem.

Jonas Oberhauser

unread,

Dec 7, 2017, 8:56:18 AM12/7/17

to Cesar Eduardo Barros, Daniel Lustig, RISC-V ISA Dev

Thanks for being watchful!

I'm also making somewhat of an implicit/informal assumption that rs1/rs2
are read and rd is written, but really we should work with the ISA
formalization task group to formalize the notions of "register is read/
written by", "used to calculate the address of", etc.

That assumption might even be important for the hardware, since it allows reading the registers in parallel with decoding most of the instruction (only the major opcode needs to be decoded first).

Reading what is stated above again I want to make sure we are all in sync: Dan, you were only stating an example for a specific instruction class, right? You do not imply, for example, that an add with an immediate constant reads r2, or that a store instruction writes rd.

And yet again, the CSR instructions are weird: some of them treat rs1 not as a register number, but as an immediate. I believe that should be mentioned in the memory model, in the following way: "these particular instructions don't create a dependency on rs1, but it's okay if a particular implementation has a false dependency (due to treating it as a dummy register read)". False dependencies are generally ok, but in this particular case I believe it's better to be explicit that it's not a problem.

There is a general way to define these dependencies, but I think for the sake of the readers it makes sense to not be general and define it by a case by case analysis in the obvious way.

The general way is this: Take an instruction word I.

Let now for a register configuration c (i.e., a record with values for all registers) the register configuration after executing instruction I and possibly using a load result lr that matches I starting with that configuration be labeled as c'.

A register r is not written by I if for all load results lr and register configurations c the value of r is the same in c and c'.

A register r can be computed using a set of registers D if for all load results lr and all register configurations c and d in which the registers in D have the same values, r has the same value in c' and d'.

In short if you know the values of registers in D (and the load result if any) you can immediately compute the value of r.

Register r uses registers D if D is the minimum set of registers that can be used to compute r.

(it is not obvious that there is only one such minimal set ... but there is)

Note that this does not create dependencies for semantic nops like "add x1 x1 x0" - which is fine for our purposes. The PPO is the same either way.

Anyway, just thought I'd drop that out there.

Andrew Waterman

unread,

Dec 7, 2017, 3:17:07 PM12/7/17

to Cesar Eduardo Barros, Daniel Lustig, Jonas Oberhauser, RISC-V ISA Dev

My reading of the syntactic dependence rules is that this case is
already covered, as the rules just refer to "registers," which CSRs
are. So for CSRRx, rd depends on the CSR, and (subsequently) the CSR
depends on rs1. That's sufficient to make r2 depend on r1 in your
example.

>
>> I'm also making somewhat of an implicit/informal assumption that rs1/rs2
>> are read and rd is written, but really we should work with the ISA
>> formalization task group to formalize the notions of "register is read/
>> written by", "used to calculate the address of", etc.
>
>
> That assumption might even be important for the hardware, since it allows
> reading the registers in parallel with decoding most of the instruction
> (only the major opcode needs to be decoded first).
>
> And yet again, the CSR instructions are weird: some of them treat rs1 not as
> a register number, but as an immediate. I believe that should be mentioned
> in the memory model, in the following way: "these particular instructions
> don't create a dependency on rs1, but it's okay if a particular
> implementation has a false dependency (due to treating it as a dummy
> register read)". False dependencies are generally ok, but in this particular
> case I believe it's better to be explicit that it's not a problem.
>
> --
> Cesar Eduardo Barros
> ces...@cesarb.eti.br
>

> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/03ce9b7c-8902-eb0d-0cdd-ac029894643e%40cesarb.eti.br.

pranith kumar

unread,

Dec 7, 2017, 6:36:52 PM12/7/17

to RISC-V ISA Dev

Hi Daniel,

In section 2.3.4, there is this text:

By default, the fence instruction ensures that all memory accesses from instructions preceding
the fence in program order (the “predecessor set”) appear later in the global memory order than
memory accesses from instructions appearing after the fence in program order (the “successor set”).

I think that later should be "before"?

Thanks,

On Friday, December 1, 2017 at 8:40:17 PM UTC-5, Daniel Lustig wrote:

Hi everyone,

We in the RISC-V memory model task group are ready to release the first
public draft of the memory consistency model specification that we've
been working on over the past few months. For those of you who
attended the workshop this week, this document will fill in some of the
details. For those of you who couldn't make it, I've attached my
presentation slides as well. The video of my talk (and of all the other
talks) should be posted online within a week or so.

If anyone has any comments, questions, or feedback, feel free to respond
here, to reach out to us in the memory model task group, or even just to
respond to me directly. I'm more than happy to take the feedback.

Over the next few weeks, assuming nobody uncovers any glaring errors,
we'll start working to merge this into the rest of the user-level ISA
spec (in some way or other, details TBD) so that we can aim to put forth
both together for official ratification in the coming months. We'll
also of course fix any typos, bugs, or discrepancies that are found in
the meantime.

We're also actively communicating with the Linux maintainers, the gcc
and LLVM maintainers, and more so that we make sure that the memory
model interacts properly with all of the above.

Let us know what you think!

Dan

Muralidaran Vijayaraghavan

unread,

Dec 8, 2017, 10:53:24 AM12/8/17

to RISC-V ISA Dev, ces...@cesarb.eti.br, dlu...@nvidia.com

While no additional PPO edges are created for "add x1 x1 x0", we intentionally want a PPO edge to be created to "xor a2 a1 a1" from the instruction that produces a1, even though a2 is always going to be 0. I am copying Figure 2.5 from the document here.

ld a1,0(s0)

xor a2,a1,a1

add s1,s1,a2

ld a5,0(s1)

In particular, we want to intentionally create an address-dependency/PPO-edge between the two loads.

Your notion of dependency does not capture such cases.

Jonas Oberhauser

unread,

Dec 8, 2017, 11:15:06 AM12/8/17

to Muralidaran Vijayaraghavan, RISC-V ISA Dev, Cesar Eduardo Barros, dlu...@nvidia.com

On Dec 8, 2017 16:53, "Muralidaran Vijayaraghavan" <murali....@gmail.com> wrote:

we intentionally want a PPO edge to be created to "xor a2 a1 a1" from the instruction that produces a1, even though a2 is always going to be 0. I am copying Figure 2.5 from the document here.

ld a1,0(s0)
xor a2,a1,a1
add s1,s1,a2
ld a5,0(s1)

In particular, we want to intentionally create an address-dependency/PPO-edge between the two loads.

I see! That's interesting and I did not quite expect that. Is this something code relies on?

Anyway, just thought I'd drop that out there.

--
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/hKywNHBkAXM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/78ba02bf-0109-44f9-b4e2-2379f79296d4%40groups.riscv.org.

Muralidaran Vijayaraghavan

unread,

Dec 8, 2017, 1:08:46 PM12/8/17

to RISC-V ISA Dev

We introduced these syntactic dependencies for some kernel code requirements. My understanding is that only the Alpha ISA Port of the Linux kernel relaxed these dependencies and that port has not been kept up to date.

Daniel Lustig

unread,

Dec 8, 2017, 2:02:10 PM12/8/17

to Murali Vijayaraghavan, Jonas Oberhauser, RISC-V ISA Dev, Cesar Eduardo Barros

On 12/8/2017 10:04 AM, Murali Vijayaraghavan wrote:

> On Dec 8, 2017 11:15 AM, "Jonas Oberhauser" <s9jo...@gmail.com> wrote:
>
>
>
> On Dec 8, 2017 16:53, "Muralidaran Vijayaraghavan" <murali....@gmail.com>
> wrote:
>>
>> we intentionally want a PPO edge to be created to "xor a2 a1 a1" from the
>> instruction that produces a1, even though a2 is always going to be 0. I am
>> copying Figure 2.5 from the document here.
>>
>> ld a1,0(s0)
>> xor a2,a1,a1
>> add s1,s1,a2
>> ld a5,0(s1)
>>
>> In particular, we want to intentionally create an
>> address-dependency/PPO-edge between the two loads.
>>
>>
>> I see! That's interesting and I did not quite expect that. Is this
>> something code relies on?
>
>

Yes, Linux requires these dependencies, as do some implementations of
Java final fields and some parallel garbage collection (i.e., pointer-
chasing) algorithms, to give a few other examples. So yes, it's intentional.

Daniel Lustig

unread,

Dec 8, 2017, 2:09:44 PM12/8/17

to pranith kumar, RISC-V ISA Dev

On 12/7/2017 3:36 PM, pranith kumar wrote:
> Hi Daniel,
>
> In section 2.3.4, there is this text:
>
> By default, the fence instruction ensures that all memory accesses from
>> instructions preceding

>> the fence in program order (the “predecessor set”) appear *later* in the

>> global memory order than
>> memory accesses from instructions appearing after the fence in program
>> order (the “successor set”).
>>
>
> I think that later should be "before"?

Fixed, thanks!

Daniel Lustig

unread,

Dec 8, 2017, 2:13:18 PM12/8/17

to Jonas Oberhauser, RISC-V ISA Dev

On 12/6/2017 10:17 PM, Jonas Oberhauser wrote:
> Does not TSO preserve program order for partial hits? I did not see that for Ztso.
> A partial hit occurs if a is a store, b is a load, accessed regions overlap, but
> b loads from addresses not changed by a.

TSO does not order a store before a load in what we call global memory order,
regardless of address. The load value axiom makes sure that the load returns
the proper value nevertheless.

For microarchitectural intuition: the load b can still return its value before
a is released from the store buffer into globally-visible memory, so it would
seem to every other hart in the system that b happened before a.

Microarchitectures in practice might be conservative and not handle partial
hits out of the store buffer in this way. That's fine too.

>
> I'm not sure what you mean by #2.
>
>
> I meant address translation and access and dirty bits. The memory operations of
> the MMU behave on their own and w.r.t. memory operations of the CPU how?
>

At the moment they're completely unordered, unless there's a sfence.vma, and
except that accessed/dirty bit updates should be done atomically.

Jonas Oberhauser

unread,

Dec 8, 2017, 4:09:13 PM12/8/17

to Daniel Lustig, Murali Vijayaraghavan, RISC-V ISA Dev, Cesar Eduardo Barros

2017-12-08 20:02 GMT+01:00 Daniel Lustig <dlu...@nvidia.com>:

On 12/8/2017 10:04 AM, Murali Vijayaraghavan wrote:
> On Dec 8, 2017 11:15 AM, "Jonas Oberhauser" <s9jo...@gmail.com> wrote:
>
>
>
> On Dec 8, 2017 16:53, "Muralidaran Vijayaraghavan" <murali....@gmail.com>
> wrote:
>
> We introduced these syntactic dependencies for some kernel code
> requirements. My understanding is that only the Alpha ISA Port of the Linux
> kernel relaxed these dependencies and that port has not been kept up to
> date.

Yes, Linux requires these dependencies, as do some implementations of
Java final fields and some parallel garbage collection (i.e., pointer-
chasing) algorithms, to give a few other examples. So yes, it's intentional.

Dan

Thanks! I learn something new every day.

Jonas Oberhauser

unread,

Dec 8, 2017, 4:36:06 PM12/8/17

to Daniel Lustig, RISC-V ISA Dev

2017-12-08 20:13 GMT+01:00 Daniel Lustig <dlu...@nvidia.com>:

On 12/6/2017 10:17 PM, Jonas Oberhauser wrote:
> Does not TSO preserve program order for partial hits? I did not see that for Ztso.
> A partial hit occurs if a is a store, b is a load, accessed regions overlap, but
> b loads from addresses not changed by a.

TSO does not order a store before a load in what we call global memory order,
regardless of address.

This is due to the rl/aq bits but I'm not sure what it has to do with this problem, where a load is ordered before a store.

The load value axiom makes sure that the load returns
the proper value nevertheless.

No, because the load can be in global memory order long before the store.

For microarchitectural intuition: the load b can still return its value before
a is released from the store buffer into globally-visible memory, so it would
seem to every other hart in the system that b happened before a.

In a TSO implementation they can not, because then the local order of stores would not coincide with the global order of stores.

In particular the local hart would observe no store between the stores from which the load result of b is pieced together, but while store a remains in the local buffer another hart may commit a store to the same address region.

Therefore in a TSO implementation a has to be released before b and no other thread can observe the other order.

For the axiomatic memory model, here is a global memory order:

T1: S x

T1: L x,y

T2: S x

T1: S y

Here is the program order:
T1: S x
T1: S y
T1: L x,y

Using the load value axiom, we obtain the following stores from which the load reads:
T1: S x

T1: S y

Thus the load observes no store to x between T1's store to x and y, which contradicts the global memory order.

Do you agree?

>
> I'm not sure what you mean by #2.
>
>
> I meant address translation and access and dirty bits. The memory operations of
> the MMU behave on their own and w.r.t. memory operations of the CPU how?
>

At the moment they're completely unordered, unless there's a sfence.vma, and
except that accessed/dirty bit updates should be done atomically.

Thanks. In another thread I was told that in the current RISC-V spec, A/D bit updates also happen in program order and are "exact". Is that still valid?

Daniel Lustig

unread,

Dec 8, 2017, 6:18:00 PM12/8/17

to Jonas Oberhauser, RISC-V ISA Dev

On 12/8/2017 1:36 PM, Jonas Oberhauser wrote:
> 2017-12-08 20:13 GMT+01:00 Daniel Lustig <dlu...@nvidia.com>:
>
>> On 12/6/2017 10:17 PM, Jonas Oberhauser wrote:
>>> Does not TSO preserve program order for partial hits? I did not see that
>> for Ztso.
>>> A partial hit occurs if a is a store, b is a load, accessed regions
>> overlap, but
>>> b loads from addresses not changed by a.
>>
>> TSO does not order a store before a load in what we call global memory
>> order,
>> regardless of address.
>
>
> This is due to the rl/aq bits but I'm not sure what it has to do with this
> problem, where a load is ordered before a store.

The store is before the load in program order, but the load is before the
store in PPO and global memory order. That's legal under TSO. This is
just a wording mixup.

>
>
>> The load value axiom makes sure that the load returns
>> the proper value nevertheless.
>>
>
> No, because the load can be in global memory order long before the store.

Even if the load appears before the store in global memory order, it can
still return a value from that store. That's what part two of the load
value axiom says.

>
>
>> For microarchitectural intuition: the load b can still return its value
>> before
>> a is released from the store buffer into globally-visible memory, so it
>> would
>> seem to every other hart in the system that b happened before a.
>>
>
> In a TSO implementation they can not, because then the local order of
> stores would not coincide with the global order of stores.
> In particular the local hart would observe no store between the stores from
> which the load result of b is pieced together, but while store a remains in
> the local buffer another hart may commit a store to the same address
> region.

There's no such rule under TSO. The order of the hart's own stores entering
the store buffer doesn't have to (and in general, does not) match the global
memory order. What matters is that every hart agrees on the order in which
things do become globally visible.

Another good test to consider to understand TSO store-load reordering
subtleties is n7, from here:
https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.pdf

>
> Therefore in a TSO implementation a has to be released before b and no
> other thread can observe the other order.
>
> For the axiomatic memory model, here is a global memory order:
>
> T1: S x
> T1: L x,y
> T2: S x
> T1: S y
>
> Here is the program order:
> T1: S x
> T1: S y
> T1: L x,y
>
> Using the load value axiom, we obtain the following stores from which the
> load reads:
> T1: S x
> T1: S y
>
> Thus the load observes no store to x between T1's store to x and y, which
> contradicts the global memory order.

If you prefer, here's another legal global memory order which also produces
exactly that same outcome:

T1: S x
T1: S y
T1: L x,y

T2: S x

With that, I can justify the original execution.

>
> Do you agree?
>
>
>
>>>
>>> I'm not sure what you mean by #2.
>>>
>>>
>>> I meant address translation and access and dirty bits. The memory
>> operations of
>>> the MMU behave on their own and w.r.t. memory operations of the CPU how?
>>>
>>
>> At the moment they're completely unordered, unless there's a sfence.vma,
>> and
>> except that accessed/dirty bit updates should be done atomically.
>>
>
> Thanks. In another thread I was told that in the current RISC-V spec, A/D
> bit updates also happen in program order and are "exact". Is that still
> valid?

Do you mean happen in program order before the access that triggered them?
What does "exact" mean here?

Jonas Oberhauser

unread,

Dec 9, 2017, 5:34:24 AM12/9/17

to Daniel Lustig, RISC-V ISA Dev

On Dec 9, 2017 00:17, "Daniel Lustig" <dlu...@nvidia.com> wrote:

On 12/8/2017 1:36 PM, Jonas Oberhauser wrote:
> 2017-12-08 20:13 GMT+01:00 Daniel Lustig <dlu...@nvidia.com>:
>
>> On 12/6/2017 10:17 PM, Jonas Oberhauser wrote:
>>> Does not TSO preserve program order for partial hits? I did not see that
>> for Ztso.
>>> A partial hit occurs if a is a store, b is a load, accessed regions
>> overlap, but
>>> b loads from addresses not changed by a.
>>
>> TSO does not order a store before a load in what we call global memory
>> order,
>> regardless of address.
>
>
> This is due to the rl/aq bits but I'm not sure what it has to do with this
> problem, where a load is ordered before a store.

The store is before the load in program order, but the load is before the
store in PPO and global memory order. That's legal under TSO. This is
just a wording mixup.

o.k.

>
>
>> The load value axiom makes sure that the load returns
>> the proper value nevertheless.
>>
>
> No, because the load can be in global memory order long before the store.

Even if the load appears before the store in global memory order, it can
still return a value from that store. That's what part two of the load
value axiom says.

Yes, but I was not worried about that store (by the same thread), only about the remote store. Sorry for the imprecise language.

>
>
>> For microarchitectural intuition: the load b can still return its value
>> before
>> a is released from the store buffer into globally-visible memory, so it
>> would
>> seem to every other hart in the system that b happened before a.
>>
>
> In a TSO implementation they can not, because then the local order of
> stores would not coincide with the global order of stores.
> In particular the local hart would observe no store between the stores from
> which the load result of b is pieced together, but while store a remains in
> the local buffer another hart may commit a store to the same address
> region.

There's no such rule under TSO. The order of the hart's own stores entering
the store buffer doesn't have to (and in general, does not) match the global
memory order. What matters is that every hart agrees on the order in which
things do become globally visible.

Another good test to consider to understand TSO store-load reordering
subtleties is n7, from here:
https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.pdf

This was not quite what I meant, but I agree I have made a mistake here.

Do note that all existing intel x86 multi-core implementations do not forward on a partial hit ( as can be seen in https://www.intel.de/content/www/de/de/architecture-and-technology/64-ia-32-architectures-optimization-manual.html ).

At some point (before my time) this was erronously introduced into our formal models as part of the spec, rather than just as an implementation detail. I never noticed until now that this is explicitly not part of the intel/amd spec.

So, thanks :)

If you prefer, here's another legal global memory order which also produces
exactly that same outcome:

T1: S x
T1: S y
T1: L x,y

T2: S x

With that, I can justify the original execution.

The execution should not be considered in isolation. Here's a bigger context (and I make the store of T2 bigger)

T1: S x

T1: L x,y

T2: S x,y

T1: S y

T1: L x,y

In this case the first load sees the stores of T1 in sequence, but the second load sees the store of T2 "retroactively" appear in between.

Note that I'm no longer claiming that this is not a valid x86TSO execution (even though no intel x86 CPU to date might have this behavior).

>
> Thanks. In another thread I was told that in the current RISC-V spec, A/D
> bit updates also happen in program order and are "exact". Is that still
> valid?

Do you mean happen in program order before the access that triggered them?
What does "exact" mean here?

I am just replaying a sound bit ;), I don't know the original intention. But as far as I understood, "in program order" means "directly before the instruction with the access that triggered them".

"exact" means "Iff there is an actual access".

In particular, as far as I understand, a load from the PTE followed by the first access that uses the PTE would never see A/D bits.

How this ties in with speculation and quashing is not clear to me.

I think if you want to honor that part of the spec -- and I personally don't -- you might have to provide a few stronger guarantees, e.g., that a load is not translated unless we are sure that no previous mops go to a pte for which HW will set A/D bits and if we are sure the load will not be quashed.

Reply all

Reply to author

Forward