I find it extremely concerning that it is proposed for there to be two memory consistency models that would result in incompatible binaries, particularly when the failure mode is perhaps subtly-different implementation-defined results. When properly accounted for this fragments the RISC-V compiled software ecosystem, even when the best intentions are otherwise. When not properly manage this results in data corruption and different-in-testing-and-production bugs.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6289F75A-4978-491A-A916-3033FD482355%40friedenbach.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/94483c8d-74ae-ac2f-7acd-bc1fc4fc7921%40cesarb.eti.br.
On Dec 2, 2017, at 12:46 PM, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:TSO binaries will have different ELF flags, so the failure won’t be subtle: the OS/loader will refuse to execute the binary. In this respect, it’s like other ISA/ABI extensions.
All standard software will be WMO and thus will run on either hardware. At the moment there aren’t even any compilers that target TSO, and as one of the GCC maintainers, my preference is to never support TSO code generation. So, while I share your concerns about fragmentation, I think we can succeed in mitigating it.
On Dec 2, 2017, at 12:46 PM, Andrew Waterman <wate...@eecs.berkeley.edu> wrote:TSO binaries will have different ELF flags, so the failure won’t be subtle: the OS/loader will refuse to execute the binary. In this respect, it’s like other ISA/ABI extensions.You’re assuming a standard UNIX-y environment with ELF binaries.All standard software will be WMO and thus will run on either hardware. At the moment there aren’t even any compilers that target TSO, and as one of the GCC maintainers, my preference is to never support TSO code generation. So, while I share your concerns about fragmentation, I think we can succeed in mitigating it.That’s wishful thinking. On the other
hand, TSO code is smaller (that’s why it’s even being officially supported by the ISA, right?) so I guarantee you’ll see a performance-critical, size-constrained micro distributions compiling TSO code without fences.
People will use these to make products. Years later, people supporting or inheriting these products will do a hardware refresh and for whatever reason (I can think of many) want to source a WMO design instead. Maybe they want to emulate the microcontroller inside a virtualized thread on a single chip, and don’t have the original source code, just a flat binary firmware file. They’re stuck — they have to source a TSO chip or pick a different approach.
On Sat, Dec 2, 2017 at 1:02 PM Mark Friedenbach <ma...@friedenbach.org> wrote:All standard software will be WMO and thus will run on either hardware. At the moment there aren’t even any compilers that target TSO, and as one of the GCC maintainers, my preference is to never support TSO code generation. So, while I share your concerns about fragmentation, I think we can succeed in mitigating it.That’s wishful thinking. On the otherLet’s call it hopeful thinking. It doesn’t require a magical outcome: it just requires staying the current course.
hand, TSO code is smaller (that’s why it’s even being officially supported by the ISA, right?) so I guarantee you’ll see a performance-critical, size-constrained micro distributions compiling TSO code without fences.The memory’s model has nearly zero impact on code size, since synchronization is usually statically uncommon, contained in library code, or both. In embedded systems where code size matters the most, most synchronization is I/O, not memory, and thus unaffected by this debate.
People will use these to make products. Years later, people supporting or inheriting these products will do a hardware refresh and for whatever reason (I can think of many) want to source a WMO design instead. Maybe they want to emulate the microcontroller inside a virtualized thread on a single chip, and don’t have the original source code, just a flat binary firmware file. They’re stuck — they have to source a TSO chip or pick a different approach.Yep. No way around this sort of lock-in. However, it will happen whether or not we standardized RVTSO. As I said in another email, vendors are building and advertising TSO as a feature. We can’t stop that. We made the strategic decision to standardize how vendors should do this, as a means of preventing yet more fragmentation (e.g., RISC-VI).
Hi Anthony,
On 12/2/2017 7:39 AM, Anthony Coulter wrote:
> Also: What's the difference between rules 2 and 4 in this section? More
> specifically, what's the difference between strongly-ordered I/O
> regions and their corresponding channel numbers? I/O channels are
> introduced in section 3.5.4 of the RISC-V privileged spec but it isn't
> really clear to me what exactly they are, whether they're visible to
> software, etc. (I had always assumed they were *not* software-visible,
> and were just an informal labeling of different strongly-ordered I/O
> regions. But my understanding doesn't really explain why channels zero
> and one are special; it would seem that every memory map is required to
> support these two I/O regions with special synchronization properties
> even if there are no underlying devices that require those properties.
> I'm almost certainly wrong.)
Andrew or someone else who knows the privileged spec better than I do
should answer this one.
- Again, the following should be the debate here: should Ztso be
defined as a standard extension, or not? The rest is either not
going to change or part of a separate later debate (see below).
As a task group, in conjunction with the Foundation, we thought
it would be better to standardize Ztso than to leave it quietly
unsupported, or supported with unusable performance, or to simply
alienate that segment of the market altogether.
- Nothing in our proposal is meant to rule out or discourage people
from proposing new extensions with new l{b|h|w|d}.aq + s{b|h|w|d}.rl,
per-page modes, dynamically reconfigurable "TSO mode", and/or anything
else. We're simply aiming to propose a model for the base ISA as it's
currently defined, with the instructions it currently has, because
that's what's out there today, and that's what software is being built
around. If you want to propose any of the above, that's great, but
let's consider it a separate discussion about a future extension.
In the meantime let's discuss the base ISA memory model here. We want
to keep those decoupled because we want to move forward with ratifying
the base ISA in the coming weeks or months, and don't want to delay it
to include lots of things which won't be in the base ISA spec.
[...]
The problem with just setting every load and store to be just .aq and
.rl today is that the opcodes just don't exist. Like I said, it's
fine if someone wants to propose adding them, but the Foundation wants
us to treat that as a separate task group and a separate future
extension. We are proposing to add these as assembler
pseudoinstructions though, for similar reasons, and for forwards
compatibility if/when they do get added.
- People who write C code that implicitly assumes TSO are writing
invalid code. C has its own memory model, and it's not TSO, and it's
not going to be TSO regardless of what RISC-V does.
We're looking into a compiler option to emit such code as RVWMO an RVWMO binary
which has enough .aq/.rl/fences to make it TSO, but since this code is
already in violation of the C spec anyway, it's not an easy thing to
just build and put into production.
- Plus, race-free code (i.e., a good chunk of the code out there) really
doesn't need to be TSO everywhere. That's the magic of "SC-for-DRF".
The hard part is racy code, and/or code used to write synchronization
primitives. Unfortunately there's no easy way that I know of to say
"compile this block of code as TSO" and "compile the rest as DRF".
Hi Jonas,
Thanks for the feedback.
The distinction between syntactic and semantic dependency is important,
On 12/6/2017 5:39 AM, Jonas Oberhauser wrote:
>
so we're using that particular term very intentionally. We can clean up
the term "register dependency" though.
(Also note that the pc register is implicitly but intentionally also
excluded from this definition. We don't really say that now, since
RISC-V treats pc as a somewhat special register in the current spec,
but we should probably clarify nevertheless...)
Do you see any current RISC-Vinstructions where this actually comes up? I haven't surveyed them
myself one-by-one to see if something like this does actually come up,
but we probably should do that.
#1 is addressed in the draft. Did you miss it? Or did we miss
> Now if I understood correctly, you haven't yet addressed 1) mixed
> size/misaligned accesses, 2) memory management 3) interrupts?
>
something else that you're looking for?
I'm not sure what you mean by #2.
I'm also making somewhat of an implicit/informal assumption that rs1/rs2
are read and rd is written, but really we should work with the ISA
formalization task group to formalize the notions of "register is read/
written by", "used to calculate the address of", etc.
That assumption might even be important for the hardware, since it allows reading the registers in parallel with decoding most of the instruction (only the major opcode needs to be decoded first).
And yet again, the CSR instructions are weird: some of them treat rs1 not as a register number, but as an immediate. I believe that should be mentioned in the memory model, in the following way: "these particular instructions don't create a dependency on rs1, but it's okay if a particular implementation has a false dependency (due to treating it as a dummy register read)". False dependencies are generally ok, but in this particular case I believe it's better to be explicit that it's not a problem.
By default, the fence instruction ensures that all memory accesses from instructions preceding
the fence in program order (the “predecessor set”) appear later in the global memory order than
memory accesses from instructions appearing after the fence in program order (the “successor set”).
Hi everyone,
We in the RISC-V memory model task group are ready to release the first
public draft of the memory consistency model specification that we've
been working on over the past few months. For those of you who
attended the workshop this week, this document will fill in some of the
details. For those of you who couldn't make it, I've attached my
presentation slides as well. The video of my talk (and of all the other
talks) should be posted online within a week or so.
If anyone has any comments, questions, or feedback, feel free to respond
here, to reach out to us in the memory model task group, or even just to
respond to me directly. I'm more than happy to take the feedback.
Over the next few weeks, assuming nobody uncovers any glaring errors,
we'll start working to merge this into the rest of the user-level ISA
spec (in some way or other, details TBD) so that we can aim to put forth
both together for official ratification in the coming months. We'll
also of course fix any typos, bugs, or discrepancies that are found in
the meantime.
We're also actively communicating with the Linux maintainers, the gcc
and LLVM maintainers, and more so that we make sure that the memory
model interacts properly with all of the above.
Let us know what you think!
Dan
we intentionally want a PPO edge to be created to "xor a2 a1 a1" from the instruction that produces a1, even though a2 is always going to be 0. I am copying Figure 2.5 from the document here.ld a1,0(s0)xor a2,a1,a1add s1,s1,a2ld a5,0(s1)In particular, we want to intentionally create an address-dependency/PPO-edge between the two loads.
Anyway, just thought I'd drop that out there.
--You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/hKywNHBkAXM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/78ba02bf-0109-44f9-b4e2-2379f79296d4%40groups.riscv.org.
On 12/8/2017 10:04 AM, Murali Vijayaraghavan wrote:
> On Dec 8, 2017 11:15 AM, "Jonas Oberhauser" <s9jo...@gmail.com> wrote:
>
>
>
> On Dec 8, 2017 16:53, "Muralidaran Vijayaraghavan" <murali....@gmail.com>
> wrote:
>
> We introduced these syntactic dependencies for some kernel code
> requirements. My understanding is that only the Alpha ISA Port of the Linux
> kernel relaxed these dependencies and that port has not been kept up to
> date.
Yes, Linux requires these dependencies, as do some implementations of
Java final fields and some parallel garbage collection (i.e., pointer-
chasing) algorithms, to give a few other examples. So yes, it's intentional.
Dan
On 12/6/2017 10:17 PM, Jonas Oberhauser wrote:
> Does not TSO preserve program order for partial hits? I did not see that for Ztso.
> A partial hit occurs if a is a store, b is a load, accessed regions overlap, but
> b loads from addresses not changed by a.
TSO does not order a store before a load in what we call global memory order,
regardless of address.
The load value axiom makes sure that the load returns
the proper value nevertheless.
For microarchitectural intuition: the load b can still return its value before
a is released from the store buffer into globally-visible memory, so it would
seem to every other hart in the system that b happened before a.
>
> I'm not sure what you mean by #2.
>
>
> I meant address translation and access and dirty bits. The memory operations of
> the MMU behave on their own and w.r.t. memory operations of the CPU how?
>
At the moment they're completely unordered, unless there's a sfence.vma, and
except that accessed/dirty bit updates should be done atomically.
On 12/8/2017 1:36 PM, Jonas Oberhauser wrote:The store is before the load in program order, but the load is before the
> 2017-12-08 20:13 GMT+01:00 Daniel Lustig <dlu...@nvidia.com>:
>
>> On 12/6/2017 10:17 PM, Jonas Oberhauser wrote:
>>> Does not TSO preserve program order for partial hits? I did not see that
>> for Ztso.
>>> A partial hit occurs if a is a store, b is a load, accessed regions
>> overlap, but
>>> b loads from addresses not changed by a.
>>
>> TSO does not order a store before a load in what we call global memory
>> order,
>> regardless of address.
>
>
> This is due to the rl/aq bits but I'm not sure what it has to do with this
> problem, where a load is ordered before a store.
store in PPO and global memory order. That's legal under TSO. This is
just a wording mixup.
Even if the load appears before the store in global memory order, it can
>
>
>> The load value axiom makes sure that the load returns
>> the proper value nevertheless.
>>
>
> No, because the load can be in global memory order long before the store.
still return a value from that store. That's what part two of the load
value axiom says.
There's no such rule under TSO. The order of the hart's own stores entering
>
>
>> For microarchitectural intuition: the load b can still return its value
>> before
>> a is released from the store buffer into globally-visible memory, so it
>> would
>> seem to every other hart in the system that b happened before a.
>>
>
> In a TSO implementation they can not, because then the local order of
> stores would not coincide with the global order of stores.
> In particular the local hart would observe no store between the stores from
> which the load result of b is pieced together, but while store a remains in
> the local buffer another hart may commit a store to the same address
> region.
the store buffer doesn't have to (and in general, does not) match the global
memory order. What matters is that every hart agrees on the order in which
things do become globally visible.
Another good test to consider to understand TSO store-load reordering
subtleties is n7, from here:
https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.pdf
If you prefer, here's another legal global memory order which also produces
exactly that same outcome:
T2: S x
T1: S x
T1: S y
T1: L x,y
With that, I can justify the original execution.
>Do you mean happen in program order before the access that triggered them?
> Thanks. In another thread I was told that in the current RISC-V spec, A/D
> bit updates also happen in program order and are "exact". Is that still
> valid?
What does "exact" mean here?