In case porting software which depends on DW-CAS becomes a problem, I
don't think it would be too hard to create a "double-wide LR/SC"
extension. Here's my back-of-the-napkin design for it.
I would use 4 instructions, which I call LRL, LRH, SCL, SCH. LRL and LRH
are encoded similar to LR; SCL and SCH are encoded similar to SC.
The word pair is required to be aligned as if it were a single larger
word (that is, 16-byte alignment for a pair of 8-byte words). This also
guarantees that both words are in the same cacheline.
The sequence of operations for a DW-CAS would be:
- LRL of the first word to a register;
- LRH of the second word to a register;
- compare both registers and branch if any is different;
- SCL to the first word;
- SCH to the second word.
The LRL reserves the two words (or their cacheline). The LRH is
guaranteed to not break the reservation, as long as it's for the word
following the one just loaded by the LRL (which it can read from the L1
cache, since it's guaranteed to be in the same cacheline).
The SCL does not store anything, it "buffers" the word and its address.
The SCH "releases" the store from the SCL and stores its own word, as
long as the reservation is intact and the word is the correct one, and
writes the result to rd.
Each of these instruction still reads at most two registers and writes
at most one register.
The complexity is all in the SCL, since it has to buffer the word; this
might be a simple modification of a buffer which already exists to
combine multiple writes to the same cacheline, or some out-of-order
bookkeeping structure, or might be its own dedicated internal register.
Thoughts?
--
Cesar Eduardo Barros
ces...@cesarb.eti.br
So the CAS example in Figure 5.1 would become for DW-CAS:
dw_cas:
mv t2, a0
lrl.w t0, (t2)
lrh.w t1, (t2)
li a0, 1
bne t0, a2, return
bne t1, a3, return
scl.w a4, (t2)
sch.w a0, a5, (t2)
return:
jr ra
To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not. So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS. If another flag is present, it can emulate DCAS, etc. That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.
Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.
Cheers,
��� Mike
Hm...
So my dw_cas example would become:
dw_cas:
addi t2, a0, 0
addi a1, a0, 4
lr.w t0, (t2)
lra.w t1, (a1)
li a0, 1
bne t0, a2, return
bne t1, a3, return
scb.w a4, (t2)
sc.w a0, a5, (a1)
return:
jr ra
> To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not. So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS. If another flag is present, it can emulate DCAS, etc. That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.
I can see two parameters to specify the hardware profile for LRA/SCB:
- The number of cachelines it can reserve;
- The number of buffers it can use to hold the result (each SCB uses one
of them).
Even the most restrictive implementation of LRA/SCB (one cacheline and
one buffer) would be able to emulate DW-CAS, as long as the cacheline is
at least two words long (16 bytes on RV64, a reasonable requirement),
and both values are in the same cacheline (just align them naturally for
their combined size and it'll work). Therefore, if these instructions
are present, they can always do the DW-CAS trick.
For DCAS/CAS2, you'd need only two cachelines and one buffer. For
transactional memory, you'd need more buffers, more cachelines, and
relaxed requirements on the retry loop contents.
The "relaxed requirements" would be the sticky part for transactional
memory. For LR/SC or even LR/LRA/SCB/SC, you can have the "as long as it
fits within X instructions, no taken backwards branches on the success
path, and only simple integer instructions" rules, and guarantee
progress if these conditions are met. For more general transactional
memory, you might need a more general "always have a fallback path for
when it fails which does not use transactional memory" rule.
> Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.
A design focused on transactional memory might look very different than
a simple LR/SC extension; for instance, it might have separate ABORT and
COMMIT instructions, it might have different memory ordering semantics,
it might have a different way to specify a "fallback" path (LR/SC waits
until the SC to return success or failure; a transactional memory
extension might not want to wait until the COMMIT, which might be
hundreds of instructions ahead), and so on.
Trying to have a single design for both might lead to a worse design for
both.
An alternative would be to make a more general multi-LR/SC support, but allow implementations to support it only in special cases. Off the top of my head, there could be ���LR additional��� (LRA) and ���SC buffer��� (SCB) instructions ��� basically a nested LR/SC. LRA loads reserved without breaking existing reservations. SCB buffers a SC for writing, but it doesn���t actually write until the next SC. If you overrun the hardware���s transactional memory support, the SC certainly fails.To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not. So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS. If another flag is present, it can emulate DCAS, etc. That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.
Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.
Cheers,
��� Mike
Am Montag, 15. Februar 2016 20:21:44 UTC+1 schrieb mike:An alternative would be to make a more general multi-LR/SC support, but allow implementations to support it only in special cases. Off the top of my head, there could be ���LR additional��� (LRA) and ���SC buffer��� (SCB) instructions ��� basically a nested LR/SC. LRA loads reserved without breaking existing reservations. SCB buffers a SC for writing, but it doesn���t actually write until the next SC. If you overrun the hardware���s transactional memory support, the SC certainly fails.To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not. So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS. If another flag is present, it can emulate DCAS, etc. That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.
Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.
Cheers,
��� MikeSomething like this makes a lot of sense, but why not simply stack the LRs up to some implementation defined (and discoverable) depth (greater or equal to zero)? Then you simply write "LR/LR/LR operation SC/SC/SC retry" and 1) only the last SC commits, and 2) only if all reservations are still held, which is stored in the result register for the last SC. Hardware could guarantee that if you do a succesful non-committing SC, the reservation will continue to be held as long as you only do successful SCs. (and similarly that while you do LRs the "timer" for previous reservations does not advance).An interrupt would delete all existing reservations, clear the non-committed outstanding stores, clear the reservation stack, and consequently fail all remaining SCs (in particular the last one).There are a few other things that would have to be specified; e.g., doing more SCs than there have been LRs, doing more LRs than the implementation can handle, doing LRs after a non-committing SC, or possibly even doing *anything except for an SC* after a non-committing SC. All of this should probably simply raise an illegal interrupt.
dw_cas:
mv t2, a0
li a0, 1
li t0, 4
lr.w t0, (t2), t0
lw t1, 4(t2)
bne t0, a1, return
bne t1, a2, return
sw a3, 4(t2)
sc.w a0, a4, (t2)
return:
jr ra
(or avoiding two jumps in the middle of the critical section and making use of C instructions if possible:
d{wd}_cas:
mv t0 a3
li a3, sizeof({wd})
lr.{wd} a3, (a0), a3
l{wd} a5, sizeof({wd})(a0)
xor a3, a3, a1
xor a5, a5, a2
or a5, a5, a3
bnez a5, fail
s{wd} a4, sizeof({wd})(a0)
sc.{wd} a0, t0, (a0)
jr ra
fail:
li a0 1
jr ra
I am reminded of a recent talk by Bob Tarjan where he talked about the necessity of an (IIRC) non-adjacent double-cas in order to solve some graph problem efficiently. I feel that a nested LR/SC is much better for this task.
--To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/7914291d-6702-4c70-8efb-c37f24839cd4%40groups.riscv.org.
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/0y3ndO_Sne0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
Thanks for your input. I don't think it will be that much easier on the hardware, considering how easy it is to build a small HW stack, in particular since depth would typically be two or less.
I think the biggest downside to your suggestion right now is that you can lose the lock due to an interrupt but your regular store would still get executed; so your current implementation is not atomic. Once you give buffered SC semantics to the interior store you lose all the advantages you may have had over the stack.
Of course I like the fact that you need only a single instruction for the locking, but I don't think it would be useful beyond doubling the address range,
which makes it a big weapon for a small target. Also you will still need the additional loads and stores to get the values across :(The other downside is that you'd need to communicate a second address over the bus to lock the range at once.
What do you think?
To unsubscribe from this group and all its topics, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
Op maandag 4 december 2017 22:31:29 UTC+1 schreef Jonas Oberhauser:Thanks for your input. I don't think it will be that much easier on the hardware, considering how easy it is to build a small HW stack, in particular since depth would typically be two or less.The idea is that the implementation defined maximum is pretty small and you typically would lock just one cacheline, and fail if you ask for a size larger than a cacheline.
I think the biggest downside to your suggestion right now is that you can lose the lock due to an interrupt but your regular store would still get executed; so your current implementation is not atomic. Once you give buffered SC semantics to the interior store you lose all the advantages you may have had over the stack.You would know that the operation failed, and that the first write would have taken place. I agree though, that as written it is pretty fatal :-(.. You would indeed have to commit the whole segment atomically on the final sc. Since that segment is known well in advance and supposed to be part of a cacheline that still seems a lot easier (perhaps naively) than finally committing a (small) stack of values at potentially unrelated locations in memory .Of course I like the fact that you need only a single instruction for the locking, but I don't think it would be useful beyond doubling the address range,I probably misunderstand you. You don't double the address range, you define a range of adreses. Eg. if rs1 = 0xF000 and rs2 = 4 then lr.w rd rs1 rs2 is supposed to _read_ the 4 bytes at 0xF000 ,.. 0xF004 and _lock_the 4 + sizeof(w) = 8 adresses 0xF000, 0xF001, ... 0xF006, 0xF007
which makes it a big weapon for a small target. Also you will still need the additional loads and stores to get the values across :(The other downside is that you'd need to communicate a second address over the bus to lock the range at once.The register rs2 is supposed to be not more than a few bits worth of information (and in the common case would be rs2 == zero) In fact, using the 5 bit of the rs2 field as an immediate to indicate the number of additional words/ doublewords ( i.e. size of the range - 1) would seem enough (famous last words), but it would break the format of the instructions.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/913ce1d4-de34-4553-add7-8d44bfb8c9d1%40groups.riscv.org.
-- Jacob
Jonas Oberhauser wrote:
I think that RVT is what you really want; the keyword to me was "non-committing SC". That sounds like asking for transactional memory to me. This might be an interesting possibility for building RVT on RVA, however, by using the existing LR opcode to begin a transaction and the existing SC opcode to attempt to commit a transaction.
Could RVT redefine RVAlrsc as a transaction containing a single word?
-- Jacob
[snip]
I probably misunderstand you. You don't double the address range, you define a range of adreses. Eg. if rs1 = 0xF000 and rs2 = 4 then lr.w rd rs1 rs2 is supposed to _read_ the 4 bytes at 0xF000 ,.. 0xF004 and _lock_the 4 + sizeof(w) = 8 adresses 0xF000, 0xF001, ... 0xF006, 0xF007In my mind, you always access an address range. Whether that range includes only one byte, two bytes, four bytes, or eight bytes :) You always access the range {base address, ..., base address+(access width-1)}So when instead of four bytes you access eight, in my mind, you doubled the address range :)What I meant with "don't think it would be useful beyond doubling the address range", I meant that I doubt there will be use cases that do more with it than double the address range (but of course I can't yet tell for sure).I hope we understand each other now?
The other downside is that you'd need to communicate a second address over the bus to lock the range at once.The register rs2 is supposed to be not more than a few bits worth of information (and in the common case would be rs2 == zero) In fact, using the 5 bit of the rs2 field as an immediate to indicate the number of additional words/ doublewords ( i.e. size of the range - 1) would seem enough (famous last words), but it would break the format of the instructions.Yeah, you're right. it's not nearly as bad as I thought.In case you're only interested in taking a cache line, why not (for now at least) say r2 != x0 then whole cache line?
On the other direction, what happens if the programmer starts a LR/SC pair, but gives up before the SC (probably conditional on the value loaded by the LR), and before the reservation clears another piece of code starts a nested LR/LR/SC/SC sequence? The last SC, which should be committing (and thus return the result) has just become a non-commiting SC.
To avoid confusion, we do need a distinction between starting the reservation (LR) and adding to it (LRA), and between non-commiting store (SCB) and commiting store (SC).
Jonas Oberhauser wrote:
2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:
Jonas Oberhauser wrote:
I think that RVT is what you really want; the keyword to me was
"non-committing SC". That sounds like asking for transactional
memory to me. This might be an interesting possibility for
building RVT on RVA, however, by using the existing LR opcode to
begin a transaction and the existing SC opcode to attempt to
commit a transaction.
I don't really want RVT (which I assume means transactional RISC-V?), I want an efficient and small tool for generating small atomic operations.
RVT is transactional memory for RISC-V. As I see it, as soon as you can have an SC that "succeeds but does not commit", you are actually talking about transactional memory.
Before I continue, I have to mention that I don't know whether there are any features that a transactional memory has to provide to be useful in practice.
Having said that, I assume a full transactional memory would be somewhat overkill for this, and I don't think that the people who really want transactional memory would be happy with this little tool. On the other hand I'm not sure if a transactional memory would want to maintain the reservations for long.
A two-phase approach seems ideal: reservations are "hard" for a small number of cycles, then become "soft" until either the transaction completes, a trap is taken, or another processor in the system wants to access that address, at which time the reservation is released entirely. Another option (for RVN systems) is to add a "transaction aborted" interrupt.
It may still -- as you suggested -- be possible to implement a full transactional memory on top of such an LR and SC if the stacking depth provided by the hardware implementation is big and the hardware reserves memory for something more than 16 integer instructions (which it may. It might need to allow some loads and stores to other memory regions). It hadn't come to my mind before. If RVT only has to implement the operations from the Herlihy paper (http://cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf <http://cs.brown.edu/%7Emph/HerlihyM93/herlihy93transactional.pdf>), I can imagine that it is possible to put all LT/LTX to the front as LRs and all STs to the end as SCs (or for an RVT-certified implementation to provide additional guarantees about LRs/SCs, so that they don't have to be in a chain). Addresses for which you do not want to actually modify the value can use SC with the original value (or x0 or something) and could theoretically be optimized away in hardware. VALIDATE can simply check whether the locks are still held. COMMIT is the last SC.
Thanks for the paper; it has been enlightening. There is sufficient space in the AMO opcode to easily fit the six instructions from that paper.
I like the idea of reusing the SC opcode for ST, with the exact function determined by whether a transaction or simple memory reservation was active. An LR introduces an LR/SC sequence, while an LT or LTX introduces a transaction.
Do note that my suggestion either way is a natural extension of the current spec, since 1) an ISA version that provides no LR/SC is just the special case of my suggestion where the maximum stack depth is zero.
2) an ISA version that provides "normal" LR/SC is just the special case of my suggestion where the maximum stack depth is one.
and for most purposes, it would be enought to provide a maximum stack depth of two.
The problem with a stack depth greater than one is ambiguity in the case of interrupts and context switching. One possible solution, if we can limit the LR/SC stack depth to two and require more complex applications to use RVT instead, is to require that the two LRs be both statically and dynamically contiguous. This would mean that an LR immediately after a trap return must silently fail. In this case, an interrupt between the LR instructions causes the second LR to never acquire a reservation, thus both SCs will fail. The SCs would also need to be together, so the hardware can recognize the double SC and internally queue the first SC before committing both at the second SC. Otherwise, the first SC encountered succeeds and all reservations are released.
Could RVT redefine RVAlrsc as a transaction containing a single word?
I guess so. It might be the special case where the addresses are the same. But is the transactional memory willing to provide the same reservation promises that RVA gives to LR/SC pairs? I.e., that the only way for the SC to fail (if it is close enough to the LR) is for an interrupt to have happened?
Reading the Herlihy paper, transactional memory is a generalization of LR/SC (then called LL/SC), so RVT will need to provide a similar guarantee of forward progress in any case. I have since backed away from that idea; there is plenty of available encoding space in the AMO major opcode.
Actually, this reasoning generalizes: for any maximum stack depth, confusion is only possible if execution resumes in the middle of an LR///LR sequence. This can be caught with an internal "reservations forbidden" flag that is set by xRET and cleared by any instruction *except* LR. LR silently fails to acquire a reservation if the "reservations forbidden" flag is set, but must still read memory. Skipping memory reads is *not* safe -- a program may attempt to LR a pointer and follow that link through a structure; this will cause SIGSEGV if LR produces a bogus value. As there are no reservations, *all* following SCs must fail. All reservations are released (and the LR/SC stack cleared) when a trap is taken.
Jonas Oberhauser wrote:
2017-12-06 5:12 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:
Jonas Oberhauser wrote:
2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
<mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
<mailto:jcb6...@gmail.com>>>:
At least about a non-trivial subset of it. If "transactional memory" is a superset of the herlihy paper, then I am not yet talking about transactional memory :), and the people who want a transactional memory will not like the stack nature of my suggestion and in particular will want stronger guarantees.
This is another reason to keep the maximum LR/SC stack depth very small -- while the hardware resources for nested LR/SC and full RVT can probably be shared to some extent, a large LR/SC stack is about as complex as RVT, but lacks the "safety net" that VALIDATE provides. As a building block for intermediate-complexity atomic primitives (like multi-word CAS) nested LR/SC could be useful, but as "almost transactional memory access", nested LR/SC is a landmine for software developers to stumble upon, creating nasty race-condition bugs. Abusing nested LR/SC will probably work just enough to get past testing (if this sort of developer even does testing at all) but blow up in production.
Could nested LR/SC be part of RVT, along with the full transactional instructions? The RVT hardware support would be easily able to support nested LR/SC.
Is the hardware required for nested LR/SC any simpler than that required for full transactional memory support?
A two-phase approach seems ideal: reservations are "hard" for a
small number of cycles, then become "soft" until either the
transaction completes, a trap is taken, or another processor in
the system wants to access that address, at which time the
reservation is released entirely.
That makes sense.
In the case of my suggestion, it would be necessary to turn a soft reservation into a hard one again in case of a successful non-committing SC, so that you can be sure that on the committing SC, the previous reservations actually still hold and your operation can be completed.
Why? A non-committing SC effectively *is* ST; if one ST fails, all subsequent STs will fail and nothing commits until all STs have succeeded. I am currently considering (obvious in the context of RISC-V as I see it) one "store-transactional-and-validate" instruction and one "store-transactional-and-commit" instruction -- every transactional store returns a transaction state value (zero == transaction valid or successfully committed; non-zero == error) just like SC does. (Of course they do -- I plan to propose using the SC opcode for both forms of ST!)
Like I said I am not sure if those six operations are enough to make transactional memory useful in practice. I'd wait for input of someone who actually uses transactional memory :) (at least in a model).
The authors of that paper used transactional memory in a model. :-)
I like the idea of reusing the SC opcode for ST, with the exact
function determined by whether a transaction or simple memory
reservation was active. An LR introduces an LR/SC sequence, while
an LT or LTX introduces a transaction.
That sounds neat. I think it may not even be necessary to keep track of the "mode" between interrupts, because both of them have the same semantics after an interrupt (fail).
Exactly, taking a trap invalidates all reservations, including transactions. (Taking a trap *must* erase the internal atomic operation state -- the trap handler may need to use RVA.)
The problem is that hardware does not see software threads as distinct. Thread A: LR/LR/LR/(preempted). Thread B: LR/LR/LR/SC/SC/SC. The 3rd SC in thread B does *not* commit because there are still three LRs on the stack and reservations are invalid. Eventually, thread A resumes: LR/SC/SC/SC/SC. The fourth SC *finally* drains the LR/SC stack and resets the state, meanwhile, all uses of LR/SC either fail or (worse!) appear to succeed but do not actually commit. And what happens if thread A is migrated and resumed on a different hart? Resetting the whole LR/SC stack on a trap only moves the problem: now, when thread A resumes, the *first* SC commits and the rest fail ... oops!
That is not compatible with the current definition of LR. Consider the following instruction sequence: an ECALL immediately followed by a LR. Under the current rules, this will work as expected. Under your proposed "LR fails immediately after xRET" rule, this will never work, and a perfectly valid LR...SC loop can now loop forever, just because the compiler happened to move an ECALL to immediately before it.
That needs saving more state than the very restricted state RISC-V currently saves on a trap. That would be a hard sell.
That's not compatible with current uses of LR/SC. Even the ISA spec itself, in its CAS sample code, abandons the LR without doing a corresponding SC in case the compare fails; I expect most CAS emulation code in the wild to do the same.
So that is the difference between LL/SC and LR/SC? I thought it was just a different name.
The problem is the reverse one. Imagine every SC succeeds, but by the time you do the last SC,an earlier reservation has been given up and someone else wrote to that address. You would have to validate again on the last SC.
But you cannot commit anything until the last SC has succeeded, so you already must validate on each SC. This would not preclude an implementation from going back to hard reservations on the first SC in order to improve the chance that a nested LR/SC succeeds. [...]
With a small maximum stack depth, the existing RVA forward progress guarantee can apply. The first SC in the set could function as another LR (renewing the reservations and "resetting the clock" so to speak), provided that the reservations have all remained valid.
Ah but there is a difference between "I created a model and prove how great it is with these three select examples that work great" and "these other guys created a model that is really great and I have built some real stuff in it" (which can turn into "these other guys created a model which is cool but to implement real stuff in it, I also needed xyz, which I added") :)
That sounds like I need to write up an RVT proposal so someone can try that with RISC-V. :-)
I don't know RVA stands for
RVA is the atomic memory access extension "A". The trap handler may itself need atomic memory operations.
There is an easier solution: add one LRM ("Load Reserved Multiple") opcode that is used for subsequent loads. There is space in the current LR opcode. The overall sequence is LR/LRM/LRM/LRM/.../SC/SC/SC/SC. LR always begins a sequence, while LRM only acquires a reservation if the previous instruction executed was either LR or LRM. A trap handler returning into the middle of a sequence causes the sequence to fail, while a return-from-trap to the beginning of the sequence can succeed on the first try. All reservations are canceled by executing LR to start a new sequence, by taking any trap, or by executing LRM that does not acquire a reservation.
Also, how many reservations should be allowed? The number needs to be more than 1, obviously, but also kept small to make this a useful option to implementors.
There should be clear distinct purposes and distinct implementation costs between base RVA, RVA with [nested LR], and RVT, with RVT most powerful but most expensive to implement. The cost of RVA with [nested LR] should be much closer to base RVA than RVT.
There have been many interesting ideas here.*commit/begin with non-subsequent store/loadsI think this is too much too ask for in hardware and should go to RVT.In this case you need to be recheck your reservations on the commit and possibly other things.*commit/begin with only subsequent store/loadsI assume the reason to use LD and not LR is to maintain consistency with existing code?I think begin;LDs;op;SCs;commit and LR;LRBs;op;SCBs;SC are in essence the same.
In case of begin you know whether you are at the very beginning of the sequence (and thus no fail after xRET) because you are before begin; if you xRET at some point after begin, the LDs will not take reservations, the SCs will fail.In case of LRB you know whether you are the very beginning of the sequence (and thus no fail after xRET) because you are before an LR; if you xRET at some point after LR, the LRBs will fail (this needs a status flag, but doesn't depend on the context) the SCs will fail.The upside of LRB of course is that you don't have to do any loads with the status flag, and as a small bonus you save two local instructions.*LRM, using a set of reservations and not forcing popsI love this idea. In fact, I believe you can do everything we need with exactly that at a very low cost.All you need is a way to distinguish, as you suggest, after xRET between the first LR in a sequence and an LR in the middle. This can be done with LR;LRMs.
Jonas Oberhauser wrote:There is also the problem that more privileged interrupts are supposed to always be enabled.
*LRM, using a set of reservations and not forcing pops
I love this idea. In fact, I believe you can do everything we need with exactly that at a very low cost.
All you need is a way to distinguish, as you suggest, after xRET between the first LR in a sequence and an LR in the middle. This can be done with LR;LRMs. I'm wondering if there is another simple way to make this distinction or to circumvent the problem altogether. HW interrupts could be masked, but you could still have page faults in the middle of the sequence and (benign) interrupts right before the sequence.
Hardware can, at least for traps taken *during* the LR/LRM group, do this invisibly to software by effectively speculating all LRM instructions and rolling back to the LR if any traps are taken. This is done by simply storing the address of the LR (er, "previous non-LRM instruction") in ?epc if a trap is taken while executing LRM. This is invisible to normal software -- the interrupt simply appears to have arrived before the sequence began. Hardware that does this *must* document it, however -- the supervisor (and debugger) will see any page faults caused by LRM as having occurred at the previous LR.
Making an interrupt at the beginning of the sequence automatically go back before the first LR in the sequence seems to be complicated as well, but might not be so bad in HW -- you anyways need to store possible epcs somewhere in the pipeline, and you could store as epcs the pcs before the first LR...
A failed LRM must still load from memory -- the program might use that value as a pointer.I don't know how implementations currently handle reservations, but I could imagine the following implementation for the whole thing.
In each cache line remember reservation and dirty (possibly implicitly in the cache state, so reservation+M/O could mean dirty in MOESI). For the whole cache remember in a single bit hard/soft.
Hard reservation stalls write and LR/LRM accesses to the same cacheline, hard dirty reservation stalls normal read accesses to the same cacheline (by stall I mean "try again later" without keeping the bus busy).
Soft reservations are destroyed on another access and send an "invalidate all reservations" message.
An LR and a valid*, non-stalled LRM is just a normal load + takes a hard reservation.
An invalid or stalled LRM is a NOP and invalidates all reservations.
When reservations are freed on successful commit, the cachelines must remain dirty -- they now need to be written back.
A successful SC 1) to a non-dirty cacheline flushes the current cacheline to memory, unless this is the last reserved non-dirty cacheline
2) changes the value of the cache line and makes it dirty, invalidating all copies of the cache line in other caches in the process. 3) if all reserved cachelines are now dirty, all reservations are freed.
When reservations are invalidated, dirty cachelines are invalidated and reservations are freed.
When reservations are freed, reservation bits and dirty bits go away.
Why? Two LRMs to the same cacheline could simply take two reservations on the same cacheline.
When a reserved cacheline is flushed except due to a successful SC, we have to "invalidate all reservations". In particular if the cacheline was dirty, the value is discarded, not stored back to memory.
The upside of this is that it is very economical, and that an implementation offering only normal LR/SC does not need a dirty bit (if any implementation even does...) because there is at most one reserved cacheline, and once this cacheline becomes dirty, it is freed immediately.
The "downside" of this is that reservations can not be held in case of cache collisions. This would need to be reflected in the ISA somehow. "valid LRMs can only fail to acquire a reservation in case of a cache collision with an existing reservation, in which case all reservations are invalidated.".
A small increase in state information is required -- per-byte reserved/dirty bits. The cacheline is dirty when all reserved bytes in the cacheline are dirty. The maximum number of reservations must be small enough that whatever cache is used *appears* to be fully-associative. In practice, that requirement for per-byte reserved/dirty bits will probably mean that the "multi-LR" cache will be a small fully associative cache with only as many lines as the guaranteed maximum number of reservations.
The rule I propose is simple: LRM must fail (and invalidate all reservations) unless the previous instruction executed was either LR or LRM. Executing LR or taking a trap invalidates all reservations in any case. Upon return-from-interrupt, the previous instruction executed is xRET, so LRM will fail.
*by valid I mean "not after an interrupt"
2017-12-08 6:56 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:
In RISC-V, there is more a preference for putting information like that in the platform configuration. An implementation that only supports depth 1 should simply not implement LRM, leaving it to raise illegal instruction. If LRM is implemented, there should be some reasonable minimum that portable software can assume. We have 15 caller-saved registers in the ABI; that is one short of the 16 needed for a fully general quadword compare and swap. Compare and swap of four words is also the longest sequence that meets the base RVA constraints for guaranteed eventual success. For these reasons, I propose that implementations supporting LRM be required to support at least four reservations.
-- Jacob
This seems like an unneeded constraint on software, and very different behavior from normal loads, plain LR, and LT/LTX (from the RVT proposal I am planning).
Making an interrupt at the beginning of the sequence automatically go back before the first LR in the sequence seems to be complicated as well, but might not be so bad in HW -- you anyways need to store possible epcs somewhere in the pipeline, and you could store as epcs the pcs before the first LR...
Hardware can, at least for traps taken *during* the LR/LRM group, do this invisibly to software by effectively speculating all LRM instructions and rolling back to the LR if any traps are taken. This is done by simply storing the address of the LR (er, "previous non-LRM instruction") in ?epc if a trap is taken while executing LRM. This is invisible to normal software -- the interrupt simply appears to have arrived before the sequence began. Hardware that does this *must* document it, however -- the supervisor (and debugger) will see any page faults caused by LRM as having occurred at the previous LR.
When a reserved cacheline is flushed except due to a
successful SC, we have to "invalidate all reservations". In
particular if the cacheline was dirty, the value is discarded,
not stored back to memory.
The upside of this is that it is very economical, and that an
implementation offering only normal LR/SC does not need a
dirty bit (if any implementation even does...) because there
is at most one reserved cacheline, and once this cacheline
becomes dirty, it is freed immediately.
The "downside" of this is that reservations can not be held in
case of cache collisions. This would need to be reflected in
the ISA somehow. "valid LRMs can only fail to acquire a
reservation in case of a cache collision with an existing
reservation, in which case all reservations are invalidated.".
Why? Two LRMs to the same cacheline could simply take two
reservations on the same cacheline.
The problem are not to the same cacheline, but to another cacheline that collides with the first cacheline. They will not be kept together in memory.
So a set-associative cache, where two reservations would map to the same physical cacheline?
Could this be handled by requiring the cache to have at least 4 ways if 4 reservations are permitted? In other words, LRM requires that the cache can store any 4 cachelines.
A small increase in state information is required -- per-byte
reserved/dirty bits. The cacheline is dirty when all reserved
bytes in the cacheline are dirty. The maximum number of
reservations must be small enough that whatever cache is used
*appears* to be fully-associative. In practice, that requirement
for per-byte reserved/dirty bits will probably mean that the
"multi-LR" cache will be a small fully associative cache with only
as many lines as the guaranteed maximum number of reservations.
That is also possible, but I like to use existing HW :(
You are right; there are easier implementations for LRM. Since only four reservations will be guaranteed, a 4-way set associative cache is sufficient. RVT as I plan to propose will permit unlimited use of a cacheline once it is in the transaction's memory set -- a second LT from the same cacheline does not consume an additional memory set slot, nor does a second LTX or ST to a cacheline previously loaded for write consume an additional slot. But RVT does not need to track when loaded-for-write cachelines have been written -- commit in RVT will be explicit.
LRM does need to track this state information, but really only needs a small buffer that stores the addresses loaded and "checks them off" as the SCs are executed. Per-byte reserved/dirty bits in the cache are not needed after all.
Jonas Oberhauser wrote:Do we want to require that all values loaded by an LR/LRM/SC
sequence are dead afterwards?
I'm not sure I understand what you mean by that.
Simply that a program must not rely on any values loaded from LR/LRM if SC fails.
After a successful LR/LRM/SC sequence, the values are fine and can be used normally.
In fact, already after a successful LR/LRM sequence the values are fine. So if you implement the DCAS by
"LR;LRM; retry in case of LRM failure; if values match compare data then SC;SC"
you know that the loaded values at the end of this block are always correct in the sense that they appear to have been read atomically from coherent memory (i.e., even in case of a failed SC or comparison, this code behaves like an atomic double-word load). (to be able to check that the LRM failed, we would require for LRMs r2!=x0 and put a non-zero value into r2 in case of a failure).
This is a non-starter because it would require two write ports on the register file. There is no way to both read a value and report success at LRM in RISC-V.
It is only in case the LR/LRM sequence fails that the LRMs in the sequence might load garbage. So if you wrongly implemented the DCAS as "LR;LRM; if values match compare data then SC;SC",
the cas does not appear to be an atomic double-wide load in case the LRM fails. Note that the SCs will fail if the LRM failed even if the garbage values match the compare data, so that is not the problem -- the problem is if you want to continue using the load results.
Why is that wrong for DCAS? A problem is seen only if LRM returns garbage instead of reading memory without acquiring a reservation. If LRM returns garbage, then DCAS will almost certainly fail ("wrong values") instead of retrying ("serialization failure").
The reason for this is simple: you can not get a reservation on that address, so the LRM has to either fail or be stalled.
But if you stall an LRM while you have hard reservations, you can get into a HW dead-lock where CPU1 is trying to execute an LRM to a hard reservation of CPU2 and CPU2 is trying to execute an LRM to a hard reservation of CPU1, e.g., by running "LR x; LRM y" and "LR y; LRM x" in parallel.
Either one CPU has priority or both must stall for the hard reservation timeout, then continue.
If both stall, CPU1 now has "y" reserved and CPU2 has "x" reserved, but both have lost their earlier reservations and will fail at SC. The simple (Ethernet collision recovery) strategy is to stall for a random small number of cycles upon an attempt to reacquire a reservation that has previously timed out. On the first attempt, both CPUs fail at SC, they both jump back to retry, they stall for different amounts of time at the first LR, and one (I will say CPU1) clearly acquires the reservations and reaches the SCs before the other. CPU2 then fails at SC and retries, succeeding on the third attempt, or stalls at LR (observing CPU1's reservations) and waits for CPU1 to complete or the reservations to again expire.
However, until one CPU has reached the SC sequence, no reservations are dirty. The guarantee of forward progress requires that all SCs be contiguous, so renewing a hard reservation just long enough to complete four SC operations avoids this with bounded delay.
The first CPU to execute SC "wins" and all reservations held by other CPUs on the addresses written are broken.
If two CPUs enter the SC group at the same time, either one has priority, or both fail and delay retry by different amounts.
If you fail the LRM and the reservation is dirty, the LRM can not really return a "real" value. It either has to return the dirty value -- which might never be written to memory and might make the remote LR/LRM/SC sequence seem non-atomic, take a stale value from the main memory, or make up a value. In each case LRM potentially returns garbage (which I think would be ok. I'd prefer if it always returned the same garbage though, such as all zeros or just the current register content.).
This cannot happen -- if the reservation is dirty, then the remote LR/LRM/SC has completed and committed.
The "stale" value must be read until that time, since the atomic writes have not yet occurred.
LR/LRM simply expresses interest in an address; there is no requirement that SC ever be executed. A "failed" LRM reads the same value that a successful LRM reads, the only difference is that a failed LRM was executed without the LR to begin the sequence, so no reservation is acquired and SC must fail, causing the sequence to be retried.
I assume that in case of a failed LRM the loaded values will only be used by LRMs which fail, SCs which fail, and local operations which do not leak the garbage values, and that the computation and its values are either abandoned or the whole thing is restarted.
Note that you can architecturally provide the guarantee that the LR will always succeed (and I guess that is currently the case).
LRM differs from LR only in that LRM must occur immediately after LR or LRM, to ensure that the LR/LRM/SC sequence is retried if a trap return lands at LRM.
This seems like an unneeded constraint on software, and very
different behavior from normal loads, plain LR, and LT/LTX (from
the RVT proposal I am planning).
One alternative is to turn the hard reservations of CPUi into a soft one while CPUi issues an LRM to a hard reservation of another CPU (you can go back to a hard reservation when the LRM eventually succeeds and the older reservations were not lost). In this case LRMs can always provide real memory values, but the LR/LRM sequence is not guaranteed to be atomic (since the soft reservations could be lost before the LRM returns its real value) -- still some way to detect whether the LR/LRM sequence was atomic or not would be needed.
If the LR/LRM sequence was not atomic, then the subsequent SC sequence must fail.
If CPUi issues an LRM to a hard reservation of CPUj, CPUi must wait until the reservation either expires or is released. If the reservation is released due to SC succeeding, CPUi may succeed in its sequence, or may fail if CPUj has updated an address where CPUi held a reservation.
We might need a rule that LR/LRM instructions must be independent of each other in a sequence.
Another alternative is to "restart" the LR/LRM sequence in case of a failure *during* the LRM sequence, e.g., by jumping back before the LR.
I wrote before:
Making an interrupt at the beginning of the sequence
automatically go back before the first LR in the sequence
seems to be complicated as well, but might not be so bad in HW
-- you anyways need to store possible epcs somewhere in the
pipeline, and you could store as epcs the pcs before the first
LR...
To this you responded:
Hardware can, at least for traps taken *during* the LR/LRM group,
do this invisibly to software by effectively speculating all LRM
instructions and rolling back to the LR if any traps are taken. This is done by simply storing the address of the LR (er,
"previous non-LRM instruction") in ?epc if a trap is taken while
executing LRM. This is invisible to normal software -- the
interrupt simply appears to have arrived before the sequence
began. Hardware that does this *must* document it, however -- the
supervisor (and debugger) will see any page faults caused by LRM
as having occurred at the previous LR.
In case we do something like this, the alternative with the reset sounds very charming.
Note that at this point it is no longer necessary to distinguish in the opcode LR and LRM: after an interrupt, you can never be in the middle of the LR/LRM sequence, only at the beginning, and LRMs can only fail because they access a reserved cacheline, not due to interrupts, not due to previous failed LRMs.
The problem is that this requires some degree of speculative execution. While it could be a significant performance improvement, implementations cannot be required to do it.
Worse, this is only possible with speculative execution -- the register writes from earlier LR/LRM must be undone if the page fault is moved back to the LR, since the program could reuse the same register if atomically chasing a pointer chain and replacing it. Or should portable programs be required to ensure that all addresses used in LR/LRM loads are preserved, making the LR/LRM group idempotent?
I think we have this requirement already -- the SC group must write exactly to every address that was read in LR/LRM and clobbering the addresses makes this impossible within the guaranteed forward progress constraints.
We might need a rule that LR/LRM instructions must be independent of each other in a sequence.No. This is neither necessary nor practical. Loading a pointer and dereferencing it atomically should be fine.For example, an LR-optimized MCS lock might look like this:struct MCS_LOCK { MCS_LOCK *owner };MCS_LOCK x;x is free iff x.owner == &x .bool LOCK (MCS_LOCK *lp, MCS_LOCK *me) {l = dereference lp // LRo = dereference l.owner // LRMl.owner = me // SCrelease lp // SC}
I think we have this requirement already -- the SC group must write exactly to every address that was read in LR/LRM and clobbering the addresses makes this impossible within the guaranteed forward progress constraints.I don't think I quite understand. Why does this make it impossible? I also don't know what "clobbering" means in this context, is that a formal term?
I'll still use the distinction LR-opcode for "an instruction with the LR opcode", LRM for "LR-opcode directly after LR-opcode", and LR for "LR-opcode but not LRM".
The semantics could be something along the following lines:
1) Keep in a status register a "restart code address", which is either a program counter or an invalid value.
2) An LR sets the restart code address to the pc. 3) Any instruction which is not an LR-opcode invalidates the restart code address.
4) An interrupt where the restart code address is not an invalid value sets the epc to restart code address.
5) A failed LRM will set the the pc to the restart code address.
Using a distinct LRM opcode makes me feel much better -- the above should work (and the "restart code address" register can be invisible) to prevent a trap return or preemptive task switch from landing in the middle of an LR/LR/SC/SC sequence, but it will not catch other incorrect transfers of control. A distinct LRM opcode ensures that any invalid landing will be caught instead of attempting a partial "atomic" sequence. I do not see any direct exploits here, but I prefer being safe over living on the edge.
In this case you never have to worry about garbage values after an LRM, because you never get beyond a failed LRM. This makes software simpler and reduces bugs where you forget to check validity and use the values for something else, like the bad DCAS above.
Simple hardware can immediately fail an LRM that tries to access a reserved cacheline; more complex hardware can first turn local reservations into soft reservations and stall the LRM, and only fail the LRM in case reservations are lost before the LRM manages to acquire the reservation itself and local reservations are turned back into hard reservations. Both approaches implement this spec and guarantee that the LR/LRMs sequence appears to be atomic.
I see this as premature optimization: permitting an LRM that cannot lead to a successful SC to avoid accessing memory at all. I believe that even an LRM that "fails" should still access memory -- if nothing else, it will "pull" the requested data nearer to "this" hart in the cache hierarchy, and some dynamic priority schemes could use how many caches need to be traversed to break ties.
It's not a perfect approach yet because you can starve if the LR/LRMs of two threads are not symmetrical, such as "LR x, LRM y" and "while (1) { LR y }" -- the latter LR might always be scheduled right between LR x and LRM y, thus starving the other thread. In case of symmetry there is no problem because the first one to get a reservation will always get the remaining reservations, and the existing mechanism to prevent starvation for LRs saves the day.
Both succeed loading "y" and both now have reservations on "y". The first one to start executing SC "wins" and breaks the other reservation. The latter LR, lacking an SC, is useless and conflicts with nothing. Alternately, if reservations cause other LRs to stall to reduce retries and guarantee forward progress, a round-robin dynamic priority scheme solves this -- on one cycle, the LRM breaks the other reservation and the first thread completes, on the next, the LRM stalls. Another option for this case, since the LR is constantly taking the same reservation, is to yield if an LR is executed to an address that the same hart already has reserved.
If instead of handling the restart in HW we handle it in SW, we can eventually give up -- e.g., we can have something like "while tasks_available > 0 do { try to take reservations for task queue using LR;LRM; if LRM succeeded, pop a task from the queue; if it didn't, retry the loop }"
which eventually stops trying to reserve the cachelines if the reason to take the reservations disappears.
These are good issues for RVT, but are beyond the scope of a simple multi-word LR/SC. That said, the approach I am leaning towards for RVT is "longest-running transaction wins" with the intent that this should generally push transactions into a FIFO-like order. If you find yourself asking "did LRM succeed?", that means you need the T.VALIDATE instruction from RVT. LRM does not give that information. You know that LRM succeeded when the subsequent SC group succeeds.
If this is useful in practice, I suggest instead of the above:
An LR is an LR-opcode with r2=x0, and an LRM is an LR-opcode with r2!=x0.
The r2 is used for the restart code address, so
1') a failed LRM jumps to gpr(r2) (and loads nothing)
2') an interrupt during an LRM sets the epc to gpr(r2)
and
3') an LR clears the reservations and is guaranteed to succeed
other than that, LRs and LRMs can now be the same -- no need to keep a status flag for LR/LRM which is cleared on interrupts. To avoid the ambiguity from before, just point your r2 in front of the LR/LRM sequence.
This is not much harder to program against, does not have the starvation problem, does not have the "garbage by accident" problem, does not need an extra restart code address register, but "costs" one gpr register in which we put the restart code address.
And it seems very easy to implement.
I have three problems with this: (1) it introduces a register-indirect conditional branch (which RISC-V currently does not have), (2) it uses the entire space in the LR opcode, where I was hoping :-( to put more instructions (RVT LT/LTX/ABORT/VALIDATE/COMMIT) by using rs2 as an extended function code field, and (3) (fatal, in my view) it introduces an abusable JR-alike ROP gadget, since an improper control transfer to an LRM would cause a jump to the address in whatever register that LRM instruction selects in rs2.
Garbage 1:
2017-12-11 14:08 GMT+01:00 Jonas Oberhauser <s9jo...@gmail.com>:We might need a rule that LR/LRM instructions must be independent of each other in a sequence.No. This is neither necessary nor practical. Loading a pointer and dereferencing it atomically should be fine.For example, an LR-optimized MCS lock might look like this:struct MCS_LOCK { MCS_LOCK *owner };MCS_LOCK x;x is free iff x.owner == &x .bool LOCK (MCS_LOCK *lp, MCS_LOCK *me) {l = dereference lp // LRo = dereference l.owner // LRMl.owner = me // SCrelease lp // SC}
Jonas Oberhauser wrote:
> 2017-12-10 4:53 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
I have three problems with this: (1) it introduces a register-indirect
conditional branch (which RISC-V currently does not have),
(2) it uses
the entire space in the LR opcode, where I was hoping :-( to put more
instructions (RVT LT/LTX/ABORT/VALIDATE/COMMIT) by using rs2 as an
extended function code field,
Jonas Oberhauser wrote:
2017-12-11 6:21 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:
Jonas Oberhauser wrote:
It is only in case the LR/LRM sequence fails that the LRMs in
the sequence might load garbage. So if you wrongly implemented
the DCAS as "LR;LRM; if values match compare data then SC;SC",
the cas does not appear to be an atomic double-wide load in
case the LRM fails. Note that the SCs will fail if the LRM
failed even if the garbage values match the compare data, so
that is not the problem -- the problem is if you want to
continue using the load results.
Why is that wrong for DCAS? A problem is seen only if LRM returns
garbage instead of reading memory without acquiring a
reservation. If LRM returns garbage, then DCAS will almost
certainly fail ("wrong values") instead of retrying
("serialization failure").
In a *normal* DCAS, it is totally necessary to have the old values, since they may be used in the next iteration of the loop
a,b = R x,y; do { c,d = op(a,b); a,b, fail = DCAS x (a -> c) y (b -> d) } while (fail)
Note the lack of a reload in the loop.
There are two failure cases: (1) DCAS reads memory and finds that the values in memory do not match the expected values, and (2) the LR/LRM/SC/SC that implements DCAS does not execute atomically. In case (2), the correct response is to retry the LR/LRM/SC/SC sequence. In case (1), DCAS itself was successfully executed, but failed. The loop given in the example is an infinite loop if locations x and y do not eventually return to the values initially read.
Or are you saying that DCAS returns the values actually read thus updating a and b?
In this latter case, LR/LRM must never return garbage: the value read by LR/LRM will be used in the calculation if DCAS fails.
This is a point that needs to be clarified: LR/LRM is *not* LT. There is no instruction that confirms that an LR/LRM sequence has read a consistent set of values. The only guarantee is that all SCs will fail if any value read by LR/LRM was changed by another hart after being read.
With LR/LRM, such a loop does not need the values, because the operation is written between the LR/LRMs and the SCs, not "before" them,
But I do not know if there are other examples where even with LR/LRM one would like to continue using the values.
This will vary -- the operation must meet the RVA forward progress guarantee constraints if it appears in the LR/LRM/.../SC/SC sequence. Some operations may need to use non-atomic reads followed by a multi-word CAS to perform an atomic update.
Either one CPU has priority or both must stall for the hard
reservation timeout, then continue.
I thought that the reservation clock is only ticking during CPU cycles, of which there are none during the stall. That is why I suggested flipping the reservation state temporarily.
The reservation clock ticks on memory bus cycles, or some other clock that continues to run unless the entire system is halted. Otherwise, suspending a hart that happens to hold a reservation could cause deadlock.
If both stall, CPU1 now has "y" reserved and CPU2 has "x"
reserved, but both have lost their earlier reservations and will
fail at SC. The simple (Ethernet collision recovery) strategy is
to stall for a random small number of cycles upon an attempt to
reacquire a reservation that has previously timed out. On the
first attempt, both CPUs fail at SC, they both jump back to retry,
they stall for different amounts of time at the first LR, and one
(I will say CPU1) clearly acquires the reservations and reaches
the SCs before the other. CPU2 then fails at SC and retries,
succeeding on the third attempt, or stalls at LR (observing CPU1's
reservations) and waits for CPU1 to complete or the reservations
to again expire.
Yes, but I'd rather not have to put that into software. I'd like to maintain a forward guarantee for one of the threads.
The only thing put into software in this example is the "retry on serialization failure" that LR/SC already requires. The stall for a random small number of cycles is a hardware operation. Hardware knows to insert the random backoff stall when attempting to reacquire a reservation that this hart previously held, but that timed out while this hart was stalled acquiring another reservation. The required state information can be invisible to software, since this is only needed to break lock-step if two harts arrive at (and thus will retry) an LR/SC simultaneously and a preemptive context switch on either hart will certainly break that lock-step.
On the first attempt, both CPUs fail at SC, they both jump back to retry, they stall for different amounts of time at the first LR"
However, until one CPU has reached the SC sequence, no
reservations are dirty. The guarantee of forward progress
requires that all SCs be contiguous, so renewing a hard
reservation just long enough to complete four SC operations avoids
this with bounded delay.
Hm. I didn't think it would be a good idea to allow multiple CPUs to have reservations to the same cacheline, but I don't have a good reason against it.
The first CPU to execute SC "wins" and all reservations held by
other CPUs on the addresses written are broken.
Which would also invalidate all other reservations of those other CPUs, preventing them from entering the SC groups.
They will enter the SC groups, but the SCs will fail. Software then retries the sequence.
You can guarantee in the absence of interrupts:1) if multiple LR;LRM;SC sequences compete, one of them will reach a successful SC
2) once a successful SC is executed, the remaining SCs in the sequence will also be successful and the operation will be committed.
As long as at least one of those sequences is not interrupted, this will hold in the presence of interrupts as well, if interrupts cause an interrupted LR/LRM/SC to be abandoned and retried in software.
If you fail the LRM and the reservation is dirty, the LRM can
not really return a "real" value. It either has to return the
dirty value -- which might never be written to memory and
might make the remote LR/LRM/SC sequence seem non-atomic, take
a stale value from the main memory, or make up a value. In
each case LRM potentially returns garbage (which I think would
be ok. I'd prefer if it always returned the same garbage
though, such as all zeros or just the current register content.).
This cannot happen -- if the reservation is dirty, then the remote
LR/LRM/SC has completed and committed.
No, because an interrupt could still happen that fails the SC sequence.
No, because the remote LR/LRM/SC has not committed until its *last* SC has succeeded. The SCs in an LR/LRM/SC sequence are an atomic group, until they *all* succeed, *no* changes are visible.
The "stale" value must be read until that time, since the atomic
writes have not yet occurred.
In the cache you probably have already the dirty value. I think the LRM to an exclusive remote reservation should be stalled until 1) the local reservations are invalidated from the outside or 2) the remote reservation is freed (whether because of an interrupt or because the remote side commits).
If the remote is currently committing, then yes, LR/LRM *should* stall and wait for the new value or for remote commit to fail. Otherwise, this hart has just acquired a reservation that it knows will be broken before local commit is possible.
LR/LRM simply expresses interest in an address; there is no
requirement that SC ever be executed. A "failed" LRM reads the
same value that a successful LRM reads, the only difference is
that a failed LRM was executed without the LR to begin the
sequence, so no reservation is acquired and SC must fail, causing
the sequence to be retried.
But an LD or LR to an exlusive remote reservation definitely have to be stalled. The question is just whether the LRM should stall like the LR or whether it is allowed to abort.
In case we do not need a meaningful, atomic value in case the SC fails, we can abort and everything is fine.
LOAD is non-atomic, so there is no reason for it to stall.
No guarantees are made, and loading the old values is correct -- the transaction has not committed yet.
For LR/LRM, the only reason to stall is avoid loading a value and acquiring a reservation that we know will not remain valid when we reach SC. Stalling at LR/LRM is a performance optimization, as it avoids a software retry of the sequence by waiting for an almost-complete competing sequence to actually complete.
I also realized this later. Take the LR/LRMs sequence and take all registers which are used by an LR/LRM but not overwritten by a previous LR/LRM in the sequence.
That set of addresses has to be restored to the point before the LR, either in SW (by not overwriting such a register) or in HW. Both sound complicated and error-prone.
LR/LRM takes one input (an address) and produces one output (a value read from memory) with the side effect of acquiring a reservation. As long as restarting the LR/LRM group produces the same results, this will work. [...]
Another way: Destination registers within an LR/LRM group must contain dead values when the group is entered.
For software, the rule would be that the LR/LRM group must not use the initial value of any register that is used as a destination. [...] A third way: An LR/LRM group must not write to a register after reading that register.
I think we have this requirement already -- the SC group must
write exactly to every address that was read in LR/LRM and
clobbering the addresses makes this impossible within the
guaranteed forward progress constraints.
I don't think I quite understand. Why does this make it impossible?
The RVA forward progress guarantee places a severe limit on code size: only 16 baseline RVI instructions, including the LR/LRM/SC and the branch to retry the sequence, and excluding loads, stores, and some other instructions. There is no time to reconstruct the addresses -- they must be preserved in registers.
I also don't know what "clobbering" means in this context, is that a formal term?
I do not know how formal "clobber" is; I think I first saw it in the GCC manual. To "clobber" a value is to overwrite the location where that value is stored.
The latter is a specific solution to this instance of the problem, but not a generic solution (nest more LRMs in the first thread and alternate LRs in the second thread).
Using shared and exclusive reservations as you suggested solves the problem: both threads take shared reservations without conflict and the first thread to reach SC now has exclusive reservations that cause the other thread to fail SC and stall when it tries to load the value again. The first thread's SCs complete and the other thread continues.
Because the repeat until success loops appear to be atomic, you don't have to bother with a bad interleaving where the lock is allocated while someone else frees the lock (which is what makes MCS locks so complicated).
To see why this algorithm works, imagine a singly linked queue of locks. There are three invariants:
1) The oldest element in the queue currently has the lock. 2) Iff l_p points to itself, the queue is empty.
3) Otherwise l_p points to the newest element in the queue, which points to itself.
On LOCK we enqueue a new element, which means atomically pointing l_p and previous newest element in the queue to myself.
If l_p pointed to itself the queue was empty and I now have the oldest lock in the queue; otherwise, I have to wait for the previously newest element in the queue to release the lock.
On UNLOCK I release the lock by pointing at myself, thus removing myself from the queue. The only difficulty is if I was the last element in the queue, in which case the queue is now empty.
I'm not saying this is a great algorithm -- unlike the normal MCS lock this one might not be fair in the sense that one of the threads can attempt an infinite number of LR/LRMs without ever reaching the SCs -- I was just giving an example of an algorithm where LRMing from a previous LRed pointer makes sense.
The RVA forward progress guarantee says otherwise. :-) I think this may be a good example and argument for multiple LR/SC, although I do not know how this compares to using the MCS locks from <URL:https://lwn.net/Articles/590243/> and AMOSWAP. It looks to me like LR/LRM/SC provides an MCS lock in a single pointer, while using AMOSWAP requires having a separate "lock-held" flag.
PS: I now enabled "undo send" in gmail, I hope this will prevent future mistakes...Your mistakes helpfully trifurcated a thread that was becoming an exchange of essays. :-)
Jonas Oberhauser wrote:
2017-12-12 6:09 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:
I like the UNLOCK. I don't know why you think that an LR will help in the LOCK_wait.
Baseline RVI LOAD does not have atomic access semantics. Only LR can have acquire or sequential consistency specified.
The following are all true:
1) all SCs in the same group become visible at once
2) the SCs will eventually become visible, at which point both the LR and LD (which are executed in a loop) would see them
3) fences will not make the SCs visible any sooner, neither local fences nor remote fences. Local fences have no effect on the remote SCs, and remote fences will just delay processor execution but probably not make the SCs happen any faster
4) it is impossible to see from the local hart whether the SCs take a long time because they are being buffered, or because the remote is just terribly slow.
5) the remote can probably not hurry up even if it sees that we have a reservation.
The RISC-V memory model is quite weak, and I do not think the revised memory model gives any better guarantees for RVI LOAD. Implementations are also allowed, as I understand, to have loose cache-coherency, where RVI LOAD may *never* see an update without a FENCE.
LR/LRM/SC, of course, must be their own FENCEs.
There are also no cacheline bounces because the only thing you execute is that one LD, and the only thread that will also access that cacheline is the one that currently holds the lock -- in which case it will only accesses it once, when it releases the lock.
The loop is reading the original lock, not the local queue entry. Any other threads wanting to the take the lock will also need to write that location. I have probably misunderstood MCS locks.
Another try with pseudo-assembler:
[...]
but consistency in LOCK requires the LOCK_enqueue AMOSWAP to break a reservation and force UNLOCK to retry.
The first BEQ instruction in the wait loop handles the case where a thread moving to take the lock is preempted between extending the queue and adding itself to the queue and the lock is released by another thread, although (oops!) the other thread's queue slot is still overwritten in this case. The second BEQ is written that way to handle the case where the other thread's queue ceases to be valid after the other thread passes the lock to this thread, but this thread remains preempted until the remote queue slot now contains garbage; failure (infinite loop) is 1-in-2^XLEN in this case, assuming a random value in the no-longer-valid queue slot that never changes. Note that UNLOCK writes a value that will break the loop, so failure is unlikely. I think that a safe LOCK requires LR/LRM/SC, since LOCK needs to atomically update both the lock itself and a queue entry. We cannot race with other threads to get into the queue, but we can race with UNLOCK.
Without the forward progress guarantee, UNLOCK can get into livelock, competing with LOCK operations adding to the queue. This means that AMO instructions to an exclusive reservation must also stall,
This leads to a key advantage for LR/LRM/SC: since taking any trap breaks the reservations, an LR/LRM/SC sequence remains atomic even with arbitrary preemption latencies.
Interestingly, the only reason for UNLOCK to take a reservation on the local queue slot is to ensure that another thread cannot add itself to the queue after we have decided to release the lock. The value read from the local queue slot is not actually used.
Should LR/LRM to x0 be defined specially (to skip actually reading memory, but still take the reservation) for this situation, where a value must not change, but the actual value is not used?
Clever, but that would be a form of conditional move. Conditional moves have been considered but rejected for RISC-V (see chapter 2.5 of riscv-spec-v2.2).
For out-of-order implementations which use register renaming, you would need three input registers (the load address, the original x, and the original y/z) and two output registers (the renamed x and the renamed y/z). Conditionally writing to a register doesn't reduce the number of outputs, since you can't conditionally rename a register, because the renaming happens before the instruction is executed. If the instruction doesn't write to the register, it still has to copy the old value from its original physical register to its new physical register, so you also increase the number of inputs.
That is, what we have today for LR is one input (the address) and one output (the destination register), and for SC two inputs (the address and the source register) and one output (the "success" register). What you proposed would need three inputs, more than anything other than FMA (which is in a different register file), and two outputs, more than anything other than CSR swaps (though the CSRs are more of an I/O space than registers, so in the likely case they are treated more like a strange sort of memory, there is again only one register output for them).
No, because the LR/LRM implementation looks different.
try:
a = LR x
b = LRM y
c,d = op(a,b)
*x = SC c
*y = SC d fail
retry on fail
In this particular example, we neither 1) reuse the old values (it would be fatal) nor 2) check whether the loads returned garbage, we just retry on a failed SC.
Correct, but LR/LRM/SC can be used to implement DCAS, and portable synchronization libraries might choose to rely on DCAS, so the first loop with DCAS should work if DCAS is a subroutine implemented using LR/LRM/SC.
If LR/LRM can return garbage instead of "old" values, then those portable libraries would need to accommodate a quasi-DCAS that does not return usable values upon failure.
Power management -- shut down a hart to save power. If that hart was holding reservations (it should not be, but...) something must ensure that those reservations do not deadlock other harts later. (Releasing reservations upon entering low-power mode should also work, but the reservations are tracked in the memory subsystem anyway, so why not expire them on the memory clock?)
The required status information is the same information needed to track reservations in the first place plus one bit per reservation for an expiration flag -- if taking a reservation that previously expired, stall.Ah. What confused me was the following sentence:
On the first attempt, both CPUs fail at SC, they both jump back to
retry, they stall for different amounts of time at the first LR"
I don't think it's practical to keep this status information, but I haven't thought enough about it.
By the way, you do not have to stall for a random amount, but for a hopefully unique amount. In a local system you might have different IDs for each CPU, which you can use to wait.
Correct, random is just an easy way to get a hopefully unique amount with dynamic priority where using hart IDs would produce a static priority model.
At some point, interrupt load is simply too high for any forward progress guarantee to hold. This is a system design problem if the hart never executes 32 instructions uninterrupted.
The solution in the second case could be to delay interrupts for up to 4 cycles to complete the SC group.
There is also the Herlihy implementation, which uses a pair of cachelines for each value, discarding one upon successful commit and the other upon abort. Commit in Herlihy's model is atomic -- the cache can update the "pending" cachelines to "current" and discard the "old" cachelines in the same cycle.In my implementation, a successful SC already replaces the cacheline with the new value, thus has to stall remote accesses.
The advantage of this is that the eventual commit is atomic and you don't need to buffer the new values somewhere else, and you can take an interrupt anywhere in the sequence without problems.
In an implementation where the SC-values are buffered in a CPU-side buffer, the CPU on the last SC pushes all of its SCs through, I agree with you.
Note that this means that while this buffer is drained, no interrupts can be taken (not sure about claimed).
In this case the SC sequence still stalls remote accesses.
In my answers to the next lines, I will refer to the two implementations as "dirty cache implementation" for the one that changes the cache lines before the commit, and as "buffer implementation" the one that has an extra buffer and grabs the memory bus while draining the buffer.
While the local hart is executing an SC group, a remote LR/LRM must either read the "old" value (taking a reservation that will almost certainly be broken in a few cycles) or stall until the "new" value is available (when the SC group completes). In a dirty cache implementation, the "old" value may not be available, but the cache is not dirty until an exclusive reservation has been taken, so having LR/LRM stall and wait for the "new" value is appropriate. If remote LR/LRM stalls and the local commit fails, invalidating the dirty cacheline, then the remote hart must go fetch the value from an outer cache or main memory.
So two more cache-coherence messages are needed: "wait" and "oops, I lost it; go fish". (The phrase "go fish" refers to a card game.)
What I agree on is that aborting the LRM and loading nothing is a performance optimization, which may or may not bite us later. But it is easier to go from "loads nothing" to "behaves like a LOAD" than from "behaves like a LOAD" to "loads nothing".
The problem is that loading nothing on LRM *is* going from "behaves like a LOAD" to "loads nothing" -- the prototype for LRM is LR, which "behaves like a LOAD" currently.
LR/LRM takes one input (an address) and produces one output (a
value read from memory) with the side effect of acquiring a
reservation. As long as restarting the LR/LRM group produces the
same results, this will work. [...]
Another way: Destination registers within an LR/LRM group must
contain dead values when the group is entered.
Yep.
For software, the rule would be that the LR/LRM group must not use
the initial value of any register that is used as a destination. [...] A third way: An LR/LRM group must not write to a register
after reading that register.
These work but are too strong. Consider:
t1 = LR a1
t1 = LRM t1
This is fine, but prohibited by both rules above.
What? That is allowed under both rules -- the initial value of t1 is dead and not used before it is loaded from (a1) and the LRM is considered to write t1 on the same step (not after) that it reads t1 to get a memory address.
That's why I think it's a bit error prone, it's a bit subtle.
It looks to me that the most error prone part will be writing the rules clearly. :-)
There is. Consider the following
try:
t1 = a1 + 4
t2 = LR a1
t1 = LRM t1
retry if t2 != t1 t3 = a1 + 4
*a1 = SC 0
*t3 = SC 0
Does it make sense to program like this? No. Are there programs where it makes sense? I doubt it, but I don't know. But for sure it is possible to program like this (and by murphy's law, someone will probably do it).
If you want this condition, you need to make it explicit, it's not a consequence of existing rules.
Maybe a rule that addresses must be preserved should be made explicit, but that sequence does violate the proposed requirements for LR/LRM to be restartable -- the value of t1 prior to LR is both used and clobbered in the LR/LRM group.
Jonas Oberhauser wrote:That is right... rats... DCAS does not attempt to store if the values do not match, so the serialization failure is not detected.
This will be hard, because it would mean either 1) LR/LRM groups always need to appear to be atomic or 2) you need a way to check for LR/LRM failure so that you can retry in case they were not, but leave the loop in case they were atomic but unequal to the compare data.
(the DCAS is an atomic load even in case of failure...)
Implementing atomic LR/LRM can be done in HW by restarting the sequence in case it would otherwise not appear to be atomic (as we discussed before, with the rules below), or if you wait with loading values until you can make sure that the sequence will appear to be atomic.
I think the latter would be preferable, but we'd have to come up with a cheap implementation.
I am somewhat inclined towards that, in any case, if the LR/LRM group could be observed to be non-atomic, the SC group must fail, so ensuring that the LR/LRM group is atomic would help an implementation support the forward progress guarantee.Well not quite because there is no software that uses LRM yet. If we make a decision either way, we will have to deal with software later if we change our mind.
But if we decide that the LR/LRM group has to act as a big atomic load -- and I definitely see the reason now for why that might be a good idea --, this discussion does not matter.
This sequence is restartable, since the restart can only begin at LR. At the end, t1 == ***(a1) and t2 == **(a1). Yet this does not meet the "third way" version of the rule.For the third way, well, in this particular example it depends on what you mean by "after" (for me, the LRM first reads t1, then writes it, but that's a matter of view).
But you can give a similar example:
t1 = LR a1
t2 = LRM t1
t1 = LRM t2
Right, the "third way" version could be made correct, but would be much longer and somewhat awkward.
Furthermore, this definition of "after" would not have been strong enough:
a1 = LR a1
a2 = LRM a2
Neither of these wrote a register "after" reading it (in your definition), but the values are now "clobbered" and the sequence can not be restarted.
Is the first version of the rule ("The LR/LRM group must not use the initial value of any register that is used as a destination.") workable alone, then?
You are right that it makes no sense, but I admit that in sequences with fewer than four reservations, it is inefficient but possible to clobber and reconstruct addresses. Even with four reservations, one address can be reconstructed. There is no reason to do this and it is a great way to introduce bugs by missing one of the required SCs, but it is possible. I think strongly-worded advice not to do this is appropriate, but an actual requirement to preserve addresses is probably unneeded.
Jonas Oberhauser wrote:
2017-12-13 4:57 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>: <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
<mailto:jcb6...@gmail.com>>>:
I like the UNLOCK. I don't know why you think that an LR will
help in the LOCK_wait.
Baseline RVI LOAD does not have atomic access semantics. Only LR
can have acquire or sequential consistency specified.
You can use the current pseudo-instruction which puts fences in the right place.
I'm not sure what you mean with "atomic access semantics". A load loads the most up-to-date value for each byte at the point that it is executed. Most up-to-date takes into consideration 1) the state of the memory and 2) local stores that have been issued before the load but not yet reached local memory. (This is the "load value axiom").
For the purposes of the lock, this is strong enough.
(cf also: "Every aligned load or store instruction gives rise to exactly one memory access that executes in a single-copy atomic fashion [...]")
That statement was written under the old memory model, which did not define those pseudo-instructions. It still stands, however: the baseline LOAD opcode does not guarantee ordering without an additional FENCE instruction, while LR combines those effects into a single instruction.
Global memory order is loose: "A store is said to have performed not when it has executed inside the pipeline, but rather only when its value has been propagated to globally
visible memory." I have not found any requirements for when that propagation occurs.
The new memory model has a special "atomicity axiom", so I expect some special behavior from LR that LOAD may not have.LR/LRM/SC, of course, must be their own FENCEs.
By that I assume you mean "all LR/LRMs are executed before any of the corresponding SCs", nothing more? I don't think there's an inherent need to have more fences than this.
Jonas Oberhauser wrote:
I have a feeling that we are slowly converging to a common line.
Here's what I think we agree on:
1) LR;LRM groups without a page fault appear atomic,
2) can take multiple reservations on same address
3) forward guarantee is maintained
4) SCs to all reservations = commit
Agree on (2), (3), (4); qualified agree on (1): if a value is changed after being read by LR/LRM, then the following SC group must fail. Making LR/LRM atomic will avoid one window for serialization failure, but is not strictly required. The last LRM may read a value that was stored by another hart after LR was executed, but the SC group must fail if any stores have occurred to any location X read by the LR/LRM group after that location X was read.
I do not believe that database-like "snapshot consistency", where all values read are as if from a snapshot taken when LR is executed, is feasible to implement.
Things that are unclear to me yet:
a) page faults in an LR/LRM group -- what happens? Options I know of: provide in HW ?epc = address of LR and suggest that "no initial value ...", let the OS take care of it (e.g., system call to provide an LRM-pf-restart address to the OS, which is used on LRM-pf), make it "invisible" by translating whole LR/LRM block before the LR is executed (this means that the suggestion from before is not needed)
The current draft proposal I am writing requires that software be able to cope with an LR/LRM group being restarted at LR. Letting the OS take care of it is a large amount of very special complexity in the page fault handler -- the Linux devs will not be amused. Translating the whole block is impossible if LR/LRM chases pointers atomically, as MCS locks seem to need with their requirement for a value loaded from a pointer that is guaranteed not to have changed since the pointer itself was loaded.
The fourth option (also permitted to implementations) is to simply take the page fault, drop the reservations, go through the LR/LRM/SC sequence, possibly taking more page faults at LRM or SC, and fail the SC group, requiring software to retry the sequence. (Page fault at SC can occur if a synchronization variable is in a page that is currently copy-on-write.)
b) when multiple reservations are taken on an address, do I need multiple SCs or one SC? If I need multiple, does the last SC win (in case they write different values)?
Each LR/LRM adds one entry to the pending update set; each successful SC removes one entry from the pending update set. This is needed because MCS locks may take two reservations on the same address or on different addresses.
Should writing different values to the same address use the last value written or fail? Multiple identical stores can be treated as a single store, but writing two different values probably indicates a bug.
c) can we allow an LRM that is not part of the LR;LRM group (without guarantee of atomicity)? For example, if we follow a pointer and want to compare for null first, or if we want to follow a pointer in a struct and have to add some offset? We would only guarantee that the LRM is atomic with the previous LR/LRMs when we do SCs and the SCs succeed, but (unlike for an LR/LRM group) the values are not guaranteed to be atomic in case of an SC failure.
No, the purpose of LRM is to ensure that we have started a multiple-LR at its correct starting point, otherwise LR/LR/SC/SC would suffice. Those situations call for full RVT and its LT instruction, not multiple-LR.
Jonas Oberhauser wrote:
2017-12-15 6:30 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:
Jonas Oberhauser wrote:
I have a feeling that we are slowly converging to a common line.
Here's what I think we agree on:
1) LR;LRM groups without a page fault appear atomic,
2) can take multiple reservations on same address
3) forward guarantee is maintained
4) SCs to all reservations = commit
Agree on (2), (3), (4); qualified agree on (1): if a value is
changed after being read by LR/LRM, then the following SC group
must fail. Making LR/LRM atomic will avoid one window for
serialization failure, but is not strictly required. The last LRM
may read a value that was stored by another hart after LR was
executed, but the SC group must fail if any stores have occurred
to any location X read by the LR/LRM group after that location X
was read.
Oh, ok. I'm fine either way.
So you are thinking of something like
1') an implementation may provide an atomicity guarantee for the LR/LRM block. Such an implementation may restart the LR/LRM block if the LR/LRM block would otherwise not appear to be atomic.
?
Wouldn't it be better to have this as an extension to the extension to the extension? It seems to me like I might write different software for implementations that have this extra feature than for those that only have LR/LRM. (sure, I can have software that works for both, but that software will not make use of the feature and pretend on SC failure that the values were not atomic -- then what's the point of the feature?)
Atomic LR/LRM is really a performance optimization. When SC fails, all software really knows is that the entire sequence must be retried. Requiring portable software to make LR/LRM blocks restartable enables this performance optimization at essentially no cost to any implementation, since RISC-V has more registers than an LR/SC sequence can use while still meeting the forward progress guarantee constraints.
Now that I think about it, the requirements around reservations have the effect that a successful LR/LRM/SC sequence acts as if a snapshot was taken at the last LRM, but this "snapshot" never has to be materialized, so it costs nothing.I do not believe that database-like "snapshot consistency", where
all values read are as if from a snapshot taken when LR is
executed, is feasible to implement.
Maybe not, especially during pointer chasing, as you reminded me below.
Atomic LR/LRM is an implementation option; the extension permits either the first or fourth option.
Elegance is important here: multi-processor synchronization is hard anyway, making it any more obscure than needed is asking for more bugs.No, since you could condition on whether you read the same address twice. It's just more elegant this way.
The reservation subsystem may not care, but then LRM is a synonym for LRc) can we allow an LRM that is not part of the LR;LRM group
(without guarantee of atomicity)? For example, if we follow a
pointer and want to compare for null first, or if we want to
follow a pointer in a struct and have to add some offset? We
would only guarantee that the LRM is atomic with the previous
LR/LRMs when we do SCs and the SCs succeed, but (unlike for an
LR/LRM group) the values are not guaranteed to be atomic in
case of an SC failure.
No, the purpose of LRM is to ensure that we have started a
multiple-LR at its correct starting point, otherwise LR/LR/SC/SC
would suffice. Those situations call for full RVT and its LT
instruction, not multiple-LR.
Why? I think all implementations of LR;LRM also work for LR;(integer instructions);LRM, (provided the same conditions about distance between LR and SC, number of LR/LRMs, etc, still hold). The reservation subsystem doesn't really care that you did a few extra integer instructions since your last LR/LRM.
and you have the same problems as with LR/LR/SC/SC, particularly the inability to easily determine if a sequence has been abandoned.
Portable software must not assume that LR/LRM groups are atomic.
> What do you mean by "materialized"?
It actually takes up storage somewhere. Snapshotting the entire memory
at LR requires either copying all of memory or tracking an "undo buffer"
that allows to retrieve previous values. Alternately, if we only care
about consistency, the "undo buffer" need only record the address of
every store in the global memory order, to allow a future LRM to
recognize that it is reading from an address that has been updated and
fail the sequence. This is a potentially unbounded buffer, since it
must record the actions of other harts in anticipation of future
instructions on the local hart.
This might actually work with preemptive threads, unlike stack-based
multi-LR. Thread A: LR/LRM/LRM/(preempted). Thread B:
LR/LRM/LRM/SC/SC/SC. The sequence in thread B commits. Eventually,
thread A resumes: LRM/SC/SC/SC/SC. The LRM fails to acquire a
reservation, the preemption invalidated previous reservations, and the
SC group fails.
I see two problems with this, however: first, the "sparse" LRM
instructions complicate the atomic LR/LRM group semantics. Either
resuming at LR is not permitted after any instruction other than LRM is
executed, or all instructions in the sequence must ensure that no
register updated in the sequence has a meaningful initial value, rather
than only an LR/LRM group that is at most four instructions long having
this constraint.
The second problem I see is that allowing "sparse" LRM instructions
comes very close to needing the VALIDATE instruction, and that is
supposed to be one of the lines between multi-LR and full RVT. While
LR.VALIDATE could be encoded as an LRM from (x0), I think that this
would be uncomfortably blurring the line between the simpler multi-LR,
where the RVA forward progress guarantee is maintained, and RVT, which
will have different (and probably weaker) guarantees.
I am considering proposing to allow "sparse" LRM but have the
architectural forward progress guarantee only apply if LR/LRM are used
in a single group at the beginning of the sequence.
Michael Clark wrote:On 28/12/2017, at 2:48 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:Michael Clark wrote:This approach to double wide LR/SC would require a 128-bit scalar register file, or a register pairing scheme, but no ISA changes:- https://github.com/michaeljclark/riscv-meta/blob/master/opcodes#L149-L150On x86 SP and DP scalar floating point operations and some scalar integer operations use the SSE/AVX xmm/ymm/zmm registers which are now 512-bits wide in Skylake. The extra width does not affect the performance of the narrower 32-bit or 64-bit scalar operations. x86-64 compilers uses SSE FP almost exclusively. Compilers only emit x87 code in some rare circumstances such as use of transcendentals. There is precedence for wide register files that are used frequently for scalar operations.In fact 128-bit scalar ops, especially bitwise logical ops, could have a lot of potential uses now. However one could define a “register pair” extension to avoid growing the register file.A much simpler “non-standard” extension could allow a subset of RV128 ops with 15 registers instead of 31, using register pairing/aggregation.A register pairing/aggregation scheme has been used for wide operations in other ISAs. One could re-use the existing RV128 LR.Q/SC.Q ISA encoding, with a non-standard extension that allows a subset of 128-bit operations on a subset of register pairs. Essentially “loose instructions” for 128-bit LR.Q/SC.Q and a “register pair” scheme.This is an interesting idea, but the goal is a standard extension, and register pairing would require more ports on the register file, so would be a very hard sell. Also, the LR/LRM/SC proposal allows up to four _independent_ words to be updated, providing a significantly more powerful primitive than a simple 128-bit LR/SC.Register ports are an implementation detail.If the implementation reads double width operands from the register pairs in two cycles and writes double width operands to the register pairs in two cycles then there is only the necessity to have 128-bit operand buffers in the load store unit for LR.Q/SC.Q and two read ports and one write port on the register file. An OoO would of course need to handle dependencies and wakeup for opcodes that read or write register pairs.
This is starting to introduce a CISC-like execution sequencer to handle breaking an atomic instruction into multiple micro-ops. I believe that RISC-V opcodes are supposed to *be* micro-ops and that this is one of the differences between RISC and CISC architectures.
I believe that LR/LRM/SC as a "light" transaction support will be easier to implement than full RVT, which I am also planning to propose, using the Herlihy model. The full RVT will support much more complex programming, but software will have some responsibility for ensuring forward progress. LR/LRM/SC will still have the RVA forward progress guarantee. RVT will have its own, weaker, guarantee.Two transactional memory extensions sounds complex.
Not really, LRM is a simple extension to the RVA extension and can be easily implemented with small changes to hardware. The planned RVT proposal more-or-less requires an additional transaction cache to implement. LRM more closely resembles what passes for transactional memory on x86, while the RVT proposal I am planning is more in line with Herlihy's original proposal and more closely resembles database transactions.
One would assume a Double Wide LR/SC, as an atomic is exactly that, access to aligned contiguous quadwords.
>> I’m referring to RISC in term of reduced instruction set. i.e. we already have LR.Q and SC.Q
>
> Both of which are already reserved for use on RV128 for atomic access to aligned contiguous quadwords.
On Jan 1, 2018 03:52, "Michael Clark" <michae...@mac.com> wrote:One would assume a Double Wide LR/SC, as an atomic is exactly that, access to aligned contiguous quadwords.
>> I’m referring to RISC in term of reduced instruction set. i.e. we already have LR.Q and SC.Q
>
> Both of which are already reserved for use on RV128 for atomic access to aligned contiguous quadwords.
The discussion some time ago moved on from the double wide LR/SC to nested LR/SC, which is needed for non-adjacent DCAS, which itself appears to be needed for some concurrent data structures. I say appears because it might be possible to use an adjacent one if one carefully massages the data in the data structures, but I have not checked.
Either way nested LR/SC can also be used for other things double wide LR/SC can not be used for, such as the example Jacob pointed out.
What you describe sounds like a perfect use-case for the T extension. Non-contiguous atomic operations across multiple loads and stores are transactions and require transaction support on existing platforms that support this.
I think one of the key RISC principles is adding instructions that are commonly emitted by compilers.These particular use cases you are talking about are clearly not features that are in use by any proportion of existing code besides code that relies on full transactional memory.
I personally believe there would be more utility in providing a mechanism to support Double-wide LR/SC as this would be a requirement for RISC-V support for binary translation of CMPXCHG16B and LDXP/STLXP. i.e. Fast x86 and aarch64 emulation.
I’ve been following the discussion and the matched recursive LR/LRM and SC sound problematic. The interface is not ideal given the return code on each SC.
There needs to be a delineation of the beginning and end of a transaction and the number of loads may not necessarily match the number of stores. One wonders the utility of such a limited form of transaction support.
One would presume there would be a single standardised T extension, not T-lite and T-full. One would possibly also support regular loads and stores in transactions with BEGIN, COMMIT, ABORT vs trying to tack something on to LR/SC.
Linux has also only a few years ago integrated the MCS lock, and MCS locks could profit from nested LR/SC (one might be able to replace ticket spinlocks with MCS locks with nested LR/SC, but certainly not with T or DCAS (even non-adjacent)).