Double-wide LR/SC

402 vues
Accéder directement au premier message non lu

Cesar Eduardo Barros

non lue,
14 févr. 2016, 20:22:4414/02/2016
à isa...@lists.riscv.org
Given the recent LR/SC thread and
https://github.com/rust-lang/rust/issues/24564, I've wondered about a
double-wide LR/SC, which could be used to emulate DW-CAS.

In case porting software which depends on DW-CAS becomes a problem, I
don't think it would be too hard to create a "double-wide LR/SC"
extension. Here's my back-of-the-napkin design for it.

I would use 4 instructions, which I call LRL, LRH, SCL, SCH. LRL and LRH
are encoded similar to LR; SCL and SCH are encoded similar to SC.

The word pair is required to be aligned as if it were a single larger
word (that is, 16-byte alignment for a pair of 8-byte words). This also
guarantees that both words are in the same cacheline.

The sequence of operations for a DW-CAS would be:

- LRL of the first word to a register;
- LRH of the second word to a register;
- compare both registers and branch if any is different;
- SCL to the first word;
- SCH to the second word.

The LRL reserves the two words (or their cacheline). The LRH is
guaranteed to not break the reservation, as long as it's for the word
following the one just loaded by the LRL (which it can read from the L1
cache, since it's guaranteed to be in the same cacheline).

The SCL does not store anything, it "buffers" the word and its address.
The SCH "releases" the store from the SCL and stores its own word, as
long as the reservation is intact and the word is the correct one, and
writes the result to rd.

Each of these instruction still reads at most two registers and writes
at most one register.

The complexity is all in the SCL, since it has to buffer the word; this
might be a simple modification of a buffer which already exists to
combine multiple writes to the same cacheline, or some out-of-order
bookkeeping structure, or might be its own dedicated internal register.

Thoughts?

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Cesar Eduardo Barros

non lue,
15 févr. 2016, 04:44:3915/02/2016
à isa...@lists.riscv.org
Thinking about it some more, to allow for implementations to "fuse" the
LRL/LRH and SCL/SCH pairs, they would have to follow each other without
any intermediate instructions, and they would have to use the same
address register. Therefore, both LRH and SCH would have to increment
the address internally; since it's always aligned, it's as simple as
flipping one bit in the address from 0 to 1.

So the CAS example in Figure 5.1 would become for DW-CAS:

dw_cas:
mv t2, a0
lrl.w t0, (t2)
lrh.w t1, (t2)
li a0, 1
bne t0, a2, return
bne t1, a3, return
scl.w a4, (t2)
sch.w a0, a5, (t2)
return:
jr ra

Mike Hamburg

non lue,
15 févr. 2016, 14:21:4415/02/2016
à Cesar Eduardo Barros,isa...@lists.riscv.org
An alternative would be to make a more general multi-LR/SC support, but allow implementations to support it only in special cases. Off the top of my head, there could be ���LR additional��� (LRA) and ���SC buffer��� (SCB) instructions ��� basically a nested LR/SC. LRA loads reserved without breaking existing reservations. SCB buffers a SC for writing, but it doesn���t actually write until the next SC. If you overrun the hardware���s transactional memory support, the SC certainly fails.

To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not. So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS. If another flag is present, it can emulate DCAS, etc. That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.

Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.

Cheers,
��� Mike

Cesar Eduardo Barros

non lue,
15 févr. 2016, 17:44:4215/02/2016
à Mike Hamburg,isa...@lists.riscv.org
Em 15-02-2016 17:21, Mike Hamburg escreveu:
> An alternative would be to make a more general multi-LR/SC support, but allow implementations to support it only in special cases. Off the top of my head, there could be ���LR additional��� (LRA) and ���SC buffer��� (SCB) instructions ��� basically a nested LR/SC. LRA loads reserved without breaking existing reservations. SCB buffers a SC for writing, but it doesn���t actually write until the next SC. If you overrun the hardware���s transactional memory support, the SC certainly fails.

Hm...

So my dw_cas example would become:

dw_cas:
addi t2, a0, 0
addi a1, a0, 4
lr.w t0, (t2)
lra.w t1, (a1)


li a0, 1
bne t0, a2, return
bne t1, a3, return

scb.w a4, (t2)
sc.w a0, a5, (a1)
return:
jr ra

> To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not. So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS. If another flag is present, it can emulate DCAS, etc. That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.

I can see two parameters to specify the hardware profile for LRA/SCB:

- The number of cachelines it can reserve;
- The number of buffers it can use to hold the result (each SCB uses one
of them).

Even the most restrictive implementation of LRA/SCB (one cacheline and
one buffer) would be able to emulate DW-CAS, as long as the cacheline is
at least two words long (16 bytes on RV64, a reasonable requirement),
and both values are in the same cacheline (just align them naturally for
their combined size and it'll work). Therefore, if these instructions
are present, they can always do the DW-CAS trick.

For DCAS/CAS2, you'd need only two cachelines and one buffer. For
transactional memory, you'd need more buffers, more cachelines, and
relaxed requirements on the retry loop contents.

The "relaxed requirements" would be the sticky part for transactional
memory. For LR/SC or even LR/LRA/SCB/SC, you can have the "as long as it
fits within X instructions, no taken backwards branches on the success
path, and only simple integer instructions" rules, and guarantee
progress if these conditions are met. For more general transactional
memory, you might need a more general "always have a fallback path for
when it fails which does not use transactional memory" rule.

> Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.

A design focused on transactional memory might look very different than
a simple LR/SC extension; for instance, it might have separate ABORT and
COMMIT instructions, it might have different memory ordering semantics,
it might have a different way to specify a "fallback" path (LR/SC waits
until the SC to return success or failure; a transactional memory
extension might not want to wait until the COMMIT, which might be
hundreds of instructions ahead), and so on.

Trying to have a single design for both might lead to a worse design for
both.

Jonas Oberhauser

non lue,
4 déc. 2017, 08:38:0304/12/2017
à RISC-V ISA Dev,ces...@cesarb.eti.br,isa...@lists.riscv.org


Am Montag, 15. Februar 2016 20:21:44 UTC+1 schrieb mike:
An alternative would be to make a more general multi-LR/SC support, but allow implementations to support it only in special cases.  Off the top of my head, there could be ���LR additional��� (LRA)  and ���SC buffer��� (SCB) instructions ��� basically a nested LR/SC.  LRA loads reserved without breaking existing reservations.  SCB buffers a SC for writing, but it doesn���t actually write until the next SC.  If you overrun the hardware���s transactional memory support, the SC certainly fails.

To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not.  So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS.  If another flag is present, it can emulate DCAS, etc.  That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.

Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.

Cheers,
��� Mike


Something like this makes a lot of sense, but why not simply stack the LRs up to some implementation defined (and discoverable) depth (greater or equal to zero)? Then you simply write "LR/LR/LR   operation  SC/SC/SC  retry"   and 1) only the last SC commits, and 2) only if all reservations are still held, which is stored in the result register for the last SC. Hardware could guarantee that if you do a succesful non-committing SC, the reservation will continue to be held as long as you only do successful SCs. (and similarly that while you do LRs the "timer" for previous reservations does not advance).

An interrupt would delete all existing reservations, clear the non-committed outstanding stores, clear the reservation stack, and consequently fail all remaining SCs (in particular the last one).

There are a few other things that would have to be specified; e.g., doing more SCs than there have been LRs, doing more LRs than the implementation can handle,  doing LRs after a non-committing SC, or possibly even doing *anything except for an SC* after a non-committing SC. All of this should probably simply raise an illegal interrupt.

I am reminded of a recent talk by Bob Tarjan where he talked about the necessity of an (IIRC) non-adjacent double-cas in order to solve some graph problem efficiently. I feel that a nested LR/SC is much better for this task.

PS: Sorry for reviving an ancient thread, is this considered bad etiquette? What would have been the proper protocol?

Rogier Brussee

non lue,
4 déc. 2017, 16:09:4804/12/2017
à RISC-V ISA Dev,ces...@cesarb.eti.br,isa...@lists.riscv.org


Op maandag 4 december 2017 14:38:03 UTC+1 schreef Jonas Oberhauser:


Am Montag, 15. Februar 2016 20:21:44 UTC+1 schrieb mike:
An alternative would be to make a more general multi-LR/SC support, but allow implementations to support it only in special cases.  Off the top of my head, there could be ���LR additional��� (LRA)  and ���SC buffer��� (SCB) instructions ��� basically a nested LR/SC.  LRA loads reserved without breaking existing reservations.  SCB buffers a SC for writing, but it doesn���t actually write until the next SC.  If you overrun the hardware���s transactional memory support, the SC certainly fails.

To make this usable, you���d need some sort of hardware profiles beyond whether the instruction exists or not.  So if a certain CPU flag is present, the pair is guaranteed to be suitable to emulate DW-CAS.  If another flag is present, it can emulate DCAS, etc.  That way we only need 2 additional instructions, and they cover DW-CAS, CAS2 and other use cases.

Or maybe the better approach is juts for the group to design RISCV-T, but again with the same semantics that it might only be strong enough for limited use cases.

Cheers,
��� Mike


Something like this makes a lot of sense, but why not simply stack the LRs up to some implementation defined (and discoverable) depth (greater or equal to zero)? Then you simply write "LR/LR/LR   operation  SC/SC/SC  retry"   and 1) only the last SC commits, and 2) only if all reservations are still held, which is stored in the result register for the last SC. Hardware could guarantee that if you do a succesful non-committing SC, the reservation will continue to be held as long as you only do successful SCs. (and similarly that while you do LRs the "timer" for previous reservations does not advance).

An interrupt would delete all existing reservations, clear the non-committed outstanding stores, clear the reservation stack, and consequently fail all remaining SCs (in particular the last one).

There are a few other things that would have to be specified; e.g., doing more SCs than there have been LRs, doing more LRs than the implementation can handle,  doing LRs after a non-committing SC, or possibly even doing *anything except for an SC* after a non-committing SC. All of this should probably simply raise an illegal interrupt.
 
This seems really elegant, but it does mean a stack of reservation state and dependencies between the LR's as well as the SC's 
that has to be be built up at runtime. 

Here is a (possibly naive) proposal that avoids implicit state and allows upfront reservation that also avoids introducing new instructions
although it slightly extends lr.{wd} 

The current LR  instruction has rs2 == zero. If LR is made into a proper 3 register instruction it allows us to do the following: 

LR.{WD} rd rs1 rs2   :  l{wd} rd 0(rs1)  and reserve  and monitor _at least_ the memory segment  [rs1, rs1 + rs2 +sizeof({wd}) ) 
(this convention assures that rs2 == zero reserves one word/doubleword as in the current definition). 
Here we must have  0 <= rs2 <= implementation defined max (typically a cacheline)

In the critical section, in addition to the usual I instructions it is allowed  to load and store
once in the interior range (rs1, rs1 + rs2 + sizeof({wd}) )

SC.{WD} rd rs1 rs2 :  s.{wd} rs1 0(rs2) and rd is 0 iff the reservation was succesfull and ALL loads and stores in the critical section succeeded
uninterfered. Otherwise rd equals an errcode (standard 1) . 

The dw_cas example then becomes (if I understand it correctly) 


# a0 , a0 + 4 addresses of the two words
# a1 expected value @  (a0)
# a2 expected value @ 4(a0)
# a3 value to write    @ (a0) if read values are as expected
# a4 value to write    @ 4(a0) if read values are as expected

dw_cas:
     mv t2, a0

     li a0, 1


     li t0, 4
     lr.w t0, (t2), t0
     lw t1, 4(t2)


     bne t0, a1, return
     bne t1, a2, return


     sw a3, 4(t2)
     sc.w a0, a4, (t2)
return:
     jr ra



(or avoiding two jumps in the middle of the critical section and making use of C instructions if possible: 


d{wd}_cas:

     mv t0 a3

    

     li a3, sizeof({wd})

     lr.{wd} a3, (a0), a3
     l{wd} a5, sizeof({wd})(a0)


     xor  a3, a3, a1
     xor  a5, a5, a2

     or    a5, a5, a3 

     bnez a5, fail


     s{wd}  a4, sizeof({wd})(a0)
     sc.{wd} a0, t0, (a0)

     jr ra

fail:

    li a0 1

    jr ra






 
I am reminded of a recent talk by Bob Tarjan where he talked about the necessity of an (IIRC) non-adjacent double-cas in order to solve some graph problem efficiently. I feel that a nested LR/SC is much better for this task.


This proposal does not allow non adjacent double cas  :-(, but is hopefully easier on the hardware :-).

Jonas Oberhauser

non lue,
4 déc. 2017, 16:31:2904/12/2017
à Rogier Brussee,RISC-V ISA Dev,ces...@cesarb.eti.br,isa...@lists.riscv.org
Thanks for your input. I don't think it will be that much easier on the hardware, considering how easy it is to build a small HW stack, in particular since depth would typically be two or less.

I think the biggest downside to your suggestion right now is that you can lose the lock due to an interrupt but your regular store would still get executed; so your current implementation is not atomic. Once you give buffered SC semantics to the interior store you lose all the advantages you may have had over the stack.

Of course I like the fact that you need only a single instruction for the locking, but I don't think it would be useful beyond doubling the address range, which makes it a big weapon for a small target. Also you will still need the additional loads and stores to get the values across :(

The other downside is that you'd need to communicate a second address over the bus to lock the range at once.

What do you think?


--
You received this message because you are subscribed to a topic in the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/a/groups.riscv.org/d/topic/isa-dev/0y3ndO_Sne0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/7914291d-6702-4c70-8efb-c37f24839cd4%40groups.riscv.org.

Rogier Brussee

non lue,
4 déc. 2017, 18:17:1904/12/2017
à RISC-V ISA Dev,rogier....@gmail.com,ces...@cesarb.eti.br,isa...@lists.riscv.org


Op maandag 4 december 2017 22:31:29 UTC+1 schreef Jonas Oberhauser:
Thanks for your input. I don't think it will be that much easier on the hardware, considering how easy it is to build a small HW stack, in particular since depth would typically be two or less.


The idea is that the implementation defined maximum is pretty small and you typically would lock just one cacheline, and fail if you ask for a size larger than a cacheline.
 

I think the biggest downside to your suggestion right now is that you can lose the lock due to an interrupt but your regular store would still get executed; so your current implementation is not atomic. Once you give buffered SC semantics to the interior store you lose all the advantages you may have had over the stack.


You would know that the operation failed, and that the first write would have taken place. I agree though, that as written it is pretty fatal :-(.. You would indeed have to commit the whole segment atomically on the final sc. Since that segment is known well in advance and supposed to be part of a cacheline that still seems a lot easier (perhaps naively) than finally committing a (small) stack of values at potentially unrelated locations in memory .   

 

Of course I like the fact that you need only a single instruction for the locking, but I don't think it would be useful beyond doubling the address range,

I probably misunderstand you. You don't double the address range, you define a range of adreses. Eg. if rs1 = 0xF000 and rs2 = 4 then lr.w rd rs1 rs2 is supposed to _read_ the 4 bytes at  0xF000 ,.. 0xF004 and _lock_ 
the 4 + sizeof(w) = 8 adresses 0xF000, 0xF001, ... 0xF006, 0xF007  
 
which makes it a big weapon for a small target. Also you will still need the additional loads and stores to get the values across :(

The other downside is that you'd need to communicate a second address over the bus to lock the range at once.

The register rs2 is supposed to be not more than a few bits worth of information (and in the common case would be  rs2 == zero)  In fact, using the 5 bit of the rs2 field as an immediate to indicate the number of additional words/ doublewords  ( i.e. size of the range - 1) would  seem enough (famous last words), but it would break the format of the instructions.
  
 
What do you think?


Thanks for your reaction!

Rogier
To unsubscribe from this group and all its topics, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

non lue,
4 déc. 2017, 23:31:5404/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,ces...@cesarb.eti.br,isa...@lists.riscv.org
Jonas Oberhauser wrote:
> Something like this makes a lot of sense, but why not simply stack the
> LRs up to some implementation defined (and discoverable) depth
> (greater or equal to zero)? Then you simply write "LR/LR/LR
> operation SC/SC/SC retry" and 1) only the last SC commits, and 2)
> only if all reservations are still held, which is stored in the result
> register for the last SC. Hardware could guarantee that if you do a
> succesful non-committing SC, the reservation will continue to be held
> as long as you only do successful SCs. (and similarly that while you
> do LRs the "timer" for previous reservations does not advance).
>
> An interrupt would delete all existing reservations, clear the
> non-committed outstanding stores, clear the reservation stack, and
> consequently fail all remaining SCs (in particular the last one).
>
> There are a few other things that would have to be specified; e.g.,
> doing more SCs than there have been LRs, doing more LRs than the
> implementation can handle, doing LRs after a non-committing SC, or
> possibly even doing *anything except for an SC* after a non-committing
> SC. All of this should probably simply raise an illegal interrupt.

I think that RVT is what you really want; the keyword to me was
"non-committing SC". That sounds like asking for transactional memory
to me. This might be an interesting possibility for building RVT on
RVA, however, by using the existing LR opcode to begin a transaction and
the existing SC opcode to attempt to commit a transaction. Could RVT
redefine RVAlrsc as a transaction containing a single word?


-- Jacob

Jonas Oberhauser

non lue,
5 déc. 2017, 03:20:5605/12/2017
à Rogier Brussee,RISC-V ISA Dev,ces...@cesarb.eti.br


On Dec 5, 2017 12:17 AM, "Rogier Brussee" <rogier....@gmail.com> wrote:


Op maandag 4 december 2017 22:31:29 UTC+1 schreef Jonas Oberhauser:
Thanks for your input. I don't think it will be that much easier on the hardware, considering how easy it is to build a small HW stack, in particular since depth would typically be two or less.


The idea is that the implementation defined maximum is pretty small and you typically would lock just one cacheline, and fail if you ask for a size larger than a cacheline.

Exactly :)

 

I think the biggest downside to your suggestion right now is that you can lose the lock due to an interrupt but your regular store would still get executed; so your current implementation is not atomic. Once you give buffered SC semantics to the interior store you lose all the advantages you may have had over the stack.


You would know that the operation failed, and that the first write would have taken place. I agree though, that as written it is pretty fatal :-(.. You would indeed have to commit the whole segment atomically on the final sc. Since that segment is known well in advance and supposed to be part of a cacheline that still seems a lot easier (perhaps naively) than finally committing a (small) stack of values at potentially unrelated locations in memory .   

 

Of course I like the fact that you need only a single instruction for the locking, but I don't think it would be useful beyond doubling the address range,

I probably misunderstand you. You don't double the address range, you define a range of adreses. Eg. if rs1 = 0xF000 and rs2 = 4 then lr.w rd rs1 rs2 is supposed to _read_ the 4 bytes at  0xF000 ,.. 0xF004 and _lock_ 
the 4 + sizeof(w) = 8 adresses 0xF000, 0xF001, ... 0xF006, 0xF007  

In my mind, you always access an address range. Whether that range includes only one byte, two bytes, four bytes, or eight bytes :) You always access the range {base address, ..., base address+(access width-1)}
So when instead of four bytes you access eight, in my mind, you doubled the address range :)
What I meant with "don't think it would be useful beyond doubling the address range", I meant that I doubt there will be use cases that do more with it than double the address range (but of course I can't yet tell for sure).

I hope we understand each other now?

 
which makes it a big weapon for a small target. Also you will still need the additional loads and stores to get the values across :(

The other downside is that you'd need to communicate a second address over the bus to lock the range at once.

The register rs2 is supposed to be not more than a few bits worth of information (and in the common case would be  rs2 == zero)  In fact, using the 5 bit of the rs2 field as an immediate to indicate the number of additional words/ doublewords  ( i.e. size of the range - 1) would  seem enough (famous last words), but it would break the format of the instructions.

Yeah, you're right. it's not nearly as bad as I thought. 
In case you're only interested in taking a cache line, why not (for now at least) say r2 != x0 then whole cache line? 


Rogier Brussee

non lue,
5 déc. 2017, 03:54:1505/12/2017
à RISC-V ISA Dev,s9jo...@gmail.com,ces...@cesarb.eti.br,isa...@lists.riscv.org,jcb6...@gmail.com


Op dinsdag 5 december 2017 05:31:54 UTC+1 schreef Jacob Bachmeyer:
That is basically what I suggested.



-- Jacob

Jonas Oberhauser

non lue,
5 déc. 2017, 04:01:1005/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,ces...@cesarb.eti.br,isa...@lists.riscv.org


2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:

Jonas Oberhauser wrote:

I think that RVT is what you really want; the keyword to me was "non-committing SC".  That sounds like asking for transactional memory to me.  This might be an interesting possibility for building RVT on RVA, however, by using the existing LR opcode to begin a transaction and the existing SC opcode to attempt to commit a transaction. 

I don't really want RVT (which I assume means transactional RISC-V?), I want an efficient and small tool for generating small atomic operations.
In particular, to implement Bob Tarjan's non-adjacent double cas
(slide 25 on 
)
and similar tiny operations.

Before I continue, I have to mention that I don't know whether there are any features that a transactional memory has to provide to be useful in practice.
Having said that, I assume a full transactional memory would be somewhat overkill for this, and I don't think that the people who really want transactional memory would be happy with this little tool. 
On the other hand I'm not sure if a transactional memory would want to maintain the reservations for long.

It may still -- as you suggested -- be possible to implement a full transactional memory on top of such an LR and SC if the stacking depth provided by the hardware implementation is big and the hardware reserves memory for something more than 16 integer instructions (which it may. It might need to allow some loads and stores to other memory regions). It hadn't come to my mind before. If RVT only has to implement the operations from the Herlihy paper (http://cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf), I can imagine that it is possible to put all LT/LTX to the front as LRs  and all STs to the end as SCs (or for an RVT-certified implementation to provide additional guarantees about LRs/SCs, so that they don't have to be in a chain). Addresses for which you do not want to actually modify the value can use SC with the original value (or x0 or something) and could theoretically be optimized away in hardware. VALIDATE can simply check whether the locks are still held. COMMIT is the last SC. 

Do note that my suggestion either way is a natural extension of the current spec, since 
1) an ISA version that provides no LR/SC is just the special case of my suggestion where the maximum stack depth is zero.
2) an ISA version that provides "normal" LR/SC is just the special case of my suggestion where the maximum stack depth is one.
and for most purposes, it would be enought to provide a maximum stack depth of two.

For some of the examples in the Herlihy paper, you already need a stack depth of three or guarantees about non-LR loads (e.g, in the dequeue in the consumer/release example, you need to load the item after you LR'd the head and tail pointers).


 
Could RVT redefine RVAlrsc as a transaction containing a single word?
 
I guess so. It might be the special case where the addresses are the same. But is the transactional memory willing to provide the same reservation promises that RVA gives to LR/SC pairs? I.e., that the only way for the SC to fail (if it is close enough to the LR) is for an interrupt to have happened?


-- Jacob

Rogier Brussee

non lue,
5 déc. 2017, 04:44:4005/12/2017
à RISC-V ISA Dev,rogier....@gmail.com,ces...@cesarb.eti.br


Op dinsdag 5 december 2017 09:20:56 UTC+1 schreef Jonas Oberhauser:

[snip]
 
I probably misunderstand you. You don't double the address range, you define a range of adreses. Eg. if rs1 = 0xF000 and rs2 = 4 then lr.w rd rs1 rs2 is supposed to _read_ the 4 bytes at  0xF000 ,.. 0xF004 and _lock_ 
the 4 + sizeof(w) = 8 adresses 0xF000, 0xF001, ... 0xF006, 0xF007  

In my mind, you always access an address range. Whether that range includes only one byte, two bytes, four bytes, or eight bytes :) You always access the range {base address, ..., base address+(access width-1)}
So when instead of four bytes you access eight, in my mind, you doubled the address range :)
What I meant with "don't think it would be useful beyond doubling the address range", I meant that I doubt there will be use cases that do more with it than double the address range (but of course I can't yet tell for sure).

I hope we understand each other now?

Yes :-) 
[snip}
 
The other downside is that you'd need to communicate a second address over the bus to lock the range at once.

The register rs2 is supposed to be not more than a few bits worth of information (and in the common case would be  rs2 == zero)  In fact, using the 5 bit of the rs2 field as an immediate to indicate the number of additional words/ doublewords  ( i.e. size of the range - 1) would  seem enough (famous last words), but it would break the format of the instructions.

Yeah, you're right. it's not nearly as bad as I thought. 
In case you're only interested in taking a cache line, why not (for now at least) say r2 != x0 then whole cache line? 


 
Because you don't know what the cacheline size is, and whether people in 2050 want to atomically commit 128k to their 2Mbit Quantum computer.   Perhaps it is a good idea to define rs2 in units of words/doublewords etc. though.

Cesar Eduardo Barros

non lue,
5 déc. 2017, 05:13:2005/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,isa...@lists.riscv.org
What if an interrupt happens between the first and the second LR? Now
the stack length is wrong, and the commit will be done by the wrong SC.

On the other direction, what happens if the programmer starts a LR/SC
pair, but gives up before the SC (probably conditional on the value
loaded by the LR), and before the reservation clears another piece of
code starts a nested LR/LR/SC/SC sequence? The last SC, which should be
committing (and thus return the result) has just become a non-commiting SC.

To avoid confusion, we do need a distinction between starting the
reservation (LR) and adding to it (LRA), and between non-commiting store
(SCB) and commiting store (SC).

Jonas Oberhauser

non lue,
5 déc. 2017, 05:39:0505/12/2017
à Cesar Eduardo Barros,RISC-V ISA Dev,isa...@lists.riscv.org
I was a very imprecise when I said "clear the reservation stack". You do not resize it to zero (that would also cause problems on the later SCs). You invalidate it in a sense (and the later LR do not reserve anything and are free to return garbage). Note that you only need to restore the size, which only uses a few bits, and the fact that it is invalid. The rest of the stack can be made garbage by the interrupt.




On the other direction, what happens if the programmer starts a LR/SC pair, but gives up before the SC (probably conditional on the value loaded by the LR), and before the reservation clears another piece of code starts a nested LR/LR/SC/SC sequence? The last SC, which should be committing (and thus return the result) has just become a non-commiting SC.

If you want to give up, you can simply do your SCs to empty the stack and write back the previous load results. HW can optimize these writes away (or they are single cycle anyway, in a normal cache system).

Every malloc needs its free and every LR needs its SC ;)


To avoid confusion, we do need a distinction between starting the reservation (LR) and adding to it (LRA), and between non-commiting store (SCB) and commiting store (SC).

 I'm not convinced. Can you

Jacob Bachmeyer

non lue,
5 déc. 2017, 23:13:0505/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,ces...@cesarb.eti.br,isa...@lists.riscv.org
Jonas Oberhauser wrote:
> 2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> I think that RVT is what you really want; the keyword to me was
> "non-committing SC". That sounds like asking for transactional
> memory to me. This might be an interesting possibility for
> building RVT on RVA, however, by using the existing LR opcode to
> begin a transaction and the existing SC opcode to attempt to
> commit a transaction.
>
>
> I don't really want RVT (which I assume means transactional RISC-V?),
> I want an efficient and small tool for generating small atomic operations.

RVT is transactional memory for RISC-V. As I see it, as soon as you can
have an SC that "succeeds but does not commit", you are actually talking
about transactional memory.

> Before I continue, I have to mention that I don't know whether there
> are any features that a transactional memory has to provide to be
> useful in practice.
> Having said that, I assume a full transactional memory would be
> somewhat overkill for this, and I don't think that the people who
> really want transactional memory would be happy with this little tool.
> On the other hand I'm not sure if a transactional memory would want to
> maintain the reservations for long.

A two-phase approach seems ideal: reservations are "hard" for a small
number of cycles, then become "soft" until either the transaction
completes, a trap is taken, or another processor in the system wants to
access that address, at which time the reservation is released
entirely. Another option (for RVN systems) is to add a "transaction
aborted" interrupt.

> It may still -- as you suggested -- be possible to implement a full
> transactional memory on top of such an LR and SC if the stacking depth
> provided by the hardware implementation is big and the hardware
> reserves memory for something more than 16 integer instructions (which
> it may. It might need to allow some loads and stores to other memory
> regions). It hadn't come to my mind before. If RVT only has to
> implement the operations from the Herlihy paper
> (http://cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf
> <http://cs.brown.edu/%7Emph/HerlihyM93/herlihy93transactional.pdf>), I
> can imagine that it is possible to put all LT/LTX to the front as LRs
> and all STs to the end as SCs (or for an RVT-certified implementation
> to provide additional guarantees about LRs/SCs, so that they don't
> have to be in a chain). Addresses for which you do not want to
> actually modify the value can use SC with the original value (or x0 or
> something) and could theoretically be optimized away in hardware.
> VALIDATE can simply check whether the locks are still held. COMMIT is
> the last SC.

Thanks for the paper; it has been enlightening. There is sufficient
space in the AMO opcode to easily fit the six instructions from that
paper. I need to find time to read the new memory model draft spec --
if store-acquire really is to be deprecated, then the SC.aq opcode could
be reused for ST in RVT. The LR opcode would remain LR, but it has a
5-bit function code in the rs2 field, and other values in that field
would be sufficient to encode LT, LTX, and VALIDATE. At some cost in
logical grouping, ABORT and COMMIT could also be encoded with LR
variants. I like the idea of reusing the SC opcode for ST, with the
exact function determined by whether a transaction or simple memory
reservation was active. An LR introduces an LR/SC sequence, while an LT
or LTX introduces a transaction.

> Do note that my suggestion either way is a natural extension of the
> current spec, since
> 1) an ISA version that provides no LR/SC is just the special case of
> my suggestion where the maximum stack depth is zero.
> 2) an ISA version that provides "normal" LR/SC is just the special
> case of my suggestion where the maximum stack depth is one.
> and for most purposes, it would be enought to provide a maximum stack
> depth of two.

The problem with a stack depth greater than one is ambiguity in the case
of interrupts and context switching. One possible solution, if we can
limit the LR/SC stack depth to two and require more complex applications
to use RVT instead, is to require that the two LRs be both statically
and dynamically contiguous. This would mean that an LR immediately
after a trap return must silently fail. In this case, an interrupt
between the LR instructions causes the second LR to never acquire a
reservation, thus both SCs will fail. The SCs would also need to be
together, so the hardware can recognize the double SC and internally
queue the first SC before committing both at the second SC. Otherwise,
the first SC encountered succeeds and all reservations are released.


>
> Could RVT redefine RVAlrsc as a transaction containing a single word?
>
>
> I guess so. It might be the special case where the addresses are the
> same. But is the transactional memory willing to provide the same
> reservation promises that RVA gives to LR/SC pairs? I.e., that the
> only way for the SC to fail (if it is close enough to the LR) is for
> an interrupt to have happened?

Reading the Herlihy paper, transactional memory is a generalization of
LR/SC (then called LL/SC), so RVT will need to provide a similar
guarantee of forward progress in any case. I have since backed away
from that idea; there is plenty of available encoding space in the AMO
major opcode.


-- Jacob

Jonas Oberhauser

non lue,
6 déc. 2017, 03:13:4306/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros,isa...@lists.riscv.org

Jonas Oberhauser

non lue,
6 déc. 2017, 03:39:1506/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
2017-12-06 5:12 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:
Jonas Oberhauser wrote:
2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:

    Jonas Oberhauser wrote:

    I think that RVT is what you really want; the keyword to me was
    "non-committing SC".  That sounds like asking for transactional
    memory to me.  This might be an interesting possibility for
    building RVT on RVA, however, by using the existing LR opcode to
    begin a transaction and the existing SC opcode to attempt to
    commit a transaction. 

I don't really want RVT (which I assume means transactional RISC-V?), I want an efficient and small tool for generating small atomic operations.

RVT is transactional memory for RISC-V.  As I see it, as soon as you can have an SC that "succeeds but does not commit", you are actually talking about transactional memory.

At least about a non-trivial subset of it. If "transactional memory" is a superset of the herlihy paper, then I am not yet talking about transactional memory :), and the people who want a transactional memory will not like the stack nature of my suggestion and in particular will want stronger guarantees.
 
Before I continue, I have to mention that I don't know whether there are any features that a transactional memory has to provide to be useful in practice.
Having said that, I assume a full transactional memory would be somewhat overkill for this, and I don't think that the people who really want transactional memory would be happy with this little tool. On the other hand I'm not sure if a transactional memory would want to maintain the reservations for long.

A two-phase approach seems ideal:  reservations are "hard" for a small number of cycles, then become "soft" until either the transaction completes, a trap is taken, or another processor in the system wants to access that address, at which time the reservation is released entirely.  Another option (for RVN systems) is to add a "transaction aborted" interrupt.

That makes sense.

In the case of my suggestion, it would be necessary to turn a soft reservation into a hard one again in case of a successful non-committing SC, so that you can be sure that on the committing SC, the previous reservations actually still hold and your operation can be completed.

 

It may still -- as you suggested -- be possible to implement a full transactional memory on top of such an LR and SC if the stacking depth provided by the hardware implementation is big and the hardware reserves memory for something more than 16 integer instructions (which it may. It might need to allow some loads and stores to other memory regions). It hadn't come to my mind before. If RVT only has to implement the operations from the Herlihy paper (http://cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf <http://cs.brown.edu/%7Emph/HerlihyM93/herlihy93transactional.pdf>), I can imagine that it is possible to put all LT/LTX to the front as LRs  and all STs to the end as SCs (or for an RVT-certified implementation to provide additional guarantees about LRs/SCs, so that they don't have to be in a chain). Addresses for which you do not want to actually modify the value can use SC with the original value (or x0 or something) and could theoretically be optimized away in hardware. VALIDATE can simply check whether the locks are still held. COMMIT is the last SC.

Thanks for the paper; it has been enlightening.  There is sufficient space in the AMO opcode to easily fit the six instructions from that paper. 

Like I said I am not sure if those six operations are enough to make transactional memory useful in practice. I'd wait for input of someone who actually uses transactional memory :) (at least in a model).
 
I like the idea of reusing the SC opcode for ST, with the exact function determined by whether a transaction or simple memory reservation was active.  An LR introduces an LR/SC sequence, while an LT or LTX introduces a transaction.

That sounds neat. I think it may not even be necessary to keep track of the "mode" between interrupts, because both of them have the same semantics after an interrupt (fail).
 
Do note that my suggestion either way is a natural extension of the current spec, since 1) an ISA version that provides no LR/SC is just the special case of my suggestion where the maximum stack depth is zero.
2) an ISA version that provides "normal" LR/SC is just the special case of my suggestion where the maximum stack depth is one.
and for most purposes, it would be enought to provide a maximum stack depth of two.

The problem with a stack depth greater than one is ambiguity in the case of interrupts and context switching.  One possible solution, if we can limit the LR/SC stack depth to two and require more complex applications to use RVT instead, is to require that the two LRs be both statically and dynamically contiguous.  This would mean that an LR immediately after a trap return must silently fail.  In this case, an interrupt between the LR instructions causes the second LR to never acquire a reservation, thus both SCs will fail.  The SCs would also need to be together, so the hardware can recognize the double SC and internally queue the first SC before committing both at the second SC.  Otherwise, the first SC encountered succeeds and all reservations are released.

Well, to have all the LRs and SCs contiguous was part of the idea. 
I don't understand though where you think the ambiguity comes from. Already in the current spec you can have 
LR -> context switch -> SC to same virtual and physical address

(for example if you reschedule a different thread of the same multithreaded user)

In any case you need to remember only the current stack depth and that the reservation has been invalidated.
Maybe with some careful thought and definition of semantics you can even reset the stack depth to zero, but I'm not sure. (For example, LRs while a "reservation invalidated" signal is active would fail and not even emit memory operations, SCs would clear the invalidation bit and fail if stack depth is zero. In this case LRLR (interrupt) LR .... SCSCSC would still work the way you expect...)

 
    Could RVT redefine RVAlrsc as a transaction containing a single word?

 I guess so. It might be the special case where the addresses are the same. But is the transactional memory willing to provide the same reservation promises that RVA gives to LR/SC pairs? I.e., that the only way for the SC to fail (if it is close enough to the LR) is for an interrupt to have happened?

Reading the Herlihy paper, transactional memory is a generalization of LR/SC (then called LL/SC), so RVT will need to provide a similar guarantee of forward progress in any case.  I have since backed away from that idea; there is plenty of available encoding space in the AMO major opcode.

Does LL/SC traditionally have a forward guarantee? I didn't know that.


PS: Sorry for the spam with the previous mail. I hit the send button by accident again :(

Jacob Bachmeyer

non lue,
6 déc. 2017, 19:16:2106/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-06 5:12 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> 2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>>:
>
> Jonas Oberhauser wrote:
>
> I think that RVT is what you really want; the keyword to
> me was
> "non-committing SC". That sounds like asking for
> transactional
> memory to me. This might be an interesting possibility for
> building RVT on RVA, however, by using the existing LR
> opcode to
> begin a transaction and the existing SC opcode to attempt to
> commit a transaction.
>
> I don't really want RVT (which I assume means transactional
> RISC-V?), I want an efficient and small tool for generating
> small atomic operations.
>
>
> RVT is transactional memory for RISC-V. As I see it, as soon as
> you can have an SC that "succeeds but does not commit", you are
> actually talking about transactional memory.
>
>
> At least about a non-trivial subset of it. If "transactional memory"
> is a superset of the herlihy paper, then I am not yet talking about
> transactional memory :), and the people who want a transactional
> memory will not like the stack nature of my suggestion and in
> particular will want stronger guarantees.

This is another reason to keep the maximum LR/SC stack depth very small
-- while the hardware resources for nested LR/SC and full RVT can
probably be shared to some extent, a large LR/SC stack is about as
complex as RVT, but lacks the "safety net" that VALIDATE provides. As a
building block for intermediate-complexity atomic primitives (like
multi-word CAS) nested LR/SC could be useful, but as "almost
transactional memory access", nested LR/SC is a landmine for software
developers to stumble upon, creating nasty race-condition bugs. Abusing
nested LR/SC will probably work just enough to get past testing (if this
sort of developer even does testing at all) but blow up in production.

Could nested LR/SC be part of RVT, along with the full transactional
instructions? The RVT hardware support would be easily able to support
nested LR/SC. Is the hardware required for nested LR/SC any simpler
than that required for full transactional memory support?

>
> Before I continue, I have to mention that I don't know whether
> there are any features that a transactional memory has to
> provide to be useful in practice.
> Having said that, I assume a full transactional memory would
> be somewhat overkill for this, and I don't think that the
> people who really want transactional memory would be happy
> with this little tool. On the other hand I'm not sure if a
> transactional memory would want to maintain the reservations
> for long.
>
>
> A two-phase approach seems ideal: reservations are "hard" for a
> small number of cycles, then become "soft" until either the
> transaction completes, a trap is taken, or another processor in
> the system wants to access that address, at which time the
> reservation is released entirely. Another option (for RVN
> systems) is to add a "transaction aborted" interrupt.
>
>
> That makes sense.
>
> In the case of my suggestion, it would be necessary to turn a soft
> reservation into a hard one again in case of a successful
> non-committing SC, so that you can be sure that on the committing SC,
> the previous reservations actually still hold and your operation can
> be completed.

Why? A non-committing SC effectively *is* ST; if one ST fails, all
subsequent STs will fail and nothing commits until all STs have
succeeded. I am currently considering (obvious in the context of RISC-V
as I see it) one "store-transactional-and-validate" instruction and one
"store-transactional-and-commit" instruction -- every transactional
store returns a transaction state value (zero == transaction valid or
successfully committed; non-zero == error) just like SC does. (Of
course they do -- I plan to propose using the SC opcode for both forms
of ST!)

> It may still -- as you suggested -- be possible to implement a
> full transactional memory on top of such an LR and SC if the
> stacking depth provided by the hardware implementation is big
> and the hardware reserves memory for something more than 16
> integer instructions (which it may. It might need to allow
> some loads and stores to other memory regions). It hadn't come
> to my mind before. If RVT only has to implement the operations
> from the Herlihy paper
> (http://cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf
> <http://cs.brown.edu/%7Emph/HerlihyM93/herlihy93transactional.pdf>
> <http://cs.brown.edu/%7Emph/HerlihyM93/herlihy93transactional.pdf
> <http://cs.brown.edu/%7Emph/HerlihyM93/herlihy93transactional.pdf>>),
> I can imagine that it is possible to put all LT/LTX to the
> front as LRs and all STs to the end as SCs (or for an
> RVT-certified implementation to provide additional guarantees
> about LRs/SCs, so that they don't have to be in a chain).
> Addresses for which you do not want to actually modify the
> value can use SC with the original value (or x0 or something)
> and could theoretically be optimized away in hardware.
> VALIDATE can simply check whether the locks are still held.
> COMMIT is the last SC.
>
>
> Thanks for the paper; it has been enlightening. There is
> sufficient space in the AMO opcode to easily fit the six
> instructions from that paper.
>
>
> Like I said I am not sure if those six operations are enough to make
> transactional memory useful in practice. I'd wait for input of someone
> who actually uses transactional memory :) (at least in a model).

The authors of that paper used transactional memory in a model. :-)

> I like the idea of reusing the SC opcode for ST, with the exact
> function determined by whether a transaction or simple memory
> reservation was active. An LR introduces an LR/SC sequence, while
> an LT or LTX introduces a transaction.
>
>
> That sounds neat. I think it may not even be necessary to keep track
> of the "mode" between interrupts, because both of them have the same
> semantics after an interrupt (fail).

Exactly, taking a trap invalidates all reservations, including
transactions. (Taking a trap *must* erase the internal atomic operation
state -- the trap handler may need to use RVA.)
The problem is that hardware does not see software threads as distinct.
Thread A: LR/LR/LR/(preempted). Thread B: LR/LR/LR/SC/SC/SC. The 3rd
SC in thread B does *not* commit because there are still three LRs on
the stack and reservations are invalid. Eventually, thread A resumes:
LR/SC/SC/SC/SC. The fourth SC *finally* drains the LR/SC stack and
resets the state, meanwhile, all uses of LR/SC either fail or (worse!)
appear to succeed but do not actually commit. And what happens if
thread A is migrated and resumed on a different hart? Resetting the
whole LR/SC stack on a trap only moves the problem: now, when thread A
resumes, the *first* SC commits and the rest fail ... oops!

If the stack depth is limited to two, we can avoid this problem by
requiring LR immediately after a trap return to silently fail to acquire
a reservation. Since the maximum length of an LR/LR sequence is two,
the only possible place where confusion can arise is if a trap is taken
in the middle of a pair. After xRET, hardware does not know if LR
stands alone or is the second instruction in a pair. Failing the
reservation ensures that a retry will arrive at the correct starting
point, regardless of whether the LR was standalone or part of a pair.

Actually, this reasoning generalizes: for any maximum stack depth,
confusion is only possible if execution resumes in the middle of an
LR///LR sequence. This can be caught with an internal "reservations
forbidden" flag that is set by xRET and cleared by any instruction
*except* LR. LR silently fails to acquire a reservation if the
"reservations forbidden" flag is set, but must still read memory.
Skipping memory reads is *not* safe -- a program may attempt to LR a
pointer and follow that link through a structure; this will cause
SIGSEGV if LR produces a bogus value. As there are no reservations,
*all* following SCs must fail. All reservations are released (and the
LR/SC stack cleared) when a trap is taken.

> Could RVT redefine RVAlrsc as a transaction containing a
> single word?
>
> I guess so. It might be the special case where the addresses
> are the same. But is the transactional memory willing to
> provide the same reservation promises that RVA gives to LR/SC
> pairs? I.e., that the only way for the SC to fail (if it is
> close enough to the LR) is for an interrupt to have happened?
>
>
> Reading the Herlihy paper, transactional memory is a
> generalization of LR/SC (then called LL/SC), so RVT will need to
> provide a similar guarantee of forward progress in any case. I
> have since backed away from that idea; there is plenty of
> available encoding space in the AMO major opcode.
>
>
> Does LL/SC traditionally have a forward guarantee? I didn't know that.

So that is the difference between LL/SC and LR/SC? I thought it was
just a different name.


-- Jacob

Bruce Hoult

non lue,
7 déc. 2017, 05:04:1907/12/2017
à Jacob Bachmeyer,Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
On Thu, Dec 7, 2017 at 3:16 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Actually, this reasoning generalizes:  for any maximum stack depth, confusion is only possible if execution resumes in the middle of an LR///LR sequence.  This can be caught with an internal "reservations forbidden" flag that is set by xRET and cleared by any instruction *except* LR.  LR silently fails to acquire a reservation if the "reservations forbidden" flag is set, but must still read memory.  Skipping memory reads is *not* safe -- a program may attempt to LR a pointer and follow that link through a structure; this will cause SIGSEGV if LR produces a bogus value.  As there are no reservations, *all* following SCs must fail.  All reservations are released (and the LR/SC stack cleared) when a trap is taken.

The "reservations forbidden" flag should be set by xRET iff reservations existed when the trap was taken. Otherwise taking a trap just before the first (or only) LR will needlessly fail.

Seems to me a small state machine is needed:

0) NORMAL no reservations exist. LR -> 1
1) RESERVING  reservations exist, LR -> 1, SC -> 0 or 3, trap/ret -> 4, else -> 2
2) INCRITICAL reservations exist. LR -> ILLEGAL, SC -> 0 or 3, trap/ret -> 0, else -> 2
3) COMMITING reservations exist. SC -> 0 or 3, trap/ret -> 0, else ILLEGAL
4) FORBIDRES no reservations exist, LR -> ILLEGAL, trap/ret -> 4, else -> 0

Cesar Eduardo Barros

non lue,
7 déc. 2017, 05:49:0407/12/2017
à jcb6...@gmail.com,Jonas Oberhauser,RISC-V ISA Dev
Em 06-12-2017 22:16, Jacob Bachmeyer escreveu:
That is not compatible with the current definition of LR. Consider the
following instruction sequence: an ECALL immediately followed by a LR.
Under the current rules, this will work as expected. Under your proposed
"LR fails immediately after xRET" rule, this will never work, and a
perfectly valid LR...SC loop can now loop forever, just because the
compiler happened to move an ECALL to immediately before it.

> Actually, this reasoning generalizes:  for any maximum stack depth,
> confusion is only possible if execution resumes in the middle of an
> LR///LR sequence.  This can be caught with an internal "reservations
> forbidden" flag that is set by xRET and cleared by any instruction
> *except* LR.  LR silently fails to acquire a reservation if the
> "reservations forbidden" flag is set, but must still read memory.
> Skipping memory reads is *not* safe -- a program may attempt to LR a
> pointer and follow that link through a structure; this will cause
> SIGSEGV if LR produces a bogus value.  As there are no reservations,
> *all* following SCs must fail.  All reservations are released (and the
> LR/SC stack cleared) when a trap is taken.

Cesar Eduardo Barros

non lue,
7 déc. 2017, 06:04:3207/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,isa...@lists.riscv.org
Em 05-12-2017 08:39, Jonas Oberhauser escreveu:
>
> On Dec 5, 2017 11:13 AM, "Cesar Eduardo Barros" <ces...@cesarb.eti.br
> <mailto:ces...@cesarb.eti.br>> wrote:
>
> What if an interrupt happens between the first and the second LR?
> Now the stack length is wrong, and the commit will be done by the
> wrong SC.
>
>
> I was a very imprecise when I said "clear the reservation stack". You do
> not resize it to zero (that would also cause problems on the later SCs).
> You invalidate it in a sense (and the later LR do not reserve anything
> and are free to return garbage). Note that you only need to restore the
> size, which only uses a few bits, and the fact that it is invalid. The
> rest of the stack can be made garbage by the interrupt.

That needs saving more state than the very restricted state RISC-V
currently saves on a trap. That would be a hard sell.

> On the other direction, what happens if the programmer starts a
> LR/SC pair, but gives up before the SC (probably conditional on the
> value loaded by the LR), and before the reservation clears another
> piece of code starts a nested LR/LR/SC/SC sequence? The last SC,
> which should be committing (and thus return the result) has just
> become a non-commiting SC.
>
>
> If you want to give up, you can simply do your SCs to empty the stack
> and write back the previous load results. HW can optimize these writes
> away (or they are single cycle anyway, in a normal cache system).
>
> Every malloc needs its free and every LR needs its SC ;)

That's not compatible with current uses of LR/SC. Even the ISA spec
itself, in its CAS sample code, abandons the LR without doing a
corresponding SC in case the compare fails; I expect most CAS emulation
code in the wild to do the same.

Your proposal needs LR to be treated as a strong reservation, which can
only be cleared by a corresponding SC. However, the current ISA treats
LR as a fragile thing, which can be cleared by nearly everything, and
current users of the ISA expect and depend on that.

Jonas Oberhauser

non lue,
7 déc. 2017, 06:38:1507/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
2017-12-07 1:16 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:
Jonas Oberhauser wrote:
2017-12-06 5:12 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:

    Jonas Oberhauser wrote:

        2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
        <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com

        <mailto:jcb6...@gmail.com>>>:

At least about a non-trivial subset of it. If "transactional memory" is a superset of the herlihy paper, then I am not yet talking about transactional memory :), and the people who want a transactional memory will not like the stack nature of my suggestion and in particular will want stronger guarantees.

This is another reason to keep the maximum LR/SC stack depth very small -- while the hardware resources for nested LR/SC and full RVT can probably be shared to some extent, a large LR/SC stack is about as complex as RVT, but lacks the "safety net" that VALIDATE provides.  As a building block for intermediate-complexity atomic primitives (like multi-word CAS) nested LR/SC could be useful, but as "almost transactional memory access", nested LR/SC is a landmine for software developers to stumble upon, creating nasty race-condition bugs.  Abusing nested LR/SC will probably work just enough to get past testing (if this sort of developer even does testing at all) but blow up in production.

I completely agree.

 
Could nested LR/SC be part of RVT, along with the full transactional instructions?  The RVT hardware support would be easily able to support nested LR/SC. 

Sure.
 
Is the hardware required for nested LR/SC any simpler than that required for full transactional memory support?

For a small stack depth definitely. For example, the validation is much simpler, since you can (see below) validate one-by-one on the SCs, rather than having to do a complete validation at some point.
 
   
    A two-phase approach seems ideal:  reservations are "hard" for a
    small number of cycles, then become "soft" until either the
    transaction completes, a trap is taken, or another processor in
    the system wants to access that address, at which time the
    reservation is released entirely.


That makes sense.

In the case of my suggestion, it would be necessary to turn a soft reservation into a hard one again in case of a successful non-committing SC, so that you can be sure that on the committing SC, the previous reservations actually still hold and your operation can be completed.
Why?  A non-committing SC effectively *is* ST; if one ST fails, all subsequent STs will fail and nothing commits until all STs have succeeded.  I am currently considering (obvious in the context of RISC-V as I see it) one "store-transactional-and-validate" instruction and one "store-transactional-and-commit" instruction -- every transactional store returns a transaction state value (zero == transaction valid or successfully committed; non-zero == error) just like SC does.  (Of course they do -- I plan to propose using the SC opcode for both forms of ST!)


The problem is the reverse one. Imagine every SC succeeds, but by the time you do the last SC,an earlier reservation has been given up and someone else wrote to that address. You would have to validate again on the last SC.

If you turn the soft res. into a hard res. which is guarranteed to remain hard until you execute something except a (successful) SC, you do not have to validate again on the last SC: all reservations are guaranteed to still hold and you can just commit immediately.

Like I said I am not sure if those six operations are enough to make transactional memory useful in practice. I'd wait for input of someone who actually uses transactional memory :) (at least in a model).

The authors of that paper used transactional memory in a model.  :-)

Ah but there is a difference between "I created a model and prove how great it is with these three select examples that work great" and "these other guys created a model that is really great and I have built some real stuff in it" (which can turn into "these other guys created a model which is cool but to implement real stuff in it, I also needed xyz, which I added") :)

 
    I like the idea of reusing the SC opcode for ST, with the exact
    function determined by whether a transaction or simple memory
    reservation was active.  An LR introduces an LR/SC sequence, while
    an LT or LTX introduces a transaction.


That sounds neat. I think it may not even be necessary to keep track of the "mode" between interrupts, because both of them have the same semantics after an interrupt (fail).

Exactly, taking a trap invalidates all reservations, including transactions.  (Taking a trap *must* erase the internal atomic operation state -- the trap handler may need to use RVA.)


I don't know RVA stands for, but I agree with the rest. 

 The problem is that hardware does not see software threads as distinct.  Thread A: LR/LR/LR/(preempted).  Thread B: LR/LR/LR/SC/SC/SC.  The 3rd SC in thread B does *not* commit because there are still three LRs on the stack and reservations are invalid.  Eventually, thread A resumes:  LR/SC/SC/SC/SC.  The fourth SC *finally* drains the LR/SC stack and resets the state, meanwhile, all uses of LR/SC either fail or (worse!) appear to succeed but do not actually commit.  And what happens if thread A is migrated and resumed on a different hart?  Resetting the whole LR/SC stack on a trap only moves the problem:  now, when thread A resumes, the *first* SC commits and the rest fail ... oops!

Nice example, thanks. In the semantics I had in mind, all LR/SC beyond the first preemption fail. Including (and I hadn't thought about that) those of thread B, which should be safe to execute.

I am now copy&pasting Cesar's posts (so please excuse any out-of-order experiences you may have in your mail clients):
on having LR fail after xRET:
That is not compatible with the current definition of LR. Consider the following instruction sequence: an ECALL immediately followed by a LR. Under the current rules, this will work as expected. Under your proposed "LR fails immediately after xRET" rule, this will never work, and a perfectly valid LR...SC loop can now loop forever, just because the compiler happened to move an ECALL to immediately before it.
on storing enough context not to have the LR fail after xRET:
That needs saving more state than the very restricted state RISC-V currently saves on a trap. That would be a hard sell. 

Well, I think not for levels 0 and 1, in which case you do not even need an explicit stack -- you can probably maintain the "stack" in terms of the reservations you have. Once the stack is slightly bigger (2 or more), you probably do need a special purpose register to include the current stack depth, which you can save and restore in case of nested interrupts.
 
on using SC to reset the reservation stack:
That's not compatible with current uses of LR/SC. Even the ISA spec itself, in its CAS sample code, abandons the LR without doing a corresponding SC in case the compare fails; I expect most CAS emulation code in the wild to do the same.

Hmmm... Maybe we can reset the stack before the first non-consecutive LR or AMO? e.g., LR, operation code, (stack is reset)LR  ? In this case you can abandon the LR without an SC. Since I only need the nested LRs of the form LRLRLR, that policy would work for me.

If that doesn't work, I think that makes the idea infeasible unless people generally agree that this spec is worth the "hard sell" and the backwards-incompatiblity. Too bad, I thought it could be elegant :)

Let's discuss again LR/LRA + SCB/SC.
First note that I do not think that LRA is necessary, since LR can already just not break previous reservations.
The distinction SCB/SC is necessary without a reservation stack. But I am afraid that the only advantage that SCB/SC brings over a reservation stack is that you no longer need to pop the stack with SCBs/SCs.
You still need to store a few extra bits on an interrupt, which have to be context stored/restored, such as "did I have a pending LR?" (to prevent the "first LR after xRET" problem).



So that is the difference between LL/SC and LR/SC?  I thought it was just a different name.

Don't ask me, I only know what an ARM guy told me in passing at a conference three years ago :)) But if wikipedia is to be believed, yes, that's a guarantee practically unique to RISC-V's LR/SC.

Cesar Eduardo Barros

non lue,
7 déc. 2017, 18:12:1607/12/2017
à Jonas Oberhauser,Jacob Bachmeyer,RISC-V ISA Dev
I've thought about it a bit more today while walking from work.

The reason you would need a stack (actually a depth counter) is to know
which is the last SC. Any SC before that one is just a preparation; the
last SC is the one that actually commits, unless any of the SCs had lost
its matching reservation.

That is, the last SC is actually doing two separate things: buffering a
store, like the SCs preceding it, and commiting all the writes to the
memory system. This second part could be made into a separate
instruction under the SYSTEM opcode, so you would have SC/SC/SC/COMMIT.
This wastes less opcode space than adding a new SC variant, and could be
macro-op fused in case it immediately follows the final SC.

This still leaves a compatibility problem: how can the processor know
whether the first SC is from a multiple SC sequence, in which case it
should be buffered until either the COMMIT or the loss of the
reservation (COMMIT has to return success/failure like SC), or from a
simple LR/SC pair, in which case it should be commited immediately?

So I thought of a cool trick: have another SYSTEM instruction, let's
call it BEGIN because I've been using SQL at work, which sets a volatile
bit of state saying "we're now in a multiple SC sequence". This bit is
as volatile as the reservations, that is, traps, xRET, and anything else
that could break a reservation would unset that bit.

Since that bit can be unset at any time, we have to make sure any SC
fails if that happens, even if an interrupt comes in the middle of a
sequence of loads. So we use the following tricks:

- Any BEGIN or COMMIT will clear all the reservations (COMMIT obviously
also clears that bit);
- We don't use LR. Instead, we make normal loads act as LR when that bit
is set. If that bit (and any reservations) were lost, we won't acquire
any new reservations.
- We make LR clear that bit, and acquire a reservation as normal.
- We make normal stores also clear that bit.

The result is that we have a sequence like BEGIN/LD/LD/SC/SC/COMMIT for
double-wide LR/SC. None of these instructions have to be adjacent
(unless you want to benefit from macro-op fusion). A trap anywhere in
the middle of that sequence will clear both the reservations and the
bit, ensuring that the following SC and/or COMMIT fails, and that no new
reservations are acquired until then (because LD acquires no reservation
when that bit is unset). We can abandon the whole sequence without fear,
since the start of a new sequence will clear that bit and any leftover
reservations; we can even SC less than we had reserved. And we don't
need to save any extra state; the retry after the COMMIT will take care
of setting it up again for us. All that at the cost of a pair of new
SYSTEM instructions.

(Thinking about it a bit more now, the new instructions could be in the
AMO group instead of the SYSTEM group. Also, both BEGIN and COMMIT would
naturally need to have .aq and .rl bits in the instruction encoding, and
COMMIT also needs a destination register.)

Jacob Bachmeyer

non lue,
7 déc. 2017, 23:34:0807/12/2017
à Bruce Hoult,Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Bruce Hoult wrote:
> On Thu, Dec 7, 2017 at 3:16 AM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Actually, this reasoning generalizes: for any maximum stack
> depth, confusion is only possible if execution resumes in the
> middle of an LR///LR sequence. This can be caught with an
> internal "reservations forbidden" flag that is set by xRET and
> cleared by any instruction *except* LR. LR silently fails to
> acquire a reservation if the "reservations forbidden" flag is set,
> but must still read memory. Skipping memory reads is *not* safe
> -- a program may attempt to LR a pointer and follow that link
> through a structure; this will cause SIGSEGV if LR produces a
> bogus value. As there are no reservations, *all* following SCs
> must fail. All reservations are released (and the LR/SC stack
> cleared) when a trap is taken.
>
>
> The "reservations forbidden" flag should be set by xRET iff
> reservations existed when the trap was taken. Otherwise taking a trap
> just before the first (or only) LR will needlessly fail.

The cost of that one failure is trivial: the LR/SC sequence is
retried. On the other hand, the state information needed to avoid that
is non-trivial: the processor must track whether a reservation existed
at the time the trap occurred. This has to be visible state, since it
must be save/restored during a task switch. Permitting the failure
avoids needing bits in *status.


-- Jacob

Jacob Bachmeyer

non lue,
8 déc. 2017, 00:56:1608/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-07 1:16 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> 2017-12-06 5:12 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>>:
>
> Jonas Oberhauser wrote:
>
> 2017-12-05 5:31 GMT+01:00 Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>
> <mailto:jcb6...@gmail.com <mailto:jcb6...@gmail.com>>>>:
>
>
>
> Is the hardware required for nested LR/SC any simpler than that
> required for full transactional memory support?
>
>
> For a small stack depth definitely. For example, the validation is
> much simpler, since you can (see below) validate one-by-one on the
> SCs, rather than having to do a complete validation at some point.

This means that nested LR/SC should not be part of RVT, then. So it
will be an extension to the RVA extension.
But you cannot commit anything until the last SC has succeeded, so you
already must validate on each SC. This would not preclude an
implementation from going back to hard reservations on the first SC in
order to improve the chance that a nested LR/SC succeeds.

> If you turn the soft res. into a hard res. which is guarranteed to
> remain hard until you execute something except a (successful) SC, you
> do not have to validate again on the last SC: all reservations are
> guaranteed to still hold and you can just commit immediately.

With a small maximum stack depth, the existing RVA forward progress
guarantee can apply. The first SC in the set could function as another
LR (renewing the reservations and "resetting the clock" so to speak),
provided that the reservations have all remained valid.

> Like I said I am not sure if those six operations are enough
> to make transactional memory useful in practice. I'd wait for
> input of someone who actually uses transactional memory :) (at
> least in a model).
>
>
> The authors of that paper used transactional memory in a model. :-)
>
>
> Ah but there is a difference between "I created a model and prove how
> great it is with these three select examples that work great" and
> "these other guys created a model that is really great and I have
> built some real stuff in it" (which can turn into "these other guys
> created a model which is cool but to implement real stuff in it, I
> also needed xyz, which I added") :)

That sounds like I need to write up an RVT proposal so someone can try
that with RISC-V. :-)

> I like the idea of reusing the SC opcode for ST, with the
> exact
> function determined by whether a transaction or simple memory
> reservation was active. An LR introduces an LR/SC
> sequence, while
> an LT or LTX introduces a transaction.
>
>
> That sounds neat. I think it may not even be necessary to keep
> track of the "mode" between interrupts, because both of them
> have the same semantics after an interrupt (fail).
>
>
> Exactly, taking a trap invalidates all reservations, including
> transactions. (Taking a trap *must* erase the internal atomic
> operation state -- the trap handler may need to use RVA.)
>
>
> I don't know RVA stands for, but I agree with the rest.

RVA is the atomic memory access extension "A". The trap handler may
itself need atomic memory operations.
There is an easier solution: add one LRM ("Load Reserved Multiple")
opcode that is used for subsequent loads. There is space in the current
LR opcode. The overall sequence is LR/LRM/LRM/LRM/.../SC/SC/SC/SC. LR
always begins a sequence, while LRM only acquires a reservation if the
previous instruction executed was either LR or LRM. A trap handler
returning into the middle of a sequence causes the sequence to fail,
while a return-from-trap to the beginning of the sequence can succeed on
the first try. All reservations are canceled by executing LR to start a
new sequence, by taking any trap, or by executing LRM that does not
acquire a reservation.

Does every use of nested LR/SC write to all and exactly the addresses
read using LR? If so, we do not need a stack depth counter, only a
means to track if any of the read set has not yet been updated. The
entire mini-transaction commits when the last item in the read set is
updated. This last SC acts as "store transactional and commit" while
previous SCs act as "store transactional and validate".

Also, how many reservations should be allowed? The number needs to be
more than 1, obviously, but also kept small to make this a useful option
to implementors. There should be clear distinct purposes and distinct
implementation costs between base RVA, RVA with LRM, and RVT, with RVT
most powerful but most expensive to implement. The cost of RVA with LRM
should be much closer to base RVA than RVT.

> Let's discuss again LR/LRA + SCB/SC.
> First note that I do not think that LRA is necessary, since LR can
> already just not break previous reservations.
> The distinction SCB/SC is necessary without a reservation stack. But I
> am afraid that the only advantage that SCB/SC brings over a
> reservation stack is that you no longer need to pop the stack with
> SCBs/SCs.
> You still need to store a few extra bits on an interrupt, which have
> to be context stored/restored, such as "did I have a pending LR?" (to
> prevent the "first LR after xRET" problem).

I say the opposite: SCB is unneeded, but separate "first LR" and
"multiple LR" instructions are needed. A new LRM opcode also ensures
that programs that expect to use nested LR/SC will fail quickly and
obviously on implementations that do not have nested LR/SC.

Lastly, the incremental encoding space required for LRM is less than for
LRA/SCB or even just SCB.



-- Jacob

Bruce Hoult

non lue,
8 déc. 2017, 02:20:5208/12/2017
à Cesar Eduardo Barros,Jonas Oberhauser,Jacob Bachmeyer,RISC-V ISA Dev
You need a stack to record what the locked addresses *are*, so that writes to those addresses by other harts remove the reservation.

You can call it a "set of reservations" instead of a stack, if you want, but I think it would be good practice to always have the SCs in the reverse order of the LRs. 

Jonas Oberhauser

non lue,
8 déc. 2017, 07:58:3408/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
There have been many interesting ideas here. 
*commit/begin with non-subsequent store/loads
I think this is too much too ask for in hardware and should go to RVT.
In this case you need to be recheck your reservations on the commit and possibly other things.

*commit/begin with only subsequent store/loads
I assume the reason to use LD and not LR is to maintain consistency with existing code?
I think   begin;LDs;op;SCs;commit   and  LR;LRBs;op;SCBs;SC   are in essence the same.
In case of begin you know whether you are at the very beginning of the sequence (and thus no fail after xRET) because you are before begin; if you xRET at some point after begin, the LDs will not take reservations, the SCs will fail.
In case of LRB you know whether you are the very beginning of the sequence (and thus no fail after xRET) because you are before an LR; if you xRET at some point after LR, the LRBs will fail (this needs a status flag, but doesn't depend on the context) the SCs will fail.
The upside of LRB of course is that you don't have to do any loads with the status flag, and as a small bonus you save two local instructions.

*LRM, using a set of reservations and not forcing pops
I love this idea. In fact, I believe you can do everything we need with exactly that at a very low cost.
All you need is a way to distinguish, as you suggest, after xRET between the first LR in a sequence and an LR in the middle. This can be done with LR;LRMs. I'm wondering if there is another simple way to make this distinction or to circumvent the problem altogether. HW interrupts could be masked, but you could still have page faults in the middle of the sequence and (benign) interrupts right before the sequence. Making an interrupt at the beginning of the sequence automatically go back before the first LR in the sequence seems to be complicated as well, but might not be so bad in HW -- you anyways need to store possible epcs somewhere in the pipeline, and you could store as epcs the pcs before the first LR... But it's not clear any of this is really worthwhile.


I don't know how implementations currently handle reservations, but I could imagine the following implementation for the whole thing.
In each cache line remember reservation and dirty (possibly implicitly in the cache state, so reservation+M/O could mean dirty in MOESI). 
For the whole cache remember in a single bit hard/soft.

Hard reservation stalls write and LR/LRM accesses to the same cacheline, hard dirty reservation stalls normal read accesses to the same cacheline (by stall I mean "try again later" without keeping the bus busy).
Soft reservations are destroyed on another access and send an "invalidate all reservations" message.

An LR and a valid*, non-stalled LRM is just a normal load + takes a hard reservation.
An invalid or stalled LRM is a NOP and invalidates all reservations.

A successful SC 
1) to a non-dirty cacheline flushes the current cacheline to memory, unless this is the last reserved non-dirty cacheline
2) changes the value of the cache line and makes it dirty, invalidating all copies of the cache line in other caches in the process. 
3) if all reserved cachelines are now dirty, all reservations are freed.

When reservations are invalidated, dirty cachelines are invalidated and reservations are freed.
When reservations are freed, reservation bits and dirty bits go away.

When a reserved cacheline is flushed except due to a successful SC, we have to "invalidate all reservations". In particular if the cacheline was dirty, the value is discarded, not stored back to memory.

The upside of this is that it is very economical, and that an implementation offering only normal LR/SC does not need a dirty bit (if any implementation even does...) because there is at most one reserved cacheline, and once this cacheline becomes dirty, it is freed immediately.
The "downside" of this is that reservations can not be held in case of cache collisions. This would need to be reflected in the ISA somehow. "valid LRMs can only fail to acquire a reservation in case of a cache collision with an existing reservation, in which case all reservations are invalidated.".

*by valid I mean "not after an interrupt" 

2017-12-08 6:56 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:
The problem is the reverse one. Imagine every SC succeeds, but by the time you do the last SC,an earlier reservation has been given up and someone else wrote to that address. You would have to validate again on the last SC.

But you cannot commit anything until the last SC has succeeded, so you already must validate on each SC.  This would not preclude an implementation from going back to hard reservations on the first SC in order to improve the chance that a nested LR/SC succeeds. [...]
With a small maximum stack depth, the existing RVA forward progress guarantee can apply.  The first SC in the set could function as another LR (renewing the reservations and "resetting the clock" so to speak), provided that the reservations have all remained valid.

You have to validate on every SC, but if you write to every address in the set, an implementation that turns soft reservations into hard ones will only need to validate the reservation on the current address.
In other words:
an SC to an address
1) checks if any reservation exists to *that address* and fails if there isn't one
2) turns that reservation into a hard one until the first instruction which is not a successful SC

and that is it. No instruction has to check anything else, and in particular, no SC has to check or renew reservations for other addresses.
This implementation already guarantees:
1) on the last SC, all previous reservations are still valid
2) in particular, no other thread has seen or modified the current address
 
Ah but there is a difference between "I created a model and prove how great it is with these three select examples that work great" and "these other guys created a model that is really great and I have built some real stuff in it" (which can turn into "these other guys created a model which is cool but to implement real stuff in it, I also needed xyz, which I added") :)

That sounds like I need to write up an RVT proposal so someone can try that with RISC-V.  :-)

:)
 
I don't know RVA stands for

RVA is the atomic memory access extension "A".  The trap handler may itself need atomic memory operations.

Thanks. I agree. 

There is an easier solution:  add one LRM ("Load Reserved Multiple") opcode that is used for subsequent loads.  There is space in the current LR opcode.  The overall sequence is LR/LRM/LRM/LRM/.../SC/SC/SC/SC.  LR always begins a sequence, while LRM only acquires a reservation if the previous instruction executed was either LR or LRM.  A trap handler returning into the middle of a sequence causes the sequence to fail, while a return-from-trap to the beginning of the sequence can succeed on the first try.  All reservations are canceled by executing LR to start a new sequence, by taking any trap, or by executing LRM that does not acquire a reservation.

Also, how many reservations should be allowed?  The number needs to be more than 1, obviously, but also kept small to make this a useful option to implementors.

Like I said, I think that the size should be implementation defined and discoverable in some register. In particular, RVA with only LR/SC should have a 1 in that register. In this case RVA is one implementation of RVA+ nested LR where depth is 1.
 
There should be clear distinct purposes and distinct implementation costs between base RVA, RVA with [nested LR], and RVT, with RVT most powerful but most expensive to implement.  The cost of RVA with [nested LR] should be much closer to base RVA than RVT.

I completely agree.

Rogier Brussee

non lue,
8 déc. 2017, 11:25:5208/12/2017
à RISC-V ISA Dev,jcb6...@gmail.com,ces...@cesarb.eti.br


Op vrijdag 8 december 2017 13:58:34 UTC+1 schreef Jonas Oberhauser:
There have been many interesting ideas here. 
*commit/begin with non-subsequent store/loads
I think this is too much too ask for in hardware and should go to RVT.
In this case you need to be recheck your reservations on the commit and possibly other things.

*commit/begin with only subsequent store/loads
I assume the reason to use LD and not LR is to maintain consistency with existing code?
I think   begin;LDs;op;SCs;commit   and  LR;LRBs;op;SCBs;SC   are in essence the same.


for LRB and SD for SCB. Ofcourse that requires knowing  
 
In case of begin you know whether you are at the very beginning of the sequence (and thus no fail after xRET) because you are before begin; if you xRET at some point after begin, the LDs will not take reservations, the SCs will fail.
In case of LRB you know whether you are the very beginning of the sequence (and thus no fail after xRET) because you are before an LR; if you xRET at some point after LR, the LRBs will fail (this needs a status flag, but doesn't depend on the context) the SCs will fail.
The upside of LRB of course is that you don't have to do any loads with the status flag, and as a small bonus you save two local instructions.

*LRM, using a set of reservations and not forcing pops
I love this idea. In fact, I believe you can do everything we need with exactly that at a very low cost.
All you need is a way to distinguish, as you suggest, after xRET between the first LR in a sequence and an LR in the middle. This can be done with LR;LRMs.

Using the unused rs2 register in the LR instruction obviously allows you to make distinction between LR's and LRM's. What if you push that idea a little further and
use it to say which reservation set the LR's in by you expect the _final_ SC to put its failure code in (i.e. rd). Hence  
only the reservation of LR .., ..,  a0 is comitted by  SC a0, .., .. and in stacked sequences you only allow such matched LR sequences.

Example 
(nonsensical registers for easy matching)

LR t1, a1, zero # 
LR t2, a2, a0    # LRM in  reserve set a0 = 10
LR t3, a3, a0

 (t4, t5, t6) = op(t1, t2, t3, ....)
 
SC a0 a5 t6  # or SC zero a5 t6 to say SCB??
SC a0 a4 t5  # or SC zero a4 t5 to say SCB??
SC a0 a3 t4

Jacob Bachmeyer

non lue,
8 déc. 2017, 18:57:1208/12/2017
à Rogier Brussee,RISC-V ISA Dev,ces...@cesarb.eti.br
Rogier Brussee wrote:
> Op vrijdag 8 december 2017 13:58:34 UTC+1 schreef Jonas Oberhauser:
>
> *LRM, using a set of reservations and not forcing pops
> I love this idea. In fact, I believe you can do everything we need
> with exactly that at a very low cost.
> All you need is a way to distinguish, as you suggest, after
> xRET between the first LR in a sequence and an LR in the middle.
> This can be done with LR;LRMs.
>
>
> Using the unused rs2 register in the LR instruction obviously allows
> you to make distinction between LR's and LRM's.

I was actually planning to use LR's rs2 field as a function code to
distinguish LR/LRM/LT/LTX/ABORT/COMMIT/VALIDATE, leaving room for 28
more two-operand instructions. (ABORT/COMMIT/VALIDATE are encoded using
special cases of LR/LTX/LT.)

> What if you push that idea a little further and
> use it to say which reservation set the LR's in by you expect the
> _final_ SC to put its failure code in (i.e. rd). Hence
> only the reservation of LR .., .., a0 is comitted by SC a0, .., ..
> and in stacked sequences you only allow such matched LR sequences.

The fundamental problem I have with this is the idea of making general
registers special.

> Example
> (nonsensical registers for easy matching)
>
> LR t1, a1, zero #
> LR t2, a2, a0 # LRM in reserve set a0 = 10
> LR t3, a3, a0
>
> (t4, t5, t6) = op(t1, t2, t3, ....)
>
> SC a0 a5 t6 # or SC zero a5 t6 to say SCB??
> SC a0 a4 t5 # or SC zero a4 t5 to say SCB??
> SC a0 a3 t4

This example seems nonsensical to me. Were the later LRs supposed to
indicate that the matching SC would use a0 as destination? What is
gained by then using a0 for all three SCs?



-- Jacob

Jacob Bachmeyer

non lue,
8 déc. 2017, 19:34:5908/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> *LRM, using a set of reservations and not forcing pops
> I love this idea. In fact, I believe you can do everything we need
> with exactly that at a very low cost.
> All you need is a way to distinguish, as you suggest, after
> xRET between the first LR in a sequence and an LR in the middle. This
> can be done with LR;LRMs. I'm wondering if there is another simple way
> to make this distinction or to circumvent the problem altogether. HW
> interrupts could be masked, but you could still have page faults in
> the middle of the sequence and (benign) interrupts right before the
> sequence.

There is also the problem that more privileged interrupts are supposed
to always be enabled.

> Making an interrupt at the beginning of the sequence automatically go
> back before the first LR in the sequence seems to be complicated as
> well, but might not be so bad in HW -- you anyways need to store
> possible epcs somewhere in the pipeline, and you could store as epcs
> the pcs before the first LR...

Hardware can, at least for traps taken *during* the LR/LRM group, do
this invisibly to software by effectively speculating all LRM
instructions and rolling back to the LR if any traps are taken. This is
done by simply storing the address of the LR (er, "previous non-LRM
instruction") in ?epc if a trap is taken while executing LRM. This is
invisible to normal software -- the interrupt simply appears to have
arrived before the sequence began. Hardware that does this *must*
document it, however -- the supervisor (and debugger) will see any page
faults caused by LRM as having occurred at the previous LR.

> I don't know how implementations currently handle reservations, but I
> could imagine the following implementation for the whole thing.
> In each cache line remember reservation and dirty (possibly implicitly
> in the cache state, so reservation+M/O could mean dirty in MOESI).
> For the whole cache remember in a single bit hard/soft.
>
> Hard reservation stalls write and LR/LRM accesses to the same
> cacheline, hard dirty reservation stalls normal read accesses to the
> same cacheline (by stall I mean "try again later" without keeping the
> bus busy).
> Soft reservations are destroyed on another access and send an
> "invalidate all reservations" message.
>
> An LR and a valid*, non-stalled LRM is just a normal load + takes a
> hard reservation.
> An invalid or stalled LRM is a NOP and invalidates all reservations.

A failed LRM must still load from memory -- the program might use that
value as a pointer.

> A successful SC
> 1) to a non-dirty cacheline flushes the current cacheline to memory,
> unless this is the last reserved non-dirty cacheline
> 2) changes the value of the cache line and makes it dirty,
> invalidating all copies of the cache line in other caches in the process.
> 3) if all reserved cachelines are now dirty, all reservations are freed.
>
> When reservations are invalidated, dirty cachelines are invalidated
> and reservations are freed.
> When reservations are freed, reservation bits and dirty bits go away.

When reservations are freed on successful commit, the cachelines must
remain dirty -- they now need to be written back.

> When a reserved cacheline is flushed except due to a successful SC, we
> have to "invalidate all reservations". In particular if the cacheline
> was dirty, the value is discarded, not stored back to memory.
>
> The upside of this is that it is very economical, and that an
> implementation offering only normal LR/SC does not need a dirty bit
> (if any implementation even does...) because there is at most one
> reserved cacheline, and once this cacheline becomes dirty, it is freed
> immediately.
> The "downside" of this is that reservations can not be held in case of
> cache collisions. This would need to be reflected in the ISA somehow.
> "valid LRMs can only fail to acquire a reservation in case of a cache
> collision with an existing reservation, in which case all reservations
> are invalidated.".

Why? Two LRMs to the same cacheline could simply take two reservations
on the same cacheline. A small increase in state information is
required -- per-byte reserved/dirty bits. The cacheline is dirty when
all reserved bytes in the cacheline are dirty. The maximum number of
reservations must be small enough that whatever cache is used *appears*
to be fully-associative. In practice, that requirement for per-byte
reserved/dirty bits will probably mean that the "multi-LR" cache will be
a small fully associative cache with only as many lines as the
guaranteed maximum number of reservations.

> *by valid I mean "not after an interrupt"

The rule I propose is simple: LRM must fail (and invalidate all
reservations) unless the previous instruction executed was either LR or
LRM. Executing LR or taking a trap invalidates all reservations in any
case. Upon return-from-interrupt, the previous instruction executed is
xRET, so LRM will fail.

> 2017-12-08 6:56 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
> There is an easier solution: add one LRM ("Load Reserved
> Multiple") opcode that is used for subsequent loads. There is
> space in the current LR opcode. The overall sequence is
> LR/LRM/LRM/LRM/.../SC/SC/SC/SC. LR always begins a sequence,
> while LRM only acquires a reservation if the previous
> instruction executed was either LR or LRM. A trap handler
> returning into the middle of a sequence causes the sequence to
> fail, while a return-from-trap to the beginning of the
> sequence can succeed on the first try. All reservations are
> canceled by executing LR to start a new sequence, by taking
> any trap, or by executing LRM that does not acquire a reservation.
>
>
> Also, how many reservations should be allowed? The number needs
> to be more than 1, obviously, but also kept small to make this a
> useful option to implementors.
>
>
> Like I said, I think that the size should be implementation defined
> and discoverable in some register. In particular, RVA with only LR/SC
> should have a 1 in that register. In this case RVA is one
> implementation of RVA+ nested LR where depth is 1.

In RISC-V, there is more a preference for putting information like that
in the platform configuration. An implementation that only supports
depth 1 should simply not implement LRM, leaving it to raise illegal
instruction. If LRM is implemented, there should be some reasonable
minimum that portable software can assume. We have 15 caller-saved
registers in the ABI; that is one short of the 16 needed for a fully
general quadword compare and swap. Compare and swap of four words is
also the longest sequence that meets the base RVA constraints for
guaranteed eventual success. For these reasons, I propose that
implementations supporting LRM be required to support at least four
reservations.



-- Jacob

Jonas Oberhauser

non lue,
9 déc. 2017, 02:13:4509/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros


On Dec 9, 2017 01:34, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:
*LRM, using a set of reservations and not forcing pops
I love this idea. In fact, I believe you can do everything we need with exactly that at a very low cost.
All you need is a way to distinguish, as you suggest, after xRET between the first LR in a sequence and an LR in the middle. This can be done with LR;LRMs. I'm wondering if there is another simple way to make this distinction or to circumvent the problem altogether. HW interrupts could be masked, but you could still have page faults in the middle of the sequence and (benign) interrupts right before the sequence.

There is also the problem that more privileged interrupts are supposed to always be enabled.

They are enabled but not claimed in that interval.



Making an interrupt at the beginning of the sequence automatically go back before the first LR in the sequence seems to be complicated as well, but might not be so bad in HW -- you anyways need to store possible epcs somewhere in the pipeline, and you could store as epcs the pcs before the first LR...

Hardware can, at least for traps taken *during* the LR/LRM group, do this invisibly to software by effectively speculating all LRM instructions and rolling back to the LR if any traps are taken.  This is done by simply storing the address of the LR (er, "previous non-LRM instruction") in ?epc if a trap is taken while executing LRM.  This is invisible to normal software -- the interrupt simply appears to have arrived before the sequence began.  Hardware that does this *must* document it, however -- the supervisor (and debugger) will see any page faults caused by LRM as having occurred at the previous LR.

Yes.

I don't know how implementations currently handle reservations, but I could imagine the following implementation for the whole thing.
In each cache line remember reservation and dirty (possibly implicitly in the cache state, so reservation+M/O could mean dirty in MOESI). For the whole cache remember in a single bit hard/soft.

Hard reservation stalls write and LR/LRM accesses to the same cacheline, hard dirty reservation stalls normal read accesses to the same cacheline (by stall I mean "try again later" without keeping the bus busy).
Soft reservations are destroyed on another access and send an "invalidate all reservations" message.

An LR and a valid*, non-stalled LRM is just a normal load + takes a hard reservation.
An invalid or stalled LRM is a NOP and invalidates all reservations.

A failed LRM must still load from memory -- the program might use that value as a pointer.

I assumed that if a failed LRM loads a pointer, that pointer will be used by failed LRMs and SCs



A successful SC 1) to a non-dirty cacheline flushes the current cacheline to memory, unless this is the last reserved non-dirty cacheline
2) changes the value of the cache line and makes it dirty, invalidating all copies of the cache line in other caches in the process. 3) if all reserved cachelines are now dirty, all reservations are freed.

When reservations are invalidated, dirty cachelines are invalidated and reservations are freed.
When reservations are freed, reservation bits and dirty bits go away.

When reservations are freed on successful commit, the cachelines must remain dirty -- they now need to be written back.

No. Dirty in the sense above means "to be discarded on invalidation".
The caches are coherent the whole time, for the simple reason that there are no other caches with the same cacheline. The cacheline remains "dirty w.r.t. main memory" because it stays in M/O, but it is no longer "dirty w.r.t. reservation".


When a reserved cacheline is flushed except due to a successful SC, we have to "invalidate all reservations". In particular if the cacheline was dirty, the value is discarded, not stored back to memory.

The upside of this is that it is very economical, and that an implementation offering only normal LR/SC does not need a dirty bit (if any implementation even does...) because there is at most one reserved cacheline, and once this cacheline becomes dirty, it is freed immediately.
The "downside" of this is that reservations can not be held in case of cache collisions. This would need to be reflected in the ISA somehow. "valid LRMs can only fail to acquire a reservation in case of a cache collision with an existing reservation, in which case all reservations are invalidated.".

Why?  Two LRMs to the same cacheline could simply take two reservations on the same cacheline.

The problem are not to the same cacheline, but to another cacheline that collides with the first cacheline. They will not be kept together in memory.

A small increase in state information is required -- per-byte reserved/dirty bits.  The cacheline is dirty when all reserved bytes in the cacheline are dirty.  The maximum number of reservations must be small enough that whatever cache is used *appears* to be fully-associative.  In practice, that requirement for per-byte reserved/dirty bits will probably mean that the "multi-LR" cache will be a small fully associative cache with only as many lines as the guaranteed maximum number of reservations.

That is also possible, but I like to use existing HW :(


*by valid I mean "not after an interrupt"

The rule I propose is simple:  LRM must fail (and invalidate all reservations) unless the previous instruction executed was either LR or LRM.  Executing LR or taking a trap invalidates all reservations in any case.  Upon return-from-interrupt, the previous instruction executed is xRET, so LRM will fail.

Sure. 


2017-12-08 6:56 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:

In RISC-V, there is more a preference for putting information like that in the platform configuration.  An implementation that only supports depth 1 should simply not implement LRM, leaving it to raise illegal instruction.  If LRM is implemented, there should be some reasonable minimum that portable software can assume.  We have 15 caller-saved registers in the ABI; that is one short of the 16 needed for a fully general quadword compare and swap.  Compare and swap of four words is also the longest sequence that meets the base RVA constraints for guaranteed eventual success.  For these reasons, I propose that implementations supporting LRM be required to support at least four reservations.

-- Jacob

Ok.

Jacob Bachmeyer

non lue,
9 déc. 2017, 22:53:5809/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> On Dec 9, 2017 01:34, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
>
> I don't know how implementations currently handle
> reservations, but I could imagine the following implementation
> for the whole thing.
> In each cache line remember reservation and dirty (possibly
> implicitly in the cache state, so reservation+M/O could mean
> dirty in MOESI). For the whole cache remember in a single bit
> hard/soft.
>
> Hard reservation stalls write and LR/LRM accesses to the same
> cacheline, hard dirty reservation stalls normal read accesses
> to the same cacheline (by stall I mean "try again later"
> without keeping the bus busy).
> Soft reservations are destroyed on another access and send an
> "invalidate all reservations" message.
>
> An LR and a valid*, non-stalled LRM is just a normal load +
> takes a hard reservation.
> An invalid or stalled LRM is a NOP and invalidates all
> reservations.
>
>
> A failed LRM must still load from memory -- the program might use
> that value as a pointer.
>
>
> I assumed that if a failed LRM loads a pointer, that pointer will be
> used by failed LRMs and SCs

Do we want to require that all values loaded by an LR/LRM/SC sequence
are dead afterwards? This seems like an unneeded constraint on
software, and very different behavior from normal loads, plain LR, and
LT/LTX (from the RVT proposal I am planning).

> When a reserved cacheline is flushed except due to a
> successful SC, we have to "invalidate all reservations". In
> particular if the cacheline was dirty, the value is discarded,
> not stored back to memory.
>
> The upside of this is that it is very economical, and that an
> implementation offering only normal LR/SC does not need a
> dirty bit (if any implementation even does...) because there
> is at most one reserved cacheline, and once this cacheline
> becomes dirty, it is freed immediately.
> The "downside" of this is that reservations can not be held in
> case of cache collisions. This would need to be reflected in
> the ISA somehow. "valid LRMs can only fail to acquire a
> reservation in case of a cache collision with an existing
> reservation, in which case all reservations are invalidated.".
>
>
> Why? Two LRMs to the same cacheline could simply take two
> reservations on the same cacheline.
>
>
> The problem are not to the same cacheline, but to another cacheline
> that collides with the first cacheline. They will not be kept together
> in memory.

So a set-associative cache, where two reservations would map to the same
physical cacheline? Could this be handled by requiring the cache to
have at least 4 ways if 4 reservations are permitted? In other words,
LRM requires that the cache can store any 4 cachelines.

> A small increase in state information is required -- per-byte
> reserved/dirty bits. The cacheline is dirty when all reserved
> bytes in the cacheline are dirty. The maximum number of
> reservations must be small enough that whatever cache is used
> *appears* to be fully-associative. In practice, that requirement
> for per-byte reserved/dirty bits will probably mean that the
> "multi-LR" cache will be a small fully associative cache with only
> as many lines as the guaranteed maximum number of reservations.
>
>
> That is also possible, but I like to use existing HW :(

You are right; there are easier implementations for LRM. Since only
four reservations will be guaranteed, a 4-way set associative cache is
sufficient. RVT as I plan to propose will permit unlimited use of a
cacheline once it is in the transaction's memory set -- a second LT from
the same cacheline does not consume an additional memory set slot, nor
does a second LTX or ST to a cacheline previously loaded for write
consume an additional slot. But RVT does not need to track when
loaded-for-write cachelines have been written -- commit in RVT will be
explicit.

LRM does need to track this state information, but really only needs a
small buffer that stores the addresses loaded and "checks them off" as
the SCs are executed. Per-byte reserved/dirty bits in the cache are not
needed after all. This will require that all LR/LRM loads are exactly
matched by SC stores, but that should not be a problem.


-- Jacob

Jonas Oberhauser

non lue,
10 déc. 2017, 14:12:3410/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
I'm not sure I understand what you mean by that.

After a successful LR/LRM/SC sequence, the values are fine and can be used normally.
In fact, already after a successful LR/LRM sequence the values are fine. 
So if you implement the DCAS by
"LR;LRM; retry in case of LRM failure; if values match compare data then SC;SC"
you know that the loaded values at the end of this block are always correct in the sense that they appear to have been read atomically from coherent memory (i.e., even in case of a failed SC or comparison, this code behaves like an atomic double-word load). (to be able to check that the LRM failed, we would require for LRMs r2!=x0 and put a non-zero value into r2 in case of a failure).

It is only in case the LR/LRM sequence fails that the LRMs in the sequence might load garbage. So if you wrongly implemented the DCAS as 
"LR;LRM; if values match compare data then SC;SC",
the cas does not appear to be an atomic double-wide load in case the LRM fails. Note that the SCs will fail if the LRM failed even if the garbage values match the compare data, so that is not the problem -- the problem is if you want to continue using the load results.

The reason for this is simple: you can not get a reservation on that address, so the LRM has to either fail or be stalled.
But if you stall an LRM while you have hard reservations, you can get into a HW dead-lock where CPU1 is trying to execute an LRM to a hard reservation of CPU2 and CPU2 is trying to execute an LRM to a hard reservation of CPU1, e.g., by running "LR x; LRM y" and "LR y; LRM x" in parallel.
If you fail the LRM and the reservation is dirty, the LRM can not really return a "real" value. It either has to return the dirty value -- which might never be written to memory and might make the remote LR/LRM/SC sequence seem non-atomic, take a stale value from the main memory, or make up a value. In each case LRM potentially returns garbage (which I think would be ok. I'd prefer if it always returned the same garbage though, such as all zeros or just the current register content.).

I assume that in case of a failed LRM the loaded values will only be used by LRMs which fail, SCs which fail, and local operations which do not leak the garbage values, and that the computation and its values are either abandoned or the whole thing is restarted.
Note that you can architecturally provide the guarantee that the LR will always succeed (and I guess that is currently the case).
 
This seems like an unneeded constraint on software, and very different behavior from normal loads, plain LR, and LT/LTX (from the RVT proposal I am planning).

One alternative is to turn the hard reservations of CPUi into a soft one while CPUi issues an LRM to a hard reservation of another CPU (you can go back to a hard reservation when the LRM eventually succeeds and the older reservations were not lost). In this case LRMs can always provide real memory values, but the LR/LRM sequence is not guaranteed to be atomic (since the soft reservations could be lost before the LRM returns its real value) -- still some way to detect whether the LR/LRM sequence was atomic or not would be needed.

Another alternative is to "restart" the LR/LRM sequence in case of a failure *during* the LRM sequence, e.g., by jumping back before the LR.
I wrote before: 
Making an interrupt at the beginning of the sequence automatically go back before the first LR in the sequence seems to be complicated as well, but might not be so bad in HW -- you anyways need to store possible epcs somewhere in the pipeline, and you could store as epcs the pcs before the first LR...  
To this you responded: 
Hardware can, at least for traps taken *during* the LR/LRM group, do this invisibly to software by effectively speculating all LRM instructions and rolling back to the LR if any traps are taken.  This is done by simply storing the address of the LR (er, "previous non-LRM instruction") in ?epc if a trap is taken while executing LRM.  This is invisible to normal software -- the interrupt simply appears to have arrived before the sequence began.  Hardware that does this *must* document it, however -- the supervisor (and debugger) will see any page faults caused by LRM as having occurred at the previous LR.

In case we do something like this, the alternative with the reset sounds very charming.
Note that at this point it is no longer necessary to distinguish in the opcode LR and LRM: after an interrupt, you can never be in the middle of the LR/LRM sequence, only at the beginning, and LRMs can only fail because they access a reserved cacheline, not due to interrupts, not due to previous failed LRMs.
I'll still use the distinction LR-opcode for "an instruction with the LR opcode", LRM for "LR-opcode directly after LR-opcode", and LR for "LR-opcode but not LRM".
The semantics could be something along the following lines:
1) Keep in a status register a "restart code address", which is either a program counter or an invalid value.
2) An LR sets the restart code address to the pc. 
3) Any instruction which is not an LR-opcode invalidates the restart code address.
4) An interrupt where the restart code address is not an invalid value sets the epc to restart code address.
5) A failed LRM will set the the pc to the restart code address.

In this case you never have to worry about garbage values after an LRM, because you never get beyond a failed LRM. This makes software simpler and reduces bugs where you forget to check validity and use the values for something else, like the bad DCAS above.

Simple hardware can immediately fail an LRM that tries to access a reserved cacheline; more complex hardware can first turn local reservations into soft reservations and stall the LRM, and only fail the LRM in case reservations are lost before the LRM manages to acquire the reservation itself and local reservations are turned back into hard reservations. Both approaches implement this spec and guarantee that the LR/LRMs sequence appears to be atomic.

It's not a perfect approach yet because you can starve if the LR/LRMs of two threads are not symmetrical, such as "LR x, LRM y" and "while (1) { LR y }" -- the latter  LR might always be scheduled right between LR x and LRM y, thus starving the other thread. 
In case of symmetry there is no problem because the first one to get a reservation will always get the remaining reservations, and the existing mechanism to prevent starvation for LRs saves the day.
If instead of handling the restart in HW we handle it in SW, we can eventually give up -- e.g., we can have something like 
"while tasks_available > 0 do { try to take reservations for task queue using LR;LRM; if LRM succeeded, pop a task from the queue; if it didn't, retry the loop }"
which eventually stops trying to reserve the cachelines if the reason to take the reservations disappears.

If this is useful in practice, I suggest instead of the above:
An LR is an LR-opcode with r2=x0, and an LRM is an LR-opcode with r2!=x0.
The r2 is used for the restart code address, so
1') a failed LRM jumps to gpr(r2) (and loads nothing)
2') an interrupt during an LRM sets the epc to gpr(r2)
and
3') an LR clears the reservations and is guaranteed to succeed
other than that, LRs and LRMs can now be the same -- no need to keep a status flag for LR/LRM which is cleared on interrupts. To avoid the ambiguity from before, just point your r2 in front of the LR/LRM sequence.

This is not much harder to program against, does not have the starvation problem, does not have the "garbage by accident" problem, does not need an extra restart code address register, but "costs" one gpr register in which we put the restart code address.
And it seems very easy to implement.
 
        When a reserved cacheline is flushed except due to a
        successful SC, we have to "invalidate all reservations". In
        particular if the cacheline was dirty, the value is discarded,
        not stored back to memory.

        The upside of this is that it is very economical, and that an
        implementation offering only normal LR/SC does not need a
        dirty bit (if any implementation even does...) because there
        is at most one reserved cacheline, and once this cacheline
        becomes dirty, it is freed immediately.
        The "downside" of this is that reservations can not be held in
        case of cache collisions. This would need to be reflected in
        the ISA somehow. "valid LRMs can only fail to acquire a
        reservation in case of a cache collision with an existing
        reservation, in which case all reservations are invalidated.".


    Why?  Two LRMs to the same cacheline could simply take two
    reservations on the same cacheline.


The problem are not to the same cacheline, but to another cacheline that collides with the first cacheline. They will not be kept together in memory.

So a set-associative cache, where two reservations would map to the same physical cacheline?

Yes, if by "physical cacheline" you mean "same slot in cache".
 
Could this be handled by requiring the cache to have at least 4 ways if 4 reservations are permitted? In other words, LRM requires that the cache can store any 4 cachelines.

Yes. Somehow I wrongly remembered that the slot in a 2^k-way associative cache is based on the address, when in reality it is not. So this works completely fine.

 
    A small increase in state information is required -- per-byte
    reserved/dirty bits.  The cacheline is dirty when all reserved
    bytes in the cacheline are dirty.  The maximum number of
    reservations must be small enough that whatever cache is used
    *appears* to be fully-associative.  In practice, that requirement
    for per-byte reserved/dirty bits will probably mean that the
    "multi-LR" cache will be a small fully associative cache with only
    as many lines as the guaranteed maximum number of reservations.


That is also possible, but I like to use existing HW :(

You are right; there are easier implementations for LRM.  Since only four reservations will be guaranteed, a 4-way set associative cache is sufficient.  RVT as I plan to propose will permit unlimited use of a cacheline once it is in the transaction's memory set -- a second LT from the same cacheline does not consume an additional memory set slot, nor does a second LTX or ST to a cacheline previously loaded for write consume an additional slot.  But RVT does not need to track when loaded-for-write cachelines have been written -- commit in RVT will be explicit.

LRM does need to track this state information, but really only needs a small buffer that stores the addresses loaded and "checks them off" as the SCs are executed.  Per-byte reserved/dirty bits in the cache are not needed after all. 

 You're right.

Jacob Bachmeyer

non lue,
11 déc. 2017, 00:21:5211/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-10 4:53 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> On Dec 9, 2017 01:34, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
Simply that a program must not rely on any values loaded from LR/LRM if
SC fails.

> After a successful LR/LRM/SC sequence, the values are fine and can be
> used normally.
> In fact, already after a successful LR/LRM sequence the values are fine.
> So if you implement the DCAS by
> "LR;LRM; retry in case of LRM failure; if values match compare data
> then SC;SC"
> you know that the loaded values at the end of this block are always
> correct in the sense that they appear to have been read atomically
> from coherent memory (i.e., even in case of a failed SC or comparison,
> this code behaves like an atomic double-word load). (to be able to
> check that the LRM failed, we would require for LRMs r2!=x0 and put a
> non-zero value into r2 in case of a failure).

This is a non-starter because it would require two write ports on the
register file. There is no way to both read a value and report success
at LRM in RISC-V.

> It is only in case the LR/LRM sequence fails that the LRMs in the
> sequence might load garbage. So if you wrongly implemented the DCAS as
> "LR;LRM; if values match compare data then SC;SC",
> the cas does not appear to be an atomic double-wide load in case the
> LRM fails. Note that the SCs will fail if the LRM failed even if the
> garbage values match the compare data, so that is not the problem --
> the problem is if you want to continue using the load results.

Why is that wrong for DCAS? A problem is seen only if LRM returns
garbage instead of reading memory without acquiring a reservation. If
LRM returns garbage, then DCAS will almost certainly fail ("wrong
values") instead of retrying ("serialization failure").

> The reason for this is simple: you can not get a reservation on that
> address, so the LRM has to either fail or be stalled.
> But if you stall an LRM while you have hard reservations, you can get
> into a HW dead-lock where CPU1 is trying to execute an LRM to a hard
> reservation of CPU2 and CPU2 is trying to execute an LRM to a hard
> reservation of CPU1, e.g., by running "LR x; LRM y" and "LR y; LRM x"
> in parallel.

Either one CPU has priority or both must stall for the hard reservation
timeout, then continue. If both stall, CPU1 now has "y" reserved and
CPU2 has "x" reserved, but both have lost their earlier reservations and
will fail at SC. The simple (Ethernet collision recovery) strategy is
to stall for a random small number of cycles upon an attempt to
reacquire a reservation that has previously timed out. On the first
attempt, both CPUs fail at SC, they both jump back to retry, they stall
for different amounts of time at the first LR, and one (I will say CPU1)
clearly acquires the reservations and reaches the SCs before the other.
CPU2 then fails at SC and retries, succeeding on the third attempt, or
stalls at LR (observing CPU1's reservations) and waits for CPU1 to
complete or the reservations to again expire.

However, until one CPU has reached the SC sequence, no reservations are
dirty. The guarantee of forward progress requires that all SCs be
contiguous, so renewing a hard reservation just long enough to complete
four SC operations avoids this with bounded delay. The first CPU to
execute SC "wins" and all reservations held by other CPUs on the
addresses written are broken. If two CPUs enter the SC group at the
same time, either one has priority, or both fail and delay retry by
different amounts.

> If you fail the LRM and the reservation is dirty, the LRM can not
> really return a "real" value. It either has to return the dirty value
> -- which might never be written to memory and might make the remote
> LR/LRM/SC sequence seem non-atomic, take a stale value from the main
> memory, or make up a value. In each case LRM potentially returns
> garbage (which I think would be ok. I'd prefer if it always returned
> the same garbage though, such as all zeros or just the current
> register content.).

This cannot happen -- if the reservation is dirty, then the remote
LR/LRM/SC has completed and committed. The "stale" value must be read
until that time, since the atomic writes have not yet occurred. LR/LRM
simply expresses interest in an address; there is no requirement that SC
ever be executed. A "failed" LRM reads the same value that a successful
LRM reads, the only difference is that a failed LRM was executed without
the LR to begin the sequence, so no reservation is acquired and SC must
fail, causing the sequence to be retried.

> I assume that in case of a failed LRM the loaded values will only be
> used by LRMs which fail, SCs which fail, and local operations which do
> not leak the garbage values, and that the computation and its values
> are either abandoned or the whole thing is restarted.
> Note that you can architecturally provide the guarantee that the LR
> will always succeed (and I guess that is currently the case).

LRM differs from LR only in that LRM must occur immediately after LR or
LRM, to ensure that the LR/LRM/SC sequence is retried if a trap return
lands at LRM.

> This seems like an unneeded constraint on software, and very
> different behavior from normal loads, plain LR, and LT/LTX (from
> the RVT proposal I am planning).
>
>
> One alternative is to turn the hard reservations of CPUi into a soft
> one while CPUi issues an LRM to a hard reservation of another CPU (you
> can go back to a hard reservation when the LRM eventually succeeds and
> the older reservations were not lost). In this case LRMs can always
> provide real memory values, but the LR/LRM sequence is not guaranteed
> to be atomic (since the soft reservations could be lost before the LRM
> returns its real value) -- still some way to detect whether the LR/LRM
> sequence was atomic or not would be needed.

If the LR/LRM sequence was not atomic, then the subsequent SC sequence
must fail. If CPUi issues an LRM to a hard reservation of CPUj, CPUi
must wait until the reservation either expires or is released. If the
reservation is released due to SC succeeding, CPUi may succeed in its
sequence, or may fail if CPUj has updated an address where CPUi held a
reservation.

We might need a rule that LR/LRM instructions must be independent of
each other in a sequence.
The problem is that this requires some degree of speculative execution.
While it could be a significant performance improvement, implementations
cannot be required to do it.

Worse, this is only possible with speculative execution -- the register
writes from earlier LR/LRM must be undone if the page fault is moved
back to the LR, since the program could reuse the same register if
atomically chasing a pointer chain and replacing it. Or should portable
programs be required to ensure that all addresses used in LR/LRM loads
are preserved, making the LR/LRM group idempotent? I think we have this
requirement already -- the SC group must write exactly to every address
that was read in LR/LRM and clobbering the addresses makes this
impossible within the guaranteed forward progress constraints.

> I'll still use the distinction LR-opcode for "an instruction with the
> LR opcode", LRM for "LR-opcode directly after LR-opcode", and LR for
> "LR-opcode but not LRM".
> The semantics could be something along the following lines:
> 1) Keep in a status register a "restart code address", which is either
> a program counter or an invalid value.
> 2) An LR sets the restart code address to the pc.
> 3) Any instruction which is not an LR-opcode invalidates the restart
> code address.
> 4) An interrupt where the restart code address is not an invalid value
> sets the epc to restart code address.
> 5) A failed LRM will set the the pc to the restart code address.

Using a distinct LRM opcode makes me feel much better -- the above
should work (and the "restart code address" register can be invisible)
to prevent a trap return or preemptive task switch from landing in the
middle of an LR/LR/SC/SC sequence, but it will not catch other incorrect
transfers of control. A distinct LRM opcode ensures that any invalid
landing will be caught instead of attempting a partial "atomic"
sequence. I do not see any direct exploits here, but I prefer being
safe over living on the edge.

> In this case you never have to worry about garbage values after an
> LRM, because you never get beyond a failed LRM. This makes software
> simpler and reduces bugs where you forget to check validity and use
> the values for something else, like the bad DCAS above.
>
> Simple hardware can immediately fail an LRM that tries to access a
> reserved cacheline; more complex hardware can first turn local
> reservations into soft reservations and stall the LRM, and only fail
> the LRM in case reservations are lost before the LRM manages to
> acquire the reservation itself and local reservations are turned back
> into hard reservations. Both approaches implement this spec and
> guarantee that the LR/LRMs sequence appears to be atomic.

I see this as premature optimization: permitting an LRM that cannot
lead to a successful SC to avoid accessing memory at all. I believe
that even an LRM that "fails" should still access memory -- if nothing
else, it will "pull" the requested data nearer to "this" hart in the
cache hierarchy, and some dynamic priority schemes could use how many
caches need to be traversed to break ties.

> It's not a perfect approach yet because you can starve if the LR/LRMs
> of two threads are not symmetrical, such as "LR x, LRM y" and "while
> (1) { LR y }" -- the latter LR might always be scheduled right
> between LR x and LRM y, thus starving the other thread.
> In case of symmetry there is no problem because the first one to get a
> reservation will always get the remaining reservations, and the
> existing mechanism to prevent starvation for LRs saves the day.

Both succeed loading "y" and both now have reservations on "y". The
first one to start executing SC "wins" and breaks the other
reservation. The latter LR, lacking an SC, is useless and conflicts
with nothing. Alternately, if reservations cause other LRs to stall to
reduce retries and guarantee forward progress, a round-robin dynamic
priority scheme solves this -- on one cycle, the LRM breaks the other
reservation and the first thread completes, on the next, the LRM
stalls. Another option for this case, since the LR is constantly taking
the same reservation, is to yield if an LR is executed to an address
that the same hart already has reserved.

> If instead of handling the restart in HW we handle it in SW, we can
> eventually give up -- e.g., we can have something like
> "while tasks_available > 0 do { try to take reservations for task
> queue using LR;LRM; if LRM succeeded, pop a task from the queue; if it
> didn't, retry the loop }"
> which eventually stops trying to reserve the cachelines if the reason
> to take the reservations disappears.

These are good issues for RVT, but are beyond the scope of a simple
multi-word LR/SC. That said, the approach I am leaning towards for RVT
is "longest-running transaction wins" with the intent that this should
generally push transactions into a FIFO-like order. If you find
yourself asking "did LRM succeed?", that means you need the T.VALIDATE
instruction from RVT. LRM does not give that information. You know
that LRM succeeded when the subsequent SC group succeeds.

> If this is useful in practice, I suggest instead of the above:
> An LR is an LR-opcode with r2=x0, and an LRM is an LR-opcode with r2!=x0.
> The r2 is used for the restart code address, so
> 1') a failed LRM jumps to gpr(r2) (and loads nothing)
> 2') an interrupt during an LRM sets the epc to gpr(r2)
> and
> 3') an LR clears the reservations and is guaranteed to succeed
> other than that, LRs and LRMs can now be the same -- no need to keep a
> status flag for LR/LRM which is cleared on interrupts. To avoid the
> ambiguity from before, just point your r2 in front of the LR/LRM sequence.
>
> This is not much harder to program against, does not have the
> starvation problem, does not have the "garbage by accident" problem,
> does not need an extra restart code address register, but "costs" one
> gpr register in which we put the restart code address.
> And it seems very easy to implement.

I have three problems with this: (1) it introduces a register-indirect
conditional branch (which RISC-V currently does not have), (2) it uses
the entire space in the LR opcode, where I was hoping :-( to put more
instructions (RVT LT/LTX/ABORT/VALIDATE/COMMIT) by using rs2 as an
extended function code field, and (3) (fatal, in my view) it introduces
an abusable JR-alike ROP gadget, since an improper control transfer to
an LRM would cause a jump to the address in whatever register that LRM
instruction selects in rs2.


-- Jacob

Jonas Oberhauser

non lue,
11 déc. 2017, 08:08:0511/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros


2017-12-11 6:21 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:

Jonas Oberhauser wrote:
    Do we want to require that all values loaded by an LR/LRM/SC
    sequence are dead afterwards? 
I'm not sure I understand what you mean by that.

Simply that a program must not rely on any values loaded from LR/LRM if SC fails.

Okay, good.
 
After a successful LR/LRM/SC sequence, the values are fine and can be used normally.
In fact, already after a successful LR/LRM sequence the values are fine. So if you implement the DCAS by
"LR;LRM; retry in case of LRM failure; if values match compare data then SC;SC"
you know that the loaded values at the end of this block are always correct in the sense that they appear to have been read atomically from coherent memory (i.e., even in case of a failed SC or comparison, this code behaves like an atomic double-word load). (to be able to check that the LRM failed, we would require for LRMs r2!=x0 and put a non-zero value into r2 in case of a failure).

This is a non-starter because it would require two write ports on the register file.  There is no way to both read a value and report success at LRM in RISC-V.

It would not have to report success, only failure, in which case it does not have to read a value :) 

It is only in case the LR/LRM sequence fails that the LRMs in the sequence might load garbage. So if you wrongly implemented the DCAS as "LR;LRM; if values match compare data then SC;SC",
the cas does not appear to be an atomic double-wide load in case the LRM fails. Note that the SCs will fail if the LRM failed even if the garbage values match the compare data, so that is not the problem -- the problem is if you want to continue using the load results.

Why is that wrong for DCAS?  A problem is seen only if LRM returns garbage instead of reading memory without acquiring a reservation.  If LRM returns garbage, then DCAS will almost certainly fail ("wrong values") instead of retrying ("serialization failure").

In a *normal* DCAS, it is totally necessary to have the old values, since they may be used in the next iteration of the loop

a,b = R x,y;  do {  c,d = op(a,b);   a,b, fail = DCAS x (a -> c) y (b -> d)   } while (fail)

Note the lack of a reload in the loop.

With LR/LRM, such a loop does not need the values, because the operation is written between the LR/LRMs and the SCs, not "before" them,
But I do not know if there are other examples where even with LR/LRM one would like to continue using the values.

 
The reason for this is simple: you can not get a reservation on that address, so the LRM has to either fail or be stalled.
But if you stall an LRM while you have hard reservations, you can get into a HW dead-lock where CPU1 is trying to execute an LRM to a hard reservation of CPU2 and CPU2 is trying to execute an LRM to a hard reservation of CPU1, e.g., by running "LR x; LRM y" and "LR y; LRM x" in parallel.

Either one CPU has priority or both must stall for the hard reservation timeout, then continue. 

I thought that the reservation clock is only ticking during CPU cycles, of which there are none during the stall. That is why I suggested flipping the reservation state temporarily.
 
If both stall, CPU1 now has "y" reserved and CPU2 has "x" reserved, but both have lost their earlier reservations and will fail at SC. The simple (Ethernet collision recovery) strategy is to stall for a random small number of cycles upon an attempt to reacquire a reservation that has previously timed out.  On the first attempt, both CPUs fail at SC, they both jump back to retry, they stall for different amounts of time at the first LR, and one (I will say CPU1) clearly acquires the reservations and reaches the SCs before the other.  CPU2 then fails at SC and retries, succeeding on the third attempt, or stalls at LR (observing CPU1's reservations) and waits for CPU1 to complete or the reservations to again expire. 

Yes, but I'd rather not have to put that into software. I'd like to maintain a forward guarantee for one of the threads.
 
However, until one CPU has reached the SC sequence, no reservations are dirty.  The guarantee of forward progress requires that all SCs be contiguous, so renewing a hard reservation just long enough to complete four SC operations avoids this with bounded delay. 

Hm. I didn't think it would be a good idea to allow multiple CPUs to have reservations to the same cacheline, but I don't have a good reason against it.
 
The first CPU to execute SC "wins" and all reservations held by other CPUs on the addresses written are broken. 

Which would also invalidate all other reservations of those other CPUs, preventing them from entering the SC groups.
 
If two CPUs enter the SC group at the same time, either one has priority, or both fail and delay retry by different amounts.

Therefore it is not possible for two CPUs with overlapping reservations to enter the SC group at the same time. It is not possible to enter SC groups "at the same time" anyways, but now it is also impossible for one to enter, and then for another to enter.

I love this! I have to think about it more but it sounds great.
Thinking about it some more (:P), I think you might need three reservation modes.
1) exclusive
2) shared
3) soft
What we previously called hard is shared&exclusive.

LR/LRM takes shared reservations. first successful SC turns reservations exclusive.
The difference between exclusive and shared is:
* multiple CPUs can have shared reservations for the same cacheline, but if you have exclusive reservations, you are the only one
* shared reservations can be broken by a successful remote

This means LRM to an exclusive remote reservation can be stalled in HW without fear of deadlock, because the remote thread will not acquire more addresses.

You can guarantee in the absence of interrupts:
1) if multiple LR;LRM;SC sequences compete, one of them will reach a successful SC
2) once a successful SC is executed, the remaining SCs in the sequence will also be successful and the operation will be committed.

 
If you fail the LRM and the reservation is dirty, the LRM can not really return a "real" value. It either has to return the dirty value -- which might never be written to memory and might make the remote LR/LRM/SC sequence seem non-atomic, take a stale value from the main memory, or make up a value. In each case LRM potentially returns garbage (which I think would be ok. I'd prefer if it always returned the same garbage though, such as all zeros or just the current register content.).

This cannot happen -- if the reservation is dirty, then the remote LR/LRM/SC has completed and committed. 

No, because an interrupt could still happen that fails the SC sequence.
 
The "stale" value must be read until that time, since the atomic writes have not yet occurred.

In the cache you probably have already the dirty value. I think the LRM to an exclusive remote reservation should be stalled until 1) the local reservations are invalidated from the outside or 2) the remote reservation is freed (whether because of an interrupt or because the remote side commits).

 
  LR/LRM simply expresses interest in an address; there is no requirement that SC ever be executed.  A "failed" LRM reads the same value that a successful LRM reads, the only difference is that a failed LRM was executed without the LR to begin the sequence, so no reservation is acquired and SC must fail, causing the sequence to be retried.

But an LD or LR to an exlusive remote reservation definitely have to be stalled. The question is just whether the LRM should stall like the LR or whether it is allowed to abort.
In case we do not need a meaningful, atomic value in case the SC fails, we can abort and everything is fine.
 
I assume that in case of a failed LRM the loaded values will only be used by LRMs which fail, SCs which fail, and local operations which do not leak the garbage values, and that the computation and its values are either abandoned or the whole thing is restarted.
Note that you can architecturally provide the guarantee that the LR will always succeed (and I guess that is currently the case).

LRM differs from LR only in that LRM must occur immediately after LR or LRM, to ensure that the LR/LRM/SC sequence is retried if a trap return lands at LRM.

    This seems like an unneeded constraint on software, and very
    different behavior from normal loads, plain LR, and LT/LTX (from
    the RVT proposal I am planning).


One alternative is to turn the hard reservations of CPUi into a soft one while CPUi issues an LRM to a hard reservation of another CPU (you can go back to a hard reservation when the LRM eventually succeeds and the older reservations were not lost). In this case LRMs can always provide real memory values, but the LR/LRM sequence is not guaranteed to be atomic (since the soft reservations could be lost before the LRM returns its real value) -- still some way to detect whether the LR/LRM sequence was atomic or not would be needed.

If the LR/LRM sequence was not atomic, then the subsequent SC sequence must fail.

Indeed.
 
If CPUi issues an LRM to a hard reservation of CPUj, CPUi must wait until the reservation either expires or is released.  If the reservation is released due to SC succeeding, CPUi may succeed in its sequence, or may fail if CPUj has updated an address where CPUi held a reservation.

Yes. 

We might need a rule that LR/LRM instructions must be independent of each other in a sequence.

No. This is neither necessary nor practical. Loading a pointer and dereferencing it atomically should be fine.
For example, an LR-optimized MCS lock might look like this:

struct MCS_LOCK { MCS_LOCK *owner };

MCS_LOCK x;

x is free iff x.owner == &x .

bool LOCK (MCS_LOCK *lp, MCS_LOCK *me) {

  l = dereference lp // LR
  o = dereference l.owner // LRM
  l.owner = me // SC
  release lp // SC
  

}




 
Another alternative is to "restart" the LR/LRM sequence in case of a failure *during* the LRM sequence, e.g., by jumping back before the LR.
I wrote before:
        Making an interrupt at the beginning of the sequence
        automatically go back before the first LR in the sequence
        seems to be complicated as well, but might not be so bad in HW
        -- you anyways need to store possible epcs somewhere in the
        pipeline, and you could store as epcs the pcs before the first
        LR... 
To this you responded:
    Hardware can, at least for traps taken *during* the LR/LRM group,
    do this invisibly to software by effectively speculating all LRM
    instructions and rolling back to the LR if any traps are taken.     This is done by simply storing the address of the LR (er,
    "previous non-LRM instruction") in ?epc if a trap is taken while
    executing LRM.  This is invisible to normal software -- the
    interrupt simply appears to have arrived before the sequence
    began.  Hardware that does this *must* document it, however -- the
    supervisor (and debugger) will see any page faults caused by LRM
    as having occurred at the previous LR.


In case we do something like this, the alternative with the reset sounds very charming.
Note that at this point it is no longer necessary to distinguish in the opcode LR and LRM: after an interrupt, you can never be in the middle of the LR/LRM sequence, only at the beginning, and LRMs can only fail because they access a reserved cacheline, not due to interrupts, not due to previous failed LRMs.

The problem is that this requires some degree of speculative execution.  While it could be a significant performance improvement, implementations cannot be required to do it.

It's not really speculative, it's just a conditional branch. 

Worse, this is only possible with speculative execution -- the register writes from earlier LR/LRM must be undone if the page fault is moved back to the LR, since the program could reuse the same register if atomically chasing a pointer chain and replacing it.  Or should portable programs be required to ensure that all addresses used in LR/LRM loads are preserved, making the LR/LRM group idempotent?

I also realized this later. Take the LR/LRMs sequence and take all registers which are used by an LR/LRM but not overwritten by a previous LR/LRM in the sequence.
That set of addresses has to be restored to the point before the LR, either in SW (by not overwriting such a register) or in HW. 
Both sound complicated and error-prone.

I think we have this requirement already -- the SC group must write exactly to every address that was read in LR/LRM and clobbering the addresses makes this impossible within the guaranteed forward progress constraints.
 
I don't think I quite understand. Why does this make it impossible? I also don't know what "clobbering" means in this context, is that a formal term?

Jonas Oberhauser

non lue,
11 déc. 2017, 08:22:1911/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
Sorry, I accidentally sent the e-mail in the middle again.
I'll update below the parts of my e-mail that were garbage.


Garbage 1:
2017-12-11 14:08 GMT+01:00 Jonas Oberhauser <s9jo...@gmail.com>:
We might need a rule that LR/LRM instructions must be independent of each other in a sequence.

No. This is neither necessary nor practical. Loading a pointer and dereferencing it atomically should be fine.
For example, an LR-optimized MCS lock might look like this:

struct MCS_LOCK { MCS_LOCK *owner };

MCS_LOCK x;

x is free iff x.owner == &x .

bool LOCK (MCS_LOCK *lp, MCS_LOCK *me) {

  l = dereference lp // LR
  o = dereference l.owner // LRM
  l.owner = me // SC
  release lp // SC
  

}

The code above is complete garbage (I wasn't finished typing). It will take me some time to fix it.

 Garbage 2:
I think we have this requirement already -- the SC group must write exactly to every address that was read in LR/LRM and clobbering the addresses makes this impossible within the guaranteed forward progress constraints.
 
I don't think I quite understand. Why does this make it impossible? I also don't know what "clobbering" means in this context, is that a formal term?


The reason why I think this doesn't make it impossible -- only hard -- is that you could use a few integer operations to reconstruct the addresses.



Here are some missing responses: 
I'll still use the distinction LR-opcode for "an instruction with the LR opcode", LRM for "LR-opcode directly after LR-opcode", and LR for "LR-opcode but not LRM".
The semantics could be something along the following lines:
1) Keep in a status register a "restart code address", which is either a program counter or an invalid value.
2) An LR sets the restart code address to the pc. 3) Any instruction which is not an LR-opcode invalidates the restart code address.
4) An interrupt where the restart code address is not an invalid value sets the epc to restart code address.
5) A failed LRM will set the the pc to the restart code address.

Using a distinct LRM opcode makes me feel much better -- the above should work (and the "restart code address" register can be invisible) to prevent a trap return or preemptive task switch from landing in the middle of an LR/LR/SC/SC sequence, but it will not catch other incorrect transfers of control.  A distinct LRM opcode ensures that any invalid landing will be caught instead of attempting a partial "atomic" sequence.  I do not see any direct exploits here, but I prefer being safe over living on the edge.

I am not worried about that -- jumping incorrectly into a cas loop will cause incorrect results, and so will jumping incorrectly into an LRM. It's not the duty of LRM to find bugs like that.

However, I think now that a separate LRM opcode is useful for the simple reason of extendibility: if you want to later allow instructions to appear between LR and LRM, you can -- but only if they have a separate opcode.

 


In this case you never have to worry about garbage values after an LRM, because you never get beyond a failed LRM. This makes software simpler and reduces bugs where you forget to check validity and use the values for something else, like the bad DCAS above.

Simple hardware can immediately fail an LRM that tries to access a reserved cacheline; more complex hardware can first turn local reservations into soft reservations and stall the LRM, and only fail the LRM in case reservations are lost before the LRM manages to acquire the reservation itself and local reservations are turned back into hard reservations. Both approaches implement this spec and guarantee that the LR/LRMs sequence appears to be atomic.

I see this as premature optimization:  permitting an LRM that cannot lead to a successful SC to avoid accessing memory at all.  I believe that even an LRM that "fails" should still access memory -- if nothing else, it will "pull" the requested data nearer to "this" hart in the cache hierarchy, and some dynamic priority schemes could use how many caches need to be traversed to break ties.

If you think so.
 


It's not a perfect approach yet because you can starve if the LR/LRMs of two threads are not symmetrical, such as "LR x, LRM y" and "while (1) { LR y }" -- the latter  LR might always be scheduled right between LR x and LRM y, thus starving the other thread. In case of symmetry there is no problem because the first one to get a reservation will always get the remaining reservations, and the existing mechanism to prevent starvation for LRs saves the day.

Both succeed loading "y" and both now have reservations on "y".  The first one to start executing SC "wins" and breaks the other reservation.  The latter LR, lacking an SC, is useless and conflicts with nothing.  Alternately, if reservations cause other LRs to stall to reduce retries and guarantee forward progress, a round-robin dynamic priority scheme solves this -- on one cycle, the LRM breaks the other reservation and the first thread completes, on the next, the LRM stalls.  Another option for this case, since the LR is constantly taking the same reservation, is to yield if an LR is executed to an address that the same hart already has reserved.

The latter is a specific solution to this instance of the problem, but not a generic solution (nest more LRMs in the first thread and alternate LRs in the second thread).

 
If instead of handling the restart in HW we handle it in SW, we can eventually give up -- e.g., we can have something like "while tasks_available > 0 do { try to take reservations for task queue using LR;LRM; if LRM succeeded, pop a task from the queue; if it didn't, retry the loop }"
which eventually stops trying to reserve the cachelines if the reason to take the reservations disappears.

These are good issues for RVT, but are beyond the scope of a simple multi-word LR/SC.  That said, the approach I am leaning towards for RVT is "longest-running transaction wins" with the intent that this should generally push transactions into a FIFO-like order.  If you find yourself asking "did LRM succeed?", that means you need the T.VALIDATE instruction from RVT.  LRM does not give that information.  You know that LRM succeeded when the subsequent SC group succeeds.

You may be right.
 


If this is useful in practice, I suggest instead of the above:
An LR is an LR-opcode with r2=x0, and an LRM is an LR-opcode with r2!=x0.
The r2 is used for the restart code address, so
1') a failed LRM jumps to gpr(r2) (and loads nothing)
2') an interrupt during an LRM sets the epc to gpr(r2)
and
3') an LR clears the reservations and is guaranteed to succeed
other than that, LRs and LRMs can now be the same -- no need to keep a status flag for LR/LRM which is cleared on interrupts. To avoid the ambiguity from before, just point your r2 in front of the LR/LRM sequence.

This is not much harder to program against, does not have the starvation problem, does not have the "garbage by accident" problem, does not need an extra restart code address register, but "costs" one gpr register in which we put the restart code address.
And it seems very easy to implement.

I have three problems with this:  (1) it introduces a register-indirect conditional branch (which RISC-V currently does not have), (2) it uses the entire space in the LR opcode, where I was hoping :-( to put more instructions (RVT LT/LTX/ABORT/VALIDATE/COMMIT) by using rs2 as an extended function code field, and (3) (fatal, in my view) it introduces an abusable JR-alike ROP gadget, since an improper control transfer to an LRM would cause a jump to the address in whatever register that LRM instruction selects in rs2.

Hm. 

Jonas Oberhauser

non lue,
11 déc. 2017, 09:52:5811/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
2017-12-11 14:22 GMT+01:00 Jonas Oberhauser <s9jo...@gmail.com>:
Garbage 1:
2017-12-11 14:08 GMT+01:00 Jonas Oberhauser <s9jo...@gmail.com>:
We might need a rule that LR/LRM instructions must be independent of each other in a sequence.

No. This is neither necessary nor practical. Loading a pointer and dereferencing it atomically should be fine.
For example, an LR-optimized MCS lock might look like this:

struct MCS_LOCK { MCS_LOCK *owner };

MCS_LOCK x;

x is free iff x.owner == &x .

bool LOCK (MCS_LOCK *lp, MCS_LOCK *me) {

  l = dereference lp // LR
  o = dereference l.owner // LRM
  l.owner = me // SC
  release lp // SC
  

}

Here is the correct code (using Pseudo C, to the best of my dusty C abilities)

struct MCS_LOCK { MCS_LOCK *owner; };
// I assume that the address of l.owner is the same as the address of l.

MCS_LOCK x;

#define free(l_p) l_p->owner == l_p;

void LOCK (MCS_LOCK *l_p, MCS_LOCK *me_p) {
    MCS_LOCK l;
    repeat until success {
        l = *l_p; // using LR
        *(l.owner); // using LRM
        // note: I possibly just took the same reservation twice 
        
        l_p->owner = me_p; // SC 
        l.owner->owner = me_p; // SC
    }
    
    if (l.owner == l_p) return;  // The lock was free, I am first lock holder
    
    wait (free(l.owner)); // I didn't take free lock, have to wait for the old owner to become free
}

void UNLOCK (MCS_LOCK *l_p, MCS_LOCK *me_p) {
    MCS_LOCK l, me;
    repeat until success {
        l = *l_p; // using LR
        me = *(me_p); // using LRM
        
        if (me.owner == me_p) {
            // I am  the last lock holder, lock becomes free
            l_p->owner = l_p; // SC 
            me_p; // SC
        } else {
            // I am not the last lock holder, I free the lock for my successor
            l_p; // SC 
            me_p->owner = me_p; // SC
        }
    }
}




Because the repeat until success loops appear to be atomic, you don't have to bother with a bad interleaving where the lock is allocated while someone else frees the lock (which is what makes MCS locks so complicated).

To see why this algorithm works, imagine a singly linked queue of locks. 
There are three invariants:
1) The oldest element in the queue currently has the lock. 
2) Iff l_p points to itself, the queue is empty.
3) Otherwise l_p points to the newest element in the queue, which points to itself.
On LOCK we enqueue a new element, which means atomically pointing l_p and previous newest element in the queue to myself.
If l_p pointed to itself the queue was empty and I now have the oldest lock in the queue; otherwise, I have to wait for the previously newest element in the queue to release the lock.

On UNLOCK I release the lock by pointing at myself, thus removing myself from the queue. The only difficulty is if I was the last element in the queue, in which case the queue is now empty.


I'm not saying this is a great algorithm -- unlike the normal MCS lock this one might not be fair in the sense that one of the threads can attempt an infinite number of LR/LRMs without ever reaching the SCs -- I was just giving an example of an algorithm where LRMing from a previous LRed pointer makes sense.


PS: I now enabled "undo send" in gmail, I hope this will prevent future mistakes...

Rogier Brussee

non lue,
11 déc. 2017, 12:00:2211/12/2017
à RISC-V ISA Dev,s9jo...@gmail.com,ces...@cesarb.eti.br,jcb6...@gmail.com


Op maandag 11 december 2017 06:21:52 UTC+1 schreef Jacob Bachmeyer:
Jonas Oberhauser wrote:
> 2017-12-10 4:53 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
>     Jonas Oberhauser wrote:

[SNIP]
I have three problems with this:  (1) it introduces a register-indirect
conditional branch (which RISC-V currently does not have), 
(2) it uses
the entire space in the LR opcode, where I was hoping :-( to put more
instructions (RVT LT/LTX/ABORT/VALIDATE/COMMIT) by using rs2 as an
extended function code field,

The regularity of the RV ISA instruction formats is very nice.  The AMO opcode 
space is R-type and not really crowded even if you insist on using the 3 bit func field for
the size (W, D, Q) only. Using rs2 for a function opcode seems like an as yet
unneeded desperate measure to save opcode space. 

(I.m.o. an additional reason to leave the rs2 field alone for now, is the idea that 
the rs2 register in the LR instruction is most naturally be used to define a range of 
reserved addresses with the first W/D/Q in the range being. I previously wanted to
define rs2 as the number of additional bytes that are reserved. Perhaps a more natural
parametrisation is that e.g. LR.{WDQ} rd rs1 rs2   reserves the range [rs1, rs1 + 1<< (rs2  + log2(sizeof({WDQ}))
assuming rs1 is rs2 + log2(sizeof{WDQ}) byte aligned.

i.e.  assuming register r contains a (small) positive value and register q contains a 2^{r + 3} == (1 <<  (r + 3))  byte aligned address,     

LW a, imm(b)  hits the reserved range of an instruction LR.D p, q, r

iff 

(b + imm) & ~ ((1 << (3 + r)) -1)    ==  q & ~ ((1 <<  (3 + r)) - 1)

or equivalently

(b + imm) >> (3 + r) == q >> (3 + r)


This obviously extends the current behaviour of LR.{WDQ} for rs2 == zero at least for 
aligned addresses and should be cheap.)

Cesar Eduardo Barros

non lue,
11 déc. 2017, 15:50:4311/12/2017
à Jonas Oberhauser,Jacob Bachmeyer,RISC-V ISA Dev
Em 11-12-2017 11:08, Jonas Oberhauser escreveu:
>
>
> 2017-12-11 6:21 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> After a successful LR/LRM/SC sequence, the values are fine and
> can be used normally.
> In fact, already after a successful LR/LRM sequence the values
> are fine. So if you implement the DCAS by
> "LR;LRM; retry in case of LRM failure; if values match compare
> data then SC;SC"
> you know that the loaded values at the end of this block are
> always correct in the sense that they appear to have been read
> atomically from coherent memory (i.e., even in case of a failed
> SC or comparison, this code behaves like an atomic double-word
> load). (to be able to check that the LRM failed, we would
> require for LRMs r2!=x0 and put a non-zero value into r2 in case
> of a failure).
>
>
> This is a non-starter because it would require two write ports on
> the register file.  There is no way to both read a value and report
> success at LRM in RISC-V.
>
>
> It would not have to report success, only failure, in which case it does
> not have to read a value :)

But in case of success, it would have to write two registers: one
register with the successfully read value, and another register with the
"success" flag.

Think about it: the instruction always has to report success/failure, so
we can branch on it, no matter whether it failed or succeeded. Since
RISC-V has no flags register, it can only be in a general register. But
in the sucess case, we're already using one of these for the loaded
value. Therefore, you'd need to be able to write _two_ registers, just
for this one instruction (everything else writes zero or one registers).
You cannot use a special "flag" value in the output register to mean
"the load actually failed" because, for XLEN-sized loads, all possible
register values can be the result of the load.

Jacob Bachmeyer

non lue,
11 déc. 2017, 22:18:1011/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-11 6:21 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> It is only in case the LR/LRM sequence fails that the LRMs in
> the sequence might load garbage. So if you wrongly implemented
> the DCAS as "LR;LRM; if values match compare data then SC;SC",
> the cas does not appear to be an atomic double-wide load in
> case the LRM fails. Note that the SCs will fail if the LRM
> failed even if the garbage values match the compare data, so
> that is not the problem -- the problem is if you want to
> continue using the load results.
>
>
> Why is that wrong for DCAS? A problem is seen only if LRM returns
> garbage instead of reading memory without acquiring a
> reservation. If LRM returns garbage, then DCAS will almost
> certainly fail ("wrong values") instead of retrying
> ("serialization failure").
>
>
> In a *normal* DCAS, it is totally necessary to have the old values,
> since they may be used in the next iteration of the loop
>
> a,b = R x,y; do { c,d = op(a,b); a,b, fail = DCAS x (a -> c) y (b
> -> d) } while (fail)
>
> Note the lack of a reload in the loop.

There are two failure cases: (1) DCAS reads memory and finds that the
values in memory do not match the expected values, and (2) the
LR/LRM/SC/SC that implements DCAS does not execute atomically. In case
(2), the correct response is to retry the LR/LRM/SC/SC sequence. In
case (1), DCAS itself was successfully executed, but failed. The loop
given in the example is an infinite loop if locations x and y do not
eventually return to the values initially read. Or are you saying that
DCAS returns the values actually read thus updating a and b? In this
latter case, LR/LRM must never return garbage: the value read by LR/LRM
will be used in the calculation if DCAS fails.

This is a point that needs to be clarified: LR/LRM is *not* LT. There
is no instruction that confirms that an LR/LRM sequence has read a
consistent set of values. The only guarantee is that all SCs will fail
if any value read by LR/LRM was changed by another hart after being read.

> With LR/LRM, such a loop does not need the values, because the
> operation is written between the LR/LRMs and the SCs, not "before" them,
> But I do not know if there are other examples where even with LR/LRM
> one would like to continue using the values.

This will vary -- the operation must meet the RVA forward progress
guarantee constraints if it appears in the LR/LRM/.../SC/SC sequence.
Some operations may need to use non-atomic reads followed by a
multi-word CAS to perform an atomic update.

>
> The reason for this is simple: you can not get a reservation
> on that address, so the LRM has to either fail or be stalled.
> But if you stall an LRM while you have hard reservations, you
> can get into a HW dead-lock where CPU1 is trying to execute an
> LRM to a hard reservation of CPU2 and CPU2 is trying to
> execute an LRM to a hard reservation of CPU1, e.g., by running
> "LR x; LRM y" and "LR y; LRM x" in parallel.
>
>
> Either one CPU has priority or both must stall for the hard
> reservation timeout, then continue.
>
>
> I thought that the reservation clock is only ticking during CPU
> cycles, of which there are none during the stall. That is why I
> suggested flipping the reservation state temporarily.

The reservation clock ticks on memory bus cycles, or some other clock
that continues to run unless the entire system is halted. Otherwise,
suspending a hart that happens to hold a reservation could cause deadlock.

>
> If both stall, CPU1 now has "y" reserved and CPU2 has "x"
> reserved, but both have lost their earlier reservations and will
> fail at SC. The simple (Ethernet collision recovery) strategy is
> to stall for a random small number of cycles upon an attempt to
> reacquire a reservation that has previously timed out. On the
> first attempt, both CPUs fail at SC, they both jump back to retry,
> they stall for different amounts of time at the first LR, and one
> (I will say CPU1) clearly acquires the reservations and reaches
> the SCs before the other. CPU2 then fails at SC and retries,
> succeeding on the third attempt, or stalls at LR (observing CPU1's
> reservations) and waits for CPU1 to complete or the reservations
> to again expire.
>
>
> Yes, but I'd rather not have to put that into software. I'd like to
> maintain a forward guarantee for one of the threads.

The only thing put into software in this example is the "retry on
serialization failure" that LR/SC already requires. The stall for a
random small number of cycles is a hardware operation. Hardware knows
to insert the random backoff stall when attempting to reacquire a
reservation that this hart previously held, but that timed out while
this hart was stalled acquiring another reservation. The required state
information can be invisible to software, since this is only needed to
break lock-step if two harts arrive at (and thus will retry) an LR/SC
simultaneously and a preemptive context switch on either hart will
certainly break that lock-step.

>
> However, until one CPU has reached the SC sequence, no
> reservations are dirty. The guarantee of forward progress
> requires that all SCs be contiguous, so renewing a hard
> reservation just long enough to complete four SC operations avoids
> this with bounded delay.
>
>
> Hm. I didn't think it would be a good idea to allow multiple CPUs to
> have reservations to the same cacheline, but I don't have a good
> reason against it.
>
>
> The first CPU to execute SC "wins" and all reservations held by
> other CPUs on the addresses written are broken.
>
>
> Which would also invalidate all other reservations of those other
> CPUs, preventing them from entering the SC groups.

They will enter the SC groups, but the SCs will fail. Software then
retries the sequence.

>
> If two CPUs enter the SC group at the same time, either one has
> priority, or both fail and delay retry by different amounts.
>
>
> Therefore it is not possible for two CPUs with overlapping
> reservations to enter the SC group at the same time. It is not
> possible to enter SC groups "at the same time" anyways, but now it is
> also impossible for one to enter, and then for another to enter.
>
> I love this! I have to think about it more but it sounds great.
> Thinking about it some more (:P), I think you might need three
> reservation modes.
> 1) exclusive
> 2) shared
> 3) soft
> What we previously called hard is shared&exclusive.
>
> LR/LRM takes shared reservations. first successful SC turns
> reservations exclusive.
> The difference between exclusive and shared is:
> * multiple CPUs can have shared reservations for the same cacheline,
> but if you have exclusive reservations, you are the only one
> * shared reservations can be broken by a successful remote
>
> This means LRM to an exclusive remote reservation can be stalled in HW
> without fear of deadlock, because the remote thread will not acquire
> more addresses.

I think this provides an existence proof that LR/LRM/SC can be
implemented. :-)

> You can guarantee in the absence of interrupts:
> 1) if multiple LR;LRM;SC sequences compete, one of them will reach a
> successful SC
> 2) once a successful SC is executed, the remaining SCs in the sequence
> will also be successful and the operation will be committed.

As long as at least one of those sequences is not interrupted, this will
hold in the presence of interrupts as well, if interrupts cause an
interrupted LR/LRM/SC to be abandoned and retried in software.

> If you fail the LRM and the reservation is dirty, the LRM can
> not really return a "real" value. It either has to return the
> dirty value -- which might never be written to memory and
> might make the remote LR/LRM/SC sequence seem non-atomic, take
> a stale value from the main memory, or make up a value. In
> each case LRM potentially returns garbage (which I think would
> be ok. I'd prefer if it always returned the same garbage
> though, such as all zeros or just the current register content.).
>
>
> This cannot happen -- if the reservation is dirty, then the remote
> LR/LRM/SC has completed and committed.
>
>
> No, because an interrupt could still happen that fails the SC sequence.

No, because the remote LR/LRM/SC has not committed until its *last* SC
has succeeded. The SCs in an LR/LRM/SC sequence are an atomic group,
until they *all* succeed, *no* changes are visible.

>
> The "stale" value must be read until that time, since the atomic
> writes have not yet occurred.
>
>
> In the cache you probably have already the dirty value. I think the
> LRM to an exclusive remote reservation should be stalled until 1) the
> local reservations are invalidated from the outside or 2) the remote
> reservation is freed (whether because of an interrupt or because the
> remote side commits).

If the remote is currently committing, then yes, LR/LRM *should* stall
and wait for the new value or for remote commit to fail. Otherwise,
this hart has just acquired a reservation that it knows will be broken
before local commit is possible.

>
> LR/LRM simply expresses interest in an address; there is no
> requirement that SC ever be executed. A "failed" LRM reads the
> same value that a successful LRM reads, the only difference is
> that a failed LRM was executed without the LR to begin the
> sequence, so no reservation is acquired and SC must fail, causing
> the sequence to be retried.
>
>
> But an LD or LR to an exlusive remote reservation definitely have to
> be stalled. The question is just whether the LRM should stall like the
> LR or whether it is allowed to abort.
> In case we do not need a meaningful, atomic value in case the SC
> fails, we can abort and everything is fine.

LOAD is non-atomic, so there is no reason for it to stall. No
guarantees are made, and loading the old values is correct -- the
transaction has not committed yet. (I plan to specify this in my RVT
proposal -- non-transactional, non-atomic memory accesses read or update
the "original" value of a location in a transaction's read or write
set. A non-transactional STORE to a location in a transaction's read
set causes the transaction to fail, but a non-transactional STORE to a
location in a transaction's write set that the transaction has not read
simply updates the "original" value of that location. A
non-transactional LOAD from a location in a transaction's write set
returns the "original" value, since the "tentative" value has not yet
been committed. An implementation is allowed to make the transaction
read set a superset of the transaction write set, such that a
non-transactional STORE to any location used by a transaction will cause
the transaction to fail. Portable software must not rely on "blind ST"
to actually ignore other writes.)

For LR/LRM, the only reason to stall is avoid loading a value and
acquiring a reservation that we know will not remain valid when we reach
SC. Stalling at LR/LRM is a performance optimization, as it avoids a
software retry of the sequence by waiting for an almost-complete
competing sequence to actually complete.
LR/LRM takes one input (an address) and produces one output (a value
read from memory) with the side effect of acquiring a reservation. As
long as restarting the LR/LRM group produces the same results, this will
work. For software, the rule would be that the LR/LRM group must not
use the initial value of any register that is used as a destination.
Another way: Destination registers within an LR/LRM group must contain
dead values when the group is entered. A third way: An LR/LRM group
must not write to a register after reading that register.

> I think we have this requirement already -- the SC group must
> write exactly to every address that was read in LR/LRM and
> clobbering the addresses makes this impossible within the
> guaranteed forward progress constraints.
>
>
> I don't think I quite understand. Why does this make it impossible?

The RVA forward progress guarantee places a severe limit on code size:
only 16 baseline RVI instructions, including the LR/LRM/SC and the
branch to retry the sequence, and excluding loads, stores, and some
other instructions. There is no time to reconstruct the addresses --
they must be preserved in registers.

> I also don't know what "clobbering" means in this context, is that a
> formal term?

I do not know how formal "clobber" is; I think I first saw it in the GCC
manual. To "clobber" a value is to overwrite the location where that
value is stored.


-- Jacob

Jacob Bachmeyer

non lue,
11 déc. 2017, 22:29:3911/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jumping to an LRM will cause a *different* atomic operation than
intended to be executed, with one or more trailing SCs ignored; jumping
into a DCAS loop still executes DCAS, not CAS. This is why LRM must be
preceded by LRM or LR. Permitting otherwise just gives me an uneasy
feeling that someone more clever than I will be able to exploit this as
a security hole.

> It's not a perfect approach yet because you can starve if
> the LR/LRMs of two threads are not symmetrical, such as
> "LR x, LRM y" and "while (1) { LR y }" -- the latter LR
> might always be scheduled right between LR x and LRM y,
> thus starving the other thread. In case of symmetry there
> is no problem because the first one to get a reservation
> will always get the remaining reservations, and the
> existing mechanism to prevent starvation for LRs saves the
> day.
>
>
> Both succeed loading "y" and both now have reservations on
> "y". The first one to start executing SC "wins" and breaks
> the other reservation. The latter LR, lacking an SC, is
> useless and conflicts with nothing. Alternately, if
> reservations cause other LRs to stall to reduce retries and
> guarantee forward progress, a round-robin dynamic priority
> scheme solves this -- on one cycle, the LRM breaks the other
> reservation and the first thread completes, on the next, the
> LRM stalls. Another option for this case, since the LR is
> constantly taking the same reservation, is to yield if an LR
> is executed to an address that the same hart already has reserved.
>
>
> The latter is a specific solution to this instance of the problem, but
> not a generic solution (nest more LRMs in the first thread and
> alternate LRs in the second thread).

Using shared and exclusive reservations as you suggested solves the
problem: both threads take shared reservations without conflict and the
first thread to reach SC now has exclusive reservations that cause the
other thread to fail SC and stall when it tries to load the value
again. The first thread's SCs complete and the other thread continues.


-- Jacob

Jacob Bachmeyer

non lue,
12 déc. 2017, 00:09:5612/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-11 14:22 GMT+01:00 Jonas Oberhauser <s9jo...@gmail.com
> <mailto:s9jo...@gmail.com>>:
>
> Garbage 1:
>
> 2017-12-11 14:08 GMT+01:00 Jonas Oberhauser <s9jo...@gmail.com
> <mailto:s9jo...@gmail.com>>:
>
> We might need a rule that LR/LRM instructions must be
> independent of each other in a sequence.
>
>
> No. This is neither necessary nor practical. Loading a pointer
> and dereferencing it atomically should be fine.
> For example, an LR-optimized MCS lock might look like this:
>
>
I will try to write these in pseudo-assembler (width of memory accesses
ignored) to give us something more concrete to discuss:

# an MCS lock is a single pointer
# when free, the lock points to itself

# a0 -- address: lock to acquire
# a1 -- address: local lock queue entry
LOCK:
LR t0, (a0)
LRM t1, (t0)
SC x0, a1, (a0)
SC t2, a1, (t0)
BNE zero, t2, LOCK # retry on serialization failure
BNE t0, a0, LOCK_wait # spin if lock held; a0 == t0 -> lock was free
JR ra # return; we now hold the lock
LOCK_wait:
LR t2, (a0) # check if we hold the lock
BNE t2, a1, LOCK_wait # spin until lock is ours
JR ra # return; we now hold the lock


# a0 -- address: lock to release
# a1 -- address: local lock queue entry
UNLOCK:
LR t0, (a0)
LRM t1, (a1)
BNE t0, a1, UNLOCK_error
BNE t1, a1, UNLOCK_pass
SC x0, a0, (a0)
SC t2, a1, (a1)
BNE zero, t2, UNLOCK # retry on serialization failure
JR ra # return; lock is now free
UNLOCK_pass: # pass lock to next waiter in queue
SC x0, t1, (a0)
SC t2, a1, (a1)
BNE zero, t2, UNLOCK # retry on serialization failure
JR ra # return; lock is passed to next waiter
UNLOCK_error: # but we were not holding the lock...
...


The UNLOCK code in particular is interesting. I believe that it meets
the RVA forward progress guarantee constraints, even though it has two
SC groups! The LOCK code has another trick: an LR with no
corresponding SC. This was chosen because I assume that remote SCs will
be immediately visible to local LR, even if FENCE is needed to make them
visible to ordinary LOAD. Further, the LOCK_wait LR can take a shared
reservation, avoiding cacheline bounces until that shared reservation is
broken and the local cacheline reloaded at the next LR.

> Because the repeat until success loops appear to be atomic, you don't
> have to bother with a bad interleaving where the lock is allocated
> while someone else frees the lock (which is what makes MCS locks so
> complicated).
>
> To see why this algorithm works, imagine a singly linked queue of locks.
> There are three invariants:
> 1) The oldest element in the queue currently has the lock.
> 2) Iff l_p points to itself, the queue is empty.
> 3) Otherwise l_p points to the newest element in the queue, which
> points to itself.
> On LOCK we enqueue a new element, which means atomically pointing l_p
> and previous newest element in the queue to myself.
> If l_p pointed to itself the queue was empty and I now have the oldest
> lock in the queue; otherwise, I have to wait for the previously newest
> element in the queue to release the lock.
>
> On UNLOCK I release the lock by pointing at myself, thus removing
> myself from the queue. The only difficulty is if I was the last
> element in the queue, in which case the queue is now empty.
>
>
> I'm not saying this is a great algorithm -- unlike the normal MCS lock
> this one might not be fair in the sense that one of the threads can
> attempt an infinite number of LR/LRMs without ever reaching the SCs --
> I was just giving an example of an algorithm where LRMing from a
> previous LRed pointer makes sense.

The RVA forward progress guarantee says otherwise. :-) I think this
may be a good example and argument for multiple LR/SC, although I do not
know how this compares to using the MCS locks from
<URL:https://lwn.net/Articles/590243/> and AMOSWAP. It looks to me like
LR/LRM/SC provides an MCS lock in a single pointer, while using AMOSWAP
requires having a separate "lock-held" flag.

> PS: I now enabled "undo send" in gmail, I hope this will prevent
> future mistakes...
Your mistakes helpfully trifurcated a thread that was becoming an
exchange of essays. :-)


-- Jacob

Jonas Oberhauser

non lue,
12 déc. 2017, 04:01:3412/12/2017
à Cesar Eduardo Barros,Jacob Bachmeyer,RISC-V ISA Dev

No, you never have to report success.

Here's a simple example

x = 0
LR (on failure, write x=1, on success, write y = loadresult)
LRM (on failure, write x=1, on success, write z = loadresult)

You only write one register per LR/LRM, but you can always tell whether any LR/LRM failed

Jonas Oberhauser

non lue,
12 déc. 2017, 04:59:5312/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
2017-12-12 4:18 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:
Jonas Oberhauser wrote:
2017-12-11 6:21 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:

    Jonas Oberhauser wrote:

        It is only in case the LR/LRM sequence fails that the LRMs in
        the sequence might load garbage. So if you wrongly implemented
        the DCAS as "LR;LRM; if values match compare data then SC;SC",
        the cas does not appear to be an atomic double-wide load in
        case the LRM fails. Note that the SCs will fail if the LRM
        failed even if the garbage values match the compare data, so
        that is not the problem -- the problem is if you want to
        continue using the load results.


    Why is that wrong for DCAS?  A problem is seen only if LRM returns
    garbage instead of reading memory without acquiring a
    reservation.  If LRM returns garbage, then DCAS will almost
    certainly fail ("wrong values") instead of retrying
    ("serialization failure").


In a *normal* DCAS, it is totally necessary to have the old values, since they may be used in the next iteration of the loop

a,b = R x,y;  do {  c,d = op(a,b);   a,b, fail = DCAS x (a -> c) y (b -> d)   } while (fail)

Note the lack of a reload in the loop.

There are two failure cases:  (1) DCAS reads memory and finds that the values in memory do not match the expected values, and (2) the LR/LRM/SC/SC that implements DCAS does not execute atomically.  In case (2), the correct response is to retry the LR/LRM/SC/SC sequence.  In case (1), DCAS itself was successfully executed, but failed.  The loop given in the example is an infinite loop if locations x and y do not eventually return to the values initially read. 
 
Or are you saying that DCAS returns the values actually read thus updating a and b? 

That is what I am saying.
 
In this latter case, LR/LRM must never return garbage:  the value read by LR/LRM will be used in the calculation if DCAS fails.

No, because the LR/LRM implementation looks different.

try:
a = LR x
b = LRM y
c,d = op(a,b)
*x = SC c
*y = SC d fail
retry on fail 

In this particular example, we neither 1) reuse the old values (it would be fatal) nor 2) check whether the loads returned garbage, we just retry on a failed SC.

This is a point that needs to be clarified:  LR/LRM is *not* LT.  There is no instruction that confirms that an LR/LRM sequence has read a consistent set of values.  The only guarantee is that all SCs will fail if any value read by LR/LRM was changed by another hart after being read.

I'm fine with that. 

With LR/LRM, such a loop does not need the values, because the operation is written between the LR/LRMs and the SCs, not "before" them,
But I do not know if there are other examples where even with LR/LRM one would like to continue using the values.

This will vary -- the operation must meet the RVA forward progress guarantee constraints if it appears in the LR/LRM/.../SC/SC sequence.  Some operations may need to use non-atomic reads followed by a multi-word CAS to perform an atomic update.

Possibly. 

    Either one CPU has priority or both must stall for the hard
    reservation timeout, then continue. 

I thought that the reservation clock is only ticking during CPU cycles, of which there are none during the stall. That is why I suggested flipping the reservation state temporarily.

The reservation clock ticks on memory bus cycles, or some other clock that continues to run unless the entire system is halted.  Otherwise, suspending a hart that happens to hold a reservation could cause deadlock.

What does suspending mean here?
* By an IPI (e.g., INIT in x86): give up reservations, no deadlock
* By a non-integer, non-LRM/SC operation: no forward guarantee, so give up reservations, no deadlock
* By an LRM: well, this was the problem I was thinking about, but using the exclusive/shared/soft scheme this can be avoided.

What other reasons for suspending the hart are there?

 
    If both stall, CPU1 now has "y" reserved and CPU2 has "x"
    reserved, but both have lost their earlier reservations and will
    fail at SC. The simple (Ethernet collision recovery) strategy is
    to stall for a random small number of cycles upon an attempt to
    reacquire a reservation that has previously timed out.  On the
    first attempt, both CPUs fail at SC, they both jump back to retry,
    they stall for different amounts of time at the first LR, and one
    (I will say CPU1) clearly acquires the reservations and reaches
    the SCs before the other.  CPU2 then fails at SC and retries,
    succeeding on the third attempt, or stalls at LR (observing CPU1's
    reservations) and waits for CPU1 to complete or the reservations
    to again expire.

Yes, but I'd rather not have to put that into software. I'd like to maintain a forward guarantee for one of the threads.

The only thing put into software in this example is the "retry on serialization failure" that LR/SC already requires.  The stall for a random small number of cycles is a hardware operation.  Hardware knows to insert the random backoff stall when attempting to reacquire a reservation that this hart previously held, but that timed out while this hart was stalled acquiring another reservation.  The required state information can be invisible to software, since this is only needed to break lock-step if two harts arrive at (and thus will retry) an LR/SC simultaneously and a preemptive context switch on either hart will certainly break that lock-step.

Ah. What confused me was the following sentence:
On the first attempt, both CPUs fail at SC, they both jump back to retry, they stall for different amounts of time at the first LR"

I don't think it's practical to keep this status information, but I haven't thought enough about it.

By the way, you do not have to stall for a random amount, but for a hopefully unique amount. In a local system you might have different IDs for each CPU, which you can use to wait.

 
    However, until one CPU has reached the SC sequence, no
    reservations are dirty.  The guarantee of forward progress
    requires that all SCs be contiguous, so renewing a hard
    reservation just long enough to complete four SC operations avoids
    this with bounded delay. 

Hm. I didn't think it would be a good idea to allow multiple CPUs to have reservations to the same cacheline, but I don't have a good reason against it.
 
    The first CPU to execute SC "wins" and all reservations held by
    other CPUs on the addresses written are broken. 

Which would also invalidate all other reservations of those other CPUs, preventing them from entering the SC groups.

They will enter the SC groups, but the SCs will fail.  Software then retries the sequence.

Sorry, yeah, that is what I meant. I should have said "successfully entering the SC groups". From now on I will try to make this distinction.
 
You can guarantee in the absence of interrupts:
1) if multiple LR;LRM;SC sequences compete, one of them will reach a successful SC
2) once a successful SC is executed, the remaining SCs in the sequence will also be successful and the operation will be committed.

As long as at least one of those sequences is not interrupted, this will hold in the presence of interrupts as well, if interrupts cause an interrupted LR/LRM/SC to be abandoned and retried in software.

I meant to say: "one of them will reach a successful SC on the first try".
But even in their current wording neither of the guarantees can be satisfied if there are too many interrupts (i.e., there is an interrupt between LR and SC every single damn time).
Another bad example is "after both enter the SC groups, the one that successfully entered is interrupted."


        If you fail the LRM and the reservation is dirty, the LRM can
        not really return a "real" value. It either has to return the
        dirty value -- which might never be written to memory and
        might make the remote LR/LRM/SC sequence seem non-atomic, take
        a stale value from the main memory, or make up a value. In
        each case LRM potentially returns garbage (which I think would
        be ok. I'd prefer if it always returned the same garbage
        though, such as all zeros or just the current register content.).


    This cannot happen -- if the reservation is dirty, then the remote
    LR/LRM/SC has completed and committed. 

No, because an interrupt could still happen that fails the SC sequence.

No, because the remote LR/LRM/SC has not committed until its *last* SC has succeeded.  The SCs in an LR/LRM/SC sequence are an atomic group, until they *all* succeed, *no* changes are visible.

Exactly. But the reservation can only be made dirty by a successful SC which is not the last in the sequence. Thus "if the reservation is dirty, then the remote LR/LRM/SC has not completed or committed, but is in the process of committing".

I think the problem here (and with the next lines) is that we are not in sync.
In my implementation, a successful SC already replaces the cacheline with the new value, thus has to stall remote accesses.
The advantage of this is that the eventual commit is atomic and you don't need to buffer the new values somewhere else, and you can take an interrupt anywhere in the sequence without problems.

In an implementation where the SC-values are buffered in a CPU-side buffer, the CPU on the last SC pushes all of its SCs through, I agree with you.
Note that this means that while this buffer is drained, no interrupts can be taken (not sure about claimed).
In this case the SC sequence still stalls remote accesses.

In my answers to the next lines, I will refer to the two implementations as "dirty cache implementation" for the one that changes the cache lines before the commit, and as "buffer implementation" the one that has an extra buffer and grabs the memory bus while draining the buffer. 

    The "stale" value must be read until that time, since the atomic
    writes have not yet occurred.

In the cache you probably have already the dirty value. I think the LRM to an exclusive remote reservation should be stalled until 1) the local reservations are invalidated from the outside or 2) the remote reservation is freed (whether because of an interrupt or because the remote side commits).

If the remote is currently committing, then yes, LR/LRM *should* stall and wait for the new value or for remote commit to fail.  Otherwise, this hart has just acquired a reservation that it knows will be broken before local commit is possible.

Yes, but in the dirty cache implementation, a dirty reservation means that the remote is currently committing (but might still fail due to an interrupt).
If the reservation is not dirty, I completely agree that you can take the reservation (and it might be broken), all of that is fine.

 
      LR/LRM simply expresses interest in an address; there is no
    requirement that SC ever be executed.  A "failed" LRM reads the
    same value that a successful LRM reads, the only difference is
    that a failed LRM was executed without the LR to begin the
    sequence, so no reservation is acquired and SC must fail, causing
    the sequence to be retried.


But an LD or LR to an exlusive remote reservation definitely have to be stalled. The question is just whether the LRM should stall like the LR or whether it is allowed to abort.
In case we do not need a meaningful, atomic value in case the SC fails, we can abort and everything is fine.

LOAD is non-atomic, so there is no reason for it to stall.  
No guarantees are made, and loading the old values is correct -- the transaction has not committed yet. 

The problem is if you have two LOADS and you do not stall, the first one might see a new value, and the second one an old value (even with .aq bits set), and thus the LR/LRM/SC is not atomic.
This is true in both implementations for the following execution:
T1: SC x;
T2: LOAD x (with .aq);
T2: LOAD y;
T1: SC y;

in the buffer implementation, the sequence is currently committing and the value of x has been written but the value of y has not been written yet.
in the dirty cache implementation, the cache has the value of SCx which would be put on the bus for x, but not the value of SCy for y.

In each case you have to stall the loads to the area to which you are currently committing, or you need to build a second buffer which keeps the stale values and puts them on the bus.

 For LR/LRM, the only reason to stall is avoid loading a value and acquiring a reservation that we know will not remain valid when we reach SC.  Stalling at LR/LRM is a performance optimization, as it avoids a software retry of the sequence by waiting for an almost-complete competing sequence to actually complete.

I don't quite agree. In HW, the only way not to stall is to add extra HW. The performance optimization is actually the simpler and cheaper option.

What I agree on is that aborting the LRM and loading nothing is a performance optimization, which may or may not bite us later. But it is easier to go from "loads nothing" to "behaves like a LOAD" than from "behaves like a LOAD" to "loads nothing".

I also realized this later. Take the LR/LRMs sequence and take all registers which are used by an LR/LRM but not overwritten by a previous LR/LRM in the sequence.
That set of addresses has to be restored to the point before the LR, either in SW (by not overwriting such a register) or in HW. Both sound complicated and error-prone.

LR/LRM takes one input (an address) and produces one output (a value read from memory) with the side effect of acquiring a reservation.  As long as restarting the LR/LRM group produces the same results, this will work.    [...] 
Another way:  Destination registers within an LR/LRM group must contain dead values when the group is entered. 
Yep. 
For software, the rule would be that the LR/LRM group must not use the initial value of any register that is used as a destination.  [...]  A third way:  An LR/LRM group must not write to a register after reading that register.
These work but are too strong. Consider:
t1 = LR a1
t1 = LRM t1

This is fine, but prohibited by both rules above.

That's why I think it's a bit error prone, it's a bit subtle.
 
    I think we have this requirement already -- the SC group must
    write exactly to every address that was read in LR/LRM and
    clobbering the addresses makes this impossible within the
    guaranteed forward progress constraints.

 I don't think I quite understand. Why does this make it impossible?

The RVA forward progress guarantee places a severe limit on code size:  only 16 baseline RVI instructions, including the LR/LRM/SC and the branch to retry the sequence, and excluding loads, stores, and some other instructions.  There is no time to reconstruct the addresses -- they must be preserved in registers.

There is. Consider the following

try:
t1 = a1 + 4
t2 = LR a1
t1 = LRM t1
retry if t2 != t1 
t3 = a1 + 4
*a1 = SC 0
*t3 = SC 0

Does it make sense to program like this? No. Are there programs where it makes sense? I doubt it, but I don't know. But for sure it is possible to program like this (and by murphy's law, someone will probably do it).
If you want this condition, you need to make it explicit, it's not a consequence of existing rules.

I also don't know what "clobbering" means in this context, is that a formal term?

I do not know how formal "clobber" is; I think I first saw it in the GCC manual.  To "clobber" a value is to overwrite the location where that value is stored.

Okay :) Thanks.

Jonas Oberhauser

non lue,
12 déc. 2017, 05:04:0112/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
Yes, but incorrectly jumping into a cas loop will mess up the thread and all shared memory that the thread can access. If you incorrectly restore a thread, the thread is going to be garbage.

In case of a DCAS, it might enqueue a garbage command into a command queue that happens to be interpreted as "rm -r *". It's not our job to fix that.
 
The latter is a specific solution to this instance of the problem, but not a generic solution (nest more LRMs in the first thread and alternate LRs in the second thread).

Using shared and exclusive reservations as you suggested solves the problem:  both threads take shared reservations without conflict and the first thread to reach SC now has exclusive reservations that cause the other thread to fail SC and stall when it tries to load the value again.  The first thread's SCs complete and the other thread continues.

Yep. 

Jonas Oberhauser

non lue,
12 déc. 2017, 05:31:1112/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
I like the UNLOCK. I don't know why you think that an LR will help in the LOCK_wait.
The following are all true:
1) all SCs in the same group become visible at once
2) the SCs will eventually become visible, at which point both the LR and LD (which are executed in a loop) would see them
3) fences will not make the SCs visible any sooner, neither local fences nor remote fences. Local fences have no effect on the remote SCs, and remote fences will just delay processor execution but probably not make the SCs happen any faster
4) it is impossible to see from the local hart whether the SCs take a long time because they are being buffered, or because the remote is just terribly slow.
5) the remote can probably not hurry up even if it sees that we have a reservation.

There are also no cacheline bounces because the only thing you execute is that one LD, and the only thread that will also access that cacheline is the one that currently holds the lock -- in which case it will only accesses it once, when it releases the lock.

 
Because the repeat until success loops appear to be atomic, you don't have to bother with a bad interleaving where the lock is allocated while someone else frees the lock (which is what makes MCS locks so complicated).

To see why this algorithm works, imagine a singly linked queue of locks. There are three invariants:
1) The oldest element in the queue currently has the lock. 2) Iff l_p points to itself, the queue is empty.
3) Otherwise l_p points to the newest element in the queue, which points to itself.
On LOCK we enqueue a new element, which means atomically pointing l_p and previous newest element in the queue to myself.
If l_p pointed to itself the queue was empty and I now have the oldest lock in the queue; otherwise, I have to wait for the previously newest element in the queue to release the lock.

On UNLOCK I release the lock by pointing at myself, thus removing myself from the queue. The only difficulty is if I was the last element in the queue, in which case the queue is now empty.


I'm not saying this is a great algorithm -- unlike the normal MCS lock this one might not be fair in the sense that one of the threads can attempt an infinite number of LR/LRMs without ever reaching the SCs -- I was just giving an example of an algorithm where LRMing from a previous LRed pointer makes sense.

The RVA forward progress guarantee says otherwise.  :-)  I think this may be a good example and argument for multiple LR/SC, although I do not know how this compares to using the MCS locks from <URL:https://lwn.net/Articles/590243/> and AMOSWAP.  It looks to me like LR/LRM/SC provides an MCS lock in a single pointer, while using AMOSWAP requires having a separate "lock-held" flag.

At VSTTE15 I also presented a single-pointer MCS LOCK which only uses AMOSWAP (see attachment).

The real simplification that LRM gives you is that you don't need the weird (but insanely clever) wait for the next one in the UNLOCK (page 2), because you know that nothing can happen between loading your me.next and checking the next one.


PS: I now enabled "undo send" in gmail, I hope this will prevent future mistakes...
Your mistakes helpfully trifurcated a thread that was becoming an exchange of essays.  :-)

Well, looks like we brought it back to essays :) I'll try to split it into more parts in the future.

MCS_LOCK.pdf

Cesar Eduardo Barros

non lue,
12 déc. 2017, 18:08:1712/12/2017
à Jonas Oberhauser,Jacob Bachmeyer,RISC-V ISA Dev
Em 12-12-2017 07:00, Jonas Oberhauser escreveu:
>
>
> 2017-12-11 21:50 GMT+01:00 Cesar Eduardo Barros <ces...@cesarb.eti.br
> <mailto:ces...@cesarb.eti.br>>:
Clever, but that would be a form of conditional move. Conditional moves
have been considered but rejected for RISC-V (see chapter 2.5 of
riscv-spec-v2.2).

For out-of-order implementations which use register renaming, you would
need three input registers (the load address, the original x, and the
original y/z) and two output registers (the renamed x and the renamed
y/z). Conditionally writing to a register doesn't reduce the number of
outputs, since you can't conditionally rename a register, because the
renaming happens before the instruction is executed. If the instruction
doesn't write to the register, it still has to copy the old value from
its original physical register to its new physical register, so you also
increase the number of inputs.

That is, what we have today for LR is one input (the address) and one
output (the destination register), and for SC two inputs (the address
and the source register) and one output (the "success" register). What
you proposed would need three inputs, more than anything other than FMA
(which is in a different register file), and two outputs, more than
anything other than CSR swaps (though the CSRs are more of an I/O space
than registers, so in the likely case they are treated more like a
strange sort of memory, there is again only one register output for them).

Jacob Bachmeyer

non lue,
12 déc. 2017, 22:37:4212/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-12 4:18 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> 2017-12-11 6:21 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
Correct, but LR/LRM/SC can be used to implement DCAS, and portable
synchronization libraries might choose to rely on DCAS, so the first
loop with DCAS should work if DCAS is a subroutine implemented using
LR/LRM/SC. If LR/LRM can return garbage instead of "old" values, then
those portable libraries would need to accommodate a quasi-DCAS that
does not return usable values upon failure.

> Either one CPU has priority or both must stall for the hard
> reservation timeout, then continue.
>
> I thought that the reservation clock is only ticking during
> CPU cycles, of which there are none during the stall. That is
> why I suggested flipping the reservation state temporarily.
>
>
> The reservation clock ticks on memory bus cycles, or some other
> clock that continues to run unless the entire system is halted.
> Otherwise, suspending a hart that happens to hold a reservation
> could cause deadlock.
>
>
> What does suspending mean here?
> * By an IPI (e.g., INIT in x86): give up reservations, no deadlock
> * By a non-integer, non-LRM/SC operation: no forward guarantee, so
> give up reservations, no deadlock
> * By an LRM: well, this was the problem I was thinking about, but
> using the exclusive/shared/soft scheme this can be avoided.
>
> What other reasons for suspending the hart are there?

Power management -- shut down a hart to save power. If that hart was
holding reservations (it should not be, but...) something must ensure
that those reservations do not deadlock other harts later. (Releasing
reservations upon entering low-power mode should also work, but the
reservations are tracked in the memory subsystem anyway, so why not
expire them on the memory clock?)
The required status information is the same information needed to track
reservations in the first place plus one bit per reservation for an
expiration flag -- if taking a reservation that previously expired, stall.

> By the way, you do not have to stall for a random amount, but for a
> hopefully unique amount. In a local system you might have different
> IDs for each CPU, which you can use to wait.

Correct, random is just an easy way to get a hopefully unique amount
with dynamic priority where using hart IDs would produce a static
priority model.

> You can guarantee in the absence of interrupts:
>
> 1) if multiple LR;LRM;SC sequences compete, one of them will
> reach a successful SC
> 2) once a successful SC is executed, the remaining SCs in the
> sequence will also be successful and the operation will be
> committed.
>
>
> As long as at least one of those sequences is not interrupted,
> this will hold in the presence of interrupts as well, if
> interrupts cause an interrupted LR/LRM/SC to be abandoned and
> retried in software.
>
>
> I meant to say: "one of them will reach a successful SC on the first try".
> But even in their current wording neither of the guarantees can be
> satisfied if there are too many interrupts (i.e., there is an
> interrupt between LR and SC every single damn time).
> Another bad example is "after both enter the SC groups, the one that
> successfully entered is interrupted."

At some point, interrupt load is simply too high for any forward
progress guarantee to hold. This is a system design problem if the hart
never executes 32 instructions uninterrupted.

The solution in the second case could be to delay interrupts for up to 4
cycles to complete the SC group.
There is also the Herlihy implementation, which uses a pair of
cachelines for each value, discarding one upon successful commit and the
other upon abort. Commit in Herlihy's model is atomic -- the cache can
update the "pending" cachelines to "current" and discard the "old"
cachelines in the same cycle.

> The "stale" value must be read until that time, since the
> atomic
> writes have not yet occurred.
>
> In the cache you probably have already the dirty value. I
> think the LRM to an exclusive remote reservation should be
> stalled until 1) the local reservations are invalidated from
> the outside or 2) the remote reservation is freed (whether
> because of an interrupt or because the remote side commits).
>
>
> If the remote is currently committing, then yes, LR/LRM *should*
> stall and wait for the new value or for remote commit to fail.
> Otherwise, this hart has just acquired a reservation that it knows
> will be broken before local commit is possible.
>
>
> Yes, but in the dirty cache implementation, a dirty reservation means
> that the remote is currently committing (but might still fail due to
> an interrupt).
> If the reservation is not dirty, I completely agree that you can take
> the reservation (and it might be broken), all of that is fine.

While the local hart is executing an SC group, a remote LR/LRM must
either read the "old" value (taking a reservation that will almost
certainly be broken in a few cycles) or stall until the "new" value is
available (when the SC group completes). In a dirty cache
implementation, the "old" value may not be available, but the cache is
not dirty until an exclusive reservation has been taken, so having
LR/LRM stall and wait for the "new" value is appropriate. If remote
LR/LRM stalls and the local commit fails, invalidating the dirty
cacheline, then the remote hart must go fetch the value from an outer
cache or main memory.

So two more cache-coherence messages are needed: "wait" and "oops, I
lost it; go fish". (The phrase "go fish" refers to a card game.)

> LR/LRM simply expresses interest in an address; there is no
> requirement that SC ever be executed. A "failed" LRM
> reads the
> same value that a successful LRM reads, the only difference is
> that a failed LRM was executed without the LR to begin the
> sequence, so no reservation is acquired and SC must fail,
> causing
> the sequence to be retried.
>
>
> But an LD or LR to an exlusive remote reservation definitely
> have to be stalled. The question is just whether the LRM
> should stall like the LR or whether it is allowed to abort.
> In case we do not need a meaningful, atomic value in case the
> SC fails, we can abort and everything is fine.
>
>
> LOAD is non-atomic, so there is no reason for it to stall.
>
> No guarantees are made, and loading the old values is correct --
> the transaction has not committed yet.
>
>
> The problem is if you have two LOADS and you do not stall, the first
> one might see a new value, and the second one an old value (even with
> .aq bits set), and thus the LR/LRM/SC is not atomic.
> This is true in both implementations for the following execution:
> T1: SC x;
> T2: LOAD x (with .aq);
> T2: LOAD y;
> T1: SC y;

In this execution trace, T2 must either read old values for both x and y
or stall until the new values are known -- none of the SCs are visible
until the last SC completes successfully. Since RVI LOAD does not have
an .aq bit (that is a possible future extension mentioned in the new
memory model), the acquire semantics require the RVA LR instruction,
which should stall in this case.

> in the buffer implementation, the sequence is currently committing and
> the value of x has been written but the value of y has not been
> written yet.

The new value of x is in a local buffer; the new value of y will be
placed in the buffer on the next cycle. The buffer does not commit
until it has both values. The buffer implementation grabs the memory
bus while flushing the buffer, so no other memory accesses are possible
while the buffer is committing.

> in the dirty cache implementation, the cache has the value of SCx
> which would be put on the bus for x, but not the value of SCy for y.

But the cache knows the exclusive reservations on x and y, so it can
produce the "wait" message for both x and y.

> In each case you have to stall the loads to the area to which you are
> currently committing, or you need to build a second buffer which keeps
> the stale values and puts them on the bus.

Yes, until the commit is done, loads must either return the old values
or stall for the new values. The new values are *not* visible until the
last SC succeeds.

> For LR/LRM, the only reason to stall is avoid loading a value and
> acquiring a reservation that we know will not remain valid when we
> reach SC. Stalling at LR/LRM is a performance optimization, as it
> avoids a software retry of the sequence by waiting for an
> almost-complete competing sequence to actually complete.
>
>
> I don't quite agree. In HW, the only way not to stall is to add extra
> HW. The performance optimization is actually the simpler and cheaper
> option.
>
> What I agree on is that aborting the LRM and loading nothing is a
> performance optimization, which may or may not bite us later. But it
> is easier to go from "loads nothing" to "behaves like a LOAD" than
> from "behaves like a LOAD" to "loads nothing".

The problem is that loading nothing on LRM *is* going from "behaves like
a LOAD" to "loads nothing" -- the prototype for LRM is LR, which
"behaves like a LOAD" currently.

> I also realized this later. Take the LR/LRMs sequence and take
> all registers which are used by an LR/LRM but not overwritten
> by a previous LR/LRM in the sequence.
> That set of addresses has to be restored to the point before
> the LR, either in SW (by not overwriting such a register) or
> in HW. Both sound complicated and error-prone.
>
>
> LR/LRM takes one input (an address) and produces one output (a
> value read from memory) with the side effect of acquiring a
> reservation. As long as restarting the LR/LRM group produces the
> same results, this will work. [...]
>
> Another way: Destination registers within an LR/LRM group must
> contain dead values when the group is entered.
>
> Yep.
>
> For software, the rule would be that the LR/LRM group must not use
> the initial value of any register that is used as a destination.
> [...] A third way: An LR/LRM group must not write to a register
> after reading that register.
>
> These work but are too strong. Consider:
> t1 = LR a1
> t1 = LRM t1
>
> This is fine, but prohibited by both rules above.

What? That is allowed under both rules -- the initial value of t1 is
dead and not used before it is loaded from (a1) and the LRM is
considered to write t1 on the same step (not after) that it reads t1 to
get a memory address.

> That's why I think it's a bit error prone, it's a bit subtle.

It looks to me that the most error prone part will be writing the rules
clearly. :-)
Maybe a rule that addresses must be preserved should be made explicit,
but that sequence does violate the proposed requirements for LR/LRM to
be restartable -- the value of t1 prior to LR is both used and clobbered
in the LR/LRM group.


-- Jacob

Jacob Bachmeyer

non lue,
12 déc. 2017, 22:57:4512/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-12 6:09 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> 2017-12-11 14:22 GMT+01:00 Jonas Oberhauser
> <s9jo...@gmail.com <mailto:s9jo...@gmail.com>
> <mailto:s9jo...@gmail.com <mailto:s9jo...@gmail.com>>>:
>
> Garbage 1:
>
> 2017-12-11 14:08 GMT+01:00 Jonas Oberhauser
> <s9jo...@gmail.com <mailto:s9jo...@gmail.com>
> <mailto:s9jo...@gmail.com <mailto:s9jo...@gmail.com>>>:
Baseline RVI LOAD does not have atomic access semantics. Only LR can
have acquire or sequential consistency specified.

> The following are all true:
> 1) all SCs in the same group become visible at once
> 2) the SCs will eventually become visible, at which point both the LR
> and LD (which are executed in a loop) would see them
> 3) fences will not make the SCs visible any sooner, neither local
> fences nor remote fences. Local fences have no effect on the remote
> SCs, and remote fences will just delay processor execution but
> probably not make the SCs happen any faster
> 4) it is impossible to see from the local hart whether the SCs take a
> long time because they are being buffered, or because the remote is
> just terribly slow.
> 5) the remote can probably not hurry up even if it sees that we have a
> reservation.

The RISC-V memory model is quite weak, and I do not think the revised
memory model gives any better guarantees for RVI LOAD. Implementations
are also allowed, as I understand, to have loose cache-coherency, where
RVI LOAD may *never* see an update without a FENCE. LR/LRM/SC, of
course, must be their own FENCEs.

> There are also no cacheline bounces because the only thing you execute
> is that one LD, and the only thread that will also access that
> cacheline is the one that currently holds the lock -- in which case it
> will only accesses it once, when it releases the lock.

The loop is reading the original lock, not the local queue entry. Any
other threads wanting to the take the lock will also need to write that
location. I have probably misunderstood MCS locks.


-- Jacob

Jacob Bachmeyer

non lue,
13 déc. 2017, 00:39:4313/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-12 6:09 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> [...]
(A side note, from the last page of the chapter on RVA in the ISA spec
draft: <<The instructions in the "A" extension can also be used to
provide sequentially consistent loads and stores. A sequentially
consistent load can be implemented as an LR with both aq and rl set. A
sequentially consistent store can be implemented as an AMOSWAP that
writes the old value to x0 and has both aq and rl set.>>)


Could LOCK be simplified to using AMOSWAP while UNLOCK uses LR/LRM/SC/SC?

Another try with pseudo-assembler:

# an MCS lock is a single pointer
# a free lock points to itself
# a held lock points to the tail of a lock queue
# taking the lock requires supplying a queue slot

# a0: address: lock
# a1: address: local queue slot
LOCK:
AMOSWAP x0, a0, (a1) # clear queue slot
AMOSWAP t0, a1, (a0) # extend queue
BNE t0, a0, LOCK_enqueue
JR ra
LOCK_enqueue:
AMOSWAP x0, a1, (t0) # add self to queue
LOCK_wait:
LR t1, (t0) # load next-in-line
BEQ t1, a0, LOCK # start over if lock freed
BEQ t1, a1, LOCK_wait # spin as long as it is us
JR ra

# a0: address: lock
# a1: address: local queue slot
UNLOCK:
LR t0, (a0) # load tail-of-queue
LRM t1, (a1) # load next-in-line
BNE t0, a1, UNLOCK_pass
SC x0, a0, (a0) # release lock
SC t2, a1, (a1) # release our queue slot
BNE zero, t2, UNLOCK # retry serialization
JR ra
UNLOCK_pass:
SC x0, t0, (a0) # preserve tail-of-queue
SC t2, a1, (a1) # release our queue slot
BNE zero, t2, UNLOCK # retry serialization
JR ra


I find the result that taking an MCS lock can be less expensive than
releasing it even without contention quite interesting, although the
reason for this (that the current holder must resolve contention for the
lock) should not surprise me.

Without the forward progress guarantee, UNLOCK can get into livelock,
competing with LOCK operations adding to the queue. This means that AMO
instructions to an exclusive reservation must also stall, but
consistency in LOCK requires the LOCK_enqueue AMOSWAP to break a
reservation and force UNLOCK to retry.

The first BEQ instruction in the wait loop handles the case where a
thread moving to take the lock is preempted between extending the queue
and adding itself to the queue and the lock is released by another
thread, although (oops!) the other thread's queue slot is still
overwritten in this case. The second BEQ is written that way to handle
the case where the other thread's queue ceases to be valid after the
other thread passes the lock to this thread, but this thread remains
preempted until the remote queue slot now contains garbage; failure
(infinite loop) is 1-in-2^XLEN in this case, assuming a random value in
the no-longer-valid queue slot that never changes. Note that UNLOCK
writes a value that will break the loop, so failure is unlikely. I
think that a safe LOCK requires LR/LRM/SC, since LOCK needs to
atomically update both the lock itself and a queue entry. We cannot
race with other threads to get into the queue, but we can race with UNLOCK.

This leads to a key advantage for LR/LRM/SC: since taking any trap
breaks the reservations, an LR/LRM/SC sequence remains atomic even with
arbitrary preemption latencies.

Interestingly, the only reason for UNLOCK to take a reservation on the
local queue slot is to ensure that another thread cannot add itself to
the queue after we have decided to release the lock. The value read
from the local queue slot is not actually used. Should LR/LRM to x0 be
defined specially (to skip actually reading memory, but still take the
reservation) for this situation, where a value must not change, but the
actual value is not used?


-- Jacob

Jonas Oberhauser

non lue,
13 déc. 2017, 04:54:0713/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
2017-12-13 4:57 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:
Jonas Oberhauser wrote:
2017-12-12 6:09 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:
I like the UNLOCK. I don't know why you think that an LR will help in the LOCK_wait.

Baseline RVI LOAD does not have atomic access semantics.  Only LR can have acquire or sequential consistency specified.

You can use the current pseudo-instruction which puts fences in the right place.
I'm not sure what you mean with "atomic access semantics". A load loads the most up-to-date value for each byte at the point that it is executed. Most up-to-date takes into consideration 1) the state of the memory and 2) local stores that have been issued before the load but not yet reached local memory. (This is the "load value axiom").
For the purposes of the lock, this is strong enough.
(cf also: "Every aligned load or store instruction gives rise to exactly one memory access that executes in a single-copy atomic fashion [...]")

 
The following are all true:
1) all SCs in the same group become visible at once
2) the SCs will eventually become visible, at which point both the LR and LD (which are executed in a loop) would see them
3) fences will not make the SCs visible any sooner, neither local fences nor remote fences. Local fences have no effect on the remote SCs, and remote fences will just delay processor execution but probably not make the SCs happen any faster
4) it is impossible to see from the local hart whether the SCs take a long time because they are being buffered, or because the remote is just terribly slow.
5) the remote can probably not hurry up even if it sees that we have a reservation.

The RISC-V memory model is quite weak, and I do not think the revised memory model gives any better guarantees for RVI LOAD.  Implementations are also allowed, as I understand, to have loose cache-coherency, where RVI LOAD may *never* see an update without a FENCE.  

Quoting again:
"Every aligned load or store instruction *gives rise to exactly one* memory access [...]"
Every store eventually becomes visible in the global memory order.
Because of the load-value axiom, the store is then immediately seen by all other threads.

LR/LRM/SC, of course, must be their own FENCEs.


By that I assume you mean "all LR/LRMs are executed before any of the corresponding SCs", nothing more? I don't think there's an inherent need to have more fences than this.

There are also no cacheline bounces because the only thing you execute is that one LD, and the only thread that will also access that cacheline is the one that currently holds the lock -- in which case it will only accesses it once, when it releases the lock.

The loop is reading the original lock, not the local queue entry.  Any other threads wanting to the take the lock will also need to write that location.  I have probably misunderstood MCS locks.

We are talking about the "spinlock" loop, not the SC loop.

In the original MCS lock, the loop is reading only the local queue entry's "locked" flag, which is only modified by the predecessor on UNLOCK.
In the single-pointer MCS locks, the loop is reading only the next pointer of the predecessor, which is only modified by the predecessor on UNLOCK.
To see why that is the case, observe that other threads wanting to take the lock only write 1) the global lock's next pointer and 2) the youngest queue entry's next pointer.

When I am in the LOCK_wait, I am already in the queue, my predecessor is not the global lock, and newer than my predecessor. Therefore, my predecessor's next pointer -- which I am spinning on -- is not modified by other LOCKs (it is neither the global lock, nor the newest entry in the queue).

Observe also that the UNLOCK only modifies my own next pointer and the global lock's next pointer.

Therefore, when I am in the LOCK_wait, an UNLOCK will only modify my predecessor's next pointer if it is executed by my predecessor.

Jonas Oberhauser

non lue,
13 déc. 2017, 05:09:4513/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
That sounds like a very cool idea!

 
Another try with pseudo-assembler: 
[...]
but consistency in LOCK requires the LOCK_enqueue AMOSWAP to break a reservation and force UNLOCK to retry. 
 
The first BEQ instruction in the wait loop handles the case where a thread moving to take the lock is preempted between extending the queue and adding itself to the queue and the lock is released by another thread, although (oops!) the other thread's queue slot is still overwritten in this case.  The second BEQ is written that way to handle the case where the other thread's queue ceases to be valid after the other thread passes the lock to this thread, but this thread remains preempted until the remote queue slot now contains garbage; failure (infinite loop) is 1-in-2^XLEN in this case, assuming a random value in the no-longer-valid queue slot that never changes.  Note that UNLOCK writes a value that will break the loop, so failure is unlikely.  I think that a safe LOCK requires LR/LRM/SC, since LOCK needs to atomically update both the lock itself and a queue entry.  We cannot race with other threads to get into the queue, but we can race with UNLOCK.

Please forgive me if I did not try to read the assembler yet :( Reading concurrent assembler is not easy. I'll try a high-level implementation and compare later. I can then read the comments on your code.
 
Without the forward progress guarantee, UNLOCK can get into livelock, competing with LOCK operations adding to the queue.  This means that AMO instructions to an exclusive reservation must also stall,

Yep.
 
This leads to a key advantage for LR/LRM/SC:  since taking any trap breaks the reservations, an LR/LRM/SC sequence remains atomic even with arbitrary preemption latencies.

Interestingly, the only reason for UNLOCK to take a reservation on the local queue slot is to ensure that another thread cannot add itself to the queue after we have decided to release the lock.  The value read from the local queue slot is not actually used.
 
Yep.
 
Should LR/LRM to x0 be defined specially (to skip actually reading memory, but still take the reservation) for this situation, where a value must not change, but the actual value is not used?

Unless you access transport-triggered devices (which, e.g., count the number of accesses), this is an implementation detail.

Jonas Oberhauser

non lue,
13 déc. 2017, 05:43:5613/12/2017
à Cesar Eduardo Barros,RISC-V ISA Dev
2017-12-13 0:08 GMT+01:00 Cesar Eduardo Barros <ces...@cesarb.eti.br>:

Clever, but that would be a form of conditional move. Conditional moves have been considered but rejected for RISC-V (see chapter 2.5 of riscv-spec-v2.2).

For out-of-order implementations which use register renaming, you would need three input registers (the load address, the original x, and the original y/z) and two output registers (the renamed x and the renamed y/z). Conditionally writing to a register doesn't reduce the number of outputs, since you can't conditionally rename a register, because the renaming happens before the instruction is executed. If the instruction doesn't write to the register, it still has to copy the old value from its original physical register to its new physical register, so you also increase the number of inputs.

That is, what we have today for LR is one input (the address) and one output (the destination register), and for SC two inputs (the address and the source register) and one output (the "success" register). What you proposed would need three inputs, more than anything other than FMA (which is in a different register file), and two outputs, more than anything other than CSR swaps (though the CSRs are more of an I/O space than registers, so in the likely case they are treated more like a strange sort of memory, there is again only one register output for them).

ok.

Jonas Oberhauser

non lue,
13 déc. 2017, 06:19:4313/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros
No, because the LR/LRM implementation looks different.

try:
a = LR x
b = LRM y
c,d = op(a,b)
*x = SC c
*y = SC d fail
retry on fail
In this particular example, we neither 1) reuse the old values (it would be fatal) nor 2) check whether the loads returned garbage, we just retry on a failed SC.

Correct, but LR/LRM/SC can be used to implement DCAS, and portable synchronization libraries might choose to rely on DCAS, so the first loop with DCAS should work if DCAS is a subroutine implemented using LR/LRM/SC. 

This will be hard, because it would mean either 1) LR/LRM groups always need to appear to be atomic or 2) you need a way to check for LR/LRM failure so that you can retry in case they were not, but leave the loop in case they were atomic but unequal to the compare data.
(the DCAS is an atomic load even in case of failure...)

Implementing atomic LR/LRM can be done in HW by restarting the sequence in case it would otherwise not appear to be atomic (as we discussed before, with the rules below), or if you wait with loading values until you can make sure that the sequence will appear to be atomic.

I think the latter would be preferable, but we'd have to come up with a cheap implementation.

 
If LR/LRM can return garbage instead of "old" values, then those portable libraries would need to accommodate a quasi-DCAS that does not return usable values upon failure.

Yes.
 
Power management -- shut down a hart to save power.  If that hart was holding reservations (it should not be, but...) something must ensure that those reservations do not deadlock other harts later.  (Releasing reservations upon entering low-power mode should also work, but the reservations are tracked in the memory subsystem anyway, so why not expire them on the memory clock?)

Because sending an "invalidate reservations" message is not expensive and is more efficient.
 
Ah. What confused me was the following sentence:

    On the first attempt, both CPUs fail at SC, they both jump back to
    retry, they stall for different amounts of time at the first LR"


I don't think it's practical to keep this status information, but I haven't thought enough about it.

The required status information is the same information needed to track reservations in the first place plus one bit per reservation for an expiration flag -- if taking a reservation that previously expired, stall.

Probably.
 
By the way, you do not have to stall for a random amount, but for a hopefully unique amount. In a local system you might have different IDs for each CPU, which you can use to wait.

Correct, random is just an easy way to get a hopefully unique amount with dynamic priority where using hart IDs would produce a static priority model.

Yep.
 
At some point, interrupt load is simply too high for any forward progress guarantee to hold.  This is a system design problem if the hart never executes 32 instructions uninterrupted.
The solution in the second case could be to delay interrupts for up to 4 cycles to complete the SC group.

Ok.
 
In my implementation, a successful SC already replaces the cacheline with the new value, thus has to stall remote accesses.
The advantage of this is that the eventual commit is atomic and you don't need to buffer the new values somewhere else, and you can take an interrupt anywhere in the sequence without problems.

In an implementation where the SC-values are buffered in a CPU-side buffer, the CPU on the last SC pushes all of its SCs through, I agree with you.
Note that this means that while this buffer is drained, no interrupts can be taken (not sure about claimed).
In this case the SC sequence still stalls remote accesses.

In my answers to the next lines, I will refer to the two implementations as "dirty cache implementation" for the one that changes the cache lines before the commit, and as "buffer implementation" the one that has an extra buffer and grabs the memory bus while draining the buffer.

There is also the Herlihy implementation, which uses a pair of cachelines for each value, discarding one upon successful commit and the other upon abort.  Commit in Herlihy's model is atomic -- the cache can update the "pending" cachelines to "current" and discard the "old" cachelines in the same cycle.

I thought briefly about that option, but it sounded too expensive.
I thought you either needed
1) four address decoders
2) or double the number of cache lines
Now I realize all you need are three additional bits in each cache line, which correspond to the lines in the small reservation buffer that should be clocked into that cacheline on a commit.
The downside is that it will add a lot of wiring either way, and will add to the delay of caches (with a few additional or gates).


 
While the local hart is executing an SC group, a remote LR/LRM must either read the "old" value (taking a reservation that will almost certainly be broken in a few cycles) or stall until the "new" value is available (when the SC group completes).  In a dirty cache implementation, the "old" value may not be available, but the cache is not dirty until an exclusive reservation has been taken, so having LR/LRM stall and wait for the "new" value is appropriate.  If remote LR/LRM stalls and the local commit fails, invalidating the dirty cacheline, then the remote hart must go fetch the value from an outer cache or main memory.

So two more cache-coherence messages are needed:  "wait" and "oops, I lost it; go fish".  (The phrase "go fish" refers to a card game.)

Now we are in sync again :)
 
What I agree on is that aborting the LRM and loading nothing is a performance optimization, which may or may not bite us later. But it is easier to go from "loads nothing" to "behaves like a LOAD" than from "behaves like a LOAD" to "loads nothing".

The problem is that loading nothing on LRM *is* going from "behaves like a LOAD" to "loads nothing" -- the prototype for LRM is LR, which "behaves like a LOAD" currently.

Well not quite because there is no software that uses LRM yet. If we make a decision either way, we will have to deal with software later if we change our mind.
But if we decide that the LR/LRM group has to act as a big atomic load -- and I definitely see the reason now for why that might be a good idea --, this discussion does not matter.


 
    LR/LRM takes one input (an address) and produces one output (a
    value read from memory) with the side effect of acquiring a
    reservation.  As long as restarting the LR/LRM group produces the
    same results, this will work.    [...]
    Another way:  Destination registers within an LR/LRM group must
    contain dead values when the group is entered. 
Yep.
    For software, the rule would be that the LR/LRM group must not use
    the initial value of any register that is used as a destination.     [...]  A third way:  An LR/LRM group must not write to a register
    after reading that register.

These work but are too strong. Consider:
t1 = LR a1
t1 = LRM t1

This is fine, but prohibited by both rules above.

What?  That is allowed under both rules -- the initial value of t1 is dead and not used before it is loaded from (a1) and the LRM is considered to write t1 on the same step (not after) that it reads t1 to get a memory address.

Maybe I misunderstood the rules then :) 
I think I overlooked the "initial" on my first try.
For the third way, well, in this particular example it depends on what you mean by "after" (for me, the LRM first reads t1, then writes it, but that's a matter of view).
But you can give a similar example:
t1 = LR a1
t2 = LRM t1
t1 = LRM t2
Furthermore, this definition of "after" would not have been strong enough:
a1 = LR a1
a2 = LRM a2

Neither of these wrote a register "after" reading it (in your definition), but the values are now "clobbered" and the sequence can not be restarted.

 
That's why I think it's a bit error prone, it's a bit subtle.

It looks to me that the most error prone part will be writing the rules clearly.  :-)

:)) 


There is. Consider the following

try:
t1 = a1 + 4
t2 = LR a1
t1 = LRM t1
retry if t2 != t1 t3 = a1 + 4
*a1 = SC 0
*t3 = SC 0

Does it make sense to program like this? No. Are there programs where it makes sense? I doubt it, but I don't know. But for sure it is possible to program like this (and by murphy's law, someone will probably do it).
If you want this condition, you need to make it explicit, it's not a consequence of existing rules.

Maybe a rule that addresses must be preserved should be made explicit, but that sequence does violate the proposed requirements for LR/LRM to be restartable -- the value of t1 prior to LR is both used and clobbered in the LR/LRM group.

I suppose so but it's not inherent in the program :)
try:
t1 = a1 + 4
t2 = LR a1
t3 = LRM t1
t1 = 0

retry if t2 != t1 
t1 = a1 + 4
*a1 = SC 0
*t3 = SC 0

This makes even less sense but right now it's valid code even with the LR/LRM condition.

Jacob Bachmeyer

non lue,
13 déc. 2017, 23:10:1113/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
>
> No, because the LR/LRM implementation looks different.
>
> try:
> a = LR x
> b = LRM y
> c,d = op(a,b)
> *x = SC c
> *y = SC d fail
> retry on fail
> In this particular example, we neither 1) reuse the old values
> (it would be fatal) nor 2) check whether the loads returned
> garbage, we just retry on a failed SC.
>
>
> Correct, but LR/LRM/SC can be used to implement DCAS, and portable
> synchronization libraries might choose to rely on DCAS, so the
> first loop with DCAS should work if DCAS is a subroutine
> implemented using LR/LRM/SC.
>
>
> This will be hard, because it would mean either 1) LR/LRM groups
> always need to appear to be atomic or 2) you need a way to check for
> LR/LRM failure so that you can retry in case they were not, but leave
> the loop in case they were atomic but unequal to the compare data.
> (the DCAS is an atomic load even in case of failure...)

That is right... rats... DCAS does not attempt to store if the values do
not match, so the serialization failure is not detected.

> Implementing atomic LR/LRM can be done in HW by restarting the
> sequence in case it would otherwise not appear to be atomic (as we
> discussed before, with the rules below), or if you wait with loading
> values until you can make sure that the sequence will appear to be atomic.
>
> I think the latter would be preferable, but we'd have to come up with
> a cheap implementation.
>
>
> Power management -- shut down a hart to save power. If that hart
> was holding reservations (it should not be, but...) something must
> ensure that those reservations do not deadlock other harts later.
> (Releasing reservations upon entering low-power mode should also
> work, but the reservations are tracked in the memory subsystem
> anyway, so why not expire them on the memory clock?)
>
>
> Because sending an "invalidate reservations" message is not expensive
> and is more efficient.

The problem then is ensuring that reservations time out even if the hart
holding them is stalled. Alternately, would the "invalidate
reservations" message go out on the cache-coherent memory bus? If so,
having the memory subsystem send it should not be difficult. Or we
could have two ways to send the message: a "fast path" driven by
counting retired instructions and a "slow watchdog" that breaks deadlocks.
Herlihy's solution as I read it is to effectively add one address column
and allow the cache to store two versions of the same dirty cacheline.
A reservation evicts a second cacheline to make room for the "pending"
version. SC (ST in Herlihy's paper) writes to the "pending" cacheline.
Upon commit, the "pending" cachelines become "valid" and their "twins"
are invalidated atomically. If remote interventions have invalidated a
cacheline on which a reservation is held, SC fails and all "pending"
cachelines are invalidated.

This means that a set-associative cache needs to be at least 8-way to
support four reservations, but using caches is not the only possible
implementation, so I think that this should be an acceptable constraint.

>
> What I agree on is that aborting the LRM and loading nothing
> is a performance optimization, which may or may not bite us
> later. But it is easier to go from "loads nothing" to "behaves
> like a LOAD" than from "behaves like a LOAD" to "loads nothing".
>
>
> The problem is that loading nothing on LRM *is* going from
> "behaves like a LOAD" to "loads nothing" -- the prototype for LRM
> is LR, which "behaves like a LOAD" currently.
>
>
> Well not quite because there is no software that uses LRM yet. If we
> make a decision either way, we will have to deal with software later
> if we change our mind.
> But if we decide that the LR/LRM group has to act as a big atomic load
> -- and I definitely see the reason now for why that might be a good
> idea --, this discussion does not matter.

I am somewhat inclined towards that, in any case, if the LR/LRM group
could be observed to be non-atomic, the SC group must fail, so ensuring
that the LR/LRM group is atomic would help an implementation support the
forward progress guarantee.
This sequence is restartable, since the restart can only begin at LR.
At the end, t1 == ***(a1) and t2 == **(a1). Yet this does not meet the
"third way" version of the rule.

> Furthermore, this definition of "after" would not have been strong enough:
> a1 = LR a1
> a2 = LRM a2
>
> Neither of these wrote a register "after" reading it (in your
> definition), but the values are now "clobbered" and the sequence can
> not be restarted.

Right, the "third way" version could be made correct, but would be much
longer and somewhat awkward.

>
> That's why I think it's a bit error prone, it's a bit subtle.
>
>
> It looks to me that the most error prone part will be writing the
> rules clearly. :-)
>
>
> :))

Is the first version of the rule ("The LR/LRM group must not use the
initial value of any register that is used as a destination.") workable
alone, then?
You are right that it makes no sense, but I admit that in sequences with
fewer than four reservations, it is inefficient but possible to clobber
and reconstruct addresses. Even with four reservations, one address can
be reconstructed. There is no reason to do this and it is a great way
to introduce bugs by missing one of the required SCs, but it is
possible. I think strongly-worded advice not to do this is appropriate,
but an actual requirement to preserve addresses is probably unneeded.


-- Jacob

Jonas Oberhauser

non lue,
14 déc. 2017, 02:12:5514/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev
On Dec 14, 2017 5:10 AM, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:

         
This will be hard, because it would mean either 1) LR/LRM groups always need to appear to be atomic or 2) you need a way to check for LR/LRM failure so that you can retry in case they were not, but leave the loop in case they were atomic but unequal to the compare data.
(the DCAS is an atomic load even in case of failure...)

That is right... rats... DCAS does not attempt to store if the values do not match, so the serialization failure is not detected.

You even could attempt to store the loaded values if the test fails, but on failure  you wouldn't know whether you loaded atomic values and lost the reservation afterwards, or you did not load atomic values. This might add unnecessary iterations.


Implementing atomic LR/LRM can be done in HW by restarting the sequence in case it would otherwise not appear to be atomic (as we discussed before, with the rules below), or if you wait with loading values until you can make sure that the sequence will appear to be atomic.

I think the latter would be preferable, but we'd have to come up with a cheap implementation.

What do you think?
Good point! You can "double" the number of cachelines, but use them like normal cachelines when you're not LRing. Smart!


Well not quite because there is no software that uses LRM yet. If we make a decision either way, we will have to deal with software later if we change our mind.
But if we decide that the LR/LRM group has to act as a big atomic load -- and I definitely see the reason now for why that might be a good idea --, this discussion does not matter.

I am somewhat inclined towards that, in any case, if the LR/LRM group could be observed to be non-atomic, the SC group must fail, so ensuring that the LR/LRM group is atomic would help an implementation support the forward progress guarantee.

Possibly.

For the third way, well, in this particular example it depends on what you mean by "after" (for me, the LRM first reads t1, then writes it, but that's a matter of view).
But you can give a similar example:
t1 = LR a1
t2 = LRM t1
t1 = LRM t2

This sequence is restartable, since the restart can only begin at LR.  At the end, t1 == ***(a1) and t2 == **(a1).  Yet this does not meet the "third way" version of the rule.

Exactly :)


Furthermore, this definition of "after" would not have been strong enough:
a1 = LR a1
a2 = LRM a2

Neither of these wrote a register "after" reading it (in your definition), but the values are now "clobbered" and the sequence can not be restarted.

Right, the "third way" version could be made correct, but would be much longer and somewhat awkward.

Yep.

Is the first version of the rule ("The LR/LRM group must not use the initial value of any register that is used as a destination.") workable alone, then?

Yes but if we do end up needing such a rule, I would highlight the "initial" ;)

You are right that it makes no sense, but I admit that in sequences with fewer than four reservations, it is inefficient but possible to clobber and reconstruct addresses.  Even with four reservations, one address can be reconstructed.  There is no reason to do this and it is a great way to introduce bugs by missing one of the required SCs, but it is possible.  I think strongly-worded advice not to do this is appropriate, but an actual requirement to preserve addresses is probably unneeded.

Yeah.

Jonas Oberhauser

non lue,
14 déc. 2017, 05:07:4814/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev,Cesar Eduardo Barros

Consider the following interleaving (T1 enters lock, T2 is the tail of queue)
T1: amoswap on tail of queue
T2: unlock 
T1: amoswap on predecessor

You are now waiting for a lock that is no longer held -- deadlock.

If you try to put in a test for that, you get bad behaviors that look like this:
T1: amoswap on tail of queue
T2: unlock 
T2: lock using same queue element
T1: amoswap on "predecessor"

I think if you prohibit reuse of queue elements this can be solved, but allocating memory each time you want to enter a lock does not seem great.
It seems like it's not possible to implement a loop-free MCS lock without using LR/LRM in both LOCK and UNLOCK.

Jonas Oberhauser

non lue,
14 déc. 2017, 06:57:4714/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev
I have a feeling that we are slowly converging to a common line.

Here's what I think we agree on:
1) LR;LRM groups without a page fault appear atomic,
2) can take multiple reservations on same address
3) forward guarantee is maintained
4) SCs to all reservations = commit

Things that are unclear to me yet:
a) page faults in an LR/LRM group -- what happens? Options I know of: provide in HW ?epc = address of LR and suggest that "no initial value ...", let the OS take care of it (e.g., system call to provide an LRM-pf-restart address to the OS, which is used on LRM-pf), make it "invisible" by translating whole LR/LRM block before the LR is executed (this means that the suggestion from before is not needed)

b) when multiple reservations are taken on an address, do I need multiple SCs or one SC? If I need multiple, does the last SC win (in case they write different values)?

c) can we allow an LRM that is not part of the LR;LRM group (without guarantee of atomicity)? For example, if we follow a pointer and want to compare for null first, or if we want to follow a pointer in a struct and have to add some offset? We would only guarantee that the LRM is atomic with the previous LR/LRMs when we do SCs and the SCs succeed, but (unlike for an LR/LRM group) the values are not guaranteed to be atomic in case of an SC failure.


Jacob Bachmeyer

non lue,
14 déc. 2017, 22:28:4214/12/2017
à Jonas Oberhauser,RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 14, 2017 5:10 AM, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
>
>
> This will be hard, because it would mean either 1) LR/LRM
> groups always need to appear to be atomic or 2) you need a way
> to check for LR/LRM failure so that you can retry in case they
> were not, but leave the loop in case they were atomic but
> unequal to the compare data.
> (the DCAS is an atomic load even in case of failure...)
>
>
> That is right... rats... DCAS does not attempt to store if the
> values do not match, so the serialization failure is not detected.
>
>
> You even could attempt to store the loaded values if the test fails,
> but on failure you wouldn't know whether you loaded atomic values and
> lost the reservation afterwards, or you did not load atomic values.
> This might add unnecessary iterations.

Wait... that is all that it does. A "torn read" ensures that the SC
group will fail if the expected values are produced, while other values
will cause the outer DCAS loop to iterate. Of course, an iteration that
calculates with a "torn read" will fail, since the expected values will
not match at a later atomic read, so DCAS is retried until an atomic
read occurs and the SC group succeeds. The only false match problem
here is the well-known ABA problem, but that is a known limitation of
CAS in general.

> Implementing atomic LR/LRM can be done in HW by restarting the
> sequence in case it would otherwise not appear to be atomic
> (as we discussed before, with the rules below), or if you wait
> with loading values until you can make sure that the sequence
> will appear to be atomic.
>
> I think the latter would be preferable, but we'd have to come
> up with a cheap implementation.
>
>
> What do you think?

Both of these are good implementation details and software will need to
permit LRM to jump back to LR and restart the sequence, but I think that
even DCAS serialization failures will simply cause a loop to iterate,
rather than a hard failure.
Herlihy used an additional cache for this, but that implementation was
far more ambitious than an (at most) quad-word atomic operation.
Integrating multi-LR support into an otherwise ordinary coherent L1
D-cache needs to be possible. It is not the only implementation
strategy, but it needs to be a valid strategy.

> Well not quite because there is no software that uses LRM yet.
> If we make a decision either way, we will have to deal with
> software later if we change our mind.
> But if we decide that the LR/LRM group has to act as a big
> atomic load -- and I definitely see the reason now for why
> that might be a good idea --, this discussion does not matter.
>
>
> I am somewhat inclined towards that, in any case, if the LR/LRM
> group could be observed to be non-atomic, the SC group must fail,
> so ensuring that the LR/LRM group is atomic would help an
> implementation support the forward progress guarantee.
>
>
> Possibly.

I am now convinced that software must behave correctly if the LR/LRM
group is restarted, but I do not see a good reason to make such behavior
a hard requirement: if LR/LRM fails to perform a multiple-word atomic
read, then at least one reservation must have been broken, so the SC
group must fail.

> Is the first version of the rule ("The LR/LRM group must not use
> the initial value of any register that is used as a destination.")
> workable alone, then?
>
>
> Yes but if we do end up needing such a rule, I would highlight the
> "initial" ;)

Got any better ways to word it? Something specifically stating that the
values at LR of any register written in the LR/LRM group must be unused?


-- Jacob

Jacob Bachmeyer

non lue,
14 déc. 2017, 23:44:5614/12/2017
à Jonas Oberhauser,RISC-V ISA Dev,Cesar Eduardo Barros
Jonas Oberhauser wrote:
> 2017-12-13 4:57 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
>
> Jonas Oberhauser wrote:
>
> 2017-12-12 6:09 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>>:
> I like the UNLOCK. I don't know why you think that an LR will
> help in the LOCK_wait.
>
>
> Baseline RVI LOAD does not have atomic access semantics. Only LR
> can have acquire or sequential consistency specified.
>
>
> You can use the current pseudo-instruction which puts fences in the
> right place.
> I'm not sure what you mean with "atomic access semantics". A load
> loads the most up-to-date value for each byte at the point that it is
> executed. Most up-to-date takes into consideration 1) the state of the
> memory and 2) local stores that have been issued before the load but
> not yet reached local memory. (This is the "load value axiom").
> For the purposes of the lock, this is strong enough.
> (cf also: "Every aligned load or store instruction gives rise to
> exactly one memory access that executes in a single-copy atomic
> fashion [...]")

That statement was written under the old memory model, which did not
define those pseudo-instructions. It still stands, however: the
baseline LOAD opcode does not guarantee ordering without an additional
FENCE instruction, while LR combines those effects into a single
instruction.

> The following are all true:
> 1) all SCs in the same group become visible at once
> 2) the SCs will eventually become visible, at which point both
> the LR and LD (which are executed in a loop) would see them
> 3) fences will not make the SCs visible any sooner, neither
> local fences nor remote fences. Local fences have no effect on
> the remote SCs, and remote fences will just delay processor
> execution but probably not make the SCs happen any faster
> 4) it is impossible to see from the local hart whether the SCs
> take a long time because they are being buffered, or because
> the remote is just terribly slow.
> 5) the remote can probably not hurry up even if it sees that
> we have a reservation.
>
>
> The RISC-V memory model is quite weak, and I do not think the
> revised memory model gives any better guarantees for RVI LOAD.
> Implementations are also allowed, as I understand, to have loose
> cache-coherency, where RVI LOAD may *never* see an update without
> a FENCE.
>
>
> Quoting again:
> "Every aligned load or store instruction *gives rise to exactly one*
> memory access [...]"
> Every store eventually becomes visible in the global memory order.
> Because of the load-value axiom, the store is then immediately seen by
> all other threads.

Global memory order is loose: "A store is said to have performed not
when it has executed inside the pipeline, but rather only when its value
has been propagated to globally
visible memory." I have not found any requirements for when that
propagation occurs.

> LR/LRM/SC, of course, must be their own FENCEs.
>
>
>
> By that I assume you mean "all LR/LRMs are executed before any of the
> corresponding SCs", nothing more? I don't think there's an inherent
> need to have more fences than this.

The new memory model has a special "atomicity axiom", so I expect some
special behavior from LR that LOAD may not have.


-- Jacob

Jacob Bachmeyer

non lue,
15 déc. 2017, 00:31:0015/12/2017
à Jonas Oberhauser,RISC-V ISA Dev
Jonas Oberhauser wrote:
> I have a feeling that we are slowly converging to a common line.
>
> Here's what I think we agree on:
> 1) LR;LRM groups without a page fault appear atomic,
> 2) can take multiple reservations on same address
> 3) forward guarantee is maintained
> 4) SCs to all reservations = commit

Agree on (2), (3), (4); qualified agree on (1): if a value is changed
after being read by LR/LRM, then the following SC group must fail.
Making LR/LRM atomic will avoid one window for serialization failure,
but is not strictly required. The last LRM may read a value that was
stored by another hart after LR was executed, but the SC group must fail
if any stores have occurred to any location X read by the LR/LRM group
after that location X was read.

I do not believe that database-like "snapshot consistency", where all
values read are as if from a snapshot taken when LR is executed, is
feasible to implement.

> Things that are unclear to me yet:
> a) page faults in an LR/LRM group -- what happens? Options I know of:
> provide in HW ?epc = address of LR and suggest that "no initial value
> ...", let the OS take care of it (e.g., system call to provide an
> LRM-pf-restart address to the OS, which is used on LRM-pf), make it
> "invisible" by translating whole LR/LRM block before the LR is
> executed (this means that the suggestion from before is not needed)

The current draft proposal I am writing requires that software be able
to cope with an LR/LRM group being restarted at LR. Letting the OS take
care of it is a large amount of very special complexity in the page
fault handler -- the Linux devs will not be amused. Translating the
whole block is impossible if LR/LRM chases pointers atomically, as MCS
locks seem to need with their requirement for a value loaded from a
pointer that is guaranteed not to have changed since the pointer itself
was loaded.

The fourth option (also permitted to implementations) is to simply take
the page fault, drop the reservations, go through the LR/LRM/SC
sequence, possibly taking more page faults at LRM or SC, and fail the SC
group, requiring software to retry the sequence. (Page fault at SC can
occur if a synchronization variable is in a page that is currently
copy-on-write.)

> b) when multiple reservations are taken on an address, do I need
> multiple SCs or one SC? If I need multiple, does the last SC win (in
> case they write different values)?

Each LR/LRM adds one entry to the pending update set; each successful SC
removes one entry from the pending update set. This is needed because
MCS locks may take two reservations on the same address or on different
addresses.

Should writing different values to the same address use the last value
written or fail? Multiple identical stores can be treated as a single
store, but writing two different values probably indicates a bug.

> c) can we allow an LRM that is not part of the LR;LRM group (without
> guarantee of atomicity)? For example, if we follow a pointer and want
> to compare for null first, or if we want to follow a pointer in a
> struct and have to add some offset? We would only guarantee that the
> LRM is atomic with the previous LR/LRMs when we do SCs and the SCs
> succeed, but (unlike for an LR/LRM group) the values are not
> guaranteed to be atomic in case of an SC failure.

No, the purpose of LRM is to ensure that we have started a multiple-LR
at its correct starting point, otherwise LR/LR/SC/SC would suffice.
Those situations call for full RVT and its LT instruction, not multiple-LR.


-- Jacob

Jonas Oberhauser

non lue,
15 déc. 2017, 01:39:1015/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev


On Dec 15, 2017 05:44, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:
2017-12-13 4:57 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:


    Jonas Oberhauser wrote:

        2017-12-12 6:09 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
        <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com

        <mailto:jcb6...@gmail.com>>>:
        I like the UNLOCK. I don't know why you think that an LR will
        help in the LOCK_wait.


    Baseline RVI LOAD does not have atomic access semantics.  Only LR
    can have acquire or sequential consistency specified.


You can use the current pseudo-instruction which puts fences in the right place.
I'm not sure what you mean with "atomic access semantics". A load loads the most up-to-date value for each byte at the point that it is executed. Most up-to-date takes into consideration 1) the state of the memory and 2) local stores that have been issued before the load but not yet reached local memory. (This is the "load value axiom").
For the purposes of the lock, this is strong enough.
(cf also: "Every aligned load or store instruction gives rise to exactly one memory access that executes in a single-copy atomic fashion [...]")

That statement was written under the old memory model, which did not define those pseudo-instructions.  It still stands, however:  the baseline LOAD opcode does not guarantee ordering without an additional FENCE instruction, while LR combines those effects into a single instruction.

My point is that you do not need a fence.
The load does not observe any value from before the local SC. Afterwards it sees nothing until the remote SC becomes globally visible (which may be long after it is issued, but that's ok).


Global memory order is loose:  "A store is said to have performed not when it has executed inside the pipeline, but rather only when its value has been propagated to globally
visible memory."  I have not found any requirements for when that propagation occurs.

Exactly.
Observe that the time of propagation does not matter in this program, only that it eventually propagates.


    LR/LRM/SC, of course, must be their own FENCEs.



By that I assume you mean "all LR/LRMs are executed before any of the corresponding SCs", nothing more? I don't think there's an inherent need to have more fences than this.

The new memory model has a special "atomicity axiom", so I expect some special behavior from LR that LOAD may not have.

The special behaviors of LR are all related to its paired SC (and possibly that it already has aqrl bits)

Jonas Oberhauser

non lue,
15 déc. 2017, 06:56:0515/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev
2017-12-15 6:30 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com>:
Jonas Oberhauser wrote:
I have a feeling that we are slowly converging to a common line.

Here's what I think we agree on:
1) LR;LRM groups without a page fault appear atomic,
2) can take multiple reservations on same address
3) forward guarantee is maintained
4) SCs to all reservations = commit

Agree on (2), (3), (4); qualified agree on (1):  if a value is changed after being read by LR/LRM, then the following SC group must fail.  Making LR/LRM atomic will avoid one window for serialization failure, but is not strictly required.  The last LRM may read a value that was stored by another hart after LR was executed, but the SC group must fail if any stores have occurred to any location X read by the LR/LRM group after that location X was read.

Oh, ok. I'm fine either way. 

So you are thinking of something like
1') an implementation may provide an atomicity guarantee for the LR/LRM block. Such an implementation may restart the LR/LRM block if the LR/LRM block would otherwise not appear to be atomic.
?
Wouldn't it be better to have this as an extension to the extension to the extension? It seems to me like I might write different software for implementations that have this extra feature than for those that only have LR/LRM. (sure, I can have software that works for both, but that software will not make use of the feature and pretend on SC failure that the values were not atomic -- then what's the point of the feature?)
 
I do not believe that database-like "snapshot consistency", where all values read are as if from a snapshot taken when LR is executed, is feasible to implement.

Maybe not, especially during pointer chasing, as you reminded me below.
 
Things that are unclear to me yet:
a) page faults in an LR/LRM group -- what happens? Options I know of: provide in HW ?epc = address of LR and suggest that "no initial value ...", let the OS take care of it (e.g., system call to provide an LRM-pf-restart address to the OS, which is used on LRM-pf), make it "invisible" by translating whole LR/LRM block before the LR is executed (this means that the suggestion from before is not needed)

The current draft proposal I am writing requires that software be able to cope with an LR/LRM group being restarted at LR.  Letting the OS take care of it is a large amount of very special complexity in the page fault handler -- the Linux devs will not be amused.  Translating the whole block is impossible if LR/LRM chases pointers atomically, as MCS locks seem to need with their requirement for a value loaded from a pointer that is guaranteed not to have changed since the pointer itself was loaded.

Ugh, you're right.
 
The fourth option (also permitted to implementations) is to simply take the page fault, drop the reservations, go through the LR/LRM/SC sequence, possibly taking more page faults at LRM or SC, and fail the SC group, requiring software to retry the sequence.  (Page fault at SC can occur if a synchronization variable is in a page that is currently copy-on-write.)

Yes, what I wrote above was under the assumption that we want atomic LR/LRM. If they are not atomic this behavior is the right behavior.
 
b) when multiple reservations are taken on an address, do I need multiple SCs or one SC? If I need multiple, does the last SC win (in case they write different values)?

Each LR/LRM adds one entry to the pending update set; each successful SC removes one entry from the pending update set.  This is needed because MCS locks may take two reservations on the same address or on different addresses.

No, since you could condition on whether you read the same address twice. It's just more elegant this way.
 
Should writing different values to the same address use the last value written or fail?  Multiple identical stores can be treated as a single store, but writing two different values probably indicates a bug.

It can indicate a bug, but
A) if it is a bug, the SC sequence will not work correctly either way -- either it will probably go into an infinite loop (if the SC fails), or it will incorrectly overwrite the value with a newer value. Which makes sense.
B) if it is intentional, you would fail the sequence erronously.

The other thing is that checking equality is more expensive in HW. 
I think it should just overwrite the value and that's it.

c) can we allow an LRM that is not part of the LR;LRM group (without guarantee of atomicity)? For example, if we follow a pointer and want to compare for null first, or if we want to follow a pointer in a struct and have to add some offset? We would only guarantee that the LRM is atomic with the previous LR/LRMs when we do SCs and the SCs succeed, but (unlike for an LR/LRM group) the values are not guaranteed to be atomic in case of an SC failure.

No, the purpose of LRM is to ensure that we have started a multiple-LR at its correct starting point, otherwise LR/LR/SC/SC would suffice.  Those situations call for full RVT and its LT instruction, not multiple-LR.

Why? I think all implementations of LR;LRM also work for LR;(integer instructions);LRM, (provided the same conditions about distance between LR and SC, number of LR/LRMs, etc, still hold). The reservation subsystem doesn't really care that you did a few extra integer instructions since your last LR/LRM.


Jacob Bachmeyer

non lue,
15 déc. 2017, 19:57:1415/12/2017
à Jonas Oberhauser,RISC-V ISA Dev
Jonas Oberhauser wrote:
> 2017-12-15 6:30 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>>:
Atomic LR/LRM is really a performance optimization. When SC fails, all
software really knows is that the entire sequence must be retried.
Requiring portable software to make LR/LRM blocks restartable enables
this performance optimization at essentially no cost to any
implementation, since RISC-V has more registers than an LR/SC sequence
can use while still meeting the forward progress guarantee constraints.

> I do not believe that database-like "snapshot consistency", where
> all values read are as if from a snapshot taken when LR is
> executed, is feasible to implement.
>
>
> Maybe not, especially during pointer chasing, as you reminded me below.

Now that I think about it, the requirements around reservations have the
effect that a successful LR/LRM/SC sequence acts as if a snapshot was
taken at the last LRM, but this "snapshot" never has to be materialized,
so it costs nothing.

> Things that are unclear to me yet:
> a) page faults in an LR/LRM group -- what happens? Options I
> know of: provide in HW ?epc = address of LR and suggest that
> "no initial value ...", let the OS take care of it (e.g.,
> system call to provide an LRM-pf-restart address to the OS,
> which is used on LRM-pf), make it "invisible" by translating
> whole LR/LRM block before the LR is executed (this means that
> the suggestion from before is not needed)
>
>
> [...]
>
> The fourth option (also permitted to implementations) is to simply
> take the page fault, drop the reservations, go through the
> LR/LRM/SC sequence, possibly taking more page faults at LRM or SC,
> and fail the SC group, requiring software to retry the sequence.
> (Page fault at SC can occur if a synchronization variable is in a
> page that is currently copy-on-write.)
>
>
> Yes, what I wrote above was under the assumption that we want atomic
> LR/LRM. If they are not atomic this behavior is the right behavior.

Atomic LR/LRM is an implementation option; the extension permits either
the first or fourth option.

> b) when multiple reservations are taken on an address, do I
> need multiple SCs or one SC? If I need multiple, does the last
> SC win (in case they write different values)?
>
>
> Each LR/LRM adds one entry to the pending update set; each
> successful SC removes one entry from the pending update set. This
> is needed because MCS locks may take two reservations on the same
> address or on different addresses.
>
>
> No, since you could condition on whether you read the same address
> twice. It's just more elegant this way.

Elegance is important here: multi-processor synchronization is hard
anyway, making it any more obscure than needed is asking for more bugs.

> Should writing different values to the same address use the last
> value written or fail? Multiple identical stores can be treated
> as a single store, but writing two different values probably
> indicates a bug.
>
>
> It can indicate a bug, but
> A) if it is a bug, the SC sequence will not work correctly either way
> -- either it will probably go into an infinite loop (if the SC fails),
> or it will incorrectly overwrite the value with a newer value. Which
> makes sense.
> B) if it is intentional, you would fail the sequence erronously.
>
> The other thing is that checking equality is more expensive in HW.
> I think it should just overwrite the value and that's it.

Very well, the final value written is the value given to the last SC to
that address in program order, then.

> c) can we allow an LRM that is not part of the LR;LRM group
> (without guarantee of atomicity)? For example, if we follow a
> pointer and want to compare for null first, or if we want to
> follow a pointer in a struct and have to add some offset? We
> would only guarantee that the LRM is atomic with the previous
> LR/LRMs when we do SCs and the SCs succeed, but (unlike for an
> LR/LRM group) the values are not guaranteed to be atomic in
> case of an SC failure.
>
>
> No, the purpose of LRM is to ensure that we have started a
> multiple-LR at its correct starting point, otherwise LR/LR/SC/SC
> would suffice. Those situations call for full RVT and its LT
> instruction, not multiple-LR.
>
>
> Why? I think all implementations of LR;LRM also work for LR;(integer
> instructions);LRM, (provided the same conditions about distance
> between LR and SC, number of LR/LRMs, etc, still hold). The
> reservation subsystem doesn't really care that you did a few extra
> integer instructions since your last LR/LRM.

The reservation subsystem may not care, but then LRM is a synonym for LR
and you have the same problems as with LR/LR/SC/SC, particularly the
inability to easily determine if a sequence has been abandoned. If
these problems can be solved, a future revision could always relax the
constraints on LRM's dynamic context, but I prefer to start with
something that I know will work.


-- Jacob

Jonas Oberhauser

non lue,
16 déc. 2017, 05:34:2916/12/2017
à Jacob Bachmeyer,RISC-V ISA Dev


On Dec 16, 2017 01:57, "Jacob Bachmeyer" <jcb6...@gmail.com> wrote:
Jonas Oberhauser wrote:
2017-12-15 6:30 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>>:


    Jonas Oberhauser wrote:

        I have a feeling that we are slowly converging to a common line.

        Here's what I think we agree on:
        1) LR;LRM groups without a page fault appear atomic,
        2) can take multiple reservations on same address
        3) forward guarantee is maintained
        4) SCs to all reservations = commit


    Agree on (2), (3), (4); qualified agree on (1):  if a value is
    changed after being read by LR/LRM, then the following SC group
    must fail.  Making LR/LRM atomic will avoid one window for
    serialization failure, but is not strictly required.  The last LRM
    may read a value that was stored by another hart after LR was
    executed, but the SC group must fail if any stores have occurred
    to any location X read by the LR/LRM group after that location X
    was read.


Oh, ok. I'm fine either way.
So you are thinking of something like
1') an implementation may provide an atomicity guarantee for the LR/LRM block. Such an implementation may restart the LR/LRM block if the LR/LRM block would otherwise not appear to be atomic.
?
Wouldn't it be better to have this as an extension to the extension to the extension? It seems to me like I might write different software for implementations that have this extra feature than for those that only have LR/LRM. (sure, I can have software that works for both, but that software will not make use of the feature and pretend on SC failure that the values were not atomic -- then what's the point of the feature?)

Atomic LR/LRM is really a performance optimization.  When SC fails, all software really knows is that the entire sequence must be retried.  Requiring portable software to make LR/LRM blocks restartable enables this performance optimization at essentially no cost to any implementation, since RISC-V has more registers than an LR/SC sequence can use while still meeting the forward progress guarantee constraints.

How is it a performance optimization?
All I can see is how it would allow a DCAS to leave the loop after one iteration if the test fails. I.e., I can write more efficient software if the feature is present.
On a HW level, I see no performance gain in an atomic LR/LRM when the SCs fail.


    I do not believe that database-like "snapshot consistency", where
    all values read are as if from a snapshot taken when LR is
    executed, is feasible to implement.


Maybe not, especially during pointer chasing, as you reminded me below.

Now that I think about it, the requirements around reservations have the effect that a successful LR/LRM/SC sequence acts as if a snapshot was taken at the last LRM, but this "snapshot" never has to be materialized, so it costs nothing.

What do you mean by "materialized"?

Atomic LR/LRM is an implementation option; the extension permits either the first or fourth option.

Under that assumption I agree.

No, since you could condition on whether you read the same address twice. It's just more elegant this way.

Elegance is important here:  multi-processor synchronization is hard anyway, making it any more obscure than needed is asking for more bugs.

Sure, I agree.

        c) can we allow an LRM that is not part of the LR;LRM group
        (without guarantee of atomicity)? For example, if we follow a
        pointer and want to compare for null first, or if we want to
        follow a pointer in a struct and have to add some offset? We
        would only guarantee that the LRM is atomic with the previous
        LR/LRMs when we do SCs and the SCs succeed, but (unlike for an
        LR/LRM group) the values are not guaranteed to be atomic in
        case of an SC failure.


    No, the purpose of LRM is to ensure that we have started a
    multiple-LR at its correct starting point, otherwise LR/LR/SC/SC
    would suffice.  Those situations call for full RVT and its LT
    instruction, not multiple-LR.


Why? I think all implementations of LR;LRM also work for LR;(integer instructions);LRM, (provided the same conditions about distance between LR and SC, number of LR/LRMs, etc, still hold). The reservation subsystem doesn't really care that you did a few extra integer instructions since your last LR/LRM.

The reservation subsystem may not care, but then LRM is a synonym for LR

No, LRM and LR have different opcodes (you know what I mean) and different semantics (LR abandons previous reservations, and LRM does not and only takes a reservation if other reservations are present).

and you have the same problems as with LR/LR/SC/SC, particularly the inability to easily determine if a sequence has been abandoned.

I don't see how? If you make an LRM after an interrupt it doesn't matter whether the last instruction before the interrupt was an integer/jump instruction or an LR/LRM. If you make an LRM on its own, the reservation subsystem knows whether you have reservations or not independent of whether you just executed an LR/LRM or some other instruction.

Jacob Bachmeyer

non lue,
16 déc. 2017, 20:15:5116/12/2017
à Jonas Oberhauser,RISC-V ISA Dev
Jonas Oberhauser wrote:
> On Dec 16, 2017 01:57, "Jacob Bachmeyer" <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Jonas Oberhauser wrote:
>
> 2017-12-15 6:30 GMT+01:00 Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com> <mailto:jcb6...@gmail.com
The atomic LR/LRM removes one possible cause for the SC group to fail:
a store to a previously-reserved address occurring while acquiring
another reservation. An implementation with atomic LR/LRM can
(effectively) forward that store to the LR/LRM load, moving all of the
LR/LRM loads to after that store in the global memory order, instead of
repeating the entire LR/LRM/SC sequence. Note that this can either be
by actually replacing the value loaded in the register file if the
implementation can prove that the program has not observed the value
yet, or simply by restarting the LR/LRM group and repeating the loads.

Portable software must not assume that LR/LRM groups are atomic.

> I do not believe that database-like "snapshot
> consistency", where
> all values read are as if from a snapshot taken when LR is
> executed, is feasible to implement.
>
>
> Maybe not, especially during pointer chasing, as you reminded
> me below.
>
>
> Now that I think about it, the requirements around reservations
> have the effect that a successful LR/LRM/SC sequence acts as if a
> snapshot was taken at the last LRM, but this "snapshot" never has
> to be materialized, so it costs nothing.
>
>
> What do you mean by "materialized"?

It actually takes up storage somewhere. Snapshotting the entire memory
at LR requires either copying all of memory or tracking an "undo buffer"
that allows to retrieve previous values. Alternately, if we only care
about consistency, the "undo buffer" need only record the address of
every store in the global memory order, to allow a future LRM to
recognize that it is reading from an address that has been updated and
fail the sequence. This is a potentially unbounded buffer, since it
must record the actions of other harts in anticipation of future
instructions on the local hart.

The "snapshot" taken at the last LRM requires none of this, only that
the reservations are broken if a *future* store changes any of the
values that have already been read. The "buffer" in this case need only
hold the four actual reservations and need not anticipate future
instructions on the local hart.
This might actually work with preemptive threads, unlike stack-based
multi-LR. Thread A: LR/LRM/LRM/(preempted). Thread B:
LR/LRM/LRM/SC/SC/SC. The sequence in thread B commits. Eventually,
thread A resumes: LRM/SC/SC/SC/SC. The LRM fails to acquire a
reservation, the preemption invalidated previous reservations, and the
SC group fails.

I see two problems with this, however: first, the "sparse" LRM
instructions complicate the atomic LR/LRM group semantics. Either
resuming at LR is not permitted after any instruction other than LRM is
executed, or all instructions in the sequence must ensure that no
register updated in the sequence has a meaningful initial value, rather
than only an LR/LRM group that is at most four instructions long having
this constraint.

The second problem I see is that allowing "sparse" LRM instructions
comes very close to needing the VALIDATE instruction, and that is
supposed to be one of the lines between multi-LR and full RVT. While
LR.VALIDATE could be encoded as an LRM from (x0), I think that this
would be uncomfortably blurring the line between the simpler multi-LR,
where the RVA forward progress guarantee is maintained, and RVT, which
will have different (and probably weaker) guarantees.

I am considering proposing to allow "sparse" LRM but have the
architectural forward progress guarantee only apply if LR/LRM are used
in a single group at the beginning of the sequence.


-- Jacob

Jonas Oberhauser

non lue,
21 déc. 2017, 10:41:0321/12/2017
à RISC-V ISA Dev,s9jo...@gmail.com,jcb6...@gmail.com

Hah, that's cool, I like it. Not so useful for pointer chasing, but cool nevertheless.
 
Portable software must not assume that LR/LRM groups are atomic.

Now I wonder: does it make sense to guarantee atomicity for an LR/LRM group where values are not used (e.g., in the DCAS)? It would make some code a lot simpler (and thus faster). I think it may be too costly in HW though; imagine you have multiple LR/LRM to the same address but loading to a different register, or you have a misaligned store that has to be forwarded to multiple LR/LRMs... You'd need to write multiple gpr ports at once or have a small buffer or something. Probably not worth it.
 
> What do you mean by "materialized"?

It actually takes up storage somewhere.  Snapshotting the entire memory
at LR requires either copying all of memory or tracking an "undo buffer"
that allows to retrieve previous values.  Alternately, if we only care
about consistency, the "undo buffer" need only record the address of
every store in the global memory order, to allow a future LRM to
recognize that it is reading from an address that has been updated and
fail the sequence.  This is a potentially unbounded buffer, since it
must record the actions of other harts in anticipation of future
instructions on the local hart.

Okay, thanks :)
 
This might actually work with preemptive threads, unlike stack-based
multi-LR.  Thread A:  LR/LRM/LRM/(preempted).  Thread B:  
LR/LRM/LRM/SC/SC/SC.  The sequence in thread B commits.  Eventually,
thread A resumes: LRM/SC/SC/SC/SC.  The LRM fails to acquire a
reservation, the preemption invalidated previous reservations, and the
SC group fails.

Yep :)
 
I see two problems with this, however:  first, the "sparse" LRM
instructions complicate the atomic LR/LRM group semantics.  Either
resuming at LR is not permitted after any instruction other than LRM is
executed, or all instructions in the sequence must ensure that no
register updated in the sequence has a meaningful initial value, rather
than only an LR/LRM group that is at most four instructions long having
this constraint.

Well, we don't have atomic semantics, but we have the requirement of restartability. I think it is reasonable to require that the whole sequence is restartable, for the reasons that you mentioned previously -- it makes little sense to clobber the addresses of the LR/LRMs and then reconstruct them using expensive cycles.

 
The second problem I see is that allowing "sparse" LRM instructions
comes very close to needing the VALIDATE instruction, and that is
supposed to be one of the lines between multi-LR and full RVT.  While
LR.VALIDATE could be encoded as an LRM from (x0), I think that this
would be uncomfortably blurring the line between the simpler multi-LR,
where the RVA forward progress guarantee is maintained, and RVT, which
will have different (and probably weaker) guarantees.

Where do you need the validate instruction? I think the current semantics that the SCs are validation is enough even with sparse LRM. Can you make the problem more precise?
 
I am considering proposing to allow "sparse" LRM but have the
architectural forward progress guarantee only apply if LR/LRM are used
in a single group at the beginning of the sequence.

Can you explain how implementations might be able to give one guarantee but not the other? I don't see it yet.

Jacob Bachmeyer

non lue,
21 déc. 2017, 19:57:3421/12/2017
à Jonas Oberhauser,RISC-V ISA Dev
Jonas Oberhauser wrote:
> Am Sonntag, 17. Dezember 2017 02:15:51 UTC+1 schrieb Jacob Bachmeyer:
>
> Jonas Oberhauser wrote:
> > On Dec 16, 2017 01:57, "Jacob Bachmeyer" <jcb6...@gmail.com
> <javascript:>
That only occurs if the hardware uses the option to speculate LR/LRM and
retroactively "fix-up" the results. It is an option, hardware can just
as easily restart execution at LR or just fail the SC group and let
software iterate back to LR.

>
> I see two problems with this, however: first, the "sparse" LRM
> instructions complicate the atomic LR/LRM group semantics. Either
> resuming at LR is not permitted after any instruction other than
> LRM is
> executed, or all instructions in the sequence must ensure that no
> register updated in the sequence has a meaningful initial value,
> rather
> than only an LR/LRM group that is at most four instructions long
> having
> this constraint.
>
>
> Well, we don't have atomic semantics, but we have the requirement of
> restartability. I think it is reasonable to require that the whole
> sequence is restartable, for the reasons that you mentioned previously
> -- it makes little sense to clobber the addresses of the LR/LRMs and
> then reconstruct them using expensive cycles.

How best to word this? "All registers used as destinations in an
LR/LRM/SC sequence must contain dead values at the initial LR.
Implementations are permitted, but *never* required to restart an
LR/LRM/SC sequence at the initial LR from any point until the final SC."?

> The second problem I see is that allowing "sparse" LRM instructions
> comes very close to needing the VALIDATE instruction, and that is
> supposed to be one of the lines between multi-LR and full RVT. While
> LR.VALIDATE could be encoded as an LRM from (x0), I think that this
> would be uncomfortably blurring the line between the simpler
> multi-LR,
> where the RVA forward progress guarantee is maintained, and RVT,
> which
> will have different (and probably weaker) guarantees.
>
>
> Where do you need the validate instruction? I think the current
> semantics that the SCs are validation is enough even with sparse LRM.
> Can you make the problem more precise?

Herlihy's paper suggested that some more complex transactions would need
to alternate LT and VALIDATE, notably when chasing pointers. Doing too
much between reserved loads increases the chance that a pointer
previously loaded may be stale by the time it is used. If it is used
for SC, that does not matter -- the SC will fail. If it is used to
calculate an address, the value loaded may or may not remain valid --
and preemptive multitasking can insert very long delays between any two
instructions.

> I am considering proposing to allow "sparse" LRM but have the
> architectural forward progress guarantee only apply if LR/LRM are
> used
> in a single group at the beginning of the sequence.
>
>
> Can you explain how implementations might be able to give one
> guarantee but not the other? I don't see it yet.

It is simpler for hardware to recognize, although the need for a
contiguous LR/LRM group ("dense LRM") may be simply a path-dependent
artifact. LRM originated as an answer to problems with a multiple-LR
proposal that did need a dense group. Requiring dense LRM for the
forward progress guarantee seemed like viable "split-the-difference" at
that writing. I will think about it -- it may be unneeded.

Dense LRM does, however, reduce the "weirdness" level of restarting at
LR if a subsequent LRM fails.


-- Jacob

Michael Clark

non lue,
23 déc. 2017, 15:40:2223/12/2017
à Cesar Eduardo Barros,RISC-V ISA Dev
This approach to double wide LR/SC would require a 128-bit scalar register file, or a register pairing scheme, but no ISA changes:

- https://github.com/michaeljclark/riscv-meta/blob/master/opcodes#L149-L150

On x86 SP and DP scalar floating point operations and some scalar integer operations use the SSE/AVX xmm/ymm/zmm registers which are now 512-bits wide in Skylake. The extra width does not affect the performance of the narrower 32-bit or 64-bit scalar operations. x86-64 compilers uses SSE FP almost exclusively. Compilers only emit x87 code in some rare circumstances such as use of transcendentals. There is precedence for wide register files that are used frequently for scalar operations.

In fact 128-bit scalar ops, especially bitwise logical ops, could have a lot of potential uses now. However one could define a “register pair” extension to avoid growing the register file.

A much simpler “non-standard” extension could allow a subset of RV128 ops with 15 registers instead of 31, using register pairing/aggregation.

A register pairing/aggregation scheme has been used for wide operations in other ISAs. One could re-use the existing RV128 LR.Q/SC.Q ISA encoding, with a non-standard extension that allows a subset of 128-bit operations on a subset of register pairs. Essentially “loose instructions” for 128-bit LR.Q/SC.Q and a “register pair” scheme.

Jacob Bachmeyer

non lue,
27 déc. 2017, 20:48:1727/12/2017
à Michael Clark,Cesar Eduardo Barros,RISC-V ISA Dev
This is an interesting idea, but the goal is a standard extension, and
register pairing would require more ports on the register file, so would
be a very hard sell. Also, the LR/LRM/SC proposal allows up to four
_independent_ words to be updated, providing a significantly more
powerful primitive than a simple 128-bit LR/SC.

I believe that LR/LRM/SC as a "light" transaction support will be easier
to implement than full RVT, which I am also planning to propose, using
the Herlihy model. The full RVT will support much more complex
programming, but software will have some responsibility for ensuring
forward progress. LR/LRM/SC will still have the RVA forward progress
guarantee. RVT will have its own, weaker, guarantee.


-- Jacob

Michael Clark

non lue,
28 déc. 2017, 02:32:2328/12/2017
à Jacob Bachmeyer,Cesar Eduardo Barros,RISC-V ISA Dev


> On 28/12/2017, at 2:48 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> This approach to double wide LR/SC would require a 128-bit scalar register file, or a register pairing scheme, but no ISA changes:
>>
>> - https://github.com/michaeljclark/riscv-meta/blob/master/opcodes#L149-L150
>>
>> On x86 SP and DP scalar floating point operations and some scalar integer operations use the SSE/AVX xmm/ymm/zmm registers which are now 512-bits wide in Skylake. The extra width does not affect the performance of the narrower 32-bit or 64-bit scalar operations. x86-64 compilers uses SSE FP almost exclusively. Compilers only emit x87 code in some rare circumstances such as use of transcendentals. There is precedence for wide register files that are used frequently for scalar operations.
>>
>> In fact 128-bit scalar ops, especially bitwise logical ops, could have a lot of potential uses now. However one could define a “register pair” extension to avoid growing the register file.
>>
>> A much simpler “non-standard” extension could allow a subset of RV128 ops with 15 registers instead of 31, using register pairing/aggregation.
>>
>> A register pairing/aggregation scheme has been used for wide operations in other ISAs. One could re-use the existing RV128 LR.Q/SC.Q ISA encoding, with a non-standard extension that allows a subset of 128-bit operations on a subset of register pairs. Essentially “loose instructions” for 128-bit LR.Q/SC.Q and a “register pair” scheme.
>>
>
> This is an interesting idea, but the goal is a standard extension, and register pairing would require more ports on the register file, so would be a very hard sell. Also, the LR/LRM/SC proposal allows up to four _independent_ words to be updated, providing a significantly more powerful primitive than a simple 128-bit LR/SC.

Register ports are an implementation detail.

If the implementation reads double width operands from the register pairs in two cycles and writes double width operands to the register pairs in two cycles then there is only the necessity to have 128-bit operand buffers in the load store unit for LR.Q/SC.Q and two read ports and one write port on the register file. An OoO would of course need to handle dependencies and wakeup for opcodes that read or write register pairs.

If 128-bit scalar register file is used, there is also no need for more register file ports, however the width of the ports would double and hence take up a similar amount of area.

> I believe that LR/LRM/SC as a "light" transaction support will be easier to implement than full RVT, which I am also planning to propose, using the Herlihy model. The full RVT will support much more complex programming, but software will have some responsibility for ensuring forward progress. LR/LRM/SC will still have the RVA forward progress guarantee. RVT will have its own, weaker, guarantee.

Two transactional memory extensions sounds complex.

Jacob Bachmeyer

non lue,
28 déc. 2017, 18:52:3728/12/2017
à Michael Clark,Cesar Eduardo Barros,RISC-V ISA Dev
Michael Clark wrote:
>> On 28/12/2017, at 2:48 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>> This approach to double wide LR/SC would require a 128-bit scalar register file, or a register pairing scheme, but no ISA changes:
>>>
>>> - https://github.com/michaeljclark/riscv-meta/blob/master/opcodes#L149-L150
>>>
>>> On x86 SP and DP scalar floating point operations and some scalar integer operations use the SSE/AVX xmm/ymm/zmm registers which are now 512-bits wide in Skylake. The extra width does not affect the performance of the narrower 32-bit or 64-bit scalar operations. x86-64 compilers uses SSE FP almost exclusively. Compilers only emit x87 code in some rare circumstances such as use of transcendentals. There is precedence for wide register files that are used frequently for scalar operations.
>>>
>>> In fact 128-bit scalar ops, especially bitwise logical ops, could have a lot of potential uses now. However one could define a “register pair” extension to avoid growing the register file.
>>>
>>> A much simpler “non-standard” extension could allow a subset of RV128 ops with 15 registers instead of 31, using register pairing/aggregation.
>>>
>>> A register pairing/aggregation scheme has been used for wide operations in other ISAs. One could re-use the existing RV128 LR.Q/SC.Q ISA encoding, with a non-standard extension that allows a subset of 128-bit operations on a subset of register pairs. Essentially “loose instructions” for 128-bit LR.Q/SC.Q and a “register pair” scheme.
>>>
>>>
>> This is an interesting idea, but the goal is a standard extension, and register pairing would require more ports on the register file, so would be a very hard sell. Also, the LR/LRM/SC proposal allows up to four _independent_ words to be updated, providing a significantly more powerful primitive than a simple 128-bit LR/SC.
>>
>
> Register ports are an implementation detail.
>
> If the implementation reads double width operands from the register pairs in two cycles and writes double width operands to the register pairs in two cycles then there is only the necessity to have 128-bit operand buffers in the load store unit for LR.Q/SC.Q and two read ports and one write port on the register file. An OoO would of course need to handle dependencies and wakeup for opcodes that read or write register pairs.
>

This is starting to introduce a CISC-like execution sequencer to handle
breaking an atomic instruction into multiple micro-ops. I believe that
RISC-V opcodes are supposed to *be* micro-ops and that this is one of
the differences between RISC and CISC architectures.

>> I believe that LR/LRM/SC as a "light" transaction support will be easier to implement than full RVT, which I am also planning to propose, using the Herlihy model. The full RVT will support much more complex programming, but software will have some responsibility for ensuring forward progress. LR/LRM/SC will still have the RVA forward progress guarantee. RVT will have its own, weaker, guarantee.
>>
>
> Two transactional memory extensions sounds complex.

Not really, LRM is a simple extension to the RVA extension and can be
easily implemented with small changes to hardware. The planned RVT
proposal more-or-less requires an additional transaction cache to
implement. LRM more closely resembles what passes for transactional
memory on x86, while the RVT proposal I am planning is more in line with
Herlihy's original proposal and more closely resembles database
transactions.


-- Jacob

Michael Clark

non lue,
28 déc. 2017, 19:09:3928/12/2017
à jcb6...@gmail.com,Cesar Eduardo Barros,RISC-V ISA Dev

On 29/12/2017, at 12:52 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Michael Clark wrote:
On 28/12/2017, at 2:48 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Michael Clark wrote:
   
This approach to double wide LR/SC would require a 128-bit scalar register file, or a register pairing scheme, but no ISA changes:

- https://github.com/michaeljclark/riscv-meta/blob/master/opcodes#L149-L150

On x86 SP and DP scalar floating point operations and some scalar integer operations use the SSE/AVX xmm/ymm/zmm registers which are now 512-bits wide in Skylake. The extra width does not affect the performance of the narrower 32-bit or 64-bit scalar operations. x86-64 compilers uses SSE FP almost exclusively. Compilers only emit x87 code in some rare circumstances such as use of transcendentals. There is precedence for wide register files that are used frequently for scalar operations.

In fact 128-bit scalar ops, especially bitwise logical ops, could have a lot of potential uses now. However one could define a “register pair” extension to avoid growing the register file.

A much simpler “non-standard” extension could allow a subset of RV128 ops with 15 registers instead of 31, using register pairing/aggregation.

A register pairing/aggregation scheme has been used for wide operations in other ISAs. One could re-use the existing RV128 LR.Q/SC.Q ISA encoding, with a non-standard extension that allows a subset of 128-bit operations on a subset of register pairs. Essentially “loose instructions” for 128-bit LR.Q/SC.Q and a “register pair” scheme.
      
This is an interesting idea, but the goal is a standard extension, and register pairing would require more ports on the register file, so would be a very hard sell.  Also, the LR/LRM/SC proposal allows up to four _independent_ words to be updated, providing a significantly more powerful primitive than a simple 128-bit LR/SC.
   

Register ports are an implementation detail.

If the implementation reads double width operands from the register pairs in two cycles and writes double width operands to the register pairs in two cycles then there is only the necessity to have 128-bit operand buffers in the load store unit for LR.Q/SC.Q and two read ports and one write port on the register file. An OoO would of course need to handle dependencies and wakeup for opcodes that read or write register pairs.
 

This is starting to introduce a CISC-like execution sequencer to handle breaking an atomic instruction into multiple micro-ops.  I believe that RISC-V opcodes are supposed to *be* micro-ops and that this is one of the differences between RISC and CISC architectures.

That’s a relatively minor implementation detail. FPU instructions take multiple cycles and have various steps i.e. renormalisation.

I’m referring to RISC in term of reduced instruction set. i.e. we already have LR.Q and SC.Q

I believe that LR/LRM/SC as a "light" transaction support will be easier to implement than full RVT, which I am also planning to propose, using the Herlihy model.  The full RVT will support much more complex programming, but software will have some responsibility for ensuring forward progress.  LR/LRM/SC will still have the RVA forward progress guarantee.  RVT will have its own, weaker, guarantee.
   

Two transactional memory extensions sounds complex.

Not really, LRM is a simple extension to the RVA extension and can be easily implemented with small changes to hardware.  The planned RVT proposal more-or-less requires an additional transaction cache to implement.  LRM more closely resembles what passes for transactional memory on x86, while the RVT proposal I am planning is more in line with Herlihy's original proposal and more closely resembles database transactions.

Two instruction set extensions to perform similar if not the same thing is not exactly what I would call RISC.

Register pairs have been implemented on several RISC processors. SPARC does this for double word accesses IIRC and it’s based on RISC I and II [1].

Jacob Bachmeyer

non lue,
28 déc. 2017, 22:06:4628/12/2017
à Michael Clark,Cesar Eduardo Barros,RISC-V ISA Dev
Michael Clark wrote:
> On 29/12/2017, at 12:52 PM, Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>> Michael Clark wrote:
>>>> On 28/12/2017, at 2:48 PM, Jacob Bachmeyer <jcb6...@gmail.com
The FPU has a separate register file and quite likely a separate
execution pipeline from the scalar integer unit.

> I’m referring to RISC in term of reduced instruction set. i.e. we
> already have LR.Q and SC.Q

Both of which are already reserved for use on RV128 for atomic access to
aligned contiguous quadwords.

>>>> I believe that LR/LRM/SC as a "light" transaction support will be
>>>> easier to implement than full RVT, which I am also planning to
>>>> propose, using the Herlihy model. The full RVT will support much
>>>> more complex programming, but software will have some
>>>> responsibility for ensuring forward progress. LR/LRM/SC will still
>>>> have the RVA forward progress guarantee. RVT will have its own,
>>>> weaker, guarantee.
>>> Two transactional memory extensions sounds complex.
>>
>> Not really, LRM is a simple extension to the RVA extension and can be
>> easily implemented with small changes to hardware. The planned RVT
>> proposal more-or-less requires an additional transaction cache to
>> implement. LRM more closely resembles what passes for transactional
>> memory on x86, while the RVT proposal I am planning is more in line
>> with Herlihy's original proposal and more closely resembles database
>> transactions.
>
> Two instruction set extensions to perform similar if not the same
> thing is not exactly what I would call RISC.

This is an interesting point, but is something I am considering. The
two are not quite that similar: LRM is really only useful for
synchronization primitives, due to its limited reach of only four words,
while RVT will be enough to traverse and update a large data structure
with atomic semantics.

> Register pairs have been implemented on several RISC processors. SPARC
> does this for double word accesses IIRC and it’s based on RISC I and
> II [1].
>
> [1] http://www.gaisler.com/doc/sparcv8.pdf

Interesting, but RISC-V is not SPARC and seems to be trying to avoid
such complexities.


-- Jacob

Michael Chapman

non lue,
29 déc. 2017, 08:47:1329/12/2017
à jcb6...@gmail.com,Michael Clark,Cesar Eduardo Barros,RISC-V ISA Dev

On 29-Dec-17 04:06, Jacob Bachmeyer wrote:
> The FPU has a separate register file and quite likely a separate
> execution pipeline from the scalar integer unit.

The FPU register file does not have to be separate. And in some cases
there are no advantages in it being separate.


Jacob Bachmeyer

non lue,
30 déc. 2017, 19:19:3530/12/2017
à Michael Chapman,Michael Clark,Cesar Eduardo Barros,RISC-V ISA Dev
You are correct that hardware implementations are free to combine the
FPU into the main scalar unit, but the ISA considers the FPU registers
(f0 - f31) to be a distinct set from the main registers (x0 - x31).


-- Jacob

Michael Clark

non lue,
31 déc. 2017, 21:52:0631/12/2017
à jcb6...@gmail.com,Cesar Eduardo Barros,RISC-V ISA Dev


> On 29/12/2017, at 4:06 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> On 29/12/2017, at 12:52 PM, Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
>>> Michael Clark wrote:
>>>>> On 28/12/2017, at 2:48 PM, Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
>>>>> Michael Clark wrote:
>>>>>> This approach to double wide LR/SC would require a 128-bit scalar register file, or a register pairing scheme, but no ISA changes:
>>>>>>
>>>>>> - https://github.com/michaeljclark/riscv-meta/blob/master/opcodes#L149-L150
>>>>>>
>>>>>> On x86 SP and DP scalar floating point operations and some scalar integer operations use the SSE/AVX xmm/ymm/zmm registers which are now 512-bits wide in Skylake. The extra width does not affect the performance of the narrower 32-bit or 64-bit scalar operations. x86-64 compilers uses SSE FP almost exclusively. Compilers only emit x87 code in some rare circumstances such as use of transcendentals. There is precedence for wide register files that are used frequently for scalar operations.
>>>>>>
>>>>>> In fact 128-bit scalar ops, especially bitwise logical ops, could have a lot of potential uses now. However one could define a “register pair” extension to avoid growing the register file.
>>>>>>
>>>>>> A much simpler “non-standard” extension could allow a subset of RV128 ops with 15 registers instead of 31, using register pairing/aggregation.
>>>>>>
>>>>>> A register pairing/aggregation scheme has been used for wide operations in other ISAs. One could re-use the existing RV128 LR.Q/SC.Q ISA encoding, with a non-standard extension that allows a subset of 128-bit operations on a subset of register pairs. Essentially “loose instructions” for 128-bit LR.Q/SC.Q and a “register pair” scheme.
>>>>>>
>>>>> This is an interesting idea, but the goal is a standard extension, and register pairing would require more ports on the register file, so would be a very hard sell. Also, the LR/LRM/SC proposal allows up to four _independent_ words to be updated, providing a significantly more powerful primitive than a simple 128-bit LR/SC.
>>>>>
>>>> Register ports are an implementation detail.
>>>>
>>>> If the implementation reads double width operands from the register pairs in two cycles and writes double width operands to the register pairs in two cycles then there is only the necessity to have 128-bit operand buffers in the load store unit for LR.Q/SC.Q and two read ports and one write port on the register file. An OoO would of course need to handle dependencies and wakeup for opcodes that read or write register pairs.
>>>
>>> This is starting to introduce a CISC-like execution sequencer to handle breaking an atomic instruction into multiple micro-ops. I believe that RISC-V opcodes are supposed to *be* micro-ops and that this is one of the differences between RISC and CISC architectures.
>>
>> That’s a relatively minor implementation detail. FPU instructions take multiple cycles and have various steps i.e. renormalisation.
>
> The FPU has a separate register file and quite likely a separate execution pipeline from the scalar integer unit.
>
>> I’m referring to RISC in term of reduced instruction set. i.e. we already have LR.Q and SC.Q
>
> Both of which are already reserved for use on RV128 for atomic access to aligned contiguous quadwords.

One would assume a Double Wide LR/SC, as an atomic is exactly that, access to aligned contiguous quadwords.

>>>>> I believe that LR/LRM/SC as a "light" transaction support will be easier to implement than full RVT, which I am also planning to propose, using the Herlihy model. The full RVT will support much more complex programming, but software will have some responsibility for ensuring forward progress. LR/LRM/SC will still have the RVA forward progress guarantee. RVT will have its own, weaker, guarantee.
>>>> Two transactional memory extensions sounds complex.
>>>
>>> Not really, LRM is a simple extension to the RVA extension and can be easily implemented with small changes to hardware. The planned RVT proposal more-or-less requires an additional transaction cache to implement. LRM more closely resembles what passes for transactional memory on x86, while the RVT proposal I am planning is more in line with Herlihy's original proposal and more closely resembles database transactions.
>>
>> Two instruction set extensions to perform similar if not the same thing is not exactly what I would call RISC.
>
> This is an interesting point, but is something I am considering. The two are not quite that similar: LRM is really only useful for synchronization primitives, due to its limited reach of only four words, while RVT will be enough to traverse and update a large data structure with atomic semantics.

I don’t know if you’ll get much support for two T extensions. T-lite and T-full.

>> Register pairs have been implemented on several RISC processors. SPARC does this for double word accesses IIRC and it’s based on RISC I and II [1].
>>
>> [1] http://www.gaisler.com/doc/sparcv8.pdf
>
> Interesting, but RISC-V is not SPARC and seems to be trying to avoid such complexities.

Interestingly arm has made the pair of registers explicit with LDXP (Load exclusive pair of registers) and STLXP (Store-release exclusive pair of registers, returning status). That approach however definitely needs additional register file ports.

There are two approaches I outlined, 128-bit scalar register file, or 64-bit scalar register file with 128-bit ops using (implicit) register pairs. There are implementation approaches for the latter that don’t require extra ports on the register file but may require a couple of bits of state in the load store unit along with 128-bit width operand buffer and 128-bit access to the L1 cache.

I suspect both approaches may be simpler to implement than T-lite, and would match the expectation of existing software. i.e. the purpose of a double wide LR/SC is to allow emulation of CMPXCHG16B (which btw requires aligned operands). i.e.

https://github.com/ivmai/libatomic_ops/blob/93611e38f041b3befbb14462a1dca1d12656636c/src/atomic_ops/sysdeps/gcc/aarch64.h#L119-L140

Jacob Bachmeyer

non lue,
31 déc. 2017, 22:13:0931/12/2017
à Michael Clark,Cesar Eduardo Barros,RISC-V ISA Dev
Michael Clark wrote:
>> On 29/12/2017, at 4:06 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>> On 29/12/2017, at 12:52 PM, Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
>>>
>>>> Michael Clark wrote:
>>>>
>>>>>> On 28/12/2017, at 2:48 PM, Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:
>>>>>> Michael Clark wrote:
>>>>>>
>>>>>>> This approach to double wide LR/SC would require a 128-bit scalar register file, or a register pairing scheme, but no ISA changes:
>>>>>>>
>>>>>>> - https://github.com/michaeljclark/riscv-meta/blob/master/opcodes#L149-L150
>>>>>>>
>>>>>>> On x86 SP and DP scalar floating point operations and some scalar integer operations use the SSE/AVX xmm/ymm/zmm registers which are now 512-bits wide in Skylake. The extra width does not affect the performance of the narrower 32-bit or 64-bit scalar operations. x86-64 compilers uses SSE FP almost exclusively. Compilers only emit x87 code in some rare circumstances such as use of transcendentals. There is precedence for wide register files that are used frequently for scalar operations.
>>>>>>>
>>>>>>> In fact 128-bit scalar ops, especially bitwise logical ops, could have a lot of potential uses now. However one could define a “register pair” extension to avoid growing the register file.
>>>>>>>
>>>>>>> A much simpler “non-standard” extension could allow a subset of RV128 ops with 15 registers instead of 31, using register pairing/aggregation.
>>>>>>>
>>>>>>> A register pairing/aggregation scheme has been used for wide operations in other ISAs. One could re-use the existing RV128 LR.Q/SC.Q ISA encoding, with a non-standard extension that allows a subset of 128-bit operations on a subset of register pairs. Essentially “loose instructions” for 128-bit LR.Q/SC.Q and a “register pair” scheme.
>>>>>>>
>>>>>>>
>>>>>> This is an interesting idea, but the goal is a standard extension, and register pairing would require more ports on the register file, so would be a very hard sell. Also, the LR/LRM/SC proposal allows up to four _independent_ words to be updated, providing a significantly more powerful primitive than a simple 128-bit LR/SC.
>>>>>>
>>>>>>
>>>>> Register ports are an implementation detail.
>>>>>
>>>>> If the implementation reads double width operands from the register pairs in two cycles and writes double width operands to the register pairs in two cycles then there is only the necessity to have 128-bit operand buffers in the load store unit for LR.Q/SC.Q and two read ports and one write port on the register file. An OoO would of course need to handle dependencies and wakeup for opcodes that read or write register pairs.
>>>>>
>>>> This is starting to introduce a CISC-like execution sequencer to handle breaking an atomic instruction into multiple micro-ops. I believe that RISC-V opcodes are supposed to *be* micro-ops and that this is one of the differences between RISC and CISC architectures.
>>>>
>>> That’s a relatively minor implementation detail. FPU instructions take multiple cycles and have various steps i.e. renormalisation.
>>>
>> The FPU has a separate register file and quite likely a separate execution pipeline from the scalar integer unit.
>>
>>
>>> I’m referring to RISC in term of reduced instruction set. i.e. we already have LR.Q and SC.Q
>>>
>> Both of which are already reserved for use on RV128 for atomic access to aligned contiguous quadwords.
>>
>
> One would assume a Double Wide LR/SC, as an atomic is exactly that, access to aligned contiguous quadwords.
>

That is not what LR/LRM/SC offers. LRM adds the ability to atomically
update up to four _independent_ (aligned) words.

Using LR.Q/SC.Q for aligned contiguous quadword <-> register pair on
RV64 will make a mess later with RV128, where a quadword fits into a
single register.

>>>>>> I believe that LR/LRM/SC as a "light" transaction support will be easier to implement than full RVT, which I am also planning to propose, using the Herlihy model. The full RVT will support much more complex programming, but software will have some responsibility for ensuring forward progress. LR/LRM/SC will still have the RVA forward progress guarantee. RVT will have its own, weaker, guarantee.
>>>>>>
>>>>> Two transactional memory extensions sounds complex.
>>>>>
>>>> Not really, LRM is a simple extension to the RVA extension and can be easily implemented with small changes to hardware. The planned RVT proposal more-or-less requires an additional transaction cache to implement. LRM more closely resembles what passes for transactional memory on x86, while the RVT proposal I am planning is more in line with Herlihy's original proposal and more closely resembles database transactions.
>>>>
>>> Two instruction set extensions to perform similar if not the same thing is not exactly what I would call RISC.
>>>
>> This is an interesting point, but is something I am considering. The two are not quite that similar: LRM is really only useful for synchronization primitives, due to its limited reach of only four words, while RVT will be enough to traverse and update a large data structure with atomic semantics.
>>
>
> I don’t know if you’ll get much support for two T extensions. T-lite and T-full.
>

I do not think of them as "T-lite" and "T-full", but as "RVAmlr" and
"RVT" -- LRM is an extension to the base RVA extension. LRM is a
building block for synchronization primitives, while RVT is for walking
and updating a large in-memory data structure with transactional semantics.

>>> Register pairs have been implemented on several RISC processors. SPARC does this for double word accesses IIRC and it’s based on RISC I and II [1].
>>>
>>> [1] http://www.gaisler.com/doc/sparcv8.pdf
>>>
>> Interesting, but RISC-V is not SPARC and seems to be trying to avoid such complexities.
>>
>
> Interestingly arm has made the pair of registers explicit with LDXP (Load exclusive pair of registers) and STLXP (Store-release exclusive pair of registers, returning status). That approach however definitely needs additional register file ports.
>
> There are two approaches I outlined, 128-bit scalar register file, or 64-bit scalar register file with 128-bit ops using (implicit) register pairs. There are implementation approaches for the latter that don’t require extra ports on the register file but may require a couple of bits of state in the load store unit along with 128-bit width operand buffer and 128-bit access to the L1 cache.
>

Neither of which allows to use non-adjacent words, which is the
direction this discussion quickly went.

> I suspect both approaches may be simpler to implement than T-lite, and would match the expectation of existing software. i.e. the purpose of a double wide LR/SC is to allow emulation of CMPXCHG16B (which btw requires aligned operands). i.e.
>
> https://github.com/ivmai/libatomic_ops/blob/93611e38f041b3befbb14462a1dca1d12656636c/src/atomic_ops/sysdeps/gcc/aarch64.h#L119-L140
>

They may be simpler, but they only give you compare-and-swap. LRM can
do much more, such as a straightforward implementation of MCS locks.



-- Jacob

Jonas Oberhauser

non lue,
1 janv. 2018, 04:06:1501/01/2018
à Michael Clark,Jacob Bachmeyer,Cesar Eduardo Barros,RISC-V ISA Dev


On Jan 1, 2018 03:52, "Michael Clark" <michae...@mac.com> wrote:

>> I’m referring to RISC in term of reduced instruction set. i.e. we already have LR.Q and SC.Q
>
> Both of which are already reserved for use on RV128 for atomic access to aligned contiguous quadwords.

One would assume a Double Wide LR/SC, as an atomic is exactly that, access to aligned contiguous quadwords.

The discussion some time ago moved on from the double wide LR/SC to nested LR/SC, which is needed for non-adjacent DCAS, which itself appears to be needed for some concurrent data structures. I say appears because it might be possible to use an adjacent one if one carefully massages the data in the data structures, but I have not checked.

Either way nested LR/SC can also be used for other  things double wide LR/SC can not be used for, such as the example Jacob pointed out.

Michael Clark

non lue,
1 janv. 2018, 15:33:0301/01/2018
à Jonas Oberhauser,Jacob Bachmeyer,Cesar Eduardo Barros,RISC-V ISA Dev

On 1/01/2018, at 10:06 PM, Jonas Oberhauser <s9jo...@gmail.com> wrote:



On Jan 1, 2018 03:52, "Michael Clark" <michae...@mac.com> wrote:

>> I’m referring to RISC in term of reduced instruction set. i.e. we already have LR.Q and SC.Q
>
> Both of which are already reserved for use on RV128 for atomic access to aligned contiguous quadwords.

One would assume a Double Wide LR/SC, as an atomic is exactly that, access to aligned contiguous quadwords.

The discussion some time ago moved on from the double wide LR/SC to nested LR/SC, which is needed for non-adjacent DCAS, which itself appears to be needed for some concurrent data structures. I say appears because it might be possible to use an adjacent one if one carefully massages the data in the data structures, but I have not checked.

What you describe sounds like a perfect use-case for the T extension. Non-contiguous atomic operations across multiple loads and stores are transactions and require transaction support on existing platforms that support this. I think one of the key RISC principles is adding instructions that are commonly emitted by compilers.

These particular use cases you are talking about are clearly not features that are in use by any proportion of existing code besides code that relies on full transactional memory.

I personally believe there would be more utility in providing a mechanism to support Double-wide LR/SC as this would be a requirement for RISC-V support for binary translation of CMPXCHG16B and LDXP/STLXP. i.e. Fast x86 and aarch64 emulation.

Either way nested LR/SC can also be used for other  things double wide LR/SC can not be used for, such as the example Jacob pointed out.

Understand. I’ve been following the discussion and the matched recursive LR/LRM and SC sound problematic. The interface is not ideal given the return code on each SC. There needs to be a delineation of the beginning and end of a transaction and the number of loads may not necessarily match the number of stores. One wonders the utility of such a limited form of transaction support.

One would presume there would be a single standardised T extension, not T-lite and T-full. One would possibly also support regular loads and stores in transactions with BEGIN, COMMIT, ABORT vs trying to tack something on to LR/SC. 



Jonas Oberhauser

non lue,
2 janv. 2018, 03:49:2702/01/2018
à Michael Clark,Jacob Bachmeyer,Cesar Eduardo Barros,RISC-V ISA Dev


On Jan 1, 2018 9:33 PM, "Michael Clark" <michae...@mac.com> wrote:


What you describe sounds like a perfect use-case for the T extension. Non-contiguous atomic operations across multiple loads and stores are transactions and require transaction support on existing platforms that support this.

They do not require full transactional memory.


I think one of the key RISC principles is adding instructions that are commonly emitted by compilers.

These particular use cases you are talking about are clearly not features that are in use by any proportion of existing code besides code that relies on full transactional memory.

They are not part of existing code because the data structures that use them are new, and there are no HW implementations. For example, concurrent disjoint set union using non-adjacent DCAS I think was work completed in 2016 (and here it is not possible to use adjacent DCAS).

So I think the current lack of such data structures in existing code is not a good argument for not supporting these data structures. 
Linux has also only a few years ago integrated the MCS lock, and MCS locks could profit from nested LR/SC (one might be able to replace ticket spinlocks with MCS locks with nested LR/SC, but certainly not with T or DCAS (even non-adjacent)).


I personally believe there would be more utility in providing a mechanism to support Double-wide LR/SC as this would be a requirement for RISC-V support for binary translation of CMPXCHG16B and LDXP/STLXP. i.e. Fast x86 and aarch64 emulation.

As far as I understand, any utility provided by double wide LR/SC is also provided by nested LR/SC at very little cost.

I’ve been following the discussion and the matched recursive LR/LRM and SC sound problematic. The interface is not ideal given the return code on each SC.

I kind of agree on that last part, since multiple SCs could each fail for a different reason, but I did not see it as a big issue since one can always tell whether the sequence succeeded or not, and that is enough for the code I have seen. I am not sure if there is code that uses more specific error codes, so if someone can link to any, that would be appreciated.

There needs to be a delineation of the beginning and end of a transaction and the number of loads may not necessarily match the number of stores. One wonders the utility of such a limited form of transaction support.

While for general transactions your comments are apt, there are several small atomic operations that can be usefully and efficiently implemented with nested LR/SC, and thus where a full transactional memory is not needed, but where the forward guarantee that LR/SC can provide and T can not makes a huge difference.


One would presume there would be a single standardised T extension, not T-lite and T-full. One would possibly also support regular loads and stores in transactions with BEGIN, COMMIT, ABORT vs trying to tack something on to LR/SC.

The problem is that all of this is quite expensive and overkill for the use cases that really only need a DCAS or a "follow a non-null pointer and update the referenced data before the pointer is changed remotely".

I am under the impression that many programs in the future will make use of such small operations (well hidden inside a library, of course, which implements locks or concurrent wait-free data structures like queues, search trees, etc), but that these programs will not need T. So why pay the full price of T and lose the forward guarantee? In particular, implementing DCAS on top of T destroys fairness (so a thread that attempts to do DCAS could stop making progress). 

The only concern I have is that we may well be moving towards a future where transactions become the main mode of programming concurrent systems, but I think there is still some time until that becomes a reality (if it ever does) and even then the operations provided by nested LR/SC can be useful for the remaining software.

Jonas Oberhauser

non lue,
2 janv. 2018, 03:54:4202/01/2018
à Michael Clark,Jacob Bachmeyer,Cesar Eduardo Barros,RISC-V ISA Dev


On Jan 2, 2018 9:49 AM, "Jonas Oberhauser" <s9jo...@gmail.com> wrote:



Linux has also only a few years ago integrated the MCS lock, and MCS locks could profit from nested LR/SC (one might be able to replace ticket spinlocks with MCS locks with nested LR/SC, but certainly not with T or DCAS (even non-adjacent)).

PS: Note that I am not a Linux developer and I have not closely been following Linux development, so there may have been developments in the last two years that may have made the comment above obsolete. I would be thankful if someone could weigh in.

Christoph Hellwig

non lue,
2 janv. 2018, 05:10:1402/01/2018
à Jonas Oberhauser,Michael Clark,Jacob Bachmeyer,Cesar Eduardo Barros,RISC-V ISA Dev
On Tue, Jan 02, 2018 at 09:54:40AM +0100, Jonas Oberhauser wrote:
> On Jan 2, 2018 9:49 AM, "Jonas Oberhauser" <s9jo...@gmail.com> wrote:
>
>
>
>
> Linux has also only a few years ago integrated the MCS lock, and MCS locks
> could profit from nested LR/SC (one might be able to replace ticket
> spinlocks with MCS locks with nested LR/SC, but certainly not with T or
> DCAS (even non-adjacent)).

Linux has an optional cmpxchg_double() arch primitive, which does
a cmpxchg on two different values up to 6 bit wide each. The primary
user of that is the slub memory allocator, which uses it for a lock
free fast path.
Répondre à tous
Répondre à l'auteur
Transférer
0 nouveau message