128-bit CAS for RISC V

130 views
Skip to first unread message

Ruslan Nikolaev

unread,
Dec 9, 2022, 2:12:19 PM12/9/22
to RISC-V ISA Dev
Hi everyone,

I am doing research in lock- and wait-free algorithms. I was also keeping an eye on RISC V more recently (and even experimented with some lock-/wait-free code on a RISC V board).

I know that this topic has been discussed heavily ( e.g., https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/0y3ndO_Sne0/m/biTIIwRfBgAJ ), but it is not clear to me whether there is any clear resolution.

I just want to mention that double-width (128-bit) CAS or 128-bit LL/SC would *really* be useful for building both lock-free and wait-free algorithms. Both x86-64 and AArch64 support it. Windows 10 requires cmpxchg16b, I believe that Linux also uses cmpxchg16b optionally.

Currently, the RISC V manual plays down the lack of double-width (128-bit) atomics and refers to transactional memory (if I remember correctly). As a researcher in this area, I do not agree with the presented arguments there and think that having support for 128-bit CAS or 128-bit LL/SC is really a must for modern 64-bit architectures. In fact, nanoMIPS added support for double-width LL/SC, which is called LLWP/SCWP there if I remember correctly. (Perhaps a somewhat similar instruction style can be adopted by RISC V?)

Having double-width CAS and regular CAS directly as in more recent AArch64 is even better (since it can never fail sporadically due to an OS interrupt) but having double-width LL/SC would certainly be better than nothing.

I also want to mention that in my recent publications, I specifically explored the use of double-width CAS operations to guarantee wait-freedom. There is a reason to believe that even more wait-free algorithms can benefit from it in the future. For example, see my two recent publications:
1. A Fast Wait-Free Queue with Bounded Memory Usage https://dl.acm.org/doi/pdf/10.1145/3490148.3538572
2. Universal Wait-Free Memory Reclamation

The above two papers can be implemented on both x86-64 and AArch64 without any issues. RISC V has almost everything except double-width CAS, which is really sad given the growing popularity of this architecture.

Please let me know if there is any additional information I can provide that can be helpful.

Sincerely,
Ruslan Nikolaev

MitchAlsup

unread,
Jan 3, 2023, 4:06:56 PM1/3/23
to RISC-V ISA Dev, Ruslan Nikolaev
If you are able to sign an NDA I can send you a document describing the features and functionality of an 
ATOMIC supporting architecture that can support up to 8 cache lines participating in a single ATOMIC event. 
With this you can implement every industrial and academic atomic primitive in literature; including things like 
move a multi-linked structure from one place to another place within a concurrent data structure in a single 
atomic event--such that all 3rd parties see the concurrent data structure entirely in the before or entirely in 
the after states with no intermediate state being visible.

Albert Cahalan

unread,
Jan 3, 2023, 4:26:39 PM1/3/23
to MitchAlsup, RISC-V ISA Dev, Ruslan Nikolaev
On Friday, December 9, 2022 at 1:12:19 PM UTC-6 Ruslan Nikolaev wrote:

> I just want to mention that double-width (128-bit) CAS or 128-bit LL/SC
> would *really* be useful for building both lock-free and wait-free
> algorithms. Both x86-64 and AArch64 support it. Windows 10 requires
> cmpxchg16b, I believe that Linux also uses cmpxchg16b optionally.
...
> Having double-width CAS and regular CAS directly as in
> more recent AArch64 is even better (since it can never fail
> sporadically due to an OS interrupt) but having double-width
> LL/SC would certainly be better than nothing.

Another strike against the LL/SC approach:

The CAS approach is compatible with hardware-assisted
replay debuggers, while the LL/SC approach is not. These
debuggers need an instruction sequence that is repeatable.

The LL/SC approach should not be implemented at all.

Ruslan Nikolaev

unread,
Jan 4, 2023, 9:18:26 AM1/4/23
to RISC-V ISA Dev, acahalan, RISC-V ISA Dev, Ruslan Nikolaev, MitchAlsup
I have additional thoughts, which I summarize below. I agree with 'acahalan' that having CAS would be highly desirable (more elaborated below). I understand that RISC-V has limitations on the number of registers for inputs and outputs. Therefore I focus on some realistic approaches for both LR/SC and (single-width and double-width) CAS, even though they are not perfect. Probably, if any extension is proposed, it would be good both LR/SC and CAS for completeness. (Attention: I discuss the approaches that were implemented in some architectures per their architecture manuals, I do not know if there is any patent(s) or any other potential legal issues for RISC-V. That would have to be investigated by people who are knowledgeable in that area, so please do not cite me on that :) )

1. For double-width LR/SC, my guess that both rs2 and rd fields in the instructions would need be used as a destination for LR. For SC, both registers must be a source, and one of the registers will also be a destination (status). Also, if I remember correctly, that is what nanoMIPS is doing, but I do not have their manual right now to check. Currently, single-width LR does not use rs2 at all, that will need to be changed. SC uses rs2 for source, rd for destination. That will need to be changed also, the same register (rd) should be used also as a source. 

My guess is that the biggest challenge is for SC where one of the registers is *both* a source and a destination. Alternatively, it can be just a hard-coded register for the *status* which is likely to be an ephemeral value anyway. This value is unlikely needed after a couple of instructions, where the branch will be taken.

2. CAS is worthy of discussion. To guarantee strict lock-freedom, let alone wait-freedom, we still need to have a single instruction which cannot be interrupted somewhere in the middle. I see how RISC-V is trying to guarantee livelock-freedom, but it is still not sufficient for per-data structure lock-freedom or wait-freedom. It is only about overall (system-wide) livelock-freedom, which is much weaker.

For a regular CAS, there should not be any big problem. It can just follow AMOADD (fetch-and-add) and similar instructions. I know that the instruction does not return the status, but that is OK - the status can be obtained by simply comparing the value that was returned and the value that was supplied. (It succeeds if and only if the old value matches the supplied value.)

For double-width CAS, it is tricky, but something similar to cmp8xchg16 (as in Itanium) can be implemented. Here, we do something opposite - only the status (no old value) is returned. That is typically OK considering that it is a corner case anyway.  cmp8xchg16 would still need 3 input registers (2 for the new value pair even though only one register is compared, 1 for the memory address) and 1 output register for the result. So, it is very similar to double-width SC, where one of the source registers has to be used for status OR the status has to be a hard-coded register.

One downside of cmp8xchg16 is that you also need *atomic* ld16 and st16 (especially ld16!) to work. Both ld16 and st16 are implemented in Itanium. Still, these instructions can be implemented since we have 3 registers per instruction. The other downside is that the instruction is non-standard unlike cmpxchg16b in x86-64, but can still be used to implement *most* ABA-safe code since comparing pointer tags alone is sufficient (assuming that you also *atomically* loaded pointers with ld16). You can argue that cmpxchg16b is also somewhat non-standard since there is no official 16-byte *atomic* load in x86-64.

As a programmer, I would still prefer cmpxchg16b with no 16-byte loads vs. cmp8xchg16 + ld16, since it compares 16 bytes and can *potentially* be more useful, but having cmp8xchg16 is certainly better than nothing (cmp8xchg16 would also nicely complement double-width LR/SC, which would still be used in all other cases that are not covered by cmp8xchg16). This is a corner case anyway, it can be handled manually through inline assembly or compiler extension. Also, it is one word comparison (cmp8) - it is probably easier to implement in hardware.

Albert Cahalan

unread,
Jan 4, 2023, 4:11:37 PM1/4/23
to Ruslan Nikolaev, RISC-V ISA Dev, MitchAlsup
On 1/4/23, Ruslan Nikolaev <nrusl...@gmail.com> wrote:

> I have additional thoughts, which I summarize below. I agree with
> 'acahalan' that having CAS would be highly desirable (more elaborated
> below). I understand that RISC-V has limitations on the number of registers
> for inputs and outputs. Therefore I focus on some realistic approaches for
> both LR/SC and (single-width and double-width) CAS, even though they are
> not perfect. Probably, if any extension is proposed, it would be good both
> LR/SC and CAS for completeness.

No, because this "completeness" prevents replay debuggers
from working. Simply having LL/SC is an architectural flaw.
The instructions should not exist at all.

Limitations on the number of registers is not a matter of the
instruction decoding. It's a physical hardware clock speed
and chip area issue, combined with microcode avoidance.
Instruction decoding is an easy problem. Simply flip the low
bit of a register number to get the second register.

> 2. CAS is worthy of discussion. To guarantee strict lock-freedom,
> let alone wait-freedom, we still need to have a single instruction
> which cannot be interrupted somewhere in the middle.

Replay requires that all instructions are this way. LL/SC must
not exist. They are an architectural flaw. It should be possible
to specify a maximum number of instructions to execute and to
measure how many instructions did execute, and there shouldn't
be any hidden strangeness like a LL/SC loop running a different
number of times.

> One downside of cmp8xchg16 is that you also need *atomic*
> ld16 and st16 (especially ld16!) to work.

Those are valuable anyway.

MitchAlsup

unread,
Jan 4, 2023, 5:21:42 PM1/4/23
to RISC-V ISA Dev, acahalan, RISC-V ISA Dev, MitchAlsup, Ruslan Nikolaev
On Wednesday, January 4, 2023 at 3:11:37 PM UTC-6 acahalan wrote:
On 1/4/23, Ruslan Nikolaev <nrusl...@gmail.com> wrote:

> I have additional thoughts, which I summarize below. I agree with
> 'acahalan' that having CAS would be highly desirable (more elaborated
> below). I understand that RISC-V has limitations on the number of registers
> for inputs and outputs. Therefore I focus on some realistic approaches for
> both LR/SC and (single-width and double-width) CAS, even though they are
> not perfect. Probably, if any extension is proposed, it would be good both
> LR/SC and CAS for completeness.

No, because this "completeness" prevents replay debuggers
from working. Simply having LL/SC is an architectural flaw.
The instructions should not exist at all.

We (architects) are caught between a rock and a hard place::

Either a) we invent and implement new ATOMIC instructions of
increasing complexity every new generation, or b) we invent a
programatic way of allowing software people to implement new
ways of performing ATOMIC events, use them for several years
(or several core generations), and when SW if finally satisfied that
these do the trick, then (and only then) are these bundles of work
ready to be cast into instructions.

LL/SC is in category (b) AMD's ASF is also in category (b).
Transactional memory attempts to be in category (b), but.........


Limitations on the number of registers is not a matter of the
instruction decoding. It's a physical hardware clock speed
and chip area issue, combined with microcode avoidance.
 
Red Herring:: microcode is simply a means of sequencing.
Such sequencing could be done with random logic and be
just as capable and take just as many cycles, juggle just as
much "stuff" and not be "microcode" (read out of a ROM).

Instruction decoding is an easy problem. Simply flip the low
bit of a register number to get the second register.

> 2. CAS is worthy of discussion. To guarantee strict lock-freedom,
> let alone wait-freedom, we still need to have a single instruction
> which cannot be interrupted somewhere in the middle.

Replay requires that all instructions are this way.

Alternatives exist: while it is true that one cannot single step through
an ATOMIC event and have any semblance of ATOMICity; it is false
that one cannot suppress interrupts for the duration of an ATOMIC
event (subject to timeouts and having a plan to deal with exceptions).

So, for example, say you had something like LL/SC and that the
core would suppress interrupts between LL and SC, so your single
stepping debugger would single step up to the LL and then you single 
again and end up at the instruction following SC. Here, at least the
ATOMICity of the event is not harmed with the "debugging" of the
event.

For this to work, however, there must be a place of control that receives
control if an exception is thrown before the exception control transfer is 
performed. Also, there must be a way that SW can query why the event 
failed.

With these two principles, one can make ATOMIC events perform
significant work--far greater than could be performed with CAS, DCAS,
TCADS (3 compares 2 writes) and several other academic variants. 
For example:: My 66000 ISA allows for::

BOOLEAN MoveElement( Element *fr, Element *to )
{
     esmLOCK( fn = fr->next );
     esmLOCK( fp = fr->prev );
     esmLOCK( tn = to->next );
     esmLOCK( fn );
     esmLOCK( fp );
     esmLOCK( tn );
     if( !esmINTERFERENCE() )
     {
              fp->next = fn;
              fn->prev = fp;
              to->next = fr;
              tn->prev = fr;
              fr->prev = to;
     esmLOCK( fr->next = tn );
              return TRUE;
     }
     return FALSE;
}

Here an element of a concurrent data structure is moved from one 
location in a CDS to another location in that CDS with no 3rd party 
thread being able to see that the element being moved is never NOT
in the CDS. You cannot achieve this property with the CAS family of
instructions.

When there is no interference observed during the event, the event
takes exactly the same number of cycles as if none of the semantic
decorations (above) existed.

The esmLOCK "macro" simply tells the compiler to set the LOCK bit on
the associated memory reference instruction. The Lock bit on an 
inbound memory reference denotes "participation" in the event, 
the Lock bit on the outbound memory reference denotes the end
of the event.

My conditional branch instruction has the ability to ask the cache-miss 
buffers of a core if any interference has been detected.

LL/SC must
not exist. They are an architectural flaw. It should be possible
to specify a maximum number of instructions to execute and to
measure how many instructions did execute, and there shouldn't
be any hidden strangeness like a LL/SC loop running a different
number of times.

LL/SC runs different numbers of times BECAUSE there are other
threads attempting to do some kind of access on the same data
this thread is attempting. Smashing LL/SC into a single instruction
just bounds the amount of interference that can happen (but does
nothing to suppress that interference.) Your test-and-set <typical>
primitive can take a cache miss and be subject to all sorts of 
latency that allows for other 3rd party threads to touch the same 
data, and ultimately cause the T&S to fail--even when the instantaneous
state of memory at the instant of T&S decode was in a state where it
could succeed. The general rule of thumb, with a multi-core chip
implementation, is "whoever gets to the memory controller first"
will end up "getting" the lock and there is nothing a core can do about
that.


> One downside of cmp8xchg16 is that you also need *atomic*
> ld16 and st16 (especially ld16!) to work.

Those are valuable anyway.

The upside, of figuring out how to allow software to construct any and
all kinds of ATOMIC events, is that we the CPU architects and designers
get out of the game of instruction addition each generation until SW
settles on one or two uses are are so fundamental to their needs that
these "units of work" are now OK-enough to be bundled into actual
instructions. 

Albert Cahalan

unread,
Jan 4, 2023, 6:26:25 PM1/4/23
to MitchAlsup, RISC-V ISA Dev, Ruslan Nikolaev
On 1/4/23, 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
> On Wednesday, January 4, 2023 at 3:11:37 PM UTC-6 acahalan wrote:
>> On 1/4/23, Ruslan Nikolaev <nrusl...@gmail.com> wrote:

>> No, because this "completeness" prevents replay debuggers
>> from working. Simply having LL/SC is an architectural flaw.
>> The instructions should not exist at all.
>
> We (architects) are caught between a rock and a hard place::
>
> Either a) we invent and implement new ATOMIC instructions of
> increasing complexity every new generation,

I don't really think so. Compare and swap is pretty solid.
You can vary the width, or support multiple pointers, or
decide on a status to provide... but none of that changes
the operation in any deep or fundamental way.

>>> 2. CAS is worthy of discussion. To guarantee strict lock-freedom,
>>> let alone wait-freedom, we still need to have a single instruction
>>> which cannot be interrupted somewhere in the middle.
>>
>> Replay requires that all instructions are this way.
>
> Alternatives exist: while it is true that one cannot single step through
> an ATOMIC event and have any semblance of ATOMICity; it is false
> that one cannot suppress interrupts for the duration of an ATOMIC
> event (subject to timeouts and having a plan to deal with exceptions).
>
> So, for example, say you had something like LL/SC and that the
> core would suppress interrupts between LL and SC, so your single
> stepping debugger would single step up to the LL and then you single
> again and end up at the instruction following SC. Here, at least the
> ATOMICity of the event is not harmed with the "debugging" of the
> event.

Oh no, not a single-stepping debugger. The idea is to let the
processor run at full speed, pipelines all going as they should,
until a certain count of executed instructions is reached or an
interrupt happens. If an interrupt happens, then it should be
possible to note the number of instructions that had run, so
that execution can be restarted. After the desired number of
instructions is completed, the processor should be in a fully
determined state, having executed the exact instructions that
were desired.

For example, consider a replay debugger built as a hypervisor.
A guest OS is booted to a GUI, and the user interacts with it.
The hypervisor records information about interrupts and
about input from things like the mouse. It is then possible to
replay the whole thing at relatively normal speed, stopping at
any desired point in time. The sequence of instructions does
not vary during the replay. Register content, memory content,
and everything else is fully reproducible. The user can jump
to any point in time. The user can fork the history at any point
in time, continuing execution with interaction, with the only
troubles being that network connections to other systems
will be rejected by those other systems.

> For this to work, however, there must be a place of control that receives
> control if an exception is thrown before the exception control transfer is
> performed. Also, there must be a way that SW can query why the event
> failed.

When the hypervisor gains control due to the exception,
it takes note of how many instructions have completed.
It can subtract that from the desired number, then restart
the guest with the new value as the number to execute.
Maybe you wouldn't call it CAS, but otherwise it works fine.
You just need more than one pointer for the instruction, and
of course you hit the usual complaints about complexity.

Example: 6 consecutive register pairs to hold comparison values,
6 consecutive register pairs to hold replacement values, and
another 6 consecutive registers to hold pointers.

> When there is no interference observed during the event, the event
> takes exactly the same number of cycles as if none of the semantic
> decorations (above) existed.

And if there is interference...?

Taking a different number of cycles is fine. Taking a different
number of instructions is trouble. Branching, including loops,
must not be unpredictable.

> My conditional branch instruction has the ability to ask the cache-miss
> buffers of a core if any interference has been detected.

The answer to that question must be deterministic,
or it will have to be recorded and then recreated.
It is very bad to need to frequently record things
like this; the worst-case situation becomes single-step.
The performance of single-step is useless.

>> LL/SC must
>> not exist. They are an architectural flaw. It should be possible
>> to specify a maximum number of instructions to execute and to
>> measure how many instructions did execute, and there shouldn't
>> be any hidden strangeness like a LL/SC loop running a different
>> number of times.
>
> LL/SC runs different numbers of times BECAUSE there are other
> threads attempting to do some kind of access on the same data
> this thread is attempting. Smashing LL/SC into a single instruction
> just bounds the amount of interference that can happen (but does
> nothing to suppress that interference.) Your test-and-set <typical>
> primitive can take a cache miss and be subject to all sorts of
> latency that allows for other 3rd party threads to touch the same
> data, and ultimately cause the T&S to fail--even when the instantaneous
> state of memory at the instant of T&S decode was in a state where it
> could succeed. The general rule of thumb, with a multi-core chip
> implementation, is "whoever gets to the memory controller first"
> will end up "getting" the lock and there is nothing a core can do about
> that.

Unpredictable latency is fine. Unpredictable instruction count
is a problem. Failure should not be happening... and I do not
consider instruction roll-back for an interrupt to be failure.

MitchAlsup

unread,
Jan 4, 2023, 7:01:36 PM1/4/23
to RISC-V ISA Dev, acahalan, RISC-V ISA Dev, Ruslan Nikolaev, MitchAlsup
Sorry for the confusion.

For example, consider a replay debugger built as a hypervisor.
A guest OS is booted to a GUI, and the user interacts with it.
The hypervisor records information about interrupts and
about input from things like the mouse. It is then possible to
replay the whole thing at relatively normal speed, stopping at
any desired point in time. The sequence of instructions does
not vary during the replay. Register content, memory content,
and everything else is fully reproducible. The user can jump
to any point in time. The user can fork the history at any point
in time, continuing execution with interaction, with the only
troubles being that network connections to other systems
will be rejected by those other systems.

Ok, you are replaying a sequence of <possibly external> events to
a "system" such that you can do it over and over again--possibly
stopping at different "places" each time to examine "different"
things.
It is certainly IN the CAS family, but reading 3 locations, prefetching
3 more, and writing 6; if it allowed to proceed is; far above the kinds
of CASs generally implemented. This illustration shows that there IS
a way to get out of the game of inventing new ATOMIC primitives
every new core generation. 

Otherwise you start with T&S, migrate to T&T&S, add CAS, then add
DCAS, then add TCADS, and on and on and on. LL/SC is/was a first
step in trying to get out of that game.

I might also note: I added all of this with a single addition to my ISA
(not used in the above illustration).


Example: 6 consecutive register pairs to hold comparison values,
6 consecutive register pairs to hold replacement values, and
another 6 consecutive registers to hold pointers.

> When there is no interference observed during the event, the event
> takes exactly the same number of cycles as if none of the semantic
> decorations (above) existed.

And if there is interference...?

Control is transferred to a place where a) either the event is retried, or
b) SW understands control only arrives is interference was detected. 
Generally the more primitive uses (like T&S) are simply 2 instructions
back to back--so control returns to LL when interference transpires and
the whole T&S is replayed again (likely to fail.) In higher level primitives, 
control is transferred as illustrated (return FALSE) where SW knows the 
event failed (and does not have to even check if it did fail !!)

Taking a different number of cycles is fine. Taking a different
number of instructions is trouble. Branching, including loops,
must not be unpredictable.

> My conditional branch instruction has the ability to ask the cache-miss
> buffers of a core if any interference has been detected.

The answer to that question must be deterministic,
or it will have to be recorded and then recreated.

How are you going to get 3 other cores to all interfere on physical address
X, 97 clocks after core[12] performs a cache-miss LD on virtual address Y ?? 

That is: I can see setting up a single core such that a series of events
is played out in a very precise way, but I cannot see how you can perform
this across a number of cores (IPIs being as unpredictable as they are.)
Failure is meant only as the point at which HW understands that the 
ATOMIC event will fail, but this may be considerable number of
clocks until SW understands that the event DID fail.  

For example, HW understand that an even will fail when some portion
of the core's memory interface "sees" a memory transaction* that
interferes with the event. SW does not "see" that event until it asks
to see if its allowed to continue.

(*) a SNOOP or a Coherent Invalidate or possibly some other kind of
memory access at a higher priority than this core is currently operating.

Ruslan Nikolaev

unread,
Jan 4, 2023, 9:16:16 PM1/4/23
to RISC-V ISA Dev, MitchAlsup, acahalan, RISC-V ISA Dev, Ruslan Nikolaev
Thanks for your responses! I do not have any strong opinion about LL/SC itself, that can be a topic for a separate discussion (though I do prefer CAS for many reasons -- see below). I can see that LL/SC is also problematic for replay debugger but this instruction already exists (it can be deprecated in the future though). That is especially true because RISC-V already provides specialized atomics.

My larger point is that RISC-V needs to have double-width atomics. I agree that having double-width CAS is even more preferable than LL/SC, especially if we take into account wait-free algorithms, where it would be impossible to use a strong version of CAS (via LL/SC) due to potential security issues (imagine a malicious actor who continuously invalidates the same cache line in a tight loop), and unclear how to use a weak version of CAS since the bound cannot be clearly determined. So, CAS is of course preferable due to security reasons as well. Also, CAS + specialized instructions (FAA, etc) will likely provide better overall performance and scalability.

Coming back to double-width atomics, I see several solutions:
1. cmp8xchg16 + ld16 + st16. cmp8xchg16 would be implemented without returning the old value, only status. We need 3 input registers, 1 output register (status). For ld16, st16, we need 3 registers.
2. cmpxchg16b + ld16. cmpxchg16b would be implemented without returning the old value, only status. We still need 3 input registers, 1 output register (status). But we now compare two words rather than just one. st16 is unnecessary because it can be implemented through ld16+cmpxchg16b.
3. Double-width LL/SC. SC would require  3 input registers, 1 output register (status) just like in two previous approaches.
4. cmpxchg16b like in x86-64. cmpxchg16b would be returning the old value as well as the status. That would require to have 3 input registers, 3 output registers (status, old value pair). Technically, feasible to encode but have to deal with many registers. ld16 and st16 are not required since everything can be implemented through cmpxchg16b.

Again, I agree that deprecating LL/SC and seriously considering CAS (both single-width and double-width) is probably the right direction but my bigger concern is lack of ANY double-width atomics at all.

With respect to double-width CAS, number 1 is the probably the easiest but is non-standard and less compatible than number 2 (which is only slightly more complicated). Finally, number 4 is probably the best but it can always be emulated by number 2 when compatibility with x86-64/AArch64 is required, e.g., call ld16 after CAS failures.

Albert Cahalan

unread,
Jan 5, 2023, 4:18:12 AM1/5/23
to MitchAlsup, RISC-V ISA Dev, Ruslan Nikolaev
On 1/4/23, 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
> On Wednesday, January 4, 2023 at 5:26:25 PM UTC-6 acahalan wrote:
>> On 1/4/23, 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org>
>> > On Wednesday, January 4, 2023 at 3:11:37 PM UTC-6 acahalan wrote:
>> >> On 1/4/23, Ruslan Nikolaev <nrusl...@gmail.com> wrote:

> Ok, you are replaying a sequence of <possibly external> events to
> a "system" such that you can do it over and over again--possibly
> stopping at different "places" each time to examine "different"
> things.

Yes.

>> Maybe you wouldn't call it CAS, but otherwise it works fine.
>> You just need more than one pointer for the instruction, and
>> of course you hit the usual complaints about complexity.
>
> It is certainly IN the CAS family, but reading 3 locations, prefetching
> 3 more, and writing 6; if it allowed to proceed is; far above the kinds
> of CASs generally implemented. This illustration shows that there IS
> a way to get out of the game of inventing new ATOMIC primitives
> every new core generation.
>
> Otherwise you start with T&S, migrate to T&T&S, add CAS, then add
> DCAS, then add TCADS, and on and on and on. LL/SC is/was a first
> step in trying to get out of that game.

One CAS instruction could handle everything. Of course it would be
a beast. My example seems to handle your use case:

>> Example: 6 consecutive register pairs to hold comparison values,
>> 6 consecutive register pairs to hold replacement values, and
>> another 6 consecutive registers to hold pointers.
>>
>> > When there is no interference observed during the event, the event
>> > takes exactly the same number of cycles as if none of the semantic
>> > decorations (above) existed.
>>
>> And if there is interference...?
>
> Control is transferred to a place where a) either the event is retried, or
> b) SW understands control only arrives is interference was detected.
> Generally the more primitive uses (like T&S) are simply 2 instructions
> back to back--so control returns to LL when interference transpires and
> the whole T&S is replayed again (likely to fail.) In higher level
> primitives,
> control is transferred as illustrated (return FALSE) where SW knows the
> event failed (and does not have to even check if it did fail !!)

The unpredictable "Control is transferred" is the problem, along
with "SW knows the event failed".

>> Taking a different number of cycles is fine. Taking a different
>> number of instructions is trouble. Branching, including loops,
>> must not be unpredictable.
>>
>> > My conditional branch instruction has the ability to ask the cache-miss
>> >
>> > buffers of a core if any interference has been detected.
>>
>> The answer to that question must be deterministic,
>> or it will have to be recorded and then recreated.
>
> How are you going to get 3 other cores to all interfere on physical address
> X, 97 clocks after core[12] performs a cache-miss LD on virtual address Y
> ??
>
> That is: I can see setting up a single core such that a series of events
> is played out in a very precise way, but I cannot see how you can perform
> this across a number of cores (IPIs being as unpredictable as they are.)

Physical address interaction is an extremely painful problem.
Generally the solution is to avoid hitting the problem. Typically
this hurts performance.

IPIs are simple to record and replay. When the IPI arrives, take
note of the number of instructions executed prior to it. Make use
of that number later, when continuing on from the IPI and when
replaying the IPI.

MitchAlsup

unread,
Jan 5, 2023, 3:23:52 PM1/5/23
to RISC-V ISA Dev, acahalan, RISC-V ISA Dev, Ruslan Nikolaev, MitchAlsup
It is not unpredictable at all ! When you start a multi-instruction ATOMIC 
event, you record the location of the first instruction. If an exception 
(or allowed interrupt) is "taken" then the IP is reset to that location
so that when control returns to this thread, you are starting the event
at its beginning. Later, if you want, a query on whether interference has
been detected will alter the atomic_control_point to point at the label
supplied by the branch. If SW arrives at this label, it knows that the 
event has failed.

Using this means you have the property that an exception or interrupt control
transfer out of the ATOMIC event returns at a known location (start or
failure point); you do not return to the middle of an event which could 
not be considered ATOMIC at that point. So, rather than crunge up a 
way to fail the event later, you just erase the event from having happened.
{You have to have this ability anyway to make failing events do no damage
to the concurrent data structure.}


>> Taking a different number of cycles is fine. Taking a different
>> number of instructions is trouble. Branching, including loops,
>> must not be unpredictable.
>>
>> > My conditional branch instruction has the ability to ask the cache-miss
>> >
>> > buffers of a core if any interference has been detected.
>>
>> The answer to that question must be deterministic,
>> or it will have to be recorded and then recreated.
>
> How are you going to get 3 other cores to all interfere on physical address
> X, 97 clocks after core[12] performs a cache-miss LD on virtual address Y
> ??
>
> That is: I can see setting up a single core such that a series of events
> is played out in a very precise way, but I cannot see how you can perform
> this across a number of cores (IPIs being as unpredictable as they are.)

Physical address interaction is an extremely painful problem.
Generally the solution is to avoid hitting the problem. Typically
this hurts performance.

But checking if ATOMICity worked or failed REQUIRES interaction at
physical address level. That is where the interference IS.

And this is why the term "coherence point" was mentioned above.
Reply all
Reply to author
Forward
0 new messages