Why isn't there an XCHG mem,mem instruction?

Rick C. Hodgin

unread,

Jul 1, 2018, 4:09:29 PM7/1/18

to

In designing my ISA, I've come to realize that there should be an atomic
XCHGN instruction that allows a memory-to-memory exchange locked at the
hardware level.

mov rdi,left_value
mov rsi,right_value
lock xchg byte_count ; Exchange N bytes

Why doesn't this exist? Is it relegated to something like enforcement
through OS-level semaphores?

--
Rick C. Hodgin

MitchAlsup

unread,

Jul 1, 2018, 5:03:18 PM7/1/18

to

The basic problem was, when I looked at this in '04-'05 was that there was
no way to guarantee that more than one cache line was simultaneously in the
cache. Thus anything ATOMIC is restricted to a single cache line.

This also lead to my SAF (advanced synchronization facility) at AMD that
I have further refined into ESM (Exotic...method) that allows up to 8
cache lines to participate in an ATOMIC event, with guarantees of forward
progress, and ways to avoid future interference. One of which has the
property that it converts a BigO( n**3 ) problem into a BigO( 3 ) problem
{yes there is no n in the second BigO number}.

Say a timer goes off and the OS enables K threads, each of these threads
accesses the front of the work queue and all see interference, on the
second pass they also all see interference, but go out and ask why they
are seeing interference. This time they all get back a unique number, and
can use this unique number to access the work queue without interference.

This problem is intertwined with the cache coherence protocol, bus protocol,
and bisection bandwidth of the "bus". So it cannot be managed by the processor
at a fundamental level. Besides, does any GBOoO processor want to sit in a
loop asking "Am I done yet?"

Quadibloc

unread,

Jul 1, 2018, 9:54:58 PM7/1/18

to

On Sunday, July 1, 2018 at 3:03:18 PM UTC-6, MitchAlsup wrote:

> The basic problem was, when I looked at this in '04-'05 was that there was
> no way to guarantee that more than one cache line was simultaneously in the
> cache. Thus anything ATOMIC is restricted to a single cache line.

I guess there are two questions to ask here, which basically boil down to:

- Is it worth doing, and
- Can it be done?

For the first: an atomic memory to memory exchange might be nice to have. But is
it needed for any common use case? Presumably, any atomic instruction is aimed
at doing something like what "Test and Set" did for the System/360 - allowing
multiple processes to communicate in a useful way to share access to some single
resource.

For the second: a naive person might not see any problem; the cache is pretty
big, so why should it be hard to ensure that two little memory locations are
both in it.

A less naive person might realize the issue that's probably the concern: there
are tasks in process that need various things that have been brought into cache
for them _in a certain order_, and so one might not be able to toss things out
of the cache to make room for something new while a certain old thing is still
there.

That, in itself shouldn't be insuperable, though. The instruction that needs two
things just marks the one that's already in cache as to be kept there, and then
waits until there's room to bring the second one in...

Oops. What if there are a *whole bunch* of these instructions in execution? Say
there are 5,000 locations in cache, and 5,000 such instructions have been
executed by various processes... each one locking one cache location, thus
leaving none to be freed up to let any of those instructions complete.

Given that we want this instruction to be non-privileged, I should think the
answer is obvious. Instead of actually locking a cache location, we just give it
a higher priority, so that if congestion does occur, a suitable process is
invoked to clear things up in a relatively intelligent manner.

Perhaps this even has to do with why the idea of "love threading" has not set
the world on fire: people want their computers to actually work, even if some of
the programs on them have been written by programmers who were either stupid or
malicious, as in real life those eventualities can't be reliably avoided.

John Savard

MitchAlsup

unread,

Jul 1, 2018, 10:26:47 PM7/1/18

to

I came to the conclusion that::
a) one wanted to be able to denote several cache lines as participating in
an ATOMIC event
b) if and when the cache lines did become present, one needed to delay snoops upon then so that the ATOMIC event could transpire.

So there are multiple events HW can monitor to decide if one ATOMIC event takes place or not. In the case nothing happens the ATOMIC envent happens. In the case interference happens the ATOMIC event fails to a knowable control point.

The My 66000 ISA defines all of this crap.

Rick C. Hodgin

unread,

Jul 2, 2018, 8:31:57 AM7/2/18

to

On Sunday, July 1, 2018 at 5:03:18 PM UTC-4, MitchAlsup wrote:
> On Sunday, July 1, 2018 at 3:09:29 PM UTC-5, Rick C. Hodgin wrote:
> > In designing my ISA, I've come to realize that there should be an atomic
> > XCHGN instruction that allows a memory-to-memory exchange locked at the
> > hardware level.
> >
> > mov rdi,left_value
> > mov rsi,right_value
> > lock xchg byte_count ; Exchange N bytes
> >
> > Why doesn't this exist? Is it relegated to something like enforcement
> > through OS-level semaphores?
>
> The basic problem was, when I looked at this in '04-'05 was that there was
> no way to guarantee that more than one cache line was simultaneously in the
> cache. Thus anything ATOMIC is restricted to a single cache line.

Wouldn't this be overcome by invalidating the cache line(s) associated
with the addresses involved, and then issuing a read/write directly
from memory?

It would be slower, but I think having atomic operations at the ISA
level in this case is of a greater significance.

> This also lead to my SAF (advanced synchronization facility) at AMD that
> I have further refined into ESM (Exotic...method) that allows up to 8
> cache lines to participate in an ATOMIC event, with guarantees of forward
> progress, and ways to avoid future interference. One of which has the
> property that it converts a BigO( n**3 ) problem into a BigO( 3 ) problem
> {yes there is no n in the second BigO number}.
>
> Say a timer goes off and the OS enables K threads, each of these threads
> accesses the front of the work queue and all see interference, on the
> second pass they also all see interference, but go out and ask why they
> are seeing interference. This time they all get back a unique number, and
> can use this unique number to access the work queue without interference.
>
> This problem is intertwined with the cache coherence protocol, bus protocol,
> and bisection bandwidth of the "bus". So it cannot be managed by the processor
> at a fundamental level. Besides, does any GBOoO processor want to sit in a
> loop asking "Am I done yet?"

What's GBOoO stand for? I've seen the term used, but do not know
the GB part.

I also don't think it's inappropriate to make the CPU wait when a
need exists in software to perform that task / require that function.
I view the CPU as the servant to all code it runs, having everything
it does dictated by that code, ultimately in that order, and without
distortion or corruption of the dictated intent.

--
Rick C. Hodgin

MitchAlsup

unread,

Jul 2, 2018, 9:37:24 AM7/2/18

to

On Monday, July 2, 2018 at 7:31:57 AM UTC-5, Rick C. Hodgin wrote:
> On Sunday, July 1, 2018 at 5:03:18 PM UTC-4, MitchAlsup wrote:
> > On Sunday, July 1, 2018 at 3:09:29 PM UTC-5, Rick C. Hodgin wrote:
> > > In designing my ISA, I've come to realize that there should be an atomic
> > > XCHGN instruction that allows a memory-to-memory exchange locked at the
> > > hardware level.
> > >
> > > mov rdi,left_value
> > > mov rsi,right_value
> > > lock xchg byte_count ; Exchange N bytes
> > >
> > > Why doesn't this exist? Is it relegated to something like enforcement
> > > through OS-level semaphores?
> >
> > The basic problem was, when I looked at this in '04-'05 was that there was
> > no way to guarantee that more than one cache line was simultaneously in the
> > cache. Thus anything ATOMIC is restricted to a single cache line.
>
> Wouldn't this be overcome by invalidating the cache line(s) associated
> with the addresses involved, and then issuing a read/write directly
> from memory?

The cycle after a ache line arrives a new request that need to invalidate
the same cache line can arrive. So, no your scheme fails mightily.

>
> It would be slower, but I think having atomic operations at the ISA
> level in this case is of a greater significance.
>
> > This also lead to my SAF (advanced synchronization facility) at AMD that
> > I have further refined into ESM (Exotic...method) that allows up to 8
> > cache lines to participate in an ATOMIC event, with guarantees of forward
> > progress, and ways to avoid future interference. One of which has the
> > property that it converts a BigO( n**3 ) problem into a BigO( 3 ) problem
> > {yes there is no n in the second BigO number}.
> >
> > Say a timer goes off and the OS enables K threads, each of these threads
> > accesses the front of the work queue and all see interference, on the
> > second pass they also all see interference, but go out and ask why they
> > are seeing interference. This time they all get back a unique number, and
> > can use this unique number to access the work queue without interference.
> >
> > This problem is intertwined with the cache coherence protocol, bus protocol,
> > and bisection bandwidth of the "bus". So it cannot be managed by the processor
> > at a fundamental level. Besides, does any GBOoO processor want to sit in a
> > loop asking "Am I done yet?"
>
> What's GBOoO stand for? I've seen the term used, but do not know
> the GB part.

Great Big.

>
> I also don't think it's inappropriate to make the CPU wait when a
> need exists in software to perform that task / require that function.

In anything ATOMIC, getting it right is vastly more important than getting
it fast.

Rick C. Hodgin

unread,

Jul 2, 2018, 10:03:48 AM7/2/18

to

On Monday, July 2, 2018 at 9:37:24 AM UTC-4, MitchAlsup wrote:
> On Monday, July 2, 2018 at 7:31:57 AM UTC-5, Rick C. Hodgin wrote:
> > On Sunday, July 1, 2018 at 5:03:18 PM UTC-4, MitchAlsup wrote:
> > > On Sunday, July 1, 2018 at 3:09:29 PM UTC-5, Rick C. Hodgin wrote:
> > > > In designing my ISA, I've come to realize that there should be an atomic
> > > > XCHGN instruction that allows a memory-to-memory exchange locked at the
> > > > hardware level.
> > > >
> > > > mov rdi,left_value
> > > > mov rsi,right_value
> > > > lock xchg byte_count ; Exchange N bytes
> > > >
> > > > Why doesn't this exist? Is it relegated to something like enforcement
> > > > through OS-level semaphores?
> > >
> > > The basic problem was, when I looked at this in '04-'05 was that there was
> > > no way to guarantee that more than one cache line was simultaneously in the
> > > cache. Thus anything ATOMIC is restricted to a single cache line.
> >
> > Wouldn't this be overcome by invalidating the cache line(s) associated
> > with the addresses involved, and then issuing a read/write directly
> > from memory?
>
> The cycle after a ache line arrives a new request that need to invalidate
> the same cache line can arrive. So, no your scheme fails mightily.

It sounds to me more like a bug in the caching system design. Any
self-respecting caching system should monitor requests to invalidate
cache lines and do a post-load check to see if the line just loaded
now also needs to be invalidated.

I think the "mighty failure" would exist on existing architectures,
and not new ones being created with this need in mind.

--
Rick C. Hodgin

Stephen Fuld

unread,

Jul 2, 2018, 2:50:29 PM7/2/18

to

On 7/1/2018 6:54 PM, Quadibloc wrote:
> On Sunday, July 1, 2018 at 3:03:18 PM UTC-6, MitchAlsup wrote:
>
>> The basic problem was, when I looked at this in '04-'05 was that there was
>> no way to guarantee that more than one cache line was simultaneously in the
>> cache. Thus anything ATOMIC is restricted to a single cache line.
>
> I guess there are two questions to ask here, which basically boil down to:
>
> - Is it worth doing, and
> - Can it be done?
>
> For the first: an atomic memory to memory exchange might be nice to have. But is
> it needed for any common use case? Presumably, any atomic instruction is aimed
> at doing something like what "Test and Set" did for the System/360 - allowing
> multiple processes to communicate in a useful way to share access to some single
> resource.
>
> For the second: a naive person might not see any problem; the cache is pretty
> big, so why should it be hard to ensure that two little memory locations are
> both in it.

Not just two. It isn't clear what the limit on the byte count is, and
no alignment restrictions are mentioned, so you could need at least
four, and perhaps more. Now think about what is required if you have to
back out a partially completed instruction due to an error. :-(

And remember, with a large byte count, the instruction could take a long
time to execute, especially with potentially several cache misses, but
you want the instruction to be uninterruptible, so interrupts would be
locked out for a long time. This could be a real mess.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Rick C. Hodgin

unread,

Jul 2, 2018, 2:55:00 PM7/2/18

to

On Monday, July 2, 2018 at 2:50:29 PM UTC-4, Stephen Fuld wrote:
> It isn't clear what the limit on the byte count is, and no alignment

> restrictions are mentioned...

It could be limited to a fixed size of up to 64 bytes per atomic
operation. I would say that such a feature could be paragraph-
aligned.

--
Rick C. Hodgin

MitchAlsup

unread,

Jul 2, 2018, 7:32:46 PM7/2/18

to

Once you figure out (like I did) how to make soft guarantees on the
number of cache lines participating in an ATOMIC event, there is no
rational left to restrict the number to 2! In my case, I set the
number equal to the number of outstanding memory requests the CPUs
of that era would support (8).

Now with 5 cache lines participating in an ATOMIC event, you can do
some pretty large amount of work in a single event--like moving an
element on a doubly linked list from one place in a concurrent data
structure to another place with no possibility that any interested
third party could ever see the CDS without the element being apart
of it! Doing this amount of work decreases the number of ATOMIC events
and thus reduces (quadratically) the amount of ATOMIC interference.

Rick C. Hodgin

unread,

Jul 2, 2018, 8:11:13 PM7/2/18

to

That right there's the exact reason I see this need in hardware.

I also think I have a solution that is cache insensitive.

--
Rick C. Hodgin

already...@yahoo.com

unread,

Jul 3, 2018, 5:18:29 AM7/3/18

to

Sounds like HTM (Hardware Transactional Memory).
I don't like this stuff. Troubles-to-utility ratio is no good, IMHO.

Rick C. Hodgin

unread,

Jul 3, 2018, 9:08:11 AM7/3/18

to

On 7/3/2018 5:18 AM, already...@yahoo.com wrote:
> Sounds like HTM (Hardware Transactional Memory).
> I don't like this stuff. Troubles-to-utility ratio is no good, IMHO.

I didn't know Hardware Transactional Memory existed. After reading
up on what it is and does, it is exactly what I am looking for. And
I reiterate that it is very good for me to come across things in my
own personal development that have already been identified by expert
designers who realize the same things I see and have implemented them.
It makes me think I might actually be on the right path to getting a
proper understanding of hardware's needs.

--
Rick C. Hodgin

MitchAlsup

unread,

Jul 3, 2018, 11:33:13 AM7/3/18

to

It is not HTM, but can be used to implement STM. It is simply multiple
simultaneous locks. Multiple in that there are a small number greater
than 1. Simultaneous in that you get them all or get them none, avoiding
several classical multiple locking problems.

I am not sure anything that need ATOMICity can have low trouble ratio.
I am sure ESM has high utility ratio, though.

Chris M. Thomasson

unread,

Jul 3, 2018, 5:25:16 PM7/3/18

to

Watch our for livelock scenarios in any TM setup.

Bruce Hoult

unread,

Jul 3, 2018, 5:47:16 PM7/3/18

to

On Monday, July 2, 2018 at 7:03:18 AM UTC+10, MitchAlsup wrote:
> On Sunday, July 1, 2018 at 3:09:29 PM UTC-5, Rick C. Hodgin wrote:
> > In designing my ISA, I've come to realize that there should be an atomic
> > XCHGN instruction that allows a memory-to-memory exchange locked at the
> > hardware level.
> >
> > mov rdi,left_value
> > mov rsi,right_value
> > lock xchg byte_count ; Exchange N bytes
> >
> > Why doesn't this exist? Is it relegated to something like enforcement
> > through OS-level semaphores?
>
> The basic problem was, when I looked at this in '04-'05 was that there was
> no way to guarantee that more than one cache line was simultaneously in the
> cache. Thus anything ATOMIC is restricted to a single cache line.

In RISC-V land we have LL/SC which does things in the CPU, but we also have a small set of Atomic Memory Operations, including swap, add, min/max, and/or/xor. The memory location is updated using the operation and the supplied data and the original contents are returned as the result of the instruction.

At least in the SiFive (and Berkeley) implementations, the "TileLink" interconnect (both internally, and off-chip) has message types for all those AMOs, which means they can be performed directly at whatever location the data lives -- perhaps in the local cache, perhaps in the cache of another CPU, perhaps directly in a peripheral, or even (for example in the case of the HiFive Unleashed with a VC707 FPGA board or Microsemi expansion board) on a different board attached using the FMC (FPGA Mezzanine Connector) interface.

That's mechanism. Of course as a matter of policy if you're frequently updating something in another processor's cache (and it isn't) then you might want to take ownership of that cache line. It's hopes this will scale to a large number of processors better than something that requires cache eviction and ownership transfer every time.

That doesn't solve the "take ownership of five cache lines and atomically transfer an object from one linked list to another" problem, but neither does LL/SC.

Chris M. Thomasson

unread,

Jul 3, 2018, 6:02:08 PM7/3/18

to

On 7/1/2018 1:09 PM, Rick C. Hodgin wrote:
> In designing my ISA, I've come to realize that there should be an atomic
> XCHGN instruction that allows a memory-to-memory exchange locked at the
> hardware level.
>
> mov rdi,left_value
> mov rsi,right_value
> lock xchg byte_count ; Exchange N bytes

byte_count would have to be a multiple of a cache line right? The
locations for the exchange would have to be properly padded and cache
line aligned right?

Fwiw, wrt extending XCHG, Joe Seigh wanted a double width version of a
CAS on the SPARC, that acts like a CMPXCHG8B on a 32-bit x86 system.

https://groups.google.com/d/topic/comp.arch/Bx-OAJTjU-U/discussion
(read all)

Andy Glew is in that thread. Nice. :^)

A relevant response:

https://groups.google.com/d/msg/comp.arch/Bx-OAJTjU-U/LD1Dosd7YF0J

> Why doesn't this exist? Is it relegated to something like enforcement
> through OS-level semaphores?

OS-level semaphores? Not exactly sure what you mean; XCHG is atomic
through the hardware on Intel. Mitch's ESM is very useful for
implementing exotic, a lot of academic, synchronization primitives. DCAS
comes to mind. Or even KCSS. Atomic updates to a double linked list...
Stuff like that. :^)

MitchAlsup

unread,

Jul 3, 2018, 6:11:11 PM7/3/18

to

In may ways ESM is a pipelined version of LL/SC where each LL identifies a
cache line that is participating in an ATOMIC event, and the final and only
SC ends the event.

Where ESM differs is that most of the checking occurs at the address level
not at the data level. And there is a branch instruction that can query
whether any interference has taken place (or not) and if any has some metric
on how heavy the interference is.

But ESM requires a certain programming style. We have a multiplicity of
instructions performing a single ATOMIC event. The instructions are not to
be used independently (random code generators emitting ESM instructions
will likely fail continuously). The style is straightforward and expressive.
There are certain non-von Neumann effects which can take place visibly.
One cannot single step through an ESM ATOMIC event. Any traps or exceptions
deliver control to the "control point" of the ATOMIC event before taking the
interrupt or exception.

And finally, there is a mechanism by which a programmer can directly avoid
future interference! This changes one BigO( n**3 ) into a BigO( 3 ).

Terje Mathisen

unread,

Jul 4, 2018, 3:10:25 AM7/4/18

to

MitchAlsup wrote:
> But ESM requires a certain programming style. We have a multiplicity
> of instructions performing a single ATOMIC event. The instructions
> are not to be used independently (random code generators emitting ESM
> instructions will likely fail continuously). The style is
> straightforward and expressive. There are certain non-von Neumann
> effects which can take place visibly. One cannot single step through
> an ESM ATOMIC event. Any traps or exceptions deliver control to the
> "control point" of the ATOMIC event before taking the interrupt or
> exception.
>
> And finally, there is a mechanism by which a programmer can directly
> avoid future interference! This changes one BigO( n**3 ) into a BigO(
> 3 ).

This is the big one.

Back when AHUS, the new Oslo area hospital was about to officially open
but already had lots of patients, I was asked to trouble-shoot a
life-or-death (literally) bug:

The wireless system that was supposed to gather the surgical crash team
had crashed. :-(

I figured out that this was load-dependent, and as the number of hand
terminals increased, the system would never be able to recover from a
central server power outage because the load of all the terminals trying
to reconnect simultaneously would overload the system and even cause
timeouts and resets among those that had already connected.

The workaround which we implemented over a week or so while I visited
one of the involved vendors (in Boulder, Colorado) was to add one new
message to the (re-)connect protocol: If the central system already had
more than N terminals in the middle of the login/connect process (an
operation that took 10 seconds, but that's another story), then any new
request would cause it to calculate when a free time slot would be
available and simply tell the terminal to "come back and retry in X
seconds". At that point enough resources were guaranteed to be available
so no terminal would ever get more than one of these wait messages.

This simple fix solved all the stability problems and afaik there hasn't
been any similar issues in the almost 10 years since then.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Chris M. Thomasson

unread,

Jul 5, 2018, 2:45:21 AM7/5/18

to

The worst is when a read invalidates, interferes with another concurrent
read. I have seen setups like that before. It basically kind of destroys
the ability to use it for read mostly data-structures. Think RCU... ;^)

Chris M. Thomasson

unread,

Jul 5, 2018, 2:48:47 AM7/5/18

to

The nice thing is that RCU is fairly adaptable. One can use it for
readers, and in conjunction with something else for writers. Heck, RCU
for readers and locks for writers can scale when the load of writes is
scarce, fairly low...

Rick C. Hodgin

unread,

Jul 5, 2018, 4:26:52 PM7/5/18

to

I think an extension of this should include the ability to execute a
series of transactions, each one atomically, one-by-one a a time, with
normal interrupt and other processing between, and where each one is
governed entries in a repeating structure:

qd left_value
qd right_value
qd byte_count

And then in use:

mov rsi,offset transaction_table
xor rcx,rcx ; Start at the beginning
lock xchgn rdx ; Number of transactions in table
; cy? set if transaction did not complete successfully
; rcx cumulative byte where the failure occurred
; rdx pass-thru total transaction count
; rax a code indicating the reason it failed

This would allow a complex data manipulation to be carried out, while
guaranteeing each portion was done atomically, with an overall test
for success or failure being given.

By incorporating the start byte, it would allow a retry, or an un-
winding on failure.

-----
I think these hardware-guaranteed abilities to move data around in
atomic / transactional forms is an absolute necessity in a CPU.

--
Rick C. Hodgin

MitchAlsup

unread,

Jul 5, 2018, 5:19:36 PM7/5/18

to

ATOMICally means that no interested 3rd party can see any intermediate
state. I see nothing here that prevents DMA devices from reading partially
updated and partially stale data; or from writing on top of your seeming
ATOMIC event.

Nor do I see any means by which other CPU <threads> are prevented in accessing
intermediate state. Here, the problem is that two CPUs might be performing
an identical ATOMIC event, one a single "bus cycle" ahead of a second CPU.
One CPU writes his first cache line and proceeds to the second, just as the
second CPU reads and writes the first. Both CPUs think they are performing a
single ATOMIC event, but only the second CPU succeeds, leaving the first CPU
to think he performed an ATOMIC event, but failed miserably.

>
> By incorporating the start byte, it would allow a retry, or an un-
> winding on failure.

You CANNOT restart an ATOMIC event in the middle !!!!!

Rick C. Hodgin

unread,

Jul 5, 2018, 5:33:45 PM7/5/18

to

The DMA controller would have to be modified to receive a base address
and range that it could not write to during the atomic event. If it
seeks to read/write into that address space, it would halt until the
atomic operation was completed.

The same would be true with all access given by other cores.

> Nor do I see any means by which other CPU <threads> are prevented in accessing
> intermediate state. Here, the problem is that two CPUs might be performing
> an identical ATOMIC event, one a single "bus cycle" ahead of a second CPU.
> One CPU writes his first cache line and proceeds to the second, just as the
> second CPU reads and writes the first. Both CPUs think they are performing a
> single ATOMIC event, but only the second CPU succeeds, leaving the first CPU
> to think he performed an ATOMIC event, but failed miserably.

When you prepare to enter this LOCK XCHG/N state, issue a request to
lock down the memory range on the other cores and any other device on
the system that could read/write to that range. Every component on a
system would receive the request at some point and time and respond
with their own immediate state of a yea/nay signal.

The core issuing the request waits for the required N cycles to take
place for the furthermost device to acknowledge. It then either con-
ducts its workload and releases afterward, or it signals the failure
in locking due to some other device having requested atomic access.

>> By incorporating the start byte, it would allow a retry, or an un-
>> winding on failure.
>
> You CANNOT restart an ATOMIC event in the middle !!!!!

Each entry in the structure would be completed atomically, and by
having a packet, you would receive a final success / failure on the
exchange.

In using this protocol, there should be no reason why anything
would fail in the middle. It would only fail when attempting one
of the structure entries, and since that entry failed as an atomic
unit, it could be restarted once the release was given by the other
core or device.

>> -----
>> I think these hardware-guaranteed abilities to move data around in
>> atomic / transactional forms is an absolute necessity in a CPU.

The whole purpose of having this atomic ability in hardware is to
provide hardware mechanisms which can guarantee atomicity of the
data move, Mitch. As the designer, you would engineer the devices
on that system to meed that need.

-----
Mitch, I've observed in several of your replies that the thing I've
designed wouldn't work because this device or that unit won't allow
it.

I think you've spent too many years working with AMD and x86/AMD64
and needing to support legacy hardware and legacy decisions made on
things. I'm not there at that place. I have a clean slate. I am
not trying to make a compatible 80386-derived chip for these new
abilities. I am working on Arxoda, which is a 40-bit 80386-like
CPU with some ARM-like features as well. I will support a true
32-bit and 40-bit 80386-mode which uses the literal ISA we see today,
but these new features won't be there.

I think you need to step outside of the rigid framework you seem to
think in today, and expand your thinking to re-engineering hardware
that today might make the idea fail. By re-engineering that hard-
ware, it now accommodates the need and succeeds.

--
Rick C. Hodgin

Rick C. Hodgin

unread,

Jul 5, 2018, 9:05:00 PM7/5/18

to

On 7/5/2018 5:19 PM, MitchAlsup wrote:

> Nor do I see any means by which other CPU <threads> are prevented in accessing
> intermediate state. Here, the problem is that two CPUs might be performing
> an identical ATOMIC event, one a single "bus cycle" ahead of a second CPU.

I look at the design of a multi-core CPU operating at multi-GHz clock
speeds as more or less an issue of temporal mechanics. Each core is
operating within its frame of reference and knowledge, but as a designer
of each core I recognize that information necessary to maintain certain
types of synchronous events across cores exists N cycles away. Each
core cannot issue a request and continue as though the answer will be
yes. It must issue the request and wait until it receive acknowledgment
from remote resources.

In this way, everything is resolved before processing begins in the case
of sequential atomic operations.

This differs from normal processing on a core, where speculative
execution can proceed with the assumption things will be okay when the
final notifications come in and processing is either accepted or then
rejected.

Each core's frame of reference acknowledges it is in a machine that is
N cycles away from knowing the things around it. There are some things
it can assume and process, and some things it cannot.

When it comes to atomic operations, the need for them being truly atomic
operations mandate that processing does not begin until acknowledgement
is received from all resources, and each one gets the go signal. In this
way, even across resources that are N cycles behind the current core, each
will see things sequentially relative to their point of dispatch in each
of their "temporal frames."

This pause allows a remote resource to flush any information it has in the
cache so the operation proceeds with the most current data, locked down,
and able to be reliable.

> One CPU writes his first cache line and proceeds to the second, just as the
> second CPU reads and writes the first. Both CPUs think they are performing a
> single ATOMIC event, but only the second CPU succeeds, leaving the first CPU
> to think he performed an ATOMIC event, but failed miserably.

It would fail in that design. It's why that design must be supplanted
with a design that accommodates the hard needs of atomic operations across
these type of "temporal cores" where information relative to its needs is
not immediately available.

>> By incorporating the start byte, it would allow a retry, or an un-
>> winding on failure.
>
> You CANNOT restart an ATOMIC event in the middle !!!!!

What would you call it? It's a sequence of atomic operations, none of
which should ever fail, but some of which might be delayed, and there
would be a potential benefit to signaling the application that a delay
larger than some determined value exists, and the application can be
signaled so it can decide what to do when the resource is not immediately
available. In other cases where it just needs to take place at some
point, then the pauses on each core to attain a full lock on the re-
sources mean it will be slower, but still fulfill the needs of the app.

And in the case of these multiple single atomic operations executed in
succession, they also would never fail unless there is some need for them
to be executed sequentially without delays. And I'm still considering a
possible way to implement that design, so that the full multi-transaction
can be done atomically through a protocol that indicates how many trans-
actions are in flight, which locks down resources for that duration,
taking a hit on all cores' performance, but ensuring fully atomic op-
eration when needed.

As I say, I think this is important. I think it's important enough to
add special circuitry to the CPU, bus, cache protocols, and to place
special demands on external hardware to enforce it.

--
Rick C. Hodgin

George Neuner

unread,

Jul 5, 2018, 11:56:40 PM7/5/18

to

I may have misunderstood completely ...

.... but impression is that Rick is thinking more in terms of
concurrent transactions [though not about conventional HTM per se].
The transactions would be atomic in the sense that if they fail no
state is changed, and might be interfering if more than one is
executed simultaneously.

My sense if that the terminology each of you are using may be leading
the other astray.

YMMV,
George

Chris M. Thomasson

unread,

Jul 6, 2018, 4:26:57 PM7/6/18

to

I did something roughly kind of like this using pure software and mutexs
for fun, around 13 years ago.

We have a large static table of mutexs

The XCHG works on n places.

We hash the addresses of these n places into the table of locks and
build another table that gets sorted on the index.

We take all the locks in the sorted table in order.

Now, we have full access for each address in the XCHG.

We mutate at will.

Then we unlock the sorted table in reverse order. There is no chance of
deadlock because the sorting takes care of it.

Here is an experimental first version:

https://groups.google.com/d/topic/comp.programming.threads/4MARuvCIRMQ/discussion

It is basically a crude lock based STM.

It is not fast at all, but can work...

MitchAlsup

unread,

Jul 6, 2018, 4:39:40 PM7/6/18

to

On Friday, July 6, 2018 at 3:26:57 PM UTC-5, Chris M. Thomasson wrote:
>
> I did something roughly kind of like this using pure software and mutexs
> for fun, around 13 years ago.
>
> We have a large static table of mutexs
>
> The XCHG works on n places.
>
> We hash the addresses of these n places into the table of locks and
> build another table that gets sorted on the index.
>
> We take all the locks in the sorted table in order.
>
> Now, we have full access for each address in the XCHG.
>
> We mutate at will.
>
> Then we unlock the sorted table in reverse order. There is no chance of
> deadlock because the sorting takes care of it.
>
> Here is an experimental first version:
>
> https://groups.google.com/d/topic/comp.programming.threads/4MARuvCIRMQ/discussion
>
> It is basically a crude lock based STM.
>
> It is not fast at all, but can work...

When it comes to ATOMIC stuff, it is always better to be correct
than fast.

Chris M. Thomasson

unread,

Jul 6, 2018, 5:12:08 PM7/6/18

to

Agreed. I know that the little method above works wrt the static mutexs
for they can never be destroyed and are completely separate from the
addresses they protect. Sorting the mutex list before locking them
prevents deadlock. It is a simple method for pure lock-based STM. So be
it. Wrt speed, perhaps:

1. Get it right! Forget about speed.

2. Get it peer reviewed... Forget about speed.

3. Correct any errors and get it right. Forget about speed.

4. Now, we can start to actually think about how to make it faster while
keeping all of the concrete correctness factors in steps 1, 2 and 3 that
make it work in the first place.

Perform steps 1, 2, 3, 4A

4A is an idea to make things faster. So, we have to repeat steps 1, 2,
and 3 wrt 4A. Then, we can move onto 4B, which should be a little faster
than 4A. Nice pattern.

For instance, wrt a possible very simple and small improvement would be
to turn the static table of mutexes into a table of read/write locks.
So, a sorted mutex table can be for read or write. It should help out
the reads. Get it right, then move onto the next improvement... Get that
right, then move along wrt the improvement process; repeat... ;^)

Side-Note:

Actually, a certain class of reads should be able to observe data that
is out-of-date without effecting anything else. They should get around
having to use any mutex in the static table. Think of the readers in a
RCU scheme. They go full speed ahead, and do not care if they hit
out-of-date items.

Of course, all of those improvements would come after getting a 100%
working model up, as slow as it may be.

Sound a little Kosher Mitch?

MitchAlsup

unread,

Jul 6, 2018, 9:48:41 PM7/6/18

to

Sure.

> and are completely separate from the
> addresses they protect.

This is one of the reasons ASF and ESM perform mutexing on the address
and not on the bit pattern of the data (avoiding the ABA problem amongst
a myriad of others.)

Chris M. Thomasson

unread,

Jul 7, 2018, 3:10:57 AM7/7/18

to

Thanks.

>
>> and are completely separate from the
>> addresses they protect.
>
> This is one of the reasons ASF and ESM perform mutexing on the address
> and not on the bit pattern of the data (avoiding the ABA problem amongst
> a myriad of others.)
>

Excellent point wrt ABA. Humm... For some reason I am thinking about the
PLO operation on the IBM z-arch. For instance, take Appendix A-50 into
account:

http://www-01.ibm.com/support/docview.wss?uid=isg26480faec85f44e2385256d5200627dee&aid=1

Chris M. Thomasson

unread,

Jul 10, 2018, 3:29:47 AM7/10/18

to

On 7/1/2018 1:09 PM, Rick C. Hodgin wrote:
> In designing my ISA, I've come to realize that there should be an atomic
> XCHGN instruction that allows a memory-to-memory exchange locked at the
> hardware level.
>
> mov rdi,left_value
> mov rsi,right_value
> lock xchg byte_count ; Exchange N bytes
>
> Why doesn't this exist? Is it relegated to something like enforcement
> through OS-level semaphores?
>

This just might a bit relevant wrt extending the size of CAS to a DWCAS,
where DW stands for double-width. So, two adjacent words are atomic!

https://forum.pellesc.de/index.php?topic=7311.0
(read all if interested...)

I found a problem in Pelles C wrt their C11 impl of DWCAS.

Chris M. Thomasson

unread,

Jul 18, 2018, 3:36:23 AM7/18/18

to

Fwiw, I remember breaking up a 32 bit word into 32 separate locks. The
experiment could take all 32 bits, or locks in an atomic operation,
after that, everything was locked.

It was a multi-level hash based distributor. Iirc, it would hash
pointers into a word in static memory table that held the static custom
mutexs, then hash again into one of the bits comprising said word, or
selected mutex word if you will.

I should implement this again from scratch for fun.

I am pretty sure I mentioned it some years back in comp.programming.threads.

lk...@lkcl.net

unread,

Oct 30, 2018, 11:22:18 PM10/30/18

to

On Monday, July 2, 2018 at 3:26:47 AM UTC+1, MitchAlsup wrote:

> > answer is obvious. Instead of actually locking a cache location, we just give it
> > a higher priority, so that if congestion does occur, a suitable process is
> > invoked to clear things up in a relatively intelligent manner.
>
> I came to the conclusion that::
> a) one wanted to be able to denote several cache lines as participating in
> an ATOMIC event
> b) if and when the cache lines did become present, one needed to
> delay snoops upon then so that the ATOMIC event could transpire.
>
> So there are multiple events HW can monitor to decide if one
> ATOMIC event takes place or not. In the case nothing happens
> the ATOMIC envent happens. In the case interference happens the
> ATOMIC event fails to a knowable control point.
>
> The My 66000 ISA defines all of this crap.

awesome.

so (hello, my first post on this newsgroup)

a discussion came up on the RISC-V ISA-dev list:
https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/1wqTcDTU2Sg

where one of the participants is asking if RISC-V can be extended
to cover a minimum of 16-byte atomic reservations, particuarly
given how many fundamental algorithms become far more efficient
when such an atomic primitive is available.

at present, RISC-V only has a single (64-bit) load/store-reservation
capability, the semantics of which is that the SC succeeds if and
only if there was no other SMP core / hardware-thread that requested
the exact same memory address in the intervening time. it is
intended that a loop occur around this testing of the return result
from SC, retrying the LR/SC batch until it succeeds.

the question therefore came up, well, is it reasonable to have
multiple consecutive LR instructions make multiple cache-line
reservations, followed by multiple consecutive SC instructions?
and for the semantics to be "all-or-nothing" on a global basis.
i.e. a sequence of LR reservations are made, and if another
SMP core / hardware thread makes a reservation on *any* of that
batch, then *all* LRs are invalidated (not just the one).

the same looping would therefore be required: it would just be
the case that several memory addresses in completely different
locations could be reserved simultaneously.

i would be interested to hear peoples' thoughts on whether this
is practical to implement, and, more importantly, if it would
actually be useful.

in particular, the fact that now *multiple* things happen inside
the multi-LR, multi-SC loop has me concerned, as i do not have
enough experience in algorithms to say if that is a problem.

greatly appreciated peoples' insights.

l.

Ivan Godard

unread,

Oct 31, 2018, 12:47:09 AM10/31/18

to

You'd need an instruction to tell you that you now have done all the SCs
there are (so commit the bunch), and another to tell you that a new
group of LLs is starting (so ignore any outstanding LLs that didn't get
closed). At that point the whole thing is starting to look like
transactional memory, and if you are going to TM then the LL/SC paradigm
just gets in your way.

lk...@lkcl.net

unread,

Oct 31, 2018, 1:39:57 AM10/31/18

to

On Wednesday, October 31, 2018 at 4:47:09 AM UTC, Ivan Godard wrote:
> On 10/30/2018 8:22 PM, lk...@lkcl.net wrote:

> > the question therefore came up, well, is it reasonable to have
> > multiple consecutive LR instructions make multiple cache-line
> > reservations, followed by multiple consecutive SC instructions?
> > and for the semantics to be "all-or-nothing" on a global basis.
> > i.e. a sequence of LR reservations are made, and if another
> > SMP core / hardware thread makes a reservation on *any* of that
> > batch, then *all* LRs are invalidated (not just the one).
> >
> > the same looping would therefore be required: it would just be
> > the case that several memory addresses in completely different
> > locations could be reserved simultaneously.
> >
> > i would be interested to hear peoples' thoughts on whether this
> > is practical to implement, and, more importantly, if it would
> > actually be useful.
> >
> > in particular, the fact that now *multiple* things happen inside
> > the multi-LR, multi-SC loop has me concerned, as i do not have
> > enough experience in algorithms to say if that is a problem.

> You'd need an instruction to tell you that you now have done all the SCs
> there are (so commit the bunch), and another to tell you that a new
> group of LLs is starting (so ignore any outstanding LLs that didn't get
> closed). At that point the whole thing is starting to look like
> transactional memory, and if you are going to TM then the LL/SC paradigm
> just gets in your way.

ok, so, correct me if i'm wrong, what you're saying is:

(1) that there is no way that the SCs could be matched up with the
LRs, such that an additional instruction is needed?

(2) there's a "commit" needed i.e. that there should be some action
which does not take place until the group's end is detected.

(3) as the end cannot be detected, another additional instruction is
needed to indicate the start of a new group

if so: many apologies: fundamentally, i feel that this is a
misunderstanding of the way that LR/SC works and also of the
multi-LR/SC proposal.

my understanding of LR/SC is that it does not pause (place in a
buffer pending an atomic completion condition) anything at all. LR
simply makes a "reservation" on a memory address (exactly as mitch
outlines for the 66000 ISA). that's all. the (potential) difference
being between what mitch outlines and what LR/SC does: that
reservation says, "if any other SMP core / hart tries to make a
similar "reservation" on that address, invalidate *MY* reservation".

there's no cacheing, no buffering, no "operations placed into a queue
which will not happen if a certain condition is not met" of any kind.

so allow me to go through (1) to (3) above, first in the context of
LR/SC (standard RISC-V) and then in the context of a multi-variant

first, standard RV:

(1) there is only one LR, there is only one SC. the standard RV
specification specifically states that if there are two LR
instructions (without a matching SC in between), the second LR
*INVALIDATES* the first.

(2) LR/SC fundamentally by design does not need a "commit". it is
quite literally implemented as "mark this address in the cache with
the SMP core identifier".

(3) there's only one item in the "group", and the specification has
specific rules that reset the reservation (singular) back to "known
reset state" directly after the SC. therefore there is no extra
instruction needed, "start a new group".

now the proposed variant:

(1) there may be multiple consecutive LRs: there will therefore need
to be multiple matching "reservation" entries, just as there are with
mitch's 66000 ISA. thus, if there are multiple LRs (let us say N), it
is reasonable to expect that the code follows up with mutltiple SCs
(let us REQUIRE N).

thus, with the REQUIREMENT that the number of SCs match the number of
LRs, i believe that the issue that you raise is moot. an additional
requirement could hypothetically be added that the LRs be consecutive
(macro-op fuseable), and likewise the SCs.

context-switches or traps would automatically invalidate the
reservations (requiring that the code go once more round an
"untrapped" loop, once the context-switch / trap returns)

(2) multi-LR/SC again, building on single LR/SC, would not need a
"commit" phase. this is, again, by design. the multi-LR reserves
multiple addresses in the cache. if another SMP core /
hardware-thread is detected to be trying to reserve ANY them, *ALL* of
them are invalidated (atomically, for both the invalidator AND the
invalidee).

there *is* no action/data that is buffered/stored in a queue, there
*is* nothing to "commit".

(3) the end of the group is detected by the number of multi-issued LRs
precisely and exactly matching the number of multi-issued SCs. thus,
everything remains in sync.

it may help to add to a specification that if the following occurs:

LR LR LR LR
SC SC SC
LR

that this will INVALIDATE the global group (there is one last of four
SCs outstanding in the example above at the time of the 5th LR), and
that, furthermore, this is clearly a programming error, given that the
number of SCs does not match up with the number of LRs.

the thing about LR/SC is that the atomic operation is achieved...
eventually. it may not be achieved immediately. the LR/SC is just a
"signalling" system - a communication system between SMP cores /
hardware-threads - which indicates whether, during the operation that
people WANT to be atomic, there happened to be another core which did
something that could have interfered with that "wish".

and the simplest way to do that is: reserve some memory address(es),
alter them, and check *AFTER* the fact if anyone else tried to modify
the exact same address. if "yes", repeat the exact same operations
until they *do* succeed.

absolutely no commits involved, absolutely no buffering of any kind
involved, at all.

does that help clarify?

the bit that i don't know is: whether a multi-LR/SC is actually
useful. as in, i don't know enough about algorithms to actually write
pseudo-code or assembly code for which multi-LR/SC solves an actual,
real-world problem.

l.

Ivan Godard

unread,

Oct 31, 2018, 2:33:09 AM10/31/18

to

All you say is true if the program is well-behaved and LLs always match
SCs. My programs do not behave, even when I pay them well. You must
expect that there will be arbitrary state already in place when your
program starts what it expects to be an atomic group, and your semantics
must guarantee to do something sensible in the presence of that state.

Consider what happens if you enter your loop with already existing
dangling LRs from some unrelated code that branched out in the middle?
What if the dangling LRs and your own LRs together exceed the maximum
supported reservation count? What happens if you do a bunch of LRs and
then some but not all SCs and then wander off? Or call a function that
has its own loop? Or want to single-step your loop in a debugger? Or log
the loaded values? In your LR LR LR LR SC SC SC LR example, did the
three SCs happen? How did you prevent them from happening? How about the
user of the data structure who expects none-or-all update semantics?

You may say that no correct program will do that. To which I reply:
"Exactly". The world is full of incorrect programs, and malicious ones
that will pervert your primitives. Yes, all these issue can be
successfully addressed; the lit is full on the subject. The fix is
called "transactional semantics", and is *very* hard to get right.

lk...@lkcl.net

unread,

Oct 31, 2018, 3:08:12 AM10/31/18

to

On Wed, Oct 31, 2018 at 6:33 AM Ivan Godard <iv...@millcomputing.com> wrote:
>
> On 10/30/2018 10:39 PM, lk...@lkcl.net wrote:

> > the bit that i don't know is: whether a multi-LR/SC is actually
> > useful. as in, i don't know enough about algorithms to actually write
> > pseudo-code or assembly code for which multi-LR/SC solves an actual,
> > real-world problem.
> >
> > l.
> >
>
> All you say is true if the program is well-behaved and LLs always match
> SCs.

yes. from what i gather, an intrinsic has been added to the riscv
gcc port, which provides the LR/SC semantics without the user having
to call them as assembler, themselves.

thus, the requirement - the contract - to conform with the
specification is moved from the user to the compiler, where it is far
more likely that the compiler will conform correctly with the
specification than the user will.

now, in the case of a multi-LR/SC, that intrinsic could reasonably be
expected to have an augmented equivalent that, instead of a single
memory address, takes an array of memory addresses.

as long as the same array of addresses is passed into both the LR and
to the SC, the specification is adhered to.

stepping *outside* of the specification is, i feel, beyond the scope
of discussion.

> Consider what happens if you enter your loop with already existing
> dangling LRs from some unrelated code that branched out in the middle?

that would be a bug in the program (or, much less likely, the compiler).

this i feel is a style of argument that could be rationally applied
to any assembly instruction, and to any specification. if a program
is not compliant with the specification, all bets are off; discussing hypothetical scenarios is... well... not useful (to me).

what would be a much more useful thing to do, i feel, would be to
demonstrate that no matter what, there does not exist any possible
multi-LR/multi-SC proposal or specification that could possibly be
useful, even if complied with correctly by both compiler and
application(s).

also i feel that the next most useful thing to do would be to
demonstrate flaws in the current proposal, listing possible weaknesses
and caveats that require implementors at both the hardware and
software level to take special precautions.

for example: this is why i mentioned that any context-switches or
traps would automatically invalidate the entire list of reservations,
as there is no way that atomicity can easily or simply be guaranteed
if an execution thread leaves its current context.

does that sound reasonable?

l.

Ivan Godard

unread,

Oct 31, 2018, 3:43:51 AM10/31/18

to

On 10/31/2018 12:08 AM, lk...@lkcl.net wrote:
> On Wed, Oct 31, 2018 at 6:33 AM Ivan Godard <iv...@millcomputing.com> wrote:
>>
>> On 10/30/2018 10:39 PM, lk...@lkcl.net wrote:
>
>>> the bit that i don't know is: whether a multi-LR/SC is actually
>>> useful. as in, i don't know enough about algorithms to actually write
>>> pseudo-code or assembly code for which multi-LR/SC solves an actual,
>>> real-world problem.
>>>
>>> l.
>>>
>>
>> All you say is true if the program is well-behaved and LLs always match
>> SCs.
>
> yes. from what i gather, an intrinsic has been added to the riscv
> gcc port, which provides the LR/SC semantics without the user having
> to call them as assembler, themselves.
>
> thus, the requirement - the contract - to conform with the
> specification is moved from the user to the compiler, where it is far
> more likely that the compiler will conform correctly with the
> specification than the user will.
>
> now, in the case of a multi-LR/SC, that intrinsic could reasonably be
> expected to have an augmented equivalent that, instead of a single
> memory address, takes an array of memory addresses.
>
> as long as the same array of addresses is passed into both the LR and
> to the SC, the specification is adhered to.
>
> stepping *outside* of the specification is, i feel, beyond the scope
> of discussion.

Then you have also stepped out of the domain of architecture, which must
concern itself with the reality of physics and users, including failing
and malicious ones. This is an architecture board, BTW

>> Consider what happens if you enter your loop with already existing
>> dangling LRs from some unrelated code that branched out in the middle?
>
> that would be a bug in the program (or, much less likely, the compiler).

And your specifications excludes bugs :-)

> this i feel is a style of argument that could be rationally applied
> to any assembly instruction, and to any specification. if a program
> is not compliant with the specification, all bets are off; discussing hypothetical scenarios is... well... not useful (to me).

No program of interesting size has a specification to comply with.
Programs that have specifications are called "toys".

> what would be a much more useful thing to do, i feel, would be to
> demonstrate that no matter what, there does not exist any possible
> multi-LR/multi-SC proposal or specification that could possibly be
> useful, even if complied with correctly by both compiler and
> application(s).
>
> also i feel that the next most useful thing to do would be to
> demonstrate flaws in the current proposal, listing possible weaknesses
> and caveats that require implementors at both the hardware and
> software level to take special precautions.
>
> for example: this is why i mentioned that any context-switches or
> traps would automatically invalidate the entire list of reservations,
> as there is no way that atomicity can easily or simply be guaranteed
> if an execution thread leaves its current context.
>
> does that sound reasonable?

Depends on what you hope to get out of it. If you hope to define
something that could be implemented and used by people unconnected with
your effort then no, it is not useful. If you hope to get a publishable
paper out of it then perhaps it is useful, given today's state of
academic publishing.

Broadly, you presume to define that a set of updates can be made atomic
as a group, and from that conclude that that a proposal incorporating a
set of updates will be successfully atomic. Hmmm.

Incidentally, a context switch is not a machine operation like an add.
It is the OS updating many fields in a rather large data structure,
which must be collectively all-or-none. Would the OS be able to use the
proposal to implement context switch?

Suggested reading:
https://en.wikipedia.org/wiki/Deadlock#Livelock
https://en.wikipedia.org/wiki/Two-phase_commit_protocol
https://en.wikipedia.org/wiki/Exploit_(computer_security)
https://en.wikipedia.org/wiki/Transactional_memory
https://en.wikipedia.org/wiki/Double_compare-and-swap

lk...@lkcl.net

unread,

Oct 31, 2018, 4:16:16 AM10/31/18

to

On Wednesday, October 31, 2018 at 7:43:51 AM UTC, Ivan Godard wrote:

> > stepping *outside* of the specification is, i feel, beyond the scope
> > of discussion.
>
> Then you have also stepped out of the domain of architecture, which must
> concern itself with the reality of physics and users, including failing
> and malicious ones. This is an architecture board, BTW

ok, so allow me to clarify what i care about, so as to illustrate
precisely and exactly what i mean by "stepping outside of the
scope / specification"

* there is a specification (finalised and implemented in the case
of RV-Base; proposed in the case of Multi-LR/SC)
* if there exists a design or other flaw that allows an arbitrary
user to execute malicious code or adversely affect anything other
than their own program, i care.
* if there exists a design or implementation flaw in the compiler
that allows the same, i care.
* if the USER does NOT CONFORM TO THE SPECIFICATION, stepping OUTSIDE
of that specification, and does NOT cause malicious damage, but
solely and exclusively causes their OWN (one) program to behave
in unexpected and unanticipated ways, i genuinely do not give a
rat's posterior.

so yes of course (and i apologise for assuming that it would go
without saying), if the user manages to corrupt other peoples' data,
that's a problem. but if the sole outcome of failing to comply
with the specification is that their own program crashes? Not Our Problem.

> >> Consider what happens if you enter your loop with already existing
> >> dangling LRs from some unrelated code that branched out in the middle?
> >
> > that would be a bug in the program (or, much less likely, the compiler).
>
> And your specifications excludes bugs :-)

nooo, i'm *asking* if there are any bugs in it. and also if there are
any real-world algorithms that would benefit from a multi-LR/SC
instruction.

> > what would be a much more useful thing to do, i feel, would be to
> > demonstrate that no matter what, there does not exist any possible
> > multi-LR/multi-SC proposal or specification that could possibly be
> > useful, even if complied with correctly by both compiler and
> > application(s).
> >
> > also i feel that the next most useful thing to do would be to
> > demonstrate flaws in the current proposal, listing possible weaknesses
> > and caveats that require implementors at both the hardware and
> > software level to take special precautions.
> >
> > for example: this is why i mentioned that any context-switches or
> > traps would automatically invalidate the entire list of reservations,
> > as there is no way that atomicity can easily or simply be guaranteed
> > if an execution thread leaves its current context.
> >
> > does that sound reasonable?
>
> Depends on what you hope to get out of it. If you hope to define
> something that could be implemented and used by people unconnected with
> your effort then no, it is not useful.

respectfully, there are about three separate independent enquiries
over the past few months, for how >=16-byte atomic operations could
be added to RISC-V, so i am (informally, and independently of those
sources) making enquiries here.

those three independent sets of interest would tend to indicate that
the statement you make is an assumption. apologies for having to
point that out.

> Broadly, you presume to define that a set of updates can be made atomic
> as a group,

no, i do not believe that i am, as from how i understand it, that's
not how LR/SC works

LR/SC allows the program to *detect* if anything
(specifically, another SMP core or hardware-thread) *interfered* with
the updates, such that *after* the updates have been carried out and
the interference [guaranteed to have been] detected, the operation
may be repeated again and again until such time as it is detected that
no interference occurred.

the single-LR/SC case been successfully implemented by multiple RISC-V
vendors, and it does have full, proven, working compiler support in
gcc, as well as libc6 (glibc) support.

so the question is: can a multi-LR/SC with multiple memory reservations
just like mitch's 66000 ISA work as well?

> and from that conclude that that a proposal incorporating a
> set of updates will be successfully atomic. Hmmm.

in light of the clarification above would you be kind enough to
re-evaluate this assessment?

> Incidentally, a context switch is not a machine operation like an add.
> It is the OS updating many fields in a rather large data structure,
> which must be collectively all-or-none.

indeed. in the riscv linux kernel as well as the supervisor-mode riscv
unit tests, such a context switch is done in a traditional way of
saving all registers onto an [altered] stack (and other state). the
unit test code is a lot simpler than a full OS-style context-switch:

https://github.com/riscv/riscv-test-env/blob/master/v/entry.S

> Would the OS be able to use the
> proposal to implement context switch?

very good question. in the question on the isa-dev mailing list
that led to this thread, it was hinted (without detail being
provided), that yes, multi-LR/SC can indeed be used to implement
a context-switch.

another algorithm that was mentioned was this:

"One example I know myself that would benefit quite a
lot from this is the popular Boehm-Demers-Weiser conservative
garbage collector"

i do not have the expertise to evaluate these things, hence
the reason why i am asking for help.

l.

Bruce Hoult

unread,

Oct 31, 2018, 4:17:44 AM10/31/18

to

On Tuesday, October 30, 2018 at 8:22:18 PM UTC-7, lk...@lkcl.net wrote:
> at present, RISC-V only has a single (64-bit) load/store-reservation
> capability, the semantics of which is that the SC succeeds if and
> only if there was no other SMP core / hardware-thread that requested
> the exact same memory address in the intervening time. it is
> intended that a loop occur around this testing of the return result
> from SC, retrying the LR/SC batch until it succeeds.

That's not quite correct.

Another hardware thread using LR on a sufficiently similar address will also steal the reservation and make the SC fail. The parameters of "sufficient" are not defined by the architecture: it could be the exact same aligned word address, it could be the same cache line, it could be the same VM/TLB page, or anything else. The SC only has to be to some address in this range, not necessarily the identical address:

Section 7.2 in the ISA spec, v2.2:
"An implementation can reserve an arbitrary subset of the memory space on each LR [...] An SC can succeed if no accesses from other harts to the address can be observed to have occurred between the SC and the last LR in this hart to reserve the address. Note this LR might have had a different address argument, but reserved the SC’s address as part of the memory subset."

lk...@lkcl.net

unread,

Oct 31, 2018, 5:14:25 AM10/31/18

to

On Wednesday, October 31, 2018 at 8:17:44 AM UTC, Bruce Hoult wrote:
> On Tuesday, October 30, 2018 at 8:22:18 PM UTC-7, lk...@lkcl.net wrote:
> > at present, RISC-V only has a single (64-bit) load/store-reservation
> > capability, the semantics of which is that the SC succeeds if and
> > only if there was no other SMP core / hardware-thread that requested
> > the exact same memory address in the intervening time. it is
> > intended that a loop occur around this testing of the return result
> > from SC, retrying the LR/SC batch until it succeeds.
>
> That's not quite correct.
>
> Another hardware thread using LR on a sufficiently similar
> address will also steal the reservation and make the SC fail.

ah, cool, thanks for clarifying, bruce. the basis for this being
that the detection of non-interference with the address that the
reservee is interested in may be covered by much coarser granularity
than the actual address of interest.

that actually makes implementation a lot easier (less resources
needed).

hypothetically i imagine that it could even be the entirety of memory
(i.e. that there be *no* LR by *any* other hardware thread), however
such crude implementations would likely result in significantly
degraded runtime performance by virtue of any LR/SC loops taking
up a disproportionate amount of execution time.

hm, the only thing about a coarser granularity is that it would mean
that multiple reservations could easily end up trying to reserve the
exact same memory "area". so a little care in the implementation
details would be needed.

l.

Ivan Godard

unread,

Oct 31, 2018, 5:18:26 AM10/31/18

to

On 10/31/2018 1:16 AM, lk...@lkcl.net wrote:
> On Wednesday, October 31, 2018 at 7:43:51 AM UTC, Ivan Godard wrote:
>
>>> stepping *outside* of the specification is, i feel, beyond the scope
>>> of discussion.
>>
>> Then you have also stepped out of the domain of architecture, which must
>> concern itself with the reality of physics and users, including failing
>> and malicious ones. This is an architecture board, BTW
>
> ok, so allow me to clarify what i care about, so as to illustrate
> precisely and exactly what i mean by "stepping outside of the
> scope / specification"
>
> * there is a specification (finalised and implemented in the case
> of RV-Base; proposed in the case of Multi-LR/SC)
> * if there exists a design or other flaw that allows an arbitrary
> user to execute malicious code or adversely affect anything other
> than their own program, i care.
> * if there exists a design or implementation flaw in the compiler
> that allows the same, i care.
> * if the USER does NOT CONFORM TO THE SPECIFICATION, stepping OUTSIDE
> of that specification, and does NOT cause malicious damage, but
> solely and exclusively causes their OWN (one) program to behave
> in unexpected and unanticipated ways, i genuinely do not give a
> rat's posterior.

You show your naivete. Those of us who practice architecture know that
defining what it does when it works is easy; the hard part is handling
when it doesn't work, both ensuring that system stability is maintained
and precluding deliberate exploit.

> so yes of course (and i apologise for assuming that it would go
> without saying), if the user manages to corrupt other peoples' data,
> that's a problem. but if the sole outcome of failing to comply
> with the specification is that their own program crashes? Not Our Problem.

While not absolutely essential (as so often demonstrated) there is also
an aesthetic aspect: even if it can be used when used correctly, your
design should not invite error, nor cause unnecessary grief among those
who must support it.

>>>> Consider what happens if you enter your loop with already existing
>>>> dangling LRs from some unrelated code that branched out in the middle?
>>>
>>> that would be a bug in the program (or, much less likely, the compiler).
>>
>> And your specifications excludes bugs :-)
>
> nooo, i'm *asking* if there are any bugs in it. and also if there are
> any real-world algorithms that would benefit from a multi-LR/SC
> instruction.
>

There are no bugs in it, because you have defined them away. And
essentially all programs with any parallelism at all, which could use a
concurrent atomic update. So far you have not shown that multi-LR/SC (as
you have described it) achieves concurrent atomic update.

>>> what would be a much more useful thing to do, i feel, would be to
>>> demonstrate that no matter what, there does not exist any possible
>>> multi-LR/multi-SC proposal or specification that could possibly be
>>> useful, even if complied with correctly by both compiler and
>>> application(s).
>>>
>>> also i feel that the next most useful thing to do would be to
>>> demonstrate flaws in the current proposal, listing possible weaknesses
>>> and caveats that require implementors at both the hardware and
>>> software level to take special precautions.
>>>
>>> for example: this is why i mentioned that any context-switches or
>>> traps would automatically invalidate the entire list of reservations,
>>> as there is no way that atomicity can easily or simply be guaranteed
>>> if an execution thread leaves its current context.
>>>
>>> does that sound reasonable?
>>
>> Depends on what you hope to get out of it. If you hope to define
>> something that could be implemented and used by people unconnected with
>> your effort then no, it is not useful.
>
> respectfully, there are about three separate independent enquiries
> over the past few months, for how >=16-byte atomic operations could
> be added to RISC-V, so i am (informally, and independently of those
> sources) making enquiries here.
>
> those three independent sets of interest would tend to indicate that
> the statement you make is an assumption. apologies for having to
> point that out.

Oh, everyone would like multi-atomic operations. And they are easy to
do: just set a Whopping Big Lock (TM), do everything, and reset it. But
you asked whether your multi-LL/SC is a way to do multi-atomic, and so
far the answer is no. Of course, you may say that implementing it is
also Someone Else's Problem (TM). But if so, then what do they need you
for? It's not like the problems are new - I suspect the first work on
multi-LL/SC is older than you are. It may be older than I am.

>> Broadly, you presume to define that a set of updates can be made atomic
>> as a group,
>
> no, i do not believe that i am, as from how i understand it, that's
> not how LR/SC works
>
> LR/SC allows the program to *detect* if anything
> (specifically, another SMP core or hardware-thread) *interfered* with
> the updates, such that *after* the updates have been carried out and
> the interference [guaranteed to have been] detected, the operation
> may be repeated again and again until such time as it is detected that
> no interference occurred.
>
> the single-LR/SC case been successfully implemented by multiple RISC-V
> vendors, and it does have full, proven, working compiler support in
> gcc, as well as libc6 (glibc) support.
>
> so the question is: can a multi-LR/SC with multiple memory reservations
> just like mitch's 66000 ISA work as well?

Not without a transaction notion.

Single LL/SC works because the hardware atomicity provides a
transactional notion for single loads and stores, although you need to
experience unaligned memory accesses to appreciate how much a pain that
is. The hardware could also make multi-access be atomic, but the cost is
too high for practice.

The known alternatives are bus locking and transactional memory, both of
which are doable for small write sets but scale badly. Your multi-LL/SC
is just an incomplete and poorly thought through version of
transactional memory; Google is your friend.

>
>> and from that conclude that that a proposal incorporating a
>> set of updates will be successfully atomic. Hmmm.
>
> in light of the clarification above would you be kind enough to
> re-evaluate this assessment?
>
>
>> Incidentally, a context switch is not a machine operation like an add.
>> It is the OS updating many fields in a rather large data structure,
>> which must be collectively all-or-none.
>
> indeed. in the riscv linux kernel as well as the supervisor-mode riscv
> unit tests, such a context switch is done in a traditional way of
> saving all registers onto an [altered] stack (and other state). the
> unit test code is a lot simpler than a full OS-style context-switch:
>
> https://github.com/riscv/riscv-test-env/blob/master/v/entry.S
>
>
>> Would the OS be able to use the
>> proposal to implement context switch?
>
> very good question. in the question on the isa-dev mailing list
> that led to this thread, it was hinted (without detail being
> provided), that yes, multi-LR/SC can indeed be used to implement
> a context-switch.
>
> another algorithm that was mentioned was this:
>
> "One example I know myself that would benefit quite a
> lot from this is the popular Boehm-Demers-Weiser conservative
> garbage collector"
>
> i do not have the expertise to evaluate these things, hence
> the reason why i am asking for help.

More circularity then: if a context switch invalidates the primitive,
how can it implement the switch without invalidating itself? Can you
guarantee absence of livelock? Show your work.

The problem here is that you are thinking too abstractly. Sure, everyone
would like to have a multi-LLSC. You can add one to your ISA definition.
Whoopee ding dong. But if you haven't dropped under the abstraction and
shown how your abstraction is (or at least could be) implemented in
hardware then it is not "reasonable". Utility is irrelevant to that
which is not technically and economically viable.

Use your own example:
LL LL LL LL
SC SC SC
<interupt, or some other invalidator>
SC

How do you ensure all-or-none semantics on the SCs, as required by the
others who are using the data structure concurrently with you?

I think I'm done; perhaps over-done. Have fun :-)

lk...@lkcl.net

unread,

Oct 31, 2018, 6:47:34 AM10/31/18

to

On Wednesday, October 31, 2018 at 9:18:26 AM UTC, Ivan Godard wrote:
> On 10/31/2018 1:16 AM, lk...@lkcl.net wrote:
> > On Wednesday, October 31, 2018 at 7:43:51 AM UTC, Ivan Godard wrote:
> >
> >>> stepping *outside* of the specification is, i feel, beyond the scope
> >>> of discussion.
> >>
> >> Then you have also stepped out of the domain of architecture, which must
> >> concern itself with the reality of physics and users, including failing
> >> and malicious ones. This is an architecture board, BTW
> >
> > ok, so allow me to clarify what i care about, so as to illustrate
> > precisely and exactly what i mean by "stepping outside of the
> > scope / specification"
> >
> > * there is a specification (finalised and implemented in the case
> > of RV-Base; proposed in the case of Multi-LR/SC)
> > * if there exists a design or other flaw that allows an arbitrary
> > user to execute malicious code or adversely affect anything other
> > than their own program, i care.
> > * if there exists a design or implementation flaw in the compiler
> > that allows the same, i care.
> > * if the USER does NOT CONFORM TO THE SPECIFICATION, stepping OUTSIDE
> > of that specification, and does NOT cause malicious damage, but
> > solely and exclusively causes their OWN (one) program to behave
> > in unexpected and unanticipated ways, i genuinely do not give a
> > rat's posterior.
>
> You show your naivete.

as someone who reverse-engineered NT Domains network traffic,
and was a member of ISS X-Force Security Research in 1999,
it is more likely that i display a lack of clarity in my ability
to communicate.

> Those of us who practice architecture know that
> defining what it does when it works is easy; the hard part is handling
> when it doesn't work, both ensuring that system stability is maintained
> and precluding deliberate exploit.

indeed: i apologise for assuming that it would be clear that such
mission-critical prerequisites would be clear and paramount
requirements.

> > so yes of course (and i apologise for assuming that it would go
> > without saying), if the user manages to corrupt other peoples' data,
> > that's a problem. but if the sole outcome of failing to comply
> > with the specification is that their own program crashes? Not Our Problem.
>
> While not absolutely essential (as so often demonstrated) there is also
> an aesthetic aspect: even if it can be used when used correctly, your
> design should not invite error, nor cause unnecessary grief among those
> who must support it.

indeed. this is one of the fascinating aspects of RISC, namely that
RISC systems are on the borderline of providing the absolute bare
minimum primitives that are an absolute pain in the ass to work with
without a compiler (and associated libc6 support) having them "baked in"
on the users' behalf as intrinsics and so on.

where a normal CISC system would have a single instruction, easy,
call it, done, the RISC paradigm minimises and dramatically simplifies
the hardware, and expects the compiler writers, library writers and
assembly-level programmers to obey the contract or suffer the
consequences.

does a multi-LR/SC go even further than that, and impose on the
user even beyond the usual contract that RISC-based systems expect?
honestly i have no idea.

> >>>> Consider what happens if you enter your loop with already existing
> >>>> dangling LRs from some unrelated code that branched out in the middle?
> >>>
> >>> that would be a bug in the program (or, much less likely, the compiler).
> >>
> >> And your specifications excludes bugs :-)
> >
> > nooo, i'm *asking* if there are any bugs in it. and also if there are
> > any real-world algorithms that would benefit from a multi-LR/SC
> > instruction.
> >
>
> There are no bugs in it, because you have defined them away.

great! i can get implementing, straight away! [i'm kidding...
simulations first... actually, corroborative constructive
input and review first... ]

> And
> essentially all programs with any parallelism at all, which could use a
> concurrent atomic update. So far you have not shown that multi-LR/SC (as
> you have described it) achieves concurrent atomic update.

*it* does not. a loop - a *sequence* of instructions - *uses* LR/SC to
test *whether* concurrent atomic update[s] have [restrospectively]
*been* achieved. this is completely different.

i've mentioned this subtle distinction about three times, now, and
each time you have not indicated that you understand this distinction,
and, unfortunately, continue to use the same [ambiguous, unclear]
language that insists, most unfortunately, that "i am saying that
multi-LR/SC *is* an atomic update", when it most certainly is not.
i have *never* said that "it" is.

> Oh, everyone would like multi-atomic operations. And they are easy to
> do: just set a Whopping Big Lock (TM), do everything, and reset it.

indeed: in a cross-over post, where bruce kindly clarified about the
RISC-V spec, he mentioned that it's possible for implementors to
put the reservation on a cache line, on a TLB, and i mention that
hypothetically the reservation could be on the entirety of memory
[the load-reservation equivalent of a WBL (TM)]

> But
> you asked whether your

... not mine. i typically do not use personal pronouns to discuss
computing concepts. the use of personal pronouns typically indicates
efforts to mentally "claim ownership" (or, worse, to _assign_ ownership),
leading to difficulties in discussions, division and conflict, and, in a
lot of cases, leads to accidental "personal insult" where constructive
criticism of the *idea* leads to misinterpretation as "criticism and
attack of the person".

after 20 years of interacting in free software forums i've generally
found that dropping personal pronouns entirely, "your project",
"your idea", and using "the project" or "the idea", and so on, are
much less likely to be misinterpreted.

> multi-LL/SC is a way to do multi-atomic,

again: i did not say that, at all.

> and so far the answer is no.

... because you believe that i said "multi-LL/SC is a multi-atomic operation"
when i said nothing of the sort.

> > so the question is: can a multi-LR/SC with multiple memory reservations
> > just like mitch's 66000 ISA work as well?
>
> Not without a transaction notion.
>
> Single LL/SC works because the hardware atomicity provides a
> transactional notion for single loads and stores, although you need to
> experience unaligned memory accesses to appreciate how much a pain that
> is.

i believe i may have a way to help you to understand multi-LR/SC,
based as it is on single-LR/SC.

(1) let us assume that there are multiple LRs which happen to
fall on the same cache line
(2) let us assume that the hardware implementation reserves
memory addresses on a cache-line level of granularity

i believe it is quite trivially shown through this example that
even if there are N bytes in the cache line and there are N
single-byte separate and distinct LR instructions, the loop
around multi-LR/SC can trivially be shown to be absolutely
identical to the single-LR/SC case.

now let us expand that to a TLB page. again, due to the
fact that all of the LRs all fall into the same TLB shows
that, indeed, the multi-LR/SC would be successful i.e.
directly equivalent to a single-LR/SC.

now we move to there being two TLB pages. now four. now
50. now the entire TLB. now the entirety of memory
(the WBL (TM)).

in each case, the multi-LR/SC primitives may be shown
to be... "moot".

the next phase from there is to go back and say, "ok,
let's get one of the LRs to reserve cache line 1,
while the second reserves cache line 2". and if another
hardware thread issues an overlapping LR **ON EITHER
CACHE LINE ONE ***OR*** CACHE LINE 2**, then **BOTH
ARE INVALIDATED**.

is it _really_ hard to see that the extension of
single-LR/SC to multi-LR/SC semantics is directly
equivalent, where one places a global reservtion on
a single area and the other *also* places a global
reservation... just on multiple cache lines (or other
suitable granular areas)?

context, for emphasis: LR/SC are *not* atomicity instructions.
they're primitives that are used to *DETECT* if a memory-based
load/store sequence was violated by another hardware thread / SMP core.

> >> Would the OS be able to use the
> >> proposal to implement context switch?
> >
> > very good question. in the question on the isa-dev mailing list
> > that led to this thread, it was hinted (without detail being
> > provided), that yes, multi-LR/SC can indeed be used to implement
> > a context-switch.
> >
> > another algorithm that was mentioned was this:
> >
> > "One example I know myself that would benefit quite a
> > lot from this is the popular Boehm-Demers-Weiser conservative
> > garbage collector"
> >
> > i do not have the expertise to evaluate these things, hence
> > the reason why i am asking for help.
>
> More circularity then:

not really.

> if a context switch invalidates the primitive,
> how can it implement the switch without invalidating itself?

this is invalid reasoning. once in the different context
(supervisor / machine mode), LR/SC usage in the higher context
level has absolutely nothing to do with the userspace context
from which it just switched.

under no circumstances would LR/SC be used or expected to be used
*over* i.e. *between* the userspace-supervisor or
supervisor-userspace boundary.

whilst in the supervisor space, LR/SC used there would work perfectly
(as long as no traps or context-switches occurred)

whilst in the user space, LR/SC used there would work perfectly
(as long as no traps or context-switches occurred)

> The problem here is that you are thinking too abstractly.

i'm just asking questions, and using this as an opportunity
to clarify.

> Use your own example:
> LL LL LL LL
> SC SC SC
> <interupt, or some other invalidator>

the 4 load-reservations would be invalidated

> SC

this SC would fail (returning an indication to the program that
failure had occurred), the internal state reset to "zero reservations",
and the loop that goes round the 4-LLs plus 4-SCs would be
re-executed.

on re-execution, the 4 LLs would reserve (anew) the same 4 memory
addresses.

at *some* point, there would be a repetition of the loop that
did not involve an interrupt (or other invalidator).

it might take a while, and that's fine.

> How do you ensure all-or-none semantics on the SCs, as required by the
> others who are using the data structure concurrently with you?

not with me: i make no personal claim of the data structures,
nor mentally associate myself with them. see above as to why.

the hardware guarantees the invalidation of all four reservations,
just as the hardware guarantees the invalidation of one (in
the single-LR/SC case).

at *no time* would there *ever* be an instance where *only*
one of the memory addresses would be invalidated whilst other
(reserved) addresses were not.

this is required by the hardware. any hardware implementation
that allowed even a single other LR/SC instruction (on *any* hardware
thread / SMP core) to be executed in between the time when the
"invalidating" conditions are detected and the time when the
invalidation of the reservations is completed would be a clear
violation of the specification (i.e. of the load-reservation
contract).

it's really quite simple, ivan - a lot simpler than i believe
you are imagining it is. the multiple reservations are
conceptually treated as a single reservation. and, by
logical inference, if single-LR/SC works, then multi-LR/SC
should as well... AS LONG AS the multiple reservations ARE
indeed treated conceptually the same as a single one.

> I think I'm done; perhaps over-done. Have fun :-)

appreciated your time: you've helped enormously to clarify the
context of the question, so that others do not have to do the
same.

l.

Bruce Hoult

unread,

Oct 31, 2018, 6:51:01 AM10/31/18

to

On Wednesday, October 31, 2018 at 2:18:26 AM UTC-7, Ivan Godard wrote:
> More circularity then: if a context switch invalidates the primitive,
> how can it implement the switch without invalidating itself? Can you
> guarantee absence of livelock? Show your work.

At least on RISC-V there is no such primitive as a "context switch". The closest thing is the return-from-exception instructions MRET, SRET, URET. An implementation is permitted but not required to clear an LR/SC reservation upon executing an exception return. Portable software should explicitly clear any reservation if required (e.g. if the exception return is actually part of a task switch), for example by executing an SC to a dummy address.

Reasons an implementation may choose not to clear a reservation on exception return are to allow debugger single-stepping or conditional breakpointing within an LR/SC sequence, or to allow trap-and-emulate on some operations within an LR/SC sequence. I guess software TLB refill might be another reason.

> The problem here is that you are thinking too abstractly. Sure, everyone
> would like to have a multi-LLSC. You can add one to your ISA definition.
> Whoopee ding dong. But if you haven't dropped under the abstraction and
> shown how your abstraction is (or at least could be) implemented in
> hardware then it is not "reasonable". Utility is irrelevant to that
> which is not technically and economically viable.

To be fair to Luke, multi-LLSC (in fact multiple LLs terminated by a single SC) is something Mitch Alsup describes as being part of his "MY66000" design in an earlier message in this six month old thread, as well as on other occasions in comp.arch, and of course in the draft ISA manual that he distributes to interested people).

If your newsreader doesn't easily support it, you might refer to:

https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/MHSdrebCAQAJ
https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/UYfENqMZAgAJ
https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/k85oGGzcAAAJ

lk...@lkcl.net

unread,

Oct 31, 2018, 7:30:07 AM10/31/18

to

On Wednesday, October 31, 2018 at 10:51:01 AM UTC, Bruce Hoult wrote:

> To be fair to Luke, multi-LLSC (in fact multiple LLs terminated
> by a single SC) is something Mitch Alsup describes as being part
> of his "MY66000" design in an earlier message in this
> six month old thread, as well as on other occasions in
> comp.arch, and of course in the draft ISA manual that he
> distributes to interested people).

ah! in not having full context, i had been unable to determine
from the reading the full thread that mitch's design *is* already
multi-LR/SC: i thought it was more exotic.

great! hm, fascinating to note (hello mitch) that only a single
SC would be needed to clear the entire group of LRs. obvious
in retrospect.

> If your newsreader doesn't easily support it, you might refer to:
>
> https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/MHSdrebCAQAJ
> https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/UYfENqMZAgAJ
> https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/k85oGGzcAAAJ

ok that's really helpful to have specific references.

so that leaves me wondering *how* multi-LR/SC (multi-LLSC) would
be used?

and also: would a multi-SC actually be useful (now that i
understand the difference, where 66000 only does a single SC
to reset an array of LLs).

here is what the RISC-V spec says [about single-LR/single-SC):

"LR loads a word from the address in rs1, places the
sign-extended value in rd, and registers a reservation
on the memory address and a range of bytes
including at least all bytes of the addressed word.

SC writes a word in rs2 to the address in rs1,
provided a valid reservation still exists on that
address. SC writes zero to rd on success or a nonzero
code on failure."

so here, we detect "potential issues" with multi-LR/multi-SC,
where it would only be the test of the very last of the multi-SC
sequence that could possibly be trusted, because only when
that very last one had stored (and the answer come back,
"yes the store succeeded") would you be able to conclude that
*all* stores succeeded.

the problem is illustrated with the following sequence:

* LR 1
* LR 2
* LR 3
* LR 4
* SC 1 (succeeds)
-> HART 5 issues LR on address LR2
* SC 2 (FAILS)

here we have a scenario where SC1 succeeded, and if this is not properly taken into account by the software, there could be data corruption because whilst the LR1 address contains "valid" data, the LR2 address would not.

thus it is not quite as straightforward as the single-LR/single-SC case, and there would need to be a "recovery / post-analysis phase" covering the above.

now, if on the other hand, a *range* of contiguous SCs is guaranteed to stall all other SMP cores / hardware-threads pending completion (where there is a hard requirement that the relevant SCs be sequential instructions and thus macro-op-fuseable in effect) then i foresee no problem: the entire block may then be treated as "all-or-nothing", i.e. effectively as a single instruction.

simple implementations could literally put all other hardware threads / SMP cores on hold (up to and including full pipeline flushes), pending completion of the contiguous batch of SCs, in order to guarantee atomicity.

it's drastic, but like mitch said earlier in the thread, correctness is more important than speed.

the other option is, yes, multi-LL/single-SC. how would that work in practice? i.e. how would an algorithm actually use multi-LL/single-SC to good effect?

l.

Terje Mathisen

unread,

Oct 31, 2018, 8:21:00 AM10/31/18

to

Bruce Hoult wrote:
> On Wednesday, October 31, 2018 at 2:18:26 AM UTC-7, Ivan Godard

>> The problem here is that you are thinking too abstractly. Sure,
>> everyone would like to have a multi-LLSC. You can add one to your
>> ISA definition. Whoopee ding dong. But if you haven't dropped under
>> the abstraction and shown how your abstraction is (or at least
>> could be) implemented in hardware then it is not "reasonable".
>> Utility is irrelevant to that which is not technically and
>> economically viable.
>
> To be fair to Luke, multi-LLSC (in fact multiple LLs terminated by a
> single SC) is something Mitch Alsup describes as being part of his
> "MY66000" design in an earlier message in this six month old thread,
> as well as on other occasions in comp.arch, and of course in the
> draft ISA manual that he distributes to interested people).

In order to be atomic it is pretty obvious that you can have arbitrary
many setup instructions (with monitoring of interference starting at the
first of them) but you need a single SC to either commit or discard
everything, otherwise you get into Ivan's example of having a ladder of
paired LL/SC instructions and no way to handle a trap/error halfway
through the SC array.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Ivan Godard

unread,

Oct 31, 2018, 8:32:07 AM10/31/18

to

Yes, I know Mitch's stuff. Note that Mitch's uses a "SC" as a commit op,
not as a conditional store. That's fine - the problem is when there
really are multiple real SCs and only half of them get done.

Similarly Mill atomics and the IBM and Intel (when/if they get it
working) flavors also have transactional semantics for multi-update. But
they all have some means to bound the scope of the transaction.
Count-up:count-down won't do because there's no way to distinguish
commit from abort in the presence of buggy control flow, and it's
subject to partial commit even without bugs.

I don't understand the objection to book-end operations; they are cheap
enough and anything to help the sanity is worth it. Perhaps the
objection is RISC-religion.

Ivan Godard

unread,

Oct 31, 2018, 8:41:18 AM10/31/18

to

Yes. Did you notice that he just invented the counter-example? Odd.

Robert Wessel

unread,

Oct 31, 2018, 9:23:07 AM10/31/18

to

I think that part of the issue is that most of us are far from
convinced that a non-transactional multi-LL/SC actually solves any
problems worth solving. Contrast that to DCAS (as implemented on some
processors), and multi-word extensions thereof, which are
transactional, and for which we *do* have use cases. So perhaps
there's an assumption that multi-LL/SC does the same thing.

On the other hand the second part of your description ("the hardware
guarantees the invalidation of all four reservations" and "the

multiple reservations are conceptually treated as a single

reservation"), and what follows, seems to imply that either all the
stores will work, or none of them will, IOW, that multi-LL/SC is
transactional.

Consider the sequence:

(a)
LL1
(b)
LL2
(c)
SC1
(d)
SC2
(e)

What happens if the interfering store from another processor happens
after SC1 is executed but before SC2? IOW during (d)? Or an interrupt
happens during (d)? SC1 has already happened, you either need to roll
it back (possibly leaving a visible intermediate state), or leave the
partial update, or prevent any other processor from making updates
during the time the SCs are getting done (which turns the entire point
of LL/SC on its head), or make it transactional. The partial update
and intermediate state visibility cases are simpler, but appear to
have questionable utility.

Now that gets at least somewhat simpler if you make multi-LL and
multi-SC single instructions (and then the implementation issues are
really the same as that of multi-CAS), but it's not very RISC-y.

EricP

unread,

Oct 31, 2018, 1:58:50 PM10/31/18

to

These are two different scenarios.
The discussion on the RISC-V ISA-dev list is for two updates to
*consecutive* 8-byte data items for a 16-byte aligned update.
This would be a double wide LL/SC within a single cache line.
A different scenario is across multiple cache lines.

I had the same idea for the single line variant quite a while back
after noting that LL/SC can only update a single 4 or 8 byte value
even though it has the whole 64-byte cache line locked/pinned.
LL/SC cannot do an atomic double-wide-compare-and-swap DWCAS.

Now before we get into how to do it, the question is why one wants
this comes up. Normally DWCAS is used to avoid the ABA problem.
However LL/SC does not suffer from the ABA problem,
so a double-wide LL/SC seems somewhat pointless.
But lets leave that aside; maybe "if you build it, they will come".

The DW-LL/SC needs 3 instructions and allows an atomic update
for up to a whole single cache line:
LL to load and lock a 4 or 8 byte value. If a second LL is to the
same cache line, it retains the original lock, otherwise it
releases the old line and locks the new line.
It is the programmers responsibility to get it right.

STCH is STore Conditional and Hold, checks the existing lock,
and if still valid updates a 4 or 8 byte aligned value,
returns a success/fail indicator, and retains the lock.

STCR is STore Conditional and Release, checks the existing lock,
and if still valid updates a 4 or 8 byte aligned value,
returns a success/fail indicator, and releases the lock.

As is usual for LL/SC, locks are also released for an interrupt.
Cache line coherency invalidate to the locked line may be ignored
or held off for a period of time.

Hardware wise, as there are multiple updates possible,
it requires either a before or after image of the locked cache line,
depending on whether you want optimistic or pessimistic commit.
Other than that it is basically the same as LL/SC.

Now the question is, is there a practical use for that?
Remember, the reason x64 needs a DWCAS is because CAS is
susceptible to ABA while LL/SC is not.

EricP

unread,

Oct 31, 2018, 2:14:44 PM10/31/18

to

EricP wrote:
>
> The DW-LL/SC needs 3 instructions and allows an atomic update
> for up to a whole single cache line:
> LL to load and lock a 4 or 8 byte value. If a second LL is to the
> same cache line, it retains the original lock, otherwise it
> releases the old line and locks the new line.
> It is the programmers responsibility to get it right.
>
> STCH is STore Conditional and Hold, checks the existing lock,
> and if still valid updates a 4 or 8 byte aligned value,
> returns a success/fail indicator, and retains the lock.
>
> STCR is STore Conditional and Release, checks the existing lock,
> and if still valid updates a 4 or 8 byte aligned value,
> returns a success/fail indicator, and releases the lock.

Oh and a LR Lock Release instruction releases the cache line lock
if currently held, soas to release the cache line ASAP and unblock
anyone waiting for it. However as with LL/SC now, worst case
is the next timer interrupt will cancel the lock anyway,
or there may be an MMU internal clock counter to limit hold time.

Ivan Godard

unread,

Oct 31, 2018, 2:39:14 PM10/31/18

to

Thank you; that's the implementation detail I was asking for.

Is it useful: yes, when you can contort your data structure to ensure
that everything you want to update is on the same line. The canonical
example of multi-update is insert to a bi-linked list, for which the
targets are arbitrarily disjoint so a one-line multi won't work.
However, in something like RB tree node rotations all the child pointers
of a node can be in the same line, so one-line multi would work so long
as the node was line aligned.

Of course, that line alignment requirement ties you into legacy
compatibility; produces obscure bugs; suffers from false sharing and
livelock; and leads to unnatural data structures that must be hand
arranged to conform. Personally, I consider a "not my problem; fix your
code" response to these issues to be an unfortunate attitude that I
interpret as "not my problem; buy something else".

As an alternative specification, consider:
1) read and write sets support at least eight arbitrarily disjoint
references (for updating complex structures)
2) locking has byte granularity (removes false sharing)
3) locking can nest (for good program structuring)
4) exponential backoff on failure (avoids livelock ping-ponging)

Of course, any implementation that does not conform to this
specification is buggy :-)

MitchAlsup

unread,

Oct 31, 2018, 2:41:41 PM10/31/18

to

On Tuesday, October 30, 2018 at 10:22:18 PM UTC-5, lk...@lkcl.net wrote:
> On Monday, July 2, 2018 at 3:26:47 AM UTC+1, MitchAlsup wrote:
>
> > > answer is obvious. Instead of actually locking a cache location, we just give it
> > > a higher priority, so that if congestion does occur, a suitable process is
> > > invoked to clear things up in a relatively intelligent manner.
> >
> > I came to the conclusion that::
> > a) one wanted to be able to denote several cache lines as participating in
> > an ATOMIC event
> > b) if and when the cache lines did become present, one needed to
> > delay snoops upon then so that the ATOMIC event could transpire.
> >
> > So there are multiple events HW can monitor to decide if one
> > ATOMIC event takes place or not. In the case nothing happens
> > the ATOMIC envent happens. In the case interference happens the
> > ATOMIC event fails to a knowable control point.
> >
> > The My 66000 ISA defines all of this crap.
>
> awesome.
>
> so (hello, my first post on this newsgroup)
>
> a discussion came up on the RISC-V ISA-dev list:
> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/1wqTcDTU2Sg
>
> where one of the participants is asking if RISC-V can be extended
> to cover a minimum of 16-byte atomic reservations, particuarly
> given how many fundamental algorithms become far more efficient
> when such an atomic primitive is available.

Since I am the inventor, I have more insight into this kind of mess.

>
> at present, RISC-V only has a single (64-bit) load/store-reservation
> capability, the semantics of which is that the SC succeeds if and
> only if there was no other SMP core / hardware-thread that requested
> the exact same memory address in the intervening time. it is
> intended that a loop occur around this testing of the return result
> from SC, retrying the LR/SC batch until it succeeds.
>
> the question therefore came up, well, is it reasonable to have
> multiple consecutive LR instructions make multiple cache-line
> reservations, followed by multiple consecutive SC instructions?
> and for the semantics to be "all-or-nothing" on a global basis.
> i.e. a sequence of LR reservations are made, and if another
> SMP core / hardware thread makes a reservation on *any* of that
> batch, then *all* LRs are invalidated (not just the one).

Sort of:: let me explain::

I found it necessary to mark load instructions as participating or not
participating in an ATOMIC event. Participating loads (and prefetches)
are monitored for interference, non-participating loads are performed
normally.

I found it useful to only mark the last participating store to denote
the end of the ATOMIC event.

Then in the middle, I found it almost mandatory to have a branch/predicate
instruction that sampled the state of the interference upon participating
cache lines (e.g., memory locations).

The above enable constructing ATOMIC events with several cache lines of
data being ATOMIC.

There are some additional details: the first participating instruction
becomes the ATOMIC control point. Should interrupts, exceptions, or interference happen, contro is transferred to the ATOMIC control point
before the interrupt, exception is handled. When the interference
checking branch is encountered its target (failure point) becomes the
ATOMIC control point. Thus, there is a control point at all times
where one can take interrupts, exceptions, or deal with interference.

I found I had to go one step further and when an ATOMIC event failed
there is a cause recorded in the thread state which can be used to
decrease the probability of interference in future ATOMIC attempts.
To my knowledge, no one else has gotten this far (preventing future
interference.)

Also note: since an exception causes control transfer to the ATOMIC
control point and since a breakpoint trap is an exception, you CANNOT
single step through an ATOMIC event. Either all of the participating
instructions execute, or none of the participating instructions execute.

>
> the same looping would therefore be required: it would just be
> the case that several memory addresses in completely different
> locations could be reserved simultaneously.
>
> i would be interested to hear peoples' thoughts on whether this
> is practical to implement, and, more importantly, if it would
> actually be useful.

It is, some guys at AMD (?Germany?) ran some simulations and found useful
gains of a slightly inferior model to the above.

>
> in particular, the fact that now *multiple* things happen inside
> the multi-LR, multi-SC loop has me concerned, as i do not have
> enough experience in algorithms to say if that is a problem.

One has to preserve the illusion that no interested third party may see
any intermediate memory (or register) state. In practice, this means
that any exception or interrupt has to cause the ATOMIC event to FAIL.

>
> greatly appreciated peoples' insights.
>
> l.

This "means" is sufficient to construct ::
a) Test and set
b) Test and test and set
c) compare and swap
d) compare double and swap double
e) remove element from doubly linked list
f) insert element in doubly linked list
g) move element from one location in DLL to another
All taking place in a single ATOMIC event.

Each of the above can be simple pointers, or pointers with additional
information (cursors). The limitations are that each unit of participating
memory does not cross a cache line boundary and can be monitored with
a single snoop sensitive comparator without things crossing line, page
memory section boundaries (MTRRs for example).

I have a 20 page description of the mater for your amusement.

MitchAlsup

unread,

Oct 31, 2018, 2:54:06 PM10/31/18

to

My 66000 ISA can perform this.

>
> (2) there's a "commit" needed i.e. that there should be some action
> which does not take place until the group's end is detected.

I express this commit as a branch instruction and then mark the last
instruction of the ATOMIC event.

>
> (3) as the end cannot be detected, another additional instruction is
> needed to indicate the start of a new group

Only a marker is required.

>
> if so: many apologies: fundamentally, i feel that this is a
> misunderstanding of the way that LR/SC works and also of the
> multi-LR/SC proposal.

One can compose LL and SC within the My 66000 ISA.

>
> my understanding of LR/SC is that it does not pause (place in a
> buffer pending an atomic completion condition) anything at all. LR
> simply makes a "reservation" on a memory address (exactly as mitch
> outlines for the 66000 ISA). that's all. the (potential) difference
> being between what mitch outlines and what LR/SC does: that
> reservation says, "if any other SMP core / hart tries to make a
> similar "reservation" on that address, invalidate *MY* reservation".

THis is a reasonable understanding. The problem is that one cannot directly
extend this model without backing up a few steps.

>
> there's no cacheing, no buffering, no "operations placed into a queue
> which will not happen if a certain condition is not met" of any kind.
>
> so allow me to go through (1) to (3) above, first in the context of
> LR/SC (standard RISC-V) and then in the context of a multi-variant
>
> first, standard RV:
>
> (1) there is only one LR, there is only one SC. the standard RV
> specification specifically states that if there are two LR
> instructions (without a matching SC in between), the second LR
> *INVALIDATES* the first.
>
> (2) LR/SC fundamentally by design does not need a "commit". it is
> quite literally implemented as "mark this address in the cache with
> the SMP core identifier".

SC is (IS) the commit.

>
> (3) there's only one item in the "group", and the specification has
> specific rules that reset the reservation (singular) back to "known
> reset state" directly after the SC. therefore there is no extra
> instruction needed, "start a new group".
>
> now the proposed variant:
>
> (1) there may be multiple consecutive LRs: there will therefore need
> to be multiple matching "reservation" entries, just as there are with
> mitch's 66000 ISA. thus, if there are multiple LRs (let us say N), it
> is reasonable to expect that the code follows up with mutltiple SCs
> (let us REQUIRE N).

Only the last participating store needs to be SC, the LLs marked the
cache lines as participating, and all participating cache lines are
monitored (snooped) for interference. There is ONE control point
where control is transferred if/when certain kinds of interference
is detected. Not all interference is necessarily detrimental to all
steps inside an ATOMIC event.

>
> thus, with the REQUIREMENT that the number of SCs match the number of
> LRs, i believe that the issue that you raise is moot. an additional
> requirement could hypothetically be added that the LRs be consecutive
> (macro-op fuseable), and likewise the SCs.

Completely unnecessary (UNNECESSARY) ! Really !!

>
> context-switches or traps would automatically invalidate the
> reservations (requiring that the code go once more round an
> "untrapped" loop, once the context-switch / trap returns)

And you forgot the transfer of control to the control point
which is necessarily <topologically> outside of the ATOMIC event.

>
> (2) multi-LR/SC again, building on single LR/SC, would not need a
> "commit" phase. this is, again, by design. the multi-LR reserves
> multiple addresses in the cache. if another SMP core /
> hardware-thread is detected to be trying to reserve ANY them, *ALL* of
> them are invalidated (atomically, for both the invalidator AND the
> invalidee).
>
> there *is* no action/data that is buffered/stored in a queue, there
> *is* nothing to "commit".

You are lost in minutia at this point.

>
> (3) the end of the group is detected by the number of multi-issued LRs
> precisely and exactly matching the number of multi-issued SCs. thus,
> everything remains in sync.
>
> it may help to add to a specification that if the following occurs:
>
> LR LR LR LR
> SC SC SC
> LR
>
> that this will INVALIDATE the global group (there is one last of four
> SCs outstanding in the example above at the time of the 5th LR), and
> that, furthermore, this is clearly a programming error, given that the
> number of SCs does not match up with the number of LRs.

I tried to do pipelined LLs and SCs and failed, the above is where I
arrived after figuring out what is inherently mispartioned with
pipelined LL and SC.

>
>
> the thing about LR/SC is that the atomic operation is achieved...
> eventually. it may not be achieved immediately. the LR/SC is just a
> "signalling" system - a communication system between SMP cores /
> hardware-threads - which indicates whether, during the operation that
> people WANT to be atomic, there happened to be another core which did
> something that could have interfered with that "wish".

This is why My 66000 ISA provides strong guarantees of forward progress
when used properly.

>
> and the simplest way to do that is: reserve some memory address(es),
> alter them, and check *AFTER* the fact if anyone else tried to modify
> the exact same address. if "yes", repeat the exact same operations
> until they *do* succeed.

This is optimistic. The problem is that ATOMIC events are timed at
memory not at the CPU, so the CPU is the least likely place one can
make forward progress on ATOMICs. One has to be able to "effectively
program" the CPU miss buffers.
a) these things are set up to be snooped when external requests arrive
b) they are setup to interact with the Load Buffers and Store Buffers
c) we can amalgamate the "interference happened" across all of them
easily,
d) we can set up a branch to sample this interference
e) then there are 2 more units of magic you need to make it work system wide.

MitchAlsup

unread,

Oct 31, 2018, 3:03:41 PM10/31/18

to

On Wednesday, October 31, 2018 at 3:16:16 AM UTC-5, lk...@lkcl.net wrote:
> On Wednesday, October 31, 2018 at 7:43:51 AM UTC, Ivan Godard wrote:
>
> > > stepping *outside* of the specification is, i feel, beyond the scope
> > > of discussion.
> >
> > Then you have also stepped out of the domain of architecture, which must
> > concern itself with the reality of physics and users, including failing
> > and malicious ones. This is an architecture board, BTW
>
> ok, so allow me to clarify what i care about, so as to illustrate
> precisely and exactly what i mean by "stepping outside of the
> scope / specification"
>
> * there is a specification (finalised and implemented in the case
> of RV-Base; proposed in the case of Multi-LR/SC)
> * if there exists a design or other flaw that allows an arbitrary
> user to execute malicious code or adversely affect anything other
> than their own program, i care.
> * if there exists a design or implementation flaw in the compiler
> that allows the same, i care.
> * if the USER does NOT CONFORM TO THE SPECIFICATION, stepping OUTSIDE
> of that specification, and does NOT cause malicious damage, but
> solely and exclusively causes their OWN (one) program to behave
> in unexpected and unanticipated ways, i genuinely do not give a
> rat's posterior.
>

Any My 66000 ISA can detect improperly formed ATOMIC events and
cause them to fail even in the absence of interference. This means
one can check ATOMIC stuff on single threaded applications and
if they fail there is something wrong with the source code.

ATOMIC events have a structure associated with them that the HW checks.

> so yes of course (and i apologise for assuming that it would go
> without saying), if the user manages to corrupt other peoples' data,
> that's a problem. but if the sole outcome of failing to comply
> with the specification is that their own program crashes? Not Our Problem.
>
> > >> Consider what happens if you enter your loop with already existing
> > >> dangling LRs from some unrelated code that branched out in the middle?
> > >
> > > that would be a bug in the program (or, much less likely, the compiler).
> >
> > And your specifications excludes bugs :-)
>
> nooo, i'm *asking* if there are any bugs in it. and also if there are
> any real-world algorithms that would benefit from a multi-LR/SC
> instruction.
>

Shitloads, especially if one can prevent ATOMIC interference.

>
> > > what would be a much more useful thing to do, i feel, would be to
> > > demonstrate that no matter what, there does not exist any possible
> > > multi-LR/multi-SC proposal or specification that could possibly be
> > > useful, even if complied with correctly by both compiler and
> > > application(s).
> > >
> > > also i feel that the next most useful thing to do would be to
> > > demonstrate flaws in the current proposal, listing possible weaknesses
> > > and caveats that require implementors at both the hardware and
> > > software level to take special precautions.
> > >
> > > for example: this is why i mentioned that any context-switches or
> > > traps would automatically invalidate the entire list of reservations,
> > > as there is no way that atomicity can easily or simply be guaranteed
> > > if an execution thread leaves its current context.
> > >
> > > does that sound reasonable?
> >
> > Depends on what you hope to get out of it. If you hope to define
> > something that could be implemented and used by people unconnected with
> > your effort then no, it is not useful.
>
> respectfully, there are about three separate independent enquiries
> over the past few months, for how >=16-byte atomic operations could
> be added to RISC-V, so i am (informally, and independently of those
> sources) making enquiries here.

This is where my dive into ATOMIC thins started:: "Why can we not give
MS compare double swap double"? It is not 16-byte ATOMICs it is
pairs of 8-byte ATOMICs that everyone wants!!

>
> those three independent sets of interest would tend to indicate that
> the statement you make is an assumption. apologies for having to
> point that out.
>
> > Broadly, you presume to define that a set of updates can be made atomic
> > as a group,
>
> no, i do not believe that i am, as from how i understand it, that's
> not how LR/SC works

LL/SC works in restricted situations. If you want to expand LL/SC to
cover the general case, you have to step back a bit and then rebuild from there.

>
> LR/SC allows the program to *detect* if anything
> (specifically, another SMP core or hardware-thread) *interfered* with
> the updates, such that *after* the updates have been carried out and
> the interference [guaranteed to have been] detected, the operation
> may be repeated again and again until such time as it is detected that
> no interference occurred.

Which is why you cannot address the exponent of interference with LL/SC.

>
> the single-LR/SC case been successfully implemented by multiple RISC-V
> vendors, and it does have full, proven, working compiler support in
> gcc, as well as libc6 (glibc) support.
>
> so the question is: can a multi-LR/SC with multiple memory reservations
> just like mitch's 66000 ISA work as well?
>

It works much better than you anticipate. But it does cause people to
have to write ATOMIC events differently. The performance is such they
will eventually want to write them differently.

MitchAlsup

unread,

Oct 31, 2018, 3:05:43 PM10/31/18

to

On Wednesday, October 31, 2018 at 3:17:44 AM UTC-5, Bruce Hoult wrote:
> On Tuesday, October 30, 2018 at 8:22:18 PM UTC-7, lk...@lkcl.net wrote:
> > at present, RISC-V only has a single (64-bit) load/store-reservation
> > capability, the semantics of which is that the SC succeeds if and
> > only if there was no other SMP core / hardware-thread that requested
> > the exact same memory address in the intervening time. it is
> > intended that a loop occur around this testing of the return result
> > from SC, retrying the LR/SC batch until it succeeds.
>
> That's not quite correct.
>
> Another hardware thread using LR on a sufficiently similar address will also steal the reservation and make the SC fail. The parameters of "sufficient" are not defined by the architecture: it could be the exact same aligned word address, it could be the same cache line, it could be the same VM/TLB page, or anything else. The SC only has to be to some address in this range, not necessarily the identical address:

This is why the "AHY" register is in My 66000 ISA. When an ATOMIC event fails
WHY records why the event failed. negative numbers indicate things like
interrupt, spurious, buffer overflow,... positive numbers are a count of
interfering accesses. This can be used to decrease the amount of future interference.

MitchAlsup

unread,

Oct 31, 2018, 3:12:22 PM10/31/18

to

In many ways CISCs are easier here, one can simply lob in an instruction
of near arbitrary complexity {Compare double cursor swap double cursor}
and have microcode perform the steps that My 66000 ISA provides to the user
thread instructions. In My 66000 case HW makes sure the ATOMIC programming
model is obeyed.

>
> does a multi-LR/SC go even further than that, and impose on the
> user even beyond the usual contract that RISC-based systems expect?
> honestly i have no idea.
>
> > >>>> Consider what happens if you enter your loop with already existing
> > >>>> dangling LRs from some unrelated code that branched out in the middle?
> > >>>
> > >>> that would be a bug in the program (or, much less likely, the compiler).
> > >>
> > >> And your specifications excludes bugs :-)
> > >
> > > nooo, i'm *asking* if there are any bugs in it. and also if there are
> > > any real-world algorithms that would benefit from a multi-LR/SC
> > > instruction.
> > >
> >
> > There are no bugs in it, because you have defined them away.
>
> great! i can get implementing, straight away! [i'm kidding...
> simulations first... actually, corroborative constructive
> input and review first... ]
>
> > And
> > essentially all programs with any parallelism at all, which could use a
> > concurrent atomic update. So far you have not shown that multi-LR/SC (as
> > you have described it) achieves concurrent atomic update.
>
> *it* does not. a loop - a *sequence* of instructions - *uses* LR/SC to
> test *whether* concurrent atomic update[s] have [restrospectively]
> *been* achieved. this is completely different.

And exactly why it is inefficient.
You keep asking whether forward progress has been achieved.

>
> i've mentioned this subtle distinction about three times, now, and
> each time you have not indicated that you understand this distinction,
> and, unfortunately, continue to use the same [ambiguous, unclear]
> language that insists, most unfortunately, that "i am saying that
> multi-LR/SC *is* an atomic update", when it most certainly is not.
> i have *never* said that "it" is.

To preserve the illusion of ATOMICity all of the participating memory
locations cannot be seen in any state other than "before the event" or
"after the event". This smells like atomic update to me.

>
> > Oh, everyone would like multi-atomic operations. And they are easy to
> > do: just set a Whopping Big Lock (TM), do everything, and reset it.

This just moves the problem and makes it worse.

MitchAlsup

unread,

Oct 31, 2018, 3:17:47 PM10/31/18

to

On Wednesday, October 31, 2018 at 6:30:07 AM UTC-5, lk...@lkcl.net wrote:
> On Wednesday, October 31, 2018 at 10:51:01 AM UTC, Bruce Hoult wrote:
>
> > To be fair to Luke, multi-LLSC (in fact multiple LLs terminated
> > by a single SC) is something Mitch Alsup describes as being part
> > of his "MY66000" design in an earlier message in this
> > six month old thread, as well as on other occasions in
> > comp.arch, and of course in the draft ISA manual that he
> > distributes to interested people).
>
> ah! in not having full context, i had been unable to determine
> from the reading the full thread that mitch's design *is* already
> multi-LR/SC: i thought it was more exotic.

It is, it enables the SW programmer to avoid <future> interference.

>
> great! hm, fascinating to note (hello mitch) that only a single
> SC would be needed to clear the entire group of LRs. obvious
> in retrospect.
>
> > If your newsreader doesn't easily support it, you might refer to:
> >
> > https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/MHSdrebCAQAJ
> > https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/UYfENqMZAgAJ
> > https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/k85oGGzcAAAJ
>
> ok that's really helpful to have specific references.
>
> so that leaves me wondering *how* multi-LR/SC (multi-LLSC) would
> be used?

BOOLEAN DCAS( unsigned oldp, unsigned oldq,
unsigned *p, unsigned *q,
unsigned newp, unsigned newq )
{
t = esmLOCK( *p );
r = esmLOCK( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCK( *q = newq );
return TRUE;
}
return FALSE;
}

which compiles to::

DCAS:
LDD R1,[Rp]:LOCK
LDD R2,[Rp]:LOCK
CMP R3,R1,Rop
CMP R4,R2,Roq
PEQ R3,ATTF
PEQ R3,TTF
STD Rnp,[Rp]
STD Rnq,[Rq]:unLOCK
MOV R1,TRUE
MOV R1,FALSE
JMP R0

>
> and also: would a multi-SC actually be useful (now that i
> understand the difference, where 66000 only does a single SC
> to reset an array of LLs).
>
> here is what the RISC-V spec says [about single-LR/single-SC):
>
> "LR loads a word from the address in rs1, places the
> sign-extended value in rd, and registers a reservation
> on the memory address and a range of bytes
> including at least all bytes of the addressed word.
>
> SC writes a word in rs2 to the address in rs1,
> provided a valid reservation still exists on that
> address. SC writes zero to rd on success or a nonzero
> code on failure."
>
> so here, we detect "potential issues" with multi-LR/multi-SC,
> where it would only be the test of the very last of the multi-SC
> sequence that could possibly be trusted, because only when
> that very last one had stored (and the answer come back,
> "yes the store succeeded") would you be able to conclude that
> *all* stores succeeded.

You have to put the detection between the inbound participating references
and the outbound participating references. B E T W E E N !

>
> the problem is illustrated with the following sequence:
>
> * LR 1
> * LR 2
> * LR 3
> * LR 4
> * SC 1 (succeeds)
> -> HART 5 issues LR on address LR2
> * SC 2 (FAILS)
>

BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next );
fp = fr->prev;
esmLOCK( tn = to->next );
esmLOCK( fn );
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}

which compiles to ::

MoveElement:
LDD Rfn,[Rfr+next]:LOCK
LDD Rfp,[Rfr+prev]
LDD Rtn,[Rto+next]:LOCK
PRE [Rfp]:LOCK:R:W
PRE [Rfn]:LOCK:R:W
PRE [Rtn]:LOCK:R:W
PIN TTTTTTTF
STD Rfn,[Rfp+next]
STD Rfp,[Rfn+prev]
STD Rfr,[Rto+next]
STD Rfr,[Rtn+prev]
STD Rto,[Rfr+prev]
STD Rtn,[Rfr+next]:unLOCK
MOV R1,TRUE
MOV R1,FALSE
JMP R0

MitchAlsup

unread,

Oct 31, 2018, 3:22:21 PM10/31/18

to

This is the accepted (and in my opinion nieve) way of looking at the problem.
You want the detection interference to transpire between the inbound
participating references and the outbound participating references. NOT
AFTER !

If you can detect that::
a) all participating lines are present,
b) none have been interfered,
you are in a position to NAK snoop requests.
Normally this would be a violation of the cache coherence protocol, but it
is KEY to making this stuff work. The NAK provides time for the request
to be recirculated, giving the ATOMIC event time to complete, strengthening
the guarantees of forward progress.

Otherwise you are shooting fish in a barrel.

MitchAlsup

unread,

Oct 31, 2018, 3:24:52 PM10/31/18

to

On Wednesday, October 31, 2018 at 7:32:07 AM UTC-5, Ivan Godard wrote:
> On 10/31/2018 3:50 AM, Bruce Hoult wrote:
> > On Wednesday, October 31, 2018 at 2:18:26 AM UTC-7, Ivan Godard wrote:
> >> More circularity then: if a context switch invalidates the primitive,
> >> how can it implement the switch without invalidating itself? Can you
> >> guarantee absence of livelock? Show your work.
> >
> > At least on RISC-V there is no such primitive as a "context switch". The closest thing is the return-from-exception instructions MRET, SRET, URET. An implementation is permitted but not required to clear an LR/SC reservation upon executing an exception return. Portable software should explicitly clear any reservation if required (e.g. if the exception return is actually part of a task switch), for example by executing an SC to a dummy address.
> >
> > Reasons an implementation may choose not to clear a reservation on exception return are to allow debugger single-stepping or conditional breakpointing within an LR/SC sequence, or to allow trap-and-emulate on some operations within an LR/SC sequence. I guess software TLB refill might be another reason.
> >
> >
> >> The problem here is that you are thinking too abstractly. Sure, everyone
> >> would like to have a multi-LLSC. You can add one to your ISA definition.
> >> Whoopee ding dong. But if you haven't dropped under the abstraction and
> >> shown how your abstraction is (or at least could be) implemented in
> >> hardware then it is not "reasonable". Utility is irrelevant to that
> >> which is not technically and economically viable.
> >
> > To be fair to Luke, multi-LLSC (in fact multiple LLs terminated by a single SC) is something Mitch Alsup describes as being part of his "MY66000" design in an earlier message in this six month old thread, as well as on other occasions in comp.arch, and of course in the draft ISA manual that he distributes to interested people).
> >
> > If your newsreader doesn't easily support it, you might refer to:
> >
> > https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/MHSdrebCAQAJ
> > https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/UYfENqMZAgAJ
> > https://groups.google.com/forum/#!original/comp.arch/QVl3c9vVDj0/k85oGGzcAAAJ
> >
>
> Yes, I know Mitch's stuff. Note that Mitch's uses a "SC" as a commit op,
> not as a conditional store. That's fine - the problem is when there
> really are multiple real SCs and only half of them get done.

SC is a termination op (marker) and suffices when the ATOMIC even is minuscule

unsigned char Lock( unsigned char *p )

{
t = esmLOCK( *p );

esmLOCK( *p = 0xFF );
return t;
}

Lock:
LDUB R1,[Rp]:LOCK
STB 0xFF,[Rp]:unLOCK
JMP R0

>
> Similarly Mill atomics and the IBM and Intel (when/if they get it
> working) flavors also have transactional semantics for multi-update. But
> they all have some means to bound the scope of the transaction.
> Count-up:count-down won't do because there's no way to distinguish
> commit from abort in the presence of buggy control flow, and it's
> subject to partial commit even without bugs.

My 66000 ISA allows only up to 8 lines to participate.

MitchAlsup

unread,

Oct 31, 2018, 3:29:14 PM10/31/18

to

Which is why My 66000 does not suffer from this because it was though
out differently.
a) there are control points for all ATOMIC events
b) there is an optional second control point
b.1) that HW can use to change snoop responses
b.2) that HW can use to propagate interference BACKWARDS though system
b.3) so the completable ATOMIC event has effective greater priority
than the starting ATOMIC event over there.
b.4) strengthening the guarantee of forward progress.

MitchAlsup

unread,

Oct 31, 2018, 4:48:42 PM10/31/18

to

On Tuesday, October 30, 2018 at 10:22:18 PM UTC-5, lk...@lkcl.net wrote:
> On Monday, July 2, 2018 at 3:26:47 AM UTC+1, MitchAlsup wrote:
>
> > > answer is obvious. Instead of actually locking a cache location, we just give it
> > > a higher priority, so that if congestion does occur, a suitable process is
> > > invoked to clear things up in a relatively intelligent manner.
> >
> > I came to the conclusion that::
> > a) one wanted to be able to denote several cache lines as participating in
> > an ATOMIC event
> > b) if and when the cache lines did become present, one needed to
> > delay snoops upon then so that the ATOMIC event could transpire.
> >
> > So there are multiple events HW can monitor to decide if one
> > ATOMIC event takes place or not. In the case nothing happens
> > the ATOMIC envent happens. In the case interference happens the
> > ATOMIC event fails to a knowable control point.
> >
> > The My 66000 ISA defines all of this crap.
>
> awesome.
>
> so (hello, my first post on this newsgroup)
>
> a discussion came up on the RISC-V ISA-dev list:
> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/1wqTcDTU2Sg
>
> where one of the participants is asking if RISC-V can be extended
> to cover a minimum of 16-byte atomic reservations, particuarly
> given how many fundamental algorithms become far more efficient
> when such an atomic primitive is available.
>

> at present, RISC-V only has a single (64-bit) load/store-reservation
> capability, the semantics of which is that the SC succeeds if and
> only if there was no other SMP core / hardware-thread that requested
> the exact same memory address in the intervening time. it is
> intended that a loop occur around this testing of the return result
> from SC, retrying the LR/SC batch until it succeeds.
>

> the question therefore came up, well, is it reasonable to have
> multiple consecutive LR instructions make multiple cache-line
> reservations, followed by multiple consecutive SC instructions?
> and for the semantics to be "all-or-nothing" on a global basis.
> i.e. a sequence of LR reservations are made, and if another
> SMP core / hardware thread makes a reservation on *any* of that
> batch, then *all* LRs are invalidated (not just the one).
>
> the same looping would therefore be required: it would just be
> the case that several memory addresses in completely different
> locations could be reserved simultaneously.
>
> i would be interested to hear peoples' thoughts on whether this
> is practical to implement, and, more importantly, if it would
> actually be useful.
>
> in particular, the fact that now *multiple* things happen inside
> the multi-LR, multi-SC loop has me concerned, as i do not have
> enough experience in algorithms to say if that is a problem.
>

> greatly appreciated peoples' insights.
>
> l.

Would the RISC-V group like to here a presentation on my ATOMIC mechanisms?

EricP

unread,

Oct 31, 2018, 4:52:46 PM10/31/18

to

lk...@lkcl.net wrote:
>
> my understanding of LR/SC is that it does not pause (place in a
> buffer pending an atomic completion condition) anything at all. LR
> simply makes a "reservation" on a memory address

Yes this is correct, but this approach only works with a single cache line.
Basically, LL translates the virtual address to physical, and fetches
that cache line using a ReadExclusive coherency message.
ReadExclusive causes an Invalidate coherency message sent to the
current owner, who writes the line back if modified, and sends an
Ack back to the home directory, who sends the the line to the
new owner. (Optimizations on this sequence are possible.)
The line is loaded into the cache as normal,
and the MMU remembers the LL physical address as "pinned".

At this point the MMU can do several things.
It wants to give the LL/SC code sequence time to complete,
but it also has to acknowledge coherency messages ASAP
in order to not hang the whole SMP memory system.
To prevent coherency deadlock, there are 4 message queues:
outbound command, outbound reply, inbound command, inbound reply.

One way is to handle all inbound command messages, sending
their outbound reply messages, until an Invalidate command for
the LL line is seen, then it blocks the inbound command queue.
This effectively "pins" that cache line to this cache and
gives this cpu a chance to finish its sequence.
The lock flag is reset by any newer LL, normal load or store,
interrupts, exceptions, or maybe an MMU clock counter.

The SC checks if the lock flag in the MMU is still set,
if it is updates the cache line as usual and resets the flag,
which re-enables inbound command queue messages processing
and unpins that cache line.

Note that blocking the inbound command queue means no messages
behind it get processed, and can effectively hang other cpu's
even though they are touching different cache lines.
Whether one can do more optimal message reordering depends
on the details of the coherency protocol and the rules for
atomic update ordering.

> so allow me to go through (1) to (3) above, first in the context of
> LR/SC (standard RISC-V) and then in the context of a multi-variant
>
> first, standard RV:
>
> (1) there is only one LR, there is only one SC. the standard RV
> specification specifically states that if there are two LR
> instructions (without a matching SC in between), the second LR
> *INVALIDATES* the first.
>
> (2) LR/SC fundamentally by design does not need a "commit". it is
> quite literally implemented as "mark this address in the cache with
> the SMP core identifier".

SC *IS* the commit as it completes the transaction if it is still valid.
Normal load, store, interrupts, exceptions, MMU counters trigger rollback.

For example, if a LL examines its value and decides not to SC,
then the lock is retained until one of the rollback triggers happens.
It is felt that a trigger will occur so soon that an explicit
LR lock release instruction would probably be redundant.

If one wants to allow normal load and store to NOT trigger a rollback
then a Lock Release serves a useful purpose to allow software to
indicate it is abandoning a lock without waiting for a timer.

> (3) there's only one item in the "group", and the specification has
> specific rules that reset the reservation (singular) back to "known
> reset state" directly after the SC. therefore there is no extra
> instruction needed, "start a new group".
>
> now the proposed variant:
>
> (1) there may be multiple consecutive LRs: there will therefore need
> to be multiple matching "reservation" entries, just as there are with
> mitch's 66000 ISA. thus, if there are multiple LRs (let us say N), it
> is reasonable to expect that the code follows up with mutltiple SCs
> (let us REQUIRE N).

It would be nice if normal loads and stores can also be intermixed
inside the transaction.

> thus, with the REQUIREMENT that the number of SCs match the number of
> LRs, i believe that the issue that you raise is moot. an additional
> requirement could hypothetically be added that the LRs be consecutive
> (macro-op fuseable), and likewise the SCs.

NO. An SC must be preceeded by a LL, but a LL can have multiple SC or none.

>
> context-switches or traps would automatically invalidate the

> reservations (requiring that the code go once more round an
> "untrapped" loop, once the context-switch / trap returns)
>

> (2) multi-LR/SC again, building on single LR/SC, would not need a
> "commit" phase. this is, again, by design. the multi-LR reserves
> multiple addresses in the cache. if another SMP core /
> hardware-thread is detected to be trying to reserve ANY them, *ALL* of
> them are invalidated (atomically, for both the invalidator AND the
> invalidee).

This is a try-fail-retry sequence, which is known to be *AT LEAST* N^2
cost, and can be N^4 cost when the coherency protocol is considered too.
It also does not guarantee forward progress which many people,
myself included, consider a requirement.

> there *is* no action/data that is buffered/stored in a queue, there
> *is* nothing to "commit".
>

> (3) the end of the group is detected by the number of multi-issued LRs
> precisely and exactly matching the number of multi-issued SCs. thus,
> everything remains in sync.

This would not allow transaction code to make decisions as what to update
based on the content of other transaction variables.

Remember there are two (or three) types of transaction variables:
read-read consistent and read-modify-write consistent (and read-write).

> it may help to add to a specification that if the following occurs:
>
> LR LR LR LR
> SC SC SC
> LR
>
> that this will INVALIDATE the global group (there is one last of four
> SCs outstanding in the example above at the time of the 5th LR), and
> that, furthermore, this is clearly a programming error, given that the
> number of SCs does not match up with the number of LRs.
>
>

> the thing about LR/SC is that the atomic operation is achieved...
> eventually. it may not be achieved immediately. the LR/SC is just a
> "signalling" system - a communication system between SMP cores /
> hardware-threads - which indicates whether, during the operation that
> people WANT to be atomic, there happened to be another core which did
> something that could have interfered with that "wish".
>

> and the simplest way to do that is: reserve some memory address(es),
> alter them, and check *AFTER* the fact if anyone else tried to modify
> the exact same address. if "yes", repeat the exact same operations
> until they *do* succeed.

Again this is try-fail-retry without forward progress.

> absolutely no commits involved, absolutely no buffering of any kind
> involved, at all.

If there are multiple value updates and rollbacks are possible,
you _HAVE TO_ buffer.

MitchAlsup

unread,

Oct 31, 2018, 5:08:55 PM10/31/18

to

I found this mandatory if only for debug purposes. One cannot single step
through an ATOMIC event and retain any sense of ATOMICity. The mere act
of taking the breakpoint trap terminates the ATOMIC event.

>
> > thus, with the REQUIREMENT that the number of SCs match the number of
> > LRs, i believe that the issue that you raise is moot. an additional
> > requirement could hypothetically be added that the LRs be consecutive
> > (macro-op fuseable), and likewise the SCs.
>
> NO. An SC must be preceeded by a LL, but a LL can have multiple SC or none.
>
> >
> > context-switches or traps would automatically invalidate the
> > reservations (requiring that the code go once more round an
> > "untrapped" loop, once the context-switch / trap returns)
> >
> > (2) multi-LR/SC again, building on single LR/SC, would not need a
> > "commit" phase. this is, again, by design. the multi-LR reserves
> > multiple addresses in the cache. if another SMP core /
> > hardware-thread is detected to be trying to reserve ANY them, *ALL* of
> > them are invalidated (atomically, for both the invalidator AND the
> > invalidee).
>
> This is a try-fail-retry sequence, which is known to be *AT LEAST* N^2
> cost, and can be N^4 cost when the coherency protocol is considered too.
> It also does not guarantee forward progress which many people,
> myself included, consider a requirement.

Now, would you like to see how to make it go to BigO( ln(n) ) where n is the
number of contenders?

Right, no <even soft> guarantees of forward progress.

>
> > absolutely no commits involved, absolutely no buffering of any kind
> > involved, at all.
>
> If there are multiple value updates and rollbacks are possible,
> you _HAVE TO_ buffer.

Yes you have to buffer, but most pipelines have resources already available
that can be used as the buffers at very low cost (essentially zero).

Stephen Fuld

unread,

Oct 31, 2018, 6:49:30 PM10/31/18

to

On 10/30/2018 11:33 PM, Ivan Godard wrote:

big snip

> All you say is true if the program is well-behaved and LLs always match
> SCs. My programs do not behave, even when I pay them well.

Really??

I don't pay my programs much, but they are almost always well behaved in
the sense that they do exactly what I tell them to do. :-)

The problem is that they rarely do what I meant to tell them (or perhaps
should have told them) to do, but didn't. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Bruce Hoult

unread,

Oct 31, 2018, 7:23:49 PM10/31/18

to

On Wednesday, October 31, 2018 at 1:48:42 PM UTC-7, MitchAlsup wrote:
> Would the RISC-V group like to here a presentation on my ATOMIC mechanisms?

I believe so. I certainly would! Assuming you don't mind if other people use it.

RISC-V currently has simple LL/SC and a set of Atomic Memory Operations that take a register, perform some operation between it and a memory location (swap, add, and, or, xor, min[u], max[u]) and return the old memory value. The TileLink bus supports the same operations, allowing them to be pushed out and executed by L2 cache, cache on another CPU, peripheral devices, or whatever.

There has always been an intention to in future add some more powerful atomic facility, and the "T" ISA extension letter (for "Transactional") has been reserved for this. I think your idea (with well thought-out microarchitectural mechanisms) could well fulfil the intent of this.

I'm sure SiFive would be happy to host, or at a user-group meeting.

By the way, have you read (SiFive co-founder) Yunsup Lee's PhD thesis? It has some ideas you might recognise, such as:

- automatic compiler analysis of SIMT code (e.g. CUDA/OpenCL) to determine operations that can be executed in scalar context instead of vector

- shared registers for the above

- blocks of code that are guaranteed to execute to completion

- a small set (8?) of very temporary registers that can be used within such a block, but not between blocks.

- an "iota" instruction that is very useful for generating effectively SIMT thread IDs across the lanes of a vector. This has been retained in the RISC-V vector extension and enhanced in that it can also count set (or cleared) predication mask bits. I'm happy about this because I found the lack of it very annoying in my AltiVec/NEON programming.

MitchAlsup

unread,

Oct 31, 2018, 7:25:42 PM10/31/18

to

On Wednesday, October 31, 2018 at 5:49:30 PM UTC-5, Stephen Fuld wrote:
> On 10/30/2018 11:33 PM, Ivan Godard wrote:
>
> big snip
>
>
> > All you say is true if the program is well-behaved and LLs always match
> > SCs. My programs do not behave, even when I pay them well.
>
> Really??
>
> I don't pay my programs much, but they are almost always well behaved in
> the sense that they do exactly what I tell them to do. :-)

Once you start considering parity and correctable errors occurring in HW
you get to the point you never trust anything in HW (long term).

>
> The problem is that they rarely do what I meant to tell them (or perhaps
> should have told them) to do, but didn't. :-(

This is why no computer was ever implemented with a GJIL instruction
(Go Jump In the Lake instruction.) Because if they had be so implemented
they would all be in the lake by the time random instruction verification
was complete.

Chris M. Thomasson

unread,

Nov 1, 2018, 12:17:04 AM11/1/18

to

On 10/31/2018 4:25 PM, MitchAlsup wrote:
> On Wednesday, October 31, 2018 at 5:49:30 PM UTC-5, Stephen Fuld wrote:
>> On 10/30/2018 11:33 PM, Ivan Godard wrote:
>>
>> big snip
>>
>>
>>> All you say is true if the program is well-behaved and LLs always match
>>> SCs. My programs do not behave, even when I pay them well.
>>
>> Really??
>>
>> I don't pay my programs much, but they are almost always well behaved in
>> the sense that they do exactly what I tell them to do. :-)
>
> Once you start considering parity and correctable errors occurring in HW
> you get to the point you never trust anything in HW (long term).

Yikes! ;^)

Chris M. Thomasson

unread,

Nov 1, 2018, 12:19:18 AM11/1/18

to

On 7/1/2018 1:09 PM, Rick C. Hodgin wrote:
> In designing my ISA, I've come to realize that there should be an atomic
> XCHGN instruction that allows a memory-to-memory exchange locked at the
> hardware level.
>
> mov rdi,left_value
> mov rsi,right_value
> lock xchg byte_count ; Exchange N bytes
>
> Why doesn't this exist? Is it relegated to something like enforcement
> through OS-level semaphores?
>

XADD is the one true instruction... Heck, we can create some neat things...

https://groups.google.com/forum/#!topic/lock-free/acjQ3-89abE

<pseudo code, membars aside for a moment>
______________________________________________
struct cell { uint32_t ver; double state; };

uint32_t head = 0;
uint32_t tail = 0;
cell cells[N]; // N must be a power of 2

void init() {
for (uint32_t i = 0; i < N; ++i) cells[i].ver = i;
}

void producer(double state) {
uint32_t ver = XADD(&head, 1);
cell& c = cells[ver & (N - 1)];
while (LOAD(&c.ver) != ver) backoff();
c.state = state;
STORE(&c.ver, ver + 1);
}

double consumer() {
uint32_t ver = XADD(&tail, 1);
cell& c = cells[ver & (N - 1)];
while (LOAD(&c.ver) != ver + 1) backoff();
double state = c.state;
STORE(&c.ver, ver + N);
return state;
}
______________________________________________

XADD, the all holy and wonderful. CAS, argh... ;^)

Terje Mathisen

unread,

Nov 1, 2018, 2:47:47 AM11/1/18

to

MitchAlsup wrote:
> On Wednesday, October 31, 2018 at 7:21:00 AM UTC-5, Terje Mathisen

>> In order to be atomic it is pretty obvious that you can have
>> arbitrary many setup instructions (with monitoring of interference
>> starting at the first of them) but you need a single SC to either
>> commit or discard everything, otherwise you get into Ivan's example
>> of having a ladder of paired LL/SC instructions and no way to
>> handle a trap/error halfway through the SC array.
>
> This is the accepted (and in my opinion nieve) way of looking at the
> problem. You want the detection interference to transpire between the
> inbound participating references and the outbound participating
> references. NOT AFTER !

I did write that minitoring of interference had to start as soon as the
first line was marked.

>
> If you can detect that:: a) all participating lines are present, b)
> none have been interfered, you are in a position to NAK snoop
> requests. Normally this would be a violation of the cache coherence
> protocol, but it is KEY to making this stuff work. The NAK provides
> time for the request to be recirculated, giving the ATOMIC event time
> to complete, strengthening the guarantees of forward progress.

We're in violent agreement here, the best (only?) way to guarantee (or
at least optimise for) forward progress is by having some help from the
cache protocols:

In order to update N cache lines atomically, we must be able to force
all of them to Exclusive state at the same time. This is far easier to
do if we could delay any requests to take over individual lines for at
least a short while.

>
> Otherwise you are shooting fish in a barrel.

:-)

Terje Mathisen

unread,

Nov 1, 2018, 3:09:27 AM11/1/18

to

MitchAlsup wrote:
> On Wednesday, October 31, 2018 at 3:52:46 PM UTC-5, EricP wrote:
>> This is a try-fail-retry sequence, which is known to be *AT LEAST*
>> N^2 cost, and can be N^4 cost when the coherency protocol is
>> considered too. It also does not guarantee forward progress which
>> many people, myself included, consider a requirement.
>
> Now, would you like to see how to make it go to BigO( ln(n) ) where n
> is the number of contenders?

This is the place where I suggested (a couple of years ago?) to extend
locks to a tree structure, where each thread would use a hash of its
PID/TID to select which leaf lock to try.

Only after getting that initial lock would it be legal to try to get the
primary lock, so this way we get a sqrt(N) reduction in the number of
threads contending for any individual lock, but at a cost of making it
twice as expensive to get access to a lock even when there is no contention.

This can of course be extended to even more levels if you want to,
moving it from O(sqrt(N)) to O(log(N)), but with a linear increase in
time and up to N individual locks needed, so having just a single extra
level is probably the most useful approach.

It is also possible to atomically switch such a structure to/from a
single-level lock to one that uses those sqrt(N) front-end gatekeeper
locks, i.e. this would be done by the first thread/user to notice
excessive contention.

>> If there are multiple value updates and rollbacks are possible, you
>> _HAVE TO_ buffer.
>
> Yes you have to buffer, but most pipelines have resources already
> available that can be used as the buffers at very low cost
> (essentially zero).

The buffers have to be the original cache lines, just with an additional
marker to make them all part of a single set. Either the last update
(with release semantics) made it into the corresponding cache line, or
everything is flushed, right?

As soon as the local cache lines have all been updated, the atomic event
has essentially finished, even if no lines have been written back so
far: Since they are all owned by the current cpu, nobody else can see
any stale data.

MitchAlsup

unread,

Nov 1, 2018, 1:00:18 PM11/1/18

to

On Thursday, November 1, 2018 at 2:09:27 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Wednesday, October 31, 2018 at 3:52:46 PM UTC-5, EricP wrote:
> >> This is a try-fail-retry sequence, which is known to be *AT LEAST*
> >> N^2 cost, and can be N^4 cost when the coherency protocol is
> >> considered too. It also does not guarantee forward progress which
> >> many people, myself included, consider a requirement.
> >
> > Now, would you like to see how to make it go to BigO( ln(n) ) where n
> > is the number of contenders?
>
> This is the place where I suggested (a couple of years ago?) to extend
> locks to a tree structure, where each thread would use a hash of its
> PID/TID to select which leaf lock to try.

I remember well.

>
> Only after getting that initial lock would it be legal to try to get the
> primary lock, so this way we get a sqrt(N) reduction in the number of
> threads contending for any individual lock, but at a cost of making it
> twice as expensive to get access to a lock even when there is no contention.
>
> This can of course be extended to even more levels if you want to,
> moving it from O(sqrt(N)) to O(log(N)), but with a linear increase in
> time and up to N individual locks needed, so having just a single extra
> level is probably the most useful approach.
>
> It is also possible to atomically switch such a structure to/from a
> single-level lock to one that uses those sqrt(N) front-end gatekeeper
> locks, i.e. this would be done by the first thread/user to notice
> excessive contention.
>
> >> If there are multiple value updates and rollbacks are possible, you
> >> _HAVE TO_ buffer.
> >
> > Yes you have to buffer, but most pipelines have resources already
> > available that can be used as the buffers at very low cost
> > (essentially zero).
>
> The buffers have to be the original cache lines, just with an additional
> marker to make them all part of a single set. Either the last update
> (with release semantics) made it into the corresponding cache line, or
> everything is flushed, right?

Not quite:: the buffers have to capture the cache line, but the stores
can be kept in the Store Buffer, the unfinished loads in the load buffer
and cache line status in the L1 miss buffers. Thus no added buffering,
just added states to machines that already have state machines governing
their use.

The store buffer should have all store data ready by the time the ATOMIC
event gains the ability to NaK Snoop requests. The store buffer is also
where the stores are nullified if the ATOMIC event fails.

>
> As soon as the local cache lines have all been updated, the atomic event
> has essentially finished, even if no lines have been written back so
> far: Since they are all owned by the current cpu, nobody else can see
> any stale data.

Yes.

Stephen Fuld

unread,

Nov 1, 2018, 1:02:37 PM11/1/18

to

On 10/31/2018 4:25 PM, MitchAlsup wrote:

> On Wednesday, October 31, 2018 at 5:49:30 PM UTC-5, Stephen Fuld wrote:
>> On 10/30/2018 11:33 PM, Ivan Godard wrote:
>>
>> big snip
>>
>>
>>> All you say is true if the program is well-behaved and LLs always match
>>> SCs. My programs do not behave, even when I pay them well.
>>
>> Really??
>>
>> I don't pay my programs much, but they are almost always well behaved in
>> the sense that they do exactly what I tell them to do. :-)
>
> Once you start considering parity and correctable errors occurring in HW
> you get to the point you never trust anything in HW (long term).

Yes. Note that I did put an "almost" before the always :-). I remember
the first time I encountered a problem that didn't fail hard, but caused
a simple instruction to not execute correctly with some data. As a
software guy, it took me a long time to debug it, then get my head
wrapped around it.

Note that it was in the 1970s on a mainframe system, and the fix was to
replace a card that IIRC had a bad sense amp and was not reading some
correct values from a register.

>> The problem is that they rarely do what I meant to tell them (or perhaps
>> should have told them) to do, but didn't. :-(
>
> This is why no computer was ever implemented with a GJIL instruction
> (Go Jump In the Lake instruction.) Because if they had be so implemented
> they would all be in the lake by the time random instruction verification
> was complete.

:-)

Rick C. Hodgin

unread,

Nov 1, 2018, 2:23:27 PM11/1/18

to

On Thursday, November 1, 2018 at 12:19:18 AM UTC-4, Chris M. Thomasson wrote:
> On 7/1/2018 1:09 PM, Rick C. Hodgin wrote:

> > ... there should be an atomic XCHGN instruction that allows a
> > memory-to-memory exchange locked at the hardware level...

>
> XADD is the one true instruction...

XADD's still only mem/reg, or reg/mem. It's not mem/mem.

--
Rick C. Hodgin

EricP

unread,

Nov 1, 2018, 2:32:55 PM11/1/18

to

Ivan Godard wrote:
>
> As an alternative specification, consider:
> 1) read and write sets support at least eight arbitrarily disjoint
> references (for updating complex structures)
> 2) locking has byte granularity (removes false sharing)
> 3) locking can nest (for good program structuring)
> 4) exponential backoff on failure (avoids livelock ping-ponging)

I hadn't thought of byte locking - yes it would be nice.
That implies that some bytes of a cache line can be updated
but not yet committed, and on a different cpu different bytes
of the same line updated but not yet committed.
So you can't use the Exclusive line ownership state for collision detection.
Seems to imply a central byte lock registry logic,
plus a bunch of fully associative tables.
sized by #bytes per line * #locked lines * #cpus
So a separate node on the SMP network.

I don't like exponential backoff as it doesn't guarantee forward progress
or fairness in collisions. I would prefer a mechanism that did.
If you have a central registry & collision detect node then
it can do fair scheduling for colliders, i.e. FIFO.

EricP

unread,

Nov 1, 2018, 2:55:30 PM11/1/18

to

Yes. I was also considering if the existing D$L1 cache lines
can be used as the before and after image buffers.
It requires fully assoc indexing because we can't tolerate
conflict evictions during a transaction.
If the cost is just a matter of adding an fully assoc index then
we might be able to afford more # of cache lines in the transaction.

Stefan Monnier

unread,

Nov 1, 2018, 3:28:50 PM11/1/18

to

> Since I am the inventor, I have more insight into this kind of mess.

BTW, this kind of opportunistic synchronization seems to have some
similar attributes to speculation (it speculates that no interference
will happen and then rewinds if that proves to be wrong), so I wonder if
it can be abused along the same lines as Spectre/Meltdown does.

Stefan

EricP

unread,

Nov 1, 2018, 3:34:45 PM11/1/18

to

MitchAlsup wrote:
>
> This is why the "AHY" register is in My 66000 ISA. When an ATOMIC event fails
> WHY records why the event failed. negative numbers indicate things like
> interrupt, spurious, buffer overflow,... positive numbers are a count of
> interfering accesses. This can be used to decrease the amount of future interference.

Yes it needs a Why reason returned in a register.
However I think it was a mistake for AWS and RTM to restore
the general register set to its starting state on abort
as it prevents software from passing information to itself
about how the transaction was going.
I understand why they did this, because they were piggy-backing
on the existing exception checkpoint restore mechanism.

But just setting the new program counter and reason register
and leave the other registers as they are would have been best.
Or make register restore an option rather than mandatory.

Ivan Godard

unread,

Nov 1, 2018, 4:05:03 PM11/1/18

to

On 11/1/2018 11:09 AM, EricP wrote:
> Ivan Godard wrote:
>>
>> As an alternative specification, consider:
>> 1) read and write sets support at least eight arbitrarily disjoint
>> references (for updating complex structures)
>> 2) locking has byte granularity (removes false sharing)
>> 3) locking can nest (for good program structuring)
>> 4) exponential backoff on failure (avoids livelock ping-ponging)
>
> I hadn't thought of byte locking - yes it would be nice.
> That implies that some bytes of a cache line can be updated
> but not yet committed, and on a different cpu different bytes
> of the same line updated but not yet committed.
> So you can't use the Exclusive line ownership state for collision
> detection.

Yes, or dispose of the notion of Exclusive entirely, though that's a
much more pervasive change.

> Seems to imply a central byte lock registry logic,
> plus a bunch of fully associative tables.
> sized by #bytes per line * #locked lines * #cpus
> So a separate node on the SMP network.

Possibly, but central registry anything doesn't scale. Perhaps somethng
like the directory-based virtual memory systems of building-scale
shared-address supercomputers? Such issues are *much* easier to deal
with if you only have to deal with in-chip. Coherency and atomicity are
major reasons why we don't support shared address space past the pins.

> I don't like exponential backoff as it doesn't guarantee forward progress
> or fairness in collisions. I would prefer a mechanism that did.
> If you have a central registry & collision detect node then
> it can do fair scheduling for colliders, i.e. FIFO.

I think that the quest for provable forward progress is pointless in
practice although a proven source of theses. We accept stochastic
security in crypto and stochastic reliability in hardware, and I don't
see that stochastic progress is any different so long as the fail
probability is an exponential function of time, which is easy to do.

The only real requirement is to avoid convoying, which random
exponential backoff does, and false collisions, which byte lock
granularity does.

Chris M. Thomasson

unread,

Nov 1, 2018, 4:47:05 PM11/1/18

to

I was being a little sarcastic. However, one can build some interesting
algorithms with XADD. Fwiw, one can simulate a XCHGN by hashing
addresses into a table of mutexs, and sticking to a strict locking order.
______________________________________________
Take several addresses as M.

Hash each address in M into a static table of locks T as S.

Sort S in ascending order.

Remove any duplicates, (hash collisions) in S.

*** Acquire each lock in S.
________________________
M is now locked! :^D

Perform mutations on M. Read from M, whatever.
________________________

*** Release each lock in S in reverse order.
______________________________________________

There is no possibility of deadlock.

MitchAlsup

unread,

Nov 1, 2018, 5:37:46 PM11/1/18

to

Any interference that could potentially change the data being used causes
the ATOMIC event to fail.
>
>
> Stefan

MitchAlsup

unread,

Nov 1, 2018, 5:43:29 PM11/1/18

to

Please don't try to use the L1 itself.

Use the Load and Store buffers which capture instruction state prior
to being accessed in the L1 (and prior to data arriving in the case of
Store buffer).

Also, use the L1 Miss buffers as these already HAVE to be snooped by
coherence traffic. These are used to monitor that all participating
cache lines remain interference free, and amalgamate same into a CPU
signal accessible ia branch or predicate.

The Load buffers manage inbound traffic
The Store buffers manage outbound traffic.

Done properly, the participating cache lines can exceed the associativity
of the L1 cache without architectural harm (may incur additional latency).

lk...@lkcl.net

unread,

Nov 1, 2018, 6:20:57 PM11/1/18

to

On Thursday, November 1, 2018 at 9:43:29 PM UTC, MitchAlsup wrote:

> Please don't try to use the L1 itself.
>
> Use the Load and Store buffers which capture instruction state prior
> to being accessed in the L1 (and prior to data arriving in the case of
> Store buffer).

i have to say this is utterly cool stuff, and very valuable
practical advice. thank you to everyone who's replied, here,
reopening this discussion: to bruce, ivan, robert, eric, and mitch
in particular.

to recap:

* LR/SC is O(N^M) and Multi-LR/SC even worse
* 66000 ISA Multi-LLSC is O(N)
* LR/SC and Multi-LR/SC need analysis of the LOAD/STORE buffers
anyway, so you might as well do 666000-style Multi-LLSC.

what i am now wondering is: could the semantics of RISC-V
LR/SC be changed to that of 66000 whilst at the same time
conforming fully with the RISC-V specification?

RISC-V limits the distance between LR and SC to be 64
instructions. beyond that, the load reservation is
cancelled. branches are prohibited, and a whole stack
of other instructions.

so, let's say instead of another LR *cancelling* the load
reservation, the SMP core / hardware thread *blocks* for
up to 63 further instructions, waiting for the reservation
to clear.

this i believe would strictly conform with the standard
RISC-V LR/SC whilst at the same time ensuring that the
first SMP core / hardware thread to get its LR reservation
in would always complete on the first "loop".

does that sound reasonable?

the next enhancement would be to have the 2nd core block
on a multi-reservation (a 2nd, and a 3rd LR), where,
because the instruction timer was already counting down
it would be waiting only for 62, 61, etc. instructions
before failing (all of) its LR(s).

i believe that this would ensure that one SMP core /
hardware thread always completed in O(N) whilst other
cores that happened to overlap *may* also complete in
O(N), and otherwise would need to go through a few
loops. no live-lock in other words.

thoughts and insights appreciated.

l.

Rick C. Hodgin

unread,

Nov 1, 2018, 7:29:26 PM11/1/18

to

On Thursday, November 1, 2018 at 4:47:05 PM UTC-4, Chris M. Thomasson wrote:
> On 11/1/2018 11:23 AM, Rick C. Hodgin wrote:
> > On Thursday, November 1, 2018 at 12:19:18 AM UTC-4, Chris M. Thomasson

> >> XADD is the one true instruction...
> > XADD's still only mem/reg, or reg/mem. It's not mem/mem.
>

> I was being a little sarcastic ... one can simulate a XCHGN by hashing

> addresses into a table of mutexs, and sticking to a strict locking order.

Still being a little sarcastic?

There's no way to perform hardware-guaranteed XCHGN mem-to-mem ops
on an x86. You can only implement software protocols to enforce a
policy preventing updating during the exchange, but even then there
is still no hardware guarantee because add-on cards can step in and
do whatever, potentially violating the software policy.

--
Rick C. Hodgin

MitchAlsup

unread,

Nov 1, 2018, 8:06:18 PM11/1/18

to

On Thursday, November 1, 2018 at 5:20:57 PM UTC-5, lk...@lkcl.net wrote:
> On Thursday, November 1, 2018 at 9:43:29 PM UTC, MitchAlsup wrote:
>
> > Please don't try to use the L1 itself.
> >
> > Use the Load and Store buffers which capture instruction state prior
> > to being accessed in the L1 (and prior to data arriving in the case of
> > Store buffer).
>
> i have to say this is utterly cool stuff, and very valuable
> practical advice. thank you to everyone who's replied, here,
> reopening this discussion: to bruce, ivan, robert, eric, and mitch
> in particular.
>
> to recap:
>
> * LR/SC is O(N^M) and Multi-LR/SC even worse
> * 66000 ISA Multi-LLSC is O(N)
> * LR/SC and Multi-LR/SC need analysis of the LOAD/STORE buffers
> anyway, so you might as well do 666000-style Multi-LLSC.
>
> what i am now wondering is: could the semantics of RISC-V
> LR/SC be changed to that of 66000 whilst at the same time
> conforming fully with the RISC-V specification?

Probably, ASF (Advanced Synchronization Facility) the precursor
of ESM (Exotic Synchronization Mechanism) was designed for x86-64.
ESM was designed inside a RISC architecture (except for the concept
that immediates and displacements could be variable in length.)

It is ALSO my intellectual property that I am giving to the world.

The only difficult part is a) I'm "not that familiar" with RV,
b) therefore I don't know if you can instal a branch/predicate
that uses the signal back for the miss buffers concerning
interference.

Secondly, you may have missed it, but ESM requires the concept of
the ATOMIC control point. The first LL initializes the Acp to be
the virtual address of the first LL instruction. A branch on
interference instruction resets the Acp to the target of the
branch. A predicate instruction resets the Acp to the end of the
predicate shadow.

An interrupt (enabled and raised) or an (enabled and raised)
exception, or a software trap will cause control to transfer
\to Acp BEFORE taking <whatever>. This eliminates any transient
state associated with ATOMIC events out of the thread state.
When one arrives at the handler, the ATOMIC event either did
not start, or it finished in the failed condition.

>
> RISC-V limits the distance between LR and SC to be 64
> instructions. beyond that, the load reservation is
> cancelled. branches are prohibited, and a whole stack
> of other instructions.

Although My 66000 conceptually has no fixed limit, ATOMIC stuff
becomes less and less efficient as the critical regions get longer
and longer. ESM allows for calling and returning from simple
subroutines, none of which can cause a trap/exception; do I/o;
call the operating system,...

I would do this with a time stamp counter but this is purely
a design choice, one way works about as well as the other.

By the way, the limit of participating lines is chosen to be
the number of buffers in the L1 Miss Buffer.

>
> so, let's say instead of another LR *cancelling* the load
> reservation, the SMP core / hardware thread *blocks* for
> up to 63 further instructions, waiting for the reservation
> to clear.

Can you explain what you mean by this paragraph?

The My 66000 approach allows all cores to simultaneously attempt
ATOMIC events.

Terje Mathisen

unread,

Nov 2, 2018, 2:59:17 AM11/2/18

to

Actually, (pseudo-random) exponential backoff is probably the
simplest algorithm which does guarantee forward progress.

It is after all what every single Ethernet is based on.

Ivan Godard

unread,

Nov 2, 2018, 6:15:31 AM11/2/18

to

Only gives stochastic forward progress. You can reduce the probability
of stall arbitrarily close to zero, but you cannot reduce it all the way
to zero in the presence of colliding noise. This difference is a
difference only to theoreticians.

lk...@lkcl.net

unread,

Nov 2, 2018, 6:23:00 AM11/2/18

to

On Friday, November 2, 2018 at 12:06:18 AM UTC, MitchAlsup wrote:
> On Thursday, November 1, 2018 at 5:20:57 PM UTC-5, lk...@lkcl.net wrote:

> > what i am now wondering is: could the semantics of RISC-V
> > LR/SC be changed to that of 66000 whilst at the same time
> > conforming fully with the RISC-V specification?
>
> Probably, ASF (Advanced Synchronization Facility) the precursor
> of ESM (Exotic Synchronization Mechanism) was designed for x86-64.
> ESM was designed inside a RISC architecture (except for the concept
> that immediates and displacements could be variable in length.)
>
> It is ALSO my intellectual property that I am giving to the world.

very much appreciated, mitch.

> The only difficult part is a) I'm "not that familiar" with RV,
> b) therefore I don't know if you can instal a branch/predicate
> that uses the signal back for the miss buffers concerning
> interference.

well, the implementors are entirely free to put in whatever
microarchitecture they choose, i.e. "behind the scenes" detail
is unrestricted. the limitations in this hypothetical
exploration are that (a) only the two existing instructions
(LR and SC) may be used and (b) modifications to their behaviour
is required to be backwards-compatible with the RV standard.

which is:

* LR loads a word from a source address into a destination
register, and creates a reservation on the source address
* SC writes a word into a destination address, provided that
the reservation is still valid, and sets rd=0 on success.

> Secondly, you may have missed it, but ESM requires the concept of
> the ATOMIC control point.

> [...]

ok, so as long as augmented-RV-LR does not require any additional
opcodes, this would qualify as microarchitectural "behind
the scenes" detail.

> The first LL initializes the Acp to be
> the virtual address of the first LL instruction. A branch on
> interference instruction resets the Acp to the target of the
> branch. A predicate instruction resets the Acp to the end of the
> predicate shadow.

hmmm... RV specifically does not permit branch operations
in between LR and SC, and does not have predication.
hopefully you are referring here to ESM, not 68000 ASF?

> > RISC-V limits the distance between LR and SC to be 64
> > instructions. beyond that, the load reservation is
> > cancelled. branches are prohibited, and a whole stack
> > of other instructions.
>
> Although My 66000 conceptually has no fixed limit, ATOMIC stuff
> becomes less and less efficient as the critical regions get longer
> and longer.

understood.

> By the way, the limit of participating lines is chosen to be
> the number of buffers in the L1 Miss Buffer.

ok, good to know.

i'd be really interested to see some pseudo-code implementation
(or an actual emulator's source code for 66000), i tend in most
cases to do better at understanding [pseudo-]code than words.

> >
> > so, let's say instead of another LR *cancelling* the load
> > reservation, the SMP core / hardware thread *blocks* for
> > up to 63 further instructions, waiting for the reservation
> > to clear.
>
> Can you explain what you mean by this paragraph?

best put in sequential events, probably.

<core1> LR <-- 64-instruction countdown starts here
<core1> ... 63
<core1> ... 62
<core2> LR same address <--- notes that core1 is on 61,
so pauses for **UP TO** 61 cycles
<core1> ... 32
<core1> SC <- core1 didn't reach zero, therefore valid, therefore
core2 is now **UNBLOCKED**, is granted the
load-reservation (and begins its **own** 64-cycle
LR instruction countdown)
<core2> ... 63
<core2> ... 62
<core2> ...
<core2> ...
<core2> SC <- also valid

so, instead of core2's LR creating an *invalidation* of the
core1 load-reservation, core2 *freezes* until the core1 SC
is met, or the core2 "waiting" countdown timer expires,
whichever is sooner.

> The My 66000 approach allows all cores to simultaneously attempt
> ATOMIC events.

indeed, and, above, there could actually be any number
of cores involved in the above, any one of which would
be blocked if there was an existing LR-SC sequence
underway on any other core.

clearly, on SC, if there were multiple other cores blocked
waiting on a given address, an algorithm would be needed
(round-robin? FIFO?) to select one AND ONLY one core to
unblock, leaving all others to continue.

also, it's important to note, there's no way for the LR
to signal if it succeeded or failed, so it will be necessary,
if the blocking timer expires, to just let the code continue,
even though at that point we know in advance that the SC
will fail. but... tough, we can't have everything: better
luck, next loop, ehn? :)

l.

lk...@lkcl.net

unread,

Nov 2, 2018, 8:21:52 AM11/2/18

to

On Friday, November 2, 2018 at 10:23:00 AM UTC, lk...@lkcl.net wrote:

> clearly, on SC, if there were multiple other cores blocked
> waiting on a given address, an algorithm would be needed
> (round-robin? FIFO?) to select one AND ONLY one core to
> unblock, leaving all others to continue.

sorry, to clarify: to continue *blocking*... with the blocking-timer
still counting down. if really really lucky, if the (2nd) unblocked
core happens to also complete (perform an SC) within the time of
the waiting (blocked) cores, hypothetically a 3rd could unblock,
and a 4th, and a 5th etc.

also now that i think about it, hypothetically the blocking-countdown-timer
of each could be reset back to the time of the current (unblocked)
core's limit of 64 instructions, as long as the order in which each
blocked core is preserved, so that no core will ever be permanently
blocked by a continuous stream of other cores entering an LR-SC
on the same memory reservation.

l.

MitchAlsup

unread,

Nov 2, 2018, 12:16:09 PM11/2/18

to

Looks to me that you could effect the same functionality by simply
holding onto the cache line in core 1 preventing core 2 from
<architecturally> getting past the LR.

On the other hand, the freeze is similar to how the MP CRAYs did
ATOMIC stuff.

But what do you do if core 2 is 150 cycles away from core 1?
(chip to chip over a buffered HT-like network.)

So you are making an assumption that all cores are "close enough"
that you can pass an address to a centralized comparator and
get a response back in contemporaneous time.

Whereas ESM performs as if there were no ATOMIC decoration on the
instructions when interference did not take place.

>
> so, instead of core2's LR creating an *invalidation* of the
> core1 load-reservation, core2 *freezes* until the core1 SC
> is met, or the core2 "waiting" countdown timer expires,
> whichever is sooner.
>
> > The My 66000 approach allows all cores to simultaneously attempt
> > ATOMIC events.
>
> indeed, and, above, there could actually be any number
> of cores involved in the above, any one of which would
> be blocked if there was an existing LR-SC sequence
> underway on any other core.
>
> clearly, on SC, if there were multiple other cores blocked
> waiting on a given address, an algorithm would be needed
> (round-robin? FIFO?) to select one AND ONLY one core to
> unblock, leaving all others to continue.
>
> also, it's important to note, there's no way for the LR
> to signal if it succeeded or failed, so it will be necessary,

This is the notion of the control point. Failure transfers control
to the control point. And also why I used a branch to separate the
setup (loads and PREfetches) from the manifestation (stores.)

During setup, failure leads back to the first instruction of
the ATOMIC event. After Query, Failure leads to the failure
exit point. Thus one does not have to add the semantic of
modifying the stores Rd register (which is nasty in many
pipeline organizations.)

lk...@lkcl.net

unread,

Nov 2, 2018, 12:41:51 PM11/2/18

to

On Friday, November 2, 2018 at 4:16:09 PM UTC, MitchAlsup wrote:

> > > Can you explain what you mean by this paragraph?
> >
> > best put in sequential events, probably.
> >
> > <core1> LR <-- 64-instruction countdown starts here
> > <core1> ... 63
> > <core1> ... 62
> > <core2> LR same address <--- notes that core1 is on 61,
> > so pauses for **UP TO** 61 cycles
> > <core1> ... 32
> > <core1> SC <- core1 didn't reach zero, therefore valid, therefore
> > core2 is now **UNBLOCKED**, is granted the
> > load-reservation (and begins its **own** 64-cycle
> > LR instruction countdown)
> > <core2> ... 63
> > <core2> ... 62
> > <core2> ...
> > <core2> ...
> > <core2> SC <- also valid
>
> Looks to me that you could effect the same functionality by simply
> holding onto the cache line in core 1 preventing core 2 from
> <architecturally> getting past the LR.

neat trick!

> On the other hand, the freeze is similar to how the MP CRAYs did
> ATOMIC stuff.
>
> But what do you do if core 2 is 150 cycles away from core 1?
> (chip to chip over a buffered HT-like network.)

hmmm... i'm assuming a tightly-coupled SMP core, maxing out at 4 cores...
*thinks*... that may not be good enough for a generic / general-purpose
design...

hmmm... err hang on: i thought about this for a few minutes: if there's
genuinely 150 cycles between cores, wouldn't that have severe impact on the
L1 cache coherency?

oh: chip-to-chip. right. definitely out of scope for the architecture
(requirement) i have in mind. good call for a general-purpose
generic implementation.

> So you are making an assumption that all cores are "close enough"
> that you can pass an address to a centralized comparator and
> get a response back in contemporaneous time.

i believe that... (or, better, i _wonder_ if) LR/SC would even work
(unmodified, as in "standard" LR/RC) over such large latencies.
as in: claiming a reservation on one core, where another core needs
150 cycles to invalidate it? the SC should have *completed* (in RISC-V)
within 64 cycles already!

so, as long as that problem is "solved" (or avoided) in the standard
RISC-V (unmodified) LR/SC case, without fully going into detail to
check, i would anticipate that an "augmented" (blocking) variant of
LR/SC would "fall back" to standard LR/SC in cases with such high
(off-chip) latency.

i understand that Tilelink has been specifically designed to
communicate full cache-coherency and atomic semantics off-chip.
i do not however know the details.

> Whereas ESM performs as if there were no ATOMIC decoration on the
> instructions when interference did not take place.

iiinterestiiing.

> > also, it's important to note, there's no way for the LR
> > to signal if it succeeded or failed, so it will be necessary,
>
> This is the notion of the control point. Failure transfers control
> to the control point. And also why I used a branch to separate the
> setup (loads and PREfetches) from the manifestation (stores.)

okaaay. so i'm starting to get a handle on the terminology you're
using.

> During setup, failure leads back to the first instruction of
> the ATOMIC event. After Query, Failure leads to the failure
> exit point.

iinteresting. okaaay. so the atomicity is achieved through
branching. which clearly saves a few instructions, and, duh,
you have to do the branches anyway, to retry the operation.

neat.

clearly you've thought about this a *lot*.

l.

MitchAlsup

unread,

Nov 2, 2018, 3:34:22 PM11/2/18

to

unsigned char Lock( unsigned char *p )
{
t = esmLOCK( *p );
esmLOCK( *p = 0xFF );
return t;
}

This procedure can be compiled into direct form:

Lock:
LDUB R1,[Rp]:LOCK
STB 0xFF,[Rp]:unLOCK
JMP R0

And will run in the same cycle count as:

NotLocked:
LDUB R1,[Rp]
STB 0xFF,[Rp]
JMP R0

That is, in the optimal case, the overhead is ZERO cycles.

>
>
> > > also, it's important to note, there's no way for the LR
> > > to signal if it succeeded or failed, so it will be necessary,
> >
> > This is the notion of the control point. Failure transfers control
> > to the control point. And also why I used a branch to separate the
> > setup (loads and PREfetches) from the manifestation (stores.)
>
> okaaay. so i'm starting to get a handle on the terminology you're
> using.
>
> > During setup, failure leads back to the first instruction of
> > the ATOMIC event. After Query, Failure leads to the failure
> > exit point.
>
> iinteresting. okaaay. so the atomicity is achieved through
> branching. which clearly saves a few instructions, and, duh,
> you have to do the branches anyway, to retry the operation.

Atomicity is achieved by preventing interested third parties from
seeing any intermediate state. Determining whether an ATOMIC event
succeeded or failed is manifest by where you end up in the code.

>
> neat.
>
> clearly you've thought about this a *lot*.

I spent a bit over 2 complete years at AMD figuring out what does
not work, and what cannot work; leaving me with things which can
be used to make forward progress on the mater.

Terje Mathisen

unread,

Nov 2, 2018, 5:09:06 PM11/2/18

to

MitchAlsup wrote:
> On Friday, November 2, 2018 at 11:41:51 AM UTC-5, lk...@lkcl.net > Atomicity is achieved by preventing interested third parties from

> seeing any intermediate state. Determining whether an ATOMIC event
> succeeded or failed is manifest by where you end up in the code.
>>
>> neat.
>>
>> clearly you've thought about this a *lot*.
>
> I spent a bit over 2 complete years at AMD figuring out what does
> not work, and what cannot work; leaving me with things which can
> be used to make forward progress on the mater.
>

In hindsight it seems obvious that you should be able to do this using
the exact same mechanism as for normal OoO processing, i.e. you have an
initial (always predicted to fall through) special form of branch that
starts the operation, then you do the various atomic operations (with
LOCK decorations on the critical ones, and then at commit is when you
detect interference or not (on any of the LOCKed vars/regions/cache
lines), which is when that setup branch finally can retire, or cause a
full flush and restart at the interference (offline) target.

New instructions needed are that setup op/branch and the "commit setup"
everything else just follows.

I'm sure your two years have figured out all the gotcha's in my naive
idea. :-)

MitchAlsup

unread,

Nov 2, 2018, 5:24:43 PM11/2/18

to

On Friday, November 2, 2018 at 4:09:06 PM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Friday, November 2, 2018 at 11:41:51 AM UTC-5, lk...@lkcl.net > Atomicity is achieved by preventing interested third parties from
> > seeing any intermediate state. Determining whether an ATOMIC event
> > succeeded or failed is manifest by where you end up in the code.
> >>
> >> neat.
> >>
> >> clearly you've thought about this a *lot*.
> >
> > I spent a bit over 2 complete years at AMD figuring out what does
> > not work, and what cannot work; leaving me with things which can
> > be used to make forward progress on the mater.
> >
> In hindsight it seems obvious that you should be able to do this using
> the exact same mechanism as for normal OoO processing, i.e. you have an
> initial (always predicted to fall through) special form of branch that
> starts the operation, then you do the various atomic operations (with
> LOCK decorations on the critical ones, and then at commit is when you
> detect interference or not (on any of the LOCKed vars/regions/cache
> lines), which is when that setup branch finally can retire, or cause a
> full flush and restart at the interference (offline) target.

Hindsight is often 20/20.

> New instructions needed are that setup op/branch and the "commit setup"
> everything else just follows.
>
> I'm sure your two years have figured out all the gotcha's in my naive
> idea. :-)

Maybe not all of them, but a big bag of them.

>
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

The thing that took the longest to recognize is that the mechanism works
by address not the data contained at the address !

Ivan Godard

unread,

Nov 2, 2018, 5:34:56 PM11/2/18

to

Yes, it's an extremely clever re-use of OOO rollback. Of course, that
means it can't be used as-is for IO or result replay, so we can't use
it. Fortunately there are alternatives.

Bruce Hoult

unread,

Nov 2, 2018, 7:28:38 PM11/2/18

to

On Friday, November 2, 2018 at 2:09:06 PM UTC-7, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Friday, November 2, 2018 at 11:41:51 AM UTC-5, lk...@lkcl.net > Atomicity is achieved by preventing interested third parties from
> > seeing any intermediate state. Determining whether an ATOMIC event
> > succeeded or failed is manifest by where you end up in the code.
> >>
> >> neat.
> >>
> >> clearly you've thought about this a *lot*.
> >
> > I spent a bit over 2 complete years at AMD figuring out what does
> > not work, and what cannot work; leaving me with things which can
> > be used to make forward progress on the mater.
> >
> In hindsight it seems obvious that you should be able to do this using
> the exact same mechanism as for normal OoO processing

The problem I see is not everything has OoO processing, but you'd still like to be able to do interesting atomic operations. I don't want to have an ISA where the only reasonable implementation requires OoO or, worse, requires a lot of the mechanism of OoO without actually getting the benefits.

Sometimes there is no substitute for single-threaded performance, but in a lot of applications you can get more bang for the buck by using the same area and power to implement a number of simple in-order single- or maybe dual-issue cores.

On October 18th a Russian company called "Cloudbear" announced two Linux-capable 64 bit RISC-V application processors:

Cloudbear BI-651, 8-10 stage in order dual issue
Cloudbear BI-671, 7-9 stage OoO

Sadly, they didn't give the issue width or the number of in-flight instructions possible for the OoO processor.

They gave performance numbers compared to the year old SiFive U54, a single issue in order processor:

U54 : Coremark/MHz 2.75, Dhrystone VAX MIPS/MHz 1.7
BI-651: Coremark/MHz 3.02, Dhrystone VAX MIPS/MHz 2.18
BI-671: Coremark/MHz 3.74, Dhrystone VAX MIPS/MHz 2.65

So, some nice improvements over the U54, especially for the OoO CPU

On October 31 SiFive announced the U74, an 8 stage dual-issue in order CPU.

U74 : Coremark/MHz 4.9, Dhrystone VAX MIPS/MHz 2.5

This compares to recent i7s with Coremark ~5, Dhrystone ~6.

Dhrystone is a pretty awful benchmark, but it's unfortunately still often quoted in the embedded world. Coremark is much better, and is the benchmark recommended by ARM.

Interesting that simple and small dual-issue in order can get very close to i7 on Coremark, though all these RISC-V processors are running at only 1.2 - 1.5 GHz, about 1/3 of Intel.

MitchAlsup

unread,

Nov 2, 2018, 7:31:32 PM11/2/18

to

On Friday, November 2, 2018 at 4:09:06 PM UTC-5, Terje Mathisen wrote:

> MitchAlsup wrote:
> > On Friday, November 2, 2018 at 11:41:51 AM UTC-5, lk...@lkcl.net > Atomicity is achieved by preventing interested third parties from
> > seeing any intermediate state. Determining whether an ATOMIC event
> > succeeded or failed is manifest by where you end up in the code.
> >>
> >> neat.
> >>
> >> clearly you've thought about this a *lot*.
> >
> > I spent a bit over 2 complete years at AMD figuring out what does
> > not work, and what cannot work; leaving me with things which can
> > be used to make forward progress on the mater.
> >
> In hindsight it seems obvious that you should be able to do this using
> the exact same mechanism as for normal OoO processing, i.e. you have an
> initial (always predicted to fall through) special form of branch that
> starts the operation, then you do the various atomic operations (with
> LOCK decorations on the critical ones, and then at commit is when you
> detect interference or not (on any of the LOCKed vars/regions/cache
> lines), which is when that setup branch finally can retire, or cause a
> full flush and restart at the interference (offline) target.
>
> New instructions

variants

> needed are that setup op/branch and the "commit setup"
> everything else just follows.

The standard LD/ST instructions have a bit named "LOCK".
In the LDs and prefetches, it denotes participation,
in STs, it denotes end of ATOMIC event.

Conditional branches have 32 specifiable conditions, I just wired
"interference" up to one of those codes. Since all one can do is
compare an integer of floating point to zero, there are lots of
codes left over. I also put Return from Interrupt, and Return from
exception in this instruction (saving precious OpCode space.)

So no NEW instructions, just some decoration on already existing
instructions--improving instruction entropy.

MitchAlsup

unread,

Nov 2, 2018, 7:33:34 PM11/2/18

to

Due to the concept of the control point, and with the query point separating
setup from manifestation, doing an IO version of this is "brain dead easy".

MitchAlsup

unread,

Nov 2, 2018, 7:35:18 PM11/2/18

to

On Friday, November 2, 2018 at 4:34:56 PM UTC-5, Ivan Godard wrote:

Forgot to mention, in Mill one would need to have 2 control points,
one for the execution stream and another for the memory stream. But
there are likely easy means to facilitate such.