LL/SC (load-linked/store-conditional)

Andy "Krazy" Glew

unread,

Oct 1, 2011, 11:33:43 PM10/1/11

to

http://semipublic.comp-arch.net/wiki/Load-linked/store-conditional_(LL/SC)

The load-linked/store-conditional instruction pair provide a [[RISC]]
flavor of [[atomic RMW]] [[synchronization]], emphasizing primitives
which can be flexibly composed. It can be viewed as a minimal form of
[[transactional memory]].

Part of the [[RISC philosophy]] was to espouse [[load/store
architecture]] - instruction sets that separated load and store memory
operations
from computational or ALU operations such as add or increment. This
works fine for single processor operations, but runs into problems
for multiprocessor synchronization.

== Pre-LL/SC ==

After years of evolution, prior to the so-called [[RISC revolution]]
multiprocessor synchronization instruction sets were converging on
simple [[atomic RMW]] instructions such as
[[compare-and-swap]], [[atomic increment memory]] or [[fetch-and-add]]
and other [[fetch-and-op]]s.
Now, these atomic RMWs can be seen as composed of fairly simple
primitives, for example:

[[locked increment memory]]
begin-atomic
tmp := load(MEM)
tmp := tmp+1
store( MEM, tmp )
end-atomic

However, the implementation of begin-atomic/end-atomic is not
necessarily simple.

The atomicity can be provided in a simple way:

[[locked increment memory]]
tmp := [[load-locked]](MEM)
tmp := tmp+1
[[store-unlock]]( MEM, tmp )

Where [[load-locked]] may be
* implemented at the memory module
** locking the entire module
** or a limited number of memory locations at the module
** or potentially an arbitrary number of memory locations, using
per-location lock-bit [[memory metadata]], e.g. [[stolen from ECC]]

or
* implemented via a bus lock

or
* implemented via a cache protocol
** e.g. an [[address lock]]: acquiring ownership of the cache line, and
then refusing to respond to [[snoops or probes]] until the [[address
lock]] was released
** or, more primitive: acquiring ownership of the cache line, and then
acquiring a [[cache lock]], locking the entire cache or a fraction thereof

Interestingly, implementing locking at the memory module quite possibly
came first, since many early multiprocessor systems were not snoopy or
bus-based.

So far, so good: [[load-locked]] and [[store-unlocked]] are somewhat RISCy.

But they have a problem: [[load-locked]] and [[store-unlocked]] as
separate instructions
raise certain security, performance, and reliability problems in some
(but not necessarily all) implementations.

E.g. what happens if a user uses load-locked to lock the bus,
and never unlocks it?
That *might* be interpreted as giving one user exclusive access to the bus.

Obviously, one can eliminate these problems by architecture and
microarchitecture.
But doing so is a source of complexity.
Many, probably most, implementations have found it easier to package the
atomic RMW
so that the primitive operations are not exposed to the user

== [[Load-linked/store-conditional]] ==

The basic idea behind [[LL/SC]] is that, instead of guaranteeing forward
progress with a [[load-locked]] that always succeeds,
but which may deadlock,
[[load-linked]] would employ [[optimistic concurrency control]].
Software would assume that it had worked, perform the operation that you
want to be atomic,
and then try to commit the operation using [[store-conditional]].

If the operation was atomic, i.e. if nobody else had accessed the memory
location load-link'ed, the store-conditional would proceed.

But if the operation were non-atomic, if somebody else had accessed the
memory location, the store-conditional would fail.
And the user software would be required to handled that failure, e.g. by
looping.

fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval

Typically LL/SC are implemented via a snoopy bus protocol:
a memory location is "linked" by putting its address into a snooper.
If another location writes to it, the link is broken so that SC will fail.

Some implementations did not implement an address snooper - they might
fail if there was ANY other activity on the bus.
Needless to say, this does not exhibit good performance on contended
systems.

Non-snoopy implementations of LL/SC are possible.
E.g. implementing them at a memory module is possible. [[TBD]].

As with more extensive forms of transactional memory, LL/SC has issues
with fairness and forward progress. It is not desirable to loop forever
in the LL/SC code.
This is solvable - through much the same mechanisms as one uses to
implement a [[locked atomic RMW]] with [[load-locked]].

One way amounts to converting a [[load-linked]] into a [[load-locked]]
after a certain number of failures.

== Synthesizing Complex RMWs ==

The nice thing about [[LL/SC]] is that they can be used to implement
many different forms of synchronization. E.g. almost any form of
[[fetch-and-op]].
Say you want [[floating point fetch-and-add]] .. you can build that with
LL/SC.
Whereas if you don't have LL/SC, and just have a fixed repertoire of
integer [[atomic RMW]]s, you may just be out of luck.

This *is* one of the big advantages of RISC: provide primitives, rather
than full solutions.

The problem, of course, is what was mentioned above: "plain" LL/SC may
have problems with fairness and forward progress. Mechanisms that solve
these problems
for LL/SC may be more complicated than they would be for non-LL/SC
instructions.

See [[list of possible atomic RMW operations]].

== [[Remote Atomic RMWs]] ==

The "most natural" implementation of [[LL/SC]] is to pull the data into
registers and/or cache of the processor on which the instructions are
executing. By "the instructions" I mean those before the LL, the LL
itself, between the LL and SC, the SC itself, and after the SC.

This is also the most natural implementation for locked implementations
of atomic RMWs:

[[LOCK INC mem]]
tmp := [[load-locked]] M[addr]
dstreg := tmp+1
[[store-unlock]] M[addr] := dstreg

I call this the "local implementation" of the atomic RMW.

However, many systems have found it useful to create "remote
implementations" of the atomic RMW.

For example, special [[bus or interconnect]] transactions could be
created that indicate that the memory controller should do the atomic
RMW operation itself.

For example, the [[NYU Ultracomputer]] allowed many [[fetch-and-add]]
operations to be exported. They could be performed at the ultimate
destination, the memory controller. But they could also be handled
specially at intermediate nodes in the interconnection fabric, where
conflicting atomic RMWs to the same location could be combined,
forwarding only their net effect on towards the destination.

You can imagine such [[remote atomic RMW]]s as being performed by

[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system

This is the advantage of CISCy atomic RMW instructions: they can hide
the implementation.
If the interconnect fabric supports only [[fetch-and-add]] but not
[[fetch-and-OR]], then the two respective [[microoperation expansions]]
might be:

[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system

[[LOCK FETCH-AND-OR mem]]
oldval := [[load-locked]] M[addr]
newtmp := oldval OR src1
[[store-unlock]] M[addr] := newtmp

For that matter, the CISC implementation migh well use [[LL/SC]]
optimistic concurrency control within its [[microflow]].

It is more work to create such a [[remote atomic RMW]] with [[LL/SC]],
since the very strength of LL/SC
- that the user can place almost arbitrarily complicated instruction
sequences between the [[LL]] and [[SC]]
is also its weakness.
If you wanted to create a [[remote atomic RMW]] implementation that
supported just, say, [[fetch-and-add]],
then you would have to create an [[idiom recognizer]] that recognized
code sequences such as:

fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval

Actually, you might not have to recognize the full sequence, complete
with the retry loop.
You would only need to recognize the sequence

oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )

and convert that into whatever bus operations create the [[remote atomic
RMW]].

Recognizing a 3-instruction idiom is not THAT hard.
But in general [[it is harder to combine separate instructions that to
break up complex instructions in an instruction decoder]].

One might even imagine creating a fairly complete set of instructions
executable on the interconnect.

=== The best of both worlds? ===

Let me end this section by describing a hybrid implementation of the
CISC atomic RMWs that combines the best of both worlds wrt [[local and
remote atomic RMWs]].

[[LOCK FETCH-AND-OP mem]]
oldval := [[load-locked/fetch-and-op]] M[addr]
newtmp := oldval OP src1
[[store-unlock/fetch-and-op]] M[addr] := newtmp
return oldval

If M[addr] is exclusive in the locak cache, then the
[[load-locked/fetch-and-op]] and [[store-unlock/fetch-and-op]]
are equivalent to ordinary [[load-locked]] and [[store-unlock]].

If M[addr]] is not present in the local cache, then
[[load-locked/fetch-and-op]] is equivalent to sending a remote atomic
RMW fetch-and-op command to the external network. The
[[store-unlock/fetch-and-op]] may then become a NOP (although it may be
used for bookkeeping purposes).

If M[addr] is present in the local cache in a shared state, then we
*COULD* perform the atomic RMW both locally and remotely. The remote
version might invalidate or update other copies. However, if the
operation is supposed to be [[serializing]], then the local update
cannot proceed without coordinating with the remote.

== Extending the semantics ==

[[PAC]]:

(Paul Clayton added this.)

Because existing LL/SC definitions either fail or have [[undefined
semantics]] if other memory operations are performed, the semantics can
be safely extended without sacrificing [[compatibility]]. (While it may
be possible in some architectures ([[TBD]]: check whether this is the
case for Alpha, MIPS, ARM, Power, etc.) that a non-SC memory operation
was used by software to cancel an atomic section, such is generally
discouraged and unlikely to have been done.) Such extensibility could
allow increasingly more extensive forms of transactional memory to be
implemented without requiring additional instructions. While an
implementation of an older architecture would not generate an illegal
instruction, an architecture which guaranteed failure on additional
memory accesses would guarantee either deadlock (which would simplify
debugging and fixing) or the use of an already in place software
fallback (which, while slower, would maintain correctness). In
architectures which use a full-sized register to return the success
condition and define a single success condition, that register could be
used to communicate failure information.

A natural progression for such extensions might be: to initially
constrain the transaction to an aligned block of 32 or 64 bytes or the
implementation-defined unit of coherence (This last could cause a
significant performance incompatibility but would maintain the
semantics.), perhaps with a single write, then to expand the semantics
to something like those presented in Cliff Click's "IWannaBit!" where
any cache miss or eviction cancels the reservation (For many uses, this
would require that the cache be warmed up before the transaction can
succeed. If the reservation mechanism is expensive, prefetching data
outside of the atomic section might be preferred over retrying the
transaction.), perhaps extending writes to an entire cache line, and
then to an arbitrary number of memory accesses.

Providing non-transactional memory accesses within the atomic section
would be more difficult. It would be possible to define register
numbers which when used as base pointers for memory accesses have
non-transactional semantics (See [[Semantic register numbers]]). This
definition would have to be made at the first extension of LL/SC and it
might be difficult to establish compiler support.

A non-transactional extension of LL/SC would be to guarantee the success
of the store-conditional under certain limited conditions. E.g., a
simple register-register operation might be guaranteed to succeed,
providing the semantics of most RMW instructions. Such a guaranteed
transaction could itself be extended to allow more complex operations,
though it might be desirable to allow a transaction that can be
guaranteed to fail if lock semantics are not desired under contention.
E.g., a thread might work on other tasks rather than wait for the
completion of a heavily contended locked operation.

== See Also ==

* [[locked versus atomic]]
* [[Returning the old versus new value: fetch-and-add versus add-to-mem]]
* [[atomic RMW spinloops and CISCy instructions]]

Brett Davis

unread,

Oct 20, 2011, 6:25:07 PM10/20/11

to

In article <4E87DB97...@SPAM.comp-arch.net>,
"Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> wrote:

> http://semipublic.comp-arch.net/wiki/Load-linked/store-conditional_(LL/SC)

>
> === The best of both worlds? ===
>
> Let me end this section by describing a hybrid implementation of the
> CISC atomic RMWs that combines the best of both worlds wrt [[local and
> remote atomic RMWs]].
>
> [[LOCK FETCH-AND-OP mem]]
> oldval := [[load-locked/fetch-and-op]] M[addr]
> newtmp := oldval OP src1
> [[store-unlock/fetch-and-op]] M[addr] := newtmp
> return oldval
>
> If M[addr] is exclusive in the locak cache, then the
> [[load-locked/fetch-and-op]] and [[store-unlock/fetch-and-op]]
> are equivalent to ordinary [[load-locked]] and [[store-unlock]].
>
> If M[addr]] is not present in the local cache, then
> [[load-locked/fetch-and-op]] is equivalent to sending a remote atomic
> RMW fetch-and-op command to the external network. The
> [[store-unlock/fetch-and-op]] may then become a NOP (although it may be
> used for bookkeeping purposes).
>
> If M[addr] is present in the local cache in a shared state, then we
> *COULD* perform the atomic RMW both locally and remotely. The remote
> version might invalidate or update other copies. However, if the
> operation is supposed to be [[serializing]], then the local update
> cannot proceed without coordinating with the remote.

I am kind of perplexed that you spent so much time on what seems to
be completely obsolete lock primitives.
AMD has proposed (implemented?) a page locking system that is far
more efficient. In the real world to get work done you will need
several locks (each in its own page often?) and then you need to
access the data that is locked, in yet another page of memory.

By sharing the whole lock page you at least cut in half the number
of pages hit to do work, and half as many pages to copy from CPU to CPU.
Lots of complexities get replaced by page lock/unlock and cache flush.

I see no reason to include complex RMW ops in a new CPU arch.
Or am I missing something?

Andy "Krazy" Glew

unread,

Oct 22, 2011, 1:50:26 AM10/22/11

to

(1) comp-arch.net is not just supposed to be new ideas. It is also
supposed to be a collection of important old ideas. Especially since the
wheel of reincarnation turns around and around in our business.

One of the big reasons I am trying to write comp-arch.net (which is now
on hold) is that I got tired of seeing old ideas be recycled, and
claimed to be new. This happens particularly for ideas that went a bit
beyond the state of the art at the last turn of the wheel of
reincarnation, so that they never got into textbooks, and might not have
been patented, which get reinvented 20 years later.

(2) Although I am not so sure as you are that plain old locking is now
obsolete.

Background: my MS thesis contained a survey of synchronization
primitives back in 1985. People back then told me that RMWs were
obsolete - that LL/SC, and the multiple location versions of LLn/SCn,
and transactional memory like the IBM 801 original RISC memory system,
would eliminate locking.

Didn't happen then. If it's going to happen now, my question is "What's
changed? Why will this technology that failed to take off in the past
take off now? What's different."

And, oh, yes, I have been defined at least one transactional memory
protocol that was under serious consideration for implementation. As
well as several that were further away from consideration. And I have
been involved in supercomputing projects at massive scale.

I.e. I have a lot of experience in synchronization. But most of my
experience has been that sexy academic ideas are proposed - and then fail.

E.g. I once had an interesting long discussion with one of the original
IBM database people who described why the old IBM 801 processor version
of implicitly locked memory failed. As he put it, it was a dumb
programmer play. Any good programmer with locks could beat 801 memory.
I fear the same thing may happen again with transactional memory. So
what has changed? Perhaps the number of dumb programmers writing
parallel code has changed.

By the way, the most promising sign of interest for TM is IBM's placing
it in BlueGene/Q.

It would be even more promising if Oracle(Sun) implemented TM. That
would be real world.

> AMD has proposed (implemented?) a page locking system that is far
> more efficient. In the real world to get work done you will need
> several locks (each in its own page often?) and then you need to
> access the data that is locked, in yet another page of memory.

I am not sure what you are talking about.

ASF? - the Advanced Synchronization Facility" Mitch has discussed here?
Or something else?

ASF is not page locked. ASF amounts to transactional memory, with a
small number of cache lines.

--

By the way, one of the things I observed in my MS thesis survey was that
there were a lot of proposals for "lock pages" and/or special pages
where a lot of locks would live, with optimized memory/cache protocols.
I am not aware of anyone proposing this now, but perhaps I missed
something.

These proposals failed, both back in the 1980s, but also in more recent
studies, because people like being able to colocate the locks with the
data. Heck, they don't like the locks being in separate cache lines,
which is pretty much required for performance, let alone in separate pages.

The other thing is that locks often cover complex datastructures. They
don't cover just a few contiguous memory locations.

But, like I said, I am not aware of any "lock pages" proposal, or even
any "page locking" proposal, so I won't continue to discuss this. Ccahe
line granularity transactional memory, on the other hand, is popular, or
at least was a few years ago.

Brett Davis

unread,

Oct 22, 2011, 7:22:35 PM10/22/11

to

In article <4EA259A2...@SPAM.comp-arch.net>,

"Andy \"Krazy\" Glew" <an...@SPAM.comp-arch.net> wrote:

> On 10/20/2011 3:25 PM, Brett Davis wrote:
> > In article<4E87DB97...@SPAM.comp-arch.net>,

> > I am kind of perplexed that you spent so much time on what seems to
> > be completely obsolete lock primitives.
>
> (1) comp-arch.net is not just supposed to be new ideas. It is also
> supposed to be a collection of important old ideas. Especially since the
> wheel of reincarnation turns around and around in our business.
>

You are right of course, it was cache lines, and with some limits:
http://developer.amd.com/tools/ASF/Pages/default.aspx

This seems to be better than LLn/SCn and other complex situations where
you need more than one 32 bit primitive.
I expect that a lot of the work of any lock revolves around cache lines,
and I wondered if ASF was the future for all locks.
Pipelines and RMW opcodes are conflicting concepts, lots of magic involved
to make it possible. Easy to see why people grasp at transactional memory.
Interesting to hear that the IBM 801 tried something similar and failed.

> By the way, one of the things I observed in my MS thesis survey was that
> there were a lot of proposals for "lock pages" and/or special pages
> where a lot of locks would live, with optimized memory/cache protocols.
>

> These proposals failed, both back in the 1980s, but also in more recent
> studies, because people like being able to colocate the locks with the
> data. Heck, they don't like the locks being in separate cache lines,
> which is pretty much required for performance, let alone in separate pages.
>
> The other thing is that locks often cover complex datastructures. They
> don't cover just a few contiguous memory locations.

In the last system (or wrapper) I dealt with you had to request special
pointer that was returned, so collocating with data was not possible.
ASF was like offering a class of water to someone in a desert.

This post of yours is one of the best you have ever made, I thank you.
(Or one of many great posts you have made, to be more accurate.)

Brett

Terje Mathisen

unread,

Oct 23, 2011, 7:53:53 AM10/23/11

to

Brett Davis wrote:
> In the last system (or wrapper) I dealt with you had to request special
> pointer that was returned, so collocating with data was not possible.
> ASF was like offering a class of water to someone in a desert.

Nice! :-)

A "class of water" is probably most useful for object-oriented
programmers lost in said desert?

>
> This post of yours is one of the best you have ever made, I thank you.
> (Or one of many great posts you have made, to be more accurate.)

Andy is a very nice guy, in several ways.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

George Neuner

unread,

Oct 24, 2011, 11:34:30 AM10/24/11

to

On Fri, 21 Oct 2011 22:50:26 -0700, "Andy \"Krazy\" Glew"
<an...@SPAM.comp-arch.net> wrote:

>(1) comp-arch.net is not just supposed to be new ideas. It is also
>supposed to be a collection of important old ideas. Especially since the
>wheel of reincarnation turns around and around in our business.
>
>One of the big reasons I am trying to write comp-arch.net (which is now
>on hold) is that I got tired of seeing old ideas be recycled, and
>claimed to be new. This happens particularly for ideas that went a bit
>beyond the state of the art at the last turn of the wheel of
>reincarnation, so that they never got into textbooks, and might not have
>been patented, which get reinvented 20 years later.

Yes. And btw, "Thank you" for your effort!

I, for one, am very tired of seeing patents pop out on things that I
learned about 25 years ago. I'm in software so I don't see too many
hardware patents, but I imagine that the situation in hardware is just
as bad as in software.

George

Terje Mathisen

unread,

Oct 24, 2011, 11:47:49 AM10/24/11

to

In some ways even worse:

Dating back to the time when sw patents were still impossible in the US,
lots of text book sw algorithms were documented, right?

With VLSI several companies realized that if they simply implemented one
of those algorithms in hw, they could indeed get a patent on it.

I was once asked to audit a group of 10 patents for a lawsuit, of those
4 were _really_ obvious, on the order of "the only answer to be expected
of a first-year CS or EE student", another 4 were textbook (i.e. Knuth
or similar) algorithms in hw/fw, and only the final pair actually
contained something I was willing to view as an actual idea.

I reported this and never heard back from them, so I presume these
lawyers represented the suing company, not the defendant. :-)

MitchAlsup

unread,

Oct 26, 2011, 9:10:52 PM10/26/11

to an...@spam.comp-arch.net

On Saturday, October 22, 2011 12:50:26 AM UTC-5, Andy Krazy Glew wrote:
> > AMD has proposed (implemented?) a page locking system that is far
> > more efficient. In the real world to get work done you will need
> > several locks (each in its own page often?) and then you need to
> > access the data that is locked, in yet another page of memory.
>
> I am not sure what you are talking about.
>
> ASF? - the Advanced Synchronization Facility" Mitch has discussed here?
> Or something else?
>
> ASF is not page locked. ASF amounts to transactional memory, with a
> small number of cache lines.

ASF is more like multiple simultaneous address based locks to caches lines of data than a degenerate TM system.

Mitch

jacko

unread,

Oct 27, 2011, 10:57:03 AM10/27/11

to an...@spam.comp-arch.net

Maybe the best place to build in RMW atomic is as an xor gate just after DRAM sense amplifier for regeneration, and a latch of course.