Revival of pessimistic locking

already...@yahoo.com

unread,

Aug 31, 2016, 10:53:19 AM8/31/16

to

[inspired by off-topic discussion with David Brown on comp.lang.c]

ARM added atomic compare-and-swap and various fetch-and-modify to ARMv8.1
IBM added atomic compare-and-swap and various fetch-and-modify to POWER v3.0

On the other hand, I don't see Intel adding LL/SC to x86 anytime soon.
Yes, despite Haswell fiasco they didn't give up on TSX, but TSX
a) substantially different from LL/CC
b) is not sold as panacea for all SMP synchronization problems
c) quite likely will never work well

Rick C. Hodgin

unread,

Aug 31, 2016, 11:58:26 AM8/31/16

to

x86 has CMPXCHG and CMPXCHG8B (and CMPXCHGQ on AMD64)? Prefix those
with LOCK and you have a system-wide atomic LL/SC:

https://courses.engr.illinois.edu/ece390/archive/spr2002/books/labmanual/inst-ref-cmpxchg.html

"The LOCK prefix prevents another processor doing anything in the
middle of this operation: it guarantees atomicity."

Best regards,
Rick C. Hodgin

Chris M. Thomasson

unread,

Aug 31, 2016, 12:24:30 PM8/31/16

to

On 8/31/2016 8:58 AM, Rick C. Hodgin wrote:
> On Wednesday, August 31, 2016 at 10:53:19 AM UTC-4, already...@yahoo.com wrote:
>> [inspired by off-topic discussion with David Brown on comp.lang.c]
>>
>> ARM added atomic compare-and-swap and various fetch-and-modify to ARMv8.1
>> IBM added atomic compare-and-swap and various fetch-and-modify to POWER v3.0
>>
>> On the other hand, I don't see Intel adding LL/SC to x86 anytime soon.
>> Yes, despite Haswell fiasco they didn't give up on TSX, but TSX
>> a) substantially different from LL/CC
>> b) is not sold as panacea for all SMP synchronization problems
>> c) quite likely will never work well
>
> x86 has CMPXCHG and CMPXCHG8B (and CMPXCHGQ on AMD64)? Prefix those
> with LOCK and you have a system-wide atomic LL/SC:

No. LL/SC is behaves differently than CAS! They are different things.

Rick C. Hodgin

unread,

Aug 31, 2016, 12:30:31 PM8/31/16

to

On Wednesday, August 31, 2016 at 12:24:30 PM UTC-4, Chris M. Thomasson wrote:
> On 8/31/2016 8:58 AM, Rick C. Hodgin wrote:
> > On Wednesday, August 31, 2016 at 10:53:19 AM UTC-4, already...@yahoo.com wrote:
> >> [inspired by off-topic discussion with David Brown on comp.lang.c]
> >>
> >> ARM added atomic compare-and-swap and various fetch-and-modify to ARMv8.1
> >> IBM added atomic compare-and-swap and various fetch-and-modify to POWER v3.0
> >>
> >> On the other hand, I don't see Intel adding LL/SC to x86 anytime soon.
> >> Yes, despite Haswell fiasco they didn't give up on TSX, but TSX
> >> a) substantially different from LL/CC
> >> b) is not sold as panacea for all SMP synchronization problems
> >> c) quite likely will never work well
> >
> > x86 has CMPXCHG and CMPXCHG8B (and CMPXCHGQ on AMD64)? Prefix those
> > with LOCK and you have a system-wide atomic LL/SC:
>
> No. LL/SC is behaves differently than CAS! They are different things.

You could offer to teach me the difference rather than just pointing out
where I'm wrong.

> > https://courses.engr.illinois.edu/ece390/archive/spr2002/books/labmanual/inst-ref-cmpxchg.html
> >
> > "The LOCK prefix prevents another processor doing anything in the
> > middle of this operation: it guarantees atomicity."

I just read the mechanics of the operation and I don't see the difference
since they are paired, and since the x86 CAS is bounded by a LOCK operation.

Is the pairing in LL/SC not required to be via sequential instructions?

Anton Ertl

unread,

Aug 31, 2016, 12:53:03 PM8/31/16

to

already...@yahoo.com writes:
>On the other hand, I don't see Intel adding LL/SC to x86 anytime soon.
>Yes, despite Haswell fiasco they didn't give up on TSX, but TSX
>a) substantially different from LL/CC

In what way? When I read the description of TSX, in particular RTM, I
thought that this was a less restricted form of LL/SC.

>b) is not sold as panacea for all SMP synchronization problems

Sure it may only provide a fast path for the uncontended case, but
that's what has been missing.

>c) quite likely will never work well

Why do you think so? It seems bug-prone, but there seem to be CPU
variants where it works (Haswell-EX, non-early Broadwells and
Skylakes). Maybe that's just teething problems of a new technology,
and once they know what to test for, they will get it right on the
first stepping.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Chris M. Thomasson

unread,

Aug 31, 2016, 12:54:29 PM8/31/16

to

CAS behaves differently than LL/SC because it compares the value of the
target memory location with the comparand value, if they are the same
then the update occurs. So, if the target memory location changes from
A, to B and back to A, then a CAS will succeed if the comparand is
looking for A. This is not the case with LL/SC. It is looking for
changes in the reservation granule. So, if you do a LL on a target
location, and it changes from A, to B and back to A, then the SC will
fail. Also, if something else touches something on the reservation
granule, it will fail. It can also spuriously fail. This is very
different behavior than CAS.

LL/SC can be prone to live-lock conditions; great care has to be applied
when working with it.

Anne & Lynn Wheeler

unread,

Aug 31, 2016, 3:10:38 PM8/31/16

to

already...@yahoo.com writes:
> ARM added atomic compare-and-swap and various fetch-and-modify to ARMv8.1
> IBM added atomic compare-and-swap and various fetch-and-modify to POWER v3.0

Charlie invented Compare-And-Swap (name chosen because CAS are Charlie's
initials) at the science center when he was working on fine-grain
multiprocessor locking for CP67. Then attempted to get CAS added to 370
... but was initially rejected because the POK favorite son operating
system people claimed that test-and-set (from 360 multiprocessor) was
more than sufficient. The 370 architecture owners said that in order to
justify Compare-And-Swap addition to 370, uses other than kernel
multiprocessor locking were needed. Thus were born the examples for
multithreaded applications (whether multiprocessor or not) that continue
to exist down thru the ages in numerous generations of principles of
operation appendix.

During the 70s and 80s ... there was increasing use of CAS by
multithreaded applications ... like large DBMS implementations ... and
also started to see it added to other processor architectures.

801/RISC started out claiming single processor only and no instructions
would take more than single cycle ... and totally rejected CAS
implementation. However, with porting of RDBMS to RS/6000 power ... the
lack of CAS significantly hurt throughput. As a temporary stop-gap a CAS
simulation was added to the supervisor call, first-level-interrupt
handler ... that would simulate CAS (while disabled for interrupts and
immediately return/resume, RS/6000 POWER was single processor so didn't
have to consider serializing other processors). Later power/pc with
multiprocessor support and more (multi-threaded) business applications,
CAS become increasingly critical.

trivia ... some place from long ago and far away ... I have the CAS
simulation code from the RS/6000 FLIH.

--
virtualization experience starting Jan1968, online at home since Mar1970

Chris M. Thomasson

unread,

Sep 1, 2016, 12:27:30 AM9/1/16

to

On 8/31/2016 12:10 PM, Anne & Lynn Wheeler wrote:
>
> already...@yahoo.com writes:
>> ARM added atomic compare-and-swap and various fetch-and-modify to ARMv8.1
>> IBM added atomic compare-and-swap and various fetch-and-modify to POWER v3.0
>
> Charlie invented Compare-And-Swap (name chosen because CAS are Charlie's
> initials) at the science center when he was working on fine-grain
> multiprocessor locking for CP67. Then attempted to get CAS added to 370
> ... but was initially rejected because the POK favorite son operating
> system people claimed that test-and-set (from 360 multiprocessor) was
> more than sufficient. The 370 architecture owners said that in order to
> justify Compare-And-Swap addition to 370, uses other than kernel
> multiprocessor locking were needed. Thus were born the examples for
> multithreaded applications (whether multiprocessor or not) that continue
> to exist down thru the ages in numerous generations of principles of
> operation appendix.

Yes, and using the version counter wrt DWCAS for the ABA problem,
(Double-Width Compare-And-Swap) is very useful; its still in the
Appendix A "Multiprogramming and Multiprocessing Examples", Free-Pool
Manipulation, as Compare Double and Swap (CDS) instruction; FWIW, I call
this DWCAS. ;^)

BTW, did Charlie also invent the idea of using an extra counter to make
the CAS fail when an A, to B, back to A encounter occurs? To avoid
destroying the data-structure?

> During the 70s and 80s ... there was increasing use of CAS by
> multithreaded applications ... like large DBMS implementations ... and
> also started to see it added to other processor architectures.
>
> 801/RISC started out claiming single processor only and no instructions
> would take more than single cycle ... and totally rejected CAS
> implementation. However, with porting of RDBMS to RS/6000 power ... the
> lack of CAS significantly hurt throughput. As a temporary stop-gap a CAS
> simulation was added to the supervisor call, first-level-interrupt
> handler ... that would simulate CAS (while disabled for interrupts and
> immediately return/resume, RS/6000 POWER was single processor so didn't
> have to consider serializing other processors). Later power/pc with
> multiprocessor support and more (multi-threaded) business applications,
> CAS become increasingly critical.

Wonderful! Thank you so much for this great information.

>
> trivia ... some place from long ago and far away ... I have the CAS
> simulation code from the RS/6000 FLIH.

If you can find it, I would love to take a look. :^)

Thanks again.

already...@yahoo.com

unread,

Sep 1, 2016, 4:37:00 AM9/1/16

to

Yes, very educating.
Somehow until this post I didn't realize that Power in general and RS/6000 computers in particular were late to cache-coherent shared memory game.
According to Wikipedia:
Servers:
Type 7012 Model G30 PowerPC 601 (2 or 4)1994-10-04 (Tower)
Type 7013 Model J30 PowerPC 601 (2 or 4)1994-10-04 (Deskside)
Type 7013 Model R30 PowerPC 601 (2 or 4)1994-10-04 (Rack mount)
Workstations:
Type 7043 Model 43P-240 (dual PPC 604e) introduced 1996-10-08.

Rick C. Hodgin

unread,

Sep 1, 2016, 9:07:28 AM9/1/16

to

On Wednesday, August 31, 2016 at 10:53:19 AM UTC-4, already...@yahoo.com wrote:

I have offered up this possible solution for LibSF 386-x40:

https://groups.google.com/d/msg/comp.arch/__PFd1_HBLA/Qn2vJJgcAgAJ

nedbrek

unread,

Sep 1, 2016, 9:12:16 AM9/1/16

to

Note that lock will only "prevent" other processors from acting on the same block of memory (in some cases, the processor doing the locked op will need to replay its operations when it detects interference).

IIRC, P5 was the last processor to use a lock bit on the bus. (Again, IIRC) P6 moved to "cache lock" - the locking processor gets ownership of the associated line(s) in its cache and does its operation before allowing anyone else to snoop the line(s). Operations to other lines are not affected.

HTH
Ned

Anne & Lynn Wheeler

unread,

Sep 1, 2016, 2:38:14 PM9/1/16

to

"Chris M. Thomasson" <inv...@invalid.invalid> writes:
> If you can find it, I would love to take a look. :^)

from long ago and far away

#
# svc 1 = compare and swap (see cs below)
# if a page-fault occurs on load or store we back-track
# to .cs in ds_flih. a wild-branch to the middle of the
# code will either take a program exception on the rfsvc
# or look like a storage exception at the beginning of
# .cs .
#
.org real0+0x1020 # svc 1 comes here.
l r6,0(r3) # load the word
cmp 0,r4,r6 # compare the word
bne 0,noswap # branch if different
st r5,0(r3) # swap word
cal r3,0(r0) # r3 = equal retcode
rfsvc # return
noswap: cal r3,1(r0) # r3 = noswap retcode
rfsvc # return

... snip ...

Anne & Lynn Wheeler

unread,

Sep 1, 2016, 7:30:18 PM9/1/16

to

"Chris M. Thomasson" <inv...@invalid.invalid> writes:

> BTW, did Charlie also invent the idea of using an extra counter to
> make the CAS fail when an A, to B, back to A encounter occurs? To
> avoid destroying the data-structure?

the original CAS examples were simpler structures ... that didn't
require version sequencing (single threaded list, double threaded list,
counters, etc).

first time I remember working with similar version sequencing was late
80s for our HA/CMP project ... and cluster scaleup ... working with
national labs on technical and scientific and RDBMS vendors on
commercial scaleup.

Some of the RDBMS vendors had vax/cluster support in the same source
base as the support for other platforms. The easiest way to extend that
support to HA/CMP was doing a DLM that implemented vax/cluster
semantics. However the RDBMS vendors had some thots on how to
significantly improve over vax/cluster implementation ... and I had some
ideas dating back to having worked on the original sql/relational
implementation.

A non-cluster RDBMS improvement was fast commit ... release record-level
lock as soon as the corresponding journal record had been written
... even though the DBMS record hasn't been written (and roll journal
forward in case of recovery). The vax/cluster implementation wouldn't
pass the lock to another cluster processor until the corresponding DBMS
record had been written, and it would then have to read the DBMS from
disk. So I sought to extend "fast commit" to cluster operation
... piggy-backing the record in transmission granting the lock to a
different processor. This then requires merging journal records from the
different cluster processors in the original transaction execution
order. Because didn't have cluster wide synchronized timer, went with
transaction version number ... that was transmitted with granting the
lock and current record value (academic papers sometimes refer to it as
virtual time in recovery with distributed logs/journals).

trivia ... old reference to Jan1992 meeting in Ellison's conference room
regarding cluster scaleup
http://manana.garlic.com/~lynn/95.html#13

then for a number of reasons, over a couple weeks, cluster scaleup was
transferred, announced as supercomputer (for technical and scientific
*ONLY*) and we were told we couldn't work on anything with more than
four processors. Old press item from 17Feb1992:
http://manana.garlic.com/~lynn/2001n.html#6000clusters1
and later from press from 11May1992
http://manana.garlic.com/~lynn/2001n.html#6000clusters2

Chris M. Thomasson

unread,

Sep 1, 2016, 9:47:13 PM9/1/16

to

On 9/1/2016 4:30 PM, Anne & Lynn Wheeler wrote:
> "Chris M. Thomasson" <inv...@invalid.invalid> writes:
>> BTW, did Charlie also invent the idea of using an extra counter to
>> make the CAS fail when an A, to B, back to A encounter occurs? To
>> avoid destroying the data-structure?
>
> the original CAS examples were simpler structures ... that didn't
> require version sequencing (single threaded list, double threaded list,
> counters, etc).

Great info; thank you. FWIW, one can get around version counters in a
singly linked list if one uses a simple exchange instruction for the
consumer, that returns the head of the list, and atomically sets it to NULL.

So the consumer would return the whole list, where AXCHG is atomic
exchange...

Simple Example for the pop method:
___________________________________
struct node
{
node* next;
};

static node g_head = { NULL };

node* pop()
{
node* h = AXCHG(&g_head, NULL);
if (h) MEMBAR(acquire);
return h;
}
___________________________________

Now, no need for version counter!

;^)

Thank you for all of this! I really do appreciate it.

Robert Wessel

unread,

Sep 7, 2016, 6:58:41 PM9/7/16

to

No, they don't. There are, however, usually limits on what you can do
between the LL and SC.