sync/atomic

mau...@murus.org

unread,

Oct 26, 2011, 6:46:45 PM10/26/11

to golang-nuts

Dear developers,

I would like to suggest some extensions to "sync/atomic",
which are useful for synchronisation algorithms by spinlocks.

1.

func TestAndSet (n *int) int {
// semantics:
// if *n == 1 {
// return 1
// }
// *n = 1
// return 0
}

Reason: Nice implementation of Lock/Unlock:

var locked int

func Lock() {
for TestAndSet (locked) == 1 {
// do nothing, maybe sleep some usecs
}

func Unlock() {
locked = 0
}

2.

Modified version of CompareAndSwap (as in IBM series 370):

func CSW (i, k *int, n int) int
// semantics:
// if *i == *k {
// *k = n
// return true
// }
// *k = *i
// return false

3.

Modified version of AddInt:

func FetchAndAdd (k *int, n int) int
// semantics:
// old:= *k
// *k += n
// return old

Reason:
Fair implementation of Lock/Unlock by the
Ticket spinlock algorithm (modulo overflow of counter)

var ticket, turn int

function Lock() {
myturn:= FetchAndAdd(&ticket, 1)
for myturn!= turn {
// do nothing, maybe sleep some usecs
}
}

func Unlock() {
turn++
}

4.

func Exchange (k, n *int)
// semantics:
// *k, *n = *n, *k

Reason:
Another nice spinlock implementation of Lock/Unlock:

var global int

func Lock() {
local:= 1
for local == 1 {
Exchange (&global, &local)
}
}

func Unlock() {
local:= 0
Exchange (&global, &local)
}

With kind regards,

Christian Maurer
http://murus.org

Ian Lance Taylor

unread,

Oct 26, 2011, 8:06:48 PM10/26/11

to mau...@murus.org, golang-nuts

"mau...@murus.org" <mau...@murus.org> writes:

> func TestAndSet (n *int) int {
> // semantics:
> // if *n == 1 {
> // return 1
> // }
> // *n = 1
> // return 0
> }

Isn't this simply
! CompareAndSwapInt32(n, 0, 1)
?

> Modified version of CompareAndSwap (as in IBM series 370):
>
> func CSW (i, k *int, n int) int
> // semantics:
> // if *i == *k {
> // *k = n
> // return true
> // }
> // *k = *i
> // return false

As far as I know, modern processors do not have an atomic implementation
of this operation. I'm actually not sure how to implement it atomically
without using some sort of lock.

> Modified version of AddInt:
>
> func FetchAndAdd (k *int, n int) int
> // semantics:
> // old:= *k
> // *k += n
> // return old

Isn't this just
AddInt32(k, n) - n
?

> func Exchange (k, n *int)
> // semantics:
> // *k, *n = *n, *k

Again, as far as I know, modern processors do not have an atomic
implementation of this operation.

Ian

Paul Borman

unread,

Oct 26, 2011, 8:28:11 PM10/26/11

to Ian Lance Taylor, mau...@murus.org, golang-nuts

On Wed, Oct 26, 2011 at 5:06 PM, Ian Lance Taylor <ia...@google.com> wrote:

"mau...@murus.org" <mau...@murus.org> writes:

> Modified version of CompareAndSwap (as in IBM series 370):
>
> func CSW (i, k *int, n int) int
> // semantics:
> // if *i == *k {
> // *k = n
> // return true
> // }
> // *k = *i
> // return false

As far as I know, modern processors do not have an atomic implementation
of this operation. I'm actually not sure how to implement it atomically
without using some sort of lock.

You shouldn't need to lock the last operation (copying the existing value). You can just say:

for {
tmp := *i

if CompareAndSwapInt32(i, *k, n) {
return true

}
if tmp != *k {

*k = tmp
return false

}
}

The only reason for the loop is so you never return false and leave *k unchanged.

It doesn't matter what value tmp has, really, just that it was *a* value of *i that was not *k. You just start playing games like "well, we will pretend the cas failed when *i was still tmp" meaning that if *i changes between tmp & cas we pretend we were called a little earlier. If it changes after the cas but before we return it is the same as someone else's cas working (which in fact is the case). Time is a relative thing :-)

Dmitry Vyukov

unread,

Oct 27, 2011, 2:34:23 AM10/27/11

to mau...@murus.org, golang-nuts

On Thu, Oct 27, 2011 at 2:46 AM, mau...@murus.org <mau...@murus.org> wrote:
> Dear developers,
>
> I would like to suggest some extensions to "sync/atomic",
> which are useful for synchronisation algorithms by spinlocks.
>
> 1.
>
> func TestAndSet (n *int) int {
> // semantics:
> // if *n == 1 {
> // return 1
> // }
> // *n = 1
> // return 0
> }
>
> Reason: Nice implementation of Lock/Unlock:
>
> var locked int
>
> func Lock() {
> for TestAndSet (locked) == 1 {
> // do nothing, maybe sleep some usecs
> }
>
> func Unlock() {
> locked = 0
> }

It's better to provide more general Exchange operation.

for Exchange(locked, 1) == 1 {

// do nothing, maybe sleep some usecs
}

But it is useful beyond that.

> 2.
>
> Modified version of CompareAndSwap (as in IBM series 370):
>
> func CSW (i, k *int, n int) int
> // semantics:
> // if *i == *k {
> // *k = n
> // return true
> // }
> // *k = *i
> // return false

It is really nice. I would do that in the first place.

> 3.
>
> Modified version of AddInt:
>
> func FetchAndAdd (k *int, n int) int
> // semantics:
> // old:= *k
> // *k += n
> // return old
>
> Reason:
> Fair implementation of Lock/Unlock by the
> Ticket spinlock algorithm (modulo overflow of counter)
>
> var ticket, turn int
>
> function Lock() {
> myturn:= FetchAndAdd(&ticket, 1)
> for myturn!= turn {
> // do nothing, maybe sleep some usecs
> }
> }
>
> func Unlock() {
> turn++
> }

You just do:
myturn := Add(&ticket, 1)-1

> 4.
>
> func Exchange (k, n *int)
> // semantics:
> // *k, *n = *n, *k
>
> Reason:
> Another nice spinlock implementation of Lock/Unlock:
>
> var global int
>
> func Lock() {
> local:= 1
> for local == 1 {
> Exchange (&global, &local)
> }
> }
>
> func Unlock() {
> local:= 0
> Exchange (&global, &local)
> }

OK, then TestAndSet is just not required.

Dmitry Vyukov

unread,

Oct 27, 2011, 2:36:31 AM10/27/11

to Ian Lance Taylor, mau...@murus.org, golang-nuts

On Thu, Oct 27, 2011 at 4:06 AM, Ian Lance Taylor <ia...@google.com> wrote:
> "mau...@murus.org" <mau...@murus.org> writes:
>
>> func TestAndSet (n *int) int {
>> // semantics:
>> // if *n == 1 {
>> // return 1
>> // }
>> // *n = 1
>> // return 0
>> }
>
> Isn't this simply
> ! CompareAndSwapInt32(n, 0, 1)
> ?
>
>
>> Modified version of CompareAndSwap (as in IBM series 370):
>>
>> func CSW (i, k *int, n int) int
>> // semantics:
>> // if *i == *k {
>> // *k = n
>> // return true
>> // }
>> // *k = *i
>> // return false
>
> As far as I know, modern processors do not have an atomic implementation
> of this operation. I'm actually not sure how to implement it atomically
> without using some sort of lock.

IA-32/Intel64 CMPXCHG works exactly this way. And it is the best CAS
interface out there (C++11 uses it).

>> Modified version of AddInt:
>>
>> func FetchAndAdd (k *int, n int) int
>> // semantics:
>> // old:= *k
>> // *k += n
>> // return old
>
> Isn't this just
> AddInt32(k, n) - n
> ?
>
>
>> func Exchange (k, n *int)
>> // semantics:
>> // *k, *n = *n, *k
>
> Again, as far as I know, modern processors do not have an atomic
> implementation of this operation.

IA-32/Intel64 XCHG

David Symonds

unread,

Oct 27, 2011, 2:39:46 AM10/27/11

to Dmitry Vyukov, Ian Lance Taylor, mau...@murus.org, golang-nuts

On Thu, Oct 27, 2011 at 5:36 PM, Dmitry Vyukov <dvy...@google.com> wrote:

>>> func Exchange (k, n *int)
>>> // semantics:
>>> // *k, *n = *n, *k
>>
>> Again, as far as I know, modern processors do not have an atomic
>> implementation of this operation.
>
>
> IA-32/Intel64 XCHG

XCHG only allows one operand to be a memory location, I believe.

Dave.

Dmitry Vyukov

unread,

Oct 27, 2011, 2:48:02 AM10/27/11

to Paul Borman, Ian Lance Taylor, mau...@murus.org, golang-nuts

On Thu, Oct 27, 2011 at 4:28 AM, Paul Borman <bor...@google.com> wrote:
>
>
> On Wed, Oct 26, 2011 at 5:06 PM, Ian Lance Taylor <ia...@google.com> wrote:
>>
>> "mau...@murus.org" <mau...@murus.org> writes:
>>
>> > Modified version of CompareAndSwap (as in IBM series 370):
>> >
>> > func CSW (i, k *int, n int) int
>> > // semantics:
>> > // if *i == *k {
>> > // *k = n
>> > // return true
>> > // }
>> > // *k = *i
>> > // return false
>>
>> As far as I know, modern processors do not have an atomic implementation
>> of this operation. I'm actually not sure how to implement it atomically
>> without using some sort of lock.
>>
>
> You shouldn't need to lock the last operation (copying the existing value).
> You can just say:
> for {
> tmp := *i
> if CompareAndSwapInt32(i, *k, n) {
> return true
> }
> if tmp != *k {
> *k = tmp
> return false
> }
> }

Lock is irrelevant here (processor loads the current value under lock
anyway). What is relevant is that you should never touch any mutable
shared location more than absolute minimum of times and CAS must
return the most actual value. In your implementation you touch the
location 3 times (read - it is now 3 times more expensive), then you
always return 100% outdated value, in the end if tmp != *k why you
execute CAS at all?

The CAS interface must be
CompareAndSwapInt32(addr, cmp *int32, xchg int32) bool
It is supported in hardware (IA-32/Intel64) in this form, and now it
is default interface in C1x/C++11 as well.

Dmitry Vyukov

unread,

Oct 27, 2011, 2:51:39 AM10/27/11

to David Symonds, Ian Lance Taylor, mau...@murus.org, golang-nuts

On Thu, Oct 27, 2011 at 10:39 AM, David Symonds <dsym...@golang.org> wrote:
>>>> func Exchange (k, n *int)
>>>> // semantics:
>>>> // *k, *n = *n, *k
>>>
>>> Again, as far as I know, modern processors do not have an atomic
>>> implementation of this operation.
>>
>>
>> IA-32/Intel64 XCHG
>
> XCHG only allows one operand to be a memory location, I believe.

Definitely. I just understand it as 'n' is a local var, that is, the
interface as if:
func Exchange (k *int, n int) int
If both operands are shared memory locations, then it's indeed not
supported in commodity hardware, and it's not worth adding it to
atomic package.

Russ Cox

unread,

Oct 27, 2011, 11:27:59 AM10/27/11

to Dmitry Vyukov, Ian Lance Taylor, mau...@murus.org, golang-nuts

On Wed, Oct 26, 2011 at 23:36, Dmitry Vyukov <dvy...@google.com> wrote:
>>> Modified version of CompareAndSwap (as in IBM series 370):
>>>
>>> func CSW (i, k *int, n int) int
>>> // semantics:
>>> // if *i == *k {
>>> // *k = n
>>> // return true
>>> // }
>>> // *k = *i
>>> // return false
>>
>> As far as I know, modern processors do not have an atomic implementation
>> of this operation. I'm actually not sure how to implement it atomically
>> without using some sort of lock.
>
> IA-32/Intel64 CMPXCHG works exactly this way.

I don't believe this is true. CMPXCHG uses one memory operand,
while this operation uses two. If you simplify this to just the single
memory operand:

func CSW(i *int, k, n int) int

then it is no different in power than the existing CompareAndSwap.

The original poster may have intended that the access to *k not be
atomic, but that is quite confusing to explain: one memory access
is atomic but this other one is not. The existing routine has the same
power as this half-atomic CSW but is easier to describe.

Russ

Dmitry Vyukov

unread,

Oct 27, 2011, 12:11:17 PM10/27/11

to r...@golang.org, Ian Lance Taylor, mau...@murus.org, golang-nuts

IBM370 CAS makes non-atomic access to *k.
With current CompareAndSwap one has to constantly reload the value
which is inconvenient and slow.

Russ Cox

unread,

Oct 28, 2011, 12:54:54 AM10/28/11

to Dmitry Vyukov, Ian Lance Taylor, mau...@murus.org, golang-nuts

On Thu, Oct 27, 2011 at 09:11, Dmitry Vyukov <dvy...@google.com> wrote:
> IBM370 CAS makes non-atomic access to *k.
> With current CompareAndSwap one has to constantly reload the value
> which is inconvenient and slow.

How slow is it compared to the atomic parts?
It seems like that should be in the noise.

Dmitry Vyukov

unread,

Oct 28, 2011, 6:13:46 AM10/28/11

to r...@golang.org, Ian Lance Taylor, mau...@murus.org, golang-nuts

That's a common mistake. Atomic part has moderate cost (~30 cycles now, and constantly decreased, ultimately to 0) and is perfectly scalable. It's access to mutable shared state part that is costly (hundreds to thousands cycles) and completely non scalable:
http://www.1024cores.net/home/lock-free-algorithms/first-things-first

Here is a simple benchmark - concurrent pop from a stack implemented with re-reading stack head and using what CAS returned as current value:

for(;;) {
#if 1
n = *head;
if(n == 0)
break;
if(__sync_bool_compare_and_swap(head, n, n->next))
sum += n->data;
#else
if(n == 0)
break;
next = n->next;
nn = __sync_val_compare_and_swap(head, n, next);
if(n == nn) {
sum += n->data;
n = next;
} else {
n = nn;
}
#endif
}

With 4 threads I get the following results:

__sync_bool_compare_and_swap:
1.951s
1.936s
1.936s
__sync_val_compare_and_swap:
1.107s
1.120s
1.122s

Note that the compiler generates very nice tight code for __sync_bool_compare_and_swap (in particular it uses Z flag set by CMPXCHG):

0000000000400690 <thrfunc>:

400690: 48 8b 07 mov (%rdi),%rax

400693: 31 f6 xor %esi,%esi

400695: 0f 1f 00 nopl (%rax)

400698: 48 8b 17 mov (%rdi),%rdx

40069b: 48 85 d2 test %rdx,%rdx

40069e: 74 19 je 4006b9 <thrfunc+0x29>

4006a0: 48 8b 0a mov (%rdx),%rcx

4006a3: 48 89 d0 mov %rdx,%rax

4006a6: f0 48 0f b1 0f lock cmpxchg %rcx,(%rdi)

4006ab: 75 eb jne 400698 <thrfunc+0x8> // Use Z flag set by CMPXCHG

4006ad: 48 03 72 08 add 0x8(%rdx),%rsi

4006b1: 48 8b 17 mov (%rdi),%rdx //<---- ROOT OF ALL EVIL

4006b4: 48 85 d2 test %rdx,%rdx

4006b7: 75 e7 jne 4006a0 <thrfunc+0x10>

4006b9: 48 89 f0 mov %rsi,%rax

4006bc: c3 retq

4006bd: 0f 1f 00 nopl (%rax)

And here is code for __sync_val_compare_and_swap:

0000000000400690 <thrfunc>:

400690: 48 8b 17 mov (%rdi),%rdx

400693: 31 c0 xor %eax,%eax

400695: 31 f6 xor %esi,%esi

400697: 48 85 d2 test %rdx,%rdx

40069a: 75 0c jne 4006a8 <thrfunc+0x18>

40069c: eb 36 jmp 4006d4 <thrfunc+0x44>

40069e: 66 90 xchg %ax,%ax

4006a0: 4c 89 c2 mov %r8,%rdx

4006a3: 48 85 d2 test %rdx,%rdx

4006a6: 74 28 je 4006d0 <thrfunc+0x40>

4006a8: 48 8b 0a mov (%rdx),%rcx

4006ab: 48 89 d0 mov %rdx,%rax

4006ae: f0 48 0f b1 0f lock cmpxchg %rcx,(%rdi)

4006b3: 48 39 c2 cmp %rax,%rdx // <- Do additional compare

4006b6: 49 89 c0 mov %rax,%r8 // <--- RAX is the current value

4006b9: 75 e5 jne 4006a0 <thrfunc+0x10>

4006bb: 48 03 72 08 add 0x8(%rdx),%rsi

4006bf: 48 89 ca mov %rcx,%rdx

4006c2: 48 85 d2 test %rdx,%rdx

4006c5: 75 e1 jne 4006a8 <thrfunc+0x18>

4006c7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)

4006ce: 00 00

4006d0: 48 89 f0 mov %rsi,%rax

4006d3: c3 retq

4006d4: f3 c3 repz retq

4006d6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

4006dd: 00 00 00

If one implements it in assembly (like we do) then he can take advantage of both Z flag and current value.

Just in case, here is full benchmark code:

#include <pthread.h>

#include <stdlib.h>

#include <stdint.h>

#include <stdio.h>

#include <time.h>

struct node_t

{

struct node_t * volatile next;

uint64_t data;

};

static void* thrfunc(void* p)

{

struct node_t * volatile * head;

struct node_t *n, *nn, *next;

uint64_t sum;

sum = 0;

head = (struct node_t * volatile *)p;

n = *head;

for(;;) {

#if 1

n = *head;

if(n == 0)

break;

if(__sync_bool_compare_and_swap(head, n, n->next))

sum += n->data;

#else

if(n == 0)

break;

next = n->next;

nn = __sync_val_compare_and_swap(head, n, next);

if(n == nn) {

sum += n->data;

n = next;

} else {

n = nn;

}

#endif

}

return (void*)(uintptr_t)sum;

}

#define N 30000000

#define P 4

#define R 3

int main()

{

pthread_t th[P];

struct timespec t1, t2;

uint64_t t, sum1, sum2;

struct node_t *head, *n;

void *res;

int i, r;

for(r=0; r<R; r++) {

sum1 = sum2 = 0;

head = 0;

for(i=0; i<N; i++) {

n = (struct node_t*)malloc(sizeof(struct node_t));

n->data = i;

n->next = head;

head = n;

sum1 += i;

}

clock_gettime(CLOCK_MONOTONIC, &t1);

for(i=0; i<P; i++)

pthread_create(&th[i], 0, thrfunc, &head);

for(i=0; i<P; i++) {

pthread_join(th[i], &res);

sum2 += (uint64_t)(uintptr_t)res;

}

clock_gettime(CLOCK_MONOTONIC, &t2);

t = (t2.tv_sec*1000000000ull+t2.tv_nsec) - (t1.tv_sec*1000000000ull+t1.tv_nsec);

printf("%.3fs\n", t/1000000000.);

if(sum1 != sum2)

printf("wrong sum: %llu/%llu\n", (unsigned long long)sum1, (unsigned long long)sum2);

}

return 0;

}

Reply all

Reply to author

Forward