Re: code review 7455049: runtime: use lock-free ring for work scheduling (issue 7455049)

dvy...@google.com

unread,

Mar 8, 2013, 1:23:27 PM3/8/13

to devon...@gmail.com, r...@golang.org, minu...@gmail.com, da...@cheney.net, remyoud...@gmail.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c
File src/pkg/runtime/proc.c (right):

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode279
src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) {
Perhaps combine it in a separate function, because it's copied several
times.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode281
src/pkg/runtime/proc.c:281: globrunqput(gp);
it important to transfer a batch of G's to globrunq (half?)
consider that a goroutine spawns lots of goroutines in a loop, you do
not want to hit the slowpath every time

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode355
src/pkg/runtime/proc.c:355:
runtime·fastatomicstore((uint32*)&runtime·gcwaiting, 1);
This needs full memory barrier between store to gcwaiting and load of
p->status, otherwise it can deadlock with entersyscall (both won't
notice the other).

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode1014
src/pkg/runtime/proc.c:1014: gp = runqget(p);
"steal half" is required here.
consider that a goroutine spawns lots of other goroutines in a loop,
then other worker threads starts stealing 1 goroutine from slowpath each
time, this will lead to unnecessary contention and slowdown.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode1214
src/pkg/runtime/proc.c:1214: runtime·fastatomicstore(&m->p->status,
Psyscall);
full memory barrier is required, otherwise it will deadlock with
stoptheworld.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode1760
src/pkg/runtime/proc.c:1760: if(old < 0 || old > MaxGomaxprocs || new <=
0 || new > MaxGomaxprocs)
such edits better go in a separate patch

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2042
src/pkg/runtime/proc.c:2042: if(runqput(p, gp1) == false) {
steal at most runqsize (128), then runqput() will always succeed

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2087
src/pkg/runtime/proc.c:2087: if((delta & mask) == (start & mask))
Is it intended that you do not full the last slot?
If so, it can be:
if((delta^start)&mask))
the compiler is not particularly good at code optimization.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2093
src/pkg/runtime/proc.c:2093: runtime·fastatomicstore(&p->r_tail, delta);
atomicstorerelease should contain the necessary fence

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2094
src/pkg/runtime/proc.c:2094:
please drop at least some of the empty lines

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2107
src/pkg/runtime/proc.c:2107: start = runtime·atomicload(&p->r_head);
please use consistent maning: head/head instead of head/start (tail/end)

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2109
src/pkg/runtime/proc.c:2109: /* TODO: load fence */
atomicload contains the fence

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2114
src/pkg/runtime/proc.c:2114: /* TODO: load fence */
why is it needed?

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2116
src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
I think you can implement steal in a similar way, just copying a batch
of elements here and incrementing head by N instead on 1.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2116
src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
I think you can implement steal() in a similar way, just copying a batch
of elements here and incrementing head by N instead of 1.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2118
src/pkg/runtime/proc.c:2118: /* TODO: full memory fence */
why is it needed?

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2133
src/pkg/runtime/proc.c:2133: p.r_size = 8192;
please test overflow as well

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h
File src/pkg/runtime/runtime.h (right):

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode339
src/pkg/runtime/runtime.h:339: G** ring;
I think size 128 should be enough.
Embed it into P directly:
enum { RunqSize = 128 };
G* ring[RunqSize];
Then you don't need mask.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode340
src/pkg/runtime/runtime.h:340: uint32 r_head;
Rename, the codebase does not use underscores.
Please refrain from unnecessary renaming, uint32 runqhead is fine. This
will eliminate lots of edits in proc.c

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode690
src/pkg/runtime/runtime.h:690: bool runtime·casvalue(uint32*, uint32,
uint32, uint32*);
Change cas(uint32*, uint32, uint32) to cas(uint32*, uint32*, uint32)
similar to cas64. Then you don't need the separate function.
This should go in a separate patch.

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode697
src/pkg/runtime/runtime.h:697: void runtime·fastatomicstore(uint32
volatile*, uint32);
rename to atomicstorerelease

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode738
src/pkg/runtime/runtime.h:738: void runtime·setmg(M*, G*);
Such changes better go in a separate patch.

https://codereview.appspot.com/7455049/

Devon H. O'Dell

unread,

Mar 12, 2013, 11:38:42 AM3/12/13

to devon...@gmail.com, dvy...@google.com, r...@golang.org, minu...@gmail.com, da...@cheney.net, remyoud...@gmail.com, golan...@googlegroups.com, re...@codereview-hr.appspotmail.com

Hi,

Haven't had a lot of time to fix this recently but I wanted to address
some of the comments.

2013/3/8 <dvy...@google.com>:

>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c
> File src/pkg/runtime/proc.c (right):
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode279
> src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) {
> Perhaps combine it in a separate function, because it's copied several
> times.

That will get inlined, yes? With some of the other comments you made,
the number of times this is copied will probably decrease (as a result
of implementing the work stealing).

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode281
> src/pkg/runtime/proc.c:281: globrunqput(gp);
> it important to transfer a batch of G's to globrunq (half?)
> consider that a goroutine spawns lots of goroutines in a loop, you do
> not want to hit the slowpath every time

That's fair -- didn't want to make this a bigger change than I needed
to. I'll look at doing this.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode355
> src/pkg/runtime/proc.c:355:
> runtime·fastatomicstore((uint32*)&runtime·gcwaiting, 1);
> This needs full memory barrier between store to gcwaiting and load of
> p->status, otherwise it can deadlock with entersyscall (both won't
> notice the other).

Good point. I'm going to remove the other places where I changed this
that aren't specifically related to the run queue. I didn't look at
independed load / store dependencies, and that's clearly showing.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode1014
> src/pkg/runtime/proc.c:1014: gp = runqget(p);
> "steal half" is required here.
> consider that a goroutine spawns lots of other goroutines in a loop,
> then other worker threads starts stealing 1 goroutine from slowpath each
> time, this will lead to unnecessary contention and slowdown.
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode1214
> src/pkg/runtime/proc.c:1214: runtime·fastatomicstore(&m->p->status,
> Psyscall);
> full memory barrier is required, otherwise it will deadlock with
> stoptheworld.
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode1760
> src/pkg/runtime/proc.c:1760: if(old < 0 || old > MaxGomaxprocs || new <=
> 0 || new > MaxGomaxprocs)
> such edits better go in a separate patch
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2042
> src/pkg/runtime/proc.c:2042: if(runqput(p, gp1) == false) {
> steal at most runqsize (128), then runqput() will always succeed
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2087
> src/pkg/runtime/proc.c:2087: if((delta & mask) == (start & mask))
> Is it intended that you do not full the last slot?
> If so, it can be:
> if((delta^start)&mask))
> the compiler is not particularly good at code optimization.

This check won't work. We can't enqueue a new value into the ring if
that would overflow the ring. So, if delta == start, we don't insert.
If we have a ring of size 4, the tail is at 0x3 and the head is at
0x0, this will end up being if(0&mask), which is of course always
false.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2093
> src/pkg/runtime/proc.c:2093: runtime·fastatomicstore(&p->r_tail, delta);
> atomicstorerelease should contain the necessary fence

The store fence is required to serialize the update of the tail with
respect to putting the entry in the ring, so it's not really a
question of whether this interface provides the fence. It's to make
sure that all other consumers do not read that the tail is updated
without being guaranteed to be able to read the entry at the tail.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2094
> src/pkg/runtime/proc.c:2094:
> please drop at least some of the empty lines
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2107
> src/pkg/runtime/proc.c:2107: start = runtime·atomicload(&p->r_head);
> please use consistent maning: head/head instead of head/start (tail/end)

I will change these for consistency / to reduce diff size as suggested later.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2109
> src/pkg/runtime/proc.c:2109: /* TODO: load fence */
> atomicload contains the fence

This is to guarantee consistency of our view of the snapshot of the head.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2114
> src/pkg/runtime/proc.c:2114: /* TODO: load fence */
> why is it needed?

Same reason, guarantee we're looking at the proper state of the head
when we do the assignment from the ring.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2116
> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
> I think you can implement steal in a similar way, just copying a batch
> of elements here and incrementing head by N instead on 1.

It would be possible to do this with a memcpy indeed, but you have to
do the copy every iteration of the loop because head may have changed.
Another way I was thinking of doing it was just to "steal" the blocks
and do the copies once that was successful -- but the problem is that
another producer might invalidate the consistency of the ring by the
time the copy started. I'll do the memcpy approach for now.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2116
> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
> I think you can implement steal() in a similar way, just copying a batch
> of elements here and incrementing head by N instead of 1.
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2118
> src/pkg/runtime/proc.c:2118: /* TODO: full memory fence */
> why is it needed?

To serialize loading with respect to updating the head poitner.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2133
> src/pkg/runtime/proc.c:2133: p.r_size = 8192;
> please test overflow as well

Good call. Will do.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h
> File src/pkg/runtime/runtime.h (right):
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode339
> src/pkg/runtime/runtime.h:339: G** ring;
> I think size 128 should be enough.
> Embed it into P directly:
> enum { RunqSize = 128 };
> G* ring[RunqSize];
> Then you don't need mask.

Also good call.

> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode340
> src/pkg/runtime/runtime.h:340: uint32 r_head;
> Rename, the codebase does not use underscores.
> Please refrain from unnecessary renaming, uint32 runqhead is fine. This
> will eliminate lots of edits in proc.c
>
> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode690
> src/pkg/runtime/runtime.h:690: bool runtime·casvalue(uint32*, uint32,
> uint32, uint32*);
> Change cas(uint32*, uint32, uint32) to cas(uint32*, uint32*, uint32)
> similar to cas64. Then you don't need the separate function.
> This should go in a separate patch.

Then we pay the cost for assigning to the old value everywhere, even
when we don't need it. I think this just adds overhead.

Russ Cox

unread,

Mar 12, 2013, 11:45:08 AM3/12/13

to Devon H. O'Dell, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

The C compiler does not inline.

Devon H. O'Dell

unread,

Mar 12, 2013, 11:50:25 AM3/12/13

to Russ Cox, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

2013/3/12 Russ Cox <r...@golang.org>:

> The C compiler does not inline.

Maybe we should revisit this after Go 1.1.

I need help implementing support for the DMB instruction on ARM. I
don't understand the assembler and linker code enough to add the
instruction and this won't work on ARM without that. Additionally,
we'd need an interface for generating memory barriers. If the C
compiler doesn't inline, that's going to be a ton of overhead just to
generate sfence/lfence/dmb.

That in mind, I'm happy to work hard to get this going for the 1.1
release, but I think it's maybe going to take too much time from other
things that really need to go in.

--dho

Russ Cox

unread,

Mar 12, 2013, 11:55:59 AM3/12/13

to Devon H. O'Dell, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon...@gmail.com> wrote:

2013/3/12 Russ Cox <r...@golang.org>:

> The C compiler does not inline.

Maybe we should revisit this after Go 1.1.

I need help implementing support for the DMB instruction on ARM. I
don't understand the assembler and linker code enough to add the
instruction and this won't work on ARM without that. Additionally,
we'd need an interface for generating memory barriers. If the C
compiler doesn't inline, that's going to be a ton of overhead just to
generate sfence/lfence/dmb.

I am happy to add intrinsics if you demonstrate that the performance win is enough. We did this already with prefetch.

I very much doubt the C compiler will ever inline. For the tiny amount of C we have, it's just not worth the effort. It certainly can't inline calls to assembly functions that it can't see.

Russ

Devon H. O'Dell

unread,

Mar 12, 2013, 12:07:27 PM3/12/13

to Russ Cox, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

2013/3/12 Russ Cox <r...@golang.org>:

> On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon...@gmail.com>
> wrote:
>>
>> 2013/3/12 Russ Cox <r...@golang.org>:
>> > The C compiler does not inline.
>>
>> Maybe we should revisit this after Go 1.1.
>>
>>
>> I need help implementing support for the DMB instruction on ARM. I
>> don't understand the assembler and linker code enough to add the
>> instruction and this won't work on ARM without that. Additionally,
>> we'd need an interface for generating memory barriers. If the C
>> compiler doesn't inline, that's going to be a ton of overhead just to
>> generate sfence/lfence/dmb.
>
>
> I am happy to add intrinsics if you demonstrate that the performance win is
> enough. We did this already with prefetch.

What benchmarks do you want me to run? I'll only be able to run them
reliably on x86 and AMD64 until there's support for DMB. This
instruction is required for consistency using the lock-free data
structure to guarantee that stores / loads are observed by other
threads; it's not really for performance. I guess in the meantime I
could make an ASM stub using .BYTE?

> I very much doubt the C compiler will ever inline. For the tiny amount of C
> we have, it's just not worth the effort. It certainly can't inline calls to
> assembly functions that it can't see.

That's fair.

> Russ

--dho

Dmitry Vyukov

unread,

Mar 12, 2013, 12:18:01 PM3/12/13

to Devon H. O'Dell, Russ Cox, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

On Tue, Mar 12, 2013 at 7:38 PM, Devon H. O'Dell <devon...@gmail.com> wrote:

>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode279
>> src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) {
>> Perhaps combine it in a separate function, because it's copied several
>> times.
>
> That will get inlined, yes?

No, it won't.

>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2087
>> src/pkg/runtime/proc.c:2087: if((delta & mask) == (start & mask))
>> Is it intended that you do not full the last slot?
>> If so, it can be:
>> if((delta^start)&mask))
>> the compiler is not particularly good at code optimization.
>
> This check won't work. We can't enqueue a new value into the ring if
> that would overflow the ring. So, if delta == start, we don't insert.
> If we have a ring of size 4, the tail is at 0x3 and the head is at
> 0x0, this will end up being if(0&mask), which is of course always
> false.

I meant ((delta^start)&mask) == 0

>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2093
>> src/pkg/runtime/proc.c:2093: runtime·fastatomicstore(&p->r_tail, delta);
>> atomicstorerelease should contain the necessary fence
>
> The store fence is required to serialize the update of the tail with
> respect to putting the entry in the ring, so it's not really a
> question of whether this interface provides the fence. It's to make
> sure that all other consumers do not read that the tail is updated
> without being guaranteed to be able to read the entry at the tail.

atomicstorerelease should contain the necessary fence, the TODO is not necessary

>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2109
>> src/pkg/runtime/proc.c:2109: /* TODO: load fence */
>> atomicload contains the fence
>
> This is to guarantee consistency of our view of the snapshot of the head.

Yes, I understand, but please remove the TODO, because it's not
actually the TODO, the fence is already there.

>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2114
>> src/pkg/runtime/proc.c:2114: /* TODO: load fence */
>> why is it needed?
>
> Same reason, guarantee we're looking at the proper state of the head
> when we do the assignment from the ring.

The necessary fence is in atomicload/cas.

>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2116
>> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
>> I think you can implement steal in a similar way, just copying a batch
>> of elements here and incrementing head by N instead on 1.
>
> It would be possible to do this with a memcpy indeed, but you have to
> do the copy every iteration of the loop

I think it is fine

> because head may have changed.
> Another way I was thinking of doing it was just to "steal" the blocks
> and do the copies once that was successful -- but the problem is that
> another producer might invalidate the consistency of the ring by the
> time the copy started. I'll do the memcpy approach for now.
>
>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2116
>> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
>> I think you can implement steal() in a similar way, just copying a batch
>> of elements here and incrementing head by N instead of 1.
>>
>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode2118
>> src/pkg/runtime/proc.c:2118: /* TODO: full memory fence */
>> why is it needed?
>
> To serialize loading with respect to updating the head poitner.

cas contains the necessary fence, these are *not* relaxed atomic

>> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#newcode690
>> src/pkg/runtime/runtime.h:690: bool runtime·casvalue(uint32*, uint32,
>> uint32, uint32*);
>> Change cas(uint32*, uint32, uint32) to cas(uint32*, uint32*, uint32)
>> similar to cas64. Then you don't need the separate function.
>> This should go in a separate patch.
>
> Then we pay the cost for assigning to the old value everywhere, even
> when we don't need it. I think this just adds overhead.

Local assignment is not a great cost. The cc compiler makes
unnecessary assignments everywhere. And it's needed in mutexes and
will be needed in chans. Most other places are not performance
critical.

Dmitry Vyukov

unread,

Mar 12, 2013, 12:21:32 PM3/12/13

to Devon H. O'Dell, Russ Cox, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

On Tue, Mar 12, 2013 at 8:07 PM, Devon H. O'Dell <devon...@gmail.com> wrote:
> 2013/3/12 Russ Cox <r...@golang.org>:
>> On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon...@gmail.com>
>> wrote:
>>>
>>> 2013/3/12 Russ Cox <r...@golang.org>:
>>> > The C compiler does not inline.
>>>
>>> Maybe we should revisit this after Go 1.1.
>>>
>>>
>>> I need help implementing support for the DMB instruction on ARM. I
>>> don't understand the assembler and linker code enough to add the
>>> instruction and this won't work on ARM without that. Additionally,
>>> we'd need an interface for generating memory barriers. If the C
>>> compiler doesn't inline, that's going to be a ton of overhead just to
>>> generate sfence/lfence/dmb.
>>
>>
>> I am happy to add intrinsics if you demonstrate that the performance win is
>> enough. We did this already with prefetch.
>
> What benchmarks do you want me to run? I'll only be able to run them
> reliably on x86 and AMD64 until there's support for DMB. This
> instruction is required for consistency using the lock-free data
> structure to guarantee that stores / loads are observed by other
> threads; it's not really for performance. I guess in the meantime I
> could make an ASM stub using .BYTE?

DMB should not be a separate function, it should be combined with
atomicload/store/cas/etc. So there will be no additional cost for
these.

For ARM you can just use existing atomic functions, they meant to
contain all the necessary fences. You can think of them as C
atomic_foo(memory_order_seq_cst).

Russ Cox

unread,

Mar 12, 2013, 12:26:07 PM3/12/13

to Devon H. O'Dell, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

On Tue, Mar 12, 2013 at 12:07 PM, Devon H. O'Dell <devon...@gmail.com> wrote:

2013/3/12 Russ Cox <r...@golang.org>:

> On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon...@gmail.com>
> wrote:
>>
>> 2013/3/12 Russ Cox <r...@golang.org>:
>> > The C compiler does not inline.
>>
>> Maybe we should revisit this after Go 1.1.
>>
>>
>> I need help implementing support for the DMB instruction on ARM. I
>> don't understand the assembler and linker code enough to add the
>> instruction and this won't work on ARM without that. Additionally,
>> we'd need an interface for generating memory barriers. If the C
>> compiler doesn't inline, that's going to be a ton of overhead just to
>> generate sfence/lfence/dmb.
>
>
> I am happy to add intrinsics if you demonstrate that the performance win is
> enough. We did this already with prefetch.

What benchmarks do you want me to run? I'll only be able to run them
reliably on x86 and AMD64 until there's support for DMB. This
instruction is required for consistency using the lock-free data
structure to guarantee that stores / loads are observed by other
threads; it's not really for performance. I guess in the meantime I
could make an ASM stub using .BYTE?

I'd like to see a version that calls into assembly (you want WORD $x, not BYTE, and no dot), compared with a version that doesn't. In the case of prefetch, the difference was significant enough to add to the compiler.

To compare with the version that doesn't call into assembly, you can patch in CL 7756043, with two caveats:

1. It's not done well, and it would need redoing if we keep it.

2. The linker emits DSB as 0xFFFFFFFF, which I assume is not the actual encoding. You'll need to tweak that before using.

Russ

Dmitry Vyukov

unread,

Mar 12, 2013, 12:37:36 PM3/12/13

to Russ Cox, Devon H. O'Dell, minux ma, Dave Cheney, Rémy Oudompheng, golang-dev, re...@codereview-hr.appspotmail.com

We do not want stand-alone memory fences, the atomic operations we
have include the memory fences. We can make them finer-grained if
necessary (e.g. load-acquire, store-release, cas-release, etc).
While the atomic operations are in assembly, intrinsic for DMB makes no sense.
If anything, we need ATOMIC_LOAD/ATOMIC_STORE/ETC intrinsics.

matt

unread,

Mar 13, 2013, 5:55:50 PM3/13/13

to golan...@googlegroups.com, Devon H. O'Dell, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, re...@codereview-hr.appspotmail.com

Late to the party, but keen to test this change list. I am trying to use the DSB primitive from inside a linked assembly routine. Is DSB expected to work from assembly? I'm currently getting the following errors:

test_arm.s:19 syntax error, last name: DSB

Russ Cox

unread,

Mar 13, 2013, 6:00:48 PM3/13/13

to matt, golang-dev, Devon H. O'Dell, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, re...@codereview-hr.appspotmail.com

No, it doesn't. You can use WORD $0x12345678 with the constant corrected.

Russ

Russ Cox

unread,

Mar 19, 2013, 2:38:49 PM3/19/13

to matt, golang-dev, Devon H. O'Dell, Dmitry Vyukov, minux ma, Dave Cheney, Rémy Oudompheng, re...@codereview-hr.appspotmail.com

I'd like to leave this until after Go 1.1.

Reply all

Reply to author

Forward