update on "lock-free channels"

2,451 views
Skip to first unread message

Dmitry Vyukov

unread,
Jul 16, 2014, 4:59:14 AM7/16/14
to golang-dev
FYI
As per offline discussion, the "lock-free channels" change:
https://docs.google.com/a/google.com/document/d/1yIAYmbvL3JxOKOjuCyon7JhW4cSv1wy5hC0ApeGMV9s/pub
is deferred as it adds significant complexity w/o visible benefit for
real Go programs.

Some of the sub-changes described in the doc go in:
https://codereview.appspot.com/112990043/
https://codereview.appspot.com/110580043/

But the rest (most notably lock-free buffered channels) need
justification in the form of positive effects on real Go programs.

Henrik Johansson

unread,
Jul 16, 2014, 6:19:01 AM7/16/14
to Dmitry Vyukov, golang-dev

This change was to make the contended case faster right?

What about the non-contended case?

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Henrik Johansson

unread,
Jul 16, 2014, 6:26:11 AM7/16/14
to Dmitry Vyukov, golang-dev

Sorry didn't read the doc, my bad.

If I read correctly then the non-contended case is much faster. How can that not be worth adding the change?

On Jul 16, 2014 10:59 AM, "'Dmitry Vyukov' via golang-dev" <golan...@googlegroups.com> wrote:

Dmitry Vyukov

unread,
Jul 16, 2014, 6:28:56 AM7/16/14
to Henrik Johansson, golang-dev
On Wed, Jul 16, 2014 at 2:26 PM, Henrik Johansson <dahan...@gmail.com> wrote:
> Sorry didn't read the doc, my bad.
>
> If I read correctly then the non-contended case is much faster. How can that
> not be worth adding the change?


The performance improvement does not materialize from the air, it
comes with code complexity increase.

Henrik Johansson

unread,
Jul 16, 2014, 6:37:15 AM7/16/14
to Dmitry Vyukov, golang-dev

Indeed and I am not in a position to judge if the maintenance overhead of the change in this particular case is worth it.

I do find the non-contended case attractive though because it increases the performance of code that really tries to use channels for concurrency in the small if that makes sense. ☺

Is it deferred with a new target release or just put on hold indefinitely?

Dmitry Vyukov

unread,
Jul 16, 2014, 6:55:46 AM7/16/14
to Henrik Johansson, golang-dev
On Wed, Jul 16, 2014 at 2:37 PM, Henrik Johansson <dahan...@gmail.com> wrote:
> Indeed and I am not in a position to judge if the maintenance overhead of
> the change in this particular case is worth it.
>
> I do find the non-contended case attractive though because it increases the
> performance of code that really tries to use channels for concurrency in the
> small if that makes sense. ☺
>
> Is it deferred with a new target release or just put on hold indefinitely?

It is deferred until there is a justification.

Henrik Johansson

unread,
Jul 16, 2014, 4:28:49 PM7/16/14
to Dmitry Vyukov, golang-dev

What would constitute a justification in this case if the performance is not enough?

Go encourage the use of channels and the the faster they are, the better.

Would some sort of count and perhaps classification of public usage make up part of an empirical justification?

Is the postponement due to you realizing that it is too complicated to be worth it? Is there a thread I missed somewhere?

Dmitry Vyukov

unread,
Jul 16, 2014, 4:31:46 PM7/16/14
to Henrik Johansson, golang-dev
On Thu, Jul 17, 2014 at 12:28 AM, Henrik Johansson <dahan...@gmail.com> wrote:
> What would constitute a justification in this case if the performance is not
> enough?

Real performance boost for real important Go programs.


> Go encourage the use of channels and the the faster they are, the better.
>
> Would some sort of count and perhaps classification of public usage make up
> part of an empirical justification?
>
> Is the postponement due to you realizing that it is too complicated to be
> worth it? Is there a thread I missed somewhere?

Sorry, I do not understand the question.

minux

unread,
Jul 16, 2014, 4:33:13 PM7/16/14
to Henrik Johansson, Dmitry Vyukov, golang-dev
On Wed, Jul 16, 2014 at 4:28 PM, Henrik Johansson <dahan...@gmail.com> wrote:

What would constitute a justification in this case if the performance is not enough?

Another important factor is that it can't make unbuffered (synchronous) channel slower.

Henrik Johansson

unread,
Jul 16, 2014, 4:43:35 PM7/16/14
to minux, golang-dev, Dmitry Vyukov

Ok, so no or at least acceptable degradation in sync channels and an impact for at least a few large and well used programs or libraries. Perhaps Vitess, Nsqd and Mgo can make up a tiny suite. These I can at least rudimentarily try myself.

What I meant by thread Dmitry was if there was a discussion in an email thread somewhere. I'll look around better when I have a real computer at hand.

Dmitry Vyukov

unread,
Jul 17, 2014, 12:46:37 AM7/17/14
to Henrik Johansson, minux, golang-dev
On Thu, Jul 17, 2014 at 12:43 AM, Henrik Johansson <dahan...@gmail.com> wrote:
> Ok, so no or at least acceptable degradation in sync channels and an impact
> for at least a few large and well used programs or libraries. Perhaps
> Vitess, Nsqd and Mgo can make up a tiny suite. These I can at least
> rudimentarily try myself.
>
> What I meant by thread Dmitry was if there was a discussion in an email
> thread somewhere. I'll look around better when I have a real computer at
> hand.

No, it was offline discussion with Go team.

Philip Hofer

unread,
Sep 26, 2014, 1:17:07 PM9/26/14
to golan...@googlegroups.com
I can't speak for everyone, but I know that personally I would love to see faster channels. And, for what's it's worth, I write Go to make a living.

So I'll give you a (small) example of a program where I had to hack together a bunch of gross atomic stuff to accomplish what *would* have been 
accomplished well with channels, except that the channels implementation is ~4x slower: https://github.com/philhofer/netpewl

Briefly, this is a data structure that maintains a pool of live `net.Conn`s and limits the total number of open connections. One implementation uses 
sync.Pool with atomic guards, and the other uses buffered channels.

BenchmarkTubRoundtrip-4 50000000        60.2 ns/op
BenchmarkChanRoundtrip-4 10000000       228 ns/op
BenchmarkTubRoudtripParallel-4 20000000        96.6 ns/op
BenchmarkChanRoundtripParallel-4 5000000       343 ns/op

(Actually, this would be a could candidate for the runtime semaphores, but I digress. Also, I wrote the channel implementation as an example for this post; it has a race condition on Close() that I didn't bother to fix because it would have meant writing many more lines of code.)

Some of this depends on how you want to see channels used in production Go code. Right now channels are great for managing concurrent i/o-bound operations, which represents a lot of the use cases for Go in the first place. However, if the goal of the language maintainers is for channels to become the de-facto synchronization mechanism in *most* cases, instead of just *some* cases, then I think it makes sense for the implementation to be faster than something that I can put together myself without using the runtime's semaphores and intrinsics. 

I would also argue that it's more efficient for complex synchronization to be implemented in the language runtime. Better to have the difficult stuff implemented in one place than a hundred. 

That all being said, I can't judge what the maintenance overhead will be of the code you have written.

And, truthfully, I don't think you'll find good production Go programs that are actually bottlenecked by channels, but that may be because people have learned that channels shouldn't be in code hot-paths.

TL;DR it would make my life easier.

Øyvind Teig

unread,
Sep 27, 2014, 12:54:45 PM9/27/14
to golan...@googlegroups.com


kl. 10:59:14 UTC+2 onsdag 16. juli 2014 skrev Dmitry Vyukov følgende:
FYI
As per offline discussion, the "lock-free channels" change:
https://docs.google.com/a/google.com/document/d/1yIAYmbvL3JxOKOjuCyon7JhW4cSv1wy5hC0ApeGMV9s/pub
is deferred as it adds significant complexity w/o visible benefit for
real Go programs.

In this document "Go channels on steroids" there is this sentence:

- make non-blocking failing operations (e.g. checking of "stop" channel) faster

I assume the code is seen in asyncchansend_nonblock(?), but I don't understand the ""stop" channel" in the heading.
I probably can't see the wood for only trees, but could you please explain this?

Øyvind

dispo...@dby.me

unread,
Sep 27, 2014, 3:58:45 PM9/27/14
to golan...@googlegroups.com
Maybe I'm missing something here, but I couldn't understand why your channel implementation actually uses two channels. Your bufferd cons channel can already be used as a semaphore. I did a quick hack to remove the sem channel earlier and the benchmark was faster than your non-channel implementation; Something like 167ns vs 135ns. It was slightly slower in the parallel benchmark, although I forget the numbers... it was less than 10ns i.e no difference.

Dmitry Vyukov

unread,
Sep 29, 2014, 2:23:03 AM9/29/14
to dispo...@dby.me, golang-dev
On Sat, Sep 27, 2014 at 11:58 PM, <dispo...@dby.me> wrote:
> Maybe I'm missing something here, but I couldn't understand why your channel implementation actually uses two channels. Your bufferd cons channel can already be used as a semaphore. I did a quick hack to remove the sem channel earlier and the benchmark was faster than your non-channel implementation; Something like 167ns vs 135ns. It was slightly slower in the parallel benchmark, although I forget the numbers... it was less than 10ns i.e no difference.

Test it with high value of GOMAXPROCS as well, e.g. 32. That can give
radically different results than GOMAXPROCS=1.

Dmitry Vyukov

unread,
Sep 29, 2014, 2:24:03 AM9/29/14
to Øyvind Teig, golang-dev
I meant the following pattern:

func someGoroutine() {
for {
select {
case <-stop:
return // asked to stop
default:
}
... do something ...
}
}

Dmitry Vyukov

unread,
Sep 29, 2014, 2:27:23 AM9/29/14
to Philip Hofer, golang-dev
A strong argument would be benchmarking results of a real application
that demonstrate performance improvement with the patch. Nobody
presented it so far.

Øyvind Teig

unread,
Sep 29, 2014, 5:02:59 AM9/29/14
to golan...@googlegroups.com, oyvin...@gmail.com
Thanks. That's what I thought. But I couldn't find that pattern in the code(?). 

Aside1: Also, in that pattern, aren't you supposed to send a stop as well, to help facilitate a join? 

Aside2: According the the Wikipedia article on Non-blocking algorithm (here) all channels would be "non-blocking" per def (and also "lock-free" I assume). I have suggested instead to use "yield" for this, see gloang-nuts here (which also points to a blog note of mine). Then we would have a separate name-space for the word "blocking" (which then can be treated is always bad, since the majority of programmers in the world would mean so anyhow). Then we don't have to explain the for them strange sentence that "..but blocking in Go is fine!". Then we can talk, at least, since in our name space it's "yielding"

Caleb Spare

unread,
Sep 29, 2014, 5:22:11 AM9/29/14
to Øyvind Teig, golang-dev
> Aside1: Also, in that pattern, aren't you supposed to send a stop as well,
> to help facilitate a join?

Typically you'd trigger the stop channel by closing it.

> Aside2: According the the Wikipedia article on Non-blocking algorithm (here)
> all channels would be "non-blocking" per def (and also "lock-free" I
> assume). I have suggested instead to use "yield" for this, see gloang-nuts
> here (which also points to a blog note of mine). Then we would have a
> separate name-space for the word "blocking" (which then can be treated is
> always bad, since the majority of programmers in the world would mean so
> anyhow). Then we don't have to explain the for them strange sentence that
> "..but blocking in Go is fine!". Then we can talk, at least, since in our
> name space it's "yielding"

This doesn't make sense to me. Channels block. The fact that you can
construct a deadlock using channels proves that (and also proves
they're not lock-free in the general case).

Øyvind Teig

unread,
Sep 29, 2014, 6:41:17 AM9/29/14
to golan...@googlegroups.com, oyvin...@gmail.com
It's not the channel in itself that would deadlock (if that's the problem the language coders have something to do), in's the dependency between concurrent programs.

We don't need to use the word "block" for this. In this figure I have tried to explain what might be perceived as difference between implicit yield, block and deadlock. 

saint....@gmail.com

unread,
Sep 29, 2014, 9:20:03 AM9/29/14
to golan...@googlegroups.com, oyvin...@gmail.com
On Monday, 29 September 2014 11:41:17 UTC+1, Øyvind Teig wrote:
In this figure I have tried to explain what might be perceived as difference between implicit yield, block and deadlock. 

Channel operations are not "yielding": this would imply that the goroutine /could/ continue but instead /yields/ (as with runtime.Gosched()).

If a send/receive on a channel occurs, the goroutine cannot continue until the channel operation completes. This is the definition of blocking.

martin...@ft.com

unread,
Sep 29, 2014, 11:01:55 AM9/29/14
to golan...@googlegroups.com, pho...@umich.edu

A strong argument would be benchmarking results of a real application
that demonstrate performance improvement with the patch. Nobody
presented it so far.

I have often re-factored parts of my code away from using channels in order to get a performance gain.  Selfishly speaking, this is far from ideal, as it results in _my_ code being more complex.  Perhaps other applications have changed in the same way.  If so, it would be hard to show a performance gains with faster channels for a real application that already exists.  If they needed that performance, they might have already moved the critical path away from using channels.

I'm all for avoiding complexity without good reason, I just think that in this case, the evidence you are looking for may already have been destroyed through necessity.

In any case, do you have any plans to maintain a branch with these changes, so that others can experiment and benchmark their applications?

-- 
Martin.



This email was sent by a company owned by Pearson plc, registered office at 80 Strand, London WC2R 0RL.  Registered in England and Wales with company number 53723.

Øyvind Teig

unread,
Sep 29, 2014, 12:58:50 PM9/29/14
to golan...@googlegroups.com, oyvin...@gmail.com, saint....@gmail.com


kl. 15:20:03 UTC+2 mandag 29. september 2014 skrev saint....@gmail.com følgende:
On Monday, 29 September 2014 11:41:17 UTC+1, Øyvind Teig wrote:
In this figure I have tried to explain what might be perceived as difference between implicit yield, block and deadlock. 

Channel operations are not "yielding": this would imply that the goroutine /could/ continue but instead /yields/ (as with runtime.Gosched()).

We may talk past each other here. But we can't stop yet:

Yielding is to another process and involves no continuation yet. That's the whole idea.

Informally it's kind of help to the scheduler, not normally used by the application. Modula-2 used transfer and iotransfer for its coroutines (a degradation from Modula) so that several scheduling mechanisms could be implemented. We built a fire detection system with that in 1985-1990. Therefore a channel that "blocks" really "yields" down to the scheduler so that it is free to run another thread. When the yielding process is again scheduled it is in the code-line after the yield (channel operation). Meanwhile the second process on the channel will have arrived and the channel will have moved data, if any. This is for non-buffered point-point one-way channels. The channel then implicitly causes a yield. As you state here:

If a send/receive on a channel occurs, the goroutine cannot continue until the channel operation completes.

Exactly. 

This is the definition of blocking.

That's what I am discussing, and I am not sure any more, because I am not discussing the semantics of the word, but which word. There is a long thread in my mentioned blog note about the history of the the use of the word blocking, by several computer scientist (occam-com) - it's expanding by the minute. I used to think what you state there, but when I said this the only people I could talk with are "us" (we could easily agree), not "them". I have tried with my intitative to define name spaces for "blocking" = yield, block and deadlock.  


Ingo Oeser

unread,
Sep 29, 2014, 2:21:37 PM9/29/14
to golan...@googlegroups.com
Then yield doesn't seem to be the right term and blocking seems better. Yielding means to give up the rest of the timeslice. After that timeslice, what should the scheduler do? Handing over to you, so you yield again?

Makes more sense to me, to explicitly block and tell the scheduler what you need in order to get useful work done instead. When you yield, you basically stop being runnable without telling the scheduler why and it must treat you as runnable and burn CPU on you.

Øyvind Teig

unread,
Sep 29, 2014, 2:49:06 PM9/29/14
to golan...@googlegroups.com


kl. 20:21:37 UTC+2 mandag 29. september 2014 skrev Ingo Oeser følgende:
Then yield doesn't seem to be the right term and blocking seems better. Yielding means to give up the rest of the timeslice. After that timeslice, what should the scheduler do? Handing over to you, so you yield again?

Makes more sense to me, to explicitly block and tell the scheduler what you need in order to get useful work done instead. When you yield, you basically stop being runnable without telling the scheduler why and it must treat you as runnable and burn CPU on you.

There is no timeslice in the context here. The scheduler does not really need to know what to do next, since it's the channels that drive scheduling, with their "implicit yielding". 

Our latest scheduler we called ChanSched for that reason. When a channel blocks/yields that goroutine isn't put in the ready-set because it's not ready to continue yet. The fact that it later needs to be scheduled is kept "in the channel". This happens after the communication. A CSP-type scheduler basically is this simple, except that selective choice is a complicating factor, because when one cause is taken all the other components need to be "torn down". Such a system will never poll, never burn CPU. (The application may of course poll on some external signal, but the channel runtime system does not need to poll. And the application should preferably not time out on internal channels if you want to also do other products after this in your career. Also, there is no wakeup with ..oops it wasn't for you, like in Java's notifyall). When there is nothing more to do: enter sleep. If there is no future event expected (interrupt, timeout) it's a system deadlock.

As I have understood the Go scheduler is not that simple because it has to piggy-back on system threads/processes and it has to juggle the machine for the processes (or vice versa). And in that context I guess timeslices are meaningful. And priority etc. But I think the basic idea is the one above. CSP and CSP can't differ that much.

Dmitry Vyukov

unread,
Sep 30, 2014, 2:19:42 AM9/30/14
to martin...@ft.com, golang-dev, Philip Hofer
On Mon, Sep 29, 2014 at 7:01 PM, <martin...@ft.com> wrote:
>>
>> A strong argument would be benchmarking results of a real application
>> that demonstrate performance improvement with the patch. Nobody
>> presented it so far.
>
>
> I have often re-factored parts of my code away from using channels in order
> to get a performance gain. Selfishly speaking, this is far from ideal, as
> it results in _my_ code being more complex. Perhaps other applications have
> changed in the same way. If so, it would be hard to show a performance
> gains with faster channels for a real application that already exists. If
> they needed that performance, they might have already moved the critical
> path away from using channels.
>
> I'm all for avoiding complexity without good reason, I just think that in
> this case, the evidence you are looking for may already have been destroyed
> through necessity.

That's true. But it also don't prove that my change would make the
apps faster. Maybe it will make them slower.

> In any case, do you have any plans to maintain a branch with these changes,
> so that others can experiment and benchmark their applications?

It's always possible to do on the old revision. 'hg clpatch' will say
what revision you need to update to.

Egon Elbre

unread,
Sep 30, 2014, 4:25:57 AM9/30/14
to golan...@googlegroups.com, martin...@ft.com, pho...@umich.edu


On Tuesday, 30 September 2014 09:19:42 UTC+3, Dmitry Vyukov wrote:
On Mon, Sep 29, 2014 at 7:01 PM,  <martin...@ft.com> wrote:
>>
>> A strong argument would be benchmarking results of a real application
>> that demonstrate performance improvement with the patch. Nobody
>> presented it so far.
>
>
> I have often re-factored parts of my code away from using channels in order
> to get a performance gain.  Selfishly speaking, this is far from ideal, as
> it results in _my_ code being more complex.  Perhaps other applications have
> changed in the same way.  If so, it would be hard to show a performance
> gains with faster channels for a real application that already exists.  If
> they needed that performance, they might have already moved the critical
> path away from using channels.
>
> I'm all for avoiding complexity without good reason, I just think that in
> this case, the evidence you are looking for may already have been destroyed
> through necessity.

That's true. But it also don't prove that my change would make the
apps faster. Maybe it will make them slower.

I've revived one of my earlier approaches to a graph search problem...


At the time, when (~2yrs ago) I wrote that version it was too slow... of course in the mean time there have been several other updates to Go.

Probably there are improvements possible to that program... aand, probably the channel isn't the slowest part of it... of course, converting it from the "locking" version to "channel" version slowed the program down a bit... Anyways, it's the closest thing I have to a real-world heavy channel use.

After testing with or without CL I'm getting about ~5% win in terms of speed, but the results are quite noisy due to the parallel to say definitively:

without CL
 RunParallel     8.31s, 8.35s, 8.58s
 RunParallelChan 10.43s, 10.47s, 10.53s

with CL
 RunParallel     8.58s, 8.43s, 8.33s, 
 RunParallelChan 9.75s, 9.72s, 9.92s, 10.12s, 10.50s, 9.8s

If you want to test the channel mode you need to use the "dev" branch...
git checkout dev
cd _examples
make proteins.30k

It runs a simple test-case. To switch between RunParallel/RunParallelChan there is parameter "chanmode". If the process is taking too little time in proteins/conf.json you can decrease the values inside "Extension.Extendable"; I used "Matches(fore)": 1. NB. it's easy to make the program eat-up several GB of RAM.

+ egon

Egon Elbre

unread,
Sep 30, 2014, 4:32:00 AM9/30/14
to golan...@googlegroups.com, martin...@ft.com, pho...@umich.edu


On Tuesday, 30 September 2014 11:25:57 UTC+3, Egon Elbre wrote:


On Tuesday, 30 September 2014 09:19:42 UTC+3, Dmitry Vyukov wrote:
On Mon, Sep 29, 2014 at 7:01 PM,  <martin...@ft.com> wrote:
>>
>> A strong argument would be benchmarking results of a real application
>> that demonstrate performance improvement with the patch. Nobody
>> presented it so far.
>
>
> I have often re-factored parts of my code away from using channels in order
> to get a performance gain.  Selfishly speaking, this is far from ideal, as
> it results in _my_ code being more complex.  Perhaps other applications have
> changed in the same way.  If so, it would be hard to show a performance
> gains with faster channels for a real application that already exists.  If
> they needed that performance, they might have already moved the critical
> path away from using channels.
>
> I'm all for avoiding complexity without good reason, I just think that in
> this case, the evidence you are looking for may already have been destroyed
> through necessity.

That's true. But it also don't prove that my change would make the
apps faster. Maybe it will make them slower.

I've revived one of my earlier approaches to a graph search problem...


At the time, when (~2yrs ago) I wrote that version it was too slow... of course in the mean time there have been several other updates to Go.

Probably there are improvements possible to that program... aand, probably the channel isn't the slowest part of it... of course, converting it from the "locking" version to "channel" version slowed the program down a bit... Anyways, it's the closest thing I have to a real-world heavy channel use.


I was comparing on 03b003455359b09fff0f1662255dc5fe10b93290 with and without the CL.
 
After testing with or without CL I'm getting about ~5% win in terms of speed, but the results are quite noisy due to the parallel to say definitively:

without CL
 RunParallel     8.31s, 8.35s, 8.58s
 RunParallelChan 10.43s, 10.47s, 10.53s

with CL
 RunParallel     8.58s, 8.43s, 8.33s, 
 RunParallelChan 9.75s, 9.72s, 9.92s, 10.12s, 10.50s, 9.8s

If you want to test the channel mode you need to use the "dev" branch...
git checkout dev
cd _examples
make proteins.30k

It runs a simple test-case. To switch between RunParallel/RunParallelChan there is parameter "chanmode". If the process is taking too little time in proteins/conf.json you can decrease the values inside "Extension.Extendable"; I used "Matches(fore)": 1. NB. it's easy to make the program eat-up several GB of RAM.


Also ignore my language errors, it seems my LPU is fried. (LPU = Language Processing Unit)

Øyvind Teig

unread,
Sep 30, 2014, 4:52:19 AM9/30/14
to golan...@googlegroups.com, martin...@ft.com, pho...@umich.edu


kl. 10:25:57 UTC+2 tirsdag 30. september 2014 skrev Egon Elbre følgende:


On Tuesday, 30 September 2014 09:19:42 UTC+3, Dmitry Vyukov wrote:
On Mon, Sep 29, 2014 at 7:01 PM,  <martin...@ft.com> wrote:
>>
>> A strong argument would be benchmarking results of a real application
>> that demonstrate performance improvement with the patch. Nobody
>> presented it so far.
>
>
> I have often re-factored parts of my code away from using channels in order
> to get a performance gain.  Selfishly speaking, this is far from ideal, as
> it results in _my_ code being more complex.  Perhaps other applications have
> changed in the same way.  If so, it would be hard to show a performance
> gains with faster channels for a real application that already exists.  If
> they needed that performance, they might have already moved the critical
> path away from using channels.
>
> I'm all for avoiding complexity without good reason, I just think that in
> this case, the evidence you are looking for may already have been destroyed
> through necessity.

That's true. But it also don't prove that my change would make the
apps faster. Maybe it will make them slower.

I've revived one of my earlier approaches to a graph search problem...


At the time, when (~2yrs ago) I wrote that version it was too slow... of course in the mean time there have been several other updates to Go.

Probably there are improvements possible to that program... aand, probably the channel isn't the slowest part of it... of course, converting it from the "locking" version to "channel" version slowed the program down a bit... Anyways, it's the closest thing I have to a real-world heavy channel use.

Egon

Just wondering: the inner select's default clause in RunParallelChan, what is the price for that? Isn't this a poll to send, if not, fill more memory? Why is the default needed?

Øyvind

Egon Elbre

unread,
Sep 30, 2014, 5:01:39 AM9/30/14
to golan...@googlegroups.com, martin...@ft.com, pho...@umich.edu
case q := <-queue:
   work <- q:

Let's say work is full, then this will block... a single worker can push upto several hundred items to the queue, which means the queue will be filled very quickly and the workers block. Now, the workers can't get to the work and queue can't progress... i.e. you have a deadlock. Of course, there could be a better solution, that I'm not aware of.

Øyvind Teig

unread,
Sep 30, 2014, 6:50:10 AM9/30/14
to golan...@googlegroups.com, martin...@ft.com, pho...@umich.edu
What's the trace to the deadlock, with all channel ends seen? When I saw the code through a straw I thought that the default was a waste. The sender side of queue could perhaps buffer instead. But this architecture is so dependent on the specification and the architecture, and when I just look at some code I might get i wrong quite fast. But I basically should be able to reason locally, once the protocol is known.

Generally, I would have thought that it were a Go idiom to try to build the architecture so that one avoids a default in a selective choice? Instead include a feedback channel in the outer select and then always succed on the sending? But then, your code is controlled by the outer case q := <-queue: which controls it elegantly..

Øyvind

Egon Elbre

unread,
Sep 30, 2014, 7:25:18 AM9/30/14
to golan...@googlegroups.com, martin...@ft.com, pho...@umich.edu

What's the trace to the deadlock, with all channel ends seen?

Essentially "queue" and "work" would be full.
 
When I saw the code through a straw I thought that the default was a waste. The sender side of queue could perhaps buffer instead. But this architecture is so dependent on the specification and the architecture, and when I just look at some code I might get i wrong quite fast. But I basically should be able to reason locally, once the protocol is known.

Not quite sure what you mean... but, I rewrote a simpler, unparallelized, version of the code:

http://play.golang.org/p/TPyB4fS043

It behaves somewhat similarly, regarding the amount work pushed.
// Of course add more items to the exts array, or alternatively when using buffered queues, use small buffers...

In the actual program Pool, ensures ordering so that the memory usage would be smaller... difference can be something like 4GB vs 16GB. Basically, it tries to choose the most likely Work item not to produce more Work, otherwise the undone Work can start building up. Of course the total amount of Work to do, is still the same.
Output pool, filters some of the elements, and only keeps N best of the results.

Generally, I would have thought that it were a Go idiom to try to build the architecture so that one avoids a default in a selective choice?

Yes, I would like to avoid it... but I guess I'm just not seeing the clean solution.
 
Instead include a feedback channel in the outer select and then always succed on the sending? But then, your code is controlled by the outer case q := <-queue: which controls it elegantly..

Not quite sure what you mean with this.

+ egon

Dmitry Vyukov

unread,
Oct 1, 2014, 12:48:51 AM10/1/14
to Egon Elbre, golang-dev, Martin Garton, Philip Hofer
Thanks! That's already something.
I will run it on a machine with high core count and see it's possible
to reduce flakiness.

Egon Elbre

unread,
Oct 1, 2014, 2:41:14 AM10/1/14
to golan...@googlegroups.com, egon...@gmail.com, martin...@ft.com, pho...@umich.edu

Thanks! That's already something.
I will run it on a machine with high core count and see it's possible
to reduce flakiness.

This morning I remembered one additional case where I was using channels, but dropped them due to slowness.

Basically an iterator here:

I created a benchmarkable solution:

The channel based iterator code looks the nicest, but compared to callback/slice counterpart it's 20x slower (on tip). I didn't test it against the CL.

+ egon

Dmitry Vyukov

unread,
Oct 3, 2014, 7:38:51 AM10/3/14
to Egon Elbre, golang-dev, Martin Garton, Philip Hofer
For chan-based generators I have another change:
https://codereview.appspot.com/74180043/

I've benchmarked it on a slightly modified version of your benchmark
(to play better with testing benchmarks):
http://play.golang.org/p/8DvKsBHcCC

Slice and callback version takes ~6 ns/op.
Sync chan generator takes ~150 ns/op with GOMAXPROCS=1 and ~1050 ns/op
with GOMAXPROCS=2.
With my change sync chan generator takes ~60 ns/op regardless of
GOMAXPROCS value. It's possible to optimize it further (to 40 ns/op or
so).

Egon Elbre

unread,
Oct 3, 2014, 8:28:38 AM10/3/14
to golan...@googlegroups.com, egon...@gmail.com, martin...@ft.com, pho...@umich.edu
Here are my measurements for the CL with a set of 1e6 elements:
https://gist.github.com/egonelbre/27ef700e78e8b423b036

+ egon

Egon Elbre

unread,
Oct 3, 2014, 8:35:21 AM10/3/14
to golan...@googlegroups.com, egon...@gmail.com, martin...@ft.com, pho...@umich.edu


On Friday, 3 October 2014 15:28:38 UTC+3, Egon Elbre wrote:
Here are my measurements for the CL with a set of 1e6 elements:
https://gist.github.com/egonelbre/27ef700e78e8b423b036

Nevermind, just realized that after increasing the set size, I've got too few data points..

Dmitry Vyukov

unread,
Oct 3, 2014, 8:51:38 AM10/3/14
to Egon Elbre, golang-dev, Martin Garton, Philip Hofer
On Tue, Sep 30, 2014 at 12:25 PM, Egon Elbre <egon...@gmail.com> wrote:
I've benchmarked the program with GOMAXPROCS=32 -procs=32 and "Matches(fore)": 1
on revision a2fe41181fe1

w/o -chanmode:
102.78user 40.17system 0:10.67elapsed 1339%CPU (0avgtext+0avgdata
105696maxresident)k
101.25user 39.99system 0:10.52elapsed 1342%CPU (0avgtext+0avgdata
95532maxresident)k
104.55user 38.86system 0:10.68elapsed 1342%CPU (0avgtext+0avgdata
98756maxresident)k
105.94user 39.03system 0:10.72elapsed 1352%CPU (0avgtext+0avgdata
95672maxresident)k
102.13user 38.93system 0:10.49elapsed 1343%CPU (0avgtext+0avgdata
106488maxresident)k

with -chanmode:
80.81user 12.22system 0:09.22elapsed 1008%CPU (0avgtext+0avgdata
84840maxresident)k
81.02user 14.07system 0:09.84elapsed 965%CPU (0avgtext+0avgdata
105128maxresident)k
80.57user 12.39system 0:09.54elapsed 973%CPU (0avgtext+0avgdata
94096maxresident)k
80.51user 12.99system 0:09.42elapsed 991%CPU (0avgtext+0avgdata
103036maxresident)k
80.86user 12.84system 0:09.63elapsed 972%CPU (0avgtext+0avgdata
103720maxresident)k

with -chanmode and cl/12544043:
73.58user 8.65system 0:07.10elapsed 1158%CPU (0avgtext+0avgdata
103488maxresident)k
72.59user 9.29system 0:07.18elapsed 1140%CPU (0avgtext+0avgdata
93940maxresident)k
73.25user 8.88system 0:07.20elapsed 1139%CPU (0avgtext+0avgdata
92712maxresident)k
73.27user 9.39system 0:07.23elapsed 1143%CPU (0avgtext+0avgdata
82944maxresident)k
72.46user 8.90system 0:07.12elapsed 1142%CPU (0avgtext+0avgdata
76656maxresident)k

Profile confirms that it's due to chan synchronization:

old:
5.11% spexs2 spexs2 [.] runtime.xchg # runtime mutex
2.22% spexs2 spexs2 [.] runtime.lock # runtime mutex
2.18% spexs2 spexs2 [.] runtime.procyield # runtime
mutex backoff

new:
2.32% spexs2 spexs2 [.] runtime.xchg # runtime mutex
1.85% spexs2 spexs2 [.] runtime.cas64 # lock-free
chan synchronization

The numbers are quite stable, the speedup is 7.1/9.2*100-100 = -23%
And the chan version is faster than nochan with GOMAXPROCS=32.

Egon Elbre

unread,
Oct 7, 2014, 10:38:20 AM10/7/14
to golan...@googlegroups.com, egon...@gmail.com, martin...@ft.com, pho...@umich.edu
PS. In the meantime

I figured out a way to eliminate one of the channels: https://github.com/egonelbre/spexs2/blob/dev/search/spexs.go#L186

It uses atomics for counting the no. of items to be done.

The previous implementation (in principle) is available in this branch:

The actual commit that Dmitry used for performance measurements was 7c68cf985692c18ee13f50729cf161ed11250c5d

Or a cleaner solution with WaitGroup

But, I don't think it invalidates the results; just clarifying in case someone wants to reproduce the results.

+ egon
Reply all
Reply to author
Forward
0 new messages