Go channels, nice idiom... prohibitively slow

Matt

unread,

Jun 21, 2011, 12:35:31 PM6/21/11

to golang-nuts

So I really like to go channel idiom and I was thinking of a
pipelining algorithm where stage A feeds stage B (after some
processing) and B feeds stage C (again after some processing and so
on).

It seems that with a very simple bit of code (below), the fastest my
links between A and B, B and C etc can be is 4 million int messages
per second. That doesn't strike me as an awful lot.

I've seen Russ Cox argue that tests of this ilk are silly because they
include no computation; and that when you add real work into the test
you'd not worry so much about the communication overhead.

I'd disagree... this is a fundamental bound; I know that in Go (right
now) using channels on a single cpu I cannot pass more than 4 million
ints per second between two routines.

But is this go's fault per se?

What is actually happening under the hood, it seems a channel event
takes in the order of 1000 instructions (a barrier is around say 50 on
my machine).

Are there optimizations to be made to the code below that still
maintain the channel passing idiom?

func main() {
link := make(chan int) // channel to pass integers
kill := make(chan int) // keep the main routine alive

go func() {
var val int = 0 // stuff 0 into the link initally
for {
link <- val
val++ // increment the value
}
}()

go func() {
for r := range link {
// every 1e6 int passes print to the log
if (r % 1000000 == 0) {
log.Printf("%08x:\n", r)
}
}
}()

// block
<-kill
}

Brad Fitzpatrick

unread,

Jun 21, 2011, 1:03:05 PM6/21/11

to Matt, golang-nuts

Much faster if you buffer your channels, btw:

sync 5000000 393 ns/op

buffered 10000000 174 ns/op

package main

import (

"fmt"

"testing"

)

func BenchmarkChannelSync(b *testing.B) {

ch := make(chan int)

go func() {

for i := 0; i < b.N; i++ {

ch <- i

}

close(ch)

}()

for _ = range ch {

}

func BenchmarkChannelBuffered(b *testing.B) {

ch := make(chan int, 128)

go func() {

for i := 0; i < b.N; i++ {

ch <- i

}

close(ch)

}()

for _ = range ch {

}

func main() {

fmt.Println(" sync", testing.Benchmark(BenchmarkChannelSync).String())

fmt.Println("buffered", testing.Benchmark(BenchmarkChannelBuffered).String())

}

Kyle Lemons

unread,

Jun 21, 2011, 1:03:39 PM6/21/11

to Matt, golang-nuts

Pipelining works great in hardware but doesn't translate that nicely into software. In Go, instead of pipelining requests into stages (where a single slow processing step can back up everything behind it) the idiom is to, instead, give each request or work unit its own goroutine. For compute intensive tasks, this also has the added benefit that you might get some speedup by using multiple cores.

~K

On Tue, Jun 21, 2011 at 9:35 AM, Matt <matthew....@gmail.com> wrote:

Nigel Tao

unread,

Jun 21, 2011, 8:17:10 PM6/21/11

to Matt, golang-nuts

On 22 June 2011 02:35, Matt <matthew....@gmail.com> wrote:
> It seems that with a very simple bit of code (below), the fastest my
> links between A and B, B and C etc can be is 4 million int messages
> per second. That doesn't strike me as an awful lot.

If you need to transfer in bulk, pass []int instead of int. For
example, io.Pipe (and io.Reader and io.Writer) work with []byte, not
byte.

ceving

unread,

Jun 22, 2011, 3:58:31 AM6/22/11

to golang-nuts

On Jun 21, 7:03 pm, Kyle Lemons <kev...@google.com> wrote:
> Pipelining works great in hardware but doesn't translate that nicely into
> software.

See continuations for a nice translation.

Matt

unread,

Jun 22, 2011, 5:13:38 AM6/22/11

to golang-nuts

When you say much faster you really mean 2x, still much slower than
I'd expect by an order of magnitude. 500 cycles for channel passing on
the same core, CAS costs ~60 cycles. Perhaps there is other overhead
here (context switching)?

I wasn't trying to be downbeat on Go, far from it, it's currently my
rapid prototyping language of choice; I just wanted to understand if
there was something fundamental that I was misunderstanding in the
performance of channels. Seems like they are currently just a little
bit slow for my envisaged use case.

I guess future optimizations in Go's runtime may choose to multiplex
goroutines within the same thread for better performance.

Matt

unread,

Jun 22, 2011, 5:15:55 AM6/22/11

to golang-nuts

Think of something akin to SystemC simulations; you can plug multiple
logic/blackbox components together and then you want to run all the
blackbox components in parallel but synchronizing on a timer event and/
or the data flowing into them.

SystemC is able to achieve a much higher number of passing messages
than go; same software methodology just there concept of threads and
messaging is more efficient.

Matt

unread,

Jun 22, 2011, 5:41:48 AM6/22/11

to golang-nuts

Thanks. I'm aware that I can pass around larger data structs, and
indeed I'd like to do just that. However, the lower bound for the
message passing still holds true I believe.

On Jun 22, 1:17 am, Nigel Tao <nigel...@golang.org> wrote:

Matt

unread,

Jun 22, 2011, 8:18:20 AM6/22/11

to golang-nuts

I'd really like to know if you can set up goroutines to coexist within
the same thread? The idiom is nice create some workers and channels
flowing between them, but the overhead of communication between
goroutines each bound to its own thread is too expensive in some
scenarios, as pointed out above.

Andy Balholm

unread,

Jun 22, 2011, 11:36:21 AM6/22/11

to golang-nuts

If you use the 6g or 8g compiler, all goroutines already share the
same thread unless GOMAXPROCS is set to a value greater than 1. (With
gccgo, each goroutine gets its own thread.) When GOMAXPROCS is greater
than 1, they run on a thread pool with that many threads.

The performance you're seeing is about the same as what I get with 6g
on my iMac with GOMAXPROCS=1. When I increase GOMAXPROCS to 2 or more,
the message-passing overhead increases significantly. I suppose that
is because it is difficult to synchronize between CPU cores.

I feel the same way as you do about this: Go is a great language, but
it would be even better if we could use channels freely without having
to worry about the performance impact.

Andy

Dmitriy Vyukov

unread,

Jun 22, 2011, 2:11:49 PM6/22/11

to golan...@googlegroups.com

I second the concern about channel/goroutine performance.

Definitely there are cases where useful work outweighs communication overheads. However, it's not that simple as well. For example, if on dualcore machine 10'000 cycles of useful work is enough to make communication overheads take <5%, on a 4 processor x 16 cores machine useful work must be 500'000 cycles to keep communication overheads <5%. So one can't say that his program does not have problems, it also depends on target hardware (so that the problem may be discovered too late in development cycle).

There are also cases where communication overheads determine as to whether a problem can be expressed in a natural way or not, or even can it take advantage of parallel hardware or not. Examples of problems with inherent fine-grained parallelism are:

- parallel labyrinth exploration

- parallel game (checkers) simulation

- dynamic programming

- parallel tree building

- modelling systems

Now there are 2 possibilities:

- It's possible to increase granularity by expressing problem in an unnatural way (for example, explore several labyrinth nodes in one batch)

- It's basically impossible/very hard to increase granularity (for example, for a modelling system it would require to run a clustering algorithm for a specific workload upfront, then there are not necessary good clusters, and then express a system in an unnatural way again). In this case, communication overheads are basically a showstopper for a technology (Go).

I am planning to look into Go channels. It's impossible to achieve ultimate scalability because of current user interface, though.

ron minnich

unread,

Jun 22, 2011, 2:17:36 PM6/22/11

to golan...@googlegroups.com

On Wed, Jun 22, 2011 at 11:11 AM, Dmitriy Vyukov <dvy...@google.com> wrote:

> I am planning to look into Go channels. It's impossible to achieve ultimate
> scalability because of current user interface, though.
>

Seems to me you're just recreating earlier discoveries in parallel
codes. The problem of fine grain vs. comm overheads is always there,
esp. as you scale up. Have you looked at languages specifically
designed for parallelism? There are many of them.

I keep wondering why everyone wants Go to be everything. Generics,
HPC, GPU, FPGA ... I saw this same feeding frenzy when Java came out
-- everyone wanted extensions to make java work well in their
particular part of the world. So we continuously saw requests for Java
to implement the union of all feature sets, without a corresponding
realization that it might not be Java once that was done.

Maybe Go is just not right for your app and you need to use something
else? I realize the channel model is very attractive for what you are
doing, but Go is not the only language which supports that sort of
comms.

ron

Paul Borman

unread,

Jun 22, 2011, 2:29:47 PM6/22/11

to ron minnich, golan...@googlegroups.com

Why are you using multiple cores? This is an important question. If it is for performance of a single application I suggest you look into some the multi-core HPC work that has been going on for the past 25 years. In super-computing 2.5% of the machine being used by the OS was considered quite high. The OS has significant impact on how well your code will be able to utilize multiple processors. Fine grained parallelism generally has only done well when you can actually reserve and lock in the CPUs for your task, but that is not the typical environment available. Go search for information about Cray's microtasking technology (back from the 80's and 90's).

An anecdotal story is why the important Fortran code NASTRAN was not modified to take advantage of multiple CPUs (well, at least not by the mid 90's). The argument was it was more efficient to run 8 copies of NASTRAN on an 8 CPU machine than try to use 8 CPUs to run a single copy. And besides, you always made many many runs of NASTRAN.

-Paul

Dmitriy Vyukov

unread,

Jun 22, 2011, 2:31:27 PM6/22/11

to ron minnich, golan...@googlegroups.com

On Wed, Jun 22, 2011 at 10:17 PM, ron minnich <rmin...@gmail.com> wrote:

On Wed, Jun 22, 2011 at 11:11 AM, Dmitriy Vyukov <dvy...@google.com> wrote:

> I am planning to look into Go channels. It's impossible to achieve ultimate
> scalability because of current user interface, though.
>

Seems to me you're just recreating earlier discoveries in parallel
codes. The problem of fine grain vs. comm overheads is always there,
esp. as you scale up.

Sure.

But communication overheads determine a boundary of what can be expressed in a natural way and what can't.

Have you looked at languages specifically
designed for parallelism? There are many of them.

What languages do you mean?

I keep wondering why everyone wants Go to be everything. Generics,
HPC, GPU, FPGA ...

Any single bad aspect if Go channels are faster?

Btw, it's equally applicable to network servers. One either has problems with IO/request dispatching, or requests that big that... well, they deserve intra-request parallelization.

Anthony Martin

unread,

Jun 22, 2011, 3:45:54 PM6/22/11

to Paul Borman, ron minnich, golan...@googlegroups.com

Paul Borman <bor...@google.com> once said:
> Why are you using multiple cores? This is an important question. If it is
> for performance of a single application I suggest you look into some the
> multi-core HPC work that has been going on for the past 25 years. In
> super-computing 2.5% of the machine being used by the OS was considered
> quite high. The OS has significant impact on how well your code will be
> able to utilize multiple processors. Fine grained parallelism generally has
> only done well when you can actually reserve and lock in the CPUs for your
> task, but that is not the typical environment available. Go search for
> information about Cray's microtasking technology (back from the 80's and
> 90's).

It's funny that this is directed at Ron, of all people.

Anthony

ron minnich

unread,

Jun 22, 2011, 4:05:50 PM6/22/11

to Dmitriy Vyukov, golang-nuts

On Wed, Jun 22, 2011 at 11:31 AM, Dmitriy Vyukov <dvy...@google.com> wrote:

> What languages do you mean?

I think that ZPL was under-appreciated. You can also look at Cray's
Chapel, it is interesting. But there's a ton of parallel languages out
there, and these two will give you a starting point to find more.

> Btw, it's equally applicable to network servers. One either has problems
> with IO/request dispatching, or requests that big that... well, they deserve
> intra-request parallelization.

That's how we're using it here. I'm more surprised at the continuing
request to make Go do everything, which seems counter to the idea of a
nice tight language that does some things well.

I should not be surprised: it happens with every new language ;-)

ron

Andy Balholm

unread,

Jun 22, 2011, 5:16:31 PM6/22/11

to golang-nuts

On Jun 22, 1:05 pm, ron minnich <rminn...@gmail.com> wrote:
> I'm more surprised at the continuing
> request to make Go do everything, which seems counter to the idea of a
> nice tight language that does some things well.

I see an important difference between the desire for generics and the
desire for better goroutine/channel performance: Adding generics would
make Go more like C++. Increasing channel performance would take what
is already one of Go's key strengths (easy concurrency) and make it
even better.

Of course, it's possible that there's another difference, too. Maybe
there's something inherent in the design of SMP systems that prevents
safe message passing from happening in less than a microsecond. But I
hope not. (I don't know much about the low-level details of threading—
that's one reason I like letting Go take care of them.)

Andy

Russ Cox

unread,

Jun 22, 2011, 7:46:24 PM6/22/11

to Andy Balholm, golang-nuts

channels are very likely to get faster.
most parts of go are.

that said, it's still useless to pretend that
a send will ever be as cheap as an add.

⚛

unread,

Jun 23, 2011, 3:26:58 AM6/23/11

to golang-nuts

On Jun 23, 1:46 am, Russ Cox <r...@golang.org> wrote:
> that said, it's still useless to pretend that
> a send will ever be as cheap as an add.

I am not sure I understand this sentence. Are you saying that "Sending
a single integer over a channel is always slower than adding two
integers"?

Kyle Lemons

unread,

Jun 23, 2011, 12:17:19 PM6/23/11

to ⚛, golang-nuts

I am not sure I understand this sentence. Are you saying that "Sending
a single integer over a channel is always slower than adding two
integers"?

I think that's what he's saying, and I think that's pretty irrefutable...

~K

⚛

unread,

Jun 23, 2011, 1:53:11 PM6/23/11

to golang-nuts

If you know the communication pattern before you run the program - it
is not dynamic, you know it at compile-time - then you can simply
replace send/receive pairs with simple JMP instructions. You jump from
the sender directly to the receiver. A predictable jump is
approximately as fast as adding two integers. This would work, for
example, if you imagine GOMAXPROCS fixed to 1. The next step would
be to try to completely delete some or all of the JMP instructions.

... so it appears the idea is refutable.

The questions are: how many existing and future Go programs fit this
pattern, and how to implement the detection in the compiler.

I believe the Java programming is doing a similar task when the JIT is
removing lock-unlock pairs from Java code.

Michael Jones

unread,

Jun 23, 2011, 2:16:22 PM6/23/11

to ⚛, golang-nuts

I built a graphics engine / game console OS / company along these lines where logical blocks of functions ran as threads, threads were connected by queues, and the high and low water marks on the queues caused the program counter(s) to move from thread to thread. The queues were designed to be cache-line friendly and the whole thing was fantastically fast. Still, our switching was not exactly as fast as a single add. ;-)

--

Michael T. Jones

Chief Technology Advocate, Google Inc.

1600 Amphitheatre Parkway, Mountain View, California 94043

Email: m...@google.com Mobile: 650-335-5765 Fax: 650-649-1938

Organizing the world's information to make it universally accessible and useful

bflm

unread,

Jun 23, 2011, 3:47:05 PM6/23/11

to golang-nuts

On Jun 23, 7:53 pm, ⚛ <0xe2.0x9a.0...@gmail.com> wrote:
> If you know the communication pattern before you run the program - it
> is not dynamic, you know it at compile-time - then you can simply
> replace send/receive pairs with simple JMP instructions.

Not generally, IMO. In some special cases probably yes.

> You jump from
> the sender directly to the receiver.

Only where there is no state/context to be switched, I believe. Not
applicable for the general case of a function/method being executed
with local state (private vars, parameters, eval stack, ..., some of
this cached by the compiler in e.g. CPU registers due to what the Go's
memory model guarantees/allows) while doing channel communication.

> A predictable jump is
> approximately as fast as adding two integers. This would work, for
> example, if you imagine GOMAXPROCS fixed to 1. The next step would
> be to try to completely delete some or all of the JMP instructions.
>
> ... so it appears the idea is refutable.

Can't agree.

> The questions are: how many existing and future Go programs fit this
> pattern, and how to implement the detection in the compiler.

Undecidable and unimportant number. For sure there are programs with
manual memory management which will beat any garbage collector. For
sure there are programs with manual memory management which can never
be proven to be memory safe. Pick speed (sometimes) of memory safety
(always).

> I believe the Java programming is doing a similar task when the JIT is
> removing lock-unlock pairs from Java code.

I don't think those concepts (mutex lock/unlock vs malloc/free) are
related enough for this to be valid.

Let me please also comment on the thread topic per se:

No, the "prohibitively slow" is IMHO a misunderstanding. Roughly
hundreds nanoseconds for a thread safe queue write/add, queue read/get
and a full user space CPU context switch is hopefully still
optimizable a bit, but there is no way in nowadays/common CPUs to make
all of this in say 10 nsecs. And even the current performance beats
the kernel full thread context switch by a quite big margin.

Pete Wilson

unread,

Jun 25, 2011, 11:29:28 AM6/25/11

to golang-nuts

One might hope to get the cost of "send a few values to another
goroutine" to be in the same ballpark as "call a function and pass a
few arguments". This was the case for the now-ancient Inmos transputer
family, with inter-processor communication not being much more
expensive - because message-passing and the suspension of a process on
the arrival of a message was supported in the hardware (the silicon
provided a "simple RTOS").

Reply all

Reply to author

Forward