Channels of functions can be slow

John DeNero

unread,

Mar 23, 2011, 2:45:56 AM3/23/11

to golang-nuts

Channels of functions appear to provide a nice type safe way to
implement generic behavior. Unfortunately, they are very slow with
GOMAXPROCS>1 in 6g. Is this a known issue? Might performance improve
one day? For the simple example below that runs 1e6 no-op closures, I
see:

$ GOMAXPROCS=1 time gorun slow.go
0.49 real 0.45 user 0.01 sys
$ GOMAXPROCS=4 time gorun slow.go
15.15 real 7.20 user 16.28 sys

package main
import "sync"

func threadPool(work chan (func()), workers int) *sync.WaitGroup {
var wg sync.WaitGroup
for i := 0; i < workers; i++ {
wg.Add(1)
go func() {
for w := range work {
w()
}
wg.Done()
}()
}
return &wg
}

func main() {
work := make(chan func())
wg := threadPool(work, 4)
for i := 0; i < 1e6; i++ {
work <- func() {
// actual work would go here
}
}
close(work)
wg.Wait()
}

John DeNero

unread,

Mar 23, 2011, 2:51:26 AM3/23/11

to golang-nuts

By the way, it looks like much (but not all) of the time comes from
having multiple goroutines all iterating over the same channel. If I
change the call threadPool(work, 4) to threadPool(work, 1), I get

$ GOMAXPROCS=4 time gorun slow.go

3.84 real 2.31 user 2.84 sys

Which is still >7x slower than GOMAXPROCS=1, but not 30x.

Jessta

unread,

Mar 23, 2011, 3:04:57 AM3/23/11

to John DeNero, golang-nuts

On Wed, Mar 23, 2011 at 5:45 PM, John DeNero <den...@google.com> wrote:
> Channels of functions appear to provide a nice type safe way to
> implement generic behavior. Unfortunately, they are very slow with
> GOMAXPROCS>1 in 6g. Is this a known issue? Might performance improve
> one day? For the simple example below that runs 1e6 no-op closures, I
> see:

So you have 4 threads that are all waiting on a single thread to feed them.
Since the actual work these threads are doing is nothing, progress
depends on the single thread feeding them.

- jessta

--
=====================
http://jessta.id.au

John DeNero

unread,

Mar 23, 2011, 10:29:09 AM3/23/11

to golang-nuts

Yes, but I'd expect the time with GOMAXPROCS= 1 or 4 threads to be
about the same. Instead, we see that adding system threads (but not
changing the code) makes the program much slower. I don't understand
why turning on multiple threads should slow things down so much.

In more complex examples, i.e. when work() does work, this channel-of-
closures design is just slow. I'd like to understand why. If it's
helpful, I can work up an example.

Thanks,
John

Russ Cox

unread,

Mar 23, 2011, 10:31:46 AM3/23/11

to John DeNero, golang-nuts

On Wed, Mar 23, 2011 at 10:29, John DeNero <den...@google.com> wrote:
> Yes, but I'd expect the time with GOMAXPROCS= 1 or 4 threads to be
> about the same. Instead, we see that adding system threads (but not
> changing the code) makes the program much slower. I don't understand
> why turning on multiple threads should slow things down so much.

http://golang.org/doc/go_faq.html#Why_GOMAXPROCS

andrey mirtchovski

unread,

Mar 23, 2011, 11:17:35 AM3/23/11

to Russ Cox, John DeNero, golang-nuts

To add to what the FAQ said, you can see clearly with a simple
benchmark from the attached code (It's an abridged version of a larger
benchmark trying to figure out whether it's more efficient to send
strings or pointers to strings down a channel):

$ ./6.out
1: 2000000 960 ns/op
2: 500000 7147 ns/op
$

chanstr.go

roger peppe

unread,

Mar 23, 2011, 11:49:15 AM3/23/11

to andrey mirtchovski, Russ Cox, John DeNero, golang-nuts

i thought it was interesting that even with GOMAXPROCS > 1, multiple
readers caused a slowdown compared to one reader. i guess it's
possible that's due to lock contention.

here's another micro-benchmark that tries out combinations
of buffer size, GOMAXPROCS and reader count.

here are some results, sorted by time-per-channel-send.
while it's clear that GOMAXPROCS=1 has least channel
overhead, other tradeoffs are not so clear. the 50x difference
is quite interesting.

procs 1; readers 1; buffer 100; 10000000 186 ns/op
procs 1; readers 2; buffer 100; 10000000 188 ns/op
procs 1; readers 3; buffer 100; 10000000 194 ns/op
procs 1; readers 4; buffer 100; 10000000 196 ns/op
procs 1; readers 1; buffer 10; 10000000 237 ns/op
procs 1; readers 2; buffer 10; 10000000 269 ns/op
procs 1; readers 3; buffer 10; 10000000 294 ns/op
procs 1; readers 4; buffer 10; 5000000 321 ns/op
procs 1; readers 1; buffer 0; 5000000 462 ns/op
procs 1; readers 3; buffer 0; 5000000 468 ns/op
procs 1; readers 2; buffer 0; 5000000 470 ns/op
procs 1; readers 4; buffer 0; 5000000 496 ns/op
procs 1; readers 1; buffer 1; 5000000 751 ns/op
procs 1; readers 2; buffer 1; 2000000 762 ns/op
procs 1; readers 3; buffer 1; 2000000 777 ns/op
procs 1; readers 4; buffer 1; 2000000 786 ns/op
procs 3; readers 1; buffer 10; 2000000 1234 ns/op
procs 2; readers 1; buffer 10; 1000000 1250 ns/op
procs 4; readers 1; buffer 10; 1000000 1406 ns/op
procs 2; readers 2; buffer 0; 1000000 1577 ns/op
procs 2; readers 3; buffer 0; 1000000 1919 ns/op
procs 2; readers 2; buffer 100; 1000000 1960 ns/op
procs 2; readers 4; buffer 10; 1000000 2001 ns/op
procs 2; readers 3; buffer 100; 2000000 2077 ns/op
procs 2; readers 3; buffer 10; 1000000 2357 ns/op
procs 4; readers 1; buffer 100; 500000 2423 ns/op
procs 2; readers 4; buffer 0; 500000 2441 ns/op
procs 2; readers 4; buffer 100; 500000 2601 ns/op
procs 2; readers 2; buffer 10; 500000 2611 ns/op
procs 2; readers 1; buffer 100; 500000 2872 ns/op
procs 3; readers 1; buffer 100; 500000 3036 ns/op
procs 4; readers 1; buffer 0; 500000 3860 ns/op
procs 2; readers 1; buffer 0; 500000 3879 ns/op
procs 3; readers 2; buffer 0; 500000 3912 ns/op
procs 4; readers 2; buffer 0; 500000 3925 ns/op
procs 3; readers 1; buffer 0; 500000 3963 ns/op
procs 4; readers 3; buffer 0; 500000 5132 ns/op
procs 3; readers 4; buffer 10; 200000 5564 ns/op
procs 4; readers 4; buffer 10; 200000 5785 ns/op
procs 4; readers 4; buffer 100; 200000 5972 ns/op
procs 4; readers 2; buffer 10; 500000 6240 ns/op
procs 2; readers 4; buffer 1; 500000 6358 ns/op
procs 3; readers 3; buffer 10; 200000 6513 ns/op
procs 2; readers 3; buffer 1; 500000 6594 ns/op
procs 4; readers 4; buffer 0; 200000 6718 ns/op
procs 4; readers 3; buffer 10; 200000 6721 ns/op
procs 2; readers 2; buffer 1; 500000 6809 ns/op
procs 3; readers 2; buffer 100; 200000 7044 ns/op
procs 3; readers 3; buffer 100; 500000 7152 ns/op
procs 3; readers 4; buffer 100; 200000 7233 ns/op
procs 4; readers 2; buffer 100; 200000 7288 ns/op
procs 4; readers 1; buffer 1; 500000 7378 ns/op
procs 2; readers 1; buffer 1; 500000 7423 ns/op
procs 3; readers 1; buffer 1; 500000 7442 ns/op
procs 3; readers 2; buffer 10; 500000 7467 ns/op
procs 3; readers 2; buffer 1; 500000 7508 ns/op
procs 3; readers 4; buffer 0; 200000 7574 ns/op
procs 4; readers 2; buffer 1; 500000 7738 ns/op
procs 3; readers 3; buffer 1; 200000 8004 ns/op
procs 3; readers 3; buffer 0; 200000 8298 ns/op
procs 4; readers 4; buffer 1; 500000 8320 ns/op
procs 4; readers 3; buffer 1; 500000 8336 ns/op
procs 4; readers 3; buffer 100; 200000 8483 ns/op
procs 3; readers 4; buffer 1; 200000 9590 ns/op

tst.go

Reply all

Reply to author

Forward