Channel iterators -- cool until you time them

Mike Beller

unread,

Dec 1, 2009, 9:10:37 PM12/1/09

to golang-nuts

I have been looking at the go sources, and I watched Rob's "newsqueak"
talk. I'm really enjoying the mind-bending potential of go-style
CSP. It's even more fun than Erlang's concurrency model. I wonder,
though, how far it is practical to take this model in go right now.
For example, channels/goroutines are used to model an iterator in the
standard library's containers/vector package.

// Channel iterator for range.
func (p *IntVector) Iter() <-chan int {
c := make(chan int);
go p.iterate(c);
return c;
}

But when I time similar channel/goroutine based solutions, they seem
to come out *way* slower than their old-style counterparts. As a
"pure" case, take this simple counting program (attached).

A straight counting loop in go can count 450M per second on my
machine. A counter "object" (somewhat inefficiently implemented I
might add) can count 150M per second. The channel/goroutine counter
"source" can only count 3M per second, and only when GOMAXPROCS=1.
With GOMAXPROCS>1, it goes down to 120K counts per second! A factor
of 50 to 100 for the best case!. I have to attribute this to
inefficient scheduling.

So with the current state of the scheduler, is it really wise to be
creating standard library code with iterators that are apparently
orders of magnitude slower than other approaches? Is it intended that
the scheduler will soon become so good that it will make sense to use
this kind of approach to iteration? (That would be awesome if the
answer is yes!)

--Mike

// sample "try counting 3 ways" source code:
func counter(from uint64, upto uint64) chan uint64 {
ch := make(chan uint64);
go func() {
for i := from; i < upto; i++ { ch <- i}
close(ch);
} ();
return ch;
}

type Counter struct {
count uint64;
upto uint64;
}

func NewCounter(from uint64, upto uint64) *Counter {
return &Counter{from, upto};
}

func (c *Counter) next () uint64 {
old := c.count;
c.count++;
return old;
}

func (c *Counter) end() bool {
return (c.count > c.upto);
}

func main() {
sum := uint64(0);
typ, _ := strconv.Atoui64(os.Args[1]);

switch typ {
case 1: // simple loop
for i := uint64(0); i < 1000000000; i++ {
sum += i;
}
case 2: // counter object
ctr := NewCounter(0, 1000000000);
for i := ctr.next();!ctr.end(); i = ctr.next() {
sum += i;
}
case 3: // goroutine
ctr := counter(0,100000000);
for i := range ctr {
sum += i;
}
}

fmt.Printf("%d\n", sum);
}

Russ Cox

unread,

Dec 1, 2009, 9:24:17 PM12/1/09

to Mike Beller, golang-nuts

> So with the current state of the scheduler, is it really wise to be
> creating standard library code with iterators that are apparently
> orders of magnitude slower than other approaches?

We'll see. It's an experiment. There's parallelism
possible in the goroutine version that is not possible
in your counter object; it's much more powerful than
just some function calls, but right now you pay for
that power even if you don't use it. The channel
runtime and scheduler will improve as we accumulate
experience with the system.

Russ

Rowan Davies

unread,

Dec 2, 2009, 8:08:30 PM12/2/09

to golang-nuts

In my opinion, this is an experiment worth continuing - it allows you
do the equivalent of a "yield" or "yield return" in Python, Ruby, C#,
F# etc., when constructing sequences. Yes, it has a cost, but I'd
hope that goroutines will be well optimised for this kind of thing.

C# manages to reduce the overhead for the corresponding use of "yield
return" in a simple for loop so that there's roughly a factor of two
compared to the version with a simple method call. See, eg:

http://stackoverflow.com/questions/408452/why-enumerable-range-is-faster-than-a-direct-yield-loop

This is done via a transformation to a state machine. I think the
same should be possible for Go - each channel send would be a state,
and a channel could be marked if a goroutine is currently in such a
state - in which case a simple function call can be used to "wake" the
goroutine until it does another send at which point the value comes
via a normal function call return. Of course, in Go you have the
complications that the next send might be to the wrong channel, or
other goroutines might send on the original, plus the previous send
might have been within a function that returns. But, I think there
are reasonable ways to deal with such things, and regardless this is
an important special case - important enough to try to detect it and
have a very specific optimization.

Having taught game programming recently, I can say that relying of the
fast "yield return" of C# greatly simplified a lot of event-oriented
code in that context. I really hope Go can manage the same - in fact
I'd be prepared to have a look at what's involved and make an attempt
if that seems useful (my PhD is in PL from CMU).

- Rowan

atomly

unread,

Dec 3, 2009, 12:35:29 AM12/3/09

to Rowan Davies, golang-nuts

I think the new channel/goroutine style is worthwhile, especially
since most of the time the iterator will be doing more heavy lifting
than just incrementing an accumulator and the scheduler will hopefully
become more optimized as time goes by. At that point, this sort of
pattern will become very useful for doing asynchronous iterations that
leverage multiple cores...

--
:: atomly ::

[ ato...@atomly.com : www.atomly.com : http://blog.atomly.com/ ...
[ atomiq records : new york city : +1.917.442.9450 ...
[ e-mail atomly-new...@atomly.com for atomly info and updates ...

Mike Beller

unread,

Dec 3, 2009, 10:19:51 AM12/3/09

to golang-nuts

Rowan

Any idea how to go about it? Have you looked at the scheduler?

I was thinking that it might be interesting if each channel could have
associated with it a list of the goroutines that are blocked waiting
to receive from (or send to) that particular channel. Then, for
example, if a goroutine is blocked on send, and another goroutine
attempts to read from the same channel, the runtime's channel read
routine could just pick up the first sender from the list of waiting
senders and force the scheduler to schedule that goroutine.

This ignores issues of GOMAXPROCS>1, which complicate matters
greatly. But I imagine that even if this optimization was only
applied to synchronous channels, for the case of GOMAXPROCS=1, and
even if for some reason it had to be limited to the single reader
single writer case (which I would assume is the most common) it could
massively speed up use of goroutines for the types of things that
generators are used for in Python. BTW: I timed python's generators,
and they python generator version of my counting program was about
half as fast as a straight loop in python. That's a small overhead
(factor of 2) when compared to the go overhead of a factor of 50. If
we could achieve the same with the above optimization, goroutines as
iterators could become a practical reality.

--Mike

On Dec 2, 8:08 pm, Rowan Davies <rd1...@gmail.com> wrote:
> In my opinion, this is an experiment worth continuing - it allows you
> do the equivalent of a "yield" or "yield return" in Python, Ruby, C#,
> F# etc., when constructing sequences. Yes, it has a cost, but I'd
> hope that goroutines will be well optimised for this kind of thing.
>
> C# manages to reduce the overhead for the corresponding use of "yield
> return" in a simple for loop so that there's roughly a factor of two
> compared to the version with a simple method call. See, eg:
>

> http://stackoverflow.com/questions/408452/why-enumerable-range-is-fas...

Ryanne Dolan

unread,

Dec 3, 2009, 2:04:37 PM12/3/09

to r...@golang.org, golang-nuts, Mike Beller

Russ,

Is there really parallelism possible in the goroutine version? It seems to me that the producer and consumer goroutines are related such that one will always be blocking on the other. If the channel has size 0, no amount of scheduling can make this parallel. Even if you use a channel with capacity 100, for example, it is likely that either the producer or consumer will end up blocking or finishing first, and even if the scheduler load-balanced perfectly, the iterator could never use more than 2 processors.

Maybe the iterators could use a capacity related to the size of the container, e.g. if the channel had capacity == container size then at least the producer never needs to block. This ameliorates the problem but does not make it very parallel nor fast, since each iteration still needs a copy into the channel and a copy out which a sane iterator would not need.

So even if you eliminate the blocking problem, you still end up with a slow iterator. It never seems logical to do with two processors what you can do faster with one. And since this can't scale to more than two processors at a time, it is never a logical implementation.

I wish this weren't true, bc I think the channel iterator concept is rather elegant. Please enlighten me if I'm missing some voodoo.

Ryanne
- from my phone -

On Dec 1, 2009 8:24 PM, "Russ Cox" <r...@golang.org> wrote:

> So with the current state of the scheduler, is it really wise to be > creating standard library co...

Russ Cox

unread,

Dec 3, 2009, 2:07:03 PM12/3/09

to Ryanne Dolan, golang-nuts, Mike Beller

> Is there really parallelism possible in the goroutine version? It seems to
> me that the producer and consumer goroutines are related such that one will
> always be blocking on the other.

not necessarily. they do synchronize but
the producer can work on computing the
next value while the consumer is dealing
with the previous one. if there is significant
computation on both sides, it can run in
parallel. this is not true of standard function
call-based iterators, nor of python's generators.

russ

Ryanne Dolan

unread,

Dec 3, 2009, 2:36:20 PM12/3/09

to r...@golang.org, golang-nuts, Mike Beller

True, but if there is significant work on the producer side then you don't really have an iterator...

Certainly we shouldn't design all iterators to be slow in general but possibly faster in the case where a different pattern would make more sense.

For example, if there was some container for which iterating was a big computation, then the programmer can use a generator and read from it in a loop. That is fine. But it doesn't follow that iterating over a list or vector should follow the same pattern just for the sake of consistency.

In most cases (I'd say all good cases) the iterator is doing something trivial, and that is why it is hidden behind the for...range syntactic sugar.

I guess I am arguing for a way to implement coroutines without being penalized for channel communication. Id like to see the for..range syntax work for coroutines and most iterators replaced with call-based coroutines rather than goroutines.

Perhaps the compiler could be smart and replace goroutines with coroutines magically, but I don't see the point.

- from my phone -

On Dec 3, 2009 1:07 PM, "Russ Cox" <r...@golang.org> wrote:

> Is there really parallelism possible in the goroutine version? It seems to > me that the producer...

peter.l...@gmail.com

unread,

Jan 3, 2012, 5:46:58 AM1/3/12

to golan...@googlegroups.com, r...@golang.org

The world has improved and iterators are comparable with "inefficient" objects if sufficient channel capacity is allocated. For less trivial computations, iterators are a reasonable design choice, with overhead comparable to function calls.

(I discovered this discussion from https://sites.google.com/site/gopatterns/object-oriented/iterators ... perhaps the gopatterns page should be updated?)

I ran the OP's benchmark program on my MacBook Pro to see if things have improved in 2 years ... the timings for iterators are much better now:

2.2 sec for counting loop

7.3 sec for counting "object" (same "inefficient" implementation)

21.1 sec for iterator

So, only on order of magnitude slower, which is a big improvement (and 3x slower than objects).

BUT ... I then changed the "make (chan uint64)" to "make (chan uint64, <value>)" and ...

7.7 sec for iterator with capacity of 100

11.2 sec for iterator with capacity of 10

37.9 sec for iterator with capacity of 1

Conclusion: for a channel capacity of 10, an iterator is about 50% slower than the default buffer; and with capacity 50, they're pretty close (increasing capacity beyond 200 yields no improvement)

It seems that any value of GOMAXPROCS other than the default (1) is about an order of magnitude slower, or worse -- for capacity 10, the time increased to 80 sec (149 sec elapsed time, with 124 sec system time) with GOMAXPROCS=2 on a 2-CPU machine.

As with all benchmarks, YYMV. (And I was lazy and just gave some individual times instead of averaging over a number of tests ... they seem reproducible +/- 10% on my machine.)

Ryanne Dolan

unread,

Jan 3, 2012, 9:40:01 AM1/3/12

to peter.l...@gmail.com, golang-nuts

Thanks for the new stats. I updated the article with a link to your results.

Ryanne

--
Thanks.
Ryanne

--
www.ryannedolan.info

Alexey Borzenkov

unread,

Jan 3, 2012, 11:16:01 AM1/3/12

to golan...@googlegroups.com, r...@golang.org

On Tue, Jan 3, 2012 at 2:46 PM, peter.l...@gmail.com
<peter.l...@gmail.com> wrote:
> 7.7 sec for iterator with capacity of 100
> 11.2 sec for iterator with capacity of 10
> 37.9 sec for iterator with capacity of 1

It might be important to understand though, that with GOMAXPROCS=1
when you're using a buffered channel, and your iterator goroutine is
pure computation (i.e. no syscalls or communications) then it won't
stop until buffer is filled completely. Effectively it means that if
you use capacity 100 you first put 100 elements, then iterator finally
sleeps and consumer wakes up, consumes all 100 elements (if consumer
is pure computation as well, which in benchmarks usually is), then the
process repeats. Because there are less switches it results in less
overhead, but you lose granularity of iteration.

John Asmuth

unread,

Jan 3, 2012, 11:19:05 AM1/3/12

to golan...@googlegroups.com, r...@golang.org

On Tuesday, January 3, 2012 11:16:01 AM UTC-5, Alexey Borzenkov wrote:

It might be important to understand though, that with GOMAXPROCS=1
when you're using a buffered channel, and your iterator goroutine is
pure computation (i.e. no syscalls or communications) then it won't
stop until buffer is filled completely.

Are you sure about this? I don't know why this would be the case.

Alexey Borzenkov

unread,

Jan 3, 2012, 11:29:46 AM1/3/12

to golan...@googlegroups.com, r...@golang.org

I might be wrong, but I looked at the source code, and in
runtime/chan.c functions chansend and chanrecv (the heart of channel
machinery) show that gosched is only called when operation cannot be
completed immediately. If GOMAXPROCS=1 then only one goroutine can run
at a time, and until any runtime operation calls gosched or
entersyscall other goroutines will not run. That's why as long as
goroutine can send values to a channel it will. And as long as
goroutine can receive from the channel it will as well. Only when the
channel is full (for send) or empty (for recv) will goroutine put
itself to sleep.

John Asmuth

unread,

Jan 3, 2012, 11:51:20 AM1/3/12

to golan...@googlegroups.com, r...@golang.org

Looks like you're right. Nice detective work.

Kyle Lemons

unread,

Jan 3, 2012, 2:57:18 PM1/3/12

to golan...@googlegroups.com

It might be important to understand though, that with GOMAXPROCS=1
when you're using a buffered channel, and your iterator goroutine is
pure computation (i.e. no syscalls or communications) then it won't
stop until buffer is filled completely.
Are you sure about this? I don't know why this would be the case.

It's by design; buffered channels wouldn't be much of a performance improvement otherwise.

Reply all

Reply to author

Forward