go for?

2,145 views
Skip to first unread message

John Asmuth

unread,
Nov 30, 2009, 11:51:00 AM11/30/09
to golang-nuts
Hi all,

I think a "go for" type construct, that runs iterations of a loop in
parallel, would be a good idea.

Now I realize that using this naively could be a bad idea. Consider

go for i := 1; i < 10; i++ {
A[i] = A[i-1]*2;
}

This loop is not parallelizable, since it has dependencies across
iterations. And I don't know that building the tools to analyze this
into the go compiler is a great idea, but I'm not an expert so who
knows.

More importantly the iterating variable, 'i', can be changed by the
contents of the loop and this one variable has a different value for
every iteration. To run this loop in parallel the "go for" would have
to *know* that 'i' needed to be held fixed for each iteration. This is
complicated both for the developer and the compiler (probably).

Cleaner would be this sort of construct

go for i := range aChannel {
foo(i);
}

Here, every iteration of the loop explicitly assigns a new value to
'i' that doesn't depend on the previous value (unlike the increment
from earlier). Here the compiler can know that each iteration has its
own copy of 'i', and in the non-parallelized version of the for loop,
nothing done in one iteration can affect the value of 'i' for another
iteration.

This, in conjunction with a library function that creates channels
filled with sequential numbers (similar to range in python), could
make a new go idiom for simple iteration. And when appropriate, the
developer could precede it with "go" to do things in parallel.

I am not sure that the go team will like the ability to put "go"
before a for loop on ranges, but not one that uses the for x;y;z
syntax, since knowing whether or not it was appropriate would require
some look-ahead (and I figure the lack of look-ahead necessary for the
language is one of the reasons it compiles so quickly, but once again
I'm not an expert), but really it's their fault for overloading "for"
like they have.

Or who knows, allow "go for i:=0; i<10; i++" and let it bomb if it
doesn't actually make sense for what the programmer wanted to do.

Thoughts?

atomly

unread,
Nov 30, 2009, 1:19:56 PM11/30/09
to John Asmuth, golang-nuts
This is definitely an interesting idea, but it'd really just be
syntactic sugar for something you can already do with the language--
since go has anonymous functions/closures, you could easily just spawn
one of those inside a for loop...

--
:: atomly ::

[ ato...@atomly.com : www.atomly.com : http://blog.atomly.com/ ...
[ atomiq records : new york city : +1.917.442.9450 ...
[ e-mail atomly-new...@atomly.com for atomly info and updates ...

John Asmuth

unread,
Nov 30, 2009, 1:28:32 PM11/30/09
to golang-nuts
I agree, this is completely sugar. But then, if we didn't like sugar
we could switch to C. I think having something like this would make it
really convenient to parallelize a lot of common routines. For
instance, you want to initialize every element in an array, but this
initialization might take some time. Go for!

So, like you said,

go for i:=0; i<10; i++ {
foo(i)
}

Is the same as

for i:=0; i<10; i++ {
go func (i int) {
foo(i)
}(i)
}

One issue would be to know which variables to include in the parameter
of this anonymous function. The ones declared in the for statement
seem appropriate. This would mean that

var i int;
go for i=0; i<10; i++ { foo(i) }

is different than the one that declares i in the for statement. This
seems ok to me.

gorgo...@online.de

unread,
Nov 30, 2009, 2:19:47 PM11/30/09
to golan...@googlegroups.com
On Mon, 30 Nov 2009 10:28:32 -0800 (PST)
John Asmuth <jas...@gmail.com> wrote:

>
> go for i:=0; i<10; i++ {
> foo(i)
> }

shouldn't the below work?


package main
import "fmt"

func main () {
go func (i int) {
for j:=0; j<10; j++ {
fmt.Printf("%d\n",i+j);
}
}(100)
}

i don't get errors but also no prints. when i drop the "go", i get the
prints!? using current github and 8g.

--
GorgonZola <gorgo...@online.de>

gorgo...@online.de

unread,
Nov 30, 2009, 2:37:09 PM11/30/09
to golan...@googlegroups.com
On Mon, 30 Nov 2009 14:34:29 -0500
Alex Suraci <i....@toogeneric.com> wrote:

> On Nov 30, 2009, at 2:19 PM, gorgo...@online.de wrote:
>
> > i don't get errors but also no prints. when i drop the "go", i get the
> > prints!? using current github and 8g.
>
> You don't get any output because main() does not wait for goroutines to finish before exiting.
>
> Have it wait on a channel at the end.
>
> - Alex

I, stupid!

--
GorgonZola <gorgo...@online.de>

John Asmuth

unread,
Nov 30, 2009, 2:39:20 PM11/30/09
to golang-nuts
The program exits before the goroutine gets to do anything. This
works:

package main
import "fmt"
func main () {
c := make(chan bool);
go func (i int) {
for j:=0; j<10; j++ {
fmt.Printf("%d\n",i+j);
}
c <- true
}(100);
<- c;
}

Peter Froehlich

unread,
Dec 1, 2009, 3:46:54 AM12/1/09
to John Asmuth, golang-nuts
IIRC Robert Griesemer worked on this kind of construct for his thesis,
so we'd have an expert right here! :-)
--
Peter H. Froehlich <http://www.cs.jhu.edu/~phf/>
Senior Lecturer | Director, Johns Hopkins Gaming Lab

Rowan Davies

unread,
Dec 3, 2009, 3:48:25 AM12/3/09
to golang-nuts

I generally think adding sugar is only appropriate when avoiding it
leaves no way of implementing something without it looking quite ugly
("syntactic bitters"). The whole point of having functions is that
you can define new ones instead of adding new features to the language
- most of the time this works very well (no bitters required).

In this case, we can get pretty close to the sugared version using
anonymous functions, which are great for defining functions capturing
control idioms - this very similar to "parallel_for" in Intel's
Threading Building Blocks and Microsoft's Parallel Patterns. While
we're at it, once and for all we can optimize the breaking up into
chunks.

///////////////////////////////////////////////////////////////////////////////
// library file par.go Author: Rowan Davies

package par

import "runtime"

const NTHREADS = 4; // Consider: NCPU
or NCPU*2 with Hyperthreading
const EXTRA_CHUNK_FACTOR = 4; // 1 is ideal, except that chunks
may vary.
const NCHUNKS = NTHREADS * EXTRA_CHUNK_FACTOR;

func init() {
runtime.GOMAXPROCS(NTHREADS)
}

func For(start, end int, body func(int)) {

num := end - start;
c := make(chan int);

doChunk := func (chkstart int, chkend int) {
for i:=chkstart; i<chkend; i++ {
body(i)
}
c <- 0
};
for chunk := 0; chunk < NCHUNKS; chunk++ {
go doChunk(chunk*num/NCHUNKS, (chunk+1)*num/NCHUNKS)
}
for chunk := 0; chunk < NCHUNKS; chunk++ {
<- c
}
}

////////////////////////////////////////

Then instead of "go for" we can use "par.For" throughour code:

///////////////////////////////////////////
package main

import ( "fmt"; "par" )

func fib(n int) int {
if n<=1 {
return 1
}
return fib(n-1) + fib(n-2)
}

const MAXFIB = 40;

func main() {
par.For(0, 800, func(i int) { //
Was: for i:=0; i<800; i++ {
res := fib(i % MAXFIB);
if i%23 == 0 {
fmt.Printf("fib(%d %% %d) = %d\n", i, MAXFIB, res)
}
})
}

///////////////////////////////

So, in my opinion, there's not much need for a built in construct to
do this, just like there's no need for a built in fib function.
(Although my fib does seem rather slow... ;-)

This all seems to work the way it should - I've even timed it on quad
core under Vista - the sequential version takes 1.56s, the parallel
30s. I did this as an exercise to get familiar with Go - if it seems
generally useful I guess I could contribute it as the start of package
that mirrors some of TBB and PPL. (Although I'm unlikely to do things
like a cache affinity aware partitioner any time soon.)

- Rowan

jqb

unread,
Dec 3, 2009, 2:06:16 PM12/3/09
to golang-nuts
Um, the point is to run the iterations in parallel, not to run the
loop in parallel with main().

On Nov 30, 11:19 am, gorgonz...@online.de wrote:
> On Mon, 30 Nov 2009 10:28:32 -0800 (PST)
>
> John Asmuth <jasm...@gmail.com> wrote:
>
> > go for i:=0; i<10; i++ {
> >   foo(i)
> > }
>
> shouldn't the below work?
>
> package main
> import "fmt"
>
> func main () {
>     go func (i int) {
>         for j:=0; j<10; j++ {
>             fmt.Printf("%d\n",i+j);
>         }
>     }(100)
>
> }
>
> i don't get errors but also no prints. when i drop the "go", i get the
> prints!? using current github and 8g.
>
> --
> GorgonZola <gorgonz...@online.de>

atomly

unread,
Dec 3, 2009, 2:56:01 PM12/3/09
to jqb, golang-nuts
On Thu, Dec 3, 2009 at 2:06 PM, jqb <jqba...@gmail.com> wrote:
>
> Um, the point is to run the iterations in parallel, not to run the
> loop in parallel with main().
>

Yes, but this should work:

package main
import "fmt"

func main () {
for j:=0; j<10; j++ {
go func (i int) {
fmt.Printf("%d\n",i);
}(j);
}
<-make(chan bool);

jqb

unread,
Dec 3, 2009, 3:18:16 PM12/3/09
to golang-nuts
Non sequitur and redundant. Several people have already posted
versions that put the go inside the loop -- I addressed the fact that
GorgonZola's offering missed the whole point.

On Dec 3, 11:56 am, atomly <ato...@atomly.com> wrote:
> On Thu, Dec 3, 2009 at 2:06 PM, jqb <jqbal...@gmail.com> wrote:
>
> > Um, the point is to run the iterations in parallel, not to run the
> > loop in parallel with main().
>
> Yes, but this should work:
>
> package main
> import "fmt"
>
> func main () {
>   for j:=0; j<10; j++ {
>     go func (i int) {
>       fmt.Printf("%d\n",i);
>     }(j);
>   }
>   <-make(chan bool);
>
> }
>
> --
> :: atomly ::
>
> [ ato...@atomly.com :www.atomly.com :http://blog.atomly.com/...
> [ atomiq records : new york city : +1.917.442.9450 ...
> [ e-mail atomly-news-subscr...@atomly.com for atomly info and updates ...

Rowan Davies

unread,
Dec 4, 2009, 3:52:26 PM12/4/09
to golang-nuts
Is something like the par.For I posted generally useful - or am I
barking up the wrong tree?

I guess turning Go into TBB isn't really progress, and the version I
gave is not really optimal. But, I think a parallel for is a pretty
natural thing, and very handy sometimes.

Maybe I'm being overly influenced currently by TBB and PPL due to a
related Masters project I was supervising until recently.

John Asmuth

unread,
Dec 4, 2009, 7:21:51 PM12/4/09
to golang-nuts


On Dec 4, 3:52 pm, Rowan Davies <rd1...@gmail.com> wrote:
> Is something like the par.For I posted generally useful - or am I
> barking up the wrong tree?

Though it may be useful, it's not the kind of thing I was thinking
about. It doesn't allow for the flexibility of go's for loop which has
a richer syntax than simply counting a number of iterations on an
integer. a generic "go for", would be nice, if it worked like so:

go for <for-statement> {
<code-block>
}

turns into:

completion chan bool = make(chan bool);
iterations := 0;
for <for-statement> {
go func(<all-variables-declared-in-the-for-statement>) {
<code-block>;
completion<-;
}(<the-corresponding-variables>);
iterations++;
}
for iterations>0 {
<-completion;
}

The only problem I see with this sort of thing is if people did
var i int;
go for i=0; i<max; i++ {
<code depending on the value of i>
}

since this code could potentially have a different result than if the
for loop were not done in parallel.

Also problematic is
go for i:=0; i<max; i++ {
<code that changes the value of i>
}

For this reason I'm not sure the benefits merit the extra complexity
(having to watch for this kind of thing and not being able to blindly
put "go" before for loops).

- John

Rowan Davies

unread,
Dec 4, 2009, 8:30:30 PM12/4/09
to golang-nuts
Yes, I understand completely, it's definitely less flexible. But, TBB
has become a major player in mulitcore parallel programming based
large on people using exactly that idiom.

Most of the time, when loops can be easily made parallel, I'd claim
they normally do fit the par.For idiom (that I stole from Microsoft I
guess - or maybe Intel). For more complicated loops I'd suggest that
it's appropriate to consider them individually because there are
opportunities to optimize them via channels.

A "go for" construct would kinda hide these opportunities.

Pete Wilson

unread,
Dec 5, 2009, 2:10:10 PM12/5/09
to golang-nuts
FWIW, occam provided a version of this, but from a different starting
point. In occam, you can choose how to execute some chunks of code -
run 'em all one after the other (sequentially); run 'em all in some
arbitrary order, including concurrently (parallel); or run one o 'em
depending on conditions (conditional, or alternative if conditions
include listening to channels). Naturally, you can have nested
collections of any mix of these to any depth.

Given these methods of construction of collections of chunks, you
could also "replicate". So a replicated sequential looked like a for
loop; and a replicated parallel looked like the par.for being
discussed.

At least originally, occam didn't provide the 'unstructured
parallelism' provided by goroutines - a parallel block only completed
when all the component chunks had finished - the synchronising join
was implicit.

Historical note only.

-- Pete

Santidhammo

unread,
Dec 5, 2009, 3:31:09 PM12/5/09
to golang-nuts
I think it is rather important for this kind of construct to exist.
Think of a database iterator, which is a result of a query. If one
wants to do calculations on this, and it is possible to have
independency, I would like to have a construct similar to this:

go parallel result := db.Execute("SELECT * FROM my_table").Iterate()
{
.
.
.
doSomethingWithResult(...);
.
.
.

Peter Froehlich

unread,
Dec 5, 2009, 4:07:42 PM12/5/09
to Santidhammo, golang-nuts
Hi all,

On Sat, Dec 5, 2009 at 3:31 PM, Santidhammo <svan...@gmail.com> wrote:
> I think it is rather important for this kind of construct to exist.
> Think of a database iterator, which is a result of a query. If one
> wants to do calculations on this, and it is possible to have
> independency, I would like to have a construct similar to this:
>
> go parallel result := db.Execute("SELECT * FROM my_table").Iterate()
> {
>    .
>    .
>    .
>    doSomethingWithResult(...);
>    .
>    .
>    .
> }

Funny you should say that because I am really busy evaluating the
different ways the generic database API could deal with exactly this.
From what I can tell so far we should use the existing exp/iter idea
to return individual results from the query if there are any. But how
those could be processed in parallel from the client side is a little
beyond me still. Seems to me the client would have to explicitly
create a goroutine for every result coming down the channel without a
construct like this. Anyone have a good feeling whether this is
correct?

Cheers,
Peter

atomly

unread,
Dec 6, 2009, 1:28:52 AM12/6/09
to Peter Froehlich, Santidhammo, golang-nuts
I think you're right, but it seems to me that it should be relatively trivial to do this.  Just have the library generate results on a channel and, on the client side, do a "for range" on the channel.  Then you can do a "go func" on each result inside of the loop.  Powerful code with minimal boilerplate.

Rowan Davies

unread,
Dec 6, 2009, 4:57:06 AM12/6/09
to golang-nuts
Yeah - occam was (is?) pretty cool. It was included when I took
concurrent programming back in 1989... (Occam was in the Ben-Ari
text).

When I taught the same unit in 2005-6, it was still using Ben-Ari, but
Occam had been taken out - and my brief was to focus on Java....
Even worse, we've cancelled the unit permanently, and instead I had to
develop a new unit "Programing Paradims" which tried to cover
functional and logic programming as well as concurrency in one
semester. Sigh.

- Rowan

Rowan Davies

unread,
Dec 6, 2009, 9:29:00 AM12/6/09
to golang-nuts


On Dec 5, 8:21 am, John Asmuth <jasm...@gmail.com> wrote:

> completion chan bool = make(chan bool);
> iterations := 0;
> for <for-statement> {
> go func(<all-variables-declared-in-the-for-statement>) {
> <code-block>;
> completion<-;
> }(<the-corresponding-variables>);
> iterations++;}
>
> for iterations>0 {
> <-completion;
>
> }

I forgot to say that in Go you don't need the <all-variables-declared-
in-the-for-statement>. That's pretty much exactly what closures do
for you, automatically.

So:

completion chan bool = make(chan bool);
iterations := 0;
for <for-statement> {
go func() {
<code-block>;
completion<-
};
iterations++
}

for iterations>0 {
<-completion;
}

Note that you it's pretty easy to avoid some of this boilerplate via
another derived construct:.

goKid, waitForKids := Patience();
for <for-block> {
goKid(func () {
<code-block>
});
waitForKids();
}

To me that's not bad in terms of syntactic noise, certainly an
improvement. YMMV. I can send my version of Patience to anyone that
wants one but better to do it yourself. It's a good exercise.

You lose the chunking of go.Par though, which is pretty essential for
huge performance critical loops.

- Rowa

Rowan Davies

unread,
Dec 6, 2009, 9:36:49 AM12/6/09
to golang-nuts
n

Note that using Kid here is deliberate - to distinguish it from
Child. A Kid is a Child that the Parent waits for. As in:
"That's my KId."
vs
"Who's Child is that?"

John Asmuth

unread,
Dec 6, 2009, 9:45:46 AM12/6/09
to golang-nuts

>
> I forgot to say that in Go you don't need the <all-variables-declared-
> in-the-for-statement>.   That's pretty much exactly what closures do
> for you, automatically.

I realize that closures would give access to the variables that would
have been available in the for's scope, but this is not the reason I
suggested that. I did that so each iteration would have its *own copy*
of each of those variables. Think of just a regular integer counting
for loop.

for i:=0; i<10; i++ {
go func() {
fmt.Printf("%d\n")
}
}

If that code segment actually prints out 0 through 9 then, well, you
got *lucky*.

Basically, each iteration of the loop does, in essence, have its own
parameters. It's just that in the normal for loop, they happen to be
stored in a reused variable. When the iterations are run in parallel,
you can't reuse that variable for obvious reasons. But we don't want
to make a copy of *every* variable, obviously. The "<all-variables-
declared-in-the-for-statement>" was my attempt to tease out exactly
which variables need to be copied.

- John

Santidhammo

unread,
Dec 6, 2009, 10:48:17 AM12/6/09
to golang-nuts
This was exactly the problem which occured in my mind. Within a range
statement, the variables would be shared across the different
closures. I think this is not what we want if we want the body of the
for loop to be executed in parallel.

The next attempt:

gowait res := range db.Execute("SELECT * FROM MY_TABLE") {
.
.
.
mypkg.DoSomethingWithTheResult(res);
.
.
.
}

gowait does all: make goroutines out of the results of range, copy the
results proper, and wait until all results have been processed.

Then, a variant could be used if blocking is not wanted (for example,
the program doesn't care about the results anymore):

goasync res := range db.Execute....

We can still use channels if there is a need for communication after
all.

Sjoerd

atomly

unread,
Dec 6, 2009, 10:53:52 AM12/6/09
to John Asmuth, golang-nuts
On Sun, Dec 6, 2009 at 9:45 AM, John Asmuth <jas...@gmail.com> wrote:
I realize that closures would give access to the variables that would
have been available in the for's scope, but this is not the reason I
suggested that. I did that so each iteration would have its *own copy*
of each of those variables. Think of just a regular integer counting
for loop.

for i:=0; i<10; i++ {
 go func() {
   fmt.Printf("%d\n")
 }
}

If that code segment actually prints out 0 through 9 then, well, you
got *lucky*.

Basically, each iteration of the loop does, in essence, have its own
parameters. It's just that in the normal for loop, they happen to be
stored in a reused variable. When the iterations are run in parallel,
you can't reuse that variable for obvious reasons. But we don't want
to make a copy of *every* variable, obviously. The "<all-variables-
declared-in-the-for-statement>" was my attempt to tease out exactly
which variables need to be copied.

Yes, this is a very important thing to remember.   The first time I wrote the code I posted in this thread, I wrote it that way thinking closures would take care of everything for me, then I saw that my output was a bunch of 10s and quickly realized my mistake.  Guess I got a little too comfortable in the Erlang no-shared-data/copy-everything/message-passing-only/everything-is-immutable world of concurrency.

I do have to say, though, that the more I work with Go, the more I love it.  The concurrency model is so simple and the language really does provide a great middle ground between system languages, academic languages with modern programming concepts, quick to write scripting languages and powerful, workhorse languages with useful libraries.

atomly

unread,
Dec 6, 2009, 8:05:33 PM12/6/09
to Santidhammo, golang-nuts
On Sun, Dec 6, 2009 at 10:48 AM, Santidhammo <svan...@gmail.com> wrote:
This was exactly the problem which occured in my mind. Within a range
statement, the variables would be shared across the different
closures. I think this is not what we want if we want the body of the
for loop to be executed in parallel.

The next attempt:

gowait res := range db.Execute("SELECT * FROM MY_TABLE") {
   .
   .
   .
   mypkg.DoSomethingWithTheResult(res);
   .
   .
   .
}

gowait does all: make goroutines out of the results of range, copy the
results proper, and wait until all results have been processed.

Then, a variant could be used if blocking is not wanted (for example,
the program doesn't care about the results anymore):

goasync res := range db.Execute....

We can still use channels if there is a need for communication after
all.

Sjoerd

I hope I don't come off as a jerk, but I don't understand why you'd want to take something that can currently be built with a few lines of code and build it into the language instead, thus making the language more complex and ultimately giving you less control and flexibility. I much prefer that the language gives the simple building blocks needed so that you can assemble them in exactly the way that fits your given task.

Rowan Davies

unread,
Dec 8, 2009, 12:46:25 AM12/8/09
to golang-nuts
Oh - yes, you're right of course.

Although you can fix this pretty easily (I just ran it, and it works):

func main() {
goKid, waitForKids := ParentWait();
for i:=0; i<10; i++ {
goKid (i, func(i int) {
fmt.Printf("%d\n", i)
});
}
waitForKids();
}

For the other arguments, it's really not clear that they should always
be copied - if some do need to be, just those can be put in a
anonymous application. Or, goKid could avoid this extra syntactic
noise by allowing any type to be passed instead of just int. (via
interface {}).

- Rowan

roger peppe

unread,
Dec 8, 2009, 2:55:16 AM12/8/09
to Rowan Davies, golang-nuts
2009/12/8 Rowan Davies <rd1...@gmail.com>:
> Oh - yes, you're right of course.
>
> Although you can fix this pretty easily (I just ran it, and it works):
>
> func main() {
>        goKid, waitForKids := ParentWait();
>        for i:=0; i<10; i++ {
>                goKid (i, func(i int) {
>                          fmt.Printf("%d\n", i)
>                });
>        }
>        waitForKids();
> }

i'm not quite sure what your goKid is doing, but
if it's just there to make a copy of i, surely
you can just do this?

for i := 0; i < 10; i++ {
j := i;
go func(){fmt.Println(j)};
}

Rowan Davies

unread,
Dec 8, 2009, 8:08:21 AM12/8/09
to golang-nuts
goKid certainly does more than just copy i. It basically increments a
counter, then starts a goroutine for the block that additionally
signals on a channel when it's done. waitForKids then loops until it
hears from the right number of kids.

I do like your way of doing the copy though, so I've removed the i
argument, and instead all arguments, including i, need to be
explicitly copied when used with goKid from ParentWait.

func main() {
goKid, waitForKids := ParentWait();
for i:=0; i<10; i++ {
ii := i;
goKid (func() {
fmt.Printf("%d\n", ii)
});
}
waitForKids();
}

Note that ParentWait is intended to be a general derrived construct
that allows a parent to spawn kids and then wait for them to
complete. It's not specific to loops.

- Rowan

PS your code doesn't compile. You need an extra () at the end.

roger peppe

unread,
Dec 8, 2009, 8:46:33 AM12/8/09
to Rowan Davies, golang-nuts
2009/12/8 Rowan Davies <rd1...@gmail.com>:
> goKid certainly does more than just copy i.  It basically increments a
> counter, then starts a goroutine for the block that additionally
> signals on a channel when it's done.  waitForKids then loops until it
> hears from the right number of kids.

i like this. i already had a process registering package, but
i prefer your abstraction, so i've just changed my package to
have a similar interface.

the main difference is that instead of a wait function, it
has a wait channel, so you can select on other channels
while waiting for the kids too.

i've attached it. i wonder how similar it is to yours.

sample usage:

start, done := waiter.New();
for i := 0; i < 10; i++ {
j := i;
start(func(){print("hello", j)});
}
<-done;
print("done");
waiter.go

Ryanne Dolan

unread,
Dec 8, 2009, 10:14:34 AM12/8/09
to roger peppe, Rowan Davies, golang-nuts
Roger, 

Your waiter seems a lot more complicated than it needs to be.  I like this pattern better:

sem := make(semaphore);
for i,xi := range x {
    go func(i int) {
        ... 
        sem.Signal();
    }(i);
}
sem.Wait(len(x));

To me, abstracting away the channel as a semaphore makes a lot of sense and offers a minimal implementation.  I discuss it a bit here: http://www.golangpatterns.info/concurrency/parallel-for-loop

Thanks.
Ryanne

--
www.ryannedolan.info

Rowan Davies

unread,
Dec 8, 2009, 10:51:38 AM12/8/09
to golang-nuts
I can see the advantage of the channel in Roger's version - I
considered it, but I missed the possibility of selecting on it. I
think the most common case won't require this - so I've left my
version the way it is. You can always wrap it to get a channel:

go func() { WaitForKids(); <- done } ()

Ryanne's version seems less general because you need to know how many
children ahead of time - for some loops you don't, and the versions
Roger and I gave work for things other than loops. Also, my version
uses channels even less in favour of the stack - which may help
efficiency. I do like the use of semphores though.

Here's my version:

func ParentWait() (func(func()), func()) {
kidDone := make(chan (struct {}));
kidsInAction := 0;
goKid := func(body func()) {
kidsInAction++;
go func() {
body();
kidDone <- struct{}{}
} ();
};
waitForKids := func() {
for ; kidsInAction>0; kidsInAction-- {
<- kidDone;
}

};
return goKid, waitForKids
}



On Dec 8, 11:14 pm, Ryanne Dolan <ryannedo...@gmail.com> wrote:
> Roger,
>
> Your waiter seems a lot more complicated than it needs to be.  I like this
> pattern better:
>
> sem := make(semaphore);
> for i,xi := range x {
>     go func(i int) {
>         ...
>         sem.Signal();
>     }(i);}
>
> sem.Wait(len(x));
>
> To me, abstracting away the channel as a semaphore makes a lot of sense and
> offers a minimal implementation.  I discuss it a bit here:http://www.golangpatterns.info/concurrency/parallel-for-loop
>
> Thanks.
> Ryanne
>
> --www.ryannedolan.info
>

roger peppe

unread,
Dec 8, 2009, 11:48:05 AM12/8/09
to Rowan Davies, golang-nuts
2009/12/8 Rowan Davies <rd1...@gmail.com>:
> Ryanne's version seems less general because you need to know how many
> children ahead of time - for some loops you don't, and the versions
> Roger and I gave work for things other than loops.

that would be my reply too. that, and that Ryanne's semaphore
implementation uses space proportional to the semaphore count
(unsafe.Sizeof(struct{}) == 4).

>        goKid := func(body func()) {
>                kidsInAction++;

nice and small code.

unfortunately this won't work correctly if goKid
is called in a new goroutine (for example if kids spawn
their own kids).

and, perhaps more seriously, the kids won't
go away until they're waited for - it's easy to
imagine a situation where 100000 intermediate
goroutines are spawned but there are only 3
active routines left to wait for at the end.

i'm sure there are other interesting methods of
doing this.

for instance, see the attached. it dispenses with
the central marshaling proc, and allows many
concurrent waiters. only barely tested!
waiter2.go

Ryanne Dolan

unread,
Dec 8, 2009, 11:53:50 AM12/8/09
to roger peppe, Rowan Davies, golang-nuts
Roger,

Ryanne's semaphore
implementation uses space proportional to the semaphore count
(unsafe.Sizeof(struct{}) == 4).

That is a shame!  I hope that is a short-term problem.  I don't see why [N]struct{} should have size 4*N?

I was gonna suggest to the group that semaphores be added to the sync library, but then realized chan struct{} should do the trick.  Maybe it doesn't.

and, perhaps more seriously, the kids won't
go away until they're waited for - it's easy to
imagine a situation where 100000 intermediate
goroutines are spawned but there are only 3
active routines left to wait for at the end.

What I was thinking.

Thanks.
Ryanne

--
www.ryannedolan.info

roger peppe

unread,
Dec 8, 2009, 12:11:31 PM12/8/09
to Ryanne Dolan, golang-nuts
2009/12/8 Ryanne Dolan <ryann...@gmail.com>:
> That is a shame!  I hope that is a short-term problem.  I don't see why
> [N]struct{} should have size 4*N?

sometimes it's useful to divide by the size of a type
without checking for divide-by-zero.

i don't know if that's the case here though. (the
only place i could find was in the map code, which
is probably easy enough to change).

> I was gonna suggest to the group that semaphores be added
> to the sync library, but then realized chan struct{} should do the trick.
> Maybe it doesn't.

i have to say that i have never encountered a need for semaphores
in quite a few years of programming in this kind of style.
maybe that's just my lack of background in conventional
thread programming, but i think it's also that semaphores
go against the mantra - they're rooted in the old "communicate
by sharing memory paradigm" - they don't share memory by communicating.

atomly

unread,
Dec 8, 2009, 12:51:15 PM12/8/09
to roger peppe, Ryanne Dolan, golang-nuts
I don't really see the need for a semaphore, but if you need one,
couldn't use easily build it in a go-like way:

type Semaphore struct {
sig channel bool;
}

func (self *Semaphore) Signal() {
sig<-true;
}

func (self *Semaphore) Wait(int n) {
for i := 0; i < n; i++ {
<-sig;

Ryanne Dolan

unread,
Dec 8, 2009, 1:14:47 PM12/8/09
to atomly, golang-nuts, roger peppe

Right, that was the idea behind chan struct {}, using a zero-size type rather than bool, since only the length of the channel matters. I still think it would be nice to have semaphores supported by the sync package tho.

- from my phone -

On Dec 8, 2009 11:51 AM, "atomly" <ato...@atomly.com> wrote:

I don't really see the need for a semaphore, but if you need one,
couldn't use easily build it in a go-like way:

type Semaphore struct {
 sig channel bool;
}

func (self *Semaphore) Signal() {
 sig<-true;
}

func (self *Semaphore) Wait(int n) {
 for i := 0; i < n; i++ {
   <-sig;

} } On Tue, Dec 8, 2009 at 12:11 PM, roger peppe <rogp...@gmail.com> wrote: > 2009/12/8 Ryanne D...

-- :: atomly :: [ ato...@atomly.com : www.atomly.com : http://blog.atomly.com/ ... [ atomiq recor...

Vish Subramanian

unread,
Dec 8, 2009, 1:21:59 PM12/8/09
to atomly, roger peppe, Ryanne Dolan, golang-nuts
On Tue, Dec 8, 2009 at 9:51 AM, atomly <ato...@atomly.com> wrote:
I don't really see the need for a semaphore, but if you need one,
couldn't use easily build it in a go-like way:

type Semaphore struct {
 sig channel bool;
}

func (self *Semaphore) Signal() {
 sig<-true;
}

func (self *Semaphore) Wait(int n) {
 for i := 0; i < n; i++ {
   <-sig;
 }
}


For unbuffered channels, the send/receive ops are synchronous (the channel send will wait till someone calls receive). For a semaphore you  want a buffered channel.

Vish

John Asmuth

unread,
Dec 8, 2009, 1:25:50 PM12/8/09
to golang-nuts


On Dec 8, 10:14 am, Ryanne Dolan <ryannedo...@gmail.com> wrote:

> To me, abstracting away the channel as a semaphore makes a lot of sense and
> offers a minimal implementation.  I discuss it a bit here:http://www.golangpatterns.info/concurrency/parallel-for-loop

The semaphore idea is neat, I think.

I was wondering about something you wrote in that link.

First, you mention the following pattern, which you say you've seem in
several places. I can't help but to think that one of those places is
my code for parallel matrix multiplication :)

xi := make(float chan);
out := make(float chan);
// start N goroutines
for _,_ := range data {
go func () {
xi := <-xch;
out <- doSomething(xi);
}
}
// send input to each goroutine
for _,xi := range data {
xch <- xi;
}
// collect results of each goroutine
for _,_ := range data {
res := <-out;
....
}

But you may not have been referring to the gomatrix code. While
similar, it differs in a few important aspects. I do not launch a
thread for each computation to be done. Instead, I launch a thread for
each worker process, which will receive multiple computations to
perform and will do them in sequence.

In short, the pattern I used looks like this:
in := make(chan int);
quit := make(chan bool);

worker := func() {
for {
select {
case i := <-in:
dostuff()
case <-quit:
return
}
}
};
for i := 0; i < threads; i++ {
go worker()
}
for i := 0; i < max; i++ {
in <- i
}
for i := 0; i < threads; i++ {
quit <- true
}

My question to the thread is, is there a better way to do this sort of
thing? For N computations on X processors, is it better to launch X
threads and loadbalance (using the above pattern or something else)?
Or should I launch N threads and let the OS/language balance
everything?

I tend to think that having X worker threads is better, because you
have a maximum amount of memory in use (temporaries that are inside
dostuff) that scales linearly with X, and if you have N threads each
doing one computation, it is possible that they run in lockstep, and
memory can scale with N (which is typically much much higher than X).

Having something like a parallel for function would be especially
useful if templating became part of the language. We could then do the
following:

inputs := make(chan MyInputType);
wait := parallel.For<MyInputType>(inputs, foo);
wait();

Where the function signature might be
func For<T>(chan T, foo func(t T)) func() { ... }

And returns a function that can be called if you wish to wait for
those computations to finish.

Without templating, any such function would have to be explicit about
the input type. One for integers would be useful, but certainly not
the only useful way.

Also a quick function to fill a channel with a range of numbers...

for i:=0; i<10; i++ {
foo(i);
}

could become

parallel.For(Count(0, 10), foo)();

etc.

- John

Ryanne Dolan

unread,
Dec 8, 2009, 1:36:31 PM12/8/09
to roger peppe, golang-nuts

Roger,

I'd argue that semaphores are the original signal...the simplest message you can send. I'd bet you've used them an aweful lot wo realizing it. Anytime you send a message but don't actually inspect the message ( done <- true ) you are using the same kinda logic, right?

- from my phone -

On Dec 8, 2009 11:11 AM, "roger peppe" <rogp...@gmail.com> wrote:

2009/12/8 Ryanne Dolan <ryann...@gmail.com>:

> That is a shame!  I hope that is a short-term problem.  I don't see why > [N]struct{} should have ...

sometimes it's useful to divide by the size of a type
without checking for divide-by-zero.

i don't know if that's the case here though. (the
only place i could find was in the map code, which
is probably easy enough to change).

> I was gonna suggest to the group that semaphores be added > to the sync library, but then realize...

Ryanne Dolan

unread,
Dec 8, 2009, 1:52:17 PM12/8/09
to John Asmuth, golang-nuts

John,

I don't mean to pick on you (publicly? anonymously?). But your code is still strange to me. The stranger thing is that the golang.org examples seem to advocate your pattern.  So maybe you are right, or at least more go-ish.  I don't think you are optimal tho.

Consider in the matrix example that you have two input matrices and one output matrix. You never write to the inputs and the output writes are all independent.  This is a data-parallel calculation.  Channels aren't needed at all!  All you need is to sync at the end (or some point in the future), which right now needs a channel (unless native semaphores are introduced at some point).  Other than this one sync step, no sync on the data is required.

Assuming stack is faster than channel communication, i'd say a better way is to use closures like in my example.

An even better way would be to return a closure to call when you need the result of the multiply.  Other examples in this thread do just that.  This thunk-based lazy eval is probly the most parallel way to do a for-loop, tho I think it might have necessarily more overhead.

- from my phone -

On Dec 8, 2009 12:25 PM, "John Asmuth" <jas...@gmail.com> wrote:

On Dec 8, 10:14 am, Ryanne Dolan <ryannedo...@gmail.com> wrote: > To me, abstracting away the cha...

John Asmuth

unread,
Dec 8, 2009, 2:04:49 PM12/8/09
to golang-nuts


On Dec 8, 1:52 pm, Ryanne Dolan <ryannedo...@gmail.com> wrote:

> I don't mean to pick on you (publicly? anonymously?).

I know you didn't, I'm just teasing.

> But your code is still
> strange to me. The stranger thing is that the golang.org examples seem to
> advocate your pattern.  So maybe you are right, or at least more go-ish.  I
> don't think you are optimal tho.
>
> Consider in the matrix example that you have two input matrices and one
> output matrix. You never write to the inputs and the output writes are all
> independent.  This is a data-parallel calculation.  Channels aren't needed
> at all!  All you need is to sync at the end (or some point in the future),
> which right now needs a channel (unless native semaphores are introduced at
> some point).  Other than this one sync step, no sync on the data is
> required.

The channels are to ensure load balancing. Perhaps for parallel matrix
multiplication, it is unlikely one computation will take longer than
the other, since they have exactly the same number of steps. But in
general, the doStuff() method can take a varying amount of time to
complete. So if I want to split the work up into X (the number of
processors) threads, I can't just give each of those workers N/X
computations to do. I use the channel like a queue, so no worker is
left idle until there are no more jobs to pick up.

This is the kind of thing that I would expect a good implementation of
parallel for to do, and I don't think it is good to expect each
developer to write the same pattern each time. Either a "go for" or a
templated "parallel.For<T>(c chan T, func(t T))" is needed to allow
developers to write code that will load balance a for loop and not
have to worry about whether they're waiting on channels properly or
not.


>
> Assuming stack is faster than channel communication, i'd say a better way is
> to use closures like in my example.

I don't see using closures as sufficient for the load-balancing
question, unless I'm missing something.

> An even better way would be to return a closure to call when you need the
> result of the multiply.  Other examples in this thread do just that.  This
> thunk-based lazy eval is probly the most parallel way to do a for-loop, tho
> I think it might have necessarily more overhead.

This is interesting, but I don't know if it fits in the general idea
of a for-loop, which performs a computation, rather than returns a
value, for each iteration.

- John

Ryanne Dolan

unread,
Dec 8, 2009, 2:22:21 PM12/8/09
to John Asmuth, golang-nuts

John,

You are very right about the load balancing problem... which is why golang.org's load balancing examples are so strange to me.  I don't see why you'd try to implement load balancing in Go. I totally agree that channels are the way to go (the only way maybe) to do this in Go, but doesn't the runtime load balance already? Under the hood, the runtime uses traditional sync methods to solve this problem on a grander scale.  I think it is impossible to solve the problem efficiently in Go the language, but Go the runtime does it magically.

That said, I think using N goroutines and letting the runtime multiplex them to X procs is the only way to write efficient parallelism in Go.

I think this is a good thing, but I still wonder why golang.org leads us to believe we need to load balance ourselves. Am I missing something or is the example just really bad?

- from my phone -

On Dec 8, 2009 1:06 PM, "John Asmuth" <jas...@gmail.com> wrote:

On Dec 8, 1:52 pm, Ryanne Dolan <ryannedo...@gmail.com> wrote: > I don't mean to pick on you (pub...

I know you didn't, I'm just teasing.

> But your code is still > strange to me. The stranger thing is that the golang.org examples seem t...

The channels are to ensure load balancing. Perhaps for parallel matrix
multiplication, it is unlikely one computation will take longer than
the other, since they have exactly the same number of steps. But in
general, the doStuff() method can take a varying amount of time to
complete. So if I want to split the work up into X (the number of
processors) threads, I can't just give each of those workers N/X
computations to do. I use the channel like a queue, so no worker is
left idle until there are no more jobs to pick up.

This is the kind of thing that I would expect a good implementation of
parallel for to do, and I don't think it is good to expect each
developer to write the same pattern each time. Either a "go for" or a
templated "parallel.For<T>(c chan T, func(t T))" is needed to allow
developers to write code that will load balance a for loop and not
have to worry about whether they're waiting on channels properly or
not.

> > Assuming stack is faster than channel communication, i'd say a better way is > to use closures...

I don't see using closures as sufficient for the load-balancing
question, unless I'm missing something.

> An even better way would be to return a closure to call when you need the > result of the multipl...

John Asmuth

unread,
Dec 8, 2009, 2:31:49 PM12/8/09
to Ryanne Dolan, golang-nuts
On Tue, Dec 8, 2009 at 2:22 PM, Ryanne Dolan <ryann...@gmail.com> wrote:

John,

You are very right about the load balancing problem... which is why golang.org's load balancing examples are so strange to me.  I don't see why you'd try to implement load balancing in Go. I totally agree that channels are the way to go (the only way maybe) to do this in Go, but doesn't the runtime load balance already? Under the hood, the runtime uses traditional sync methods to solve this problem on a grander scale.  I think it is impossible to solve the problem efficiently in Go the language, but Go the runtime does it magically.

That said, I think using N goroutines and letting the runtime multiplex them to X procs is the only way to write efficient parallelism in Go.

I think this is a good thing, but I still wonder why golang.org leads us to believe we need to load balance ourselves. Am I missing something or is the example just really bad?

I trust the runtime to balance computation. What I don't trust it to do is to manage the memory nicely.

Consider the situation where I want to call foo(i) with the values of 0 to 1000000. Forget the overhead of having 1000000 goroutines existing at the same time. But think about what happens in foo().

func foo(i int) {
  array := make([]int, 1000000);
  doStuffOnTheArray(i);
  c += reduceTheArrayToASingleValue();
}

The doStuffOnTheArray() function is going to take a long time, so it's very likely that all 1000000 goroutines are going to be in that stage of execution at the same time. This requires that each of the 1000000 goroutines (not too many goroutines for a powerful system) are going to allocate 1000000 integers (not too many integers for a powerful system) at the same time. Ok, we now have 10^12 integers allocated at the same time on a 20 processor computer. Assuming the computer can handle that much data, it doesn't gain any speed by having more than 20x1000000 integers allocated at the same time. The following implementation of parallel.For<int> would do this more efficiently, memory-wise.

func For(inputs chan int, foo func(i int)) (wait func()) {
workerBlock := make(semaphore);
numThreads := 20;
for j:=0; j<numThreads; j++ {
go func() {
for i := range inputs {
foo(i)
}
workerBlock.touch();
}();
}
wait = func() { workerBlock.wait(numThreads) };
return;
}

- John

Jonathan Amsterdam

unread,
Dec 8, 2009, 2:38:52 PM12/8/09
to golang-nuts
> In short, the pattern I used looks like this:
> in := make(chan int);
> quit := make(chan bool);
>
> worker := func() {
>         for {
>                 select {
>                 case i := <-in:
>                         dostuff()
>                 case <-quit:
>                         return
>                 }
>         }};
>
> for i := 0; i < threads; i++ {
>         go worker()}
>
> for i := 0; i < max; i++ {
>         in <- i}
>
> for i := 0; i < threads; i++ {
>         quit <- true
>
> }

Do you really need the quit channel? What if you just closed the in
channel, and had the workers check for that?

atomly

unread,
Dec 8, 2009, 3:08:56 PM12/8/09
to Vish Subramanian, roger peppe, Ryanne Dolan, golang-nuts
On Tue, Dec 8, 2009 at 1:21 PM, Vish Subramanian
<vish.sub...@gmail.com> wrote:
> For unbuffered channels, the send/receive ops are synchronous (the channel
> send will wait till someone calls receive). For a semaphore you  want a
> buffered channel.

Yes. I didn't "make" the channel anywhere in my code, so there's
really no way to tell if it's buffered or unbuffered-- I was working
under the assumption that it would be buffered with a method like the
following:

func NewSemaphore(int cap) s *Semaphore {
s := &Semaphore{sig: make(chan bool, cap)};
}

An alternative would be to provide fail-fast semaphores with
unbuffered channels using the comma ok syntax, so that sends never
block but then it's possible to fail to acquire a semaphore (possibly
a useful pattern).

And, Ryanne, I didn't actually look at your Semaphore code until now.
Haha-- kinda funny to realize that this is indeed pretty much how you
implemented it. I was saying basically trying to say that you don't
really need semaphores in Go, but if you really, really need them,
then they're easy to build with simple Go constructs.

Ben Tilly

unread,
Dec 8, 2009, 5:05:02 PM12/8/09
to John Asmuth, Ryanne Dolan, golang-nuts
On Tue, Dec 8, 2009 at 11:31 AM, John Asmuth <jas...@gmail.com> wrote:
> On Tue, Dec 8, 2009 at 2:22 PM, Ryanne Dolan <ryann...@gmail.com> wrote:
[...]
> I trust the runtime to balance computation. What I don't trust it to do is
> to manage the memory nicely.
> Consider the situation where I want to call foo(i) with the values of 0 to
> 1000000. Forget the overhead of having 1000000 goroutines existing at the
> same time. But think about what happens in foo().
> func foo(i int) {
>   array := make([]int, 1000000);
>   doStuffOnTheArray(i);
>   c += reduceTheArrayToASingleValue();
> }

Note that c += ... is not threadsafe. You really, really want to
communicate c over a channel to an aggregator.

> The doStuffOnTheArray() function is going to take a long time, so it's very
> likely that all 1000000 goroutines are going to be in that stage of
> execution at the same time. This requires that each of the 1000000
> goroutines (not too many goroutines for a powerful system) are going to
> allocate 1000000 integers (not too many integers for a powerful system) at
> the same time. Ok, we now have 10^12 integers allocated at the same time on
> a 20 processor computer. Assuming the computer can handle that much data, it
> doesn't gain any speed by having more than 20x1000000 integers allocated at
> the same time. The following implementation of parallel.For<int> would do
> this more efficiently, memory-wise.
> func For(inputs chan int, foo func(i int)) (wait func()) {
> workerBlock := make(semaphore);
> numThreads := 20;
> for j:=0; j<numThreads; j++ {
> go func() {
> for i := range inputs {
> foo(i)
> }
> workerBlock.touch();
> }();
> }
> wait = func() { workerBlock.wait(numThreads) };
> return;
> }

You can easily implement that without semaphores. Here is some
untested code to do so. Note that I am encapsulating the body of the
loop as an arbitrary function that neither expects to receive nor
return arguments.

func For (parallelCount int, inputs chan func ()) {
done := make(chan bool);
for i := 0; i < parallelCount; i++ {
go makeWorker(inputs, done);
}

finished := 0;
while finished < parallelCount {
<- done; // Value doesn't matter
finished++;
}

return;
}

func makeWorker (inputs chan func (), done chan bool) {
f := <-inputs;

while f != nil {
f();
f = <-inputs;
}

done <- true;
return;
}

Cheers,
Ben

John Asmuth

unread,
Dec 8, 2009, 5:09:07 PM12/8/09
to Ben Tilly, Ryanne Dolan, golang-nuts
On Tue, Dec 8, 2009 at 5:05 PM, Ben Tilly <bti...@gmail.com> wrote:

Note that c += ... is not threadsafe.  You really, really want to
communicate c over a channel to an aggregator.

Certainly.
 
You can easily implement that without semaphores.  Here is some
untested code to do so.  Note that I am encapsulating the body of the
loop as an arbitrary function that neither expects to receive nor
return arguments.


I hid my implementation of semaphore, which did exactly what you did there with the done channel. In fact it was a typedef to chan bool, touch put a value in and wait(n) took n values out. Same thing.

- John 

Ben Tilly

unread,
Dec 8, 2009, 5:49:34 PM12/8/09
to John Asmuth, Ryanne Dolan, golang-nuts
On Tue, Dec 8, 2009 at 2:09 PM, John Asmuth <jas...@gmail.com> wrote:
> On Tue, Dec 8, 2009 at 5:05 PM, Ben Tilly <bti...@gmail.com> wrote:
[...]
>> You can easily implement that without semaphores.  Here is some
>> untested code to do so.  Note that I am encapsulating the body of the
>> loop as an arbitrary function that neither expects to receive nor
>> return arguments.
>>
>
> I hid my implementation of semaphore, which did exactly what you did there
> with the done channel. In fact it was a typedef to chan bool, touch put a
> value in and wait(n) took n values out. Same thing.

Ah. You are right.

Cheers,
Ben

Ryanne Dolan

unread,
Dec 8, 2009, 7:04:37 PM12/8/09
to John Asmuth, golang-nuts
John,

Do I understand you right?

You mean that launching N goroutines on X procs is a bad idea b/c each goroutine may require a ton of memory.   You say it would be better to launch less goroutines if you know you can't run more than X at a time anyway.

I'd agree to a point.  In the current release, Go allocates a big stack with each goroutine even if the goroutine doesn't need one (I think), so there is a memory penalty for using many goroutines even when they don't allocate 10^6 integers.  In many cases, the process of allocating the goroutine's stack may be expensive compared to what the goroutine actually _does_.  In that regard, it is bad to use goroutines when you don't need them, especially with the current implementation.  Agreed.

But your example is a little preposterous...

for i := 0; i < 1e6; i++ {
    data := make ([]huge, many);
    initialize (data);
    process (data, i);
    c += reduce (data)
}

No matter how you parallelize the above for-loop, it isn't a very good for-loop.  Your proposed solution is to do the allocation only 20 times:

for j := 0; j < 20; j++ {
    data := make ([]huge, many);
    initialize (data);
    for i := 0; i < 1e6; i++ {    
        process (data);
        c += reduce (data);
    }

... but that is still 20x worse that it should be.  You are again describing a data-parallel for-loop (the subject of this thread).  Any data-parallel computation doesn't require synchronization, which means:

data := make ([]huge, many);
initialize (data);
for i := 0; i < 1e6; i++ {    
    process (data, i);
}
reduce (data);

.. is the proper implementation.  I think the mistake that you and others keep making is to assume that Go doesn't allow shared memory.  It is easy to assume that goroutines can only communicate thru channels, because the golang.org articles would have you believe that.  But you can also communicate through closures, which are unsafe, fast, and on the stack.  The unsafe part doesn't matter if you are doing data-parallel computation.

Channels are great from a high-level, but they are slow compared to proper use of shared memory, and you can't get any faster than data-parallel code with shared memory.  This is why the Transputer died long before I was born, but you can get a 128-core x 1.5GHz GPU for $100 (graphics rendering is a data-parallel computation).

I think there are two types of for-loops that people understand well:
  1. count from 1..N, results of iteration i are needed for i+1
  2. count from 1..N, each iteration is independent 
For type-1 for-loops, the goroutine-based iterators use by the standard library are as fast as you can get (with the current language).  For type-2 for-loops, you can use closures and N goroutines, letting the scheduler figure out the rest.  You don't ever need channels (or any synchronization at all) for type-2.

By the way, a smart compiler can figure out when a for-loop is actually type-2.  Go's compiler isn't that smart yet (I don't think), but I'm sure it will be; if not, I volunteer for the patch!  For example:

for _,xi := range array {
    ....
}

Since the iteration index is never used, whatever is in '...' can very likely be thrown into a stackless goroutine automatically.

Thanks.
Ryanne

John Asmuth

unread,
Dec 8, 2009, 7:20:09 PM12/8/09
to Ryanne Dolan, golang-nuts
On Tue, Dec 8, 2009 at 7:04 PM, Ryanne Dolan <ryann...@gmail.com> wrote:
John,

Do I understand you right?

You mean that launching N goroutines on X procs is a bad idea b/c each goroutine may require a ton of memory.   You say it would be better to launch less goroutines if you know you can't run more than X at a time anyway.

That's right.
 

I'd agree to a point.  In the current release, Go allocates a big stack with each goroutine even if the goroutine doesn't need one (I think), so there is a memory penalty for using many goroutines even when they don't allocate 10^6 integers.  In many cases, the process of allocating the goroutine's stack may be expensive compared to what the goroutine actually _does_.  In that regard, it is bad to use goroutines when you don't need them, especially with the current implementation.  Agreed.

But your example is a little preposterous...

No doubt.
 

for i := 0; i < 1e6; i++ {
    data := make ([]huge, many);
    initialize (data);
    process (data, i);
    c += reduce (data)
}

No matter how you parallelize the above for-loop, it isn't a very good for-loop.  Your proposed solution is to do the allocation only 20 times:

for j := 0; j < 20; j++ {
    data := make ([]huge, many);
    initialize (data);
    for i := 0; i < 1e6; i++ {    
        process (data);
        c += reduce (data);
    }

I'm sure someone can think of a context in which that data could not be shared, because it is, in fact, different for each iteration. I didn't bother because it wasn't important.
 

... but that is still 20x worse that it should be.  You are again describing a data-parallel for-loop (the subject of this thread).  Any data-parallel computation doesn't require synchronization, which means:

data := make ([]huge, many);
initialize (data);
for i := 0; i < 1e6; i++ {    
    process (data, i);
}
reduce (data);

.. is the proper implementation.  I think the mistake that you and others keep making is to assume that Go doesn't allow shared memory.  It is easy to assume that goroutines can only communicate thru channels, because the golang.org articles would have you believe that.  But you can also communicate through closures, which are unsafe, fast, and on the stack.  The unsafe part doesn't matter if you are doing data-parallel computation.

Ok, how about this example. You have a function that downloads an image and tries to decide whether or not it is a picture of a dog. I would like to do this in parallel, if possible. There are one million images that I want to take, and I have the processor power to efficiently think about 100 of them at a time.

Downloading all one million images and allowing the scheduler to determine which are worked on is ok only if I don't mind keeping them all in memory at once. I'd prefer to download only as many as is useful for my parallel computation.

Of course I can write code to do what I want here. But it would be nice to have a language construct or library function to make this easy.
 
By the way, a smart compiler can figure out when a for-loop is actually type-2.  Go's compiler isn't that smart yet (I don't think), but I'm sure it will be; if not, I volunteer for the patch!  For example:

This is actually what a class project I should be working on now covers.
 

for _,xi := range array {
    ....
}

Since the iteration index is never used, whatever is in '...' can very likely be thrown into a stackless goroutine automatically.

Often likely, but not necessarily.
 
count := 0;

for _,xi := range array {
    ....
    if success { count = count+1 }
}

It is not safe to parallelize this, since incrementing the counter could go wrong. That is why I think that the programmer should be responsible for telling the program when it should do things in parallel, rather relying on a compiler that has to be extremely conservative in deciding when to parallelize. As a human, I would know not to parallelize my example just above. But if I had other code in there that, as a human, I am convinced can go in parallel, I'd like to be able to easy tell the compiler that it should be parallelized. A "go for" would be sweet, but it has some issues. A templated parallel.For(input chan, function) is simple enough to use and does all that I am interested in.

- John

SnakE

unread,
Dec 8, 2009, 7:20:12 PM12/8/09
to Ryanne Dolan, John Asmuth, golang-nuts
2009/12/9 Ryanne Dolan <ryann...@gmail.com>

No matter how you parallelize the above for-loop, it isn't a very good for-loop.

I don't think this is the point.  The goroutines in the loop may not be data-parallel, just memory-intensive.  Think serving translated web-pages.  The "go for" construct would give the runtime a very useful hint: "Here are things I need to do many times.  I'm OK with sequential, but if you *can* run in parallel, I'm all for it."  Or, in other words, run as many of "go for" bodies in goroutines as feasible and postpone others until there is an opportunity to parallelize.  Currently there is no way to express this in Go, and only runtime can implement this efficiently.

Ben Tilly

unread,
Dec 8, 2009, 7:32:55 PM12/8/09
to SnakE, Ryanne Dolan, John Asmuth, golang-nuts
I disbelieve. I already posted a solution that allows this exact
problem to be solved reasonably efficiently in Go. All you need to do
is spawn a fixed number of worker goroutines, all reading jobs from a
channel. The overhead of this solution is spawning the fixed number
of worker goroutines and communicating over channels. I believe that
the overheads are low enough that it would be hard for a built-in
solution to be significantly more efficient.

Cheers,
Ben

Ryanne Dolan

unread,
Dec 8, 2009, 7:43:21 PM12/8/09
to Ben Tilly, golang-nuts, John Asmuth, SnakE

Right, this is what John's code does also. I still hold that it usually isn't what you mean when you write a for-loop.

- from my phone -

On Dec 8, 2009 6:32 PM, "Ben Tilly" <bti...@gmail.com> wrote:

On Tue, Dec 8, 2009 at 4:20 PM, SnakE <snake...@gmail.com> wrote: > 2009/12/9 Ryanne Dolan <ryann...

I disbelieve.  I already posted a solution that allows this exact

John Asmuth

unread,
Dec 8, 2009, 7:46:46 PM12/8/09
to Ryanne Dolan, Ben Tilly, golang-nuts, SnakE
On Tue, Dec 8, 2009 at 7:43 PM, Ryanne Dolan <ryann...@gmail.com> wrote:

Right, this is what John's code does also. I still hold that it usually isn't what you mean when you write a for-loop.

I can buy that. But using the X worker goroutines for Y iterations pattern can do anything the Y goroutines for Y iterations pattern can (indeed the former has an execution that is one of the possibilities for the latter), and I am not convinced that it is less efficient. If they're both equally efficient for the general case, and one is much more efficient for the special case, you use the one that can handle the special case. 

Ian Lance Taylor

unread,
Dec 8, 2009, 8:38:39 PM12/8/09
to Ryanne Dolan, John Asmuth, golang-nuts
Ryanne Dolan <ryann...@gmail.com> writes:

> In the current release, Go allocates a big stack with
> each goroutine even if the goroutine doesn't need one (I think), so there is
> a memory penalty for using many goroutines even when they don't
> allocate 10^6 integers.

I just want to correct this comment. Go does not allocate a big stack
for each goroutine. The stack starts small, and is grown on demand.
The compilers support a discontiguous stack.

Ian

Ryanne Dolan

unread,
Dec 8, 2009, 8:47:07 PM12/8/09
to Ian Lance Taylor, golang-nuts, John Asmuth

Awesome thanks.  I wonder where I got that impression.  Sorry for the blunder.

I guess small and big are relative tho.  I'd hope for stackless goroutines in some cases.  Or maybe it doesn't matter.

Ryanne

- from my phone -

On Dec 8, 2009 7:39 PM, "Ian Lance Taylor" <ia...@google.com> wrote:

Ryanne Dolan <ryann...@gmail.com> writes: > In the current release, Go allocates a big stack wit...

Rowan Davies

unread,
Dec 9, 2009, 2:32:16 AM12/9/09
to golang-nuts
The par.For I posted much earlier in the thread is a pretty similar
idea.

But, it explicitly is only for integers, and I think this is the right
approach if you really want maximum performance with multiple cores.
The idea is that anything else you can put into an array, and this is
very efficient.

You get good locality for access into this array, because it doesn't
create one goroutine for each iteration. Instead it allocates whole
chunks at a time. To deal with different chunks being faster or
slower, it creates more chunks than are strictly necessary, but not
too many more.

This approach is based on way Intel's thread building blocks works -
and that gives performance that's hard to beat. I can see that using
channels to distribute the work has advantages, but I think without
something like chunks you won't get ideal performance for locality
reasons.

You could combine my par.For with distribution of chunks via channels,
but it'd be more complex and it's unclear to me what you'd gain - the
goroutines don't compete for the CPU because they aren't preemptive,
and only a small number are created so the overhead is small.

Note that you already have the appropriate number of OS worker threads
- do we really need to match this with the number of worker
goroutines?

- Rowan

Rowan Davies

unread,
Dec 9, 2009, 2:46:48 AM12/9/09
to golang-nuts
I was expecting that once all the OS threads were full with CPU bound
goroutines, that there'd be no chance for the parent to run anymore,
hence no more goroutines created.

But, maybe depending on this isn't wise - it does lead to a nice model
where the go runtime isolates you from some of these issues. The
question is whether that isolation is useful, or just a leaky
abstraction that leads people to rely on it too much.

- Rowan

Rowan Davies

unread,
Dec 9, 2009, 3:05:54 AM12/9/09
to golang-nuts


On Dec 9, 12:48 am, roger peppe <rogpe...@gmail.com> wrote:
> >        goKid := func(body func()) {
> >                kidsInAction++;
>
> nice and small code.
>
> unfortunately this won't work correctly if goKid
> is called in a new goroutine (for example if kids spawn
> their own kids).

In that case, it's hard to guarantee that the spawn happens prior to
the parent calling waitForKids, so I think that's a much harder
problem that complicates things too much to include. Normally in that
situation you wouldn't use "go" but "goKid", and then have the kid
wait for it's own kids (the grandkids).


> and, perhaps more seriously, the kids won't
> go away until they're waited for - it's easy to
> imagine a situation where 100000 intermediate
> goroutines are spawned but there are only 3
> active routines left to wait for at the end.

Yes - I think the channel should be buffered to partially fix this,
probably with a hint for the size. It would also allow the parent to
wait more efficiently.

> i'm sure there are other interesting methods of
> doing this.

Yes, certainly. Mine only captures the pattern of a parent who
creates kids and counts them as they are spawned and then waits and
counts them again as they finish. I think it's a very common and
efficient pattern. But, perhaps generalizing to something more is
worthwhile - having a separate goroutine to do the counting so it can
begin before the parent is done spawning. Anyway, interesting to
consider the options.

- Rowan

Ben Tilly

unread,
Dec 9, 2009, 3:35:38 AM12/9/09
to Rowan Davies, golang-nuts
On Tue, Dec 8, 2009 at 11:46 PM, Rowan Davies <rd1...@gmail.com> wrote:
> I was expecting that once all the OS threads were full with CPU bound
> goroutines, that there'd be no chance for the parent to run anymore,
> hence no more goroutines created.

My assumption would be the opposite, that we should act as if some
version of Go could have preemptive scheduling or a different order of
action, so it is unwise to assume that the parent is ever blocked by
child goroutines.

> But, maybe depending on this isn't wise - it does lead to a nice model
> where the go runtime isolates you from some of these issues.  The
> question is whether that isolation is useful, or just a leaky
> abstraction that leads people to rely on it too much.

I wouldn't rely on it at all.

Cheers,
Ben

roger peppe

unread,
Dec 9, 2009, 9:07:28 AM12/9/09
to Rowan Davies, golang-nuts
2009/12/9 Rowan Davies <rd1...@gmail.com>:
>> unfortunately this won't work correctly if goKid
>> is called in a new goroutine (for example if kids spawn
>> their own kids).
>
> In that case, it's hard to guarantee that the spawn happens prior to
> the parent calling waitForKids, so I think that's a much harder
> problem that complicates things too much to include.  Normally in that
> situation you wouldn't use "go" but "goKid", and then have the kid
> wait for it's own kids (the grandkids).

actually, it doesn't matter if the spawn happens prior to
the parent calling waitForKids, because waitForKids will also be waiting
for the kid that's doing the spawn.

we're fine as long as start() is called by a process
that was also created with start().

i've attached a demo that makes use of this property.
it spawns a tree of processes, each of which fill in their
own subtree of the resulting data.

although the results in the tree are filled in entirely concurrently, when
our wait completes, we know that they're fully complete.

i've used random numbers to model arbitrary computation.
in order to avoid cluttering the code, i've put the thread
safe random number generator in a separate file, ts.go,
which i'm hoping will be accepted as part of the rand package.
for the time being, you'll need to copy ts.go to src/pkg/rand
and add it to the Makefile there.

> Yes - I think the channel should be buffered to partially fix this,
> probably with a hint for the size.  It would also allow the parent to
> wait more efficiently.

i'm still not sure that's enough - you'll still use space
proportional to the total number of goroutines,
and once the buffer is filled, goroutines will still pile up.

> Anyway, interesting to consider the options.

definitely.
proctree.go
ts.go

roger peppe

unread,
Dec 9, 2009, 12:31:05 PM12/9/09
to Ryanne Dolan, golang-nuts
2009/12/8 Ryanne Dolan <ryann...@gmail.com>:
> I'd argue that semaphores are the original signal...the simplest message you
> can send. I'd bet you've used them an aweful lot wo realizing it. Anytime
> you send a message but don't actually inspect the message ( done <- true )
> you are using the same kinda logic, right?

i guess i meant that i'd never felt the need to create an
abstraction called Semaphore, with the usual semantics,
P/V, wait/signal operators, or whatever.

if the two abstractions do match, as i think you're
saying, then why not just use channels directly,
and call it synchronisation rather than signalling?

R D

unread,
Dec 9, 2009, 10:57:12 PM12/9/09
to roger peppe, golang-nuts
On Wed, Dec 9, 2009 at 10:07 PM, roger peppe <rogp...@gmail.com> wrote:
> 2009/12/9 Rowan Davies <rd1...@gmail.com>:
>>> unfortunately this won't work correctly if goKid
>>> is called in a new goroutine (for example if kids spawn
>>> their own kids).
>>
>> In that case, it's hard to guarantee that the spawn happens prior to
>> the parent calling waitForKids, so I think that's a much harder
>> problem that complicates things too much to include.  Normally in that
>> situation you wouldn't use "go" but "goKid", and then have the kid
>> wait for it's own kids (the grandkids).
>
> actually, it doesn't matter if the spawn happens prior to
> the parent calling waitForKids, because waitForKids will also be waiting
> for the kid that's doing the spawn.
>
> we're fine as long as start() is called by a process
> that was also created with start().

So you agree there actually isn't an issue with this in the code I gave?

This pattern is very common - it's particularly important in
workstealing paradigms like Cilk and TBB.


>> Yes - I think the channel should be buffered to partially fix this,
>> probably with a hint for the size.  It would also allow the parent to
>> wait more efficiently.
>
> i'm still not sure that's enough - you'll still use space
> proportional to the total number of goroutines,
> and once the buffer is filled, goroutines will still pile up.

Well - I think to write any code that is guaranteed to use space
that's less than proportional to the number of goroutines spawned
would require careful synchronization, even not considering the aspect
of the parent waiting for the children. The parent couldn't just keep
spawning the kids - because then they might all be running at the same
time. Instead, it would need to wait for some to finish before
spawning more. (Of course this only works if an early kid doesn't
need to communicate with a much later kid.)

If you want to do this, something like a semaphore does seem
appropriate, as suggested before. It can be initialized with the
number of simultaneous kids you want to allow, and the parent will
wait before spawning more when the semaphore reaches zero. A the end,
the parent reduces the semaphore by the number it was initialized
with. Actually, this turns out to be really easy to implement.

func ParentWaitSem(maxKids int) (func(func()), func()) {
sem := make(chan bool, maxKids);
goKid := func(body func()) {
sem <- true ;
go func() {
body();
<- sem
} ();
};
waitForKids := func() {
for i:=0 ; i<maxKids ; i++ {
sem <- true
}
};
return goKid, waitForKids
}

This is almost easy enough that it's not worth having ParentWaitSem
defined at all. But, I'd argue it's still good to use something like
this, because goKid avoids the possibility of missing a semaphore
operation which tends to be common and tedious to debug in things like
this.

- Rowan

roger peppe

unread,
Dec 10, 2009, 11:59:42 AM12/10/09
to R D, golang-nuts
2009/12/10 R D <rd1...@gmail.com>:
>> we're fine as long as start() is called by a process
>> that was also created with start().
>
> So you agree there actually isn't an issue with this in the code I gave?

no, sorry - i was talking about my implementation of start().
your code has the unguarded:
kidsInAction++;
which is a race if start() is called outside of the main process.

> Well - I think to write any code that is guaranteed to use space
> that's less than proportional to the number of goroutines spawned
> would require careful synchronization,
[...]
> it would need to wait for some to finish before spawning more.

that's not unusual.
it's also quite common for goroutines to be started according
to some sporadic event - if the goroutines are short lived
and the events aren't too fast, then they won't accumulate
indefinitely, but there might be many millions over time.

>        sem := make(chan bool, maxKids);

this assumes that you know the maximum number
of kids in advance. it's sometimes not possible to do so.
did you look at my concurrent tree example?

R D

unread,
Dec 10, 2009, 10:57:21 PM12/10/09
to roger peppe, golang-nuts


On Fri, Dec 11, 2009 at 12:59 AM, roger peppe <rogp...@gmail.com> wrote:
2009/12/10 R D <rd1...@gmail.com>:


no, sorry - i was talking about my implementation of start().
your code has the unguarded:
    kidsInAction++;
which is a race if start() is called outside of the main process.

Right - but the intention is that only the parent would ever run that.  goKid is for making kids - only the parent should make it's kids.  To wait for grandkids, you wait for your kid, which waits for it's kids.

To do otherwise makes it really hard to ensure that no more kids need be created when the parent is exiting waitForKids.
 

> Well - I think to write any code that is guaranteed to use space
> that's less than proportional to the number of goroutines spawned
> would require careful synchronization,
[...]
> it would need to wait for some to finish before spawning more.

that's not unusual.
it's also quite common for goroutines to be started according
to some sporadic event - if the goroutines are short lived
and the events aren't too fast, then they won't accumulate
indefinitely, but there might be many millions over time.

You can't rely on the scheduling order though, without some other mechanism to enforce it.  On a single processor machine, perhaps only the parent will run, and all the spawned goroutines will sit there waiting until it reaches the point where it waits for them.  So, you don't have good bounds on memory in this case - it might run well, but sometimes it will fill up all your memory.  That's not a good way to program.
 

>        sem := make(chan bool, maxKids);

this assumes that you know the maximum number
of kids in advance. it's sometimes not possible to do so.
did you look at my concurrent tree example?

The intention was that maxKids was that you can use maxKids to limit the number running at the same time.  Commonly you might choose maxKids = NCPU, so that there's just enough cpu bound children running simultaneously.

I agree there's more you could do - but keeping things simple is good too.

- Rowan



Reply all
Reply to author
Forward
0 new messages