Unless you need some incredibly huge number of semi-concurrent
activities, one thread per concurrent activity works fine. The
use case for a goroutine-like mechanism is when you have a very
large number of things waiting, like persistent HTTP connections.
Google has that problem, hence Go, but most applications don't.
Bounded buffers aren't hard to implement. Here's a
classic implementation, from 1972:
http://www.fourmilab.ch/documents/univac/fang/hsource/scheduler.asm.html
It's cleaner than most of its successors.
The innovation in Go is "select" which waits on any of a set of bounded
buffers. That's hard to do efficiently. Take a look a the library
code for the general N-channel case, which is painful and slow.
There's a common special case which can be implemented efficiently.
If all the channels read in an N-channel select are read only in that
select, and all the channels written in that select are written only in
that select, there's a big potential simplification. Instead of having
one P/V lock for each channel at the relevant end, there can be one
lock for all those channel ends. When the selecting thread unblocks,
at least one of the channels will have work (read data or a write slot)
available.
That's an optimization Go could do by static analysis. Does it?
John Nagle