Array much faster than Channel

Michael

unread,

Sep 2, 2011, 6:23:30 PM9/2/11

to golang-nuts

Channels become a bottleneck if you pass a lot of individual items
through them.

You can work around this by packing chunks of data into arrays and
then unpacking on the other end. This is literally 10x faster. The
weird thing is, you'd think it would be slower with all those extra
steps. The following code demonstrates this effect:

package main
import "fmt"

var c = make(chan uint64,200)
var fastDone = make(chan int)
var slowDone = make(chan int)
var r uint64
// ********** SLOW ***************
func sum1() {
var x uint64
var ok bool
for {
x,ok = <-c
if ok {
r += x
} else {
slowDone <-1
}
}
}
func slow() {
r=0
go sum1()
go func() {
var i uint64
for i=0; i<100000000; i++ {
c<-i
if i%100000 == 0 {
fmt.Print(".")
}
}
close(c)
}()
<-slowDone
fmt.Println("Sum:",r)
}

// ********** FAST ***************
var cSlice = make(chan []uint64,200)
func sumSlice() {
var xSlice []uint64
var ok bool
for {
xSlice,ok = <-cSlice
if ok {
for i:=0;i<len(xSlice);i++ {
r += xSlice[i]
}
} else {
fastDone <-1
}
}
}
func fast() {
r=0
go sumSlice()
var slice []uint64
go func() {
var i uint64
for i=0; i<100000000; i++ {
if i%100000 == 0 {
fmt.Print(".")
if len(slice) != 0 {
cSlice<-slice
}
slice = make([]uint64, 100000)
}
slice[i%100000]=i
}
close(cSlice)
}()
<-fastDone
fmt.Println("Sum:",r)
}

func main() {
/*
runtime.GOMAXPROCS(1)
f,_ := os.Create("case3.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
*/
fmt.Println("The fast way:")
fast()

fmt.Println("The slow way:")
slow()
}

Evan Shaw

unread,

Sep 2, 2011, 11:05:03 PM9/2/11

to Michael, golang-nuts

One of the most expensive things you're doing in this program is
sending values on channels. Using arrays decreases the number of
channel sends, which improves throughput. It's kind of like reading a
file in blocks as opposed to reading a byte at a time. Reading a byte
at a time is going to be slower.

This is not to say that channel sends are very inefficient. It's just
that everything else you're doing is pretty fast, so the channel sends
are very noticeable and probably not a good choice in this case if
you're worried about speed.

A few other comments on the code:

Use range where you can. Your sum1 and sumSlice functions can be
shorter and faster by ranging over c and cSlice, respectively.
sumSlice can also range over the slice itself.

Your closure goroutines in slow and fast probably don't need to be
goroutines or closures, either. You can just inline the code, drop the
go, and get the same effect.

- Evan

Dmitry Vyukov

unread,

Sep 3, 2011, 4:36:10 AM9/3/11

to Michael, golang-nuts

On Sat, Sep 3, 2011 at 2:23 AM, Michael <415...@gmail.com> wrote:

Channels become a bottleneck if you pass a lot of individual items
through them.

You can work around this by packing chunks of data into arrays and
then unpacking on the other end. This is literally 10x faster. The
weird thing is, you'd think it would be slower with all those extra
steps. The following code demonstrates this effect:

You are benchmarking chan send/recv vs slice slot write/read. I do expect the latter to be at least an order of magnitude faster. Remove 2 divisions per iteration in the fast case, and performance will skyrocket.

Michael

unread,

Sep 3, 2011, 5:49:15 PM9/3/11

to golang-nuts

Thank you for the range tip. That improves readability and gives about
a 30% performance improvement for the fast case. No performance
improvement for the slow case, though, since that is still channel
limited.

What I'd really like to talk about here is how to make the slow case
competitive by improving the underlying code. I'm not an expert on how
to do that, but allow me to stir the pot.

If you look at the file:
src/pkg/runtime/chan.c

There are two interesting functions:
runtime·chansend
runtime·chanrecv

Each of these does two things:
1. Call enqueue/dequeue -- This code is tight. No room for
improvement.
2. Loads of synchronization stuff to decide which goroutine gets to go
next. -- I'm guessing this is where the time goes.

The above seems right for an unbuffered channel. A separate
implementation that might work better for buffered channels is:
1. Call enqueue/dequeue as before
2. If this makes the buffer full/empty then do the stuff to decide
which goroutine gets to go next. Otherwise, don't.

The effect is to allow the goroutine that just called chansend/
chanrecv keep going until it has filled or emptied the buffer, with
minimal overhead which means you just need to make the channel buffer
big enough to keep synchronization from becoming a bottleneck.

The problem with this is that channels will get stuck if the buffer
isn't getting fully filled/emptied. So you need:
3. A timer interrupt which goes through all of the channels and
switches their direction every 1ms or so.

Dmitry Vyukov

unread,

Sep 3, 2011, 6:17:34 PM9/3/11

to Michael, golang-nuts

It will make this particular synthetic microbenchmark faster. But at the same time it will make most real-life production code both slower and considerably increase latencies. I think it is not the direction we should be moving in.

Michael

unread,

Sep 3, 2011, 9:26:54 PM9/3/11

to golang-nuts

Hi Dmitry. I agree with what you're saying about real-life production
code. But, with the caveat that real-life production code has been
written with an awareness of this issue, and so already applies the
hand-optimization of packing large data streams into chunks before
passing them over channels, or it eschews channels altogether.

I don't pretend that my particular suggestion for re-implementing
channels should direct future development of the language. It was just
the first thing that occurred to me.

What I would like to get from this discussion is some technical
opinions on the what-if question: What if I wanted to make channels
and goroutines the most efficient (and therefore default) way to
organize the data path? How might I do that?

Whether it's useful to do so is another topic. But, to give my
synthetic microbenchmark some plausibility, I'd like to suggest that
channels are the signal feature of Go, in the same way that regular
expressions are the signal feature of Perl. If I was writing a program
that needed to scan gigabytes of text, line-by-line and capture
certain tokens, I would look no further than writing three lines of
Perl, because the syntax and execution are tuned so perfectly for that
domain, that I would never find a better solution in any number of
lines of code in another language. Similarly if I was writing a
molecular-dynamics simulator which was multi-threaded in order to
spread the load across many processors I would find Go channels to
embody the ideal syntax, but it would take a lot of tuning to wring
the performance out of it and meanwhile, I'd be wondering if Boost or
OpenMP might make more sense.

Rich Wareham

unread,

Sep 4, 2011, 7:17:27 AM9/4/11

to Dmitry Vyukov, Michael, golang-nuts

On Sun, Sep 04, 2011 at 02:17:34AM +0400, Dmitry Vyukov wrote:
>
> It will make this particular synthetic microbenchmark faster. But at the
> same time it will make most real-life production code both slower and
> considerably increase latencies. I think it is not the direction we should
> be moving in.

It will also complicate the mental model for channels beyond being a
convenient way to do the co-routine bookkeeping. One of the nicest
things about Go is it 'fits in your head'.

--
Dr Rich Wareham

Dmitry Vyukov

unread,

Sep 5, 2011, 2:08:47 PM9/5/11

to Michael, golang-nuts

On Sun, Sep 4, 2011 at 5:26 AM, Michael <415...@gmail.com> wrote:

Hi Dmitry. I agree with what you're saying about real-life production
code. But, with the caveat that real-life production code has been
written with an awareness of this issue, and so already applies the
hand-optimization of packing large data streams into chunks before
passing them over channels, or it eschews channels altogether.

I did not mean that real-life code packs messages into batches, I meant that real-life code features considerably larger messages, with considerably more work associated with both production and consumption of messages, real-life code rarely produces messages with the rate of:

for i=0; i<N; i++ {
c <- i

}

So, if there are blocked consumers, a producer *must* unblock them.

I don't pretend that my particular suggestion for re-implementing
channels should direct future development of the language. It was just
the first thing that occurred to me.

What I would like to get from this discussion is some technical
opinions on the what-if question: What if I wanted to make channels
and goroutines the most efficient (and therefore default) way to
organize the data path? How might I do that?

Ultimate efficiency and concurrency are conflicting with each other. A concurrent data structure basically always less efficient than according non-concurrent data structure. Moreover, they are solving different problems.

Whether it's useful to do so is another topic. But, to give my
synthetic microbenchmark some plausibility, I'd like to suggest that
channels are the signal feature of Go, in the same way that regular
expressions are the signal feature of Perl. If I was writing a program
that needed to scan gigabytes of text, line-by-line and capture
certain tokens, I would look no further than writing three lines of
Perl, because the syntax and execution are tuned so perfectly for that
domain, that I would never find a better solution in any number of
lines of code in another language. Similarly if I was writing a
molecular-dynamics simulator which was multi-threaded in order to
spread the load across many processors I would find Go channels to
embody the ideal syntax, but it would take a lot of tuning to wring
the performance out of it and meanwhile, I'd be wondering if Boost or
OpenMP might make more sense.

I am sure we can do no worser than boost and OpenMP.

What you are saying makes sense, however it is related to quite specific context and so I think should not be incorporated into fundamental channels. However, there are several possibilities.

You can create chan proxy with required functionality as a separate entity. That is, create an object that collects up to N messages and then enqueues then into underlying chan at once (or after a timeout). No need to incorporated that into channels itself.

Then, you can create a specialized queue (potentially single-producer/single-consumer, or at least multi-producer/single-consumer). A specialized solution is always more efficient than a general one.

Additionally, I think we can incorporate your idea in the reverse direction. That is, if a producer dequeues a message from a full chan, do not unblock producers instantly. Instead, wait until chan len drops below a low water mark (2/3 or 1/2 of chan capacity) or until a timeout. This will lead to so called "cohort scheduling", which is a good thing (batching, hot caches, etc).

Reply all

Reply to author

Forward