Disruptor

Christoph Hack

unread,

Sep 21, 2011, 9:57:56 PM9/21/11

to golan...@googlegroups.com

Hi folks,

fluffle has recently send me a link about the LMAX Architecture, a quite exceptional

concurrent architecture which main goal was to avoid synchronization primitives and

concurrency as much as possible. Their excellent technical paper also contains a good

introduction about CPU caches in general.

After reading their paper, I immediately wanted to try out those things I just have learned.

My goal was to transfer a lot of integers between two goroutines, which basically is a

simple 1-producer + 1-consumer topology. Please note, that the Disruptor design can be

easily scaled-up to many-producers + many-consumers, although with sightly different

semantics than the (universally usable) Go channels.

Anyway, some unscientific micro-benchmarks revealed the following result:

Channels: 173 ns/op

Disruptor (Go): 38.4 ns/op

Disruptor (C++): 6.03 ns/op

I was quite surprised about the current speed of channels. It's not that easy to be much

faster, even with very specialized solutions. In the beginning, I was also thinking about

offering a highly customize-able "disruptor-like" package for Go (different wait strategies,

configurable typologies, etc), but I think the costs of calling non-inlined functions will be

much higher than the usage of the more general go channels. Also, there are probably not

many applications which might require such specialized and optimized solutions.

The C++ implementation is currently much faster, because it was compiled with "-O3",

threads were pinned to specific cores and it uses the new std::atomic library which allows

fine-grained memory barriers and is highly optimized for specific platforms. Also a GCC

specific extension was used to get the cache line alignment right (i think it's still wrong in

the Go version).

The sources (both versions, .go and .cc) can be found here:

https://gist.github.com/1218360

I hope I haven't made any serious mistakes. Concurrent programming and all those

lock-free algorithms are still new to me.

PS: Many thanks to Dmitry, who already helped me to get the go benchmarks right.

His 1024cores website, as well as all the useful comments he has written on different

blogs posts about concurrency and std::atomic were extremely helpful. Thanks for that!

-christoph

John Arbash Meinel

unread,

Sep 22, 2011, 5:56:46 AM9/22/11

to golan...@googlegroups.com, Christoph Hack

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9/22/2011 3:57 AM, Christoph Hack wrote:
> Hi folks,
>
> fluffle has recently send me a link about the LMAX Architecture

> <http://martinfowler.com/articles/lmax.html>, a quite exceptional

> concurrent architecture which main goal was to avoid
> synchronization primitives and concurrency as much as possible.
> Their excellent technical paper

> <http://disruptor.googlecode.com/files/Disruptor-1.0.pdf> also

> contains a good introduction about CPU caches in general.
>
> After reading their paper, I immediately wanted to try out those
> things I just have learned. My goal was to transfer a lot of
> integers between two goroutines, which basically is a simple
> 1-producer + 1-consumer topology. Please note, that the Disruptor

> <http://code.google.com/p/disruptor/> design can be easily

> scaled-up to many-producers + many-consumers, although with sightly
> different semantics than the (universally usable) Go channels.
>
> Anyway, some unscientific micro-benchmarks revealed the following
> result:
>
> Channels: 173 ns/op Disruptor (Go): 38.4 ns/op Disruptor (C++):
> 6.03 ns/op
>

One thing to think about. If you are passing a lot of data over
channels, you can just as easily pass arrays of data over channels,
and then work on the content in a loop. So instead of:

func producer(out chan int) {

for i := range 10000 {
// Do real stuff here
out <- i;
}
}

you can do:

const BUF_SIZE = 100
func producer(out chan []int) {

buf := make([]int, BUF_SIZE, BUF_SIZE)
offset = 0
for i := range 10000 {
if offset == BUF_SIZE {
out <- buf
buf := make([]int, BUF_SIZE, BUF_SIZE)
offset = 0
}
// Do real stuff here
buf[offset] = i
offset++
}
}

I know your code does allow buffering in the channel, but you'd be
better off buffering into an array if you know that you are going to
be producing batches.

Also, your low level operation is a busy loop. Which looks great when
you have 2 of them and 2 threads and 2 cpus. But what happens if you
wanted to have 100 'channels'. Note that in Go, goroutines are
cooperative multi-taskers (not preemptive). So if you busy-wait on a
coroutine that is multiplexed with another coroutine, you could deadlock.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk57Bl4ACgkQJdeBCYSNAAPqbACfYU/a0/1lBBCW1qOG44mVJ9P4
N0IAn2MS6ebPiL+arXiHVlEraaUq5fRt
=3CG6
-----END PGP SIGNATURE-----

tiny_dust

unread,

Sep 22, 2011, 6:27:34 AM9/22/11

to golang-nuts

it can be easily fixed. just add runtime.Gosched(). the performance
will be better.

> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/

Dmitry Vyukov

unread,

Sep 22, 2011, 5:40:00 PM9/22/11

to golan...@googlegroups.com

Specialized solutions are always faster than general ones.

I think the main sources of slowdown for Go channels are: (1) they support multi-producer/multi-consumer scenarion, (2) they support selects, (3) they support closing, (4) they are generic (support different types of elements), (5) they do proper blocking/unblocking of goroutines, (6) they are not inlined.

Some further optimizations can be applied to channels, but not too much.

I guess eventually people will create specialized reusable queueing solutions for Go in either case (problematic w/o templates). However, I think there is no place for them in the standard library.

Your code indeed has the problem with active spinning. I don't think that Gosched() is a good solution, first it burns cpu cycles anyway (the program will always consume 100% CPU), then Gosched() can/will significantly confuse scheduler (it's impossible to do efficient scheduling when goroutines constantly do Gosched).

Reply all

Reply to author

Forward