Can't explain Golang benchmarks

309 views
Skip to first unread message

zohaib...@gmail.com

unread,
Jul 16, 2017, 10:20:21 AM7/16/17
to golang-nuts
I have been spending my day over implementing an efficient PubSub system. I had implemented one before using channels, and I wanted to benchmark that against sync.Cond. Here is the quick and dirty test that I put together https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. Now my confusion starts when I change GOMAXPROCS to test how it would perform on my age old Raspberry Pi. Here are results:

mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=8 go test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out -benchmem
BenchmarkPubSubPrimitiveChannelsMultiple-8     10000    165419 ns/op      92 B/op       2 allocs/op
BenchmarkPubSubWaitGroupMultiple-8             10000    204685 ns/op      53 B/op       2 allocs/op
PASS
ok   sibte.so/rascore 3.749s
mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=4 go test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out -benchmem
BenchmarkPubSubPrimitiveChannelsMultiple-4     20000    101704 ns/op      60 B/op       2 allocs/op
BenchmarkPubSubWaitGroupMultiple-4             10000    204039 ns/op      52 B/op       2 allocs/op
PASS
ok   sibte.so/rascore 5.087s
mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=2 go test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out -benchmem
BenchmarkPubSubPrimitiveChannelsMultiple-2     30000     51255 ns/op      54 B/op       2 allocs/op
BenchmarkPubSubWaitGroupMultiple-2             20000     60871 ns/op      43 B/op       2 allocs/op
PASS
ok   sibte.so/rascore 4.022s
mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=1 go test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out -benchmem
BenchmarkPubSubPrimitiveChannelsMultiple   20000     79534 ns/op      61 B/op       2 allocs/op
BenchmarkPubSubWaitGroupMultiple          100000     19066 ns/op      40 B/op       2 allocs/op
PASS
ok   sibte.so/rascore 4.502s

 I tried multiple times and results are consistent. I am using Go 1.8, Linux x64, 8GB RAM. I have multiple questions:

  • Why do channels perform worst than sync.Cond in single core results? Context switching is same if anything it should perform worst.
  • As I increase the max procs the sync.Cond results go down which might be explainable, but what is up with channels? 20k to 30k to 20k to  10k :( I have a i5 with 4 cores, so it should have peaked at 4 procs (pst. I tried 3 as well it's consistent). 
 I am still suspicious I am not making some kind of mistake in code. Any ideas?

- Thanks

peterGo

unread,
Jul 16, 2017, 1:27:04 PM7/16/17
to golang-nuts
When you have data races the results are undefined.

$ go version
go version devel +dd81c37 Sat Jul 15 05:43:45 2017 +0000 linux/amd64
$ go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go
==================
WARNING: DATA RACE
Read at 0x00c4200140c0 by goroutine 18:
  command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:59 +0x51d
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b

Previous write at 0x00c4200140c0 by goroutine 57:
  [failed to restore the stack]

Goroutine 18 (running) created at:
  testing.(*B).run1()
      /home/peter/go/src/testing/benchmark.go:207 +0x8c
  testing.(*B).Run()
      /home/peter/go/src/testing/benchmark.go:513 +0x482
  testing.runBenchmarks.func1()
      /home/peter/go/src/testing/benchmark.go:417 +0xa7
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.runBenchmarks()
      /home/peter/go/src/testing/benchmark.go:423 +0x86d
  testing.(*M).Run()
      /home/peter/go/src/testing/testing.go:928 +0x51e
  main.main()
      command-line-arguments/_test/_testmain.go:46 +0x1d3

Goroutine 57 (finished) created at:
  command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:40 +0x290
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b
==================
--- FAIL: BenchmarkPubSubPrimitiveChannelsMultiple
    benchmark.go:147: race detected during execution of benchmark
==================
WARNING: DATA RACE
Read at 0x00c42000c030 by goroutine 1079:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
      /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e

Previous write at 0x00c42000c030 by goroutine 7:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b

Goroutine 1079 (running) created at:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b

Goroutine 7 (running) created at:
  testing.(*B).run1()
      /home/peter/go/src/testing/benchmark.go:207 +0x8c
  testing.(*B).Run()
      /home/peter/go/src/testing/benchmark.go:513 +0x482
  testing.runBenchmarks.func1()
      /home/peter/go/src/testing/benchmark.go:417 +0xa7
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.runBenchmarks()
      /home/peter/go/src/testing/benchmark.go:423 +0x86d
  testing.(*M).Run()
      /home/peter/go/src/testing/testing.go:928 +0x51e
  main.main()
      command-line-arguments/_test/_testmain.go:46 +0x1d3
==================
==================
WARNING: DATA RACE
Write at 0x00c42000c030 by goroutine 7:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b

Previous read at 0x00c42000c030 by goroutine 1078:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
      /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e

Goroutine 7 (running) created at:
  testing.(*B).run1()
      /home/peter/go/src/testing/benchmark.go:207 +0x8c
  testing.(*B).Run()
      /home/peter/go/src/testing/benchmark.go:513 +0x482
  testing.runBenchmarks.func1()
      /home/peter/go/src/testing/benchmark.go:417 +0xa7
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.runBenchmarks()
      /home/peter/go/src/testing/benchmark.go:423 +0x86d
  testing.(*M).Run()
      /home/peter/go/src/testing/testing.go:928 +0x51e
  main.main()
      command-line-arguments/_test/_testmain.go:46 +0x1d3

Goroutine 1078 (running) created at:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b
==================
==================
WARNING: DATA RACE
Read at 0x00c4200140c8 by goroutine 7:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:109 +0x51d
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b

Previous write at 0x00c4200140c8 by goroutine 175:
  sync/atomic.AddInt64()
      /home/peter/go/src/runtime/race_amd64.s:276 +0xb
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
      /home/peter/gopath/src/nuts/pubsub_test.go:88 +0x19a

Goroutine 7 (running) created at:
  testing.(*B).run1()
      /home/peter/go/src/testing/benchmark.go:207 +0x8c
  testing.(*B).Run()
      /home/peter/go/src/testing/benchmark.go:513 +0x482
  testing.runBenchmarks.func1()
      /home/peter/go/src/testing/benchmark.go:417 +0xa7
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.runBenchmarks()
      /home/peter/go/src/testing/benchmark.go:423 +0x86d
  testing.(*M).Run()
      /home/peter/go/src/testing/testing.go:928 +0x51e
  main.main()
      command-line-arguments/_test/_testmain.go:46 +0x1d3

Goroutine 175 (finished) created at:
  command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
      /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
  testing.(*B).runN()
      /home/peter/go/src/testing/benchmark.go:141 +0x12a
  testing.(*B).run1.func1()
      /home/peter/go/src/testing/benchmark.go:214 +0x6b
==================
--- FAIL: BenchmarkPubSubWaitGroupMultiple
    benchmark.go:147: race detected during execution of benchmark
FAIL
exit status 1
FAIL    command-line-arguments    0.726s
$

Peter

Zohaib Sibte Hassan

unread,
Jul 16, 2017, 9:51:38 PM7/16/17
to golang-nuts
Thanks for pointing issues out I updated my code to get rid of race conditions (nothing critical I was always doing reader-writer race). Anyhow I updated my code on https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. Still doesn't explain the new results:

$> go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8 -cpuprofile=cpu.out -memprofile=mem.out pubsub_test.go
BenchmarkPubSubPrimitiveChannelsMultiple          50  21121694 ns/op    8515 B/op      39 allocs/op
BenchmarkPubSubPrimitiveChannelsMultiple-2       100  19302372 ns/op    4277 B/op      20 allocs/op
BenchmarkPubSubPrimitiveChannelsMultiple-4        50  22674769 ns/op    8182 B/op      35 allocs/op
BenchmarkPubSubPrimitiveChannelsMultiple-8        50  21201533 ns/op    8469 B/op      38 allocs/op
BenchmarkPubSubWaitGroupMultiple                3000    501804 ns/op      63 B/op       2 allocs/op
BenchmarkPubSubWaitGroupMultiple-2               200  15417944 ns/op     407 B/op       6 allocs/op
BenchmarkPubSubWaitGroupMultiple-4               300   5010273 ns/op     231 B/op       4 allocs/op
BenchmarkPubSubWaitGroupMultiple-8               200   5444634 ns/op     334 B/op       5 allocs/op
PASS
ok   command-line-arguments 21.775s

So far my testing shows channels are slower for pubsub scenario. I tried looking into pprof dumps of memory and CPU and it's not making sense to me. What am I missing here?

peterGo

unread,
Jul 16, 2017, 11:06:44 PM7/16/17
to golang-nuts
Your latest benchmarks are invalid.

In your benchmarks, replace b.StartTimer() with b.ResetTimer().

Simply run benchmarks. Don't run race detectors. Don't run profilers. For example,

$ go version
go version devel +504deee Sun Jul 16 03:57:11 2017 +0000 linux/amd64
$ go test -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go
goos: linux
goarch: amd64
BenchmarkPubSubPrimitiveChannelsMultiple            20000         86328 ns/op          61 B/op           2 allocs/op
BenchmarkPubSubPrimitiveChannelsMultiple-2          30000         50844 ns/op          54 B/op           2 allocs/op
BenchmarkPubSubPrimitiveChannelsMultiple-4          10000        112833 ns/op          83 B/op           2 allocs/op
BenchmarkPubSubPrimitiveChannelsMultiple-8          10000        160011 ns/op          88 B/op           2 allocs/op
BenchmarkPubSubWaitGroupMultiple                   100000         21231 ns/op          40 B/op           2 allocs/op
BenchmarkPubSubWaitGroupMultiple-2                  10000        107165 ns/op          46 B/op           2 allocs/op
BenchmarkPubSubWaitGroupMultiple-4                  20000         73235 ns/op          43 B/op           2 allocs/op
BenchmarkPubSubWaitGroupMultiple-8                  20000         82917 ns/op          42 B/op           2 allocs/op
PASS
ok      command-line-arguments    15.481s
$

Peter

Zohaib Sibte Hassan

unread,
Jul 17, 2017, 1:45:57 AM7/17/17
to golang-nuts
Awesome :) got similar results. So channels are slower for pubsub. I am surprised so far, with the efficiency going down as the cores go up (heck reminds me of that Node.js single core meme). Please tell me I am wrong here, is there any other more efficient approach? 

Jesper Louis Andersen

unread,
Jul 17, 2017, 1:23:04 PM7/17/17
to Zohaib Sibte Hassan, golang-nuts
Your benchmarks are not really *doing* something once they get data. This means that eventual conflicts on (internal or external) locks are going to be hit all the time. In turn, you are likely to measure the performance of your hardware in the event of excessive lock contention (this hypothesis can be verified: block profiling).

Were you to process something before going back to the barrier or channel, then you are less likely to hit the problem. Though you should be aware sometimes your cores will coordinate (because the work they do will tend them to come back to the lock at the same point in time - a similar problem to TCP incast).

As you go from a single core to multiple cores, an important event happens in the system. In a single-core system, you don't need the hardware to coordinate. This is much faster than the case where the hardware might need to tell other physical cores about what is going on. Especially under contention. In short, there is a gap between the single-core system and a multicore system due to communication overhead. In high-performance computing a common mistake is to measure the *same* implementation in the serial and parallel case, even though the serial case can be made to avoid taking locks and so on. A fair benchmark would compare the fastest single-core implementation with the fastest multi-core ditto.

The upshot of having access to multiple cores is that you can get them all to do work and this can lead to a substantial speedup. But there has to be work to do. Your benchmark has almost no work to do, so most of the cores are just competing against each other.

For a PubSub implementation you need to think about the effort involved in the implementation. Up to a point, I think a channel-based solution, or a barrier based solution is adequate. For systems where you need low latency, go for a pattern like the Disruptor pattern[0], which should be doable in Go. This pattern works by a ring-buffer of events and two types of "clock hands". One hand is the point of insertion of a new event, the other hand(s) track where the readers currently are in their processing. By tracking the hands through atomics, you can get messaging overhead down by a lot. The trick is that each thread writes to its own hand and other threads can read an eventually consistent snapshot. As long as the buffer doesn't fill up completely, the eventually consistent view of an ever increasing counter is enough. Getting an older value is not a big deal, as long as the value isn't undefined.

The problem with channels are that they are, inherently, one-shot. If a message is consumed, it is gone from the channel. In a PubSub solution, you really want multiple readers process the same (immutable) data and flag it when it is all consumed. Message brokers and SMTP servers often optimize delivery of this kind by keeping the payload separate from the header (envelope): better not pay for the contents more than once. But you still have to send a message to each subscriber, and this gets rather expensive. The Disruptor pattern solves this.

On the other hand, the Disruptor only works if you happen to have memory shared. In large distributed systems, the pattern tend to break down, and other methods must be used. For instance a cluster of disruptors. But then keeping consistency in check becomes a challenge of the fun kind :)

[0] http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf - Thompson, Farley, Barker, Gee, Steward, 2011

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zohaib Sibte Hassan

unread,
Jul 17, 2017, 4:14:50 PM7/17/17
to golang-nuts, zohaib...@gmail.com
Hey Jesper,

 Thanks for such an insightful reply. So the scenario I am trying to implement is basically a chat scenario. With huge chat rooms (thousands of people) situation, all the listener threads are going to run into this situation (which is exactly what I am doing here). To elaborate a little more, right now each user is a chat room is a coroutine waiting for a message on a channel (shameless plug project is available on http://raspchat.com/). Now this setup is running on a limited memory RaspberryPi with 512MB of RAM and few cores. What I noticed as channels were flooded by people on 4chan was excessive usage or CPU and memory, which led me to believe that channels are probably the bottle neck. So yes I am running it on shared memory system for now and disruptor pattern sounds sounds like an interesting approach I would love to try. 

Thanks for the idea, and I will follow up with my findings here

Michael Jones

unread,
Jul 17, 2017, 4:15:10 PM7/17/17
to Jesper Louis Andersen, Zohaib Sibte Hassan, golang-nuts
I get about 3 million channel send/receive pairs per second. If you actually do work inbeteeen (database access, computation, etc.) the limit is difficult to reach. 
--
Michael T. Jones
michae...@gmail.com

Jesper Louis Andersen

unread,
Jul 17, 2017, 6:12:56 PM7/17/17
to Zohaib Sibte Hassan, golang-nuts
On Mon, Jul 17, 2017 at 10:14 PM Zohaib Sibte Hassan <zohaib...@gmail.com> wrote:
So the scenario I am trying to implement is basically a chat scenario.

My first approach would be to measure.

* The full system will have a bottleneck as per Amdahl's law. You need to take a holistic view since the bottleneck may be outside Go (ie, the kernel, hardware, ...). Knowing the bottleneck will usually suggest an angle of attack on the problem at hand.

* High memory usage suggests either an overload situation (because you don't have enough resources) or excessive copying of data is going on. Try doing some napkin math. A million active users with 1k of data each takes one gigabyte of memory and so on.

* High CPU can be indicative of real work, that the CPU is bandwidth constrained toward memory, or that lock contention is going on. Kernels usually will not discriminate between these states, so some investigation is necessary.

* Keeping thousands of connections on small hardware is expensive. Each TCP connection needs some kernel space in addition to userland space. It quickly adds up.

* TCP sending is going to cost a lot of time. In laboratory tests, the network is fast. Inside a datacenter, such as one operated by Google, network transmission is fast. The internet in general is slow, latency-inducing, and brittle. This forces your system to keep data lingering for longer, and this puts more pressure on memory usage.

* Observation: in a noisy chatroom, you want to skip messages if they flow out of the view on the client. You don't need to process every message in this case. Just the messages that are visible. This suggests a polling construction like the disruptor: only care about the K newest messages in the fast path. The slow path does historical lookups. Keep an "epoch" count of where we are in the message flow. When a socket is ready for data, use the epoch count to figure out what happened in the meantime.

* If messages are immutable, they don't change and can be concurrently read with little overhead. Edits to messages can be handled by a patching-construction which overrides an earlier message.

* The *publisher* should take the effort of constructing as much of the payload as possible and place it into a buffer everyone writes directly into the network socket. If every subscriber has to do work, things get expensive.

* Channels are likely to be fast enough, and resources are likely to run out quickly: especially memory on a 512 megabyte computer.

* Channels should be used to send epochs around. The actual payload ought to be somewhere else where it is ready and barriered appropriately for read-only consumption. Alternative: pass a reference to the data around. This is simple to do and is likely to be fast.

* Your goal is to get data to the socket so the kernel can do work. If you do this correctly, it is likely the socket is going to be the bottleneck of the system. Also, consider the possibility that you will block a goroutine on data transfer to the outside world while its channel buffer fills up. The publisher should not block in this situation.

In general, system engineering tend to trump local tuning. Effort is a constrained resource, so it is usually best spent in the areas where the cost/benefit analysis falls out nicely.
Reply all
Reply to author
Forward
0 new messages