> This is fixed on tip. And Jon says that the issue happens with 1.5
> (presumably tip).
Yes, I'm running tip (as of yesterday).
> Could you post actual performance number and GOMAXPROCS?
+---------+-----------+----------+
| Threads | Memcached | Cucache |
+---------+-----------+----------+
| 1 | 90621/s | 57648/s |
| 2 | 185000/s | 98456/s |
| 4 | 258975/s | 173354/s |
| 8 | 345890/s | 260459/s |
| 16 | 260510/s | 230311/s |
+---------+-----------+----------+
Threads for memcached means -t <threads>
Threads for cucache (my Go application) means GOMAXPROCS=<threads>
All experiments are run on the same 40-core machine, using memtier_benchmark:
memtier_benchmark -P memcache_binary -n 20000 -t 20 -c 10
This uses 20 client threads, each with 10 clients, each issuing 20k
requests sequentially (write request to socket, wait for reply before
sending the next). The servers are bound to the first <threads> CPU
cores, the clients are bound to the last 20 cores. For the 16-thread
test, the clients started reaching 100% CPU, so memtier_benchmark was
given the last *24* cores.
> From the profile it looks like you underload runtime with work.
> Please also collect block profile and scheduler trace.
Hmm... Clients issue requests as fast as the server can reply, but I
suppose if the socket writes dominate, many of the goroutines might
spend time waiting around for the write to complete and another request
to come in. I turned up the number of cores available to the client to
make sure, and the time spent in epollwait did indeed go down (still
~9%).
I'm still surprised that memcached achieves higher throughput though, as
it should also be dominated by socket writes.
Block profile (pprof.Lookup("block").WriteTo("...", 2)):
--- contention:
cycles/second=2400085654
Scheduler trace:
SCHED 0ms: gomaxprocs=8 idleprocs=5 threads=6 spinningthreads=1 idlethreads=0 runqueue=0 [0 0 0 0 0 0 0 0]
SCHED 1001ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=84 [17 0 21 18 0 12 25 14]
SCHED 2010ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=73 [11 14 0 10 11 15 0 13]
SCHED 3019ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=136 [5 0 0 0 0 1 0 0]
SCHED 4028ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [0 0 0 0 0 0 0 0]
SCHED 5036ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=81 [0 5 5 3 5 4 6 8]
SCHED 6045ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=65 [11 14 16 9 13 24 18 21]
SCHED 7053ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=114 [0 16 1 3 0 0 0 1]
SCHED 8061ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=178 [0 0 0 0 0 7 0 0]
SCHED 9070ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [2 0 0 0 0 0 0 0]
SCHED 10079ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [0 0 0 0 0 0 0 0]
SCHED 11087ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=111 [0 0 1 0 0 7 9 0]
SCHED 12096ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=8 runqueue=101 [8 21 0 14 12 17 19 12]
SCHED 13104ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=78 [18 20 16 20 13 15 0 10]
SCHED 14113ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=94 [0 16 12 19 0 0 14 22]
SCHED 15121ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=113 [3 2 4 0 3 1 3 2]
SCHED 16129ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]
SCHED 17134ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]
SCHED 18139ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]
CPU profile:
32.29s of 38s total (84.97%)
Dropped 143 nodes (cum <= 0.19s)
flat flat% sum% cum cum%
21.29s 56.03% 56.03% 22.38s 58.89% syscall.Syscall
3.47s 9.13% 65.16% 3.47s 9.13% runtime.epollwait
0.79s 2.08% 67.24% 0.79s 2.08% runtime.cas
0.43s 1.13% 68.37% 0.52s 1.37% runtime.mallocgc
0.39s 1.03% 69.39% 0.44s 1.16% runtime.writebarrierptr
0.36s 0.95% 70.34% 0.36s 0.95% runtime.xchg
0.31s 0.82% 71.16% 1s 2.63% runtime.chanrecv
0.25s 0.66% 71.82% 0.89s 2.34%
github.com/dustin/gomemcached.(*MCRequest).Receive
0.25s 0.66% 72.47% 4.58s 12.05% net.(*netFD).Read
0.25s 0.66% 73.13% 0.56s 1.47% runtime.casgstatus
0.25s 0.66% 73.79% 0.25s 0.66% runtime.schedule
0.24s 0.63% 74.42% 0.26s 0.68% runtime.scanobject
0.23s 0.61% 75.03% 0.23s 0.61% runtime.newdefer
0.19s 0.5% 75.53% 4.85s 12.76% bufio.(*Reader).fill
0.19s 0.5% 76.03% 0.19s 0.5% runtime.systemstack
0.16s 0.42% 76.45% 0.31s 0.82% runtime.deferreturn
0.16s 0.42% 76.87% 0.33s 0.87% runtime.lock
0.16s 0.42% 77.29% 0.38s 1.00% sync.(*Pool).Get
0.15s 0.39% 77.68% 0.55s 1.45% runtime.exitsyscall
0.14s 0.37% 78.05% 20.19s 53.13% bufio.(*Writer).flush
0.13s 0.34% 78.39% 0.24s 0.63% runtime.gopark
0.13s 0.34% 78.74% 0.46s 1.21% runtime.reentersyscall
0.12s 0.32% 79.05% 0.23s 0.61% runtime.writebarrierslice
0.11s 0.29% 79.34% 0.22s 0.58% sync.(*Mutex).Unlock
0.10s 0.26% 79.61% 0.25s 0.66% net.(*pollDesc).Prepare
0.10s 0.26% 79.87% 0.25s 0.66% runtime.exitsyscallfast
0.10s 0.26% 80.13% 0.34s 0.89% sync.(*Pool).Put
0.09s 0.24% 80.37% 20.28s 53.37% bufio.(*Writer).Flush
0.09s 0.24% 80.61% 0.50s 1.32% cuckood.(*cmap).get
0.09s 0.24% 80.84% 0.37s 0.97% runtime.makeslice
0.09s 0.24% 81.08% 0.28s 0.74% runtime.netpollblock
0.09s 0.24% 81.32% 0.20s 0.53% runtime.unlock
0.09s 0.24% 81.55% 19.15s 50.39% syscall.write
0.08s 0.21% 81.76% 0.21s 0.55% cuckood.(*cbin).v
0.08s 0.21% 81.97% 20.03s 52.71% net.(*conn).Write
0.08s 0.21% 82.18% 0.23s 0.61% net.(*fdMutex).RWUnlock
0.08s 0.21% 82.39% 0.42s 1.11% runtime.chansend
0.07s 0.18% 82.58% 1.97s 5.18% main.execute
0.07s 0.18% 82.76% 1.17s 3.08% main.req2res
0.07s 0.18% 82.95% 4.65s 12.24% net.(*conn).Read
0.07s 0.18% 83.13% 0.54s 1.42% runtime.entersyscall
0.07s 0.18% 83.32% 0.20s 0.53% runtime.goparkunlock
0.06s 0.16% 83.47% 6.26s 16.47% main.parse
0.06s 0.16% 83.63% 19.95s 52.50% net.(*netFD).Write
0.06s 0.16% 83.79% 0.28s 0.74% runtime.newarray
0.05s 0.13% 83.92% 20.44s 53.79% main.writeback.func1
0.05s 0.13% 84.05% 0.38s 1.00% net.(*pollDesc).Wait
0.05s 0.13% 84.18% 0.33s 0.87% net.runtime_pollWait
0.05s 0.13% 84.32% 19.20s 50.53% syscall.Write
0.04s 0.11% 84.42% 4.89s 12.87% bufio.(*Reader).Peek
0.04s 0.11% 84.53% 0.29s 0.76% main.execute.func1
0.04s 0.11% 84.63% 0.46s 1.21% runtime.chansend1
0.03s 0.079% 84.71% 0.41s 1.08% net.(*pollDesc).WaitRead
0.02s 0.053% 84.76% 0.52s 1.37% cuckood.Cuckoo.Get
0.02s 0.053% 84.82% 1.02s 2.68% runtime.chanrecv2
0.02s 0.053% 84.87% 3.35s 8.82% syscall.Read
0.01s 0.026% 84.89% 0.21s 0.55% cuckood.(*cmap).insert
0.01s 0.026% 84.92% 1.03s 2.71% main.writeback
0.01s 0.026% 84.95% 30.24s 79.58% runtime.goexit
0.01s 0.026% 84.97% 0.31s 0.82% runtime.newobject
0 0% 84.97% 0.29s 0.76% cuckood.Cuckoo.Set
0 0% 84.97% 0.22s 0.58% cuckood.Cuckoo.op
0 0% 84.97% 0.36s 0.95%
github.com/dustin/gomemcached.(*MCResponse).Bytes
0 0% 84.97% 6.27s 16.50% main.handleConnection
0 0% 84.97% 6.26s 16.47% main.setup
0 0% 84.97% 7.52s 19.79% runtime._System
0 0% 84.97% 0.22s 0.58% runtime.backgroundgc
0 0% 84.97% 0.22s 0.58% runtime.gc
0 0% 84.97% 0.22s 0.58% runtime.gcDrain
0 0% 84.97% 3.33s 8.76% syscall.readtime.scanobjecA