Profiling shows majority of CPU time spent in System, and runtime.epollwait in particular

2,509 views
Skip to first unread message

j...@tsp.io

unread,
Apr 20, 2015, 5:02:52 PM4/20/15
to golan...@googlegroups.com
Hi all!

I've been working on a high-performance memcached-like server in Go: https://github.com/jonhoo/cuckoocache.
It uses three goroutines for each connection (one for reading, one for computation, one for writing).

When benchmarking with 50 clients issuing back-to-back requests, performance is fairly close to that of memcached.
However, when I increase the number of cores (and GOMAXPROCS), my profiles become dominated by System (65%), and runtime.epollwait (25%) in particular (see attached profiling results). This is the case both for Go 1.4 and Go 1.5.
This was surprising to me, as memcached also uses epollwait, read and write, but achieves higher throughput than my Go program does.

Is this a well-known problem, and are there any obvious knobs I should be tweaking or optimizations I could implement to try and mitigate this cost?

Hope someone can help shed some light on this.
Cheers,
Jon
cucache.svg

Dmitry Vyukov

unread,
Apr 20, 2015, 11:43:04 PM4/20/15
to j...@tsp.io, golang-nuts
From the profile it looks like you underload runtime with work.
Please also collect block profile and scheduler trace.
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Naoki INADA

unread,
Apr 21, 2015, 4:32:24 AM4/21/15
to golan...@googlegroups.com, j...@tsp.io
I wonder if this is the same problem reported in this thread: https://groups.google.com/forum/#!searchin/golang-nuts/benchmarks/golang-nuts/W5yOfB_RF8I/E4u8_sFAQNAJ
I can't reproduce it on EC2 c4.8xlarge since network performance in EC2 is worse than PEAK hosting.

Could you post actual performance number and GOMAXPROCS?

I have a guess of this problem.
Go's scheduler take next goroutine from local queue.
If local queue is empty, try global queue, and if global queue is empty, call poller.

So when global queue is empty, many threads may call poller at same time.

To reduce this problem, monitor thread can call poller more frequently to fill global queue.

now = runtime·nanotime();
unixnow = runtime·unixnanotime();
- if(lastpoll != 0 && lastpoll + 10*1000*1000 < now) {
+ if(lastpoll != 0) {
runtime·cas64(&runtime·sched.lastpoll, lastpoll, now);
gp = runtime·netpoll(false);  // non-blocking


Dmitry Vyukov

unread,
Apr 21, 2015, 4:36:03 AM4/21/15
to Naoki INADA, golang-nuts, jon
On Tue, Apr 21, 2015 at 11:32 AM, Naoki INADA <songof...@gmail.com> wrote:
> I wonder if this is the same problem reported in this thread:
> https://groups.google.com/forum/#!searchin/golang-nuts/benchmarks/golang-nuts/W5yOfB_RF8I/E4u8_sFAQNAJ
> I can't reproduce it on EC2 c4.8xlarge since network performance in EC2 is
> worse than PEAK hosting.
>
> Could you post actual performance number and GOMAXPROCS?
>
> I have a guess of this problem.
> Go's scheduler take next goroutine from local queue.
> If local queue is empty, try global queue, and if global queue is empty,
> call poller.

Only one thread calls netpoll. If one thread is already blocked in
netpoll, other threads don't call it.

> So when global queue is empty, many threads may call poller at same time.
>
> To reduce this problem, monitor thread can call poller more frequently to
> fill global queue.
> https://github.com/golang/go/blob/release-branch.go1.4/src/runtime/proc.c#L2911-L2914
>
> now = runtime·nanotime();
> unixnow = runtime·unixnanotime();
> - if(lastpoll != 0 && lastpoll + 10*1000*1000 < now) {
> + if(lastpoll != 0) {
> runtime·cas64(&runtime·sched.lastpoll, lastpoll, now);
> gp = runtime·netpoll(false); // non-blocking
>
>

Naoki INADA

unread,
Apr 21, 2015, 4:46:02 AM4/21/15
to golan...@googlegroups.com, songof...@gmail.com, j...@tsp.io


On Tuesday, April 21, 2015 at 5:36:03 PM UTC+9, Dmitry Vyukov wrote:
On Tue, Apr 21, 2015 at 11:32 AM, Naoki INADA <songof...@gmail.com> wrote:
> I wonder if this is the same problem reported in this thread:
> https://groups.google.com/forum/#!searchin/golang-nuts/benchmarks/golang-nuts/W5yOfB_RF8I/E4u8_sFAQNAJ
> I can't reproduce it on EC2 c4.8xlarge since network performance in EC2 is
> worse than PEAK hosting.
>
> Could you post actual performance number and GOMAXPROCS?
>
> I have a guess of this problem.
> Go's scheduler take next goroutine from local queue.
> If local queue is empty, try global queue, and if global queue is empty,
> call poller.

Only one thread calls netpoll. If one thread is already blocked in
netpoll, other threads don't call it.


Dmitry Vyukov

unread,
Apr 21, 2015, 6:04:10 AM4/21/15
to Naoki INADA, golang-nuts, jon
On Tue, Apr 21, 2015 at 11:46 AM, Naoki INADA <songof...@gmail.com> wrote:
>
>
> On Tuesday, April 21, 2015 at 5:36:03 PM UTC+9, Dmitry Vyukov wrote:
>>
>> On Tue, Apr 21, 2015 at 11:32 AM, Naoki INADA <songof...@gmail.com> wrote:
>> > I wonder if this is the same problem reported in this thread:
>> >
>> > https://groups.google.com/forum/#!searchin/golang-nuts/benchmarks/golang-nuts/W5yOfB_RF8I/E4u8_sFAQNAJ
>> > I can't reproduce it on EC2 c4.8xlarge since network performance in EC2
>> > is
>> > worse than PEAK hosting.
>> >
>> > Could you post actual performance number and GOMAXPROCS?
>> >
>> > I have a guess of this problem.
>> > Go's scheduler take next goroutine from local queue.
>> > If local queue is empty, try global queue, and if global queue is empty,
>> > call poller.
>>
>> Only one thread calls netpoll. If one thread is already blocked in
>> netpoll, other threads don't call it.
>>
>
> Is it true on here?
> https://github.com/golang/go/blob/release-branch.go1.4/src/runtime/proc.c#L1406


This is fixed on tip. And Jon says that the issue happens with 1.5
(presumably tip).

Jon Gjengset

unread,
Apr 21, 2015, 11:38:14 AM4/21/15
to Naoki INADA, Dmitry Vyukov, golan...@googlegroups.com
> This is fixed on tip. And Jon says that the issue happens with 1.5
> (presumably tip).

Yes, I'm running tip (as of yesterday).

> Could you post actual performance number and GOMAXPROCS?

+---------+-----------+----------+
| Threads | Memcached | Cucache |
+---------+-----------+----------+
| 1 | 90621/s | 57648/s |
| 2 | 185000/s | 98456/s |
| 4 | 258975/s | 173354/s |
| 8 | 345890/s | 260459/s |
| 16 | 260510/s | 230311/s |
+---------+-----------+----------+

Threads for memcached means -t <threads>
Threads for cucache (my Go application) means GOMAXPROCS=<threads>
All experiments are run on the same 40-core machine, using memtier_benchmark:

memtier_benchmark -P memcache_binary -n 20000 -t 20 -c 10

This uses 20 client threads, each with 10 clients, each issuing 20k
requests sequentially (write request to socket, wait for reply before
sending the next). The servers are bound to the first <threads> CPU
cores, the clients are bound to the last 20 cores. For the 16-thread
test, the clients started reaching 100% CPU, so memtier_benchmark was
given the last *24* cores.

> From the profile it looks like you underload runtime with work.
> Please also collect block profile and scheduler trace.

Hmm... Clients issue requests as fast as the server can reply, but I
suppose if the socket writes dominate, many of the goroutines might
spend time waiting around for the write to complete and another request
to come in. I turned up the number of cores available to the client to
make sure, and the time spent in epollwait did indeed go down (still
~9%).

I'm still surprised that memcached achieves higher throughput though, as
it should also be dominated by socket writes.

Block profile (pprof.Lookup("block").WriteTo("...", 2)):

--- contention:
cycles/second=2400085654

Scheduler trace:

SCHED 0ms: gomaxprocs=8 idleprocs=5 threads=6 spinningthreads=1 idlethreads=0 runqueue=0 [0 0 0 0 0 0 0 0]
SCHED 1001ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=84 [17 0 21 18 0 12 25 14]
SCHED 2010ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=73 [11 14 0 10 11 15 0 13]
SCHED 3019ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=136 [5 0 0 0 0 1 0 0]
SCHED 4028ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [0 0 0 0 0 0 0 0]
SCHED 5036ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=81 [0 5 5 3 5 4 6 8]
SCHED 6045ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=65 [11 14 16 9 13 24 18 21]
SCHED 7053ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=114 [0 16 1 3 0 0 0 1]
SCHED 8061ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=178 [0 0 0 0 0 7 0 0]
SCHED 9070ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [2 0 0 0 0 0 0 0]
SCHED 10079ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [0 0 0 0 0 0 0 0]
SCHED 11087ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=111 [0 0 1 0 0 7 9 0]
SCHED 12096ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=8 runqueue=101 [8 21 0 14 12 17 19 12]
SCHED 13104ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=78 [18 20 16 20 13 15 0 10]
SCHED 14113ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=94 [0 16 12 19 0 0 14 22]
SCHED 15121ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=113 [3 2 4 0 3 1 3 2]
SCHED 16129ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]
SCHED 17134ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]
SCHED 18139ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]

CPU profile:

32.29s of 38s total (84.97%)
Dropped 143 nodes (cum <= 0.19s)
flat flat% sum% cum cum%
21.29s 56.03% 56.03% 22.38s 58.89% syscall.Syscall
3.47s 9.13% 65.16% 3.47s 9.13% runtime.epollwait
0.79s 2.08% 67.24% 0.79s 2.08% runtime.cas
0.43s 1.13% 68.37% 0.52s 1.37% runtime.mallocgc
0.39s 1.03% 69.39% 0.44s 1.16% runtime.writebarrierptr
0.36s 0.95% 70.34% 0.36s 0.95% runtime.xchg
0.31s 0.82% 71.16% 1s 2.63% runtime.chanrecv
0.25s 0.66% 71.82% 0.89s 2.34% github.com/dustin/gomemcached.(*MCRequest).Receive
0.25s 0.66% 72.47% 4.58s 12.05% net.(*netFD).Read
0.25s 0.66% 73.13% 0.56s 1.47% runtime.casgstatus
0.25s 0.66% 73.79% 0.25s 0.66% runtime.schedule
0.24s 0.63% 74.42% 0.26s 0.68% runtime.scanobject
0.23s 0.61% 75.03% 0.23s 0.61% runtime.newdefer
0.19s 0.5% 75.53% 4.85s 12.76% bufio.(*Reader).fill
0.19s 0.5% 76.03% 0.19s 0.5% runtime.systemstack
0.16s 0.42% 76.45% 0.31s 0.82% runtime.deferreturn
0.16s 0.42% 76.87% 0.33s 0.87% runtime.lock
0.16s 0.42% 77.29% 0.38s 1.00% sync.(*Pool).Get
0.15s 0.39% 77.68% 0.55s 1.45% runtime.exitsyscall
0.14s 0.37% 78.05% 20.19s 53.13% bufio.(*Writer).flush
0.13s 0.34% 78.39% 0.24s 0.63% runtime.gopark
0.13s 0.34% 78.74% 0.46s 1.21% runtime.reentersyscall
0.12s 0.32% 79.05% 0.23s 0.61% runtime.writebarrierslice
0.11s 0.29% 79.34% 0.22s 0.58% sync.(*Mutex).Unlock
0.10s 0.26% 79.61% 0.25s 0.66% net.(*pollDesc).Prepare
0.10s 0.26% 79.87% 0.25s 0.66% runtime.exitsyscallfast
0.10s 0.26% 80.13% 0.34s 0.89% sync.(*Pool).Put
0.09s 0.24% 80.37% 20.28s 53.37% bufio.(*Writer).Flush
0.09s 0.24% 80.61% 0.50s 1.32% cuckood.(*cmap).get
0.09s 0.24% 80.84% 0.37s 0.97% runtime.makeslice
0.09s 0.24% 81.08% 0.28s 0.74% runtime.netpollblock
0.09s 0.24% 81.32% 0.20s 0.53% runtime.unlock
0.09s 0.24% 81.55% 19.15s 50.39% syscall.write
0.08s 0.21% 81.76% 0.21s 0.55% cuckood.(*cbin).v
0.08s 0.21% 81.97% 20.03s 52.71% net.(*conn).Write
0.08s 0.21% 82.18% 0.23s 0.61% net.(*fdMutex).RWUnlock
0.08s 0.21% 82.39% 0.42s 1.11% runtime.chansend
0.07s 0.18% 82.58% 1.97s 5.18% main.execute
0.07s 0.18% 82.76% 1.17s 3.08% main.req2res
0.07s 0.18% 82.95% 4.65s 12.24% net.(*conn).Read
0.07s 0.18% 83.13% 0.54s 1.42% runtime.entersyscall
0.07s 0.18% 83.32% 0.20s 0.53% runtime.goparkunlock
0.06s 0.16% 83.47% 6.26s 16.47% main.parse
0.06s 0.16% 83.63% 19.95s 52.50% net.(*netFD).Write
0.06s 0.16% 83.79% 0.28s 0.74% runtime.newarray
0.05s 0.13% 83.92% 20.44s 53.79% main.writeback.func1
0.05s 0.13% 84.05% 0.38s 1.00% net.(*pollDesc).Wait
0.05s 0.13% 84.18% 0.33s 0.87% net.runtime_pollWait
0.05s 0.13% 84.32% 19.20s 50.53% syscall.Write
0.04s 0.11% 84.42% 4.89s 12.87% bufio.(*Reader).Peek
0.04s 0.11% 84.53% 0.29s 0.76% main.execute.func1
0.04s 0.11% 84.63% 0.46s 1.21% runtime.chansend1
0.03s 0.079% 84.71% 0.41s 1.08% net.(*pollDesc).WaitRead
0.02s 0.053% 84.76% 0.52s 1.37% cuckood.Cuckoo.Get
0.02s 0.053% 84.82% 1.02s 2.68% runtime.chanrecv2
0.02s 0.053% 84.87% 3.35s 8.82% syscall.Read
0.01s 0.026% 84.89% 0.21s 0.55% cuckood.(*cmap).insert
0.01s 0.026% 84.92% 1.03s 2.71% main.writeback
0.01s 0.026% 84.95% 30.24s 79.58% runtime.goexit
0.01s 0.026% 84.97% 0.31s 0.82% runtime.newobject
0 0% 84.97% 0.29s 0.76% cuckood.Cuckoo.Set
0 0% 84.97% 0.22s 0.58% cuckood.Cuckoo.op
0 0% 84.97% 0.36s 0.95% github.com/dustin/gomemcached.(*MCResponse).Bytes
0 0% 84.97% 6.27s 16.50% main.handleConnection
0 0% 84.97% 6.26s 16.47% main.setup
0 0% 84.97% 7.52s 19.79% runtime._System
0 0% 84.97% 0.22s 0.58% runtime.backgroundgc
0 0% 84.97% 0.22s 0.58% runtime.gc
0 0% 84.97% 0.22s 0.58% runtime.gcDrain
0 0% 84.97% 3.33s 8.76% syscall.readtime.scanobjecA
signature.asc

Dmitry Vyukov

unread,
Apr 22, 2015, 2:45:50 AM4/22/15
to Jon Gjengset, Naoki INADA, golang-nuts
On Tue, Apr 21, 2015 at 6:37 PM, Jon Gjengset <j...@thesquareplanet.com> wrote:
>> This is fixed on tip. And Jon says that the issue happens with 1.5
>> (presumably tip).
>
> Yes, I'm running tip (as of yesterday).
>
>> Could you post actual performance number and GOMAXPROCS?
>
> +---------+-----------+----------+
> | Threads | Memcached | Cucache |
> +---------+-----------+----------+
> | 1 | 90621/s | 57648/s |
> | 2 | 185000/s | 98456/s |
> | 4 | 258975/s | 173354/s |
> | 8 | 345890/s | 260459/s |
> | 16 | 260510/s | 230311/s |
> +---------+-----------+----------+

I don't get the issue with Go scalability. If anything, there is an
issue with Memcached scalability, as it scales worse.

Dmitry Vyukov

unread,
Apr 22, 2015, 2:52:00 AM4/22/15
to Jon Gjengset, Naoki INADA, golang-nuts
On Tue, Apr 21, 2015 at 6:37 PM, Jon Gjengset <j...@thesquareplanet.com> wrote:
I don't understand the problem with scalability.
Go is 36% slower in single-threaded mode. With 16 threads it is only
11% slower. So it scales much better than the C server.
You should be investigating performance with GOMAXPROCS=1.



> Block profile (pprof.Lookup("block").WriteTo("...", 2)):
>
> --- contention:
> cycles/second=2400085654

You also need to call runtime.SetBlockProfileRate(1)

> Scheduler trace:
>
> SCHED 0ms: gomaxprocs=8 idleprocs=5 threads=6 spinningthreads=1 idlethreads=0 runqueue=0 [0 0 0 0 0 0 0 0]
> SCHED 1001ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=84 [17 0 21 18 0 12 25 14]
> SCHED 2010ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=73 [11 14 0 10 11 15 0 13]
> SCHED 3019ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=136 [5 0 0 0 0 1 0 0]
> SCHED 4028ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [0 0 0 0 0 0 0 0]
> SCHED 5036ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=81 [0 5 5 3 5 4 6 8]
> SCHED 6045ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=65 [11 14 16 9 13 24 18 21]
> SCHED 7053ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=114 [0 16 1 3 0 0 0 1]
> SCHED 8061ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=178 [0 0 0 0 0 7 0 0]
> SCHED 9070ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [2 0 0 0 0 0 0 0]
> SCHED 10079ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=128 [0 0 0 0 0 0 0 0]
> SCHED 11087ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=1 runqueue=111 [0 0 1 0 0 7 9 0]
> SCHED 12096ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=8 runqueue=101 [8 21 0 14 12 17 19 12]
> SCHED 13104ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=78 [18 20 16 20 13 15 0 10]
> SCHED 14113ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=94 [0 16 12 19 0 0 14 22]
> SCHED 15121ms: gomaxprocs=8 idleprocs=0 threads=14 spinningthreads=0 idlethreads=2 runqueue=113 [3 2 4 0 3 1 3 2]
> SCHED 16129ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]
> SCHED 17134ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]
> SCHED 18139ms: gomaxprocs=8 idleprocs=8 threads=14 spinningthreads=0 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0]


Looks good to me.
Looks good to me. I see no issue. Most time is spent in Read/Write
which is expected. A small fraction is spent in epoll, which is also
reasonable, and it is only called when the thread is otherwise out of
work anyway.

Naoki INADA

unread,
Apr 22, 2015, 3:09:16 AM4/22/15
to golan...@googlegroups.com, j...@thesquareplanet.com, songof...@gmail.com, dvy...@google.com
I profiled memcached and cucache on EC2 c4.8xlarge.
I can't find any problem.

Go (and your program) performance looks good to me, while it's worse than memcached.
Go has some overhead around system calls for it's schedular.
Memcached (libevent) is lighter than Go's schedular.

Dmitry Vyukov

unread,
Apr 22, 2015, 3:26:16 AM4/22/15
to Naoki INADA, golang-nuts, Jon Gjengset
On Wed, Apr 22, 2015 at 10:09 AM, Naoki INADA <songof...@gmail.com> wrote:
> I profiled memcached and cucache on EC2 c4.8xlarge.
> I can't find any problem.
> https://gist.github.com/methane/e15795b9f92471944e6d
>
> Go (and your program) performance looks good to me, while it's worse than
> memcached.
> Go has some overhead around system calls for it's schedular.
> Memcached (libevent) is lighter than Go's schedular.

Also, Go compiler generates machine code that is significantly worse
than C compilers. Also, we have bounds checks, write barriers and all
that stuff.

Jon Gjengset

unread,
Apr 22, 2015, 11:59:22 AM4/22/15
to Naoki INADA, Dmitry Vyukov, golan...@googlegroups.com
Naoki INADA wrote:
> I profiled memcached and cucache on EC2 c4.8xlarge. I can't find any
> problem. https://gist.github.com/methane/e15795b9f92471944e6d

It's strange that epollwait does not show up there. I suspect it might
(as Dmitry pointed out earlier) be that more clients were simply needed
to keep the runtime busy. One thing that does stand out though is how Go
gets a bunch of read errors -- I don't know why that would be the case
when it's using epollwait?

> Go (and your program) performance looks good to me, while it's worse
> than memcached. Go has some overhead around system calls for it's
> schedular. Memcached (libevent) is lighter than Go's schedular.

It's really the performance difference that I'm interested in. I'm
trying to figure out exactly where it comes from, and if there's
anything at all I can do to minimize the overhead.

Dmitry Vyukov wrote:
> Also, Go compiler generates machine code that is significantly worse
> than C compilers. Also, we have bounds checks, write barriers and all
> that stuff.

Yes, I'm aware of this, and I knew that Go would be slower because of
these things going into it. I'm just trying to figure out which of these
are fundamental to any Go program, and which can be avoided by changing
my application.

> I don't understand the problem with scalability.
> Go is 36% slower in single-threaded mode. With 16 threads it is only
> 11% slower. So it scales much better than the C server.
> You should be investigating performance with GOMAXPROCS=1.

This is true. I've done some further multi-core experiments (see
attached results.png) that shows Go scaling much better than memcached
does as the number of cores increases.

My concern at the moment is specifically that of raw performance *per
core* with fewer cores. In particular, when the number of cores is
small, memcached is ~2x faster. The latest profile (see attached
profile.svg) shows that time is divided roughly into:

- 67% syscall.Syscall
- 13% System
- 6% read/write overhead
- 4% channel send/receive
- 3% application code (hash table lookups)
- 2% serialization
- 2% sync.Pool
- 1% GC

The channel send/recv is surprising to me given that each channel has
only a single reader and a single writer. Maybe this is one place where
I can regain some performance by not using channels, but I'm not sure
it's worth the trade-off.

> > Block profile (pprof.Lookup("block").WriteTo("...", 2)):
> >
> > --- contention:
> > cycles/second=2400085654
>
> You also need to call runtime.SetBlockProfileRate(1)

--- contention:
cycles/second=2411098391
12108794021628 1007507 @ 0x40930b 0x40394b 0x45bc51
# 0x40930b runtime.chanrecv2+0x2b /usr/local/src/go15/src/runtime/chan.go:340
# 0x40394b main.writeback+0xfb /home/am2/jfrg/dev/cuckood/src/cuckood/cucache/main.go:166

11194471237690 1007420 @ 0x40930b 0x403576 0x45bc51
# 0x40930b runtime.chanrecv2+0x2b /usr/local/src/go15/src/runtime/chan.go:340
# 0x403576 main.execute+0x76 /home/am2/jfrg/dev/cuckood/src/cuckood/cucache/main.go:131

72334849077 1 @ 0x4092db 0x405788 0x45bc51
# 0x4092db runtime.chanrecv1+0x2b /usr/local/src/go15/src/runtime/chan.go:336
# 0x405788 main.main.func2.1+0x58 /home/am2/jfrg/dev/cuckood/src/cuckood/cucache/main.go:88
profile.svg
results.png
signature.asc

Jon Gjengset

unread,
Apr 22, 2015, 8:32:22 PM4/22/15
to Naoki INADA, Dmitry Vyukov, golan...@googlegroups.com
FYI, the repository has moved here: https://github.com/jonhoo/cucache
signature.asc

Naoki INADA

unread,
Apr 24, 2015, 4:46:09 AM4/24/15
to golan...@googlegroups.com, songof...@gmail.com, j...@thesquareplanet.com, dvy...@google.com


On Thursday, April 23, 2015 at 12:59:22 AM UTC+9, Jon Gjengset wrote:
Naoki INADA wrote:
> I profiled memcached and cucache on EC2 c4.8xlarge. I can't find any
> problem. https://gist.github.com/methane/e15795b9f92471944e6d

It's strange that epollwait does not show up there. I suspect it might
(as Dmitry pointed out earlier) be that more clients were simply needed
to keep the runtime busy. One thing that does stand out though is how Go
gets a bunch of read errors -- I don't know why that would be the case
when it's using epollwait?

I started strace cucache manually, then run memtier_benchmark and stop strace cucache.
So epoll_wait waits I'm hitting Ctrl-C.
I've focused on how many times each syscall is called. Don't see time.
 
Reply all
Reply to author
Forward
0 new messages