high cpu consumption in syscall.Syscall

4,240 views
Skip to first unread message

steve wang

unread,
Mar 5, 2014, 5:02:23 AM3/5/14
to golan...@googlegroups.com
I profiled a program which consumed much cpu time up to 200% and found the top 10 list:
(pprof) top10
Total: 42868 samples
   12210  28.5%  28.5%    12263  28.6% syscall.Syscall
    3491   8.1%  36.6%     3491   8.1% runtime.futex
    2907   6.8%  43.4%     2907   6.8% runtime.xchg
    2878   6.7%  50.1%     2878   6.7% runtime.epollwait
    1682   3.9%  54.0%     1682   3.9% findrunnable
    1483   3.5%  57.5%     1483   3.5% runqsteal
     865   2.0%  59.5%      865   2.0% code.kingsoft.com/jx1/net.code
     644   1.5%  61.0%     1861   4.3% runtime.chanrecv
     621   1.4%  62.5%     2137   5.0% runtime.lock
     614   1.4%  63.9%     1672   3.9% runtime.mallocgc

Almost all of the syscall.Syscalls are from a function named 'system'.

I dived into the log produced by 'strace' and found many many epoll_wait calls look strange:
epoll_wait(6, {}, 128, 0) = 0

I run my program on linux amd64 and go 1.2.1.

Any advice?
Thanks in advance.

The files produced by strace and prof are attached.

20140305_155225.prof
bishop_strace.txt

Dave Cheney

unread,
Mar 5, 2014, 8:02:32 AM3/5/14
to golan...@googlegroups.com
Could you use the `web` command in go tool pprof and post the result. A high syscall.Syscall time may not be a problem depending on what your program is doing .. which from a very cursory look is doing lots of network io and allocating lots of memory. At a guess I think you've probably been playing with GOMAXPROCS as well.

steve wang

unread,
Mar 5, 2014, 8:32:25 AM3/5/14
to golan...@googlegroups.com
Thanks for your comments. I will post the web graph tomorrow when I'm back at work.
To my surprise, I didn't expect the epoll_wait would be called that way, say, waiting for no events(the second parameter passed to epoll_wait).
Does that make sense?

Dave Cheney

unread,
Mar 5, 2014, 8:38:50 AM3/5/14
to steve wang, golang-nuts
epoll_wait, and select used in those forms are a convenient form of sleep. They are used in the runtime to drive the implementations of timers. Essentially, sleep for n microseconds (could be nanos), or until there is a wakeup event on a file descriptor; whichever is the sooner.


--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

steve wang

unread,
Mar 5, 2014, 8:54:05 AM3/5/14
to golan...@googlegroups.com, steve wang
I'm confused.
Empty events(the second parameter) means waiting for no file descriptors and the zero timeout parameter makes epoll_wait returns in no time.
How could this system API call be used to drive timers or to wait for file's readability or writability?
Could you please give more details?

Dmitry Vyukov

unread,
Mar 5, 2014, 8:58:17 AM3/5/14
to steve wang, golang-nuts
On Wed, Mar 5, 2014 at 5:54 PM, steve wang <steve....@gmail.com> wrote:
> I'm confused.
> Empty events(the second parameter) means waiting for no file descriptors and
> the zero timeout parameter makes epoll_wait returns in no time.

epoll_waits for up to 128 events (the third parameter)
the second parameter is not an input parameter
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an

steve wang

unread,
Mar 5, 2014, 9:19:06 AM3/5/14
to golan...@googlegroups.com, steve wang
Thanks for your explanation.
I have not been using the API for a period of time and misunderstood it.

steve wang

unread,
Mar 5, 2014, 8:43:15 PM3/5/14
to golan...@googlegroups.com
Attached is the .svg file.
The GOMAXPROCS was 24 on a system with 12 cores, each with 2 hyper-threads. 


On Wednesday, March 5, 2014 9:02:32 PM UTC+8, Dave Cheney wrote:
bishop_2014030501.svg

Dmitry Vyukov

unread,
Mar 6, 2014, 1:21:03 AM3/6/14
to steve wang, golang-nuts
I can't understand what is that syscall.Syscall...
Please also run the program under perf profiler:
$ perf record -g ./mybin
$ perf report > report.txt
$ less report.txt

steve wang

unread,
Mar 6, 2014, 2:23:08 AM3/6/14
to golan...@googlegroups.com, steve wang
Attached is the zipped report file.
The first command 'perf record -g ./mybin' didn't work on my centos6, so I replaced it with 'perf record -g fp ./mybin'.
The second command reported a message 'Failed to open [vsyscall], continuing withou symbols'.
perfreport_20140306.zip

steve wang

unread,
Mar 6, 2014, 3:27:53 AM3/6/14
to golan...@googlegroups.com, steve wang
Here is the profiling file made on ubunut12.04 amd64, where the two commands you offered works well and no symbols are missing.


On Thursday, March 6, 2014 2:21:03 PM UTC+8, Dmitry Vyukov wrote:
perfreport_20140306_02.7z

Dmitry Vyukov

unread,
Mar 6, 2014, 3:37:49 AM3/6/14
to steve wang, golang-nuts
That seems to be just write syscall, the top entry in the profile is:

22.51% bishop [e1000] [k] e1000_tx_queue
|
--- e1000_tx_queue
|
|--99.99%-- e1000_xmit_frame
| dev_hard_start_xmit
| sch_direct_xmit
| |
| |--72.76%-- __qdisc_run
| | |
| | |--78.77%-- dev_queue_xmit
| | | ip_finish_output
| | | ip_output
| | | ip_local_out
| | | |
| | | |--99.98%-- ip_queue_xmit
| | | | tcp_transmit_skb
| | | | |
| | | |
|--75.09%-- tcp_write_xmit
| | | | |
__tcp_push_pending_frames
| | | | |
|
| | | | |
|--74.42%-- tcp_sendmsg
| | | | |
| inet_sendmsg
| | | | |
| do_sock_write.isra.10
| | | | |
| sock_aio_write
| | | | |
| do_sync_write
| | | | |
| vfs_write
| | | | |
| sys_write
| | | | |
| system_call_fastpath
| | | | |
| syscall.Syscall


22% is roughly the same what pprof profile says for syscall.Syscall
and e1000_tx_queue is also rooted in syscall.Syscall

switching to e1000e driver may help to some degree:
http://lwn.net/Articles/278016/

or maybe you are just doing lots of too small writes, if so try buffering them

steve wang

unread,
Mar 6, 2014, 3:50:53 AM3/6/14
to golan...@googlegroups.com, steve wang
This profiling was made on my virtual machine.
The first I attached today was made on real machine powered by centos6, where I think the network card driver is not an issue. 

Dmitry Vyukov

unread,
Mar 6, 2014, 5:21:50 AM3/6/14
to steve wang, golang-nuts
On Thu, Mar 6, 2014 at 12:50 PM, steve wang <steve....@gmail.com> wrote:
> This profiling was made on my virtual machine.
> The first I attached today was made on real machine powered by centos6,
> where I think the network card driver is not an issue.


Where the pprof profile was made?

Dmitry Vyukov

unread,
Mar 6, 2014, 5:30:20 AM3/6/14
to steve wang, golang-nuts
The first profile looks pretty reasonable:
5.89% bishop_y bishop_y [.] runtime.xchg
3.34% bishop_y [kernel.kallsyms] [k] tcp_sendmsg
2.80% bishop_y bishop_y [.] findrunnable
2.40% bishop_y bishop_y [.] runqsteal
2.17% bishop_y [kernel.kallsyms] [k] _spin_lock
1.83% bishop_y bishop_y [.]
code.kingsoft.com/jx1/net.code
1.62% bishop_y [kernel.kallsyms] [k] _spin_lock_irqsave
1.53% bishop_y bishop_y [.] runtime.mallocgc
1.39% bishop_y bishop_y [.] runtime.chanrecv
1.27% bishop_y [kernel.kallsyms] [k] fget_light
1.14% bishop_y [kernel.kallsyms] [k] find_busiest_group
1.11% bishop_y bishop_y [.] runtime.slicecopy
1.07% bishop_y bishop_y [.] runtime.lock
1.04% bishop_y [kernel.kallsyms] [k] sock_aio_write


it looks like there is shortage of work for 24 thread, so lots of
threads contend for small amount of work causing contention

Please run on the real machine:
$ GODEBUG=schedtrace=500 ./mybin

it will show what happens in Go scheduler.

steve wang

unread,
Mar 6, 2014, 7:30:45 AM3/6/14
to golan...@googlegroups.com, steve wang


On Thursday, March 6, 2014 6:30:20 PM UTC+8, Dmitry Vyukov wrote:
The first profile looks pretty reasonable:
     5.89%  bishop_y  bishop_y           [.] runtime.xchg
     3.34%  bishop_y  [kernel.kallsyms]  [k] tcp_sendmsg
     2.80%  bishop_y  bishop_y           [.] findrunnable
     2.40%  bishop_y  bishop_y           [.] runqsteal
     2.17%  bishop_y  [kernel.kallsyms]  [k] _spin_lock
     1.83%  bishop_y  bishop_y           [.]
code.kingsoft.com/jx1/net.code
     1.62%  bishop_y  [kernel.kallsyms]  [k] _spin_lock_irqsave
     1.53%  bishop_y  bishop_y           [.] runtime.mallocgc
     1.39%  bishop_y  bishop_y           [.] runtime.chanrecv
     1.27%  bishop_y  [kernel.kallsyms]  [k] fget_light
     1.14%  bishop_y  [kernel.kallsyms]  [k] find_busiest_group
     1.11%  bishop_y  bishop_y           [.] runtime.slicecopy
     1.07%  bishop_y  bishop_y           [.] runtime.lock
     1.04%  bishop_y  [kernel.kallsyms]  [k] sock_aio_write


it looks like there is shortage of work for 24 thread, so lots of
threads contend for small amount of work causing contention

In this case, what can I do to reduce the cpu consumption?
Should I decrease GOMAXPROCS by half or more?
 

Please run on the real machine:
$ GODEBUG=schedtrace=500 ./mybin

it will show what happens in Go scheduler.
 
I will run the test and post the report next Monday.

Dmitry Vyukov

unread,
Mar 6, 2014, 7:41:45 AM3/6/14
to steve wang, golang-nuts
Let's first look at schedtrace output.
You've said that it consumes 200%, then setting GOMAXPROCS to 2 or
even 1 may be sufficient.

steve wang

unread,
Mar 10, 2014, 3:35:41 AM3/10/14
to golan...@googlegroups.com, steve wang
Here are the results of my experiments:
1.runs with 2 GOMAXPROCS:
  cpu around 20~30%, occasionally 70~85%
2.runs with 1 GOMAXPROCS: 
  cpu around 20~50%, occasionally 80~100%
3.runs with 24 GOMAXPROCS:
  cpu around 50~90%, occasionally over 300%

Schedtrace log for the 3rd case is attached.
schedtrace_141415.txt

Dmitry Vyukov

unread,
Mar 11, 2014, 2:28:35 AM3/11/14
to steve wang, golang-nuts
On Mon, Mar 10, 2014 at 11:35 AM, steve wang <steve....@gmail.com> wrote:
> Here are the results of my experiments:
> 1.runs with 2 GOMAXPROCS:
> cpu around 20~30%, occasionally 70~85%
> 2.runs with 1 GOMAXPROCS:
> cpu around 20~50%, occasionally 80~100%
> 3.runs with 24 GOMAXPROCS:
> cpu around 50~90%, occasionally over 300%
>
> Schedtrace log for the 3rd case is attached.


The program is mostly idle, I would suggest setting GOMAXPROCS to 1 or 2.

steve wang

unread,
Mar 11, 2014, 4:34:27 AM3/11/14
to golan...@googlegroups.com, steve wang
As the program is mostly idle, I can't understand why it consumes so much cpu time?
Is there something wrong in my code? or it's something to be improved in runtime?
Any way to optimize my program?

steve wang

unread,
Mar 11, 2014, 4:35:20 AM3/11/14
to golan...@googlegroups.com, steve wang

Dmitry Vyukov

unread,
Mar 11, 2014, 4:52:00 AM3/11/14
to steve wang, golang-nuts
Maybe there are some heuristics to tune.
But there is an avoidable tradeoff between spending more CPU and
increasing latency. You give the program the right to use 24 CPUs and
speedup the program, and that's what it tries to do. Runtime does not
know whether there will a new work to do in a next nanosecond or
not...

steve wang

unread,
Mar 11, 2014, 5:05:43 AM3/11/14
to golan...@googlegroups.com, steve wang
I ever tried buffering the data to be sent until the buffer is full or a timer at intervals of 10ms is triggered, but that didn't help much.

steve wang

unread,
Mar 11, 2014, 5:12:33 AM3/11/14
to golan...@googlegroups.com, steve wang
In my case, the program needs to quite often(18 times every second) broadcast messages to many clients(700 at maximum in my test cases). I have a guess that this behavior might be responsible for the busy cpu.  

Dmitry Vyukov

unread,
Mar 11, 2014, 5:19:10 AM3/11/14
to steve wang, golang-nuts
Does setting GOMAXPROCS=24 improve client-perceived latency?

Do you think there is something pathological in Go runtime? If yes,
why? If no, do you just want to optimize the program?

steve wang

unread,
Mar 11, 2014, 5:40:45 AM3/11/14
to golan...@googlegroups.com, steve wang
The different GOMAXPROCS make no difference in terms of latency. 
 
Do  you think there is something pathological in Go runtime? If yes,
why? If no, do you just want to optimize the program?

I  think my program have room for improvement but I can't find the hot spot in my code by profiling, which confuses me.
I have no answer to who are responsible for the issue, which also bothers me..
Is it the final way to set GOMAXPROCS  to 2? Could you please give me some advice on further optimization?

steve wang

unread,
Mar 11, 2014, 5:40:59 AM3/11/14
to golan...@googlegroups.com, steve wang
The different GOMAXPROCS make no difference in terms of latency. 
 
Do  you think there is something pathological in Go runtime? If yes,
why? If no, do you just want to optimize the program?

Dmitry Vyukov

unread,
Mar 11, 2014, 6:15:33 AM3/11/14
to steve wang, golang-nuts
There may be an issue w/o goroutine blocking/unblocking, difficult to
say w/o seeing the code.
Try to look at the blocking profile runtime.SetBlockProfileRate(1),
runtime/pprof.Lookup("block").WriteTo(file, 0), then you can use the
resulting file with pprof as usual.
In the CPU profile I see that gsCallback.deliver unblocks lots of
goroutines, maybe there is something to optimize.

Edward Muller

unread,
Mar 11, 2014, 5:52:18 PM3/11/14
to steve wang, golang-nuts
If you are using lots of high resolution timers and they fire even when there is no work to be done your program is still doing work to process the timers and re-set them. Generally speaking I try to setup timers for timeouts after the bit of work is queued to ensure that my program is not "busy idling" and cycling timers.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Edward Muller
@freeformz

Dave Cheney

unread,
Mar 11, 2014, 6:47:53 PM3/11/14
to Edward Muller, steve wang, golang-nuts
Is the source of your program available ?
> You received this message because you are subscribed to a topic in the
> Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
Message has been deleted

steve wang

unread,
Mar 11, 2014, 11:35:29 PM3/11/14
to golan...@googlegroups.com, steve wang
Attached are the svg files of block and cpu profiling for the same running with GOMAXPROCS set to 24.
 
In the CPU profile I see that gsCallback.deliver unblocks lots of
goroutines, maybe there is something to optimize.

gsCallback.deliver wakes up hundreds of goroutines that are waiting on their respective channels for the message to be sent to the client.
Exactly this behavior unlock lots of goroutines, but it's a normal mode of message broadcast, isn't it? 
block_20140312.svg
cpu_20140312.svg

Dave Cheney

unread,
Mar 11, 2014, 11:39:11 PM3/11/14
to steve wang, golang-nuts
There is no buffering when you are writing to network sockets. That
may explain the high syscall.Syscall usage coming from syscall.Write.
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

steve wang

unread,
Mar 11, 2014, 11:43:24 PM3/11/14
to golan...@googlegroups.com, steve wang
I ever tried, but those high resolution timers cost too much cpu time(around 20~30%), so I abandoned this idea.

steve wang

unread,
Mar 11, 2014, 11:51:13 PM3/11/14
to golan...@googlegroups.com, steve wang
I once tried buffering those messages to be sent and using a timer at intervals of 10ms to trigger the writing action, which makes little difference in terms of cpu time cost.
I kept the total number of timers around 10, which I believe is not too many.

Dave Cheney

unread,
Mar 11, 2014, 11:57:33 PM3/11/14
to steve wang, golan...@googlegroups.com
Still waiting on seeing some code Steve. 

steve wang

unread,
Mar 12, 2014, 12:16:36 AM3/12/14
to golan...@googlegroups.com, steve wang
Sorry, I thought you said to Edward.
The source code is from my company's code base and I'm not allowed to post it, but any profiling reports and details of the implementation are available.

Dmitry Vyukov

unread,
Mar 12, 2014, 1:41:47 AM3/12/14
to steve wang, golang-nuts
Yes, it's normal in this case.

I don't know. It all looks pretty reasonable.

Konstantin Khomoutov

unread,
Mar 12, 2014, 5:43:07 AM3/12/14
to steve wang, golan...@googlegroups.com
On Tue, 11 Mar 2014 20:51:13 -0700 (PDT)
steve wang <steve....@gmail.com> wrote:

> > There is no buffering when you are writing to network sockets. That
> > may explain the high syscall.Syscall usage coming from
> > syscall.Write.
> >
> I once tried buffering those messages to be sent and using a timer at
> intervals of 10ms to trigger the writing action, which makes little
> difference in terms of cpu time cost.
> I kept the total number of timers around 10, which I believe is not
> too many.

You could try wrapping your sockets to bufio.ReadWriter-s so that not
each attempt to write something to a socket actually hits the syscall.
Note that this approach requires certain tweaking of your code as you
will have to call Flush() in key moments on your instances of
bufio.ReadWriter, otherwise they could hold on unsent data
indefinitely. You could do that, say, for each Nth message written
or trying to limit the writer's buffer (using bufio.NewWriterSize).

[...]

steve wang

unread,
Mar 12, 2014, 8:19:27 AM3/12/14
to golan...@googlegroups.com, steve wang
It seems that waking up lots of goroutines by channel sending costs much cpu time.
This program[1] takes 60%+ cpu time, which surprises me.

Dave Cheney

unread,
Mar 12, 2014, 8:25:55 AM3/12/14
to steve wang, golan...@googlegroups.com
Of course that has a measurable cost, that program does nothing but wake up channels. 

Sending on  a channel is cheap, but it is not free. Using channels has a cost, especially when you insist on bumping GOMAXPROCS, so the cost of the computation that you offload to another goroutine must be at least larger than the cost of just doing the work yourself, sequentially.


--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

steve wang

unread,
Mar 12, 2014, 8:29:50 AM3/12/14
to golan...@googlegroups.com, steve wang
Thanks for your advice.
Buffering will definitely improve performance, but for now it seems not the hot spot. I have tried as what you say, and it didn't help much.
Perhaps the problem lies in waking up lots of goroutines to broadcast messages.
I have written a simple program at http://play.golang.org/p/NYlhGfsfP1, trying to reproduce the issue.

  

steve wang

unread,
Mar 12, 2014, 8:44:47 AM3/12/14
to golan...@googlegroups.com, steve wang
This being the case, what is the idiomatic way to broadcast messages among tcp connections with least latency and least cpu time?

Henrik Johansson

unread,
Mar 12, 2014, 9:04:27 AM3/12/14
to steve wang, golang-nuts
Using channels seems like the proper way to do it but if you can do some bucketing it might make it both simpler and faster.
Maybe it can be one channel/goroutines job to send to a number of connections?

Measure and optimize accordingly I guess.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Dmitry Vyukov

unread,
Mar 12, 2014, 12:25:58 PM3/12/14
to steve wang, golang-nuts
There are several 'broadcasting' primitives in Go. E.g. close of a
channel unblocks all receives, or sync.WaitGroup.Done/Wait or
sync.RWMutex.Unlock/RLock. But currently they all are implemented as N
independent signals each waking a single goroutine. So there is
nothing good I can suggest. Probably here is some room for
optimization.
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Edward Muller

unread,
Mar 12, 2014, 5:41:54 PM3/12/14
to steve wang, golang-nuts
I would also question why you need to wake up a lot of goroutines. Seems like you are using goroutines/channels/timeouts backwards.

Without actual code around the problem though I'm just speculating.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Edward Muller
@freeformz

steve wang

unread,
Mar 12, 2014, 10:39:01 PM3/12/14
to golan...@googlegroups.com, steve wang
To broadcast messages among lots of tcp connections, I got three ways by far:
1.send messages to lots of goroutines which are waiting on their respective channels and then call *net.TCPConn.Write
2.call *net.TCPConn.Write of each connection in a loop
3.write messages to buffers where the messages will be sent later.

The first way looks, in my opinion, idiomatic in go, but causes performance issues.
The second may be blocked for an unknown period of time on certain connections, which is unacceptable to me.
The third is still in my exploration and have not come to a conclusion, but looks not idiomatic enough.

Dave Cheney

unread,
Mar 12, 2014, 10:45:48 PM3/12/14
to steve wang, golang-nuts
On Thu, Mar 13, 2014 at 1:39 PM, steve wang <steve....@gmail.com> wrote:
> To broadcast messages among lots of tcp connections, I got three ways by
> far:
> 1.send messages to lots of goroutines which are waiting on their respective
> channels and then call *net.TCPConn.Write

I think this is the best solution.

Compared to the cost of sending data over the network, the cost of
waking up the goroutine is very small. I think you are focusing on the
wrong area of your code.
> You received this message because you are subscribed to a topic in the
> Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
Reply all
Reply to author
Forward
0 new messages