high cpu consumption in syscall.Syscall

steve wang

unread,

Mar 5, 2014, 5:02:23 AM3/5/14

to golan...@googlegroups.com

I profiled a program which consumed much cpu time up to 200% and found the top 10 list:

(pprof) top10

Total: 42868 samples

12210 28.5% 28.5% 12263 28.6% syscall.Syscall

3491 8.1% 36.6% 3491 8.1% runtime.futex

2907 6.8% 43.4% 2907 6.8% runtime.xchg

2878 6.7% 50.1% 2878 6.7% runtime.epollwait

1682 3.9% 54.0% 1682 3.9% findrunnable

1483 3.5% 57.5% 1483 3.5% runqsteal

865 2.0% 59.5% 865 2.0% code.kingsoft.com/jx1/net.code

644 1.5% 61.0% 1861 4.3% runtime.chanrecv

621 1.4% 62.5% 2137 5.0% runtime.lock

614 1.4% 63.9% 1672 3.9% runtime.mallocgc

Almost all of the syscall.Syscalls are from a function named 'system'.

I dived into the log produced by 'strace' and found many many epoll_wait calls look strange:

epoll_wait(6, {}, 128, 0) = 0

I run my program on linux amd64 and go 1.2.1.

Any advice?

Thanks in advance.

The files produced by strace and prof are attached.

20140305_155225.prof

bishop_strace.txt

Dave Cheney

unread,

Mar 5, 2014, 8:02:32 AM3/5/14

to golan...@googlegroups.com

Could you use the `web` command in go tool pprof and post the result. A high syscall.Syscall time may not be a problem depending on what your program is doing .. which from a very cursory look is doing lots of network io and allocating lots of memory. At a guess I think you've probably been playing with GOMAXPROCS as well.

steve wang

unread,

Mar 5, 2014, 8:32:25 AM3/5/14

to golan...@googlegroups.com

Thanks for your comments. I will post the web graph tomorrow when I'm back at work.

To my surprise, I didn't expect the epoll_wait would be called that way, say, waiting for no events(the second parameter passed to epoll_wait).

Does that make sense?

Dave Cheney

unread,

Mar 5, 2014, 8:38:50 AM3/5/14

to steve wang, golang-nuts

epoll_wait, and select used in those forms are a convenient form of sleep. They are used in the runtime to drive the implementations of timers. Essentially, sleep for n microseconds (could be nanos), or until there is a wakeup event on a file descriptor; whichever is the sooner.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

steve wang

unread,

Mar 5, 2014, 8:54:05 AM3/5/14

to golan...@googlegroups.com, steve wang

I'm confused.

Empty events(the second parameter) means waiting for no file descriptors and the zero timeout parameter makes epoll_wait returns in no time.

How could this system API call be used to drive timers or to wait for file's readability or writability?

Could you please give more details?

Dmitry Vyukov

unread,

Mar 5, 2014, 8:58:17 AM3/5/14

to steve wang, golang-nuts

On Wed, Mar 5, 2014 at 5:54 PM, steve wang <steve....@gmail.com> wrote:
> I'm confused.
> Empty events(the second parameter) means waiting for no file descriptors and
> the zero timeout parameter makes epoll_wait returns in no time.

epoll_waits for up to 128 events (the third parameter)
the second parameter is not an input parameter

> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an

steve wang

unread,

Mar 5, 2014, 9:19:06 AM3/5/14

to golan...@googlegroups.com, steve wang

Thanks for your explanation.

I have not been using the API for a period of time and misunderstood it.

steve wang

unread,

Mar 5, 2014, 8:43:15 PM3/5/14

to golan...@googlegroups.com

Attached is the .svg file.

The GOMAXPROCS was 24 on a system with 12 cores, each with 2 hyper-threads.

On Wednesday, March 5, 2014 9:02:32 PM UTC+8, Dave Cheney wrote:

bishop_2014030501.svg

Dmitry Vyukov

unread,

Mar 6, 2014, 1:21:03 AM3/6/14

to steve wang, golang-nuts

I can't understand what is that syscall.Syscall...
Please also run the program under perf profiler:
$ perf record -g ./mybin
$ perf report > report.txt
$ less report.txt

steve wang

unread,

Mar 6, 2014, 2:23:08 AM3/6/14

to golan...@googlegroups.com, steve wang

Attached is the zipped report file.

The first command 'perf record -g ./mybin' didn't work on my centos6, so I replaced it with 'perf record -g fp ./mybin'.

The second command reported a message 'Failed to open [vsyscall], continuing withou symbols'.

perfreport_20140306.zip

steve wang

unread,

Mar 6, 2014, 3:27:53 AM3/6/14

to golan...@googlegroups.com, steve wang

Here is the profiling file made on ubunut12.04 amd64, where the two commands you offered works well and no symbols are missing.

On Thursday, March 6, 2014 2:21:03 PM UTC+8, Dmitry Vyukov wrote:

unread,

Mar 6, 2014, 5:30:20 AM3/6/14

to steve wang, golang-nuts

The first profile looks pretty reasonable:
5.89% bishop_y bishop_y [.] runtime.xchg
3.34% bishop_y [kernel.kallsyms] [k] tcp_sendmsg
2.80% bishop_y bishop_y [.] findrunnable
2.40% bishop_y bishop_y [.] runqsteal
2.17% bishop_y [kernel.kallsyms] [k] _spin_lock
1.83% bishop_y bishop_y [.]
code.kingsoft.com/jx1/net.code
1.62% bishop_y [kernel.kallsyms] [k] _spin_lock_irqsave
1.53% bishop_y bishop_y [.] runtime.mallocgc
1.39% bishop_y bishop_y [.] runtime.chanrecv
1.27% bishop_y [kernel.kallsyms] [k] fget_light
1.14% bishop_y [kernel.kallsyms] [k] find_busiest_group
1.11% bishop_y bishop_y [.] runtime.slicecopy
1.07% bishop_y bishop_y [.] runtime.lock
1.04% bishop_y [kernel.kallsyms] [k] sock_aio_write

it looks like there is shortage of work for 24 thread, so lots of
threads contend for small amount of work causing contention

Please run on the real machine:
$ GODEBUG=schedtrace=500 ./mybin

it will show what happens in Go scheduler.

steve wang

unread,

Mar 6, 2014, 7:30:45 AM3/6/14

to golan...@googlegroups.com, steve wang

On Thursday, March 6, 2014 6:30:20 PM UTC+8, Dmitry Vyukov wrote:

The first profile looks pretty reasonable:
5.89% bishop_y bishop_y [.] runtime.xchg
3.34% bishop_y [kernel.kallsyms] [k] tcp_sendmsg
2.80% bishop_y bishop_y [.] findrunnable
2.40% bishop_y bishop_y [.] runqsteal
2.17% bishop_y [kernel.kallsyms] [k] _spin_lock
1.83% bishop_y bishop_y [.]
code.kingsoft.com/jx1/net.code
1.62% bishop_y [kernel.kallsyms] [k] _spin_lock_irqsave
1.53% bishop_y bishop_y [.] runtime.mallocgc
1.39% bishop_y bishop_y [.] runtime.chanrecv
1.27% bishop_y [kernel.kallsyms] [k] fget_light
1.14% bishop_y [kernel.kallsyms] [k] find_busiest_group
1.11% bishop_y bishop_y [.] runtime.slicecopy
1.07% bishop_y bishop_y [.] runtime.lock
1.04% bishop_y [kernel.kallsyms] [k] sock_aio_write

it looks like there is shortage of work for 24 thread, so lots of
threads contend for small amount of work causing contention

In this case, what can I do to reduce the cpu consumption?

Should I decrease GOMAXPROCS by half or more?

Please run on the real machine:
$ GODEBUG=schedtrace=500 ./mybin

it will show what happens in Go scheduler.

I will run the test and post the report next Monday.

Dmitry Vyukov

unread,

Mar 6, 2014, 7:41:45 AM3/6/14

to steve wang, golang-nuts

Let's first look at schedtrace output.
You've said that it consumes 200%, then setting GOMAXPROCS to 2 or
even 1 may be sufficient.

steve wang

unread,

Mar 10, 2014, 3:35:41 AM3/10/14

to golan...@googlegroups.com, steve wang

Here are the results of my experiments:

1.runs with 2 GOMAXPROCS:

cpu around 20~30%, occasionally 70~85%

2.runs with 1 GOMAXPROCS:

cpu around 20~50%, occasionally 80~100%

3.runs with 24 GOMAXPROCS:

cpu around 50~90%, occasionally over 300%

Schedtrace log for the 3rd case is attached.

schedtrace_141415.txt

Dmitry Vyukov

unread,

Mar 11, 2014, 2:28:35 AM3/11/14

to steve wang, golang-nuts

On Mon, Mar 10, 2014 at 11:35 AM, steve wang <steve....@gmail.com> wrote:
> Here are the results of my experiments:
> 1.runs with 2 GOMAXPROCS:
> cpu around 20~30%, occasionally 70~85%
> 2.runs with 1 GOMAXPROCS:
> cpu around 20~50%, occasionally 80~100%
> 3.runs with 24 GOMAXPROCS:
> cpu around 50~90%, occasionally over 300%
>
> Schedtrace log for the 3rd case is attached.

The program is mostly idle, I would suggest setting GOMAXPROCS to 1 or 2.

steve wang

unread,

Mar 11, 2014, 4:34:27 AM3/11/14

to golan...@googlegroups.com, steve wang

As the program is mostly idle, I can't understand why it consumes so much cpu time?

Is there something wrong in my code? or it's something to be improved in runtime?

Any way to optimize my program?

unread,

Mar 11, 2014, 5:40:45 AM3/11/14

to golan...@googlegroups.com, steve wang

The different GOMAXPROCS make no difference in terms of latency.

Do you think there is something pathological in Go runtime? If yes,
why? If no, do you just want to optimize the program?

I think my program have room for improvement but I can't find the hot spot in my code by profiling, which confuses me.

I have no answer to who are responsible for the issue, which also bothers me..

Is it the final way to set GOMAXPROCS to 2? Could you please give me some advice on further optimization?

steve wang

unread,

Mar 11, 2014, 5:40:59 AM3/11/14

to golan...@googlegroups.com, steve wang

The different GOMAXPROCS make no difference in terms of latency.

Do you think there is something pathological in Go runtime? If yes,

why? If no, do you just want to optimize the program?

Dmitry Vyukov

unread,

Mar 11, 2014, 6:15:33 AM3/11/14

to steve wang, golang-nuts

There may be an issue w/o goroutine blocking/unblocking, difficult to
say w/o seeing the code.
Try to look at the blocking profile runtime.SetBlockProfileRate(1),
runtime/pprof.Lookup("block").WriteTo(file, 0), then you can use the
resulting file with pprof as usual.
In the CPU profile I see that gsCallback.deliver unblocks lots of
goroutines, maybe there is something to optimize.

Edward Muller

unread,

Mar 11, 2014, 5:52:18 PM3/11/14

to steve wang, golang-nuts

If you are using lots of high resolution timers and they fire even when there is no work to be done your program is still doing work to process the timers and re-set them. Generally speaking I try to setup timers for timeouts after the bit of work is queued to ensure that my program is not "busy idling" and cycling timers.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Edward Muller
@freeformz

Dave Cheney

unread,

Mar 11, 2014, 6:47:53 PM3/11/14

to Edward Muller, steve wang, golang-nuts

Is the source of your program available ?

> You received this message because you are subscribed to a topic in the
> Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Message has been deleted

steve wang

unread,

Mar 11, 2014, 11:35:29 PM3/11/14

to golan...@googlegroups.com, steve wang

Attached are the svg files of block and cpu profiling for the same running with GOMAXPROCS set to 24.

In the CPU profile I see that gsCallback.deliver unblocks lots of
goroutines, maybe there is something to optimize.

unread,

Mar 12, 2014, 1:41:47 AM3/12/14

to steve wang, golang-nuts

Yes, it's normal in this case.

I don't know. It all looks pretty reasonable.

Konstantin Khomoutov

unread,

Mar 12, 2014, 5:43:07 AM3/12/14

to steve wang, golan...@googlegroups.com

On Tue, 11 Mar 2014 20:51:13 -0700 (PDT)
steve wang <steve....@gmail.com> wrote:

> > There is no buffering when you are writing to network sockets. That
> > may explain the high syscall.Syscall usage coming from
> > syscall.Write.
> >

> I once tried buffering those messages to be sent and using a timer at
> intervals of 10ms to trigger the writing action, which makes little
> difference in terms of cpu time cost.
> I kept the total number of timers around 10, which I believe is not
> too many.

You could try wrapping your sockets to bufio.ReadWriter-s so that not
each attempt to write something to a socket actually hits the syscall.
Note that this approach requires certain tweaking of your code as you
will have to call Flush() in key moments on your instances of
bufio.ReadWriter, otherwise they could hold on unsent data
indefinitely. You could do that, say, for each Nth message written
or trying to limit the writer's buffer (using bufio.NewWriterSize).

[...]

steve wang

unread,

Mar 12, 2014, 8:19:27 AM3/12/14

to golan...@googlegroups.com, steve wang

It seems that waking up lots of goroutines by channel sending costs much cpu time.

This program[1] takes 60%+ cpu time, which surprises me.

[1]http://play.golang.org/p/NYlhGfsfP1

Dave Cheney

unread,

Mar 12, 2014, 8:25:55 AM3/12/14

to steve wang, golan...@googlegroups.com

Of course that has a measurable cost, that program does nothing but wake up channels.

Sending on a channel is cheap, but it is not free. Using channels has a cost, especially when you insist on bumping GOMAXPROCS, so the cost of the computation that you offload to another goroutine must be at least larger than the cost of just doing the work yourself, sequentially.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

steve wang

unread,

Mar 12, 2014, 8:29:50 AM3/12/14

to golan...@googlegroups.com, steve wang

Thanks for your advice.

Buffering will definitely improve performance, but for now it seems not the hot spot. I have tried as what you say, and it didn't help much.

Perhaps the problem lies in waking up lots of goroutines to broadcast messages.

I have written a simple program at http://play.golang.org/p/NYlhGfsfP1, trying to reproduce the issue.

steve wang

unread,

Mar 12, 2014, 8:44:47 AM3/12/14

to golan...@googlegroups.com, steve wang

This being the case, what is the idiomatic way to broadcast messages among tcp connections with least latency and least cpu time?

Henrik Johansson

unread,

Mar 12, 2014, 9:04:27 AM3/12/14

to steve wang, golang-nuts

Using channels seems like the proper way to do it but if you can do some bucketing it might make it both simpler and faster.

Maybe it can be one channel/goroutines job to send to a number of connections?

Measure and optimize accordingly I guess.

--

You received this message because you are subscribed to the Google Groups "golang-nuts" group.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Dmitry Vyukov

unread,

Mar 12, 2014, 12:25:58 PM3/12/14

to steve wang, golang-nuts

There are several 'broadcasting' primitives in Go. E.g. close of a
channel unblocks all receives, or sync.WaitGroup.Done/Wait or
sync.RWMutex.Unlock/RLock. But currently they all are implemented as N
independent signals each waking a single goroutine. So there is
nothing good I can suggest. Probably here is some room for
optimization.

> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Edward Muller

unread,

Mar 12, 2014, 5:41:54 PM3/12/14

to steve wang, golang-nuts

I would also question why you need to wake up a lot of goroutines. Seems like you are using goroutines/channels/timeouts backwards.

Without actual code around the problem though I'm just speculating.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Edward Muller
@freeformz

steve wang

unread,

Mar 12, 2014, 10:39:01 PM3/12/14

to golan...@googlegroups.com, steve wang

To broadcast messages among lots of tcp connections, I got three ways by far:

1.send messages to lots of goroutines which are waiting on their respective channels and then call *net.TCPConn.Write

2.call *net.TCPConn.Write of each connection in a loop

3.write messages to buffers where the messages will be sent later.

The first way looks, in my opinion, idiomatic in go, but causes performance issues.

The second may be blocked for an unknown period of time on certain connections, which is unacceptable to me.

The third is still in my exploration and have not come to a conclusion, but looks not idiomatic enough.

Dave Cheney

unread,

Mar 12, 2014, 10:45:48 PM3/12/14

to steve wang, golang-nuts

On Thu, Mar 13, 2014 at 1:39 PM, steve wang <steve....@gmail.com> wrote:
> To broadcast messages among lots of tcp connections, I got three ways by
> far:
> 1.send messages to lots of goroutines which are waiting on their respective
> channels and then call *net.TCPConn.Write

I think this is the best solution.

Compared to the cost of sending data over the network, the cost of
waking up the goroutine is very small. I think you are focusing on the
wrong area of your code.

> You received this message because you are subscribed to a topic in the
> Google Groups "golang-nuts" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/golang-nuts/mNZI7oi8_Tg/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Reply all

Reply to author

Forward