Why there is a so great performance penalty when the usage combines LockOSThread and channel?

Cia

unread,

Nov 22, 2016, 12:45:19 AM11/22/16

to golang-nuts

Here is the minimal test case 0:

package main


import "fmt"
import "runtime"


type callRet struct {
    ret int
}


type callIn struct {
    ret_chan *chan *callRet
    arg1     int
}


func caller(call_in_c *chan *callIn, arg1 int) int {
    ret_c := make(chan *callRet)
    ci := callIn{&ret_c, arg1}
    *call_in_c <- &ci
    ret := <-ret_c
    return ret.ret
}


func call_srv(call_in_c *chan *callIn) {
    //runtime.LockOSThread()
    for {
        in := <-*call_in_c
        ret_c := in.ret_chan
        ret := callRet{3 + in.arg1}
        *ret_c <- &ret
    }
    //runtime.UnlockOSThread()
}


func main() {
    p := fmt.Println
    runtime.GOMAXPROCS(2)
    call_in_c := make(chan *callIn)


    go call_srv(&call_in_c)


    fp := func(call_in_c chan *callIn) {
        ct := 0
        for ; ct < 10000000; ct = ct + 1 {
            //caller_batch(&call_in_c, 3, 100)
            caller(&call_in_c, 1)
            if ct%1000000 == 0 {
                fmt.Println(ct)
            }
        }
        p("done:)")
    }
    {
        ct := 0
        for ; ct < 0; ct++ {
            go fp(call_in_c)
        }
    }
    fp(call_in_c)
    return
}

It's perf output is about 1000,000 ops/sec :)

But when you uncomment the runtime.LockOSThread() at line 24(the beginning at func call_srv),it goes so slow to 10,000/sec and the cpu usage of kernel goes into madness.

Is it a normal penalty when you using channel with LockOSThread?

Thanks! ☺

------------

$ go version

go version go1.7 linux/amd64

Cia

unread,

Nov 22, 2016, 1:00:53 AM11/22/16

to golang-nuts

Sorry :) This a mistake here:

It's perf output is about 1000,000 ops/sec :)

But when you uncomment the runtime.LockOSThread() at line 24(the beginning at func call_srv),it goes so slow to 100,000/sec and the cpu usage of kernel goes into madness.

Actually it should be 100,000 ops/sec.

Thus 1000,000 ops/sec to 100,000 ops/sec when you uncomment the runtime.LockOSThread() function call.

There is a nearly 90% performance penalty when you using channel with LockOSThread,is it abnormal or just should be expected so slow?

(:Hoping I am wrong:)

Dave Cheney

unread,

Nov 22, 2016, 4:46:18 AM11/22/16

to golang-nuts

Please try profiling your application, if you are on Linux perf(1) works very well for tracing user and system time.

Cia

unread,

Nov 22, 2016, 5:04:45 AM11/22/16

to golang-nuts

On Tuesday, 22 November 2016 17:46:18 UTC+8, Dave Cheney wrote:

Please try profiling your application, if you are on Linux perf(1) works very well for tracing user and system time.

Thanks for mention that :)

Here is the first step result of perf stat:

Performance counter stats for './penalty_with_LockOSThread':

4893.495113 task-clock (msec) # 0.583 CPUs utilized
2,112,910 context-switches # 0.432 M/sec
80 cpu-migrations # 0.016 K/sec
2,377 page-faults # 0.486 K/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses

8.389103849 seconds time elapsed

Performance counter stats for './penalty_without_LockOSThread':

930.442000 task-clock (msec) # 0.978 CPUs utilized
2,947 context-switches # 0.003 M/sec
41 cpu-migrations # 0.044 K/sec
2,389 page-faults # 0.003 M/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses

0.951572314 seconds time elapsed

It seems there is something wrong with the context-switches for penalty_without_LockOSThread at first sight.

Dave Cheney

unread,

Nov 22, 2016, 5:06:50 AM11/22/16

to golang-nuts

Perf record / report will highlight the problem.

Dave Cheney

unread,

Nov 22, 2016, 5:08:37 AM11/22/16

to golang-nuts

Related, you don't need a pointer to a chan, channels are already pointers to the private runtime channel type.

Cia

unread,

Nov 22, 2016, 5:12:07 AM11/22/16

to golang-nuts

On Tuesday, 22 November 2016 18:06:50 UTC+8, Dave Cheney wrote:

Perf record / report will highlight the problem.

Thanks Dave.And here is the output of the perf report:

Samples: 48K of event 'cpu-clock', Event count (approx.): 12020250000

Overhead Command Shared Object Symbol

59.40% penalty [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore

13.22% penalty [kernel.kallsyms] [k] finish_task_switch

2.13% penalty [kernel.kallsyms] [k] get_user_pages_fast

1.87% penalty penalty [.] runtime.mallocgc

1.05% penalty [kernel.kallsyms] [k] futex_wake

0.75% penalty penalty [.] runtime.findrunnable

0.72% penalty penalty [.] runtime.futex

0.71% penalty penalty [.] runtime/internal/atomic.Xchg

0.67% penalty [kernel.kallsyms] [k] unlock_page

0.61% penalty [kernel.kallsyms] [k] wake_q_add

0.56% penalty penalty [.] runtime.schedule

0.55% penalty penalty [.] runtime.runqgrab

0.52% penalty [kernel.kallsyms] [k] entry_SYSCALL_64_after_swapgs

0.50% penalty penalty [.] runtime.lock

0.46% penalty [kernel.kallsyms] [k] __wake_up_bit

0.43% penalty [kernel.kallsyms] [k] smp_call_function_single

0.39% penalty penalty [.] main.main.func1

0.37% penalty [kernel.kallsyms] [k] get_futex_key

0.37% penalty penalty [.] runtime.notesleep

0.35% penalty [kernel.kallsyms] [k] do_futex

0.32% penalty [kernel.kallsyms] [k] hash_futex

0.32% penalty [kernel.kallsyms] [k] sys_futex

0.32% penalty penalty [.] runtime.chanrecv

0.32% penalty penalty [.] runtime.startm

0.31% penalty penalty [.] runtime/internal/atomic.Load

0.30% penalty penalty [.] runtime/internal/atomic.Cas

0.30% penalty [kernel.kallsyms] [k] do_select

0.28% penalty [kernel.kallsyms] [k] core_sys_select

0.28% penalty [kernel.kallsyms] [k] futex_wait_queue_me

0.27% penalty penalty [.] runtime.chansend

0.25% penalty penalty [.] runtime.casgstatus

Cia

unread,

Nov 22, 2016, 5:20:01 AM11/22/16

to golang-nuts

On Tuesday, 22 November 2016 18:08:37 UTC+8, Dave Cheney wrote:

Related, you don't need a pointer to a chan, channels are already pointers to the private runtime channel type.

Thanks for telling me that ☺ The codes will be pretty cleaner without the pointer '*' notation.

Is there something wrong with the scheduler? Or maybe a really deeply hidden bug in the codes. (Doubting about 1st one honestly....)

Cia

unread,

Nov 22, 2016, 5:23:18 AM11/22/16

to golang-nuts

On Tuesday, 22 November 2016 18:06:50 UTC+8, Dave Cheney wrote:

Perf record / report will highlight the problem.

And here is the perf.data of the perf record for details :)

perf.data

Dave Cheney

unread,

Nov 22, 2016, 5:51:46 AM11/22/16

to golang-nuts

Why do you want to use LockOSthread, you've proved it has a significant performance cost for your application.

Cia

unread,

Nov 22, 2016, 6:11:34 AM11/22/16

to golang-nuts

On Tuesday, 22 November 2016 18:51:46 UTC+8, Dave Cheney wrote:

Why do you want to use LockOSthread, you've proved it has a significant performance cost for your application.

Cgo is the answer.....

(Actually I want to use a C library(which needs LockOSthread to reach local constant thread execution model) through Cgo in the Go application.And one of the biggest benefits of Go is the Channel's event notification mechanism which is on the user space and powerful.It could be used to create a sync & nonblock application easily.)

Is this a expected performance penalty when we use channel with LockOSthread?(Maybe due to a design tradeoff or anything)

(: Just want to figure it out and be sure the exclusion of the probability of some bugs maybe hidden on this test codes here :)

Dave Cheney

unread,

Nov 22, 2016, 6:30:55 AM11/22/16

to golang-nuts

The cgo call is going to cost you a lot more than this. Can you write up your real code and profile it.

Cia

unread,

Nov 22, 2016, 6:43:34 AM11/22/16

to golang-nuts

I have been tested Cgo's perf and here is the result below:

chan:

1000,0000 op/s (buffer 100)

400,0000 op/s (no buffer,blocked)

cgo:

554,4729 op/s

It shows that the speed of Cgo is acceptable,and even without thinking about the batch optimization method when using it :)

Dave Cheney

unread,

Nov 22, 2016, 3:37:48 PM11/22/16

to golang-nuts

Thinking about this overnight I think this benchmark

https://play.golang.org/p/5d4MUKqPYd

Shows the issue you are having. On my machine

% go test -bench=. -benchtime=5s
BenchmarkWithoutLockOSThread-4 5000 1827196 ns/op
BenchmarkWithLockOSThread-4 200 31506649 ns/op

My guess is the LockOSThread version causes a context switch because the thread the sending goroutine runs on must be parked, and the receiving thread locked to the receiving goroutine must be woken to drain the value.

I don't see an easy workaround for this slowdown.

Ian Lance Taylor

unread,

Nov 22, 2016, 3:47:51 PM11/22/16

to Dave Cheney, golang-nuts

On Tue, Nov 22, 2016 at 12:37 PM, Dave Cheney <da...@cheney.net> wrote:
> Thinking about this overnight I think this benchmark
>
> https://play.golang.org/p/5d4MUKqPYd
>
> Shows the issue you are having. On my machine
>
> % go test -bench=. -benchtime=5s
> BenchmarkWithoutLockOSThread-4 5000 1827196 ns/op
> BenchmarkWithLockOSThread-4 200 31506649 ns/op
>
> My guess is the LockOSThread version causes a context switch because the
> thread the sending goroutine runs on must be parked, and the receiving
> thread locked to the receiving goroutine must be woken to drain the value.
>
> I don't see an easy workaround for this slowdown.

I also have come to think it may be due to thread context switching.

Still, I also wrote up a benchmark, and I filed https://golang.org/issue/18023.

Ian

Dave Cheney

unread,

Nov 22, 2016, 4:03:56 PM11/22/16

to Ian Lance Taylor, golang-nuts

Thanks Ian.

ki...@nexedi.com

unread,

Sep 10, 2017, 8:59:58 AM9/10/17

to golang-nuts

вторник, 22 ноября 2016 г., 23:47:51 UTC+3 пользователь Ian Lance Taylor написал:

For the reference: the context switching, and thus associated slowdown, can be potentially avoided with proper Go scheduler tuning. More, since lockOSThread is used at every CGo call by Go runtime internally, similar slowdown might be also showing on servers which use CGo when handling requests.