Why there is a so great performance penalty when the usage combines LockOSThread and channel?

1,004 views
Skip to first unread message

Cia

unread,
Nov 22, 2016, 12:45:19 AM11/22/16
to golang-nuts

Here is the minimal test case 0:

package main


import "fmt"
import "runtime"


type callRet
struct {
    ret
int
}


type callIn
struct {
    ret_chan
*chan *callRet
    arg1    
int
}


func
caller(call_in_c *chan *callIn, arg1 int) int {
    ret_c
:= make(chan *callRet)
    ci
:= callIn{&ret_c, arg1}
   
*call_in_c <- &ci
    ret
:= <-ret_c
   
return ret.ret
}


func call_srv
(call_in_c *chan *callIn) {
   
//runtime.LockOSThread()
   
for {
       
in := <-*call_in_c
        ret_c
:= in.ret_chan
        ret
:= callRet{3 + in.arg1}
       
*ret_c <- &ret
   
}
   
//runtime.UnlockOSThread()
}


func main
() {
    p
:= fmt.Println
    runtime
.GOMAXPROCS(2)
    call_in_c
:= make(chan *callIn)


    go call_srv
(&call_in_c)


    fp
:= func(call_in_c chan *callIn) {
        ct
:= 0
       
for ; ct < 10000000; ct = ct + 1 {
           
//caller_batch(&call_in_c, 3, 100)
           
caller(&call_in_c, 1)
           
if ct%1000000 == 0 {
                fmt
.Println(ct)
           
}
       
}
        p
("done:)")
   
}
   
{
        ct
:= 0
       
for ; ct < 0; ct++ {
            go fp
(call_in_c)
       
}
   
}
    fp
(call_in_c)
   
return
}


It's perf output is about 1000,000 ops/sec :)

But when you uncomment the runtime.LockOSThread() at line 24(the beginning at func call_srv),it goes so slow to 10,000/sec and the cpu usage of kernel goes into madness.

Is it a normal penalty when you using channel with LockOSThread?

Thanks! ☺

------------
$ go version
go version go1.7 linux/amd64

Cia

unread,
Nov 22, 2016, 1:00:53 AM11/22/16
to golang-nuts
Sorry :) This a mistake here:
 
It's perf output is about 1000,000 ops/sec :)

But when you uncomment the runtime.LockOSThread() at line 24(the beginning at func call_srv),it goes so slow to 100,000/sec and the cpu usage of kernel goes into madness.


Actually it should be 100,000 ops/sec.

Thus 1000,000 ops/sec to 100,000 ops/sec when you uncomment the runtime.LockOSThread() function call.
 
There is a nearly 90% performance penalty when you using channel with LockOSThread,is it abnormal or just should be expected so slow?

(:Hoping I am wrong:)

Dave Cheney

unread,
Nov 22, 2016, 4:46:18 AM11/22/16
to golang-nuts
Please try profiling your application, if you are on Linux perf(1) works very well for tracing user and system time.

Cia

unread,
Nov 22, 2016, 5:04:45 AM11/22/16
to golang-nuts


On Tuesday, 22 November 2016 17:46:18 UTC+8, Dave Cheney wrote:
Please try profiling your application, if you are on Linux perf(1) works very well for tracing user and system time.

Thanks for mention that :)

Here is the first step result of perf stat

 Performance counter stats for './penalty_with_LockOSThread':

       4893.495113      task-clock (msec)         #    0.583 CPUs utilized
         2,112,910      context-switches          #    0.432 M/sec
                80      cpu-migrations            #    0.016 K/sec
             2,377      page-faults               #    0.486 K/sec
   <not supported>      cycles
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       8.389103849 seconds time elapsed

 Performance counter stats for './penalty_without_LockOSThread':

        930.442000      task-clock (msec)         #    0.978 CPUs utilized
             2,947      context-switches          #    0.003 M/sec
                41      cpu-migrations            #    0.044 K/sec
             2,389      page-faults               #    0.003 M/sec
   <not supported>      cycles
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   <not supported>      instructions
   <not supported>      branches
   <not supported>      branch-misses

       0.951572314 seconds time elapsed

It seems there is something wrong with the context-switches for penalty_without_LockOSThread at first sight.

Dave Cheney

unread,
Nov 22, 2016, 5:06:50 AM11/22/16
to golang-nuts
Perf record / report will highlight the problem.

Dave Cheney

unread,
Nov 22, 2016, 5:08:37 AM11/22/16
to golang-nuts
Related, you don't need a pointer to a chan, channels are already pointers to the private runtime channel type.

Cia

unread,
Nov 22, 2016, 5:12:07 AM11/22/16
to golang-nuts


On Tuesday, 22 November 2016 18:06:50 UTC+8, Dave Cheney wrote:
Perf record / report will highlight the problem.

Thanks Dave.And here is the output of the perf report:

Samples: 48K of event 'cpu-clock', Event count (approx.): 12020250000
Overhead  Command  Shared Object      Symbol
  59.40%  penalty  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
  13.22%  penalty  [kernel.kallsyms]  [k] finish_task_switch
   2.13%  penalty  [kernel.kallsyms]  [k] get_user_pages_fast
   1.87%  penalty  penalty            [.] runtime.mallocgc
   1.05%  penalty  [kernel.kallsyms]  [k] futex_wake
   0.75%  penalty  penalty            [.] runtime.findrunnable
   0.72%  penalty  penalty            [.] runtime.futex
   0.71%  penalty  penalty            [.] runtime/internal/atomic.Xchg
   0.67%  penalty  [kernel.kallsyms]  [k] unlock_page
   0.61%  penalty  [kernel.kallsyms]  [k] wake_q_add
   0.56%  penalty  penalty            [.] runtime.schedule
   0.55%  penalty  penalty            [.] runtime.runqgrab
   0.52%  penalty  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_swapgs
   0.50%  penalty  penalty            [.] runtime.lock
   0.46%  penalty  [kernel.kallsyms]  [k] __wake_up_bit
   0.43%  penalty  [kernel.kallsyms]  [k] smp_call_function_single
   0.39%  penalty  penalty            [.] main.main.func1
   0.37%  penalty  [kernel.kallsyms]  [k] get_futex_key
   0.37%  penalty  penalty            [.] runtime.notesleep
   0.35%  penalty  [kernel.kallsyms]  [k] do_futex
   0.32%  penalty  [kernel.kallsyms]  [k] hash_futex
   0.32%  penalty  [kernel.kallsyms]  [k] sys_futex
   0.32%  penalty  penalty            [.] runtime.chanrecv
   0.32%  penalty  penalty            [.] runtime.startm
   0.31%  penalty  penalty            [.] runtime/internal/atomic.Load
   0.30%  penalty  penalty            [.] runtime/internal/atomic.Cas
   0.30%  penalty  [kernel.kallsyms]  [k] do_select
   0.28%  penalty  [kernel.kallsyms]  [k] core_sys_select
   0.28%  penalty  [kernel.kallsyms]  [k] futex_wait_queue_me
   0.27%  penalty  penalty            [.] runtime.chansend
   0.25%  penalty  penalty            [.] runtime.casgstatus 


Cia

unread,
Nov 22, 2016, 5:20:01 AM11/22/16
to golang-nuts


On Tuesday, 22 November 2016 18:08:37 UTC+8, Dave Cheney wrote:
Related, you don't need a pointer to a chan, channels are already pointers to the private runtime channel type.

Thanks for telling me that ☺ The codes will be pretty cleaner without the pointer '*' notation.

Is there something wrong with the scheduler? Or maybe a really deeply hidden bug in the codes. (Doubting about 1st one honestly....)

Cia

unread,
Nov 22, 2016, 5:23:18 AM11/22/16
to golang-nuts


On Tuesday, 22 November 2016 18:06:50 UTC+8, Dave Cheney wrote:
Perf record / report will highlight the problem.

And here is the perf.data of the perf record for details :)
perf.data

Dave Cheney

unread,
Nov 22, 2016, 5:51:46 AM11/22/16
to golang-nuts
Why do you want to use LockOSthread, you've proved it has a significant performance cost for your application.

Cia

unread,
Nov 22, 2016, 6:11:34 AM11/22/16
to golang-nuts


On Tuesday, 22 November 2016 18:51:46 UTC+8, Dave Cheney wrote:
Why do you want to use LockOSthread, you've proved it has a significant performance cost for your application.

 Cgo is the answer.....

(Actually I want to use a C library(which needs LockOSthread to reach local constant thread execution model) through Cgo in the Go application.And one of the biggest benefits of Go is the Channel's event notification mechanism which is on the user space and powerful.It could be used to create a sync & nonblock application easily.)

Is this a expected performance penalty when we use channel with LockOSthread?(Maybe due to a design tradeoff or anything)

(: Just want to figure it out and be sure the exclusion of the probability of some bugs maybe hidden on this test codes here :)

Dave Cheney

unread,
Nov 22, 2016, 6:30:55 AM11/22/16
to golang-nuts
The cgo call is going to cost you a lot more than this. Can you write up your real code and profile it.

Cia

unread,
Nov 22, 2016, 6:43:34 AM11/22/16
to golang-nuts
I have been tested Cgo's perf and here is the result below:

chan:
1000,0000 op/s (buffer 100)
400,0000 op/s (no buffer,blocked)

cgo:
554,4729 op/s

It shows that the speed of Cgo is acceptable,and even without thinking about the batch optimization method when using it :)

Dave Cheney

unread,
Nov 22, 2016, 3:37:48 PM11/22/16
to golang-nuts
Thinking about this overnight I think this benchmark


Shows the issue you are having. On my machine

% go test -bench=. -benchtime=5s
BenchmarkWithoutLockOSThread-4 5000 1827196 ns/op
BenchmarkWithLockOSThread-4 200 31506649 ns/op

My guess is the LockOSThread version causes a context switch because the thread the sending goroutine runs on must be parked, and the receiving thread locked to the receiving goroutine must be woken to drain the value. 

I don't see an easy workaround for this slowdown.

Ian Lance Taylor

unread,
Nov 22, 2016, 3:47:51 PM11/22/16
to Dave Cheney, golang-nuts
On Tue, Nov 22, 2016 at 12:37 PM, Dave Cheney <da...@cheney.net> wrote:
> Thinking about this overnight I think this benchmark
>
> https://play.golang.org/p/5d4MUKqPYd
>
> Shows the issue you are having. On my machine
>
> % go test -bench=. -benchtime=5s
> BenchmarkWithoutLockOSThread-4 5000 1827196 ns/op
> BenchmarkWithLockOSThread-4 200 31506649 ns/op
>
> My guess is the LockOSThread version causes a context switch because the
> thread the sending goroutine runs on must be parked, and the receiving
> thread locked to the receiving goroutine must be woken to drain the value.
>
> I don't see an easy workaround for this slowdown.

I also have come to think it may be due to thread context switching.

Still, I also wrote up a benchmark, and I filed https://golang.org/issue/18023.

Ian

Dave Cheney

unread,
Nov 22, 2016, 4:03:56 PM11/22/16
to Ian Lance Taylor, golang-nuts

Thanks Ian.

ki...@nexedi.com

unread,
Sep 10, 2017, 8:59:58 AM9/10/17
to golang-nuts
вторник, 22 ноября 2016 г., 23:47:51 UTC+3 пользователь Ian Lance Taylor написал:
For the reference: the context switching, and thus associated slowdown, can be potentially avoided with proper Go scheduler tuning. More, since lockOSThread is used at every CGo call by Go runtime internally, similar slowdown might be also showing on servers which use CGo when handling requests.


Kirill
Reply all
Reply to author
Forward
0 new messages