GOMAXPROCS, LockOSThread and scheduler interactions

Ugorji Nwoke

unread,

Dec 5, 2013, 12:56:45 PM12/5/13

to golan...@googlegroups.com

I want to use cgo, without the performance penalty of scheduler interactions, including creating new OS threads when blocked in cgo.

I have been thinking of an implementation that uses a set of worker goroutines all locked to their own OS threads. Channels are used to send tasks to, and receive results from the worker goroutines.

Will this effectively make cgo interactions cheap? As the OS thread is exclusive to the goroutine, I assume there is no need to interact with the scheduler.

If this works, how will this play with GOMAXPROCS? I've been reading the list and docs, but it's not clear. Assume that:

- I have 16 worker goroutines locked to their OS threads

- I want to have 4 OS threads available otherwise to run other goroutines

Do I set GOMAXPROCS in this situation to 4, or 20, or otherwise?

Thanks.

Ian Lance Taylor

unread,

Dec 5, 2013, 2:12:39 PM12/5/13

to Ugorji Nwoke, golang-nuts

On Thu, Dec 5, 2013 at 9:56 AM, Ugorji Nwoke <ugo...@gmail.com> wrote:
>
> I want to use cgo, without the performance penalty of scheduler
> interactions, including creating new OS threads when blocked in cgo.
>
> I have been thinking of an implementation that uses a set of worker
> goroutines all locked to their own OS threads. Channels are used to send
> tasks to, and receive results from the worker goroutines.
>
> Will this effectively make cgo interactions cheap? As the OS thread is
> exclusive to the goroutine, I assume there is no need to interact with the
> scheduler.

I don't think this would make any noticeable difference. When you
make a cgo call, the runtime internally calls LockOSThread for you
while the C code executes. The little you save by calling it
beforehand will be lost by the time it takes for the channel
communication between your Go threads and your C thread.

Ian

Ugorji Nwoke

unread,

Dec 5, 2013, 2:27:03 PM12/5/13

to golan...@googlegroups.com, Ugorji Nwoke

Isn't this something that can be optimized? If I already have the OS thread locked, then I would expect the runtime to assume the scheduler will not attempt to schedule anything on the thread,

and the runtime doesn't have to do a lock/unlockosthread around it. It should just switch the stacks before and after.

Also, my code does not have channel communication between Go and C (or callbacks from C to Go). All the channel communication is on the Go side. Go writes to channel, worker goroutine picks up from channel, and calls Go code interleaved with C code.

I am doing this to integrate with leveldb, which I implement a smart secondary index with. During scans, there is a lot of computation that happens to check if a row is a valid result, causing a lot of back and forth between C and Go. I'm looking for a way to eliminate the overhead from that, while gating the number of concurrent threads into leveldb.

I previously opted to write a leveldb server in C++, but that has involved re-writing a fair amount of code I already had in Go, and implementing epoll for edge-triggered I/O, and handling serialization of arguments and return values across the wire. It's mostly done, but I'm hoping there's a better way to solve this using just Go code.

Ian Lance Taylor

unread,

Dec 5, 2013, 2:46:12 PM12/5/13

to Ugorji Nwoke, golang-nuts

On Thu, Dec 5, 2013 at 11:27 AM, Ugorji Nwoke <ugo...@gmail.com> wrote:
> Isn't this something that can be optimized? If I already have the OS thread
> locked, then I would expect the runtime to assume the scheduler will not
> attempt to schedule anything on the thread,
> and the runtime doesn't have to do a lock/unlockosthread around it. It
> should just switch the stacks before and after.

The Go runtime scheduler has a more complex model than I think you are
assuming. I don't think there is a significant scope for optimization
here. I could certainly be wrong. But I would want to see some
evidence.

> Also, my code does not have channel communication between Go and C (or
> callbacks from C to Go). All the channel communication is on the Go side. Go
> writes to channel, worker goroutine picks up from channel, and calls Go code
> interleaved with C code.

Understood. The channel communication from one goroutine to another
is cheap, but it is not free. I'm guessing that the cost of that
channel communication is comparable to the cost of the runtime
scheduler. Both cases involve looking at a couple of data structures
and taking a lock or two.

Maybe a different way to say it is this: in my estimation, the cost of
switching to a different stack, including moving the parameters and
results back and forth, dominates the scheduling cost.

> I am doing this to integrate with leveldb, which I implement a smart
> secondary index with. During scans, there is a lot of computation that
> happens to check if a row is a valid result, causing a lot of back and forth
> between C and Go. I'm looking for a way to eliminate the overhead from that,
> while gating the number of concurrent threads into leveldb.

Eliminating back and forth between C and Go is definitely highly
desirable for performance. Anything you can do on that front will
help.

Ian

Alexey Borzenkov

unread,

Dec 5, 2013, 2:51:23 PM12/5/13

to Ugorji Nwoke, golang-nuts

On Thu, Dec 5, 2013 at 9:56 PM, Ugorji Nwoke <ugo...@gmail.com> wrote:
> I have been thinking of an implementation that uses a set of worker
> goroutines all locked to their own OS threads. Channels are used to send
> tasks to, and receive results from the worker goroutines.
>
> Will this effectively make cgo interactions cheap? As the OS thread is
> exclusive to the goroutine, I assume there is no need to interact with the
> scheduler.

No, the scheduler interaction wouldn't go anywhere, when you enter cgo
it still needs to check if there are more goroutines scheduled and
whether or not additional threads are needed. What that would do is
simply create that many threads in advance which wouldn't be able to
do anything else, which you probably don't need as concurrent Go
programs that do syscalls tend to have spare threads lying around
anyway.

Moreover, using workers and channels would actually add complexity in
your case, for example if you need results of those cgo calls you
would need to create reply channels and communicate results over them.
It would make sense if you need e.g. buffering, but not otherwise, as
cross-thread communication is not free.

If you need to limit cgo concurrency you should probably use some form
of semaphore. Sadly runtime semaphores don't seem to be exposed in
standard library, but it's easy to do based on either a buffered
channel or sync.Mutex + sync.Cond + counter.

> Do I set GOMAXPROCS in this situation to 4, or 20, or otherwise?

GOMAXPROCS controls Go parallelism, so if you want 4 goroutines to
execute simultaneously, then it should be 4.

Ugorji Nwoke

unread,

Dec 5, 2013, 3:53:36 PM12/5/13

to golan...@googlegroups.com, Ugorji Nwoke

Sounds good. Thanks Ian. I will workaround things on my end without using LockOSThread, by doing more work on the C side to reduce the chatter.

Ugorji Nwoke

unread,

Dec 5, 2013, 3:53:51 PM12/5/13

to golan...@googlegroups.com, Ugorji Nwoke

Sounds good. Thanks. I will workaround things on my end without using LockOSThread, by doing more work on the C side to reduce the chatter.

Hamish Ogilvy

unread,

Dec 5, 2013, 7:17:27 PM12/5/13

to golan...@googlegroups.com, Ugorji Nwoke

I agree with Ian, the scheduler is pretty good, it's probably not the first area you should be looking at for performance gains. Any specific reason for "gating the number of concurrent threads into leveldb"?

We have a similar issue where data is stored and accessed via C, but needs to be scanned and then evaluated in Go. From experiments, the overhead of the channel you are using within Go probably has more overhead than the connection between C and Go (depending how you've done it). If your scan is incrementally sending lots of data that is then sent over a channel in Go, that will cost you. What types are you sending between C and Go? From what i've seen so far casting is almost free, but C function calls have extra overhead. In saying that, we currently have some processing steps that run millions of C functions via cgo and it doesn't hurt us badly. We stripe (usually the number of CPU's) the access to C and use an RWMutex to control what is accessing each stripe. Channels are brilliant, but if you need to scan a lot of data, you don't want the overhead on each send, you're better to control access to that data at a higher level and then scan/process it directly.

Dmitry Vyukov

unread,

Dec 6, 2013, 2:35:56 AM12/6/13

to Ugorji Nwoke, golang-nuts

On Thu, Dec 5, 2013 at 9:56 PM, Ugorji Nwoke <ugo...@gmail.com> wrote:
>

Hi,

It won't help, scheduler preserves GOMAXPROCS threads running Go code,
not just GOMAXPROCS threads. Otherwise if GOMAXPROCS call into
blocking syscall/cgo, your program would be deadlocked.

What do you do inside of cgo? Short computation? Long computation? DB request?

How does profile for GOMAXPROCS=4 look like? Do you see any
performance problems there?

Ugorji Nwoke

unread,

Dec 6, 2013, 4:24:15 PM12/6/13

to golan...@googlegroups.com, Ugorji Nwoke

Sorry for the late response.

On Friday, December 6, 2013 2:35:56 AM UTC-5, Dmitry Vyukov wrote:

On Thu, Dec 5, 2013 at 9:56 PM, Ugorji Nwoke <ugo...@gmail.com> wrote:
>
> I want to use cgo, without the performance penalty of scheduler
> interactions, including creating new OS threads when blocked in cgo.
>
> I have been thinking of an implementation that uses a set of worker
> goroutines all locked to their own OS threads. Channels are used to send
> tasks to, and receive results from the worker goroutines.
>
> Will this effectively make cgo interactions cheap? As the OS thread is
> exclusive to the goroutine, I assume there is no need to interact with the
> scheduler.
>
> If this works, how will this play with GOMAXPROCS? I've been reading the
> list and docs, but it's not clear. Assume that:
> - I have 16 worker goroutines locked to their OS threads
> - I want to have 4 OS threads available otherwise to run other goroutines
>
> Do I set GOMAXPROCS in this situation to 4, or 20, or otherwise?

Hi,

It won't help, scheduler preserves GOMAXPROCS threads running Go code,
not just GOMAXPROCS threads. Otherwise if GOMAXPROCS call into
blocking syscall/cgo, your program would be deadlocked.

The response still doesn't show me why it wouldn't work. Currently, the scheduler preserves GOMAXPROCS threads, but shouldn't it handle those which have locked OS threads explicitly differently? ie. if I set GOMAXPROCS to 8, and lock 6 threads exclusively to a goroutine, shouldn't those 6 threads and goroutines be treated separately since they technically should be outside the purview of the scheduler.

What do you do inside of cgo? Short computation? Long computation? DB request?

Inside of cgo, they are short computations. In one scenario, they are leveldb scans which are pretty quick, especially if the values are already in (C) memory (think very low microseconds). On each iteration through the scan, we have to determine if it points to a valid object, and when we should stop iterating. Currently, I have code in Go that does that determination. We sometimes scan 1000s of rows at a time.

How does profile for GOMAXPROCS=4 look like? Do you see any
performance problems there?

I haven't tried the profile yet. I moved away from using cgo a few months ago and so wrote the server wrapping leveldb completely in C++. However, I'm having to write too much code which I get for free with Go (epoll edge-triggered connection management, RPC functionality, (de)serialization of objects in C++, etc). So I am evaluating moving it back to Go, and seeing what options are for reducing the chatter overhead.

Dmitry Vyukov

unread,

Dec 7, 2013, 5:06:08 AM12/7/13

to Ugorji Nwoke, golang-nuts

No, it works differently.
If a goroutine locked to a thread blocks, enters syscall or cgo, then
scheduler need to wake up another thread to run other Go code.
So locked goroutines only increase overheads as scheduler need to
switch threads more often; as opposed to just running what is
available on current thread.

>> What do you do inside of cgo? Short computation? Long computation? DB
>> request?
>>
> Inside of cgo, they are short computations. In one scenario, they are
> leveldb scans which are pretty quick, especially if the values are already
> in (C) memory (think very low microseconds). On each iteration through the
> scan, we have to determine if it points to a valid object, and when we
> should stop iterating. Currently, I have code in Go that does that
> determination. We sometimes scan 1000s of rows at a time.
>>
>> How does profile for GOMAXPROCS=4 look like? Do you see any
>> performance problems there?
>
>
> I haven't tried the profile yet. I moved away from using cgo a few months
> ago and so wrote the server wrapping leveldb completely in C++. However, I'm
> having to write too much code which I get for free with Go (epoll
> edge-triggered connection management, RPC functionality, (de)serialization
> of objects in C++, etc). So I am evaluating moving it back to Go, and seeing
> what options are for reducing the chatter overhead.

OK, just try to use cgo in the most straightforward way.
If that works unacceptably slow, then we can look at profiles.
But there are good chances that it will work quite fast.

Ugorji Nwoke

unread,

Dec 18, 2013, 7:10:28 PM12/18/13

to golan...@googlegroups.com, Ugorji Nwoke

Just an update that I rewrote the database server code to all be in Go, using cgo to talk to leveldb, and net/rpc to talk to clients.

The cgo interaction is quite fast - as fast as can be expected. It's about the cost of 5 no-op go function calls. No complaints here whatsoever.

My only issue now is that debugging is harder. This is especially true because I've been unable to grab a core dump when I have an issue, even though I run with expected environment parameters and ulimit set.

I don't get a core dump when something goes wrong in the C code. I just get a crash.

I run my command as:

ulimit -c unlimited && GOGCTRACE=1 GOTRACEBACK=crash command commandArgs...

Please let me know if this is a legitimate bug and I will file an issue.

== Sample code: crash-cgo.go ==

package main

// void my_c_crash() { int* i = 0; *i = 2; } // NULL dereference will crash

import "C"

func main() {

C.my_c_crash()

}

== Command line ==

ulimit -c unlimited; GOTRACEBACK=crash go run crash-cgo.go

Dmitry Vyukov

unread,

Dec 19, 2013, 2:26:26 AM12/19/13

to Ugorji Nwoke, golang-nuts

On Thu, Dec 19, 2013 at 4:10 AM, Ugorji Nwoke <ugo...@gmail.com> wrote:
> Just an update that I rewrote the database server code to all be in Go,
> using cgo to talk to leveldb, and net/rpc to talk to clients.
>
> The cgo interaction is quite fast - as fast as can be expected. It's about
> the cost of 5 no-op go function calls. No complaints here whatsoever.
>
> My only issue now is that debugging is harder. This is especially true
> because I've been unable to grab a core dump when I have an issue, even
> though I run with expected environment parameters and ulimit set.
>
> I don't get a core dump when something goes wrong in the C code. I just get
> a crash.
>
> I run my command as:
> ulimit -c unlimited && GOGCTRACE=1 GOTRACEBACK=crash command commandArgs...
>
> Please let me know if this is a legitimate bug and I will file an issue.

Looks perfectly legitimate to me.
Please file an issue.

> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward