What is the overhead of calling a C function from Go?

adam_smith

unread,

Nov 30, 2009, 5:23:05 PM11/30/09

to golang-nuts

Can C functions be called as efficiently from Go as native Go
functions? E.g. in C# it is from my understanding some overhead in
calling native code because you have to switch from a managed
environment to an unmanaged environment (whatever that entails?).

Could you replace Go functions with C implementations and expect
performance gains or will calling out foreign functions limit
performance optimization by the compiler? (E.g. reducing possibility
of inlining)

Adam Langley

unread,

Nov 30, 2009, 5:26:14 PM11/30/09

to adam_smith, golang-nuts

On Mon, Nov 30, 2009 at 2:23 PM, adam_smith <erik.e...@gmail.com> wrote:
> Can C functions be called as efficiently from Go as native Go
> functions? E.g. in C# it is from my understanding some overhead in
> calling native code because you have to switch from a managed
> environment to an unmanaged environment (whatever that entails?).

Calling C functions via cgo involves a thread-switch, which is
reasonably expensive.

Calling C functions that are compiled with 6c (and family), like the
runtime functions, is the same as calling a Go function.

AGL

Ostsol

unread,

Nov 30, 2009, 8:37:55 PM11/30/09

to golang-nuts

Does that mean that calling a C function via a C wrapper written in
the cgo file is less expensive than calling the C function directly?
For example:

//#include <stdio.h>
//#include <stdlib.h>
//int GoPrintf (const char* str) { return printf (str); }
import "C"

func PrintStuff (stuff string) {
str := C.CString (stuff);
C.GoPrintf (str);
C.free (str)
}

-Ostsol

On Nov 30, 3:26 pm, Adam Langley <a...@golang.org> wrote:

Adam Langley

unread,

Nov 30, 2009, 8:45:32 PM11/30/09

to Ostsol, golang-nuts

On Mon, Nov 30, 2009 at 5:37 PM, Ostsol <ost...@gmail.com> wrote:
> Does that mean that calling a C function via a C wrapper written in
> the cgo file is less expensive than calling the C function directly?

No. Any time the process is running code with the native ABI, it needs
to do the context switch.

AGL

Russ Cox

unread,

Dec 1, 2009, 3:59:06 AM12/1/09

to adam_smith, golang-nuts

There's always going to be some overhead.
It's more expensive than a simple function call but
significantly less expensive than a context switch
(agl is remembering an earlier implementation;
we cut out the thread switch before the public release).
Right now the expense is basically just having to
do a full register set switch (no kernel involvement).
I'd guess it's comparable to ten function calls.

You could always write some programs to measure
it and let us know what you find. ;-)

Russ

Charlie

unread,

Dec 1, 2009, 6:30:07 AM12/1/09

to golang-nuts

Both functions just return

Averaged time over 1,000,000 iterations
Go subroutine execution time (ns): 10.585000
Extern C execution time (ns): 161.750000

for loop factored out of Go subroutine
for loop + wrapping go func factored out of Extern C (assumed to be
same as cost of calling simplefunc)
Using nanosecond (valid to microsecond)

Linux (Fedora 11 x64), amd be2300, 6x
Note: not seeing context switching while running

adam_smith

unread,

Dec 1, 2009, 9:43:05 AM12/1/09

to golang-nuts

I got to say I am a bit confused by what has been said in this thread.
At the moment it is not clear to me whether Adam Langley was right
about the thread switch. And whether Charlie's test was with thread
switch or not. Adam says that when using 6c compiler it is cheap but
not with cgo. But the cgo documentation says it produces two files one
for 6c and one for gcc (not that I understand the difference. I
thought a C file was a C file).

Further it is not clear to me the difference between regular C
functions compiled with 6c and those made in the same style as the
runtime C functions. If I understand correctly the runtime C functions
use a number of structs (e.g. String) which correspond to Go types.
But I am not sure what the implications of this are. Are the runtime
functions compiled / linked in another way? Could I basically get low
overhead on calling C functions if I wrote the C code like they are
written in the runtime? Following the same conventions and using the
structs representing standard Go types? Would that avoid a thread
switch?

I am sorry if this is a lot of stupid questions, I just can't find a
single place were all these differences are explained properly. At the
moment my hunch is that this is a bit like the difference between
calling a FFI and registering functions directly with the runtime. In
E.g. mono you can use a FFI interface to call C code but you could
also register a C function with the runtime, if it followed certain
conventions (function signature and types of objects in argument
list).

Russ Cox

unread,

Dec 1, 2009, 11:59:15 AM12/1/09

to adam_smith, golang-nuts

On Tue, Dec 1, 2009 at 06:43, adam_smith <erik.e...@gmail.com> wrote:
> I got to say I am a bit confused by what has been said in this thread.
> At the moment it is not clear to me whether Adam Langley was right
> about the thread switch. And whether Charlie's test was with thread
> switch or not.

Adam Langley was remembering an earlier version of the system.

In my reply I said:

(agl is remembering an earlier implementation;
we cut out the thread switch before the public release).

(I wrote agl instead of Adam there because your mails
say "adam_smith" and I was trying to avoid confusion.)

> Adam says that when using 6c compiler it is cheap but
> not with cgo. But the cgo documentation says it produces two files one
> for 6c and one for gcc (not that I understand the difference. I
> thought a C file was a C file).

The 6c file is just glue: all the work happens in the gcc-compiled
file. The difference is that code has different calling conventions
depending on which compiler gets invoked. When writing cgo
extensions you're trying to call other C libraries that have been
built with gcc, so you need to compile your C code with gcc.
Cgo provides the bridge between the two different worlds.

> Further it is not clear to me the difference between regular C
> functions compiled with 6c and those made in the same style as the
> runtime C functions. If I understand correctly the runtime C functions
> use a number of structs (e.g. String) which correspond to Go types.
> But I am not sure what the implications of this are. Are the runtime
> functions compiled / linked in another way? Could I basically get low
> overhead on calling C functions if I wrote the C code like they are
> written in the runtime? Following the same conventions and using the
> structs representing standard Go types? Would that avoid a thread
> switch?

It would avoid a few register operations. Unless you need to
be able to do very high frequency calls, it's probably easier
to use cgo than to make your code too familiar with the
runtime data structures.

> I am sorry if this is a lot of stupid questions, I just can't find a
> single place were all these differences are explained properly. At the
> moment my hunch is that this is a bit like the difference between
> calling a FFI and registering functions directly with the runtime. In
> E.g. mono you can use a FFI interface to call C code but you could
> also register a C function with the runtime, if it followed certain
> conventions (function signature and types of objects in argument
> list).

That's about right. One complicating factor in Go is that if
you want to call any code that has been compiled with gcc
(e.g., any C library already installed on your system) you
have to use cgo.

Russ

Message has been deleted

Russ Cox

unread,

Dec 1, 2009, 5:13:26 PM12/1/09

to inspector_jouve, golang-nuts

On Tue, Dec 1, 2009 at 13:36, inspector_jouve <kaush...@gmail.com> wrote:
> Any suggestions about assembler?

The first rule of writing assembly is don't.

> If I write .s file, it can be compiler and linked in a regular way?
> What is the overhead in this case?

It's an ordinary function call.

> Are platform-optimized versions of low-level functions welcome?

Assembly is really only appropriate if two things are true:

1. There is a fundamental reason you can't get the
same performance out of portable Go.
2. The performance difference, in real programs,
is significant (say, 2x or more).

(More generally, added complexity always has to pay for itself.)

The math package has an assembly square root, because
otherwise you can't get at the hardware implementation,
and there's a big difference between the software and
hardware versions. Switching to assembly made an
actual program doing a reasonable computation
(test/bench/nbody) 3x faster.

The big package has assembly versions of the basic
bignum operations, because otherwise you can't get
at the special hardware instructions (like double-word
multiplications and division), and the software equivalent
slows down real programs using bignums. When we added
386 assembly to bignum, it made RSA operations 7x faster.

> But, for example, string conversions? I played with Itoa method (just
> to get a sense of performance) - I can see it can be easily optimized
> by about 15% just by writing specific version for uints in go (by
> simply cloning existing code with "uints" instead of uint64). A lot
> more can be done with assembler, of course. Question is whether it
> should be done at all or not.

I think strconv fails on both criteria: you can do perfectly well
enough in Go, and well-written real world programs aren't
bottlenecked by itoa speed.

Russ

Bob Cunningham

unread,

Dec 1, 2009, 8:24:37 PM12/1/09

to golang-nuts

On 12/01/2009 02:13 PM, Russ Cox wrote:
> On Tue, Dec 1, 2009 at 13:36, inspector_jouve<kaush...@gmail.com> wrote:
>> Any suggestions about assembler?
>
> The first rule of writing assembly is don't.

Except, of course, when there is no alternative!

> ...

> Assembly is really only appropriate if two things are true:
>
> 1. There is a fundamental reason you can't get the
> same performance out of portable Go.
> 2. The performance difference, in real programs,
> is significant (say, 2x or more).

3. You need access to instructions that cannot
be emitted by the compiler (common in deeply
embedded programming). The Go compilers
don't (yet) generate SSE and MMX instructions.

> (More generally, added complexity always has to pay for itself.)

>...

>> But, for example, string conversions? I played with Itoa method (just
>> to get a sense of performance) - I can see it can be easily optimized
>> by about 15% just by writing specific version for uints in go (by
>> simply cloning existing code with "uints" instead of uint64). A lot
>> more can be done with assembler, of course. Question is whether it
>> should be done at all or not.
>
> I think strconv fails on both criteria: you can do perfectly well
> enough in Go, and well-written real world programs aren't
> bottlenecked by itoa speed.

Then again, it wasn't that long ago that CPUs contained dedicated string conversion instructions. Some mainframes still do!

-BobC

Dave Cheney

unread,

Feb 27, 2013, 11:25:56 PM2/27/13

to c...@chrisbolton.me, golan...@googlegroups.com, erik.e...@gmail.com

Yes, there will be some overhead.

On 28/02/2013, at 15:15, c...@chrisbolton.me wrote:

would creating bindings to a physics engine via cgo see any significant decrease in performance?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Philipp Schumann

unread,

Feb 28, 2013, 12:14:50 AM2/28/13

to golan...@googlegroups.com, erik.e...@gmail.com

Do it anyway! Would love to see one. Thinking of Bullet?

No matter what overhead there is, it cannot be as "bad" as rewriting it in Go, or writing your game in C/C++ would be...

In a simple OpenGL+GLFW graphics loop I'm having, there are some 400-500 CGO calls per frame right now. While I'm certainly going to down-optimize that to a fraction of this (not doing any proper batching yet for instance), whatever tiny overhead there is doesn't seem to bottleneck me. In a 4ms frame, on average 3.9ms are currently spent GPU-side drawing (glfw.SwapBuffers). I haven't properly timed only the CGO "overhead" but it must be miniscule -- here's why:

look at the above stats posted over 3 years ago:

Averaged time over 1,000,000 iterations
Go subroutine execution time (ns): 10.585000
Extern C execution time (ns): 161.750000

So say the budget for physics is 1 or 2 ms. Since 1million cgo calls cost 160ns (on 3-years-ago hardware with a much older Go/CGo implementation), that would be well within the budget.

And if the physics engine required 1 million calls per frame, it would be a terrible API design...

On Thursday, February 28, 2013 12:15:31 PM UTC+8, Chris Bolton wrote:

would creating bindings to a physics engine via cgo see any significant decrease in performance?

On Monday, November 30, 2009 2:23:05 PM UTC-8, adam_smith wrote:

Chris Bolton

unread,

Feb 28, 2013, 12:51:45 AM2/28/13

to golan...@googlegroups.com, erik.e...@gmail.com

Good to hear! And thinking of chipmunk actually :P.

Mikael Gustavsson

unread,

Mar 1, 2013, 7:28:22 AM3/1/13

to golan...@googlegroups.com, erik.e...@gmail.com

I'm pretty sure 160ns is the time per call.

Philipp Schumann

unread,

Mar 1, 2013, 10:29:15 PM3/1/13

to golan...@googlegroups.com, erik.e...@gmail.com

Well right now I do avg 396 cgo calls in avg 0.329ms.

That would be a shocking 830ns but I'm not really benchmarking here --- I'm doing some fair amounts of Go stuff in those 0.329ms and the CGO functions do some "actual work" (GPU driver / OS windowing) rather than being noops.

Not sure if that would be OK for a physics binding, depends on the API, depends on how much data is being calculated etc etc.

I highly doubt a 160ns vs. 10ns "overhead" would be the bottleneck anyway, again depending on how the API works. Ideally one should be able to batch and minimize calls, but not sure if Chipmunk or Bullet support that as well as OpenGL does...

minux

unread,

Mar 2, 2013, 11:27:44 AM3/2/13

to Robert Zaremba, golan...@googlegroups.com, adam_smith, r...@golang.org

On Sat, Mar 2, 2013 at 5:42 PM, Robert Zaremba <robert....@zoho.com> wrote:

And what happens when I'm using gccgo compiler?
Do functions from the go compiled code and C libraries have the same calling conventions, thus the overhead is zero?

the overhead for translating between calling convention and switch stacks is small,

the major part of cgo overhead comes from coordination with the goroutine scheduler.

(for example, the scheduler might need to create new OS thread to run other ready

goroutines)

thus, even with gccgo, the overhead should be roughly the same.

Erwin

unread,

Mar 2, 2013, 12:13:47 PM3/2/13

to minux, Robert Zaremba, golan...@googlegroups.com, adam_smith, r...@golang.org

the overhead for translating between calling convention and switch stacks is small,

the major part of cgo overhead comes from coordination with the goroutine scheduler.
(for example, the scheduler might need to create new OS thread to run other ready
goroutines)

has it been considered to allow a fast path for calling C functions? no idea if it is possible at all, but suppose one could label a C function as allowed to block the goroutine it is called from, so that the C call need not be run in its own thread. and/or having part of the runtime made so that it forms a zero overhead environment to call C code from?

i agree that with OpenGL and its facilities to reduce the number of library calls, the GO/C overhead isn't worrisome,

but wouldn't it be nice to be able to cooperate with C libraries very efficiently, even those that require many tiny function calls to do things.

bryanturley

unread,

Mar 2, 2013, 1:11:45 PM3/2/13

to golan...@googlegroups.com, minux, Robert Zaremba, adam_smith, r...@golang.org

I think the problem is calling standard c on a goroutine stack. You need to switch to a c stack first.

Dave Cheney

unread,

Mar 2, 2013, 3:53:18 PM3/2/13

to Erwin, minux, Robert Zaremba, golan...@googlegroups.com, adam_smith, r...@golang.org

You have to switch stacks during a cgo crosscall as there is no way to know the stack requirements of the C function you are calling.

Dave

Maxim Khitrov

unread,

Mar 2, 2013, 4:02:45 PM3/2/13

to minux, Robert Zaremba, golan...@googlegroups.com, adam_smith, r...@golang.org

On Sat, Mar 2, 2013 at 11:27 AM, minux <minu...@gmail.com> wrote:
> the major part of cgo overhead comes from coordination with the goroutine
> scheduler.
> (for example, the scheduler might need to create new OS thread to run other
> ready
> goroutines)

I don't really understand why this is necessary. The current scheduler
is non-preemptive, so even in a pure Go program one goroutine can
prevent others from running by not executing any channel operations,
system calls, ... not sure what else may cause a switch. Why not do
the absolute minimum work necessary for a C call? If there are no
threads to run other goroutines, then that's a problem that should be
addressed via runtime.GOMAXPROCS. Am I missing something?

- Max

bryanturley

unread,

Mar 2, 2013, 4:08:36 PM3/2/13

to golan...@googlegroups.com, minux, Robert Zaremba, adam_smith, r...@golang.org

From the FAQ

"To make the stacks small, Go's run-time uses segmented stacks. A newly minted goroutine is given a few kilobytes, which is almost always enough. When it isn't, the run-time allocates (and frees) extension segments automatically. The overhead averages about three cheap instructions per function call. It is practical to create hundreds of thousands of goroutines in the same address space. If goroutines were just threads, system resources would run out at a much smaller number."

Go's segmented stacks != standard C stacks
The C code won't know how to grow the stack if it needs more than "a few kilobytes".

I think the current OS thread that the goroutine is running on has to be locked while you jump in and out of c code for signals as well.

Maxim Khitrov

unread,

Mar 2, 2013, 4:14:40 PM3/2/13

to bryanturley, golan...@googlegroups.com, minux, Robert Zaremba, adam_smith, r...@golang.org

I understand that the stack needs to be switched, but minux implied
that this isn't where most of the overhead comes from. The call to a C
function may also spawn a new thread, which would be used for running
other goroutines. Erwin and I are just wondering whether this is
really necessary.

- Max

minux

unread,

Mar 2, 2013, 4:37:17 PM3/2/13

to Maxim Khitrov, bryanturley, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

On Sun, Mar 3, 2013 at 5:14 AM, Maxim Khitrov <m...@mxcrypt.com> wrote:

I understand that the stack needs to be switched, but minux implied
that this isn't where most of the overhead comes from. The call to a C
function may also spawn a new thread, which would be used for running
other goroutines. Erwin and I are just wondering whether this is
really necessary.

basically, once a goroutine enters cgo, it's considered blocking, so not counted

in $GOMAXPROCS limit and so the goroutine scheduler might need to create

new OS thread to host other ready goroutines.

Erwin

unread,

Mar 2, 2013, 5:38:23 PM3/2/13

to minux, Maxim Khitrov, bryanturley, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

see also:
https://groups.google.com/d/msg/golang-nuts/NNaluSgkLSU/0bq1kXZueCwJ

that's an interesting read. a significant performance boost after russ applied the diffs that remove parts of the cgo work that need not be done when the c functions are known to return quickly and don't call back into go. again, what if one could tag such c functions, and have a fast path for them in cgo? seems doable?

Maxim Khitrov

unread,

Mar 2, 2013, 6:11:51 PM3/2/13

to minux, bryanturley, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org, Erwin

I tried to test the impact of removing entersyscall()/exitsyscall()
around asmcgocall(fn, arg) in cgocall.c (Russ's first patch), but
all.bat gets stuck when building runtime/cgo. No errors or any other
messages, it just sits there not doing anything.

To get past this, I added a new function to the runtime package that
can disable these calls at run time. The diff is below (for go 1.0.3).
I then ran a quick test that just called a C function, which didn't do
any real work:

http://play.golang.org/p/0Nc6Dlj6lU

Here are the results:

Before the patch: 820.0469ms
After the patch (syscall on): 841.0481ms
After the patch (syscall off): 326.0186ms

That's a pretty big difference. I don't think Russ's second patch can
be applied in the general case, since callbacks have to be supported,
but I would seriously consider either disabling
entersyscall()/exitsyscall() by default, or providing a function like
the one I added in the runtime package for disabling these calls at
run time.

The scheduler interaction makes sense for system calls, but I don't
think this behavior is appropriate when executing any and all C
functions.

- Max

diff -r 2d8bc3c94ecb src/pkg/runtime/cgocall.c
--- a/src/pkg/runtime/cgocall.c Fri Sep 21 17:10:44 2012 -0500
+++ b/src/pkg/runtime/cgocall.c Sat Mar 02 18:05:01 2013 -0500
@@ -84,6 +84,8 @@

void *initcgo; /* filled in by dynamic linker when Cgo is available */

+static bool cgosyscall = true;
+
static void unlockm(void);
static void unwindm(void);

@@ -131,9 +133,11 @@
* so it is safe to call while "in a system call", outside
* the $GOMAXPROCS accounting.
*/
- runtimeÂ·entersyscall();
+ if (cgosyscall)
+ runtimeÂ·entersyscall();
runtimeÂ·asmcgocall(fn, arg);
- runtimeÂ·exitsyscall();
+ if (cgosyscall)
+ runtimeÂ·exitsyscall();

if(d.nofree) {
if(g->defer != &d || d.fn != (byte*)unlockm)
@@ -151,6 +155,12 @@
}

void
+runtimeÂ·CgoAsSyscall(bool enable)
+{
+ cgosyscall = enable;
+}
+
+void
runtimeÂ·NumCgoCall(int64 ret)
{
M *m;
diff -r 2d8bc3c94ecb src/pkg/runtime/debug.go
--- a/src/pkg/runtime/debug.go Fri Sep 21 17:10:44 2012 -0500
+++ b/src/pkg/runtime/debug.go Sat Mar 02 18:05:01 2013 -0500
@@ -32,6 +32,9 @@
// NumGoroutine returns the number of goroutines that currently exist.
func NumGoroutine() int

+// CgoAsSyscall determines whether cgo calls are treated the same as syscalls.
+func CgoAsSyscall(enable bool)
+
// MemProfileRate controls the fraction of memory allocations
// that are recorded and reported in the memory profile.
// The profiler aims to sample an average of

Maxim Khitrov

unread,

Mar 2, 2013, 6:20:54 PM3/2/13

to minux, bryanturley, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org, Erwin

Forgot to mention that calling the regular Go test() function takes
35.002ms in my example (with inlining disabled), so cgo is still 9x
slower, but that's a lot better than 24x.

- Max

Archos

unread,

Mar 2, 2013, 6:54:33 PM3/2/13

to golan...@googlegroups.com, minux, bryanturley, Robert Zaremba, adam_smith, r...@golang.org, Erwin

El sábado, 2 de marzo de 2013 23:20:54 UTC, Maxim Khitrov escribió:

> http://play.golang.org/p/0Nc6Dlj6lU
>
> Here are the results:
>
> Before the patch: 820.0469ms
> After the patch (syscall on): 841.0481ms
> After the patch (syscall off): 326.0186ms

Forgot to mention that calling the regular Go test() function takes
35.002ms in my example (with inlining disabled), so cgo is still 9x
slower, but that's a lot better than 24x.

Having in mind those data I would argue that a Go library is always going to be faster than a binding to C library, even in libraries related to maths/graphics, since Go is about 3x slower than C (http://benchmarksgame.alioth.debian.org/u64/which-programs-are-best.php).

bryanturley

unread,

Mar 2, 2013, 7:58:16 PM3/2/13

to golan...@googlegroups.com, minux, bryanturley, Robert Zaremba, adam_smith, r...@golang.org, Erwin

From that page

Selected and weighted 'how-many-times-more compared to the-program-that-used-least' scores are compressed into one number, the weighted geometric mean, at the risk of being "neat, plausible, and wrong".

They are claiming to be "at the risk of being \"neat, plausible, and wrong\"."
Therefore they have a risk of both
* plausible: having an appearance of truth or reason
* wrong: deviating from truth or fact; erroneous
And being neat

So... They might have the appearance of truth (appearances being deceiving), definitely not true, or clean/amusing...
The lack of confidence *the authors* of that data express, leads me to think I should completely ignore that data.

Robert Zaremba

unread,

Mar 2, 2013, 8:11:37 PM3/2/13

to golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

Thanks,
The patch sounds great. I think It is better to move responsibility for thread blocking to user C code. If the code does a lot of computation, then he needs to consider running it in separate OS thread.

Still didn't get an answer for Do functions from the gccgo compiled code and gcc complied C libraries have the same calling conventions?

bryanturley

unread,

Mar 2, 2013, 8:26:48 PM3/2/13

to golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

On Saturday, March 2, 2013 7:11:37 PM UTC-6, Robert Zaremba wrote:

Thanks,
The patch sounds great. I think It is better to move responsibility for thread blocking to user C code. If the code does a lot of computation, then he needs to consider running it in separate OS thread.

I think it is just simpler to assume the worst case, when you jump into dynamically linked code you never really know what is on the other side.
I try to stay away from cgo with the exception of hardware specific code like opengl.
If you need to have an inner loop calling tons of short c functions perhaps write that inner loop in c as well?
Once you jump into c you are free to jump around c/other without penalty, assuming good code...

Still didn't get an answer for Do functions from the gccgo compiled code and gcc complied C libraries have the same calling conventions?

No and partially. C code can't return multiple values for instance.

Ian Lance Taylor

unread,

Mar 2, 2013, 8:36:09 PM3/2/13

to Robert Zaremba, adam_smith, golang-nuts, r...@golang.org

On Mar 2, 2013 5:11 PM, "Robert Zaremba" <robert....@zoho.com> wrote:
>
> Still didn't get an answer for Do functions from the gccgo compiled code and gcc complied C libraries have the same calling conventions?

Yes, they do. Multiple return values are handled as a struct.

The cgo code for gccgo is simpler (take a look). But it still has to switch stacks.

Ian

Philipp Schumann

unread,

Mar 2, 2013, 8:39:17 PM3/2/13

to golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

Here's a vote for a new runtime.CgoSyscallOff bool option... who's with me?

> If you need to have an inner loop calling tons of short c functions perhaps
> write that inner loop in c as well?

That sounds... painful to some of us not-so-hardcore Go-lovers-but-C-haters ;)

bryanturley

unread,

Mar 2, 2013, 8:52:47 PM3/2/13

to golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

So in a prefect world of C code running on segmented stacks without thread local storage, cgo is almost unnecessary?
I do know that is how the c in the runtime is written.

Dmitry Vyukov

unread,

Mar 3, 2013, 12:21:07 AM3/3/13

to Maxim Khitrov, minux, bryanturley, golang-nuts, Robert Zaremba, adam_smith, Russ Cox, Erwin

What hardware are you using? I guess it's an old processor. On newer
processors the difference should be smaller.
Cgo call overhead with and w/o entersyscall is fine for episodic
and/or heavy C functions. And it is big for a C function returning 42
called in a tight loop both with and w/o entersyscall. So the
conclusion seems to be: do not call a C function returning 42 in a
tight loop.

bryanturley

unread,

Mar 3, 2013, 12:23:19 AM3/3/13

to golan...@googlegroups.com, Maxim Khitrov, minux, bryanturley, Robert Zaremba, adam_smith, Russ Cox, Erwin

At least not without a towel.

Ian Lance Taylor

unread,

Mar 3, 2013, 12:36:43 AM3/3/13

to bryanturley, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

Even in that perfect world cgo is currently necessary, because you
need to tell that Go scheduler that you might be about to block in a
way that the Go scheduler does not understand.

When and if the Go scheduler gets smarter, then it is possible to
imagine that cgo would not be necessary for gccgo if you were able to
compile all of your C libraries with -fsplit-stack.

Ian

Archos

unread,

Mar 3, 2013, 3:34:10 AM3/3/13

to golan...@googlegroups.com, minux, bryanturley, Robert Zaremba, adam_smith, r...@golang.org, Erwin

Sure that it is a reference but it is right to have an idea about the performance of every language there.
In my case, when I've to port a graphics library to Go from Rust (when can be embebed into other software), the tests will be made to check whether a pure library in Go has better performance than a binding to Rust.

minux

unread,

Mar 3, 2013, 8:42:31 AM3/3/13

to Maxim Khitrov, bryanturley, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org, Erwin

On Sun, Mar 3, 2013 at 7:11 AM, Maxim Khitrov <m...@mxcrypt.com> wrote:

The scheduler interaction makes sense for system calls, but I don't
think this behavior is appropriate when executing any and all C
functions.

if you agree that the scheduler interaction makes sense for syscalls, and

also agree that foreign C function could make syscalls (the runtime just

won't know), why the interaction is not appropriate when executing C

function?

in theory, if you set GOMAXPROCS high enough so that there are always

free OS threads to run the goroutines, you still risk block the garbage collector

for indefinite amount of time when code is blocked in C world (the GC need

to stop the world).

in summary, removing the scheduler interaction might actually work and

increase performance for you, but we can't make that a default nor can we

make that optional, as make use of that require non-trivial prior thought,

and people will tend to abuse such feature to increase performance without

knowing its consequences.

minux

unread,

Mar 3, 2013, 8:45:01 AM3/3/13

to Ian Lance Taylor, bryanturley, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

On Sun, Mar 3, 2013 at 1:36 PM, Ian Lance Taylor <ia...@golang.org> wrote:

Even in that perfect world cgo is currently necessary, because you
need to tell that Go scheduler that you might be about to block in a
way that the Go scheduler does not understand.

When and if the Go scheduler gets smarter, then it is possible to
imagine that cgo would not be necessary for gccgo if you were able to
compile all of your C libraries with -fsplit-stack.

so the real solution to this problem is adopt Dmitry's syscall/blocking monitor

approach so that the scheduler could know that a goroutine has blocked

an OS thread and it needs to start new OS threads to host other goroutines.

minux

unread,

Mar 3, 2013, 8:52:40 AM3/3/13

to Philipp Schumann, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

On Sun, Mar 3, 2013 at 9:39 AM, Philipp Schumann <philipp....@gmail.com> wrote:

Here's a vote for a new runtime.CgoSyscallOff bool option... who's with me?

this short-term incomplete solution is going to hurt more than it cures.

if we adopt this suggestion, how can we document it?

Something like this is totally unacceptable:

// CgoSyscallOff might make your cgo calls faster, but it might also deadlock your whole

// process, use it at your own risk.

var CgoSyscallOff bool

we really should fix the scheduler to make every case fast instead of providing a switch

to make some cases fast at the expense of making some other cases deadlock.

Maxim Khitrov

unread,

Mar 3, 2013, 9:21:58 AM3/3/13

to Dmitry Vyukov, minux, bryanturley, golang-nuts, Robert Zaremba, adam_smith, Russ Cox, Erwin

Intel i7-920. I got similar results on i7-2600:

After the patch (syscall on): 636.0636ms
After the patch (syscall off): 242.0242ms
Go function (no inlining): 25.0025ms

- Max

Maxim Khitrov

unread,

Mar 3, 2013, 9:32:44 AM3/3/13

to minux, Philipp Schumann, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

If it's possible to have the scheduler monitor what the other threads
are doing, and to launch new threads without blocking any running
ones, then I agree that this would be the optimal solution.

Can you clarify how my patch could lead to a deadlock? I can see it
preventing other goroutines from running while in cgo, but assuming
that cgo returns eventually, wouldn't the scheduler just pick up where
it left off?

- Max

bryanturley

unread,

Mar 3, 2013, 1:42:11 PM3/3/13

to golan...@googlegroups.com, minux, Philipp Schumann, Robert Zaremba, adam_smith, r...@golang.org

How many c libraries did you test your patch with?
And does the patch turn off the blocking globally or per goroutine?

A good bit of the core go runtime is written in C that doesn't block. But it is known to not block because it doesn't call anything external.
If your external c code blocks but doesn't inform the go scheduler cpu cycles can/will be wasted.

Say you have 4 cores each running a goroutine, and 2 goroutines that want to be run.
One of the running goroutines enters some cgo code without informing the scheduler and immediately blocks on a slow storage device.
Ignoring other things happening in the OS/system that cpu could have at that point switched over to another goroutine that wasn't blocked instead of just being idle.
So you have potentially cut your overall performance by 1/4 in that time frame.

I believe (and someone who actually knows correct me) that the runtime is written to notify itself when a goroutine is about to be in a system call that can or will block (file read for instance). With this notification it can switch to a runnable goroutine.
To get cgo to work at that level you would have to have a perfect knowledge of what you are calling and call the same notification mechanisms in your cgo code making cgo a good bit more complex.
I have been writing a bit of xlib/opengl lately and I have no clue what is in this binary blob opengl driver and no source to look at for it. When will it block or not block doing it's weird syscalls? So the safest bet is to just say it blocks always.
Also it may not block now but it is a .so and gets updated very frequently to newer versions. The newer version may block where the older version did not.

Like I said in an earlier post, I avoid using cgo directly unless it is for something hardware specific. Not everyone will be able to though, I wouldn't want to rewrite a c based database client library for instance.
Additionally, I am not having a problem with cgo overhead nor am I foreseeing one.

Ian Lance Taylor

unread,

Mar 3, 2013, 2:14:23 PM3/3/13

to Maxim Khitrov, minux, Philipp Schumann, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

On Sun, Mar 3, 2013 at 6:32 AM, Maxim Khitrov <m...@mxcrypt.com> wrote:
>
> Can you clarify how my patch could lead to a deadlock? I can see it
> preventing other goroutines from running while in cgo, but assuming
> that cgo returns eventually, wouldn't the scheduler just pick up where
> it left off?

Some C libraries uses locks in the C code as well. The C code might
try to acquire a lock and thus not return until some other C code
runs. That other C code might be routinely invoked by a goroutine.
Unfortunately, since the first C code has now blocked the scheduler,
that other goroutine will never run.

That probably sounds abstract, but the gccgo SWIG implementation used
to not inform the scheduler that it was entering C code. This caused
several different SWIG-using programs to deadlock in this way until I
fixed SWIG for gccgo.

Ian

Isaac Gouy

unread,

Mar 3, 2013, 7:12:23 PM3/3/13

to golan...@googlegroups.com, bryanturley, r...@golang.org

I find the most tedious part of curating the benchmarks game is responding to instances of mis-reading that seem so egregious they fall somewhere between mischievous and malicious.

Last week, the author of "Why Code in C Anymore?" in Dr Dobb's misread text on the summary page and claimed the benchmarks game measurements "show that C++ (running on 32-bit Linux) runs the series of tests 27% slower than C". That table of summary statistics has now been removed because too many readers seem unable to make sensible use of basic descriptive statistics.

Now, here, we have Archos pointing to an obscure table of geometric means, supposedly as some-kind-of evidence to do with binding to a C library -- instead of actually finding data to do with binding to a C library. (Hint -- regex-dna.c in go/test/bench/shootout uses pcre, maybe compare it with a Go program that uses a pcre binding.)

And, here, we have bryanturley playing dismissive word games with a H.L.Mencken quote shown on that page.

bryanturley

unread,

Mar 3, 2013, 7:23:24 PM3/3/13

to golan...@googlegroups.com, bryanturley, r...@golang.org

On Sunday, March 3, 2013 6:12:23 PM UTC-6, Isaac Gouy wrote:
> On Saturday, March 2, 2013 4:58:16 PM UTC-8, bryanturley wrote:On Saturday, March 2, 2013 5:54:33 PM UTC-6, Archos wrote:
>
>
> Having in mind those data I would argue that a Go library is always going to be faster than a binding to C library, even in libraries related to maths/graphics, since Go is about 3x slower than C (http://benchmarksgame.alioth.debian.org/u64/which-programs-are-best.php).
>
>
> From that page
>
>

> Selected and weighted 'how-many-times-more compared to the-program-that-used-least' scores are compressed into one number, the weighted geometric mean, at the risk of being "neat, plausible, and wrong".They are claiming to be "at the risk of being \"neat, plausible, and wrong\"."

> Therefore they have a risk of both
> * plausible: having an appearance of truth or reason
> * wrong: deviating from truth or fact; erroneous
> And being neat
>
> So... They might have the appearance of truth (appearances being deceiving), definitely not true, or clean/amusing...
> The lack of confidence *the authors* of that data express, leads me to think I should completely ignore that data.
>
>
>
>
> I find the most tedious part of curating the benchmarks game is responding to instances of mis-reading that seem so egregious they fall somewhere between mischievous and malicious.
>
> Last week, the author of "Why Code in C Anymore?" in Dr Dobb's misread text on the summary page and claimed the benchmarks game measurements "show that C++ (running on 32-bit Linux) runs the series of tests 27% slower than C". That table of summary statistics has now been removed because too many readers seem unable to make sensible use of basic descriptive statistics.
>
> Now, here, we have Archos pointing to an obscure table of geometric means, supposedly as some-kind-of evidence to do with binding to a C library -- instead of actually finding data to do with binding to a C library. (Hint -- regex-dna.c in go/test/bench/shootout uses pcre, maybe compare it with a Go program that uses a pcre binding.)
>
> And, here, we have bryanturley playing dismissive word games with a H.L.Mencken quote shown on that page.

I dismiss your dismissing of my dismissive word games.
;)

Archos

unread,

Mar 4, 2013, 3:21:09 AM3/4/13

to golan...@googlegroups.com, bryanturley, r...@golang.org

If you think that the Shootout benchmark is irrelevant then you should ask to the Go team that don't make those tests and that Go is not showed in that page. Because while Go is in that page there will be always comparisons with other languages.

Anoher thing is that the most of people can take it like a reference, and make your benchmarks for specific tasks when you need.

minux

unread,

Mar 4, 2013, 1:46:21 PM3/4/13

to Maxim Khitrov, Philipp Schumann, golan...@googlegroups.com, Robert Zaremba, adam_smith, r...@golang.org

On Sun, Mar 3, 2013 at 10:32 PM, Maxim Khitrov <m...@mxcrypt.com> wrote:

Can you clarify how my patch could lead to a deadlock? I can see it
preventing other goroutines from running while in cgo, but assuming
that cgo returns eventually, wouldn't the scheduler just pick up where
it left off?

what i said is:

> <snip> providing a switch

> to make some cases fast at the expense of making some other cases deadlock.

one of the other cases is the cgo call calls callbacks.
you can see this when you unconditionally enable CgoSyscallOff, and
go test $GOROOT/misc/cgo/test -v

there might be other possibilities of deadlocking, but this is the most
obvious one.

Isaac Gouy

unread,

Mar 4, 2013, 6:37:03 PM3/4/13

to golan...@googlegroups.com, bryanturley, r...@golang.org

On Monday, March 4, 2013 12:21:09 AM UTC-8, Archos wrote:

El lunes, 4 de marzo de 2013 00:12:23 UTC, Isaac Gouy escribió:
Now, here, we have Archos pointing to an obscure table of geometric means, supposedly as some-kind-of evidence to do with binding to a C library -- instead of actually finding data to do with binding to a C library. (Hint -- regex-dna.c in go/test/bench/shootout uses pcre, maybe compare it with a Go program that uses a pcre binding.)

If you think that the Shootout benchmark is irrelevant then ...

I think you should refer to an example that shows binding to a C library.

http://benchmarksgame.alioth.debian.org/u64/program.php?test=regexdna&lang=go&id=7

Archos

unread,

Mar 5, 2013, 2:46:13 AM3/5/13

to golan...@googlegroups.com

Irrelevant because the regexp package is not optimized and it is know to be quite slow.
In fact, that benchmark, before of use a third library, showed that the Go implementation of regexpdna was the slower one and with a great difference in run time with respect to the others.

Jan Mercl

unread,

Mar 5, 2013, 4:55:13 AM3/5/13

to Isaac Gouy, golang-nuts, bryanturley, r...@golang.org

On Tue, Mar 5, 2013 at 12:37 AM, Isaac Gouy <igo...@yahoo.com> wrote:
> I think you should refer to an example that shows binding to a C library.
>
> http://benchmarksgame.alioth.debian.org/u64/program.php?test=regexdna&lang=go&id=7

I think the regexp test is totally twisted the other way around. Go
regexp is AFAICS actually faster than pcre. Aka, why one avoids
O(n^2)? It's the worst case that matters, not the simple/easy one:
http://swtch.com/~rsc/regexp/regexp1.html

-j

Isaac Gouy

unread,

Mar 5, 2013, 11:17:13 AM3/5/13

to golan...@googlegroups.com

On Monday, March 4, 2013 11:46:13 PM UTC-8, Archos wrote:

http://benchmarksgame.alioth.debian.org/u64/program.php?test=regexdna&lang=go&id=7
Irrelevant because the regexp package is not optimized and it is know to be quite slow.

The question was -- "What is the overhead of calling a C function from Go?"

That Go program uses PCRE (not the Go regexp package).

Compare that program with a similar C program that uses PCRE.

Archos

unread,

Mar 5, 2013, 12:22:11 PM3/5/13

to golan...@googlegroups.com

You would have to compare a Go program using a binding to C (like that in github.com/glenn-brown) with another one in pure Go. My point is that this comparison is disadvantageous for Go becase the C library, PCRE, is greatly optimized after of a lot of years, while the regexp package is very young and known to be quite slow, in fact is one of the packages with worst performance in the Go standard library.

Kyle Lemons

unread,

Mar 5, 2013, 6:38:28 PM3/5/13

to Archos, golang-nuts

[citation needed]

Sure, it might have room to grow, but that's a completely different matter. As Jan said above, the fact that it runs in linear worst-case time, to me, means that it's a clear winner over anything that runs in quadratic worst-case time.

in fact is one of the packages with worst performance in the Go standard library.

It's pointless to compare performance of packages in the standard library. They are, almost by definition, serving entirely different purposes. Reading a 1TB file from disk takes longer than doing a regexp takes longer than computing an exponent. This is not a surprise.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Robert Zaremba

unread,

Mar 9, 2013, 5:52:05 PM3/9/13

to golan...@googlegroups.com, minux, Philipp Schumann, Robert Zaremba, adam_smith, r...@golang.org, bryan...@gmail.com

To clarify the cgo behaviour: it is worthless using some c-library wrapper for for a lot of tiny calls.

My use case is to use rrdtool through go binding (using cgo - https://github.com/ziutek/rrd) for hight traffic server.
Then when it comes to handle 1k concurrent request / second, and on each request I want to fire some event to rrdtool, then this rrd will force to create 1k real OS threads?

Russ Cox

unread,

Mar 11, 2013, 10:12:15 AM3/11/13

to Robert Zaremba, golang-nuts, minux, Philipp Schumann, adam_smith, Bryan Turley

On Sat, Mar 9, 2013 at 5:52 PM, Robert Zaremba <robert....@zoho.com> wrote:

My use case is to use rrdtool through go binding (using cgo - https://github.com/ziutek/rrd) for hight traffic server.
Then when it comes to handle 1k concurrent request / second, and on each request I want to fire some event to rrdtool, then this rrd will force to create 1k real OS threads?

Only if each request executes for 1 second. If the calls are lightweight you can use a semaphore to limit the number of parallel cgo calls you have running.

Russ

Robert Zaremba

unread,

Mar 11, 2013, 11:18:25 AM3/11/13

to golan...@googlegroups.com, Robert Zaremba, minux, Philipp Schumann, adam_smith, Bryan Turley

Thanks,
So my statement is true?
And the right way to do (when the request time is smaller then launching new OS thread) it is to control the number of cgo using appropriate Semaphore.

Russ Cox

unread,

Mar 11, 2013, 1:17:12 PM3/11/13

to Robert Zaremba, golang-nuts, minux, Philipp Schumann, adam_smith, Bryan Turley

On Mon, Mar 11, 2013 at 11:18 AM, Robert Zaremba <robert....@zoho.com> wrote:

Thanks,
So my statement is true?
And the right way to do (when the request time is smaller then launching new OS thread) it is to control the number of cgo using appropriate Semaphore.

Your statement is only true if enough cgo calls pile up blocked in the C code you are calling.

You only need 1000 threads if the C function you are calling takes long enough for 999 other functions to get into C and get stuck too.

But yes using a semaphore would be a good way to impose a hard limit on the number of threads.

Russ

Reply all

Reply to author

Forward