Russ
I agree, but I want to comment that cgo is not the same as libffi.
The libffi library permits you to dynamically build a function call at
runtime. It is more like reflect.Call than like cgo.
In fact gccgo uses libffi to implement reflect.Call.
Ian
1. Mainly because of the C preprocessor and typedefs,
it's tricky for cgo to resolve some C.xxx references,
and cgo doesn't do as good a job as it should. This is
the issue that I notice most often, but I haven't had
time to write or find a better implementation.
2. Maybe cgo should have a different interface entirely.
Having written a handful of wrappers for C libraries in a
variety of languages, I find the cgo/python ctypes/haskell ffi
approach of "write code in the non-C language" refreshing.
But maybe there's something even better yet. I'd like to hear
what you prefer about the libffi interface and see code
snippets demonstrating the advantages.
3. The performance of a single cgo call is slower than a function call.
I looked into this. On my (two generations ago?) Mac Mini,
a no-op takes 5 nanoseconds, and a cgo no-op takes 200
nanoseconds. That agrees with your factor of 40.
A few people have said on this thread that cgo runs C calls
in a separate thread, and thus calling a cgo call involves a
thread switch. That is not true. Cgo does run C calls
on a separate stack, the one that was allocated by the
operating system and is thus assumed to be big enough
to run ordinary C code (the goroutine stacks are not).
It's cheap - just a couple register moves - and also required
to write a correct libffi call. Otherwise you'll overflow the
goroutine stack and suffer silent, mysterious memory
corruption. (Just ask Alex about the Windows calls.)
The expense comes from coordinating with the scheduler,
which requires lock(&sched) and unlock(&sched) both
before and after the call, each of which does a ~50ns
atomic memory operation. A very clever implementation
might be able to get it down to one memory operation,
so 100 ns instead of 200 ns, but that doesn't seem like
it would satisfy your particular use case. What's more,
any libffi-based solution has to do the same thing, or else
the if the program ever blocks inside the called C code,
the other goroutines will not get a chance to run unless
you've set GOMAXPROCS > # of blocked threads.
Even if you've set GOMAXPROCS big enough to
avoid the deadlock, if you don't coordinate with the scheduler,
the garbage collector is going to sit and wait for the
C calls to finish, which will cause its own deadlock or at
least slowdown, unless they're fairly responsive.
It's not something that can be easily brushed aside
or ignored in the general case.
The Windows DLL entry and the cgo entry code are
essentially the same, for what that's worth; it's just
wishful thinking to guess that one is more efficient than
the other.
If you're noticing the overhead, it means that you're
doing very little inside the called C functions.
Anything that does a modicum of work should not notice the
overhead, but I wouldn't put C calls in a tight loop like:
for i := 0; i < C.vectorlen(v); i++ {
s += C.vectorat(v, i)
}
Something like that is definitely going to hurt.
What does your SDL program do in each called C function?
Russ
P.S. If you only call functions that return quickly
(they never block indefinitely), you can apply the first
diff below to cut out the scheduler coordination.
On my system that cuts the C no-op time to about 40 ns.
If you want to be more aggressive, and you never use
callbacks, then you can apply the second diff below,
which cuts out even more things that are needed in
general but perhaps not in your specific case. On my
system that cuts the C no-op time down to 20 ns.
$ hg diff cgocall.c
diff -r 483f23f89563 src/pkg/runtime/cgocall.c
--- a/src/pkg/runtime/cgocall.c Tue Jun 08 22:32:04 2010 -0700
+++ b/src/pkg/runtime/cgocall.c Wed Jun 09 01:45:23 2010 -0700
@@ -34,9 +34,7 @@
* M to run goroutines while we are in the
* foreign code.
*/
- ·entersyscall();
runcgo(fn, arg);
- ·exitsyscall();
m->lockedg = oldlock;
if(oldlock == nil)
$
$ hg diff .
diff -r 483f23f89563 src/pkg/runtime/amd64/asm.s
--- a/src/pkg/runtime/amd64/asm.s Tue Jun 08 22:32:04 2010 -0700
+++ b/src/pkg/runtime/amd64/asm.s Wed Jun 09 01:44:33 2010 -0700
@@ -299,14 +299,6 @@
MOVQ m, 16(SP)
MOVQ CX, 8(SP)
- // Save g and m values for a potential callback. The callback
- // will start running with on the g0 stack and as such should
- // have g set to m->g0.
- MOVQ m, DI // DI = first argument in AMD64 ABI
- // SI, second argument, set above
- MOVQ libcgo_set_scheduler(SB), BX
- CALL BX
-
MOVQ R13, DI // DI = first argument in AMD64 ABI
CALL R12
diff -r 483f23f89563 src/pkg/runtime/cgocall.c
--- a/src/pkg/runtime/cgocall.c Tue Jun 08 22:32:04 2010 -0700
+++ b/src/pkg/runtime/cgocall.c Wed Jun 09 01:44:33 2010 -0700
@@ -13,36 +13,7 @@
void
cgocall(void (*fn)(void*), void *arg)
{
- G *oldlock;
-
- if(initcgo == nil)
- throw("cgocall unavailable");
-
- ncgocall++;
-
- /*
- * Lock g to m to ensure we stay on the same stack if we do a
- * cgo callback.
- */
- oldlock = m->lockedg;
- m->lockedg = g;
- g->lockedm = m;
-
- /*
- * Announce we are entering a system call
- * so that the scheduler knows to create another
- * M to run goroutines while we are in the
- * foreign code.
- */
- ·entersyscall();
runcgo(fn, arg);
- ·exitsyscall();
-
- m->lockedg = oldlock;
- if(oldlock == nil)
- g->lockedm = nil;
-
- return;
}
// When a C function calls back into Go, the wrapper function will
$
Might there be a way to mark C calls that are known to be both cheap
and stack-light, so that they could be called quickly? I'm thinking of
things like fast random number generators or special math functions
where speed really is critical, and where performance really is
critical. I agree that in the general case, all this coordination
needs to be done, but wonder whether it might be worthwhile to be able
to flag certain functions as "cheap".
David
Ian replied to this on a different thread yesterday.
The issue is that if you put a / anywhere in the library name
then the linker refuses to use the search path.
If you have to generate a .so for a hierarchical path
(let's say gosqlite.googlecode.com/hg/sqlite) then if
you name the file gosqlite.googlecode.com/hg/sqlite.so
and just record that, the search path gets ignored.
If you name the file sqlite.so then you might collide with
a package named sqlite elsewhere in the tree.
We'll probably have to do something like rewrite / to _
and then install that, so that this example would install
an unrooted gosqlite.googlecode.com_hg_sqlite.so.
> 2) The c compiler supports dynamical imports and exports from and to
> the C world, but I think that certainly the go compiler, and perhaps
> also the assembler should support them. That would already simplify a
> lot of what cgo is doing. Any syntax would be fine for me fine, as
> long as the support is there.
I don't believe it would save anything. The complexity here
is not in getting the links right but in translating between the
different calling conventions (what registers, stack locations
get used for arguments and return values). 6g/6c/8g/8c have
one, and gcc has a different one.
> 3) Preferrably, cgo shouldn't be a preprocessor but part of the syntax
> of the go compiler itself. Or if it should remain a preprocessor it
> should not expand the line count so syntax errors that the go compiler
> encounters in the cgo output are reported in the same place they are
> in the input file.
I don't think it will ever be part of the Go compiler.
But the line count issue is definitely a bug.
We've agreed on a format for the line comments -
//line filename:linenumber
but the gc compilers do not implement it.
> 4) Of course, on the long term, go should probably support and allow
> us to generate and use Elf/PE libraries fully, but that's not
> something that can be done overnight.
It's very hard to do this. A big part of what makes Go useful is that
you can create tons of goroutines without worrying about stack
overflow. Go does this by using a very different stack discipline
than the standard ELF or PE ABI does. You can't have both.
Having wasted an hour today trying to figure out how to rewrite
a recursive function to be non-recursive so as to avoid a stack
overflow, I would much rather have the good stacks than
interoperability.
That said, very hard does not mean impossible, and Ian has been
working on support in both gcc and the new GNU linker gold to
do both stack growth and automatic conversions for calling
between code using the different stack conventions. But it's not
going to happen overnight. And on Windows, where we have no
control over the tools, it's unlikely to happen ever.
Cgo's goal is to make you not need to worry about any of this,
but it falls short at the moment.
Russ