Foreign Function Interface in Go and Assembly

Beoran

unread,

Jun 8, 2010, 6:58:47 AM6/8/10

to golang-nuts

I was looking into implementing a foreign function interface for Go
using libffi and cgo, and although the first tests worked fine,
results were not encouraging because the overhead was high. On my
system, a call to an empty function in a dll through FFI and CGO was
47 times slower than a call to an empty function inside Go. A call
using cgo directly seems to be "only" 42 times slower that call to a
empty function in go. (I validated these results using a statistic
test with 95% confidence interval).

I want to use Go for game programming so such delays are unacceptable
to me as the reason I'm considering using Go in stead of Ruby is that
calling the C game libs, like SDL from Ruby is quite sluggish. I
really want to push the pedal to the metal. :)

So, that's why I decided to look into porting libffi to go and (8a)
assembly. So far, I had a minor success in that on 386 Linux. I can
call an empty void(void) function in a shared library using the
assembly below. The overhead seems to be much lower than that of cgo:
only 5 times slower than a direct call to a do nothing function inside
cgo.

So, this approach looks promising to me, but what I'd like to ask is:
what do I have to be careful of? I know Go uses a runtime stack, so
are there any things I need to be careful of when calling to functions
in dll's like I'm doing below? Any special care I need to take when
passing go arguments to C functions? I also saw that cgo starts
another operating system thread to do it's calls to C. What is the
reason for this? Is this some type of precaution?

Here is the assembly that worked for me:

TEXT ·CallCDECLVoid(SB),7,$0
PUSHL BP // save all needed registers on the stack
MOVL SP, BP
PUSHL BX
PUSHL SI
PUSHL DI

MOVL (BP), AX
/* Call function, the pointer to which was on top of the stack. */
CALL *AX

POPL DI // restore all needed registers from the stack
POPL SI
POPL BX
POPL DX
RET

Kind Regards,

B.

brainman

unread,

Jun 8, 2010, 8:12:24 AM6/8/10

to golang-nuts

Disclaimer: This is just as I saw it. I didn't write the code.

> ... I know Go uses a runtime stack, so

> are there any things I need to be careful of when calling to functions
> in dll's like I'm doing below? Any special care I need to take when
> passing go arguments to C functions?

If you meant to call windows dlls, then you could just use
syscall.Syscall and friends (see http://golang.org/src/pkg/syscall/zsyscall_windows_386.go
for samples). You, probably, know that all windows dlls use 'stdcall'
calling conventions. Also goroutine stacks are small and will most
likely fill up, if used to call external (dll) functions, so, you must
switch onto bigger stack before calling into dll.

syscall.Syscall should do all that for you.

> ... I also saw that cgo starts

> another operating system thread to do it's calls to C. What is the
> reason for this? Is this some type of precaution?

Go runtime will start new os threads every time there is a chance that
current thread will block, while there are some goroutines ready to
run. This will happen just before call into dll, because go runtime
doesn't know how long the call will take, so it'll just start new
thread to keep runtime going.

Alex

Beoran

unread,

Jun 8, 2010, 9:45:02 AM6/8/10

to golang-nuts

Dear Alex,

Thanks for the info. I'm running on Linux, though I hope to male this
FFI portable.
Looks like the work will be easier on Windows than on linux, where gcc
uses a different
call convention (cdecl aka sysv). I'll take a look at the windows
syscall to see how
you guys manage the stack there, I can probably do something similar
on Linux.

As for the go runtime starting a new thread... that's probably the
reason for the overhead,
but that also kind of sucks. Is this *really* necessary? I've called
small functions that
print "hello world" a million times without setting up any threads,
and the go program didn't
seem any worse for the wear. When calling foreign functions from the
current thread, I would
rather have the Go runtime sleep and yield control wholly to the
foreign function for maximal speed, even if that is a bit unsafe. By
the go runtime and then the SDL game library are in the right thread,
my sprite's wings will have fallen off. :)

Kind Regards,

B.

Russ Cox

unread,

Jun 8, 2010, 2:42:23 PM6/8/10

to Beoran, golang-nuts

I think a better tack would be to understand what is
going on in the current ffi (aka cgo) and then fix
the performance problems. Porting another library
doesn't necessarily solve anything; it just makes
things more complex.

Russ

Ian Lance Taylor

unread,

Jun 8, 2010, 4:22:38 PM6/8/10

to r...@golang.org, Beoran, golang-nuts

Russ Cox <r...@golang.org> writes:

I agree, but I want to comment that cgo is not the same as libffi.
The libffi library permits you to dynamically build a function call at
runtime. It is more like reflect.Call than like cgo.

In fact gccgo uses libffi to implement reflect.Call.

Ian

Beoran

unread,

Jun 9, 2010, 3:40:24 AM6/9/10

to golang-nuts

Dear Ian,

Thanks for your comment, that was also my idea. I want to see if a
different approach than the one cgo takes would be useful.

Dear Russ,

I agree that I should know more about how cgo works. And I'm not
unwilling to contribute to fixing it. So I went and read the sources.
I have been reading through the source code of cgo, the runtime, the
8l linker, the 8c compiler, etc to get a grasp on what CGO is doing
and I must say that what it does is also rather complex, IMO.

I more or less undertand all the steps it takes, but the I have some
trouble understanding what runtime special-purpose functionality to
make a call into cgo is doing. It seems to sets up a separate thread
for cgo to run in and also seems to do some stack magic that I don't
understand well. So, my problem with fixing cgos's peed is that it
takes so many different steps in wrapping and calling a C function
that I don't have a clue where the slowness could be. I'm guessing the
threading is doing it, but I'm unsure if thet is the case or how I
could fix that. So I'd need the help of someone with more knowlege of
the runtime internals.

On another note, I've written over 5000 lines of cgo code with my
fungo project, and honestly, cgo feels a bit like a hack. It works,
but it's poorly integrated with Go, and sort of a pain to write for.
And now I found out it's a bit slow too, I'm starting to think that
perhaps a different approach is needed.

So, I looked at how another non-c compiler, namely Free Pascal is
doing it, and they do it much like gccgo is doing it. The language has
the ability to declare that that a function is external and should be
loaded from the c world and called using a certain calling convention,
and their runtime can call these C functions correctly. I think that
all in all, this may be a simpler approach. It does mean that the
runtime and go compiler need to be modified and enhanced, but it also
would mean that we don't need the cgo program and libcgo anymore.

The linker already has support for dynamic imports and exports. And,
already, on windows, the runtime knows how to open dlls's, how to look
up symbols in the, and how to call those functions using the windows
calling covention (stdcall). This is of course because under windows,
system calls are performed by caling functions in dll's. So on
windows, you get ffi for free. Perhaps it should be the same on Linux
and other platforms?

Again, I'm willing to help fix cgo, but I think it's a good idea to
consider if it wouldn't be better to replace it by a different
approach, or or at least, complimented with different FFI
functionality.

Kind regards,

B.

Russ Cox

unread,

Jun 9, 2010, 4:48:11 AM6/9/10

to Beoran, golang-nuts

You've raised at least three distinct issues.

1. Mainly because of the C preprocessor and typedefs,
it's tricky for cgo to resolve some C.xxx references,
and cgo doesn't do as good a job as it should. This is
the issue that I notice most often, but I haven't had
time to write or find a better implementation.

2. Maybe cgo should have a different interface entirely.
Having written a handful of wrappers for C libraries in a
variety of languages, I find the cgo/python ctypes/haskell ffi
approach of "write code in the non-C language" refreshing.
But maybe there's something even better yet. I'd like to hear
what you prefer about the libffi interface and see code
snippets demonstrating the advantages.

3. The performance of a single cgo call is slower than a function call.
I looked into this. On my (two generations ago?) Mac Mini,
a no-op takes 5 nanoseconds, and a cgo no-op takes 200
nanoseconds. That agrees with your factor of 40.

A few people have said on this thread that cgo runs C calls
in a separate thread, and thus calling a cgo call involves a
thread switch. That is not true. Cgo does run C calls
on a separate stack, the one that was allocated by the
operating system and is thus assumed to be big enough
to run ordinary C code (the goroutine stacks are not).
It's cheap - just a couple register moves - and also required
to write a correct libffi call. Otherwise you'll overflow the
goroutine stack and suffer silent, mysterious memory
corruption. (Just ask Alex about the Windows calls.)

The expense comes from coordinating with the scheduler,
which requires lock(&sched) and unlock(&sched) both
before and after the call, each of which does a ~50ns
atomic memory operation. A very clever implementation
might be able to get it down to one memory operation,
so 100 ns instead of 200 ns, but that doesn't seem like
it would satisfy your particular use case. What's more,
any libffi-based solution has to do the same thing, or else
the if the program ever blocks inside the called C code,
the other goroutines will not get a chance to run unless
you've set GOMAXPROCS > # of blocked threads.
Even if you've set GOMAXPROCS big enough to
avoid the deadlock, if you don't coordinate with the scheduler,
the garbage collector is going to sit and wait for the
C calls to finish, which will cause its own deadlock or at
least slowdown, unless they're fairly responsive.
It's not something that can be easily brushed aside
or ignored in the general case.

The Windows DLL entry and the cgo entry code are
essentially the same, for what that's worth; it's just
wishful thinking to guess that one is more efficient than
the other.

If you're noticing the overhead, it means that you're
doing very little inside the called C functions.
Anything that does a modicum of work should not notice the
overhead, but I wouldn't put C calls in a tight loop like:

for i := 0; i < C.vectorlen(v); i++ {
s += C.vectorat(v, i)
}

Something like that is definitely going to hurt.

What does your SDL program do in each called C function?

Russ

P.S. If you only call functions that return quickly
(they never block indefinitely), you can apply the first
diff below to cut out the scheduler coordination.
On my system that cuts the C no-op time to about 40 ns.

If you want to be more aggressive, and you never use
callbacks, then you can apply the second diff below,
which cuts out even more things that are needed in
general but perhaps not in your specific case. On my
system that cuts the C no-op time down to 20 ns.

$ hg diff cgocall.c
diff -r 483f23f89563 src/pkg/runtime/cgocall.c
--- a/src/pkg/runtime/cgocall.c Tue Jun 08 22:32:04 2010 -0700
+++ b/src/pkg/runtime/cgocall.c Wed Jun 09 01:45:23 2010 -0700
@@ -34,9 +34,7 @@
* M to run goroutines while we are in the
* foreign code.
*/
- ·entersyscall();
runcgo(fn, arg);
- ·exitsyscall();

m->lockedg = oldlock;
if(oldlock == nil)
$

$ hg diff .
diff -r 483f23f89563 src/pkg/runtime/amd64/asm.s
--- a/src/pkg/runtime/amd64/asm.s Tue Jun 08 22:32:04 2010 -0700
+++ b/src/pkg/runtime/amd64/asm.s Wed Jun 09 01:44:33 2010 -0700
@@ -299,14 +299,6 @@
MOVQ m, 16(SP)
MOVQ CX, 8(SP)

- // Save g and m values for a potential callback. The callback
- // will start running with on the g0 stack and as such should
- // have g set to m->g0.
- MOVQ m, DI // DI = first argument in AMD64 ABI
- // SI, second argument, set above
- MOVQ libcgo_set_scheduler(SB), BX
- CALL BX
-
MOVQ R13, DI // DI = first argument in AMD64 ABI
CALL R12

diff -r 483f23f89563 src/pkg/runtime/cgocall.c
--- a/src/pkg/runtime/cgocall.c Tue Jun 08 22:32:04 2010 -0700
+++ b/src/pkg/runtime/cgocall.c Wed Jun 09 01:44:33 2010 -0700
@@ -13,36 +13,7 @@
void
cgocall(void (*fn)(void*), void *arg)
{
- G *oldlock;
-
- if(initcgo == nil)
- throw("cgocall unavailable");
-
- ncgocall++;
-
- /*
- * Lock g to m to ensure we stay on the same stack if we do a
- * cgo callback.
- */
- oldlock = m->lockedg;
- m->lockedg = g;
- g->lockedm = m;
-
- /*
- * Announce we are entering a system call
- * so that the scheduler knows to create another
- * M to run goroutines while we are in the
- * foreign code.
- */
- ·entersyscall();
runcgo(fn, arg);
- ·exitsyscall();
-
- m->lockedg = oldlock;
- if(oldlock == nil)
- g->lockedm = nil;
-
- return;
}

// When a C function calls back into Go, the wrapper function will
$

David Roundy

unread,

Jun 9, 2010, 9:51:51 AM6/9/10

to r...@golang.org, Beoran, golang-nuts

On Wed, Jun 9, 2010 at 1:48 AM, Russ Cox <r...@golang.org> wrote:
> The expense comes from coordinating with the scheduler,
> which requires lock(&sched) and unlock(&sched) both
> before and after the call, each of which does a ~50ns
> atomic memory operation. A very clever implementation
> might be able to get it down to one memory operation,
> so 100 ns instead of 200 ns, but that doesn't seem like
> it would satisfy your particular use case. What's more,
> any libffi-based solution has to do the same thing, or else
> the if the program ever blocks inside the called C code,
> the other goroutines will not get a chance to run unless
> you've set GOMAXPROCS > # of blocked threads.
> Even if you've set GOMAXPROCS big enough to
> avoid the deadlock, if you don't coordinate with the scheduler,
> the garbage collector is going to sit and wait for the
> C calls to finish, which will cause its own deadlock or at
> least slowdown, unless they're fairly responsive.
> It's not something that can be easily brushed aside
> or ignored in the general case.

Might there be a way to mark C calls that are known to be both cheap
and stack-light, so that they could be called quickly? I'm thinking of
things like fast random number generators or special math functions
where speed really is critical, and where performance really is
critical. I agree that in the general case, all this coordination
needs to be done, but wonder whether it might be worthwhile to be able
to flag certain functions as "cheap".

David

Beoran

unread,

Jun 9, 2010, 2:07:25 PM6/9/10

to golang-nuts

Dear Russ,

First of all, thanks for all this detailed information and for the
patches. It really gave me a better idea of what's going on behind the
screens.

You're right, In hindsight I raised several issues about cgo, so let
me go over your points:

1. I know that what you say is the case, but that wasn't what I was
referring to. It's kind of hard to get C types right in any kind of
FFI, so that issue I worked around on my own. What actually annoyed me
the most about cgo is that it's a preprocessor, and not integrated
into the go compiler proper, so, if I made a syntax error in my cgo,
the go compiler would complain about another line, and then I'd have
to open the go file cgo generated to see where the error was, and
refer that back to the original file. This brings me to 2.

2. I fully agree that it's more fun to write interfaces to C in the
language you want to use. And that is probaly the main thing I dislike
about cgo. It's still a preprocessor, so it's not integrated smoothly
into Go. It's too much of a halfway house to my taste.

From my previous experience in Ruby, I remebered there are two
different ways to call to C from Ruby, one through extensions written
in C, an one through a binding to libffi that allows you to write
calls to C in Ruby itself.
See http://github.com/jacius/ruby-sdl-ffi/blob/master/lib/ruby-sdl-ffi/sdl/cdrom.rb
for one example of a part of the SDL library being wrapped like that
using nothing but Ruby.

So, that's one reason why I started to research if I could port libffi
to Go, as I felt it would allow writing bindings to Go directly
without needing a preprocessor. If you look at what libffi actually
is, it's a big heap of assembly with implementations of all sorts of
different calling conventions for all sorts of different platforms,
with some C on top for enabling these calls to performed at runtime.
With all the information you gave me, I still think the the main
advantage of a libffi-style approach over cgo is that you can load
libraries at run time and use different calling conventions at the
same time. Suppose we want to enable go compiling down to dynamic
libraries, such a thing could be useful.

However, separately from all of this this, my feeling is that we
should probably give the Go compiler and runtime the functionality and
knowledge they need to generate calls to C on their own, more or less
like gccgo is doing it. The cgo syntax could probably be largely kept
for backwards compatibility.

3. I only found out about this speed issue by accient. Thanks for
elucidating what is causing it. I agree that in the general case, you
wouldn't want your C function to block the Go threads and GC, and
also, you want to be able to do callbacks from C back into Go, and
also that a libffi approach will not solve this. But in many specific
cases, like mine or that of David, it's OK, or even preferrable to
stop the Go threads and GC.

Not only in my specific case, but in the case of games, I'm usually
dealing with a C API (usually SDL or Opengl, or perhaps DirectX on
windows) which doesn't block, doesn't use any callbacks, and will be
used in tight-wound loops. In such cases, I want the minimal overhead
for every call to the C API, I don't want any other threads to
interfere while I'm in the C call, and I certainly don't want the
garbage collector to run. :)

You say I shouldn't call C functions in a tight loop, but taking away
the go layers I put on top of it, that's exactly what I need to do in
a game. During rendering my 2D game, I constantly need to call a the
blitting SDL_UpperBlit() in a tight loop to draw the tiles of the map.
In my case, calling SDL_Upperblit though cgo takes about 2400 ns to
blit a single 32x32 tile, which may seem fast. But I need to call
SDL_Uperblit it 300 times per layer of the map for 4 to 6 map layers,
and that at least 60 times per second, it means I have to spend 259.2
milliseconds per second in blitting. A 150ns speed up or even a 100ns
speed up per call reduces this to 248.4 milliseconds, giving me 10
more precious milliseconds for AI, sound, music, physics engine, etc.
So, even your idea of a smart locking mechanism could be helpful, if
only a bit. Unfortunately, I am not smart enough to write this yet. :)

But I also like David's idea. When I'm reading in my data with very
slow calls that may block, it's nice that other go threads can keep
running (say, to play the music and display the "now loading screen").
But the functions that do the blitting I'd rather not have slowed down
or interfered with. So, basically, it would be nice indeed to be able
to choose, mix and match which type of call Go uses to call C
functions. What this boils down to is implementing different calling
conventions, which brings us closer to a libffi-style idea approach
again.

Anyway, I have to work on all these ideas more to come up with some
more concrete proposals.

Kind Regards,

B.

Beoran

unread,

Jun 11, 2010, 3:57:12 AM6/11/10

to golang-nuts, r...@golang.org

I've been looking more in detail to how cgo works, how the c compiler
implements the #pragma dynimport, and how the linker uses it, to see
how I could help with improving cgo or . I tried to see if I could not
extend the go compiler to also emit the dynimport section in the
object files, but I got confused. I feel most of this project is
written in "old style" C code, and I feel like I'm in over my head.

Still I think I have a few suggestions that could perhaps be
implemented by people smarter than me.

1) Now the dynimport generally requires to link at runtime to a .so
library in a known location. This makes distributing binaries
difficult. Like in normal ELF executables, it should use the OS's
dynamic linker's search path.

2) The c compiler supports dynamical imports and exports from and to
the C world, but I think that certainly the go compiler, and perhaps
also the assembler should support them. That would already simplify a
lot of what cgo is doing. Any syntax would be fine for me fine, as
long as the support is there.

3) Preferrably, cgo shouldn't be a preprocessor but part of the syntax
of the go compiler itself. Or if it should remain a preprocessor it
should not expand the line count so syntax errors that the go compiler
encounters in the cgo output are reported in the same place they are
in the input file.

4) Of course, on the long term, go should probably support and allow
us to generate and use Elf/PE libraries fully, but that's not
something that can be done overnight.

As for my own project, I'll try to do it as a separate package without
modifications to the go core and see if it produces any useful
results.

Russ Cox

unread,

Jun 11, 2010, 4:49:46 AM6/11/10

to Beoran, golang-nuts

> 1) Now the dynimport generally requires to link at runtime to a .so
> library in a known location. This makes distributing binaries
> difficult. Like in normal ELF executables, it should use the OS's
> dynamic linker's search path.

Ian replied to this on a different thread yesterday.
The issue is that if you put a / anywhere in the library name
then the linker refuses to use the search path.
If you have to generate a .so for a hierarchical path
(let's say gosqlite.googlecode.com/hg/sqlite) then if
you name the file gosqlite.googlecode.com/hg/sqlite.so
and just record that, the search path gets ignored.
If you name the file sqlite.so then you might collide with
a package named sqlite elsewhere in the tree.
We'll probably have to do something like rewrite / to _
and then install that, so that this example would install
an unrooted gosqlite.googlecode.com_hg_sqlite.so.

> 2) The c compiler supports dynamical imports and exports from and to
> the C world, but I think that certainly the go compiler, and perhaps
> also the assembler should support them. That would already simplify a
> lot of what cgo is doing. Any syntax would be fine for me fine, as
> long as the support is there.

I don't believe it would save anything. The complexity here
is not in getting the links right but in translating between the
different calling conventions (what registers, stack locations
get used for arguments and return values). 6g/6c/8g/8c have
one, and gcc has a different one.

> 3) Preferrably, cgo shouldn't be a preprocessor but part of the syntax
> of the go compiler itself. Or if it should remain a preprocessor it
> should not expand the line count so syntax errors that the go compiler
> encounters in the cgo output are reported in the same place they are
> in the input file.

I don't think it will ever be part of the Go compiler.
But the line count issue is definitely a bug.
We've agreed on a format for the line comments -

//line filename:linenumber

but the gc compilers do not implement it.

> 4) Of course, on the long term, go should probably support and allow
> us to generate and use Elf/PE libraries fully, but that's not
> something that can be done overnight.

It's very hard to do this. A big part of what makes Go useful is that
you can create tons of goroutines without worrying about stack
overflow. Go does this by using a very different stack discipline
than the standard ELF or PE ABI does. You can't have both.
Having wasted an hour today trying to figure out how to rewrite
a recursive function to be non-recursive so as to avoid a stack
overflow, I would much rather have the good stacks than
interoperability.

That said, very hard does not mean impossible, and Ian has been
working on support in both gcc and the new GNU linker gold to
do both stack growth and automatic conversions for calling
between code using the different stack conventions. But it's not
going to happen overnight. And on Windows, where we have no
control over the tools, it's unlikely to happen ever.

Cgo's goal is to make you not need to worry about any of this,
but it falls short at the moment.

Russ

Reply all

Reply to author

Forward