does assembly pay the cgo transition cost / does runtime.LockOSThread() make CGO calls faster?

215 views
Skip to first unread message

Jason E. Aten

unread,
Mar 1, 2019, 11:13:28 AM3/1/19
to golang-nuts
If I include a chunk of assembly .s code in my Go code, does my program pay the CGO transition cost of locking and changing to a C stack?

I'm pretty sure the answer is no. But my knowledge of the Go internals is low enough that I thought I would like confirm that before I go pulling in .s code.

Then for numerical work, if I need to call some Fortran 90 compiled library routines, does it not just look like assembly? I suppose not... in the sense that it makes certain assumptions about being able to grow the stack it is running on without running out of space. Are there assumptions beyond the stack that are used? Probably some "standard library" runtime dependencies? 

So on to my central question: Can the CGO transition stack switching cost be minimized by telling Go to run my main (single threaded) Go routine on a C thread to start with?  In other words, can I minimize the CGO switching cost by doing runtime.LockOSThread() or similar?

It seems like if we are already on a C thread, then perhaps *some large part* of the cost of Go -> f90 can be avoided. Of course the Go compiler doesn't know we are on a C thread in the general case, right(?), so it probably can't optimize those transitions by omitting change-the-stack code.

Returning to the original throught about .s code, what would happen if that code tried to grow the stack too far?  Just crash? Is there guidance about how far assembly can grow the stack before it needs to check back with the runtime?

Thanks!

J

Tamás Gulácsi

unread,
Mar 1, 2019, 11:54:11 AM3/1/19
to golang-nuts
AFAIK asm parts are compiled into the binary, and no overhead is paid.

For cgo, it is not just stack growth, but the differences between C and Go calling convention, the "ABI".
So LockOSThread may help, but won't eliminate all the overhead.

Do you really have to pay that overhead a lot of times? Can't you batch the calls?

Tamás Gulácsi

Ian Lance Taylor

unread,
Mar 1, 2019, 5:13:49 PM3/1/19
to Jason E. Aten, golang-nuts
On Fri, Mar 1, 2019 at 8:13 AM Jason E. Aten <j.e....@gmail.com> wrote:
>
> If I include a chunk of assembly .s code in my Go code, does my program pay the CGO transition cost of locking and changing to a C stack?

No. The assembly function is simply called just as a Go function is called.


> I'm pretty sure the answer is no. But my knowledge of the Go internals is low enough that I thought I would like confirm that before I go pulling in .s code.
>
> Then for numerical work, if I need to call some Fortran 90 compiled library routines, does it not just look like assembly? I suppose not... in the sense that it makes certain assumptions about being able to grow the stack it is running on without running out of space. Are there assumptions beyond the stack that are used? Probably some "standard library" runtime dependencies?

Go assembly code is compiled by the Go assembler, cmd/asm. The author
of the assembly code is required to specify how much stack space the
assembly function requires, in the TEXT pseudo-op that introduces the
function. For more about this, see https://golang.org/doc/asm. The
cmd/asm program will use that user declaration to insert a function
prologue that ensures that enough stack space is available, copying
the stack if necessary. The cmd/asm program will also produce a stack
map that the garbage collector will use when tracing back the stack;
in practice it's quite difficult for the assembler code to define this
stack map as anything other than "this stack contains no pointers",
but see runtime/funcdata.h.

In any case, your Fortran 90 code will have none of that information.
Of course, if you know the exact stack usage of your Fortran code, and
if the code never stores pointers on the stack, then you could with
some effort write Go assembly code that defines the appropriate stack
information and then calls the Fortran code. I think that would, but
it would require a lot of manual hand-holding.


> So on to my central question: Can the CGO transition stack switching cost be minimized by telling Go to run my main (single threaded) Go routine on a C thread to start with? In other words, can I minimize the CGO switching cost by doing runtime.LockOSThread() or similar?
>
> > It seems like if we are already on a C thread, then perhaps *some large part* of the cost of Go -> f90 can be avoided. Of course the Go compiler doesn't know we are on a C thread in the general case, right(?), so it probably can't optimize those transitions by omitting change-the-stack code.

This doesn't work, because the runtime expects to be able to use the C
thread stack in some cases to handle scheduling between goroutines.
You can't run general Go code on that stack. Note that LockOSThread
does not cause code to run on the C thread stack. It causes the
goroutine to only be scheduled on that thread.

Also, while there is a cost to the fact that cgo changes stack, there
are larger costs to 1) telling the Go scheduler that the code is
leaving Go and entering C; 2) changing from the Go calling convention
to the C calling convention. Eliminating the stack switching costs,
while not entirely negligible, would not make cgo calls as fast as Go
calls.


> Returning to the original throught about .s code, what would happen if that code tried to grow the stack too far? Just crash? Is there guidance about how far assembly can grow the stack before it needs to check back with the runtime?

The assembly can grow the stack exactly as far as the TEXT declaration
said that the stack would grow. Any farther may lead to memory
corruption.

Ian

Jason E. Aten

unread,
Mar 1, 2019, 5:37:21 PM3/1/19
to golang-nuts
Thank you Ian!  This is so helpful.


On Friday, March 1, 2019 at 4:13:49 PM UTC-6, Ian Lance Taylor wrote:
Go assembly code is compiled by the Go assembler, cmd/asm.  The author
of the assembly code is required to specify how much stack space the
assembly function requires, in the TEXT pseudo-op that introduces the
function.    For more about this, see https://golang.org/doc/asm.  The
cmd/asm program will use that user declaration to insert a function
prologue that ensures that enough stack space is available, copying
the stack if necessary.  The cmd/asm program will also produce a stack
map that the garbage collector will use when tracing back the stack;
in practice it's quite difficult for the assembler code to define this
stack map as anything other than "this stack contains no pointers",
but see runtime/funcdata.h.

In any case, your Fortran 90 code will have none of that information.
Of course, if you know the exact stack usage of your Fortran code, and
if the code never stores pointers on the stack, then you could with
some effort write Go assembly code that defines the appropriate stack
information and then calls the Fortran code.  I think that would, but
it would require a lot of manual hand-holding.

 I was thinking I could write a prelude and then launch into the .f90 compiled routines... that I manually inline
but I see that the assembler is a little higher level than actual amd64 machine code.  

But there
does appear to be a BYTE escape hatch in the assembler. So if I compile my .f90 to machine code and then run
through each byte and generate a BYTE instruction to the Go assembler, would that (at least in theory), let
me call the .f90 code?  (Ignoring for the moment that I need to get the parameters passed in from Go to .f90 in
a way that the .f90 code expects; I think I will have to hand-craft some assembly glue to make that work of course.)

Ian Lance Taylor

unread,
Mar 1, 2019, 6:00:16 PM3/1/19
to Jason E. Aten, golang-nuts
In theory, sure, as long as you accommodate the difference in calling
convention and accurately record the stack requirements.

Ian

Andrei Tudor Călin

unread,
Mar 1, 2019, 6:14:08 PM3/1/19
to Jason E. Aten, golang-nuts
Perhaps https://github.com/minio/c2goasm might be of interest.

I don't see anything in there about Fortran specifically, but I don't think it would be a huge leap.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Andrei Călin

Jason E. Aten

unread,
Mar 1, 2019, 6:16:35 PM3/1/19
to golang-nuts
On Friday, March 1, 2019 at 5:14:08 PM UTC-6, Andrei Tudor Călin wrote:
Perhaps https://github.com/minio/c2goasm might be of interest.

I don't see anything in there about Fortran specifically, but I don't think it would be a huge leap.

Nice! Thank you, Andrei. That looks very helpful. Fortran can be told to follow C calling conventions, that might do the trick. 

Jason E. Aten

unread,
Mar 1, 2019, 6:27:46 PM3/1/19
to golang-nuts
Especially given that Nvidia open sourced a production quality Fortran 2003 front-end for clang called Flang...


Flang is a Fortran compiler targeting LLVM.
Flang was announced in 2015. In 2017, the source code was released on GitHub.
Flang is a Fortran language front-end designed for integration with LLVM and the LLVM optimizer.
Flang+LLVM is a production-quality Fortran solution designed to be co-installed and is fully interoperable with Clang C++.
Flang single-core and OpenMP performance is now on par with GNU Fortran. Flang has implemented Fortran 2003 and has a near full implementation of OpenMP through version 4.5 targeting multicore CPUs.

Dan Kortschak

unread,
Mar 1, 2019, 7:46:11 PM3/1/19
to golang-nuts
Assembly incurs a function call cost (non-inlineable AFAIU), but Cgo
incurs a function call cost with additional work for C stack and call
conventions translation as said by Tamás.

Jason E. Aten

unread,
Mar 1, 2019, 11:17:20 PM3/1/19
to golang-nuts
On Friday, March 1, 2019 at 6:46:11 PM UTC-6, kortschak wrote:
Assembly incurs a function call cost (non-inlineable AFAIU), but Cgo
incurs a function call cost with additional work for C stack and call
conventions translation as said by Tamás.

Thanks @kortschak.  The c2goasm project (https://github.com/minio/c2goasm) that Andrei mentioned shows benchmarks on
its front README that suggest that even with the function call cost included, the speed ratio is about 20:1 favoring assembly
over CGO.

Dan Kortschak

unread,
Mar 2, 2019, 2:35:04 AM3/2/19
to Jason E. Aten, golang-nuts
Yes, that's not unreasonable. With f2c, you could potentially get your
fortran into C, then Go asm and then call that.

Dan

Louki Sumirniy

unread,
Mar 2, 2019, 6:51:20 AM3/2/19
to golang-nuts
The stack requirements are quite important, I think you have to either take care of freeing by adding assembler function that does that, or if you can instead allocate the buffers from Go variable declarations the GC will take care of it, if it is possible to do this (very likely I think yes, since assembler deals mainly with what are essentially arrays of - well, you can say, byte, (u)int(16/32/64), and it will be a flat range of memory as the assembler will expect, on this way you have the other problem making sure the GC didn't discard it before the assembler works with it.
Reply all
Reply to author
Forward
0 new messages