The reason to switch call stacks when doing Cgo calls

mizze...@gmail.com

unread,

Jun 8, 2018, 12:39:30 AM6/8/18

to golang-nuts

Is the small size of a goroutine stack pretty much the only reason why the Go runtime needs to switch stacks when doing Cgo calls? I wonder if the main goroutine can made to be created with a big stack (like 1 MB), so I can call well-behaving* C code from it without need to constantly switch stacks.
*Well-behaving == non-blocking && non-callbacking

Ian Lance Taylor

unread,

Jun 8, 2018, 12:58:29 AM6/8/18

to mizze...@gmail.com, golang-nuts

We can't assume that C code is well behaved. That is, as they say, a
footgun: a problem that is likely to arise just when you aren't
prepared for it. For that matter, treating cgo calls specially on one
specific goroutine is also likely to have unexpected performance
characteristics for people who aren't very careful about how they
write their code.

Ian

mizze...@gmail.com

unread,

Jun 11, 2018, 3:16:21 AM6/11/18

to golang-nuts

I shouldn't have mentioned well-behaved functions. It's a whole other topic.
I'm interested specifically in goroutine stacks.
Theoretically, can they be used to run C code?
Are there any pitfalls?

>For that matter, treating cgo calls specially on one
>specific goroutine is also likely to have unexpected performance
>characteristics for people who aren't very careful about how they
>write their code.

Yeah, such a solution looks inflexible. But I can think of another.
What if there were a compiler directive to adjust a Go function's stack growing strategy?
Something like this:
//go:stack_always_grow=1MB
func func_with_cgocalls()

That way, when func_with_cgocalls is entered, a calling goroutine's stack is always incremented by 1MB. Thus, within the function the cgo transpiler could emit more effective Cgo calls avoiding switching to a system stack. Naturally, when the function is left, the stack is shrunk back.

P.S. The cost of switching the stacks is around 7 nanoseconds per call (on my system).
//Which alone is comparable to the cost of a JNI call (from Java to C).
The optimization may be or not be worth it, depending on context.

Ian Lance Taylor

unread,

Jun 12, 2018, 12:11:52 AM6/12/18

to Des Nerger, golang-nuts

On Mon, Jun 11, 2018 at 12:15 AM, <mizze...@gmail.com> wrote:
>
> I shouldn't have mentioned well-behaved functions. It's a whole other topic.
> I'm interested specifically in goroutine stacks.
> Theoretically, can they be used to run C code?
> Are there any pitfalls?

Yes, theoretically, if you have a large goroutine stack, it could be
used to run C code.

>>For that matter, treating cgo calls specially on one
>>specific goroutine is also likely to have unexpected performance
>>characteristics for people who aren't very careful about how they
>>write their code.
> Yeah, such a solution looks inflexible. But I can think of another.
> What if there were a compiler directive to adjust a Go function's stack
> growing strategy?
> Something like this:
> //go:stack_always_grow=1MB
> func func_with_cgocalls()
>
> That way, when func_with_cgocalls is entered, a calling goroutine's stack is
> always incremented by 1MB. Thus, within the function the cgo transpiler
> could emit more effective Cgo calls avoiding switching to a system stack.
> Naturally, when the function is left, the stack is shrunk back.
>
> P.S. The cost of switching the stacks is around 7 nanoseconds per call (on
> my system).
> //Which alone is comparable to the cost of a JNI call (from Java to C).
> The optimization may be or not be worth it, depending on context.

What are you measuring?

As you've already observed, switching the stacks is only part of a cgo
call. The more expensive part is telling the scheduler that the
thread is entering C code.

Ian

mizze...@gmail.com

unread,

Jun 12, 2018, 2:16:43 AM6/12/18

to golang-nuts

What are you measuring?

I mean, when I strip this part from the amd64 asmcgocall, every Cgo call gets faster by 7 nanoseconds:

    // Figure out if we need to switch to m->g0 stack.
    // We get called to create new OS threads too, and those
    // come in on the m->g0 stack already.
    get_tls(CX)
    MOVQ    g(CX), R8
    CMPQ    R8, $0
    JEQ    nosave
    MOVQ    g_m(R8), R8
    MOVQ    m_g0(R8), SI
    MOVQ    g(CX), DI
    CMPQ    SI, DI
    JEQ    nosave
    MOVQ    m_gsignal(R8), SI
    CMPQ    SI, DI
    JEQ    nosave
    
    // Switch to system stack.
    MOVQ    m_g0(R8), SI
    CALL    gosave<>(SB)
    MOVQ    SI, g(CX)
    MOVQ    (g_sched+gobuf_sp)(SI), SP

    // Now on a scheduling stack (a pthread-created stack).
    // Make sure we have enough room for 4 stack-backed fast-call
    // registers as per windows amd64 calling convention.
    SUBQ    $64, SP
    ANDQ    $~15, SP    // alignment for gcc ABI
    MOVQ    DI, 48(SP)    // save g
    MOVQ    (g_stack+stack_hi)(DI), DI
    SUBQ    DX, DI
    MOVQ    DI, 40(SP)    // save depth in stack (can't just save SP, as stack might be copied during a callback)
    MOVQ    BX, DI        // DI = first argument in AMD64 ABI
    MOVQ    BX, CX        // CX = first argument in Win64
    CALL    AX

    // Restore registers, g, stack pointer.
    get_tls(CX)
    MOVQ    48(SP), DI
    MOVQ    (g_stack+stack_hi)(DI), SI
    SUBQ    40(SP), SI
    MOVQ    DI, g(CX)
    MOVQ    SI, SP

    MOVL    AX, ret+16(FP)
    RET

Yes, I understand that it's not a major part of the Cgo overhead (which is around 63ns in total), but still good to know if it can be optimized away.

Reply all

Reply to author

Forward