Go Stack Design Proposal

307 views
Skip to first unread message

Arseny Samoylov

unread,
Nov 3, 2025, 4:24:50 AM (8 days ago) Nov 3
to golang-nuts
Hello everyone,

I'd like to get feedback on an idea for changing how Go manages goroutine stack growth.
Below is a short draft of the proposal.

## Current State

Currently, almost every Go function prologue includes a stack growth check.
If the remaining stack space is insufficient, the runtime allocates a larger stack and copies the old one, adjusting pointers to local variables as needed.

**Drawbacks of this approach:**

* Increased CPU usage due to frequent stack size checks and possible reallocations
* Larger code size because of the additional prologue instructions

## Proposed Stack Management Mechanism
I would like to hear your opinion on the following stack growth mechanism and whether it's worth exploring further.
If you think that this idea has potential, I'll continue by estimating its effect on CPU usage and code size and, if estimations will look good enough, make a proof of concept.

### Reallocation via Page Faults
The idea is inspired by how Linux manages system thread stacks.

In Linux, each thread reserves (by default) 8 MB of virtual memory for its stack. Physical memory is mapped lazily - new pages are allocated when the thread touches them, via page faults.
When the stack limit is reached, the program aborts.

In Go, however, instead of aborting, we could reuse the existing stack growth logic - relocating the stack to a larger chunk when a page fault occurs near the stack boundary.

**Potential drawbacks**
* The Go runtime would need to handle page faults:
    * This might increase the number of page faults and add handling overhead
    * It could be tricky to distinguish between stack-related and unrelated page faults

* A large number of goroutines will consume a large amount of virtual address space
* The minimal stack size would effectively increase from 2 KB to 4 KB (one physical page). In the worst case, when all goroutine use <2Kb stack space, this will double memory consumption
* This mechanism would depend on OS-level signal handling and may require platform-specific implementations

The main concern, as I see it, is the increased use of virtual address space.
A rough estimation:
100k goroutines with 8 MB stacks each would reserve ~800 GB (=2^3 * 10^5 * 2^20 ~ 2^38 B), i.e., about 1/1000 of the 2^48 bit virtual address space.
This seems acceptable, especially since we can reserve less than 8 MB.

The second concern is the larger minimum stack size (4 KB vs 2 KB). This could double memory consumption in the worst case.
I'm not yet sure whether this trade-off would be acceptable or if it can be mitigated.

Also, the cross-platform support is a major concern. 

## Additional notes
* The current implementation supports stack shrinking (when less than 1/4 of the stack is used). I guess we can shrink stack with MADV_DONTNEED.
* Stack growth checks are currently tied to the goroutine preemption. Removing them might indirectly affect the scheduler. However, Go has other cooperative/asynchronous preemption, so this may not be a major issue.

### Conclusion
What do you think about this idea?
Is this direction worth further exploration? To get some concrete performance improvement estimations and make PoC?

Thank you for your time and feedback!

Jan Mercl

unread,
Nov 3, 2025, 5:24:01 AM (8 days ago) Nov 3
to Arseny Samoylov, golang-nuts
On Mon, Nov 3, 2025 at 10:25 AM Arseny Samoylov
<samoylo...@gmail.com> wrote:

> **Drawbacks of this approach:**
>
> * Increased CPU usage due to frequent stack size checks and possible reallocations

The cost is non zero, so yes, the increase will be there.

But how much of an increase? Most code does some data processing more
often than performing calls. Is the increase on a mix of real programs
1%? 10%? More? Knowing this number is the most important thing to do
before going any further.

Brian Candler

unread,
Nov 3, 2025, 5:28:36 AM (8 days ago) Nov 3
to golang-nuts
Interesting idea.

Have you come across any examples of any other languages which have tried this approach, successfully or otherwise?

Arseny Samoylov

unread,
Nov 3, 2025, 6:13:49 AM (8 days ago) Nov 3
to golang-nuts
>  Most code does some data processing more
> often than performing calls.

In my experience, there are still a lot of calls in real Go workloads. In particular, many indirect calls (through interfaces) that can't be inlined, so stack checks are still performed at runtime.

> But how much of an increase? Most code does some data processing more
> often than performing calls. Is the increase on a mix of real programs
> 1%? 10%? More? Knowing this number is the most important thing to do
> before going any further.

I'd roughly speculate to have something like 3%-5% increase. But this is only a guess at this stage. My goal with this discussion is to understand whether such an idea would be considered worth exploring further - if so, I can prepare a more concrete estimation.

Arseny Samoylov

unread,
Nov 3, 2025, 6:21:02 AM (8 days ago) Nov 3
to golang-nuts
>  Have you come across any examples of any other languages which have tried this approach, successfully or otherwise?
I looked into other languages. First of all, a language must have a garbage collector to support stack relocation, since pointer fixeups required. This rules out languages like C, C++, Rust.

That leaves languages as Java, Python, JavaScript, etc. 
In my (admittedly shallow) research, I haven't found any examples of a similar approach being used.

Jan Mercl

unread,
Nov 3, 2025, 6:26:20 AM (8 days ago) Nov 3
to Arseny Samoylov, golang-nuts
On Mon, Nov 3, 2025 at 12:14 PM Arseny Samoylov
<samoylo...@gmail.com> wrote:

> I'd roughly speculate to have something like 3%-5% increase. But this is only a guess at this stage. My goal with this discussion is to understand whether such an idea would be considered worth exploring further ...

I think no one can tell if it is worth it without knowing the
measured, non-guessed current impact first.

Arseny Samoylov

unread,
Nov 3, 2025, 6:27:27 AM (8 days ago) Nov 3
to golang-nuts
>  I can prepare a more concrete estimation.
As an estimation, I plan to disable stack checks and set starting stack size high enough so that goroutines won't run off memory. Then I can run benchmark, for example sweet or etcd.

The minimum starting stack size will be determined from baseline benchmark runs, based on the observed stack usage.

Jason E. Aten

unread,
Nov 3, 2025, 6:03:51 PM (8 days ago) Nov 3
to golang-nuts
Hi Arseny,

Remember it is no longer the number of conditionals or reads but
rather the number of L1 cache misses and writes that cause
multi-core cache coherency stalls that dominate performance
typically today.

Even more importantly, signal handling is very slow, and often "late".

You would have to 
1) trap the page fault and context switch to the kernel, 
2) context switch back to the signal handler,
3) context switch back to the kernel to allocate a page or change its protections, 
and then 
4) context switch back to the original faulting code. 

You are looking at least 3 but usually 4 context switches; this will be 
much, much slower than using the Go allocator.

Arseny Samoylov

unread,
Nov 4, 2025, 4:12:47 AM (7 days ago) Nov 4
to golang-nuts

> Remember it is no longer the number of conditionals or reads but
> rather the number of L1 cache misses and writes that cause
> multi-core cache coherency stalls that dominate performance
> typically today.

You are right. One of the issues with stack growth checks is the increased code size, which leads to higher L1i cache pressure.
Each check takes roughly 10 instructions: load the end address of the stack, compute the remaining space, branch if insufficient, spill registers (since in Go's ABI all registers are caller-saved, so it can be up to 16 x 2 push/pops for spill/fill), call runtime.morestack, fill registers, and retry again.
If a medium-sized functions consists of about 100-200 instructions, so this presents roughly 10%-5% code size overhead.


> Even more importantly, signal handling is very slow, and often "late".

Absolutely - I agree. The idea here is to reserve a large, lazily allocated stack (e.g. 8 MB) so that we almost never hit the limit.
The page-fault-based reallocation would only serve as a safety mechanism - ensuring that, if a goroutine stack ever does reaches it's limit, it can either be reallocated safely or cause a controlled panic.

Arseny Samoylov

unread,
Nov 4, 2025, 4:47:39 AM (7 days ago) Nov 4
to golang-nuts
> One of the issues with stack growth checks is the increased code size, which leads to higher L1i cache pressure.
 
Go generally isn't designed for computation-heavy workloads (e.g. matrix multiplication), and the compiler backend prioritizes compilation speed over the absolute performance of the generated code. In my experience, most Go server applications are front-end bound and front-end stalls tend to be a major bottleneck.
That's why we should care about L1i performance.

Rob Pike

unread,
Nov 5, 2025, 12:19:53 AM (7 days ago) Nov 5
to Arseny Samoylov, golang-nuts
I believe you are overstating the cost. It's measurable but not as severe as you state. And now with inlining more common, functions tend to be bigger than they once were, so the amortized and actual cost are both reduced.

Moreover, using traps for stack growth has been problematic in the past. I have seen other systems trying to do this, and in most cases eventually abandoning it due to unforeseen complexity. Not saying it can't be done, but it's not easy. Plus it is hard to do portably since the details will depend on the architecture and the operating system.

It's not up to me, but I wouldn't do this.

-rob


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/27b84849-e863-4d5c-97e7-1da3a04cc87fn%40googlegroups.com.

Arseny Samoylov

unread,
Nov 5, 2025, 4:57:28 AM (6 days ago) Nov 5
to golang-nuts
>  And now with inlining more common, functions tend to be bigger than they once were, so the amortized and actual cost are both reduced.

You are correct that inlining helps reduce the cost of stack-grows checks. However, interface method calls present a significant obstacle to inlining and go code frequently uses them. Also, the inlining budget without PGO is quite modest: one uninlined call already takes almost the entire budget (57 out 80, see https://github.com/golang/go/blob/master/src/cmd/compile/internal/inline/inl.go#L53).

>  I have seen other systems trying to do this, and in most cases eventually abandoning it due to unforeseen complexity. Not saying it can't be done, but it's not easy.
Can you please provide some links? It would be very helpful! Because I couldn't find any case studies of GC'd language/runtime that tried to use page-fault-based stack growth for coroutines =(.

Reply all
Reply to author
Forward
0 new messages