What is the long term cost of closures in Golang

149 views
Skip to first unread message

Alexander Shopov

unread,
Mar 8, 2026, 1:05:52 PM (2 days ago) Mar 8
to golang-nuts
Hi all,

Closures have some cost and we are advised to not use them in code
that strives to be performant.
However the quick benchmark I did showed this is not so currently but
I suppose the benchmark is way too trivial.

What optimisation will the official golang compiler try to do and what
is the long term plan for this?

AFAIK there are two types of costs for closures:
1. Some data may need to be allocated in heap and will be GC collected
rather than stack allocated and autorelease.
2. Typically inner functions are assigned to a variable. So at the
moment the body is called via the variable, the runtime needs to check
what the variable points to.

Note that both of these can be optimized away depending on things such
as the definition, call and size of function body, number of free
variables.

In the absolute trivial benchmark I created - it seems all calls are
practically the same (difference slowest to fastest is 2%). However I
am not sure how well this corresponds to typical behavior.

- When will a closure call be inlined?
- How long can the body be or how many free variables can be for the
compiler to optimize this?
- What can break the optimization?

Kind regards:
al_shopov


Benchmark here: https://go.dev/play/p/2Z7cg0tJVAe
Results:

BenchmarkNoclosurecall-12 92 12593210 ns/op
BenchmarkClosurecall-12 91 12327950 ns/op
BenchmarkClosurecallimmediate-12 92 12359107 ns/op

Jason E. Aten

unread,
Mar 8, 2026, 4:42:14 PM (2 days ago) Mar 8
to golang-nuts
> we are advised to not use them in code that strives to be performant

We are also advised to always profile before guessing about performance. I might
advise that this second piece of advice should almost always be given more weight.

:)

Best wishes,
Jason

Ugorji Nwoke

unread,
Mar 8, 2026, 8:42:47 PM (2 days ago) Mar 8
to golang-nuts
> However the quick benchmark I did showed this is not so currently but I suppose the benchmark is way too trivial.
Seems OP did. Just answer and help him if you can.

Brian Candler

unread,
Mar 9, 2026, 5:28:50 AM (yesterday) Mar 9
to golang-nuts
Microbenchmarks are often unrepresentative of real world behaviour. The OP's microbenchmarks found no significant difference, but I think they are very poor examples.

For one, a good compiler might have optimised out the whole loops - although this appears not to have happened here. For another, these are compute-bound functions - each function takes about 12ms to run - so do not measure the cost of call and return from a closure versus a "normal" function. But even with step=int32(500_000_000) I see no difference.  More importantly, these functions are so trivial that they almost certainly run entirely in registers.  The initial "d := step" line copies the closure variable to a local variable, and I'd say that's highly likely to use a register.  However again, changing the loops to use "step" directly doesn't appear to make a difference.

You can use godbolt if you want to look at the compiled code, which will answer your questions about inlining.

Whether the cost is significant in a particular real-world application is a very different, and much more relevant question; that will be specific to the OP's real-world problem.  The only general feedback I can offer is: "I've not heard anybody complain here about the cost of closures in Go".

The OP has also identified that if there is a problem, it is likely to be related to garbage collection. Go's GC is continuously improving, and is not a problem for many real-world workloads. If their application really has a critical bottleneck in this area, then maybe a non-GC language would be more suitable.  But to me, this sounds like a severe case of premature optimisation.

Jason E. Aten

unread,
Mar 9, 2026, 6:21:23 AM (yesterday) Mar 9
to golang-nuts
Brian's words are wise.  My point was its almost impossible to generalize.

But to provide a specific workflow to analyze your actual production code in place, in practice:

a) write a Benchmark that focuses on your area of interest in your actual code.

b) compile the test binary, "go test -c". You do this so pprof can show you the disassembly in step e) below.

c) run the benchmark with -benchtime=10s and -cpuprofile cpu.prof 

For example, from my run
two minutes ago, the full benchmark run line was:

go test -v -tags memfs -run=xxx -bench Benchmark_Iter_YogaDB_Ascend -benchtime=10s -cpuprofile cpu.prof

d) open pprof; it will show you the flame graph; look for the longest horizontal bar, click it, then click it again
once it expands; then select View -> Source from the upper left menu. It will show you the source
code with the time spent on each hot line, out of the 10 seconds that you ran.

go tool pprof -http :7777 yogadb.test cpu.prof

It will look something like this:

profile.jpg

e) if need be, choose View -> Disassembly instead of View -> Source

f) highlight the entire section that contains the hottest line, such as line 1252 above,

g) paste the text into your favorite LLM and ask it for ideas to help you optimize that code.
     It will give you amazingly good ideas 90% of the time. try them one by one, running
     your benchmark after each one and observing if the timing improved. Sometimes you
     need to redirect it to just try to eliminate function calls, or to manually inline hot parts
     of functions to avoid function calls. My ability to read assembly -- horrible. Does
     not matter because the LLM speaks it fluently. Leverage that.

There is an example of what this process can do here, in this next link. It took a 340 nsec hot iteration path down to 
currently less than 10 nsec, which is about an L3 cache load on my 2020 vintage Intel CPU.


Finally, move over to Linux and run "perf" to analyze and optimze your L1 hit rate. Something like:

sudo  perf stat -e L1-dcache-loads,L1-dcache-load-misses ./drwmutex-bench -strat 1

Enjoy the process.

- Jason

p.s. there is a nice video that illustrates optimization in Go, here, in case some of the steps were unclear,
from Prashant V: "Profiling and Optimizing Go"
Reply all
Reply to author
Forward
0 new messages