Asking for guidance: estimate and optimize synchronization cost and memory access latency(NUMA) for multicore(256 logical cores) ARM server

144 views

Skip to first unread message

Qingwei Li

unread,

Oct 27, 2025, 4:19:46 AMOct 27

to golang-nuts

Hello, everyone. I'm new to NUMA and synchronization primitives in ARM, like DMB ST/SY/... instructions but I want to do some performance analysis and do some runtime optimizations or compiler transformation pass for these scenarios. Although I know these names, I don't know insights of these concepts and can't infer the research steps to find the targets to be optimized.

I have proposed 2 ideas based on my naive understanding of NUMA, synchronization primitives and Go runtime system, especially memory allocation.

idea1. allocate objects in the corresponding mcache(or local cache) where they are frequently accessed.

The motivating example is as follows:

```
func capture_by_ref() {
// Capture by reference
x := 0
go func() {
for range N {
x++
}
}()
_ = x
}
```

The pseudo-IR/SSA can be:

```
capture_by_ref:
px = newobject(int)
closure = &funcval{thunk, px}
newproc(closure) // just pseudo-code, may not be correct
```

Here `x` is allocated in the corresponding mcache(local cache) of "main" goroutine, but frequently accessed in the corresponding mcache(local cache) of `go func` goroutine. This may cause access large latency in NUMA. I'm not sure. Maybe cache can eliminate the latency but the first access from remote machine can still cause large latency, about which I have no insights and just guess.

If the assumption of access latency is correct or verified, the proposed solution is as follows:

1. find objects in this pattern, i.e. allocated in one goroutine but passed to another goroutine and accessed much more frequently than original goroutine through static analysis(like escape analysis) or dynamic analysis(like thread sanitizer with runtime replaced by access counter. I tried this in C++ tsan, but not in Go. Meanwhile, I'm not familiar with Go tsan)

2. transform code to the following pattern

```
capture_by_ref
px = newobject_remote(int, nextprocid()) // nextprocid will assist the allocator to allocate in corresponding mcache
closure = &funcval{thunk, px}
newproc(closure) // newproc will create new goroutine binded to the mcache and P
```

idea2: this is based on issue 63640. runtime: mallocgc does not seem to need to call publicationBarrier when allocating noscan objects · Issue #63640 · golang/go.

Is there any simple testcase that can illustrate the optimization effect of DMB ST removal. Are there some other variables like GC to make the optimization effect more convincing.

(Sorry for the 2nd idea. The expression is unclear now. So I here asking for guidance.)

Jason E. Aten

unread,

Oct 28, 2025, 5:21:40 PMOct 28

to golang-nuts

Go having a rather sophisticated and complex scheduler to begin with

suggests to me that you would be better off conducting your experiments

in a vastly simplified setting, say with just a simple single C (Zig?) process with

the minimal possible count of CPUs to observe in your experiments.

Also you will want to account for the kernel's tendency to reactively move

pages closer (NUMA wise) to where they are used-- this may be on or off on your Linux

distribution by default, and it will definitely impact your measurements.

Also you will need to read up on how to know if memory is actually still far

away, since your TLB can have the same virtual memory address for a near

or a far physical page. It's not possible, as far as I understand, to pin TLB

entries without kernel level privilege. Thus experiments are pretty hard to

do to begin with.

Start simple. Get reproducibility. Get a positive control. Advance from there.

Reply all

Reply to author

Forward

0 new messages