Asking for guidance: estimate and optimize synchronization cost and memory access latency(NUMA) for multicore(256 logical cores) ARM server

144 views
Skip to first unread message

Qingwei Li

unread,
Oct 27, 2025, 4:19:46 AMOct 27
to golang-nuts
Hello, everyone. I'm new to NUMA and synchronization primitives in ARM, like DMB ST/SY/... instructions but I want to do some performance analysis and do some runtime optimizations or compiler transformation pass for these scenarios. Although I know these names, I don't know insights of these concepts and can't infer the research steps to find the targets to be optimized.

I have proposed 2 ideas based on my naive understanding of NUMA, synchronization primitives and Go runtime system, especially memory allocation.

idea1. allocate objects in the corresponding mcache(or local cache) where they are frequently accessed.

The motivating example is as follows:

```
func capture_by_ref() {  
    // Capture by reference  
    x := 0  
    go func() {
        for range N {
            x++
        }  
    }()  
    _ = x  
}  
```

The pseudo-IR/SSA can be:

```
capture_by_ref:
    px = newobject(int)
    closure = &funcval{thunk, px}
    newproc(closure) // just pseudo-code, may not be correct
```

Here `x` is allocated in the corresponding mcache(local cache) of "main" goroutine, but frequently accessed in the corresponding mcache(local cache) of `go func` goroutine. This may cause access large latency in NUMA. I'm not sure. Maybe cache can eliminate the latency but the first access from remote machine can still cause large latency, about which I have no insights and just guess.

If the assumption of access latency is correct or verified, the proposed solution is as follows:
1. find objects in this pattern, i.e. allocated in one goroutine but passed to another goroutine and accessed much more frequently than original goroutine through static analysis(like escape analysis) or dynamic analysis(like thread sanitizer with runtime replaced by access counter. I tried this in C++ tsan, but not in Go. Meanwhile, I'm not familiar with Go tsan)
2. transform code to the following pattern

```
capture_by_ref
    px = newobject_remote(int, nextprocid()) // nextprocid will assist the allocator to allocate in corresponding mcache
    closure = &funcval{thunk, px}
    newproc(closure) // newproc will create new goroutine binded to the mcache and P
```


Is there any simple testcase that can illustrate the optimization effect of DMB ST removal. Are there some other variables like GC to make the optimization effect more convincing.

(Sorry for the 2nd idea. The expression is unclear now. So I here asking for guidance.)

Jason E. Aten

unread,
Oct 28, 2025, 5:21:40 PMOct 28
to golang-nuts
Go having a rather sophisticated and complex scheduler to begin with
suggests to me that you would be better off conducting your experiments
in a vastly simplified setting, say with just a simple single C (Zig?) process with
the minimal possible count of CPUs to observe in your experiments.

Also you will want to account for the kernel's tendency to reactively move
pages closer (NUMA wise) to where they are used-- this may be on or off on your Linux
distribution by default, and it will definitely impact your measurements.

Also you will need to read up on how to know if memory is actually still far
away, since your TLB can have the same virtual memory address for a near
or a far physical page. It's not possible, as far as I understand, to pin TLB 
entries without kernel level privilege. Thus experiments are pretty hard to
do to begin with.

Start simple. Get reproducibility. Get a positive control. Advance from there.
Reply all
Reply to author
Forward
0 new messages