Hello, everyone. I'm new to NUMA and synchronization primitives in ARM, like DMB ST/SY/... instructions but I want to do some performance analysis and do some runtime optimizations or compiler transformation pass for these scenarios. Although I know these names, I don't know insights of these concepts and can't infer the research steps to find the targets to be optimized.
I have proposed 2 ideas based on my naive understanding of NUMA, synchronization primitives and Go runtime system, especially memory allocation.
idea1. allocate objects in the corresponding mcache(or local cache) where they are frequently accessed.
The motivating example is as follows:
```
func capture_by_ref() {
// Capture by reference
x := 0
go func() {
for range N {
x++
}
}()
_ = x
}
```
The pseudo-IR/SSA can be:
```
capture_by_ref:
px = newobject(int)
closure = &funcval{thunk, px}
newproc(closure) // just pseudo-code, may not be correct
```
Here `x` is allocated in the corresponding mcache(local cache) of "main" goroutine, but frequently accessed in the corresponding mcache(local cache) of `go func` goroutine. This may cause access large latency in NUMA. I'm not sure. Maybe cache can eliminate the latency but the first access from remote machine can still cause large latency, about which I have no insights and just guess.
If the assumption of access latency is correct or verified, the proposed solution is as follows:
1. find objects in this pattern, i.e. allocated in one goroutine but passed to another goroutine and accessed much more frequently than original goroutine through static analysis(like escape analysis) or dynamic analysis(like thread sanitizer with runtime replaced by access counter. I tried this in C++ tsan, but not in Go. Meanwhile, I'm not familiar with Go tsan)
2. transform code to the following pattern
```
capture_by_ref
px = newobject_remote(int, nextprocid()) // nextprocid will assist the allocator to allocate in corresponding mcache
closure = &funcval{thunk, px}
newproc(closure) // newproc will create new goroutine binded to the mcache and P
```
Is there any simple testcase that can illustrate the optimization effect of DMB ST removal. Are there some other variables like GC to make the optimization effect more convincing.
(Sorry for the 2nd idea. The expression is unclear now. So I here asking for guidance.)