The discussion is restricted to AArch64.
Question: On arm64, publicationBarrier in mallocgc is implemented as DMB ST.
What is the invariant that requires it to execute at its current position?
Specifically:
- Must it execute before the allocated object becomes visible to another P/M?
- Must it execute before GC metadata becomes visible?
- Or is it required for maintaining the tri-color invariant under concurrent GC?
My reasoning (please correct me if wrong)
The comment in runtime/stubs.go says that the purpose of publicationBarrier is to ensure that other processors observe the fully initialized object before it becomes reachable from GC.
If that is the case, it seems that as long as:
1) the allocated object is not yet accessible by another goroutine, and
2) the goroutine which does the allocation is not preempted or schedule itself through chanrecv or other operations to another P/M,
then the barrier might be deferrable.
Under this reasoning, it appears possible that a single DMB ST could be shared across multiple consecutive mallocgc calls.
However, I'm unsure whether this reasoning overlooks some GC or scheduler invariants, and that is what I would like to understand.
---
Background:
The current order in mallocgc (simplified) is:
```go
alloc
publicationBarrier // DMB ST
update GC metadata
```
According to measurements in issue comment
https://github.com/golang/go/issues/63640#issuecomment-3661284210, the barrier can account for ~35–40% of mallocgc time on arm64 microbenchmarks.
I experimented with amortizing the barrier across multiple consecutive allocations (i.e., sharing the DMB ST). The design is omitted here for concise question. Microbenchmark results show mixed performance impact:
```
goos: linux
goarch: arm64
pkg: runtime
│ default.txt │ batch.txt │
│ sec/op │ sec/op vs base │
Malloc8-64 22.11n ± 0% 21.82n ± 0% -1.31% (p=0.000 n=10)
Malloc16-64 38.79n ± 0% 33.76n ± 0% -12.98% (p=0.000 n=10)
MallocTypeInfo8-64 28.49n ± 0% 31.37n ± 0% +10.11% (p=0.000 n=10)
MallocTypeInfo16-64 38.19n ± 0% 39.57n ± 0% +3.61% (p=0.000 n=10)
MallocLargeStruct-64 417.9n ± 1% 400.8n ± 1% -4.10% (p=0.000 n=10)
geomean 52.27n 51.62n -1.24%
```
However, my main concern is correctness: I would like to understand the exact memory-ordering guarantee enforced by this barrier on AArch64.
Thanks.