Looks like it's a live-lock problem when a goroutine running time-consuming nonpreemptible functions is under suspending by its bg worker at the same time, I tried a few changes to reduce the cost, looks like one of them is promising, details and data are posted in
https://github.com/golang/go/issues/40229, thanks.
Furthermore, functions like bulkBarrierPreWriteSrcOnly rely on heap bits to identifier pointer fields, perhaps we can utilize the known layout of some built-in types, say if the element type is 'slice', a specialized version outperforms a little, but not sure if it's worth the complexity, and the layout of slice might be changed later.
general version:
h := heapBitsForAddr(dst)
for i := uintptr(0); i < size; i += sys.PtrSize {
if h.isPointer() {
srcx := (*uintptr)(unsafe.Pointer(src + i))
if !buf.putFast(0, *srcx) {
wbBufFlush(nil, 0)
}
}
h = h.next()
}
specialized version for 'slice of slice':
for i := uintptr(0); i < size; i += sys.PtrSize * 3 {
srcx := (*uintptr)(unsafe.Pointer(src + i))
if !buf.putFast(0, *srcx) {
wbBufFlush(nil, 0)