runtime: reduce contention in (*lfstack).pop
When profiling CPU usage LiveKit on AArch64/x86 (AWS), the graphs show
CPU spikes that was repeating in a semi-periodic manner and spikes occur
when the GC(garbage collector) is active.
Our analysis found that the getempty function accounted for 10.54% of the
overhead, which was mainly caused by the work.empty.pop() function. And
listing pop shows that the majority of the time, with a 10.29% overhead,
is spent on atomic.Cas64((*uint64)(head), old, next).
This patch adds a backoff approach to reduce the high overhead of the
atomic operation primarily occurs when contention over a specific memory
address increases, typically with the rise in the number of threads
This patch adds a new function nano_delay(), which is an Armv8.0-A compatible
delay function using the counter-timer.
The garbage collector benchmark:
│ master │ opt │
│ sec/op │ sec/op vs base │
Garbage/benchmem-MB=64-160 3.782m ± 4% 2.264m ± 2% -40.12% (p=0.000 n=10)
│ user+sys-sec/op │ user+sys-sec/op vs base │
Garbage/benchmem-MB=64-160 433.5m ± 4% 255.4m ± 2% -41.08% (p=0.000 n=10)
Reference for backoff mechianism:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
diff --git a/src/runtime/asm_arm64.s b/src/runtime/asm_arm64.s
index 6c447ac..24e264d 100644
--- a/src/runtime/asm_arm64.s
+++ b/src/runtime/asm_arm64.s
@@ -970,6 +970,47 @@
CBNZ R0, again
RET
+// The Arm architecture provides a user space accessible counter-timer which
+// is incremented at a fixed but machine-specific rate. Software can (spin)
+// wait until the counter-timer reaches some desired value.
+// Armv8.7-A introduced the WFET (FEAT_WFxT) instruction, which allows the
+// processor to enter a low power state for a set time, or until an event is
+// received.
+// Without this feature, we can instead use the ISB instruction to
+// decrease processor activity and thus power consumption between checks of
+// the counter-timer. Note that we do not depend on the latency of the ISB instruction.
+// Read more in this Arm blog post:
+// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
+
+TEXT runtime·nano_delay(SB),NOSPLIT,$0-0
+ MOVWU backoff_ns+0(FP), R0
+ ISB $15
+ CMP $0x12, R0
+ BLS ret
+ SUB $0x12, R0, R0
+ MRS CNTFRQ_EL0, R1
+ ADD R0<<4, R0, R0
+ MUL R1, R0, R0
+ LSR $30, R0, R0
+ CBZ R0, ret
+ MRS CNTVCT_EL0, R2
+ CMP $0x28, R0
+ BLS loop
+ SUB $0x28, R0, R3
+delay:
+ ISB $15
+ MRS CNTVCT_EL0, R1
+ SUB R2, R1, R1
+ CMP R3, R1
+ BCC delay
+loop:
+ MRS CNTVCT_EL0, R1
+ SUB R2, R1, R1
+ CMP R0, R1
+ BCC loop
+ret:
+ RET
+
// Save state of caller into g->sched,
// but using fake PC from systemstack_switch.
// Must only be called from functions with no locals ($0)
diff --git a/src/runtime/lfstack.go b/src/runtime/lfstack.go
index cbec6e8..11b0eee 100644
--- a/src/runtime/lfstack.go
+++ b/src/runtime/lfstack.go
@@ -37,20 +37,6 @@
}
}
-func (head *lfstack) pop() unsafe.Pointer {
- for {
- old := atomic.Load64((*uint64)(head))
- if old == 0 {
- return nil
- }
- node := lfstackUnpack(old)
- next := atomic.Load64(&node.next)
- if atomic.Cas64((*uint64)(head), old, next) {
- return unsafe.Pointer(node)
- }
- }
-}
-
func (head *lfstack) empty() bool {
return atomic.Load64((*uint64)(head)) == 0
}
diff --git a/src/runtime/mspanset.go b/src/runtime/mspanset.go
index 3aa2b5b..f535002 100644
--- a/src/runtime/mspanset.go
+++ b/src/runtime/mspanset.go
@@ -136,90 +136,6 @@
block.spans[bottom].StoreNoWB(s)
}
-// pop removes and returns a span from buffer b, or nil if b is empty.
-// pop is safe to call concurrently with other pop and push operations.
-func (b *spanSet) pop() *mspan {
- var head, tail uint32
-claimLoop:
- for {
- headtail := b.index.load()
- head, tail = headtail.split()
- if head >= tail {
- // The buf is empty, as far as we can tell.
- return nil
- }
- // Check if the head position we want to claim is actually
- // backed by a block.
- spineLen := b.spineLen.Load()
- if spineLen <= uintptr(head)/spanSetBlockEntries {
- // We're racing with a spine growth and the allocation of
- // a new block (and maybe a new spine!), and trying to grab
- // the span at the index which is currently being pushed.
- // Instead of spinning, let's just notify the caller that
- // there's nothing currently here. Spinning on this is
- // almost definitely not worth it.
- return nil
- }
- // Try to claim the current head by CASing in an updated head.
- // This may fail transiently due to a push which modifies the
- // tail, so keep trying while the head isn't changing.
- want := head
- for want == head {
- if b.index.cas(headtail, makeHeadTailIndex(want+1, tail)) {
- break claimLoop
- }
- headtail = b.index.load()
- head, tail = headtail.split()
- }
- // We failed to claim the spot we were after and the head changed,
- // meaning a popper got ahead of us. Try again from the top because
- // the buf may not be empty.
- }
- top, bottom := head/spanSetBlockEntries, head%spanSetBlockEntries
-
- // We may be reading a stale spine pointer, but because the length
- // grows monotonically and we've already verified it, we'll definitely
- // be reading from a valid block.
- blockp := b.spine.Load().lookup(uintptr(top))
-
- // Given that the spine length is correct, we know we will never
- // see a nil block here, since the length is always updated after
- // the block is set.
- block := blockp.Load()
- s := block.spans[bottom].Load()
- for s == nil {
- // We raced with the span actually being set, but given that we
- // know a block for this span exists, the race window here is
- // extremely small. Try again.
- s = block.spans[bottom].Load()
- }
- // Clear the pointer. This isn't strictly necessary, but defensively
- // avoids accidentally re-using blocks which could lead to memory
- // corruption. This way, we'll get a nil pointer access instead.
- block.spans[bottom].StoreNoWB(nil)
-
- // Increase the popped count. If we are the last possible popper
- // in the block (note that bottom need not equal spanSetBlockEntries-1
- // due to races) then it's our responsibility to free the block.
- //
- // If we increment popped to spanSetBlockEntries, we can be sure that
- // we're the last popper for this block, and it's thus safe to free it.
- // Every other popper must have crossed this barrier (and thus finished
- // popping its corresponding mspan) by the time we get here. Because
- // we're the last popper, we also don't have to worry about concurrent
- // pushers (there can't be any). Note that we may not be the popper
- // which claimed the last slot in the block, we're just the last one
- // to finish popping.
- if block.popped.Add(1) == spanSetBlockEntries {
- // Clear the block's pointer.
- blockp.StoreNoWB(nil)
-
- // Return the block to the block pool.
- spanSetBlockPool.free(block)
- }
- return s
-}
-
// reset resets a spanSet which is empty. It will also clean up
// any left over blocks.
//
diff --git a/src/runtime/pop.go b/src/runtime/pop.go
new file mode 100644
index 0000000..fd093c9
--- /dev/null
+++ b/src/runtime/pop.go
@@ -0,0 +1,111 @@
+// Copyright 2012 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+//go:build !arm64
+
+package runtime
+
+import (
+ "internal/runtime/atomic"
+ "unsafe"
+)
+
+// free-stack
+func (head *lfstack) pop() unsafe.Pointer {
+ for {
+ old := atomic.Load64((*uint64)(head))
+ if old == 0 {
+ return nil
+ }
+ node := lfstackUnpack(old)
+ next := atomic.Load64(&node.next)
+ if atomic.Cas64((*uint64)(head), old, next) {
+ return unsafe.Pointer(node)
+ }
+ }
+}
+
+// pop removes and returns a span from buffer b, or nil if b is empty.
+// pop is safe to call concurrently with other pop and push operations.
+func (b *spanSet) pop() *mspan {
+ var head, tail uint32
+claimLoop:
+ for {
+ headtail := b.index.load()
+ head, tail = headtail.split()
+ if head >= tail {
+ // The buf is empty, as far as we can tell.
+ return nil
+ }
+ // Check if the head position we want to claim is actually
+ // backed by a block.
+ spineLen := b.spineLen.Load()
+ if spineLen <= uintptr(head)/spanSetBlockEntries {
+ // We're racing with a spine growth and the allocation of
+ // a new block (and maybe a new spine!), and trying to grab
+ // the span at the index which is currently being pushed.
+ // Instead of spinning, let's just notify the caller that
+ // there's nothing currently here. Spinning on this is
+ // almost definitely not worth it.
+ return nil
+ }
+ // Try to claim the current head by CASing in an updated head.
+ // This may fail transiently due to a push which modifies the
+ // tail, so keep trying while the head isn't changing.
+ want := head
+ for want == head {
+ if b.index.cas(headtail, makeHeadTailIndex(want+1, tail)) {
+ break claimLoop
+ }
+ headtail = b.index.load()
+ head, tail = headtail.split()
+ }
+ // We failed to claim the spot we were after and the head changed,
+ // meaning a popper got ahead of us. Try again from the top because
+ // the buf may not be empty.
+ }
+ top, bottom := head/spanSetBlockEntries, head%spanSetBlockEntries
+
+ // We may be reading a stale spine pointer, but because the length
+ // grows monotonically and we've already verified it, we'll definitely
+ // be reading from a valid block.
+ blockp := b.spine.Load().lookup(uintptr(top))
+
+ // Given that the spine length is correct, we know we will never
+ // see a nil block here, since the length is always updated after
+ // the block is set.
+ block := blockp.Load()
+ s := block.spans[bottom].Load()
+ for s == nil {
+ // We raced with the span actually being set, but given that we
+ // know a block for this span exists, the race window here is
+ // extremely small. Try again.
+ s = block.spans[bottom].Load()
+ }
+ // Clear the pointer. This isn't strictly necessary, but defensively
+ // avoids accidentally re-using blocks which could lead to memory
+ // corruption. This way, we'll get a nil pointer access instead.
+ block.spans[bottom].StoreNoWB(nil)
+
+ // Increase the popped count. If we are the last possible popper
+ // in the block (note that bottom need not equal spanSetBlockEntries-1
+ // due to races) then it's our responsibility to free the block.
+ //
+ // If we increment popped to spanSetBlockEntries, we can be sure that
+ // we're the last popper for this block, and it's thus safe to free it.
+ // Every other popper must have crossed this barrier (and thus finished
+ // popping its corresponding mspan) by the time we get here. Because
+ // we're the last popper, we also don't have to worry about concurrent
+ // pushers (there can't be any). Note that we may not be the popper
+ // which claimed the last slot in the block, we're just the last one
+ // to finish popping.
+ if block.popped.Add(1) == spanSetBlockEntries {
+ // Clear the block's pointer.
+ blockp.StoreNoWB(nil)
+
+ // Return the block to the block pool.
+ spanSetBlockPool.free(block)
+ }
+ return s
+}
diff --git a/src/runtime/pop_arm64.go b/src/runtime/pop_arm64.go
new file mode 100644
index 0000000..f52a127
--- /dev/null
+++ b/src/runtime/pop_arm64.go
@@ -0,0 +1,128 @@
+// Copyright 2012 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+// Lock-free stack.
+
+package runtime
+
+import (
+ "internal/runtime/atomic"
+ "unsafe"
+)
+
+func (head *lfstack) pop() unsafe.Pointer {
+ var backoff uint32 = 128
+ for {
+ old := atomic.Load64((*uint64)(head))
+ if old == 0 {
+ return nil
+ }
+ node := lfstackUnpack(old)
+ next := atomic.Load64(&node.next)
+ if atomic.Cas64((*uint64)(head), old, next) {
+ return unsafe.Pointer(node)
+ }
+ // Use a backoff approach to reduce demand to the shared memory location
+ // decreases memory contention and allows for other threads to make quicker
+ // progress.
+ // Read more in this Arm blog post:
+ // https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
+ nano_delay(backoff)
+ // Increase backoff time.
+ backoff += backoff >> 1
+ }
+}
+
+// pop removes and returns a span from buffer b, or nil if b is empty.
+// pop is safe to call concurrently with other pop and push operations.
+func (b *spanSet) pop() *mspan {
+ var head, tail uint32
+ var backoff uint32 = 128
+claimLoop:
+ for {
+ headtail := b.index.load()
+ head, tail = headtail.split()
+ if head >= tail {
+ // The buf is empty, as far as we can tell.
+ return nil
+ }
+ // Check if the head position we want to claim is actually
+ // backed by a block.
+ spineLen := b.spineLen.Load()
+ if spineLen <= uintptr(head)/spanSetBlockEntries {
+ // We're racing with a spine growth and the allocation of
+ // a new block (and maybe a new spine!), and trying to grab
+ // the span at the index which is currently being pushed.
+ // Instead of spinning, let's just notify the caller that
+ // there's nothing currently here. Spinning on this is
+ // almost definitely not worth it.
+ return nil
+ }
+ // Try to claim the current head by CASing in an updated head.
+ // This may fail transiently due to a push which modifies the
+ // tail, so keep trying while the head isn't changing.
+ want := head
+ for want == head {
+ if b.index.cas(headtail, makeHeadTailIndex(want+1, tail)) {
+ break claimLoop
+ }
+ // Use a backoff approach to reduce demand to the shared memory location
+ // decreases memory contention and allows for other threads to make quicker
+ // progress.
+ // Read more in this Arm blog post:
+ // https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
+ nano_delay(backoff)
+ // Increase backoff time.
+ backoff += backoff >> 1
+ headtail = b.index.load()
+ head, tail = headtail.split()
+ }
+ // We failed to claim the spot we were after and the head changed,
+ // meaning a popper got ahead of us. Try again from the top because
+ // the buf may not be empty.
+ }
+ top, bottom := head/spanSetBlockEntries, head%spanSetBlockEntries
+
+ // We may be reading a stale spine pointer, but because the length
+ // grows monotonically and we've already verified it, we'll definitely
+ // be reading from a valid block.
+ blockp := b.spine.Load().lookup(uintptr(top))
+
+ // Given that the spine length is correct, we know we will never
+ // see a nil block here, since the length is always updated after
+ // the block is set.
+ block := blockp.Load()
+ s := block.spans[bottom].Load()
+ for s == nil {
+ // We raced with the span actually being set, but given that we
+ // know a block for this span exists, the race window here is
+ // extremely small. Try again.
+ s = block.spans[bottom].Load()
+ }
+ // Clear the pointer. This isn't strictly necessary, but defensively
+ // avoids accidentally re-using blocks which could lead to memory
+ // corruption. This way, we'll get a nil pointer access instead.
+ block.spans[bottom].StoreNoWB(nil)
+
+ // Increase the popped count. If we are the last possible popper
+ // in the block (note that bottom need not equal spanSetBlockEntries-1
+ // due to races) then it's our responsibility to free the block.
+ //
+ // If we increment popped to spanSetBlockEntries, we can be sure that
+ // we're the last popper for this block, and it's thus safe to free it.
+ // Every other popper must have crossed this barrier (and thus finished
+ // popping its corresponding mspan) by the time we get here. Because
+ // we're the last popper, we also don't have to worry about concurrent
+ // pushers (there can't be any). Note that we may not be the popper
+ // which claimed the last slot in the block, we're just the last one
+ // to finish popping.
+ if block.popped.Add(1) == spanSetBlockEntries {
+ // Clear the block's pointer.
+ blockp.StoreNoWB(nil)
+
+ // Return the block to the block pool.
+ spanSetBlockPool.free(block)
+ }
+ return s
+}
diff --git a/src/runtime/stubs_arm64.go b/src/runtime/stubs_arm64.go
index df04e64..e83c10f 100644
--- a/src/runtime/stubs_arm64.go
+++ b/src/runtime/stubs_arm64.go
@@ -25,3 +25,5 @@
// getfp returns the frame pointer register of its caller or 0 if not implemented.
// TODO: Make this a compiler intrinsic
func getfp() uintptr
+
+func nano_delay(backoff_ns uint32)
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Run-TryBot | +1 |
Can you help to reivew this patch? Thank you.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Thanks.
Is there a reason to think that there is more of a problem on arm64 than other processors? It seems to me likely that we should do the same thing on all processors.
Rather than nano_sleep (which is misnamed for Go, should be nanoSleep or nsleep), I suggest that we use the existing procyield function. If that function is not a good choice, why not?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Commit-Queue | +1 |
Thanks.
Is there a reason to think that there is more of a problem on arm64 than other processors? It seems to me likely that we should do the same thing on all processors.
Rather than nano_sleep (which is misnamed for Go, should be nanoSleep or nsleep), I suggest that we use the existing procyield function. If that function is not a good choice, why not?
This is not an arm64-specific issue, it is the most common scalability bottleneck we see inside the GC. https://go.dev/issue/21056 is a very old issue. More recently examples like https://go.dev/issue/65064#issuecomment-2609805922 still show this bottleneck (in the second screenshot, the hot function under `runtime.scanstack` is `getempty`).
We definitely want to improve this. Perhaps backoff is a good/easy first step. Better I think would be redesigns that actually fundamentally reduce contention.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Thanks.
Is there a reason to think that there is more of a problem on arm64 than other processors? It seems to me likely that we should do the same thing on all processors.
Rather than nano_sleep (which is misnamed for Go, should be nanoSleep or nsleep), I suggest that we use the existing procyield function. If that function is not a good choice, why not?
I agree that this is not just a problem with the arm64 platform. But I don't have other non-arm64 environments to test the impact of the modification on their performance.
On arm64, procyield function is implemented with the YIELD instruction, which does not provide a backoff when spinning or waiting. We considered changing the procyield implementation to a nano_sleep implementation, but there are many other places in the runtime that call procyield and we don't know if the new implementation will function differently from the original.
Do you think it is ok to change the implementation of procyield to that of nano_sleep? Thanks.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Dmitri ShuralyovTryBots beginning. Status page: https://farmer.golang.org/try?commit=c254c9b8
There weren't any specific legacy trybots requested, to stopping this empty request.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Run-TryBot | +1 |
Fannie ZhangThanks.
Is there a reason to think that there is more of a problem on arm64 than other processors? It seems to me likely that we should do the same thing on all processors.
Rather than nano_sleep (which is misnamed for Go, should be nanoSleep or nsleep), I suggest that we use the existing procyield function. If that function is not a good choice, why not?
I agree that this is not just a problem with the arm64 platform. But I don't have other non-arm64 environments to test the impact of the modification on their performance.
On arm64, procyield function is implemented with the YIELD instruction, which does not provide a backoff when spinning or waiting. We considered changing the procyield implementation to a nano_sleep implementation, but there are many other places in the runtime that call procyield and we don't know if the new implementation will function differently from the original.
Do you think it is ok to change the implementation of procyield to that of nano_sleep? Thanks.
@ia...@golang.org Kindly ping.
@ia...@google.com The patch was refactored, the patch abandoned nano_delay and rewritten the procyield. Please help to review this patch again. Thank you.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Removed Run-TryBot+1 by Fannie Zhang <Fannie...@arm.com>
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
sorry for not looking at this sooner, but we've been trying to address some of these issues with Green Tea instead. that being said, we still use the old workbufs for larger objects, so something like may still be worthwhile.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Commit-Queue | +1 |
// Armv8.7-A introduced the WFET (FEAT_WFxT) instruction, which allows the
// processor to enter a low power state for a set time, or until an event is
// received.what's the reason we're not using it here? would it require an extra feature check? since we're not using it but aware of it (and it's recommended), we should document why.
// Use a backoff approach to reduce demand to the shared memory location
// decreases memory contention and allows for other threads to make quicker
// progress.
// Read more in this Arm blog post:
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
procyield(backoff)
// Increase backoff time.
backoff += backoff >> 1the idea that we should back off instead of spinning honestly doesn't seem to be that arm64 specific.
I argue that we should make the initial value of backoff dependent on the platform. maybe to start with, we can set it to be zero on all platforms except arm64? and skip the procyield if it's zero? that seems a little bit more general.
lastly, we should add a TODO to tweak this number on other platforms besides arm64.
| Commit-Queue | +1 |
ah, your change needs to be rebased on tip. it should be a clean rebase, AFAICT.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
When profiling CPU usage LiveKit on AArch64/x86 (AWS), the graphs showSince you seem to have a benchmark or production workload to measure, we would appreciate any results you can share on https://go.dev/issue/73581 on the impact of the Green Tea GC. Particularly after CL 700496 (which is only on tip, not in 1.25).
This patch rewrites the implementation of procyiely(), which is an Armv8.0-Aprocyield
// the counter-timer. Note that we do not depend on the latency of the ISB instruction.I think what you mean here is that we use a real counter to measure the delay rather than assuming a fixed delay from ISB, but out of context this sentence is confusing because the code below does hard-code an assumed 18ns ISB delay. Please expand the comment to clarify.
CMP $0x12, R0These constants (and 0x28 below) should probably be in decimal to match the blog post.
CMP $0x12, R0How much do the heuristics around avoiding ISB for small n, and avoiding ISB at the end of a longer delay actually matter?
The smallest input we currently pass is 128ns, so we'll never skip ISB entirely, but will use the heuristic for the last ISB.
It seems to me like it shouldn't matter much, especially the tail case. The delay time is already an arbitrary heuristic, so making this function very accurate doesn't seem too important.
If we could simplify this function to just the ns -> cycle conversion plus the `delay` loop it would be much easier to reason about.
ADD R0<<4, R0, R0Shouldn't this be `R0>>4`?
MOVWU backoff_ns+0(FP), R0
ISB $15
CMP $0x12, R0
BLS ret
SUB $0x12, R0, R0
MRS CNTFRQ_EL0, R1
ADD R0<<4, R0, R0
MUL R1, R0, R0
LSR $30, R0, R0
CBZ R0, ret
MRS CNTVCT_EL0, R2
CMP $0x28, R0
BLS loop
SUB $0x28, R0, R3
delay:
ISB $15
MRS CNTVCT_EL0, R1
SUB R2, R1, R1
CMP R3, R1
BCC delay
loop:
MRS CNTVCT_EL0, R1
SUB R2, R1, R1
CMP R0, R1
BCC loop
ret:
RETPlease add comments throughout to explain what is going on. The blog post provides psuedocode, but if the blog post disappears this code will be quite confusing on its own, particularly because it contains several heuristics.
if goarch.IsArm64 == 1 {```suggestion
if GOARCH == "arm64" {
```
Comparing the GOARCH is the preferred form outside of constants that must use a number.
backoff += backoff >> 1In my opinion, division is more intuitive.
```suggestion
backoff += backoff / 2
```
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
runtime: reduce contention in (*lfstack).pop"on arm64"
func (head *lfstack) pop() unsafe.Pointer {Why is pop moving from where it was? It seems like the current location below push is the logical one.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
ah, your change needs to be rebased on tip. it should be a clean rebase, AFAICT.
Done
I just got back from the National Day holiday, so my reply is late. Thank you for the review. I've already resolved some of the comments and am still working on it. Thanks again.
runtime: reduce contention in (*lfstack).popFannie Zhang"on arm64"
Done
This patch rewrites the implementation of procyiely(), which is an Armv8.0-AFannie Zhangprocyield
Done
// Armv8.7-A introduced the WFET (FEAT_WFxT) instruction, which allows the
// processor to enter a low power state for a set time, or until an event is
// received.what's the reason we're not using it here? would it require an extra feature check? since we're not using it but aware of it (and it's recommended), we should document why.
Done
func (head *lfstack) pop() unsafe.Pointer {Why is pop moving from where it was? It seems like the current location below push is the logical one.
It was a mistake. Moved it back to its original location. Done.
```suggestion
if GOARCH == "arm64" {
```Comparing the GOARCH is the preferred form outside of constants that must use a number.
Done
backoff += backoff >> 1In my opinion, division is more intuitive.
```suggestion
backoff += backoff / 2
```
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
These constants (and 0x28 below) should probably be in decimal to match the blog post.
Done
How much do the heuristics around avoiding ISB for small n, and avoiding ISB at the end of a longer delay actually matter?
The smallest input we currently pass is 128ns, so we'll never skip ISB entirely, but will use the heuristic for the last ISB.
It seems to me like it shouldn't matter much, especially the tail case. The delay time is already an arbitrary heuristic, so making this function very accurate doesn't seem too important.
If we could simplify this function to just the ns -> cycle conversion plus the `delay` loop it would be much easier to reason about.
Done
Shouldn't this be `R0>>4`?
Done
MOVWU backoff_ns+0(FP), R0Please add comments throughout to explain what is going on. The blog post provides psuedocode, but if the blog post disappears this code will be quite confusing on its own, particularly because it contains several heuristics.
Done
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Run-TryBot | +1 |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Michael KnyszekTryBots beginning. Status page: https://farmer.golang.org/try?commit=1b38e926
Acknowledged
Removed Run-TryBot+1 by Fannie Zhang <Fannie...@arm.com>
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
All done. Please help to review it, thanks.
When profiling CPU usage LiveKit on AArch64/x86 (AWS), the graphs showSince you seem to have a benchmark or production workload to measure, we would appreciate any results you can share on https://go.dev/issue/73581 on the impact of the Green Tea GC. Particularly after CL 700496 (which is only on tip, not in 1.25).
I am working on it. Will update results later.
// the counter-timer. Note that we do not depend on the latency of the ISB instruction.I think what you mean here is that we use a real counter to measure the delay rather than assuming a fixed delay from ISB, but out of context this sentence is confusing because the code below does hard-code an assumed 18ns ISB delay. Please expand the comment to clarify.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
func (head *lfstack) pop() unsafe.Pointer {Fannie ZhangWhy is pop moving from where it was? It seems like the current location below push is the logical one.
It was a mistake. Moved it back to its original location. Done.
Done
// Use a backoff approach to reduce demand to the shared memory location
// decreases memory contention and allows for other threads to make quicker
// progress.
// Read more in this Arm blog post:
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
procyield(backoff)
// Increase backoff time.
backoff += backoff >> 1the idea that we should back off instead of spinning honestly doesn't seem to be that arm64 specific.
I argue that we should make the initial value of backoff dependent on the platform. maybe to start with, we can set it to be zero on all platforms except arm64? and skip the procyield if it's zero? that seems a little bit more general.
lastly, we should add a TODO to tweak this number on other platforms besides arm64.
Do you mean the modification as below:
```
var backoff uint32 = 0
if GOARCH == "arm64" {
backoff = 128
}
...
procyield(backoff)
backoff += backoff / 2
```
However, based on this modification, the behavior on other architectures may change. On amd64, the procyield is implemented as follows, which runs PAUSE for delay even if the value of cycles is 0.
```
TEXT runtime·procyield(SB),NOSPLIT,$0-0
MOVL cycles+0(FP), AX
again:
PAUSE
SUBL $1, AX
JNZ again
RET
```
If I miss something please correct me. Thanks.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Nice work and thank you Fannie. Backoff is a great remedy even for reducing
contention of even high level processes. However, without jitter (randomization of a couple of units of time) every single contenting process will retry with the
exact same backoff which can then cause more contention in lockstep, hence I suggest that we add some jitter randomized and bounded, for every single retry.
Perhaps can we add to the subject “using backoff and ISB instruction”
backoff += backoff >> 1This backoff jitter (randomization of a few nanoseconds) of which that is to prevent lockstep retries if all similar processes retry at the exact same time, to help further reduce the contention. With the current code, every one of the contending processes will repeat with the same backoff like this:
n=1; backoff is 128 + 128/2 =192
n=2; backoff is 192 + 192/2 =288
n=3; backoff is 288 + 288/2 =432
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
backoff += backoff >> 1This backoff jitter (randomization of a few nanoseconds) of which that is to prevent lockstep retries if all similar processes retry at the exact same time, to help further reduce the contention. With the current code, every one of the contending processes will repeat with the same backoff like this:
n=1; backoff is 128 + 128/2 =192
n=2; backoff is 192 + 192/2 =288
n=3; backoff is 288 + 288/2 =432
at the scale of a couple hundred nanoseconds of backoff, computing a pseudorandom jitter might be more trouble than it's worth. the longer you take to decide how to wait, the more time you're spending burning CPU time instead of getting into that low power state. there's also likely some amount of jitter in the form of the current state of the CPU -- procyield is not going to execute in the same amount of cycles every single time.
I'm not sure what the trade-off is, but I don't think we need to complicate the assembly unless we see a real issue or a real improvement from adding jitter. (so, maybe worth trying, but IMO this shouldn't be a blocker.)
yeah, that's pretty much what I meant.
I think it's OK to change the behavior slightly on other platforms. we see contention problems there, too. might just be worth mentioning in the commit message.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Nice work and thank you Fannie. Backoff is a great remedy even for reducing
contention of even high level processes. However, without jitter (randomization of a couple of units of time) every single contenting process will retry with the
exact same backoff which can then cause more contention in lockstep, hence I suggest that we add some jitter randomized and bounded, for every single retry.
Thanks for the suggestion. I agree with Michael Knyszek's idea and don't want to make the implementation too complicated.
Perhaps can we add to the subject “using backoff and ISB instruction”
Done
// Use a backoff approach to reduce demand to the shared memory location
// decreases memory contention and allows for other threads to make quicker
// progress.
// Read more in this Arm blog post:
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
procyield(backoff)
// Increase backoff time.
backoff += backoff >> 1Fannie Zhangthe idea that we should back off instead of spinning honestly doesn't seem to be that arm64 specific.
I argue that we should make the initial value of backoff dependent on the platform. maybe to start with, we can set it to be zero on all platforms except arm64? and skip the procyield if it's zero? that seems a little bit more general.
lastly, we should add a TODO to tweak this number on other platforms besides arm64.
Michael KnyszekDo you mean the modification as below:
```
var backoff uint32 = 0
if GOARCH == "arm64" {
backoff = 128
}
...
procyield(backoff)
backoff += backoff / 2
```
However, based on this modification, the behavior on other architectures may change. On amd64, the procyield is implemented as follows, which runs PAUSE for delay even if the value of cycles is 0.
```
TEXT runtime·procyield(SB),NOSPLIT,$0-0
MOVL cycles+0(FP), AX
again:
PAUSE
SUBL $1, AX
JNZ again
RET
```If I miss something please correct me. Thanks.
yeah, that's pretty much what I meant.
I think it's OK to change the behavior slightly on other platforms. we see contention problems there, too. might just be worth mentioning in the commit message.
Got it. Done. Thanks.
Michael KnyszekThis backoff jitter (randomization of a few nanoseconds) of which that is to prevent lockstep retries if all similar processes retry at the exact same time, to help further reduce the contention. With the current code, every one of the contending processes will repeat with the same backoff like this:
n=1; backoff is 128 + 128/2 =192
n=2; backoff is 192 + 192/2 =288
n=3; backoff is 288 + 288/2 =432
at the scale of a couple hundred nanoseconds of backoff, computing a pseudorandom jitter might be more trouble than it's worth. the longer you take to decide how to wait, the more time you're spending burning CPU time instead of getting into that low power state. there's also likely some amount of jitter in the form of the current state of the CPU -- procyield is not going to execute in the same amount of cycles every single time.
I'm not sure what the trade-off is, but I don't think we need to complicate the assembly unless we see a real issue or a real improvement from adding jitter. (so, maybe worth trying, but IMO this shouldn't be a blocker.)
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Fannie ZhangNice work and thank you Fannie. Backoff is a great remedy even for reducing
contention of even high level processes. However, without jitter (randomization of a couple of units of time) every single contenting process will retry with the
exact same backoff which can then cause more contention in lockstep, hence I suggest that we add some jitter randomized and bounded, for every single retry.
Thanks for the suggestion. I agree with Michael Knyszek's idea and don't want to make the implementation too complicated.
And it seems orthogonal to our backoff solution. It could be implemented afterwards in a separate commit. But first such a randomization feature needs to be benchmarked and its actual benefits measured. Thanks.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
looks good overall -- one more comment for consistency, but that's all I've got. thank you!
if goarch.IsArm64 == 1 {
// Use a backoff approach to reduce demand to the shared memory location
// decreases memory contention and allows for other threads to make quicker
// progress.
// Read more in this Arm blog post:
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
procyield(backoff)
// Increase backoff time.
backoff += backoff >> 1
}could you please also make this platform agnostic in the same way you did for lfstack?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Run-TryBot | +1 |
// Use a backoff approach to reduce demand to the shared memory location
// decreases memory contention and allows for other threads to make quicker
// progress.
// Read more in this Arm blog post:
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
procyield(backoff)
// Increase backoff time.
backoff += backoff >> 1
}could you please also make this platform agnostic in the same way you did for lfstack?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
looks good overall -- one more comment for consistency, but that's all I've got. thank you!
Thanks for the comments.
if goarch.IsArm64 == 1 {
// Use a backoff approach to reduce demand to the shared memory location
// decreases memory contention and allows for other threads to make quicker
// progress.
// Read more in this Arm blog post:
// https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
procyield(backoff)
// Increase backoff time.
backoff += backoff >> 1
}Fannie Zhangcould you please also make this platform agnostic in the same way you did for lfstack?
Of course, I forgot to update it. Thanks.
Done
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Removed Run-TryBot+1 by Fannie Zhang <Fannie...@arm.com>
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Michael KnyszekTryBots beginning. Status page: https://farmer.golang.org/try?commit=fa722027
Acknowledged
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
FYI, don't use Run-TryBot+1 anymore. it's not going to do anything. it's the legacy trybots, and it's only useful if used in conjunction with the last few builders that haven't been ported over to the new system yet. use Commit-Queue+1 instead.
| Commit-Queue | +1 |
Removed Run-TryBot+1 by Fannie Zhang <Fannie...@arm.com>
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
it seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
it seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
Yes, that's why it fails. So we need to add a check to see if cyclys is equal to 0, and return if it is.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Fannie Zhangit seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
Yes, that's why it fails. So we need to add a check to see if cyclys is equal to 0, and return if it is.
maybe we should just fix procyield on amd64? it's bizarre that the behavior at 0 is just to loop forever 😄
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
FYI, don't use Run-TryBot+1 anymore. it's not going to do anything. it's the legacy trybots, and it's only useful if used in conjunction with the last few builders that haven't been ported over to the new system yet. use Commit-Queue+1 instead.
Got it. Thanks.
Fannie Zhangit seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
Michael KnyszekYes, that's why it fails. So we need to add a check to see if cyclys is equal to 0, and return if it is.
maybe we should just fix procyield on amd64? it's bizarre that the behavior at 0 is just to loop forever 😄
In addition to amd64, ppc64, arm and 386 also have the same behavior. So I wonder if we can change the initial value of backoff to 1 to avoid modifying the implementation of multiple procyeield?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Fannie Zhangit seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
Michael KnyszekYes, that's why it fails. So we need to add a check to see if cyclys is equal to 0, and return if it is.
Fannie Zhangmaybe we should just fix procyield on amd64? it's bizarre that the behavior at 0 is just to loop forever 😄
In addition to amd64, ppc64, arm and 386 also have the same behavior. So I wonder if we can change the initial value of backoff to 1 to avoid modifying the implementation of multiple procyeield?
ugh. I'll just send a change to fix that. frankly, it's a pretty awful pitfall of the API.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Fannie Zhangit seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
Michael KnyszekYes, that's why it fails. So we need to add a check to see if cyclys is equal to 0, and return if it is.
Fannie Zhangmaybe we should just fix procyield on amd64? it's bizarre that the behavior at 0 is just to loop forever 😄
Michael KnyszekIn addition to amd64, ppc64, arm and 386 also have the same behavior. So I wonder if we can change the initial value of backoff to 1 to avoid modifying the implementation of multiple procyeield?
ugh. I'll just send a change to fix that. frankly, it's a pretty awful pitfall of the API.
or better yet, maybe we rename procyield to procyield0 and procyield becomes a Go function that checks for <= 0 and skips the call. that will compile down to nothing with a constant 0.
Fannie Zhangit seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
Michael KnyszekYes, that's why it fails. So we need to add a check to see if cyclys is equal to 0, and return if it is.
Fannie Zhangmaybe we should just fix procyield on amd64? it's bizarre that the behavior at 0 is just to loop forever 😄
Michael KnyszekIn addition to amd64, ppc64, arm and 386 also have the same behavior. So I wonder if we can change the initial value of backoff to 1 to avoid modifying the implementation of multiple procyeield?
Michael Knyszekugh. I'll just send a change to fix that. frankly, it's a pretty awful pitfall of the API.
or better yet, maybe we rename procyield to procyield0 and procyield becomes a Go function that checks for <= 0 and skips the call. that will compile down to nothing with a constant 0.
I landed a change to both: make procyield stop looping infinitely, and to wrap procyield in a Go function. try rebasing!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Commit-Queue | +1 |
Fannie Zhangit seems the builders are timing out running make.bash. I can reproduce this locally. I'm not yet sure what the problem is, but there's probably a bug here where we're getting stuck in an infinite loop.
AFAICT, the problem is that procyield(0) is invalid on amd64. :( the procyield assembly looks like it will loop infinitely if passed 0.
Michael KnyszekYes, that's why it fails. So we need to add a check to see if cyclys is equal to 0, and return if it is.
Fannie Zhangmaybe we should just fix procyield on amd64? it's bizarre that the behavior at 0 is just to loop forever 😄
Michael KnyszekIn addition to amd64, ppc64, arm and 386 also have the same behavior. So I wonder if we can change the initial value of backoff to 1 to avoid modifying the implementation of multiple procyeield?
Michael Knyszekugh. I'll just send a change to fix that. frankly, it's a pretty awful pitfall of the API.
Michael Knyszekor better yet, maybe we rename procyield to procyield0 and procyield becomes a Go function that checks for <= 0 and skips the call. that will compile down to nothing with a constant 0.
I landed a change to both: make procyield stop looping infinitely, and to wrap procyield in a Go function. try rebasing!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
runtime: use backoff and ISB instruction to reduce contention in (*lfstack).pop and (*spanSet).pop on arm64
When profiling CPU usage LiveKit on AArch64/x86 (AWS), the graphs show
CPU spikes that was repeating in a semi-periodic manner and spikes occur
when the GC(garbage collector) is active.
Our analysis found that the getempty function accounted for 10.54% of the
overhead, which was mainly caused by the work.empty.pop() function. And
listing pop shows that the majority of the time, with a 10.29% overhead,
is spent on atomic.Cas64((*uint64)(head), old, next).
This patch adds a backoff approach to reduce the high overhead of the
atomic operation primarily occurs when contention over a specific memory
address increases, typically with the rise in the number of threads.
Note that on paltforms other than arm64, the initial value of backoff is zero.
This patch rewrites the implementation of procyield() on arm64, which is an
Armv8.0-A compatible delay function using the counter-timer.
The garbage collector benchmark:
│ master │ opt │
│ sec/op │ sec/op vs base │
Garbage/benchmem-MB=64-160 3.782m ± 4% 2.264m ± 2% -40.12% (p=0.000 n=10)
│ user+sys-sec/op │ user+sys-sec/op vs base │
Garbage/benchmem-MB=64-160 433.5m ± 4% 255.4m ± 2% -41.08% (p=0.000 n=10)
Reference for backoff mechianism:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |