[go] runtime: aggregate records with identical stacks in MutexProfile

2 views

Skip to first unread message

Nick Ripley (Gerrit)

unread,

8:09 AM (13 hours ago) 8:09 AM

to goph...@pubsubhelper.golang.org, Felix Geisendörfer, Go LUCI, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer

Nick Ripley added 1 comment

File src/runtime/pprof/pprof_test.go

Line 2582, Patchset 2 (Latest):func TestMutexDeduplication(t *testing.T) {

Felix Geisendörfer . unresolved

Do we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.

Nick Ripley

I looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Nick Ripley (Gerrit)

unread,

8:26 AM (13 hours ago) 8:26 AM

to goph...@pubsubhelper.golang.org, Felix Geisendörfer, Go LUCI, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer

Nick Ripley added 1 comment

File src/runtime/pprof/pprof_test.go

Line 2585, Patchset 2 (Latest): // Make sure we have permission to use it, or write a new one.

Nick Ripley . unresolved

I've reached out on that issue to get permission to use this code.

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Nick Ripley (Gerrit)

unread,

9:21 AM (12 hours ago) 9:21 AM

to goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer

Nick Ripley uploaded new patchset

Nick Ripley uploaded patch set #3 to this change.

Following approvals got outdated and were removed:

TryBots-Pass: LUCI-TryBot-Result+1 by Go LUCI

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement

TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

open

diffy

Nick Ripley (Gerrit)

unread,

9:21 AM (12 hours ago) 9:21 AM

to goph...@pubsubhelper.golang.org, Felix Geisendörfer, Go LUCI, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer

Nick Ripley added 1 comment

File src/runtime/pprof/pprof_test.go

Line 2585, Patchset 2: // Make sure we have permission to use it, or write a new one.

Nick Ripley . resolved

I've reached out on that issue to get permission to use this code.

Nick Ripley

Done

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

open

diffy

Rhys Hiltner (Gerrit)

unread,

1:28 PM (8 hours ago) 1:28 PM

to Nick Ripley, goph...@pubsubhelper.golang.org, Rhys Hiltner, Go LUCI, Felix Geisendörfer, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer and Nick Ripley

Rhys Hiltner added 3 comments

Patchset-level comments

File-level comment, Patchset 3 (Latest):

Rhys Hiltner . resolved

Fixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.

Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.

File src/runtime/mprof.go

Line 1180, Patchset 2: // I don't see any tests for this behavior

Felix Geisendörfer . unresolved

Good question. I think it's okay because the caller can't assume that the returned `n` stays the same between calls.
However, it's possible to imagine a caller that makes the assumption that `n` is not decreasing between calls for this profile type.
So the most-compatible thing to do here is to remove the `size` argument from `mutexProfileInternal` (so it always invokes its callbacks) and handle the `n` counting case inside of this function, including the aggregation when needed.
But I'd wait for feedback from the maintainers. I'm a bit uncertain how far to interpret Hyrum's law given the new [compatibility focus](https://go.dev/blog/compat).

Rhys Hiltner

Another non-maintainer view:

Most profiles say "If len(p) < n, [FooProfile] does not change p and returns n, false.", this one says "If len(p) >= n, MutexProfile copies the profile into p and returns n, true."

I'm not sure how to interpret those when "n" isn't a single value (at some point in time). PS3 looks like its behavior could be:

```
MutexProfile(buf[:100]) == (1, true)

MutexProfile(buf[:1]) == (3, false)
MutexProfile(buf[:2]) == (3, false)
MutexProfile(buf[:3]) == (1, true)
```

The docs guarantee that there won't be changes to the slice if ok==false, which I think will cause trouble for Felix's suggestion of doing the count inside this function -- hard to deduplicate when we're not allowed to use the user-allocated memory as scratch space.

The docs don't make guarantees about not touching the elements in `p[n:len(p)]. As of PS3, this will overwrite some of them (and leave those records in place after it returns).

File src/runtime/pprof/pprof_test.go

Line 2582, Patchset 2:func TestMutexDeduplication(t *testing.T) {

Felix Geisendörfer . unresolved

Do we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.

Nick Ripley

I looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?

Rhys Hiltner

Is there a test that will break as soon as someone adds a `blockevent` call that violates that invariant? Absent that, it doesn't seem like a safe / future-proof assumption to make.

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer
Nick Ripley

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement

TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Nick Ripley (Gerrit)

unread,

3:23 PM (6 hours ago) 3:23 PM

to goph...@pubsubhelper.golang.org, Cherry Mui, Rhys Hiltner, Go LUCI, Felix Geisendörfer, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer and Rhys Hiltner

Nick Ripley added 2 comments

Patchset-level comments

File-level comment, Patchset 3 (Latest):

Rhys Hiltner . resolved

Fixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.

Nick Ripley

it seems like it'll result in repeating a lot of the work that the original change was trying to avoid

Indeed, I don't see a way forward without losing a lot of the gain from frame pointer unwinding.

Plus allocating a big map on each read.

Yeah, I chose the implementation as of PS3 for simplicity, but the added memory overhead is unfortunate. We could do it without additional memory overhead by sorting the records (lexicographically by stack) in place and merging consecutive records with identical stacks. I opted not to do this for a first implementation because we can't use an existing sort implementation here, and I didn't trust myself to write one without subtle bugs. But I could give that a go if it sounds better.

Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket?

I could see something like that working. This also bring up the point that this issue affects the runtime/pprof output as well. That is, we can get mutex profiles where there are multiple records with the same call stack. This can already happen for the heap profile due to removing runtime frames during reporting, so in theory pprof consumers already have to deal with this. But if we consider that a regression as well then we'll need to address it here. And if so, it'd probably be better to handle aggregation/deduplication in one place than to make each mutex/block profile consumer do it.

Talking through this has me reconsidering what the best fix is. Given that there doesn't seem to be a way forward without _some_ performance hit, perhaps the best options are:

Don't use frame pointer unwinding for now, i.e. revert CL 533258
Still use frame pointer unwinding to collect the stack, but handle skipping & inline expansion at sample time.

The second option would lose most of the performance benefit of frame pointer unwinding, but we'd still save _some_ work and come out ahead of 1.22 perf-wise. Thoughts? cc @cher...@google.com as a runtime maintainer

File src/runtime/pprof/pprof_test.go

Line 2582, Patchset 2:func TestMutexDeduplication(t *testing.T) {

Felix Geisendörfer . unresolved

Do we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.

Nick Ripley

I looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?

Rhys Hiltner

Is there a test that will break as soon as someone adds a `blockevent` call that violates that invariant? Absent that, it doesn't seem like a safe / future-proof assumption to make.

Nick Ripley

There isn't such a test, so you're right, the only safe thing to do is handle this for the block profiler as well.

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer
Rhys Hiltner

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Rhys Hiltner (Gerrit)

unread,

4:41 PM (5 hours ago) 4:41 PM

to Nick Ripley, goph...@pubsubhelper.golang.org, Cherry Mui, Rhys Hiltner, Go LUCI, Felix Geisendörfer, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer and Nick Ripley

Rhys Hiltner added 1 comment

Patchset-level comments

File-level comment, Patchset 3 (Latest):

Rhys Hiltner . resolved

Fixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.

Nick Ripley

it seems like it'll result in repeating a lot of the work that the original change was trying to avoid
Indeed, I don't see a way forward without losing a lot of the gain from frame pointer unwinding.
Plus allocating a big map on each read.
Yeah, I chose the implementation as of PS3 for simplicity, but the added memory overhead is unfortunate. We could do it without additional memory overhead by sorting the records (lexicographically by stack) in place and merging consecutive records with identical stacks. I opted not to do this for a first implementation because we can't use an existing sort implementation here, and I didn't trust myself to write one without subtle bugs. But I could give that a go if it sounds better.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket?
I could see something like that working. This also bring up the point that this issue affects the runtime/pprof output as well. That is, we can get mutex profiles where there are multiple records with the same call stack. This can already happen for the heap profile due to removing runtime frames during reporting, so in theory pprof consumers already have to deal with this. But if we consider that a regression as well then we'll need to address it here. And if so, it'd probably be better to handle aggregation/deduplication in one place than to make each mutex/block profile consumer do it.
Talking through this has me reconsidering what the best fix is. Given that there doesn't seem to be a way forward without _some_ performance hit, perhaps the best options are:
Don't use frame pointer unwinding for now, i.e. revert CL 533258
Still use frame pointer unwinding to collect the stack, but handle skipping & inline expansion at sample time.
The second option would lose most of the performance benefit of frame pointer unwinding, but we'd still save _some_ work and come out ahead of 1.22 perf-wise. Thoughts? cc @cher...@google.com as a runtime maintainer

Rhys Hiltner

I was proposing what I see as a third option, maybe it didn't come across. What if we handle skipping and inline expansion at sample time, but only the first time we see a particular stack (and create the mprof bucket)?

Previously, (before CL 533258?) we had one `runtime.bucket` for each profile entry. Now, we sometimes (#67548) have multiple `runtime.bucket` values that end up with identical stacks in the resulting profile. I'm proposing that we consider having two runtime.bucket kinds of values: one with an expanded/skipped stack (which shows up in the profile), and one that holds the unprocessed stack that we get from framepointer unwinding. The second kind of bucket would forward to the first (`type bucket struct { ... ; forwardTo *bucket ; ... }`), and be filtered out when exporting the profile to the user.

We'd still use frame pointer unwinding to collect the stack. We'd look up the entry in the mprof hash table, we'd see that the `runtime.(*bucket).forwardTo` field was set, we'd follow that pointer and add the samples to that other bucket. In effect, it's a cache of the expansion/skipping work. The main runtime cost would be the memory due to (nearly) doubling the number of bucket values.

But in my experience the mutex/block profiles are usually small relative to heap profiles, so I estimate that as a low / acceptable cost. Do your data and experience match that?

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer
Nick Ripley

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Cherry Mui (Gerrit)

unread,

5:38 PM (4 hours ago) 5:38 PM

to Nick Ripley, goph...@pubsubhelper.golang.org, Rhys Hiltner, Go LUCI, Felix Geisendörfer, golang-co...@googlegroups.com

Attention needed from Felix Geisendörfer, Nick Ripley and Rhys Hiltner

Cherry Mui added 5 comments

Patchset-level comments

File-level comment, Patchset 3 (Latest):

Rhys Hiltner . unresolved

Fixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.

Nick Ripley

it seems like it'll result in repeating a lot of the work that the original change was trying to avoid
Indeed, I don't see a way forward without losing a lot of the gain from frame pointer unwinding.
Plus allocating a big map on each read.
Yeah, I chose the implementation as of PS3 for simplicity, but the added memory overhead is unfortunate. We could do it without additional memory overhead by sorting the records (lexicographically by stack) in place and merging consecutive records with identical stacks. I opted not to do this for a first implementation because we can't use an existing sort implementation here, and I didn't trust myself to write one without subtle bugs. But I could give that a go if it sounds better.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket?
I could see something like that working. This also bring up the point that this issue affects the runtime/pprof output as well. That is, we can get mutex profiles where there are multiple records with the same call stack. This can already happen for the heap profile due to removing runtime frames during reporting, so in theory pprof consumers already have to deal with this. But if we consider that a regression as well then we'll need to address it here. And if so, it'd probably be better to handle aggregation/deduplication in one place than to make each mutex/block profile consumer do it.
Talking through this has me reconsidering what the best fix is. Given that there doesn't seem to be a way forward without _some_ performance hit, perhaps the best options are:
Don't use frame pointer unwinding for now, i.e. revert CL 533258
Still use frame pointer unwinding to collect the stack, but handle skipping & inline expansion at sample time.
The second option would lose most of the performance benefit of frame pointer unwinding, but we'd still save _some_ work and come out ahead of 1.22 perf-wise. Thoughts? cc @cher...@google.com as a runtime maintainer

Rhys Hiltner

I was proposing what I see as a third option, maybe it didn't come across. What if we handle skipping and inline expansion at sample time, but only the first time we see a particular stack (and create the mprof bucket)?
Previously, (before CL 533258?) we had one `runtime.bucket` for each profile entry. Now, we sometimes (#67548) have multiple `runtime.bucket` values that end up with identical stacks in the resulting profile. I'm proposing that we consider having two runtime.bucket kinds of values: one with an expanded/skipped stack (which shows up in the profile), and one that holds the unprocessed stack that we get from framepointer unwinding. The second kind of bucket would forward to the first (`type bucket struct { ... ; forwardTo *bucket ; ... }`), and be filtered out when exporting the profile to the user.
We'd still use frame pointer unwinding to collect the stack. We'd look up the entry in the mprof hash table, we'd see that the `runtime.(*bucket).forwardTo` field was set, we'd follow that pointer and add the samples to that other bucket. In effect, it's a cache of the expansion/skipping work. The main runtime cost would be the memory due to (nearly) doubling the number of bucket values.
But in my experience the mutex/block profiles are usually small relative to heap profiles, so I estimate that as a low / acceptable cost. Do your data and experience match that?

Cherry Mui

Yeah, I was writing my comment and about to propose perhaps something similar. What about we do expansion only up to the skip count? That is, we still use frame pointer unwinder, which is cheaper than the full unwinding. Then we expand just a few frames to cover the skip count, leave the rest of the PCs unexpanded, and record the partially expanded stack to buckets. At profile read time, we expand the rest of the PCs. (We probably don't need two types of buckets.)

This will be more expensive than just the frame pointer unwinding. But skip counts are usually pretty small, so we don't need to expand many frames. And if it matters we probably can memoize the expansion result somehow.

Could we try this and see if the performance is acceptible?

Thanks.

File src/runtime/mprof.go

Line 1160, Patchset 3 (Latest):

	stackHash := func(stk []uintptr) uintptr {
		var h uintptr
		for _, pc := range stk {
			h += pc
			h += h << 10
			h ^= h >> 6
		}
		// finalize
		h += h << 3
		h ^= h >> 11
		return h
	}

Cherry Mui . unresolved

Since we're operating on BlockProfileRecord, we can jush hash the Stack0 array, and just use a Go map.

Line 1180, Patchset 2: // I don't see any tests for this behavior

Felix Geisendörfer . unresolved

Good question. I think it's okay because the caller can't assume that the returned `n` stays the same between calls.
However, it's possible to imagine a caller that makes the assumption that `n` is not decreasing between calls for this profile type.
So the most-compatible thing to do here is to remove the `size` argument from `mutexProfileInternal` (so it always invokes its callbacks) and handle the `n` counting case inside of this function, including the aggregation when needed.
But I'd wait for feedback from the maintainers. I'm a bit uncertain how far to interpret Hyrum's law given the new [compatibility focus](https://go.dev/blog/compat).

Rhys Hiltner

Another non-maintainer view:
Most profiles say "If len(p) < n, [FooProfile] does not change p and returns n, false.", this one says "If len(p) >= n, MutexProfile copies the profile into p and returns n, true."
I'm not sure how to interpret those when "n" isn't a single value (at some point in time). PS3 looks like its behavior could be:
```
MutexProfile(buf[:100]) == (1, true)
MutexProfile(buf[:1]) == (3, false)
MutexProfile(buf[:2]) == (3, false)
MutexProfile(buf[:3]) == (1, true)
```
The docs guarantee that there won't be changes to the slice if ok==false, which I think will cause trouble for Felix's suggestion of doing the count inside this function -- hard to deduplicate when we're not allowed to use the user-allocated memory as scratch space.
The docs don't make guarantees about not touching the elements in `p[n:len(p)]. As of PS3, this will overwrite some of them (and leave those records in place after it returns).

Cherry Mui

Interesting questions.

I'm not too worried about writing to `p[n:len(p)]`. As it is possible to write all the elements in p, the caller caller cannot keep any useful data there.

The non-monotonicity is unfortunate and is more concerning. Most uses are probably still fine. The count is essentially only useful for allocating the buffer.

As Rhys mentioned it would be hard to dedup inside mutexProfileInternal, as if ok==false we don't want to write to p.

Maybe this is okay? (But see the other discussion.)

Line 1231, Patchset 3 (Latest): })

Cherry Mui . unresolved

Maybe mention that we don't need to dedup here, as the pprof package dedups.

File src/runtime/pprof/pprof_test.go

Line 2582, Patchset 2:func TestMutexDeduplication(t *testing.T) {

Felix Geisendörfer . unresolved

Do we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.

Nick Ripley

I looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?

Rhys Hiltner

Is there a test that will break as soon as someone adds a `blockevent` call that violates that invariant? Absent that, it doesn't seem like a safe / future-proof assumption to make.

Nick Ripley

There isn't such a test, so you're right, the only safe thing to do is handle this for the block profiler as well.

Cherry Mui

It would be good to add a test (can be a separate CL). We could do actual dedup logic later. Thanks.

Open in Gerrit

Related details

Attention is currently required from:

Felix Geisendörfer
Nick Ripley
Rhys Hiltner

Submit Requirements:

Code-Review
No-Unresolved-Comments
Review-Enforcement
TryBots-Pass

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

unsatisfied_requirement

satisfied_requirement

open

diffy

Reply all

Reply to author

Forward

0 new messages