func TestMutexDeduplication(t *testing.T) {
Nick RipleyDo we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.
I looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |
// Make sure we have permission to use it, or write a new one.
I've reached out on that issue to get permission to use this code.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |
// Make sure we have permission to use it, or write a new one.
I've reached out on that issue to get permission to use this code.
Done
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |
Fixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.
// I don't see any tests for this behavior
Rhys HiltnerGood question. I think it's okay because the caller can't assume that the returned `n` stays the same between calls.
However, it's possible to imagine a caller that makes the assumption that `n` is not decreasing between calls for this profile type.
So the most-compatible thing to do here is to remove the `size` argument from `mutexProfileInternal` (so it always invokes its callbacks) and handle the `n` counting case inside of this function, including the aggregation when needed.
But I'd wait for feedback from the maintainers. I'm a bit uncertain how far to interpret Hyrum's law given the new [compatibility focus](https://go.dev/blog/compat).
Another non-maintainer view:
Most profiles say "If len(p) < n, [FooProfile] does not change p and returns n, false.", this one says "If len(p) >= n, MutexProfile copies the profile into p and returns n, true."
I'm not sure how to interpret those when "n" isn't a single value (at some point in time). PS3 looks like its behavior could be:
```
MutexProfile(buf[:100]) == (1, true)
MutexProfile(buf[:1]) == (3, false)
MutexProfile(buf[:2]) == (3, false)
MutexProfile(buf[:3]) == (1, true)
```
The docs guarantee that there won't be changes to the slice if ok==false, which I think will cause trouble for Felix's suggestion of doing the count inside this function -- hard to deduplicate when we're not allowed to use the user-allocated memory as scratch space.
The docs don't make guarantees about not touching the elements in `p[n:len(p)]. As of PS3, this will overwrite some of them (and leave those records in place after it returns).
func TestMutexDeduplication(t *testing.T) {
Nick RipleyDo we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.
I looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?
Is there a test that will break as soon as someone adds a `blockevent` call that violates that invariant? Absent that, it doesn't seem like a safe / future-proof assumption to make.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |
Fixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.
it seems like it'll result in repeating a lot of the work that the original change was trying to avoid
Indeed, I don't see a way forward without losing a lot of the gain from frame pointer unwinding.
Plus allocating a big map on each read.
Yeah, I chose the implementation as of PS3 for simplicity, but the added memory overhead is unfortunate. We could do it without additional memory overhead by sorting the records (lexicographically by stack) in place and merging consecutive records with identical stacks. I opted not to do this for a first implementation because we can't use an existing sort implementation here, and I didn't trust myself to write one without subtle bugs. But I could give that a go if it sounds better.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket?
I could see something like that working. This also bring up the point that this issue affects the runtime/pprof output as well. That is, we can get mutex profiles where there are multiple records with the same call stack. This can already happen for the heap profile due to removing runtime frames during reporting, so in theory pprof consumers already have to deal with this. But if we consider that a regression as well then we'll need to address it here. And if so, it'd probably be better to handle aggregation/deduplication in one place than to make each mutex/block profile consumer do it.
Talking through this has me reconsidering what the best fix is. Given that there doesn't seem to be a way forward without _some_ performance hit, perhaps the best options are:
The second option would lose most of the performance benefit of frame pointer unwinding, but we'd still save _some_ work and come out ahead of 1.22 perf-wise. Thoughts? cc @cher...@google.com as a runtime maintainer
func TestMutexDeduplication(t *testing.T) {
Nick RipleyDo we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.
Rhys HiltnerI looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?
Is there a test that will break as soon as someone adds a `blockevent` call that violates that invariant? Absent that, it doesn't seem like a safe / future-proof assumption to make.
There isn't such a test, so you're right, the only safe thing to do is handle this for the block profiler as well.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |
Nick RipleyFixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.
it seems like it'll result in repeating a lot of the work that the original change was trying to avoid
Indeed, I don't see a way forward without losing a lot of the gain from frame pointer unwinding.
Plus allocating a big map on each read.
Yeah, I chose the implementation as of PS3 for simplicity, but the added memory overhead is unfortunate. We could do it without additional memory overhead by sorting the records (lexicographically by stack) in place and merging consecutive records with identical stacks. I opted not to do this for a first implementation because we can't use an existing sort implementation here, and I didn't trust myself to write one without subtle bugs. But I could give that a go if it sounds better.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket?
I could see something like that working. This also bring up the point that this issue affects the runtime/pprof output as well. That is, we can get mutex profiles where there are multiple records with the same call stack. This can already happen for the heap profile due to removing runtime frames during reporting, so in theory pprof consumers already have to deal with this. But if we consider that a regression as well then we'll need to address it here. And if so, it'd probably be better to handle aggregation/deduplication in one place than to make each mutex/block profile consumer do it.
Talking through this has me reconsidering what the best fix is. Given that there doesn't seem to be a way forward without _some_ performance hit, perhaps the best options are:
- Don't use frame pointer unwinding for now, i.e. revert CL 533258
- Still use frame pointer unwinding to collect the stack, but handle skipping & inline expansion at sample time.
The second option would lose most of the performance benefit of frame pointer unwinding, but we'd still save _some_ work and come out ahead of 1.22 perf-wise. Thoughts? cc @cher...@google.com as a runtime maintainer
I was proposing what I see as a third option, maybe it didn't come across. What if we handle skipping and inline expansion at sample time, but only the first time we see a particular stack (and create the mprof bucket)?
Previously, (before CL 533258?) we had one `runtime.bucket` for each profile entry. Now, we sometimes (#67548) have multiple `runtime.bucket` values that end up with identical stacks in the resulting profile. I'm proposing that we consider having two runtime.bucket kinds of values: one with an expanded/skipped stack (which shows up in the profile), and one that holds the unprocessed stack that we get from framepointer unwinding. The second kind of bucket would forward to the first (`type bucket struct { ... ; forwardTo *bucket ; ... }`), and be filtered out when exporting the profile to the user.
We'd still use frame pointer unwinding to collect the stack. We'd look up the entry in the mprof hash table, we'd see that the `runtime.(*bucket).forwardTo` field was set, we'd follow that pointer and add the samples to that other bucket. In effect, it's a cache of the expansion/skipping work. The main runtime cost would be the memory due to (nearly) doubling the number of bucket values.
But in my experience the mutex/block profiles are usually small relative to heap profiles, so I estimate that as a low / acceptable cost. Do your data and experience match that?
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |
Nick RipleyFixing the records after the fact looks tricky. It's hard to hide fully from the user (see line comments), and it seems like it'll result in repeating a lot of the work that the original change was trying to avoid (skip/inline calculations on every read, though not on every event). Plus allocating a big map on each read.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket? That would involve doing the stack skip / inlining calculations once (at the time of creating the bucket corresponding to the non-deduped stack). It might double the memory footprint of block/mutex profiles .. though in my experience they're usually dwarfed by the size of the heap profile when all are enabled.
Rhys Hiltnerit seems like it'll result in repeating a lot of the work that the original change was trying to avoid
Indeed, I don't see a way forward without losing a lot of the gain from frame pointer unwinding.
Plus allocating a big map on each read.
Yeah, I chose the implementation as of PS3 for simplicity, but the added memory overhead is unfortunate. We could do it without additional memory overhead by sorting the records (lexicographically by stack) in place and merging consecutive records with identical stacks. I opted not to do this for a first implementation because we can't use an existing sort implementation here, and I didn't trust myself to write one without subtle bugs. But I could give that a go if it sounds better.
Is there a path forward where an mprof bucket could serve as a forwarding record to another bucket?
I could see something like that working. This also bring up the point that this issue affects the runtime/pprof output as well. That is, we can get mutex profiles where there are multiple records with the same call stack. This can already happen for the heap profile due to removing runtime frames during reporting, so in theory pprof consumers already have to deal with this. But if we consider that a regression as well then we'll need to address it here. And if so, it'd probably be better to handle aggregation/deduplication in one place than to make each mutex/block profile consumer do it.
Talking through this has me reconsidering what the best fix is. Given that there doesn't seem to be a way forward without _some_ performance hit, perhaps the best options are:
- Don't use frame pointer unwinding for now, i.e. revert CL 533258
- Still use frame pointer unwinding to collect the stack, but handle skipping & inline expansion at sample time.
The second option would lose most of the performance benefit of frame pointer unwinding, but we'd still save _some_ work and come out ahead of 1.22 perf-wise. Thoughts? cc @cher...@google.com as a runtime maintainer
I was proposing what I see as a third option, maybe it didn't come across. What if we handle skipping and inline expansion at sample time, but only the first time we see a particular stack (and create the mprof bucket)?
Previously, (before CL 533258?) we had one `runtime.bucket` for each profile entry. Now, we sometimes (#67548) have multiple `runtime.bucket` values that end up with identical stacks in the resulting profile. I'm proposing that we consider having two runtime.bucket kinds of values: one with an expanded/skipped stack (which shows up in the profile), and one that holds the unprocessed stack that we get from framepointer unwinding. The second kind of bucket would forward to the first (`type bucket struct { ... ; forwardTo *bucket ; ... }`), and be filtered out when exporting the profile to the user.
We'd still use frame pointer unwinding to collect the stack. We'd look up the entry in the mprof hash table, we'd see that the `runtime.(*bucket).forwardTo` field was set, we'd follow that pointer and add the samples to that other bucket. In effect, it's a cache of the expansion/skipping work. The main runtime cost would be the memory due to (nearly) doubling the number of bucket values.
But in my experience the mutex/block profiles are usually small relative to heap profiles, so I estimate that as a low / acceptable cost. Do your data and experience match that?
Yeah, I was writing my comment and about to propose perhaps something similar. What about we do expansion only up to the skip count? That is, we still use frame pointer unwinder, which is cheaper than the full unwinding. Then we expand just a few frames to cover the skip count, leave the rest of the PCs unexpanded, and record the partially expanded stack to buckets. At profile read time, we expand the rest of the PCs. (We probably don't need two types of buckets.)
This will be more expensive than just the frame pointer unwinding. But skip counts are usually pretty small, so we don't need to expand many frames. And if it matters we probably can memoize the expansion result somehow.
Could we try this and see if the performance is acceptible?
Thanks.
stackHash := func(stk []uintptr) uintptr {
var h uintptr
for _, pc := range stk {
h += pc
h += h << 10
h ^= h >> 6
}
// finalize
h += h << 3
h ^= h >> 11
return h
}
Since we're operating on BlockProfileRecord, we can jush hash the Stack0 array, and just use a Go map.
// I don't see any tests for this behavior
Rhys HiltnerGood question. I think it's okay because the caller can't assume that the returned `n` stays the same between calls.
However, it's possible to imagine a caller that makes the assumption that `n` is not decreasing between calls for this profile type.
So the most-compatible thing to do here is to remove the `size` argument from `mutexProfileInternal` (so it always invokes its callbacks) and handle the `n` counting case inside of this function, including the aggregation when needed.
But I'd wait for feedback from the maintainers. I'm a bit uncertain how far to interpret Hyrum's law given the new [compatibility focus](https://go.dev/blog/compat).
Another non-maintainer view:
Most profiles say "If len(p) < n, [FooProfile] does not change p and returns n, false.", this one says "If len(p) >= n, MutexProfile copies the profile into p and returns n, true."
I'm not sure how to interpret those when "n" isn't a single value (at some point in time). PS3 looks like its behavior could be:
```
MutexProfile(buf[:100]) == (1, true)MutexProfile(buf[:1]) == (3, false)
MutexProfile(buf[:2]) == (3, false)
MutexProfile(buf[:3]) == (1, true)
```The docs guarantee that there won't be changes to the slice if ok==false, which I think will cause trouble for Felix's suggestion of doing the count inside this function -- hard to deduplicate when we're not allowed to use the user-allocated memory as scratch space.
The docs don't make guarantees about not touching the elements in `p[n:len(p)]. As of PS3, this will overwrite some of them (and leave those records in place after it returns).
Interesting questions.
I'm not too worried about writing to `p[n:len(p)]`. As it is possible to write all the elements in p, the caller caller cannot keep any useful data there.
The non-monotonicity is unfortunate and is more concerning. Most uses are probably still fine. The count is essentially only useful for allocating the buffer.
As Rhys mentioned it would be hard to dedup inside mutexProfileInternal, as if ok==false we don't want to write to p.
Maybe this is okay? (But see the other discussion.)
})
Maybe mention that we don't need to dedup here, as the pprof package dedups.
func TestMutexDeduplication(t *testing.T) {
Nick RipleyDo we also need this test and logic for block profiles? This could be a separate CL, I'm just curious.
Rhys HiltnerI looked through the existing call sites leading to `blockevent` and I didn't see any places where it looked like there were multiple ways to reach it (as is the case for `semrelease` for mutex unlocking). That's not _proof_ it can't happen, of course. We could do deduplication for the block profiler defensively. But perhaps absent a clear case where we can get duplicates, we should avoid the cost of deduplication?
Nick RipleyIs there a test that will break as soon as someone adds a `blockevent` call that violates that invariant? Absent that, it doesn't seem like a safe / future-proof assumption to make.
There isn't such a test, so you're right, the only safe thing to do is handle this for the block profiler as well.
It would be good to add a test (can be a separate CL). We could do actual dedup logic later. Thanks.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. | Gerrit |