Re: [chromium-dev] Debugging bad Callbacks

468 views
Skip to first unread message

Daniel Cheng

unread,
May 22, 2024, 4:49:07 PMMay 22
to i...@chromium.org, Chromium-dev, scheduler-dev
(re-adding scheduler-dev@, sorry)

On Wed, 22 May 2024 at 13:48, Daniel Cheng <dch...@chromium.org> wrote:
I think you'll need to measure the overhead of doing that. I suppose it'd be cheaper (in terms of code size) than having the thunk invoker base::debug::Alias() the callable onto the stack though.

That all being said, for debugging this immediate bug, it seems like the animation builder could just take a base::Location and we could alias that onto the stack before invoking the on_abort_ callback. This would be pretty straightforward and non-controversial.

Daniel

On Wed, 22 May 2024 at 13:21, Ian Barkley-Yeung <i...@chromium.org> wrote:
We're trying to research a bug (https://g-issues.chromium.org/issues/335902543) where a OnceCallback has a bad pointer and is crashing.

The issue is that the call site doesn't tell us who owns the callback. It's one of those common places that many pieces of code hook into, so it could be any one of dozens of callbacks that has the problem. 

I've tried examining the stack variables using lldb, but unfortunately, I don't get any useful information about which function the callback will eventually resolve to. Minidumps don't save heap memory, so the contents of the BindState are not recorded.

First question: Does anyone have any ideas on how to debug this crash further? Is there a way I'm not seeing to figure out which piece of code has the issue with the invalid pointer?

Second question: Assuming no one has better ideas on debugging the problem.... I'm thinking of trying to improve the debuggability of these types of issues. (This isn't the first one I've seen.)

In particular, I'd like to move the RETURN_ADDRESS() macro from location.cc into a header, and then have base::BindOnce and base::BindRepeating store the results of RETURN_ADDRESS() into a const void * in BindStateBase, call it BindStateBase::creation_address_. Then, in base::OnceCallback::Run and base::RepeatingCallback::Run, copy the creation_address_ into a local variable and use base::debug::Alias to ensure it's visible on the stack in a minidump. It's annoying but I can then manually symbolize the creation address to the creation function.

Thoughts?

(Side note: Why not a full Location object? Because the normal trick of having a defaulted parameter Location loc = FROM_HERE at the end of a function's parameter list doesn't work with base::BindOnce / base::BindRepeating since they use variadic templates and I don't think you can have a defaulted parameter in a variadic template function. And I'm not willing to do an LSC to change every single call to base::BindOnce/base::BindRepeating to have a FROM_HERE at the beginning. If someone with more template knowledge than me knows of a clever way to accomplish this without requiring every call site to change, please let me know!)

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/16e70588-f302-4607-9c0c-9030e452df59n%40chromium.org.

Ian Barkley-Yeung

unread,
May 22, 2024, 5:08:50 PMMay 22
to Chromium-dev, Joe Mason, Chromium-dev, scheduler-dev, i...@chromium.org
Yes, I can add a "just for this crash" debugging parameter. 

My concern is that I've seen this type of issue -- there's a callback, it's bad in some way, we have no idea which callback it is -- before. Each time, we have to add "just for this crash" debugging code and then get it into a dev release and then wait until we get enough adoption of the new dev release before we can even start fixing the real bug, increasing the chances that we ship something buggy to the end user. Perhaps because I'm on a stability gardening rotation, I see more of the "we don't know who to assign this to" problems than other people do. Do other stability gardener types see similar issues?


On Wednesday, May 22, 2024 at 1:52:02 PM UTC-7 Joe Mason wrote:
BindStateBase is VERY performance (and size) sensitive, since so many callbacks are created and passed around. If there's really common debug info missing, it might be worth adding more, but it'll be a heavy lift.

For this particular crash, since you know that the callback is passed to SetOnAborted, you could save some debug info (eg. a base::debug::StackTrace and/or base::debug:TaskTrace) in a member variable alongside the callback, instead of augmenting the callback. Then use base::Alias on those member variables in the stack frame that calls "on_aborted_.Run()". 

On Wed, May 22, 2024 at 4:30 PM Joe Mason <joenot...@google.com> wrote:
+scheduler-dev, where the experts on task tracing hang out.

On Wed, May 22, 2024 at 4:21 PM Ian Barkley-Yeung <i...@chromium.org> wrote:
We're trying to research a bug (https://g-issues.chromium.org/issues/335902543) where a OnceCallback has a bad pointer and is crashing.

The issue is that the call site doesn't tell us who owns the callback. It's one of those common places that many pieces of code hook into, so it could be any one of dozens of callbacks that has the problem. 

I've tried examining the stack variables using lldb, but unfortunately, I don't get any useful information about which function the callback will eventually resolve to. Minidumps don't save heap memory, so the contents of the BindState are not recorded.

First question: Does anyone have any ideas on how to debug this crash further? Is there a way I'm not seeing to figure out which piece of code has the issue with the invalid pointer?

Second question: Assuming no one has better ideas on debugging the problem.... I'm thinking of trying to improve the debuggability of these types of issues. (This isn't the first one I've seen.)

In particular, I'd like to move the RETURN_ADDRESS() macro from location.cc into a header, and then have base::BindOnce and base::BindRepeating store the results of RETURN_ADDRESS() into a const void * in BindStateBase, call it BindStateBase::creation_address_. Then, in base::OnceCallback::Run and base::RepeatingCallback::Run, copy the creation_address_ into a local variable and use base::debug::Alias to ensure it's visible on the stack in a minidump. It's annoying but I can then manually symbolize the creation address to the creation function.

Thoughts?

(Side note: Why not a full Location object? Because the normal trick of having a defaulted parameter Location loc = FROM_HERE at the end of a function's parameter list doesn't work with base::BindOnce / base::BindRepeating since they use variadic templates and I don't think you can have a defaulted parameter in a variadic template function. And I'm not willing to do an LSC to change every single call to base::BindOnce/base::BindRepeating to have a FROM_HERE at the beginning. If someone with more template knowledge than me knows of a clever way to accomplish this without requiring every call site to change, please let me know!)

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.

Daniel Cheng

unread,
May 22, 2024, 5:13:05 PMMay 22
to Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, memory-s...@chromium.org
I think this was less problematic before we implemented BRP, as we typically wouldn't crash before we invoked the bound functor. It's worth considering if there's some way to improve the debuggability of these reports, as I do agree it's not great. I'm just hoping we can do it without blowing up memory use or code size too much.

For example, if we somehow knew the concrete type of the object, would that help? Could we base::debug::Alias the remains of the object on the stack? I'm guessing even the vtable would be gone by the (currently irreversible) zapping though...

Daniel

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-de...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/scheduler-dev/2628b2c4-a4c3-433d-a8b5-ada611d25687n%40chromium.org.

Dave Tapuska

unread,
May 22, 2024, 5:38:08 PMMay 22
to Daniel Cheng, Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, memory-s...@chromium.org
Have you looked at:


The call stack shows AnimateGroupedChildExpandedCollapse:$_0 on the stack, and those pass raw_ptr... so you'll get the dangling ptr checks.

dave.

Ian Barkley-Yeung

unread,
May 22, 2024, 6:15:30 PMMay 22
to Chromium-dev, Dave Tapuska, Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, memory-s...@chromium.org, Daniel Cheng
> The call stack shows AnimateGroupedChildExpandedCollapse:$_0 on the stack, and those pass raw_ptr... so you'll get the dangling ptr checks.

I'm not sure I believe it though. go/crash marks those frames as having been code-folded (ICF). So any callback with a similar signature (a raw pointer and a weak pointer) will go through that code address. Is it possible that's the source of the problem? Yes. Is it certain? No. And I've had other bugs from previous stability gardening rotations where I believed those code-folded signatures and assigned the crash to the wrong team -- it wasted a week. 

For example, if we somehow knew the concrete type of the object, would that help? Could we base::debug::Alias the remains of the object on the stack? I'm guessing even the vtable would be gone by the (currently irreversible) zapping though...

My understanding is that vtable pointers point to read-only data sections in the ELF? So it might be able to understand a vtable pointer.

I don't think BindState is copyable, but yes, one could memcpy the whole thing into a local char array of the correct size in one of the Invoker functions. I think it would make more sense to only copy the BindState::functor_ which gives you address the callback wanted to call. But that also does impose some cost on each call to Once/RepeatingCallback::Run. (And it's even more difficult to interpret.) Still, it should also work.

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-dev+unsubscribe@chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-dev+unsubscribe@chromium.org.

Daniel Cheng

unread,
May 22, 2024, 6:20:47 PMMay 22
to Ian Barkley-Yeung, Chromium-dev, Dave Tapuska, Joe Mason, scheduler-dev, memory-s...@chromium.org
I don't really want to alias the entire BindState, and we certainly can't do it on every invocation. It could be very large, and it would introduce more bloat in the BindState specializations, which already take up a non-trivial % of the Chrome binary.

I was suggesting memcpying the dangling object so we could at least figure out what type of object it was. However, that doesn't work today because PartitionAlloc zaps an object and overwrites the entire thing with a pattern, *including* the vtable pointers. The pattern was intentionally chosen to try to maximize the chances of crashing with bad data/bad deref when using freed objects.

The other problem here, of course, is that the object has already been destroyed, so the vtable entries would point to the base class vtables. So we'd have to somehow save them before the dtor started running.

Daniel

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-de...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-de...@chromium.org.

Dave Tapuska

unread,
May 22, 2024, 6:35:32 PMMay 22
to Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, memory-s...@chromium.org, Daniel Cheng
Right but I thought the signature would need to include an AshNotificationView (so I'd expect the closures in this method (and the other 5 in this class to match)).. Either way this code is funky..

The "if (parent)" would never execute because it owns collapsed_summary_view_ and main_view_ which would fail the dangling raw_ptr checks, I think those should actually be removed as args on the callbacks and just access them directly when "if (parent)" is true. 

dave.

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-de...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "scheduler-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scheduler-de...@chromium.org.

Arthur Sonzogni

unread,
May 23, 2024, 9:53:45 AMMay 23
to dtap...@chromium.org, Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, memory-s...@chromium.org, Daniel Cheng
I strongly agree with what Dave found in StackTrace. Thanks!
If the WeakPtr to the parent is gone, it means the pointer to the child is dangling.

I am going to propose a patch fixing this.

What caused the author to use this unfortunate detour is because we don't have a `View::GetWeakPtr()` function. Maybe we should? Given the importance of this class, I'm guessing there must be some push back?
Arthur @arthursonzogni


Peter Kasting

unread,
May 23, 2024, 11:06:39 AMMay 23
to Arthur Sonzogni, Dave Tapuska, Ian Barkley-Yeung, Chromium-dev, Joe Mason, scheduler-dev, memory-s...@chromium.org, Daniel Cheng
I haven't been paying close enough attention to this thread to deeply understand the bug or the most desirable fix. Classically getting a weak pointer to a view was rarely needed and would generally suggest larger design problems, and we'd try to avoid bandaiding them (similar to our guidance on refcounting). It's possible that's the case here and some combo of making safe use easier, dangerous use harder, and auditing and fixing best practice compliance is the right route. It's also possible the pros of such an API outweigh any cons. If you want to pursue it, I think raising it with the views team with a motivating design or two (and why this is the best solution in such cases) is the right route. I don't think we would reject such proposals out of hand, we'd just want some careful thought.

PK

Reply all
Reply to author
Forward
0 new messages