Set Ready For Review
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Hi Alex,
As the one who suggested to explore delaying RFH in this scenario, could you give a first pass to content/browser?
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Thanks! I didn't look at any test changes yet, but this seems like a good start. Seems like there are still some failures on the Mac bot? Some initial comments below, and I'll also try to discuss this approach tomorrow with the CSA team.
// RenderFrameDeleted();
Remove debug comment?
features::kRfhCheckOneChildAtATime.Get()) {
Should probably check the feature flag as well?
base::SingleThreadTaskRunner::GetCurrentDefault()->PostTask(
I'm not sure about this part of the optimization. Is there evidence that it is needed? Further breaking up the destruction into a bunch of tasks strikes me as increasing complexity and IIUC potentially redoing a bunch of the traversal work (e.g., if we traversed a complex hierarchy of 50 frames and only deleted one frame, we'll post this task and then traverse the whole hierarchy again to see if there's any other frames to delete, rather than only doing the traversal once.)
// the dead branches now. This is a performance optimization.
Maybe move this comment out in front of the "if" and explain how the feature gates whether the RFHs that are ready to be deleted are destroyed synchronously or asynchronously, which might allow the current task (where a navigation might be committing) to finish sooner.
}
I wonder if a third option is possible here, which is to do nothing. In theory, this shouldn't affect correctness, as all frames with unload handlers will still be deleted once they receive their unload ACKs and trigger other branches to be cleaned up, but I can't remember if any subframes without unload handlers will still be cleaned up if we do nothing here. Curious if that's something you tried?
// happen on PendingDeletionCheckCompletedOnSubTree.
nit: lowercase t in "Subtree" to match the function's actual name
BASE_FEATURE_PARAM(bool,
Can you describe the purpose of this param as well?
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Thanks! I didn't look at any test changes yet, but this seems like a good start. Seems like there are still some failures on the Mac bot? Some initial comments below, and I'll also try to discuss this approach tomorrow with the CSA team.
That last test was much harder to fix, but it revealed an assumption that didn't hold anymore when the feature is enabled. See ~FrameTreeNode.
// RenderFrameDeleted();
Aldo CulquicondorRemove debug comment?
Oops, removed.
Should probably check the feature flag as well?
Done
base::SingleThreadTaskRunner::GetCurrentDefault()->PostTask(
I'm not sure about this part of the optimization. Is there evidence that it is needed? Further breaking up the destruction into a bunch of tasks strikes me as increasing complexity and IIUC potentially redoing a bunch of the traversal work (e.g., if we traversed a complex hierarchy of 50 frames and only deleted one frame, we'll post this task and then traverse the whole hierarchy again to see if there's any other frames to delete, rather than only doing the traversal once.)
I figured I would let the study tell me whether this is necessary. I plan to have two arms. There is evidence from flamegraphs and traces that we might be spending hundreds of milliseconds on PendingDeletionCheckCompletedOnSubtree right now.
// the dead branches now. This is a performance optimization.
Maybe move this comment out in front of the "if" and explain how the feature gates whether the RFHs that are ready to be deleted are destroyed synchronously or asynchronously, which might allow the current task (where a navigation might be committing) to finish sooner.
Moved the comment to the header file.
I wonder if a third option is possible here, which is to do nothing. In theory, this shouldn't affect correctness, as all frames with unload handlers will still be deleted once they receive their unload ACKs and trigger other branches to be cleaned up, but I can't remember if any subframes without unload handlers will still be cleaned up if we do nothing here. Curious if that's something you tried?
Would the browser tests have all the machinery necessary to send the unload ACKs? Right now, if I comment out the post task, the tests fail because they don't satisfy rfh_deleted_observer->WaitUntilDeleted(), so it doesn't look like it will work. There's probably a bit too many assumptions now that the deletions are synchronous.
nit: lowercase t in "Subtree" to match the function's actual name
Done
Can you describe the purpose of this param as well?
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Aldo CulquicondorThanks! I didn't look at any test changes yet, but this seems like a good start. Seems like there are still some failures on the Mac bot? Some initial comments below, and I'll also try to discuss this approach tomorrow with the CSA team.
That last test was much harder to fix, but it revealed an assumption that didn't hold anymore when the feature is enabled. See ~FrameTreeNode.
Oh, this is concerning. Can you provide more details about which test was failing and where? Is it FrameNavigationEntry_RecreatedInjectedSrcdocSubframe? If so, I'm surprised it would only fail on Mac, since it's enabled on all platforms. I wonder if it was flaky to start with, and your CL just made it worse. (Also, mac_rel still seems to be failing.)
More importantly, I don't really understand the fix in the latest PS, and don't think this is the right direction for it. Conceptually it doesn't make much sense to pass in a nav entry ID to places like RemoveChild(). But also, I'd expect the test to have the same problem if any of the subframes registered an unload handler, which would make their RFH destruction asynchronous even without this CL. Can you try that and see if the test still fails? If so, it would indicate that there's an independent problem with the FNE pruning logic added in https://chromium-review.googlesource.com/c/chromium/src/+/767627 that we should file a bug for and fix separately from this CL. Maybe the FNE pruning should still happen when we enter pending deletion state, and not when we complete the unload handlers and/or destroy the RFH/FTN later, or maybe we should always look up the nav entry ID of a subframe FTN from its `parent_` RFH's nav_entry_id() and never look at the last committed one.
base::SingleThreadTaskRunner::GetCurrentDefault()->PostTask(
Aldo CulquicondorI'm not sure about this part of the optimization. Is there evidence that it is needed? Further breaking up the destruction into a bunch of tasks strikes me as increasing complexity and IIUC potentially redoing a bunch of the traversal work (e.g., if we traversed a complex hierarchy of 50 frames and only deleted one frame, we'll post this task and then traverse the whole hierarchy again to see if there's any other frames to delete, rather than only doing the traversal once.)
I figured I would let the study tell me whether this is necessary. I plan to have two arms. There is evidence from flamegraphs and traces that we might be spending hundreds of milliseconds on PendingDeletionCheckCompletedOnSubtree right now.
Hundreds of milliseconds on PendingDeletionCheckCompletedOnSubtree is pretty wild. Do you have a sample trace where this is the case? I took a look at a [trace](https://chrometto.googleplex.com/chrometto/open_trace?p=%5Bnull%2Cnull%2Cnull%2Cnull%2Cnull%2C%5B%228b96a5ebcacc8ad6db01220a323032352f30372f3138%22%2Cnull%2Cnull%2C%5B%5B%22RenderFrameHostManager%3A%3AUnloadOldFrame%22%2C30000%2C40000%5D%5D%5D%5D&trace_id=8b96a5ebcacc8ad6db01220a323032352f30372f3138) linked in your doc but didn't see any cases where PendingDeletionCheckCompletedOnSubtree was overly long or involved more than a couple of ~RFHI destructors. The per-frame PostTasks would really only make sense if we see a single PendingDeletionCheckCompletedOnSubtree comprised of a bunch of several smaller ~RFHIs, rather than one ~RFHI taking a long time (which is what I saw in that trace).
The concern with per-frame PostTasks is that they're adding a bunch of new states with partially destroyed frame trees, making races and bugs more likely (being in pending deletion state has always been kind of fragile), in addition to accumulating the overhead of posting many tasks and re-traversing the tree for larger hierarchies. So there's a complexity cost to supporting this experiment arm, and I'd like to make sure it's justified (as compared to starting with a single PostTask to see if it gets us most of the way there).
Aldo CulquicondorI wonder if a third option is possible here, which is to do nothing. In theory, this shouldn't affect correctness, as all frames with unload handlers will still be deleted once they receive their unload ACKs and trigger other branches to be cleaned up, but I can't remember if any subframes without unload handlers will still be cleaned up if we do nothing here. Curious if that's something you tried?
Would the browser tests have all the machinery necessary to send the unload ACKs? Right now, if I comment out the post task, the tests fail because they don't satisfy rfh_deleted_observer->WaitUntilDeleted(), so it doesn't look like it will work. There's probably a bit too many assumptions now that the deletions are synchronous.
Ack. Looking at the code again, it seems that if neither the navigating frame nor any of its descendants have unload handlers, and we're navigating in a subframe, there's nothing else to trigger this destruction. And this is consistent with the test fixes you've done which seems to be only touching subframe cases. So we probably do need to keep the call.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Aldo CulquicondorThanks! I didn't look at any test changes yet, but this seems like a good start. Seems like there are still some failures on the Mac bot? Some initial comments below, and I'll also try to discuss this approach tomorrow with the CSA team.
Alex MoshchukThat last test was much harder to fix, but it revealed an assumption that didn't hold anymore when the feature is enabled. See ~FrameTreeNode.
Oh, this is concerning. Can you provide more details about which test was failing and where? Is it FrameNavigationEntry_RecreatedInjectedSrcdocSubframe? If so, I'm surprised it would only fail on Mac, since it's enabled on all platforms. I wonder if it was flaky to start with, and your CL just made it worse. (Also, mac_rel still seems to be failing.)
More importantly, I don't really understand the fix in the latest PS, and don't think this is the right direction for it. Conceptually it doesn't make much sense to pass in a nav entry ID to places like RemoveChild(). But also, I'd expect the test to have the same problem if any of the subframes registered an unload handler, which would make their RFH destruction asynchronous even without this CL. Can you try that and see if the test still fails? If so, it would indicate that there's an independent problem with the FNE pruning logic added in https://chromium-review.googlesource.com/c/chromium/src/+/767627 that we should file a bug for and fix separately from this CL. Maybe the FNE pruning should still happen when we enter pending deletion state, and not when we complete the unload handlers and/or destroy the RFH/FTN later, or maybe we should always look up the nav entry ID of a subframe FTN from its `parent_` RFH's nav_entry_id() and never look at the last committed one.
The test wasn't initially flaky, but this CL made it flaky. It was failing on Linux too, but somehow the rate of failure was lower so that it passed the bots.
But also, I'd expect the test to have the same problem if any of the subframes registered an unload handler, which would make their RFH destruction asynchronous even without this CL. Can you try that and see if the test still fails?
No, they are not failing, I have run them hundreds of times with my feature disabled and they never failed. Doesn't that mean that there isn't an unload handler for those tests? Any other tests I could check?
Maybe the FNE pruning should still happen when we enter pending deletion state, and not when we complete the unload handlers and/or destroy the RFH/FTN later, or maybe we should always look up the nav entry ID of a subframe FTN from its parent_ RFH's nav_entry_id() and never look at the last committed one.
I think we need to do the FNE prunning when entering the deletion state. I tried using the parent's nav_entry_id, but it had already changed to the last committed entry too (hence why I added that hacky parameter).
base::SingleThreadTaskRunner::GetCurrentDefault()->PostTask(
Aldo CulquicondorI'm not sure about this part of the optimization. Is there evidence that it is needed? Further breaking up the destruction into a bunch of tasks strikes me as increasing complexity and IIUC potentially redoing a bunch of the traversal work (e.g., if we traversed a complex hierarchy of 50 frames and only deleted one frame, we'll post this task and then traverse the whole hierarchy again to see if there's any other frames to delete, rather than only doing the traversal once.)
Alex MoshchukI figured I would let the study tell me whether this is necessary. I plan to have two arms. There is evidence from flamegraphs and traces that we might be spending hundreds of milliseconds on PendingDeletionCheckCompletedOnSubtree right now.
Hundreds of milliseconds on PendingDeletionCheckCompletedOnSubtree is pretty wild. Do you have a sample trace where this is the case? I took a look at a [trace](https://chrometto.googleplex.com/chrometto/open_trace?p=%5Bnull%2Cnull%2Cnull%2Cnull%2Cnull%2C%5B%228b96a5ebcacc8ad6db01220a323032352f30372f3138%22%2Cnull%2Cnull%2C%5B%5B%22RenderFrameHostManager%3A%3AUnloadOldFrame%22%2C30000%2C40000%5D%5D%5D%5D&trace_id=8b96a5ebcacc8ad6db01220a323032352f30372f3138) linked in your doc but didn't see any cases where PendingDeletionCheckCompletedOnSubtree was overly long or involved more than a couple of ~RFHI destructors. The per-frame PostTasks would really only make sense if we see a single PendingDeletionCheckCompletedOnSubtree comprised of a bunch of several smaller ~RFHIs, rather than one ~RFHI taking a long time (which is what I saw in that trace).
The concern with per-frame PostTasks is that they're adding a bunch of new states with partially destroyed frame trees, making races and bugs more likely (being in pending deletion state has always been kind of fragile), in addition to accumulating the overhead of posting many tasks and re-traversing the tree for larger hierarchies. So there's a complexity cost to supporting this experiment arm, and I'd like to make sure it's justified (as compared to starting with a single PostTask to see if it gets us most of the way there).
Yeah, in most cases there is just one ~RFH that takes a long time, in some cases followed by shorter ones. Again, I was trying to get the study to tell me whether it was overkill. But I see your point about more races, so yeah, let's start simpler. I'll remove it in my next update.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Aldo CulquicondorThanks! I didn't look at any test changes yet, but this seems like a good start. Seems like there are still some failures on the Mac bot? Some initial comments below, and I'll also try to discuss this approach tomorrow with the CSA team.
Alex MoshchukThat last test was much harder to fix, but it revealed an assumption that didn't hold anymore when the feature is enabled. See ~FrameTreeNode.
Aldo CulquicondorOh, this is concerning. Can you provide more details about which test was failing and where? Is it FrameNavigationEntry_RecreatedInjectedSrcdocSubframe? If so, I'm surprised it would only fail on Mac, since it's enabled on all platforms. I wonder if it was flaky to start with, and your CL just made it worse. (Also, mac_rel still seems to be failing.)
More importantly, I don't really understand the fix in the latest PS, and don't think this is the right direction for it. Conceptually it doesn't make much sense to pass in a nav entry ID to places like RemoveChild(). But also, I'd expect the test to have the same problem if any of the subframes registered an unload handler, which would make their RFH destruction asynchronous even without this CL. Can you try that and see if the test still fails? If so, it would indicate that there's an independent problem with the FNE pruning logic added in https://chromium-review.googlesource.com/c/chromium/src/+/767627 that we should file a bug for and fix separately from this CL. Maybe the FNE pruning should still happen when we enter pending deletion state, and not when we complete the unload handlers and/or destroy the RFH/FTN later, or maybe we should always look up the nav entry ID of a subframe FTN from its `parent_` RFH's nav_entry_id() and never look at the last committed one.
The test wasn't initially flaky, but this CL made it flaky. It was failing on Linux too, but somehow the rate of failure was lower so that it passed the bots.
But also, I'd expect the test to have the same problem if any of the subframes registered an unload handler, which would make their RFH destruction asynchronous even without this CL. Can you try that and see if the test still fails?
No, they are not failing, I have run them hundreds of times with my feature disabled and they never failed. Doesn't that mean that there isn't an unload handler for those tests? Any other tests I could check?
Maybe the FNE pruning should still happen when we enter pending deletion state, and not when we complete the unload handlers and/or destroy the RFH/FTN later, or maybe we should always look up the nav entry ID of a subframe FTN from its parent_ RFH's nav_entry_id() and never look at the last committed one.
I think we need to do the FNE prunning when entering the deletion state. I tried using the parent's nav_entry_id, but it had already changed to the last committed entry too (hence why I added that hacky parameter).
The test wasn't initially flaky, but this CL made it flaky. It was failing on Linux too, but somehow the rate of failure was lower so that it passed the bots.
But also, I'd expect the test to have the same problem if any of the subframes registered an unload handler, which would make their RFH destruction asynchronous even without this CL. Can you try that and see if the test still fails?
No, they are not failing, I have run them hundreds of times with my feature disabled and they never failed. Doesn't that mean that there isn't an unload handler for those tests? Any other tests I could check?
Right, I would only expect them to fail without your CL if you modify the existing test so that the subframes have an unload handler. Was the failure happening [here](https://source.chromium.org/chromium/chromium/src/+/main:content/browser/renderer_host/navigation_controller_impl_browsertest.cc;l=8007;drc=f57cc21393cfc33b95bbf88117df6f8fba371a95)? If so, I'd run something like this:
```
ASSERT_TRUE(ExecJs(root->child_at(0), "window.onpagehide = ()=>{}"));
ASSERT_TRUE(ExecJs(root->child_at(0)->child_at(0), "window.onpagehide = ()=>{}"));
```
before step 2 and see if that causes the test to fail (without your CL). You might also need to modify navigation_controller/inject_iframe_srcdoc_with_nested_frame.html so that the grandchild iframe points to "/cross-site/foo.com/navigation_controller/form.html" instead of just "form.html", so that it's an OOPIF (required to involve async unload handling).
> Maybe the FNE pruning should still happen when we enter pending deletion state, and not when we complete the unload handlers and/or destroy the RFH/FTN later, or maybe we should always look up the nav entry ID of a subframe FTN from its parent_ RFH's nav_entry_id() and never look at the last committed one.I think we need to do the FNE prunning when entering the deletion state. I tried using the parent's nav_entry_id, but it had already changed to the last committed entry too (hence why I added that hacky parameter).
Oh, interesting - the nav_entry_id change should really only be possible if the main frame navigation stayed in the same RFH - which shouldn't really be the case anymore thanks to RenderDocument, but these tests do cover running both with and without RenderDocument. And I see that the only test variant that failed was explicitly without RenderDocument (RDCrashedFrame_, and not RDAllFrames_). Maybe that means that this problem doesn't exist with RenderDocument (which should launch very soon). But also, I'm confused because in that state the new logic shouldn't be invoked at commit time, because we aren't swapping the main RFH and hence not even entering pending deletion.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
It is indeed failing with unload handlers! https://chromium-review.googlesource.com/c/chromium/src/+/6966438
But also, I'm confused because in that state the new logic shouldn't be invoked at commit time, because we aren't swapping the main RFH and hence not even entering pending deletion.
Not sure what you mean by new logic but, in RDCrashedFrame, there is a call for Detach for the iframe created by a script when navigating away.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |