Hey Bryan,I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.1) Unified notion of "Page visit"?The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.
2) Handling of different schemesWhat about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.
3) Mapping frames back to a "page visit"It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.
Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.
4) Biases from throttlingThe once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?
5) Data volume / cumulative updatesThe UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?
Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?
6) Duplication of enum definitionDifferent (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.
Sorry for the long list. Let me know if it would be better to chat over VC.
--Thanks,Rick
You received this message because you are subscribed to the Google Groups "loading-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.
To post to this group, send email to loadi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/loading-dev/CAFUtAY8yXzcFw05G_RX%3D%3DWrk8n8yhR5hZfdc7Q4H2J5%2B-CMqSw%40mail.gmail.com.
Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.
On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:Hey Bryan,I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.
1) Unified notion of "Page visit"?The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.Loading team solves this by IPCing relevant data to the browser process.
2) Handling of different schemesWhat about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.We probably restrict things too much here. You can look at page_track_decider for the logic of which pages we track in page_load_metrics code.
3) Mapping frames back to a "page visit"It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.We only track main frames, so we don't really deal with workers. Delegating UseCounters seems okay to me.
Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.I'm less sure about SVGs but I'm fairly sure the page track decider will filter them out.
4) Biases from throttlingThe once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?The first set of IPCs is throttled to 50ms, and subsequent ones are at 1 second to account for this somewhat. I agree that this will bias pages that end very quickly (e.g. client redirects). If UseCounters hook into our IPC system we could probably have a stronger argument to reduce the throttling.
5) Data volume / cumulative updatesThe UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?We should investigate this. Loading team did an experiment and found in some cases that directly IPCing small resources (<= 1kB I think?) was more efficient than using shared memory. I think the overhead will essentially be 2 1k copies.Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?This might make sense. Sending a vector<tuple<int /* index */, int /* delta */>> associative array could make sense. IMO it depends on the code complexity tradeoff vs the bit vector copying.
6) Duplication of enum definitionDifferent (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.This sounds reasonable to me.
Sorry for the long list. Let me know if it would be better to chat over VC.Don't worry about it, I would be happy to jump on VC about this.
Thanks,Rick
--
You received this message because you are subscribed to the Google Groups "loading-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev...@chromium.org.
Thanks Charles! Agree with what Charles shared earlier - I'll just add a few more details.First, I'm really glad to have UseCounter info flowing through page load metrics, as it will also enable page load metrics observers to do things like only log metrics for pages where a certain UseCounter is encountered. In the longer term, I wonder if it makes sense to get rid of WebLoadingBehaviorFlag and migrate everything to use counters. This is something we can consider going forward.On Tue, May 2, 2017 at 9:55 PM Charles Harrison <cshar...@chromium.org> wrote:Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.Yes. In addition there's https://docs.google.com/document/d/1ApM09ygFdNBHjQogYJ_rUwGyFcCplWZFhcJfsUFkFQM/edit which is intended for consumers who want to add custom page load metrics, but has some useful architecture details (which page loads are tracked and which aren't etc - see the 'Which page loads are tracked by PageLoadMetricsObservers?' section for details).On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:Hey Bryan,I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.This sounds great. I think existing page load metrics implementers will benefit from being able to determine which use counters were activated for a given page.1) Unified notion of "Page visit"?The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.Loading team solves this by IPCing relevant data to the browser process.RE: fast shutdown, it's correct that any process that's killed by fast shutdown within 1 second will not get to report any data buffered in the render process. We decided that this was a reasonable tradeoff between accuracy and IPC overhead.Our thinking here is that if something happens just 1 second before the page dies, then it's unclear that could have had a meaningful impact on the user experience, and thus maybe it's actually reasonable (or better) to not really count it as having happened. This may not be valid for some uses, in particular any features that persist information for longer than the lifetime of that page, but it's been an ok tradeoff so far.We can definitely revisit the 1 second threshold, shortening it etc, if we think that might be important to do so. One thing I could do is run a field trial with varying thresholds and see how much of a difference there is in resulting metrics, to find the optimal tradeoff between delay and accuracy.
One more thought: adding the use counter data to PageLoadMetadata, as a peer to WebLoadingBehaviorFlag, is probably the easiest path forward here. We already send this struct for both main and child frames, so all we should have to do is plumb the use counter data out from blink to MetricsRenderFrameObserver and we should get the rest for free, more or less.
On Wed, May 3, 2017 at 11:59 AM Bryan McQuade <bmcq...@chromium.org> wrote:On Wed, May 3, 2017 at 11:56 AM Bryan McQuade <bmcq...@chromium.org> wrote:Thanks Charles! Agree with what Charles shared earlier - I'll just add a few more details.First, I'm really glad to have UseCounter info flowing through page load metrics, as it will also enable page load metrics observers to do things like only log metrics for pages where a certain UseCounter is encountered. In the longer term, I wonder if it makes sense to get rid of WebLoadingBehaviorFlag and migrate everything to use counters. This is something we can consider going forward.On Tue, May 2, 2017 at 9:55 PM Charles Harrison <cshar...@chromium.org> wrote:Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.Yes. In addition there's https://docs.google.com/document/d/1ApM09ygFdNBHjQogYJ_rUwGyFcCplWZFhcJfsUFkFQM/edit which is intended for consumers who want to add custom page load metrics, but has some useful architecture details (which page loads are tracked and which aren't etc - see the 'Which page loads are tracked by PageLoadMetricsObservers?' section for details).
On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:Hey Bryan,I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.This sounds great. I think existing page load metrics implementers will benefit from being able to determine which use counters were activated for a given page.1) Unified notion of "Page visit"?The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.Loading team solves this by IPCing relevant data to the browser process.RE: fast shutdown, it's correct that any process that's killed by fast shutdown within 1 second will not get to report any data buffered in the render process. We decided that this was a reasonable tradeoff between accuracy and IPC overhead.Our thinking here is that if something happens just 1 second before the page dies, then it's unclear that could have had a meaningful impact on the user experience, and thus maybe it's actually reasonable (or better) to not really count it as having happened. This may not be valid for some uses, in particular any features that persist information for longer than the lifetime of that page, but it's been an ok tradeoff so far.
We can definitely revisit the 1 second threshold, shortening it etc, if we think that might be important to do so. One thing I could do is run a field trial with varying thresholds and see how much of a difference there is in resulting metrics, to find the optimal tradeoff between delay and accuracy.One more detail here: for non-fast-shutdown, we do flush any buffered updates in the destructor, so we will report all metrics as long as the full destruction path is allowed to run.
2) Handling of different schemesWhat about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.We probably restrict things too much here. You can look at page_track_decider for the logic of which pages we track in page_load_metrics code.Right, as Charles notes we ignore some types of commits in the main frame. In particular we ignore non http/https pages.The policy for this can be found here:
With implementations of the various policy methods:render side: https://cs.chromium.org/chromium/src/chrome/renderer/page_load_metrics/renderer_page_track_decider.ccbrowser side: https://cs.chromium.org/chromium/src/chrome/browser/page_load_metrics/browser_page_track_decider.ccThis is also explained a bit in the earlier linked doc.3) Mapping frames back to a "page visit"It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.We only track main frames, so we don't really deal with workers. Delegating UseCounters seems okay to me.Right - page load metrics as it exists has no direct knowledge of workers. We can think about adding support for them, or perhaps we can track usecounters for workers in a separate custom way, specific to workers, outside of the page load metrics code.Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.I'm less sure about SVGs but I'm fairly sure the page track decider will filter them out.That's right - our code to filter out non HTML/XHTML mime types will cause us to ignore SVG documents altogether. Our general policy has been to relax the constraints of what we ignore as consumers have a need for it, so we could add support for tracking SVG if you need it.
The other thing to know is that we don't do anything special for same-page navigations (e.g. #anchor navs, pushState, etc) - we just treat those as the same 'page' as the non-same-page nav it was associated with. Hopefully that's ok for / consistent with the existing UseCounter implementation.
4) Biases from throttlingThe once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?The first set of IPCs is throttled to 50ms, and subsequent ones are at 1 second to account for this somewhat. I agree that this will bias pages that end very quickly (e.g. client redirects). If UseCounters hook into our IPC system we could probably have a stronger argument to reduce the throttling.Right - the 50ms initial delay is intended to report anything that happens shortly after commit, to catch things like pages that abort quickly. Of course, if the abort happens in less than 50ms we may miss it. We can similarly modify this initial delay if needed. Definitely open to suggestions on how to improve - it'd be helpful to identify certain cases that we think are important, verify they are dropped by the current impl, and then figure out how to address them. We could also just run a field trial with very aggressive flushing (e.g. 1ms delay) to see if we see significant differences in metrics as a result, to understand how much room there is for improvement here.
5) Data volume / cumulative updatesThe UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?We should investigate this. Loading team did an experiment and found in some cases that directly IPCing small resources (<= 1kB I think?) was more efficient than using shared memory. I think the overhead will essentially be 2 1k copies.Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?This might make sense. Sending a vector<tuple<int /* index */, int /* delta */>> associative array could make sense. IMO it depends on the code complexity tradeoff vs the bit vector copying.Yeah, we could both use this encoding and only send deltas from the previous state to minimize the amount of data being sent. I'm fine with doing this or whatever else we think makes sense here.
6) Duplication of enum definitionDifferent (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.This sounds reasonable to me.+1 fine with this in the near term - perhaps in the medium term we could move the UseCounter enum's def to a common shared location. Rick do you have a sense for where that might be? In addition to just logging use counters, I can imagine that other page load metrics observers might be interested in only logging metrics if a certain use counter is encountered - for that case it seems really valuable to be able to reference the enum definition in the browser process code. We don't have to solve this problem right away but it'd be ideal to have a path/plan to having a single shared definition we can use from the browser process, like with WebLoadingBehaviorFlag.
Sorry for the long list. Let me know if it would be better to chat over VC.Don't worry about it, I would be happy to jump on VC about this.Yep, Charles and I can VC if you like. From this thread it sounds like this should be a very doable change. Happy to work through any remaining details over VC or however works best for you.
Thanks,Rick
--
You received this message because you are subscribed to the Google Groups "loading-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.
This is great guys, thank you!I think this is clear enough that I can start working on an initial CL to sketch this out.A couple responses inline. And one more question: What sort of testing do you have in place? For UseCounters we've had only unit testing, relying on some manual ad-hoc testing and post-hoc data analysis to convince us it's likely working as intended. I'm not sure but I had been starting to think it could be worth the effort to create some sort of content_browsertest framework so we can get more integration testing of the historically problematic cases. Then again, since we'll always have to do the data analysis to convince ourselves the metrics are telling the truth, perhaps having better integration tests isn't that high priority?
On Wed, May 3, 2017 at 9:22 PM, Bryan McQuade <bmcq...@chromium.org> wrote:One more thought: adding the use counter data to PageLoadMetadata, as a peer to WebLoadingBehaviorFlag, is probably the easiest path forward here. We already send this struct for both main and child frames, so all we should have to do is plumb the use counter data out from blink to MetricsRenderFrameObserver and we should get the rest for free, more or less.Yeah that's what I was thinking too. Luna and I have started to sketch out some design details here. The key open question I have about this is if we're going to have the IPC contain only a list of the features hit since the previous IPC, where do we convert that to a full bit-set of all features hit since page load? I see two obvious choices:
- The simplest answer I think is to do that inside our own PageLoadMetricsObserver instance which is responsible for generating the histograms. Then PageLoadMetadata (and hence PageLoadExtraInfo) contains only the small list of newly hit features. This gives us everything we need at the moment, but then this information isn't terribly useful by itself in, say a PageLoadMetricsObserver::OnComplete call.
- Alternately I like your idea that other metrics observers should be able to ask what features have been hit, which suggests that maybe PageLoadTracker should be maintaining the bit vector of seen features itself and making it available in the PageLoadExtraInfo separately from the PageLoadMetadata. This would suggest that perhaps the "newly observed features list" shouldn't actually be a member of PageLoadMetadata but a peer to it.
Any thoughts? We could always start with #1 now and plan to migrate to the more complicated #2 solution once things are clearly. In particular, I can unblock my urgent UKM work with just #1.
On Wed, May 3, 2017 at 11:59 AM Bryan McQuade <bmcq...@chromium.org> wrote:On Wed, May 3, 2017 at 11:56 AM Bryan McQuade <bmcq...@chromium.org> wrote:Thanks Charles! Agree with what Charles shared earlier - I'll just add a few more details.First, I'm really glad to have UseCounter info flowing through page load metrics, as it will also enable page load metrics observers to do things like only log metrics for pages where a certain UseCounter is encountered. In the longer term, I wonder if it makes sense to get rid of WebLoadingBehaviorFlag and migrate everything to use counters. This is something we can consider going forward.On Tue, May 2, 2017 at 9:55 PM Charles Harrison <cshar...@chromium.org> wrote:Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.Yes. In addition there's https://docs.google.com/document/d/1ApM09ygFdNBHjQogYJ_rUwGyFcCplWZFhcJfsUFkFQM/edit which is intended for consumers who want to add custom page load metrics, but has some useful architecture details (which page loads are tracked and which aren't etc - see the 'Which page loads are tracked by PageLoadMetricsObservers?' section for details).These are great, thank you!On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:Hey Bryan,I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.This sounds great. I think existing page load metrics implementers will benefit from being able to determine which use counters were activated for a given page.1) Unified notion of "Page visit"?The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.Loading team solves this by IPCing relevant data to the browser process.RE: fast shutdown, it's correct that any process that's killed by fast shutdown within 1 second will not get to report any data buffered in the render process. We decided that this was a reasonable tradeoff between accuracy and IPC overhead.Our thinking here is that if something happens just 1 second before the page dies, then it's unclear that could have had a meaningful impact on the user experience, and thus maybe it's actually reasonable (or better) to not really count it as having happened. This may not be valid for some uses, in particular any features that persist information for longer than the lifetime of that page, but it's been an ok tradeoff so far.Yep that makes sense. I think we'll be able to test this hypothesis by recording both a new UMA histogram along with our existing one and looking for non-trivial differences in the data.
We can definitely revisit the 1 second threshold, shortening it etc, if we think that might be important to do so. One thing I could do is run a field trial with varying thresholds and see how much of a difference there is in resulting metrics, to find the optimal tradeoff between delay and accuracy.One more detail here: for non-fast-shutdown, we do flush any buffered updates in the destructor, so we will report all metrics as long as the full destruction path is allowed to run.Ah, great! Do you know if that includes the case of a page navigating itself like the user clicking on a cross-origin <a> link or a JS redirect? We want (and probably have), for example, people to be able to have UseCounters around navigation behavior (eg. how often does a cross-origin iframe navigate the main frame with window.top.location=foo).
2) Handling of different schemesWhat about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.We probably restrict things too much here. You can look at page_track_decider for the logic of which pages we track in page_load_metrics code.Right, as Charles notes we ignore some types of commits in the main frame. In particular we ignore non http/https pages.The policy for this can be found here:Oh nice, so even at first glance I see this correctly handles a case that could be a bug for us - new tab pages served over http.With implementations of the various policy methods:render side: https://cs.chromium.org/chromium/src/chrome/renderer/page_load_metrics/renderer_page_track_decider.ccbrowser side: https://cs.chromium.org/chromium/src/chrome/browser/page_load_metrics/browser_page_track_decider.ccThis is also explained a bit in the earlier linked doc.3) Mapping frames back to a "page visit"It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.We only track main frames, so we don't really deal with workers. Delegating UseCounters seems okay to me.Right - page load metrics as it exists has no direct knowledge of workers. We can think about adding support for them, or perhaps we can track usecounters for workers in a separate custom way, specific to workers, outside of the page load metrics code.Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.I'm less sure about SVGs but I'm fairly sure the page track decider will filter them out.That's right - our code to filter out non HTML/XHTML mime types will cause us to ignore SVG documents altogether. Our general policy has been to relax the constraints of what we ignore as consumers have a need for it, so we could add support for tracking SVG if you need it.Yeah I think we'd want something. We have been deprecating some SVG features so it's important to have visibility into them. Ideally we'd treat them similar to iframes, but they're tricky because (I think) they don't have a single parent frame. So perhaps we need to handle them more like shared workers.
The other thing to know is that we don't do anything special for same-page navigations (e.g. #anchor navs, pushState, etc) - we just treat those as the same 'page' as the non-same-page nav it was associated with. Hopefully that's ok for / consistent with the existing UseCounter implementation.Yep, that's consistent with what we do already. I've been a little torn on how to handle single-page-app style applications. It seems wrong that a popular site like Facebook moving from a new-page-to-navigate to a same-page-navigation model could have a huge impact on our metrics (naively this suggests are usage metrics are biased in favor of older-style sites over newer ones). But I don't see an obviously better solution (I don't think I want opening each e-mail in GMail to be considered a "page visit"), so as long as this sort of problem is something we could work on together in the PLM infrastructure then I'm not too worried about it.4) Biases from throttlingThe once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?The first set of IPCs is throttled to 50ms, and subsequent ones are at 1 second to account for this somewhat. I agree that this will bias pages that end very quickly (e.g. client redirects). If UseCounters hook into our IPC system we could probably have a stronger argument to reduce the throttling.Right - the 50ms initial delay is intended to report anything that happens shortly after commit, to catch things like pages that abort quickly. Of course, if the abort happens in less than 50ms we may miss it. We can similarly modify this initial delay if needed. Definitely open to suggestions on how to improve - it'd be helpful to identify certain cases that we think are important, verify they are dropped by the current impl, and then figure out how to address them. We could also just run a field trial with very aggressive flushing (e.g. 1ms delay) to see if we see significant differences in metrics as a result, to understand how much room there is for improvement here.I suggest we start by logging the new UseCounter data in parallel with continuing to log the existing stuff. That should give us an A/B test that's at least as good (if not better than) such a field trial.
5) Data volume / cumulative updatesThe UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?We should investigate this. Loading team did an experiment and found in some cases that directly IPCing small resources (<= 1kB I think?) was more efficient than using shared memory. I think the overhead will essentially be 2 1k copies.Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?This might make sense. Sending a vector<tuple<int /* index */, int /* delta */>> associative array could make sense. IMO it depends on the code complexity tradeoff vs the bit vector copying.Yeah, we could both use this encoding and only send deltas from the previous state to minimize the amount of data being sent. I'm fine with doing this or whatever else we think makes sense here.Why would we need a "delta"? This is just the list of features newly seen during the lifetime of this Page. Despite the (perhaps poorly chosen) class name, we don't actually know the "count" of usage within a given page load (the "count" refers to the number of distinct page loads that used a feature).
6) Duplication of enum definitionDifferent (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.This sounds reasonable to me.+1 fine with this in the near term - perhaps in the medium term we could move the UseCounter enum's def to a common shared location. Rick do you have a sense for where that might be? In addition to just logging use counters, I can imagine that other page load metrics observers might be interested in only logging metrics if a certain use counter is encountered - for that case it seems really valuable to be able to reference the enum definition in the browser process code. We don't have to solve this problem right away but it'd be ideal to have a path/plan to having a single shared definition we can use from the browser process, like with WebLoadingBehaviorFlag.Yep having it be in some common place makes sense. I would have said WebKit/public/platform like WebLoadingBehaviorFlag except I know that there's some effort to not depend on WebKit from some parts of Chrome. But maybe that's still the right place?
Sorry for the long list. Let me know if it would be better to chat over VC.Don't worry about it, I would be happy to jump on VC about this.Yep, Charles and I can VC if you like. From this thread it sounds like this should be a very doable change. Happy to work through any remaining details over VC or however works best for you.I think the next step (other than a couple minor questions here) is for me to put up an initial sketch of a CL (and as a Pri-2, try to get some of the conclusions here captured in the design doc). Sound good? Maybe once that CL is up and it shows how poor my understanding is of the PLM space then it'll make sense to get on a VC to correct my misunderstandings ;-)
Thanks,Rick
--
You received this message because you are subscribed to the Google Groups "loading-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev...@chromium.org.
6) Duplication of enum definitionDifferent (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.This sounds reasonable to me.+1 fine with this in the near term - perhaps in the medium term we could move the UseCounter enum's def to a common shared location. Rick do you have a sense for where that might be? In addition to just logging use counters, I can imagine that other page load metrics observers might be interested in only logging metrics if a certain use counter is encountered - for that case it seems really valuable to be able to reference the enum definition in the browser process code. We don't have to solve this problem right away but it'd be ideal to have a path/plan to having a single shared definition we can use from the browser process, like with WebLoadingBehaviorFlag.Yep having it be in some common place makes sense. I would have said WebKit/public/platform like WebLoadingBehaviorFlag except I know that there's some effort to not depend on WebKit from some parts of Chrome. But maybe that's still the right place?I think that's a fine place to put it in the near term, but it may be that as this code integrates with GRC it may be difficult to reference these headers there. Unclear yet, though, so WebKit/public/platform seems ok for the time being.Sorry for the long list. Let me know if it would be better to chat over VC.Don't worry about it, I would be happy to jump on VC about this.Yep, Charles and I can VC if you like. From this thread it sounds like this should be a very doable change. Happy to work through any remaining details over VC or however works best for you.I think the next step (other than a couple minor questions here) is for me to put up an initial sketch of a CL (and as a Pri-2, try to get some of the conclusions here captured in the design doc). Sound good? Maybe once that CL is up and it shows how poor my understanding is of the PLM space then it'll make sense to get on a VC to correct my misunderstandings ;-)SGTM thx!
Thanks,Rick
--
You received this message because you are subscribed to the Google Groups "loading-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.
Thanks,Rick
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev...@chromium.org.
Thanks,Rick
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-predictability/CADjAqN5OmWzg7D0V4qdyOS8cQkqEWT8SKbdm5yECFKvyzT1BvA%40mail.gmail.com.You received this message because you are subscribed to the Google Groups "platform-predictability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-predictability+unsub...@chromium.org.
To post to this group, send email to platform-predictability@chromium.org.