Using PageTimingMetricsSender for UseCounters

14 views

Skip to first unread message

Rick Byers

unread,

May 2, 2017, 5:35:36 PM5/2/17

to Bryan McQuade, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, loadi...@chromium.org

Hey Bryan,

I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.

1) Unified notion of "Page visit"?

The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.

2) Handling of different schemes

What about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.

3) Mapping frames back to a "page visit"

It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.

In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.

Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.

4) Biases from throttling

The once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?

5) Data volume / cumulative updates

The UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?

Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?

6) Duplication of enum definition

Different (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.

Sorry for the long list. Let me know if it would be better to chat over VC.

Thanks,

Rick

Charles Harrison

unread,

May 2, 2017, 9:55:01 PM5/2/17

to Rick Byers, Bryan McQuade, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.

On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:

Hey Bryan,
I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.

1) Unified notion of "Page visit"?
The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.

Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.

Loading team solves this by IPCing relevant data to the browser process.

2) Handling of different schemes
What about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.

We probably restrict things too much here. You can look at page_track_decider for the logic of which pages we track in page_load_metrics code.

3) Mapping frames back to a "page visit"
It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.

In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.

We only track main frames, so we don't really deal with workers. Delegating UseCounters seems okay to me.

Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.

I'm less sure about SVGs but I'm fairly sure the page track decider will filter them out.

4) Biases from throttling
The once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?

The first set of IPCs is throttled to 50ms, and subsequent ones are at 1 second to account for this somewhat. I agree that this will bias pages that end very quickly (e.g. client redirects). If UseCounters hook into our IPC system we could probably have a stronger argument to reduce the throttling.

5) Data volume / cumulative updates
The UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?

We should investigate this. Loading team did an experiment and found in some cases that directly IPCing small resources (<= 1kB I think?) was more efficient than using shared memory. I think the overhead will essentially be 2 1k copies.

Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?

This might make sense. Sending a vector<tuple<int /* index */, int /* delta */>> associative array could make sense. IMO it depends on the code complexity tradeoff vs the bit vector copying.

6) Duplication of enum definition
Different (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.

This sounds reasonable to me.

Sorry for the long list. Let me know if it would be better to chat over VC.

Don't worry about it, I would be happy to jump on VC about this.

Thanks,
Rick

--
You received this message because you are subscribed to the Google Groups "loading-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.
To post to this group, send email to loadi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/loading-dev/CAFUtAY8yXzcFw05G_RX%3D%3DWrk8n8yhR5hZfdc7Q4H2J5%2B-CMqSw%40mail.gmail.com.

Bryan McQuade

unread,

May 3, 2017, 11:57:06 AM5/3/17

to Charles Harrison, Rick Byers, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

Thanks Charles! Agree with what Charles shared earlier - I'll just add a few more details.

First, I'm really glad to have UseCounter info flowing through page load metrics, as it will also enable page load metrics observers to do things like only log metrics for pages where a certain UseCounter is encountered. In the longer term, I wonder if it makes sense to get rid of WebLoadingBehaviorFlag and migrate everything to use counters. This is something we can consider going forward.

On Tue, May 2, 2017 at 9:55 PM Charles Harrison <cshar...@chromium.org> wrote:

Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.

Yes. In addition there's https://docs.google.com/document/d/1ApM09ygFdNBHjQogYJ_rUwGyFcCplWZFhcJfsUFkFQM/edit which is intended for consumers who want to add custom page load metrics, but has some useful architecture details (which page loads are tracked and which aren't etc - see the 'Which page loads are tracked by PageLoadMetricsObservers?' section for details).

On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:
Hey Bryan,
I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.

This sounds great. I think existing page load metrics implementers will benefit from being able to determine which use counters were activated for a given page.

1) Unified notion of "Page visit"?
The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.
Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.

Loading team solves this by IPCing relevant data to the browser process.

RE: fast shutdown, it's correct that any process that's killed by fast shutdown within 1 second will not get to report any data buffered in the render process. We decided that this was a reasonable tradeoff between accuracy and IPC overhead.

Our thinking here is that if something happens just 1 second before the page dies, then it's unclear that could have had a meaningful impact on the user experience, and thus maybe it's actually reasonable (or better) to not really count it as having happened. This may not be valid for some uses, in particular any features that persist information for longer than the lifetime of that page, but it's been an ok tradeoff so far.

We can definitely revisit the 1 second threshold, shortening it etc, if we think that might be important to do so. One thing I could do is run a field trial with varying thresholds and see how much of a difference there is in resulting metrics, to find the optimal tradeoff between delay and accuracy.

2) Handling of different schemes
What about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.
We probably restrict things too much here. You can look at page_track_decider for the logic of which pages we track in page_load_metrics code.

Right, as Charles notes we ignore some types of commits in the main frame. In particular we ignore non http/https pages.

The policy for this can be found here:

https://cs.chromium.org/chromium/src/chrome/common/page_load_metrics/page_track_decider.cc

With implementations of the various policy methods:

render side: https://cs.chromium.org/chromium/src/chrome/renderer/page_load_metrics/renderer_page_track_decider.cc

browser side: https://cs.chromium.org/chromium/src/chrome/browser/page_load_metrics/browser_page_track_decider.cc

This is also explained a bit in the earlier linked doc.

3) Mapping frames back to a "page visit"
It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.

In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.
We only track main frames, so we don't really deal with workers. Delegating UseCounters seems okay to me.

Right - page load metrics as it exists has no direct knowledge of workers. We can think about adding support for them, or perhaps we can track usecounters for workers in a separate custom way, specific to workers, outside of the page load metrics code.

Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.
I'm less sure about SVGs but I'm fairly sure the page track decider will filter them out.

That's right - our code to filter out non HTML/XHTML mime types will cause us to ignore SVG documents altogether. Our general policy has been to relax the constraints of what we ignore as consumers have a need for it, so we could add support for tracking SVG if you need it.

The other thing to know is that we don't do anything special for same-page navigations (e.g. #anchor navs, pushState, etc) - we just treat those as the same 'page' as the non-same-page nav it was associated with. Hopefully that's ok for / consistent with the existing UseCounter implementation.

4) Biases from throttling
The once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?
The first set of IPCs is throttled to 50ms, and subsequent ones are at 1 second to account for this somewhat. I agree that this will bias pages that end very quickly (e.g. client redirects). If UseCounters hook into our IPC system we could probably have a stronger argument to reduce the throttling.

Right - the 50ms initial delay is intended to report anything that happens shortly after commit, to catch things like pages that abort quickly. Of course, if the abort happens in less than 50ms we may miss it. We can similarly modify this initial delay if needed. Definitely open to suggestions on how to improve - it'd be helpful to identify certain cases that we think are important, verify they are dropped by the current impl, and then figure out how to address them. We could also just run a field trial with very aggressive flushing (e.g. 1ms delay) to see if we see significant differences in metrics as a result, to understand how much room there is for improvement here.

5) Data volume / cumulative updates
The UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?
We should investigate this. Loading team did an experiment and found in some cases that directly IPCing small resources (<= 1kB I think?) was more efficient than using shared memory. I think the overhead will essentially be 2 1k copies.

Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?
This might make sense. Sending a vector<tuple<int /* index */, int /* delta */>> associative array could make sense. IMO it depends on the code complexity tradeoff vs the bit vector copying.

Yeah, we could both use this encoding and only send deltas from the previous state to minimize the amount of data being sent. I'm fine with doing this or whatever else we think makes sense here.

6) Duplication of enum definition
Different (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.
This sounds reasonable to me.

+1 fine with this in the near term - perhaps in the medium term we could move the UseCounter enum's def to a common shared location. Rick do you have a sense for where that might be? In addition to just logging use counters, I can imagine that other page load metrics observers might be interested in only logging metrics if a certain use counter is encountered - for that case it seems really valuable to be able to reference the enum definition in the browser process code. We don't have to solve this problem right away but it'd be ideal to have a path/plan to having a single shared definition we can use from the browser process, like with WebLoadingBehaviorFlag.

Sorry for the long list. Let me know if it would be better to chat over VC.
Don't worry about it, I would be happy to jump on VC about this.

Yep, Charles and I can VC if you like. From this thread it sounds like this should be a very doable change. Happy to work through any remaining details over VC or however works best for you.

Thanks,
Rick

--
You received this message because you are subscribed to the Google Groups "loading-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev...@chromium.org.

Bryan McQuade

unread,

May 3, 2017, 11:59:23 AM5/3/17

to Charles Harrison, Rick Byers, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

On Wed, May 3, 2017 at 11:56 AM Bryan McQuade <bmcq...@chromium.org> wrote:

Thanks Charles! Agree with what Charles shared earlier - I'll just add a few more details.

First, I'm really glad to have UseCounter info flowing through page load metrics, as it will also enable page load metrics observers to do things like only log metrics for pages where a certain UseCounter is encountered. In the longer term, I wonder if it makes sense to get rid of WebLoadingBehaviorFlag and migrate everything to use counters. This is something we can consider going forward.

On Tue, May 2, 2017 at 9:55 PM Charles Harrison <cshar...@chromium.org> wrote:
Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.

Yes. In addition there's https://docs.google.com/document/d/1ApM09ygFdNBHjQogYJ_rUwGyFcCplWZFhcJfsUFkFQM/edit which is intended for consumers who want to add custom page load metrics, but has some useful architecture details (which page loads are tracked and which aren't etc - see the 'Which page loads are tracked by PageLoadMetricsObservers?' section for details).

On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:
Hey Bryan,
I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.

This sounds great. I think existing page load metrics implementers will benefit from being able to determine which use counters were activated for a given page.

1) Unified notion of "Page visit"?
The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.
Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.

Loading team solves this by IPCing relevant data to the browser process.

RE: fast shutdown, it's correct that any process that's killed by fast shutdown within 1 second will not get to report any data buffered in the render process. We decided that this was a reasonable tradeoff between accuracy and IPC overhead.

Our thinking here is that if something happens just 1 second before the page dies, then it's unclear that could have had a meaningful impact on the user experience, and thus maybe it's actually reasonable (or better) to not really count it as having happened. This may not be valid for some uses, in particular any features that persist information for longer than the lifetime of that page, but it's been an ok tradeoff so far.

We can definitely revisit the 1 second threshold, shortening it etc, if we think that might be important to do so. One thing I could do is run a field trial with varying thresholds and see how much of a difference there is in resulting metrics, to find the optimal tradeoff between delay and accuracy.

One more detail here: for non-fast-shutdown, we do flush any buffered updates in the destructor, so we will report all metrics as long as the full destruction path is allowed to run.

Bryan McQuade

unread,

May 3, 2017, 9:23:01 PM5/3/17

to Charles Harrison, Rick Byers, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

One more thought: adding the use counter data to PageLoadMetadata, as a peer to WebLoadingBehaviorFlag, is probably the easiest path forward here. We already send this struct for both main and child frames, so all we should have to do is plumb the use counter data out from blink to MetricsRenderFrameObserver and we should get the rest for free, more or less.

Rick Byers

unread,

May 8, 2017, 1:43:22 PM5/8/17

to Bryan McQuade, Charles Harrison, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

This is great guys, thank you!

I think this is clear enough that I can start working on an initial CL to sketch this out.

A couple responses inline. And one more question: What sort of testing do you have in place? For UseCounters we've had only unit testing, relying on some manual ad-hoc testing and post-hoc data analysis to convince us it's likely working as intended. I'm not sure but I had been starting to think it could be worth the effort to create some sort of content_browsertest framework so we can get more integration testing of the historically problematic cases. Then again, since we'll always have to do the data analysis to convince ourselves the metrics are telling the truth, perhaps having better integration tests isn't that high priority?

On Wed, May 3, 2017 at 9:22 PM, Bryan McQuade <bmcq...@chromium.org> wrote:

One more thought: adding the use counter data to PageLoadMetadata, as a peer to WebLoadingBehaviorFlag, is probably the easiest path forward here. We already send this struct for both main and child frames, so all we should have to do is plumb the use counter data out from blink to MetricsRenderFrameObserver and we should get the rest for free, more or less.

Yeah that's what I was thinking too. Luna and I have started to sketch out some design details here. The key open question I have about this is if we're going to have the IPC contain only a list of the features hit since the previous IPC, where do we convert that to a full bit-set of all features hit since page load? I see two obvious choices:

The simplest answer I think is to do that inside our own PageLoadMetricsObserver instance which is responsible for generating the histograms. Then PageLoadMetadata (and hence PageLoadExtraInfo) contains only the small list of newly hit features. This gives us everything we need at the moment, but then this information isn't terribly useful by itself in, say a PageLoadMetricsObserver::OnComplete call.
Alternately I like your idea that other metrics observers should be able to ask what features have been hit, which suggests that maybe PageLoadTracker should be maintaining the bit vector of seen features itself and making it available in the PageLoadExtraInfo separately from the PageLoadMetadata. This would suggest that perhaps the "newly observed features list" shouldn't actually be a member of PageLoadMetadata but a peer to it.

Any thoughts? We could always start with #1 now and plan to migrate to the more complicated #2 solution once things are clearly. In particular, I can unblock my urgent UKM work with just #1.

On Wed, May 3, 2017 at 11:59 AM Bryan McQuade <bmcq...@chromium.org> wrote:
On Wed, May 3, 2017 at 11:56 AM Bryan McQuade <bmcq...@chromium.org> wrote:
Thanks Charles! Agree with what Charles shared earlier - I'll just add a few more details.

First, I'm really glad to have UseCounter info flowing through page load metrics, as it will also enable page load metrics observers to do things like only log metrics for pages where a certain UseCounter is encountered. In the longer term, I wonder if it makes sense to get rid of WebLoadingBehaviorFlag and migrate everything to use counters. This is something we can consider going forward.

On Tue, May 2, 2017 at 9:55 PM Charles Harrison <cshar...@chromium.org> wrote:
Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.

Yes. In addition there's https://docs.google.com/document/d/1ApM09ygFdNBHjQogYJ_rUwGyFcCplWZFhcJfsUFkFQM/edit which is intended for consumers who want to add custom page load metrics, but has some useful architecture details (which page loads are tracked and which aren't etc - see the 'Which page loads are tracked by PageLoadMetricsObservers?' section for details).

These are great, thank you!

On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:
Hey Bryan,
I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.

This sounds great. I think existing page load metrics implementers will benefit from being able to determine which use counters were activated for a given page.

1) Unified notion of "Page visit"?
The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.
Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.

Loading team solves this by IPCing relevant data to the browser process.

RE: fast shutdown, it's correct that any process that's killed by fast shutdown within 1 second will not get to report any data buffered in the render process. We decided that this was a reasonable tradeoff between accuracy and IPC overhead.

Our thinking here is that if something happens just 1 second before the page dies, then it's unclear that could have had a meaningful impact on the user experience, and thus maybe it's actually reasonable (or better) to not really count it as having happened. This may not be valid for some uses, in particular any features that persist information for longer than the lifetime of that page, but it's been an ok tradeoff so far.

Yep that makes sense. I think we'll be able to test this hypothesis by recording both a new UMA histogram along with our existing one and looking for non-trivial differences in the data.

We can definitely revisit the 1 second threshold, shortening it etc, if we think that might be important to do so. One thing I could do is run a field trial with varying thresholds and see how much of a difference there is in resulting metrics, to find the optimal tradeoff between delay and accuracy.

One more detail here: for non-fast-shutdown, we do flush any buffered updates in the destructor, so we will report all metrics as long as the full destruction path is allowed to run.

Ah, great! Do you know if that includes the case of a page navigating itself like the user clicking on a cross-origin <a> link or a JS redirect? We want (and probably have), for example, people to be able to have UseCounters around navigation behavior (eg. how often does a cross-origin iframe navigate the main frame with window.top.location=foo).

2) Handling of different schemes
What about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.
We probably restrict things too much here. You can look at page_track_decider for the logic of which pages we track in page_load_metrics code.

Right, as Charles notes we ignore some types of commits in the main frame. In particular we ignore non http/https pages.

The policy for this can be found here:
https://cs.chromium.org/chromium/src/chrome/common/page_load_metrics/page_track_decider.cc

Oh nice, so even at first glance I see this correctly handles a case that could be a bug for us - new tab pages served over http.

With implementations of the various policy methods:
render side: https://cs.chromium.org/chromium/src/chrome/renderer/page_load_metrics/renderer_page_track_decider.cc
browser side: https://cs.chromium.org/chromium/src/chrome/browser/page_load_metrics/browser_page_track_decider.cc

This is also explained a bit in the earlier linked doc.

3) Mapping frames back to a "page visit"
It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.

In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.
We only track main frames, so we don't really deal with workers. Delegating UseCounters seems okay to me.

Right - page load metrics as it exists has no direct knowledge of workers. We can think about adding support for them, or perhaps we can track usecounters for workers in a separate custom way, specific to workers, outside of the page load metrics code.

Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.
I'm less sure about SVGs but I'm fairly sure the page track decider will filter them out.

That's right - our code to filter out non HTML/XHTML mime types will cause us to ignore SVG documents altogether. Our general policy has been to relax the constraints of what we ignore as consumers have a need for it, so we could add support for tracking SVG if you need it.

Yeah I think we'd want something. We have been deprecating some SVG features so it's important to have visibility into them. Ideally we'd treat them similar to iframes, but they're tricky because (I think) they don't have a single parent frame. So perhaps we need to handle them more like shared workers.

The other thing to know is that we don't do anything special for same-page navigations (e.g. #anchor navs, pushState, etc) - we just treat those as the same 'page' as the non-same-page nav it was associated with. Hopefully that's ok for / consistent with the existing UseCounter implementation.

Yep, that's consistent with what we do already. I've been a little torn on how to handle single-page-app style applications. It seems wrong that a popular site like Facebook moving from a new-page-to-navigate to a same-page-navigation model could have a huge impact on our metrics (naively this suggests are usage metrics are biased in favor of older-style sites over newer ones). But I don't see an obviously better solution (I don't think I want opening each e-mail in GMail to be considered a "page visit"), so as long as this sort of problem is something we could work on together in the PLM infrastructure then I'm not too worried about it.

4) Biases from throttling
The once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?
The first set of IPCs is throttled to 50ms, and subsequent ones are at 1 second to account for this somewhat. I agree that this will bias pages that end very quickly (e.g. client redirects). If UseCounters hook into our IPC system we could probably have a stronger argument to reduce the throttling.

Right - the 50ms initial delay is intended to report anything that happens shortly after commit, to catch things like pages that abort quickly. Of course, if the abort happens in less than 50ms we may miss it. We can similarly modify this initial delay if needed. Definitely open to suggestions on how to improve - it'd be helpful to identify certain cases that we think are important, verify they are dropped by the current impl, and then figure out how to address them. We could also just run a field trial with very aggressive flushing (e.g. 1ms delay) to see if we see significant differences in metrics as a result, to understand how much room there is for improvement here.

I suggest we start by logging the new UseCounter data in parallel with continuing to log the existing stuff. That should give us an A/B test that's at least as good (if not better than) such a field trial.

5) Data volume / cumulative updates
The UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?
We should investigate this. Loading team did an experiment and found in some cases that directly IPCing small resources (<= 1kB I think?) was more efficient than using shared memory. I think the overhead will essentially be 2 1k copies.

Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?
This might make sense. Sending a vector<tuple<int /* index */, int /* delta */>> associative array could make sense. IMO it depends on the code complexity tradeoff vs the bit vector copying.

Yeah, we could both use this encoding and only send deltas from the previous state to minimize the amount of data being sent. I'm fine with doing this or whatever else we think makes sense here.

Why would we need a "delta"? This is just the list of features newly seen during the lifetime of this Page. Despite the (perhaps poorly chosen) class name, we don't actually know the "count" of usage within a given page load (the "count" refers to the number of distinct page loads that used a feature).

6) Duplication of enum definition
Different (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.
This sounds reasonable to me.

+1 fine with this in the near term - perhaps in the medium term we could move the UseCounter enum's def to a common shared location. Rick do you have a sense for where that might be? In addition to just logging use counters, I can imagine that other page load metrics observers might be interested in only logging metrics if a certain use counter is encountered - for that case it seems really valuable to be able to reference the enum definition in the browser process code. We don't have to solve this problem right away but it'd be ideal to have a path/plan to having a single shared definition we can use from the browser process, like with WebLoadingBehaviorFlag.

Yep having it be in some common place makes sense. I would have said WebKit/public/platform like WebLoadingBehaviorFlag except I know that there's some effort to not depend on WebKit from some parts of Chrome. But maybe that's still the right place?

Sorry for the long list. Let me know if it would be better to chat over VC.
Don't worry about it, I would be happy to jump on VC about this.

Yep, Charles and I can VC if you like. From this thread it sounds like this should be a very doable change. Happy to work through any remaining details over VC or however works best for you.

I think the next step (other than a couple minor questions here) is for me to put up an initial sketch of a CL (and as a Pri-2, try to get some of the conclusions here captured in the design doc). Sound good? Maybe once that CL is up and it shows how poor my understanding is of the PLM space then it'll make sense to get on a VC to correct my misunderstandings ;-)

Thanks,
Rick

--
You received this message because you are subscribed to the Google Groups "loading-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.

Bryan McQuade

unread,

May 8, 2017, 1:59:02 PM5/8/17

to Rick Byers, Charles Harrison, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

On Mon, May 8, 2017 at 1:43 PM Rick Byers <rby...@chromium.org> wrote:

This is great guys, thank you!
I think this is clear enough that I can start working on an initial CL to sketch this out.

A couple responses inline. And one more question: What sort of testing do you have in place? For UseCounters we've had only unit testing, relying on some manual ad-hoc testing and post-hoc data analysis to convince us it's likely working as intended. I'm not sure but I had been starting to think it could be worth the effort to create some sort of content_browsertest framework so we can get more integration testing of the historically problematic cases. Then again, since we'll always have to do the data analysis to convince ourselves the metrics are telling the truth, perhaps having better integration tests isn't that high priority?

Yes, we have browsertests in page_load_metrics_browsertest.cc - you could add a few there or create a similar browsertest cc for usecounters.

We tend to do both browser and unit tests. If we had to pick one I think I'd pick browsertests since there have been cases where unittests passed but something wasn't wired up correctly end to end and a bug crept in.

There are some unittests in MetricsWebContentsObserverTest - you could add some browser side unit tests there.

On Wed, May 3, 2017 at 9:22 PM, Bryan McQuade <bmcq...@chromium.org> wrote:
One more thought: adding the use counter data to PageLoadMetadata, as a peer to WebLoadingBehaviorFlag, is probably the easiest path forward here. We already send this struct for both main and child frames, so all we should have to do is plumb the use counter data out from blink to MetricsRenderFrameObserver and we should get the rest for free, more or less.

Yeah that's what I was thinking too. Luna and I have started to sketch out some design details here. The key open question I have about this is if we're going to have the IPC contain only a list of the features hit since the previous IPC, where do we convert that to a full bit-set of all features hit since page load? I see two obvious choices:

The simplest answer I think is to do that inside our own PageLoadMetricsObserver instance which is responsible for generating the histograms. Then PageLoadMetadata (and hence PageLoadExtraInfo) contains only the small list of newly hit features. This gives us everything we need at the moment, but then this information isn't terribly useful by itself in, say a PageLoadMetricsObserver::OnComplete call.

Alternately I like your idea that other metrics observers should be able to ask what features have been hit, which suggests that maybe PageLoadTracker should be maintaining the bit vector of seen features itself and making it available in the PageLoadExtraInfo separately from the PageLoadMetadata. This would suggest that perhaps the "newly observed features list" shouldn't actually be a member of PageLoadMetadata but a peer to it.

(2) would be ideal if we can make it work. I'd be fine with (1) initially to bootstrap and then we can generalize in the future - up to you.

Any thoughts? We could always start with #1 now and plan to migrate to the more complicated #2 solution once things are clearly. In particular, I can unblock my urgent UKM work with just #1.

Ah, I should've kept reading. :) This SGTM.

On Wed, May 3, 2017 at 11:59 AM Bryan McQuade <bmcq...@chromium.org> wrote:
On Wed, May 3, 2017 at 11:56 AM Bryan McQuade <bmcq...@chromium.org> wrote:
Thanks Charles! Agree with what Charles shared earlier - I'll just add a few more details.

First, I'm really glad to have UseCounter info flowing through page load metrics, as it will also enable page load metrics observers to do things like only log metrics for pages where a certain UseCounter is encountered. In the longer term, I wonder if it makes sense to get rid of WebLoadingBehaviorFlag and migrate everything to use counters. This is something we can consider going forward.

On Tue, May 2, 2017 at 9:55 PM Charles Harrison <cshar...@chromium.org> wrote:
Rick, thanks for sending this out. I've left some comments inline. Please see https://docs.google.com/document/d/1HJsJ5W2H_3qRdqPAUgAEo10AF8gXPTXZLUET4X4_sII/edit?usp=sharing for a more general PLM doc.

Yes. In addition there's https://docs.google.com/document/d/1ApM09ygFdNBHjQogYJ_rUwGyFcCplWZFhcJfsUFkFQM/edit which is intended for consumers who want to add custom page load metrics, but has some useful architecture details (which page loads are tracked and which aren't etc - see the 'Which page loads are tracked by PageLoadMetricsObservers?' section for details).

These are great, thank you!

On Tue, May 2, 2017 at 5:35 PM, Rick Byers <rby...@chromium.org> wrote:
Hey Bryan,
I realized UseCounters not working correctly with OOPIF is likely to become urgent, and I'm thinking your suggestion of extending PageTimingMetricsSender to handle use counters is probably the right way to go. But I wanted to ask a few more questions about it to make sure (feel free to point me at any design docs - I didn't find any with a quick search). If the answers to these questions suggest this is likely the right path forward, I can write up a design doc with more detail.

This sounds great. I think existing page load metrics implementers will benefit from being able to determine which use counters were activated for a given page.

1) Unified notion of "Page visit"?
The trickiest bit of UseCounters is probably correctly correlating to and counting the notion of "page visit". It seems like we should be doing this once for all page visit scoped metrics, and this seems to me like the main advantage of unifying page load and UseCounter metrics. I think our concept of what a "visit" should be essentially the same as the loading team's, right? In particular, we want something that's well correlated to the user's notion of a "page" - i.e. it's largely irrelevant to the user what sort of frame navigations happen while they're using a web page. Conceptually I think "did this page break" and "did this page load fast" are referring to the same scope.
Yes I think this is true. There are complexities here WRT "page load end" if that's when you want to collect metrics. Often the "end" of a page load from the renderer's perspective will happen during fast shutdown, so there is no time to log metrics.

Loading team solves this by IPCing relevant data to the browser process.

RE: fast shutdown, it's correct that any process that's killed by fast shutdown within 1 second will not get to report any data buffered in the render process. We decided that this was a reasonable tradeoff between accuracy and IPC overhead.

Our thinking here is that if something happens just 1 second before the page dies, then it's unclear that could have had a meaningful impact on the user experience, and thus maybe it's actually reasonable (or better) to not really count it as having happened. This may not be valid for some uses, in particular any features that persist information for longer than the lifetime of that page, but it's been an ok tradeoff so far.

Yep that makes sense. I think we'll be able to test this hypothesis by recording both a new UMA histogram along with our existing one and looking for non-trivial differences in the data.

+1 this sounds like a great approach

We can definitely revisit the 1 second threshold, shortening it etc, if we think that might be important to do so. One thing I could do is run a field trial with varying thresholds and see how much of a difference there is in resulting metrics, to find the optimal tradeoff between delay and accuracy.

One more detail here: for non-fast-shutdown, we do flush any buffered updates in the destructor, so we will report all metrics as long as the full destruction path is allowed to run.

Ah, great! Do you know if that includes the case of a page navigating itself like the user clicking on a cross-origin <a> link or a JS redirect? We want (and probably have), for example, people to be able to have UseCounters around navigation behavior (eg. how often does a cross-origin iframe navigate the main frame with window.top.location=foo).

I believe fast shutdown makes it hard to track these cases well. We may need to flag some usecounters as 'urgent' and either use a very short timeout or flush them right away to address this.

Once we have the initial implementation in place, we could run a field trial where we dial the flush delay down to 1ms and looking for counters with significant differences would be a good way to see if this ends up being a real problem (I expect it may be for some counters).

2) Handling of different schemes
What about special schemes like chrome:// and chrome-extension:// - do you treat those the same as https:// from a metrics perspective? Ignoring at least chrome:// seems important for UseCounter (otherwise pages like the NTP can bias our metrics). I think it's OK if we have a different policy here, but it might be useful to share some logic.
We probably restrict things too much here. You can look at page_track_decider for the logic of which pages we track in page_load_metrics code.

Right, as Charles notes we ignore some types of commits in the main frame. In particular we ignore non http/https pages.

The policy for this can be found here:
https://cs.chromium.org/chromium/src/chrome/common/page_load_metrics/page_track_decider.cc

Oh nice, so even at first glance I see this correctly handles a case that could be a bug for us - new tab pages served over http.

With implementations of the various policy methods:
render side: https://cs.chromium.org/chromium/src/chrome/renderer/page_load_metrics/renderer_page_track_decider.cc
browser side: https://cs.chromium.org/chromium/src/chrome/browser/page_load_metrics/browser_page_track_decider.cc

This is also explained a bit in the earlier linked doc.

3) Mapping frames back to a "page visit"
It looks like it's instances of PageLoadTracker which really define the notion of a "page visit" for you, right? I think this does pretty much what we want for the common case, but there are some edge cases.

In particular, how do you handle the different type of workers today? I think we'd want to continue to delegate the UseCounters from workers to (one of) the page(s) that they're associated with.
We only track main frames, so we don't really deal with workers. Delegating UseCounters seems okay to me.

Right - page load metrics as it exists has no direct knowledge of workers. We can think about adding support for them, or perhaps we can track usecounters for workers in a separate custom way, specific to workers, outside of the page load metrics code.

Also SVGImages are really special in Blink - each get their own 'Page' instance (which appears as a top-level frame) and I even think one SVGImage can be used by multiple WebContents instances. Do you know how those are handled today from a page load metrics perspective? We treat them specially for UseCounter (separate 'SVGImage' histograms) but I think I'd rather delegate them to some page like workers.
I'm less sure about SVGs but I'm fairly sure the page track decider will filter them out.

That's right - our code to filter out non HTML/XHTML mime types will cause us to ignore SVG documents altogether. Our general policy has been to relax the constraints of what we ignore as consumers have a need for it, so we could add support for tracking SVG if you need it.

Yeah I think we'd want something. We have been deprecating some SVG features so it's important to have visibility into them. Ideally we'd treat them similar to iframes, but they're tricky because (I think) they don't have a single parent frame. So perhaps we need to handle them more like shared workers.

Sure, we an generalize to support SVG documents for you all - we've added one off support for mhtml pages for the offline pages team already - svg can be a second case here.

The other thing to know is that we don't do anything special for same-page navigations (e.g. #anchor navs, pushState, etc) - we just treat those as the same 'page' as the non-same-page nav it was associated with. Hopefully that's ok for / consistent with the existing UseCounter implementation.

Yep, that's consistent with what we do already. I've been a little torn on how to handle single-page-app style applications. It seems wrong that a popular site like Facebook moving from a new-page-to-navigate to a same-page-navigation model could have a huge impact on our metrics (naively this suggests are usage metrics are biased in favor of older-style sites over newer ones). But I don't see an obviously better solution (I don't think I want opening each e-mail in GMail to be considered a "page visit"), so as long as this sort of problem is something we could work on together in the PLM infrastructure then I'm not too worried about it.

4) Biases from throttling
The once-per-second throttle seems likely to be important for performance, but won't that introduce some biases where metrics are lost for certain types of "page visits"? Eg. how are redirects handled? We could get a renderer fast shutdown on redirect meaning that the renderer process is always killed before it gets a chance to send any metrics, right? Perhaps we could put some "send now" trigger into the navigation code to mitigate?
The first set of IPCs is throttled to 50ms, and subsequent ones are at 1 second to account for this somewhat. I agree that this will bias pages that end very quickly (e.g. client redirects). If UseCounters hook into our IPC system we could probably have a stronger argument to reduce the throttling.

Right - the 50ms initial delay is intended to report anything that happens shortly after commit, to catch things like pages that abort quickly. Of course, if the abort happens in less than 50ms we may miss it. We can similarly modify this initial delay if needed. Definitely open to suggestions on how to improve - it'd be helpful to identify certain cases that we think are important, verify they are dropped by the current impl, and then figure out how to address them. We could also just run a field trial with very aggressive flushing (e.g. 1ms delay) to see if we see significant differences in metrics as a result, to understand how much room there is for improvement here.

I suggest we start by logging the new UseCounter data in parallel with continuing to log the existing stuff. That should give us an A/B test that's at least as good (if not better than) such a field trial.

Sounds great.

5) Data volume / cumulative updates
The UseCounter bits are potentially a lot of data. Ignoring the SVG case, we've got about 6000 bits across 6 histograms. We're not going to want to add ~1kB to each update IPC, will we?
We should investigate this. Loading team did an experiment and found in some cases that directly IPCing small resources (<= 1kB I think?) was more efficient than using shared memory. I think the overhead will essentially be 2 1k copies.

Perhaps instead of sending the current BitVector we could just send a vector of ints for the features that have been newly used for a given DocumentLoader since the last update? In our case that will often be 0, but occasionally (first load of each frame) be as many as 100 ints. Or is there some reason it's important that the data on any given update includes all the information from past updates?
This might make sense. Sending a vector<tuple<int /* index */, int /* delta */>> associative array could make sense. IMO it depends on the code complexity tradeoff vs the bit vector copying.

Yeah, we could both use this encoding and only send deltas from the previous state to minimize the amount of data being sent. I'm fine with doing this or whatever else we think makes sense here.

Why would we need a "delta"? This is just the list of features newly seen during the lifetime of this Page. Despite the (perhaps poorly chosen) class name, we don't actually know the "count" of usage within a given page load (the "count" refers to the number of distinct page loads that used a feature).

Oh, I was thinking a delta of features observed since we sent the last update. So for example on a single page if we observe

usecounter a

usecounter b

usecounter c

flush IPC 1

usecounter d

usecounter e

flush IPC 2

in flush IPC 2, do we want to send a, b, c, d, e or just d, e and let the client-side reconstruct the full list, having already observed a, b, c in an earlier IPC? The latter would result in smaller IPCs but not sure if it results in a real perf benefits. I'd suggest starting simple by sending everything observed on the page in each IPC & revisiting if the IPCs start to get very large.

6) Duplication of enum definition
Different (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.
This sounds reasonable to me.

+1 fine with this in the near term - perhaps in the medium term we could move the UseCounter enum's def to a common shared location. Rick do you have a sense for where that might be? In addition to just logging use counters, I can imagine that other page load metrics observers might be interested in only logging metrics if a certain use counter is encountered - for that case it seems really valuable to be able to reference the enum definition in the browser process code. We don't have to solve this problem right away but it'd be ideal to have a path/plan to having a single shared definition we can use from the browser process, like with WebLoadingBehaviorFlag.

Yep having it be in some common place makes sense. I would have said WebKit/public/platform like WebLoadingBehaviorFlag except I know that there's some effort to not depend on WebKit from some parts of Chrome. But maybe that's still the right place?

I think that's a fine place to put it in the near term, but it may be that as this code integrates with GRC it may be difficult to reference these headers there. Unclear yet, though, so WebKit/public/platform seems ok for the time being.

Sorry for the long list. Let me know if it would be better to chat over VC.
Don't worry about it, I would be happy to jump on VC about this.

Yep, Charles and I can VC if you like. From this thread it sounds like this should be a very doable change. Happy to work through any remaining details over VC or however works best for you.

I think the next step (other than a couple minor questions here) is for me to put up an initial sketch of a CL (and as a Pri-2, try to get some of the conclusions here captured in the design doc). Sound good? Maybe once that CL is up and it shows how poor my understanding is of the PLM space then it'll make sense to get on a VC to correct my misunderstandings ;-)

SGTM thx!

Thanks,
Rick

--
You received this message because you are subscribed to the Google Groups "loading-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev...@chromium.org.

Rick Byers

unread,

May 8, 2017, 2:20:52 PM5/8/17

to Bryan McQuade, Charles Harrison, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

Oh interesting, I was definitely thinking we would want to do the latter. The most common case is probably a page that hits ~200 different counters during load, and then no others (or just one or two) for the duration of the page. For example, we have a "feature" for each and every CSS property, so things like "display", "position" etc. are virtually always hit. So it seems like a pretty inefficient design to do the former (the vast majority of data sent would be redundant/useless). Do you feel like that would add a lot of extra complexity? It doesn't seem too bad to me (compared to the challenge of trying to decide what exactly meaningfully counts as "get very large"), but I haven't totally figured out what the option (2) above (i.e. sibling to PageLoadMetadata not a member of it) would look like yet. Maybe I should try to sketch out that first and see if we can agree on something that is relatively clean? If it starts to feel really hard/ugly then we could fall back to the less efficient approach to keep things simpler for now. WDYT?

6) Duplication of enum definition
Different (I think) from WebLoadingBehaviorFlag, the UseCounter::Features histogram changes a LOT. I'd prefer to avoid having multiple definitions of it (beyond the copies we already need in UseCounter.h and histograms.xml - the latter being updated by a script semi-automatically from the former). Do you think it would be OK to just treat our IDs as opaque integer values outside of blink? We could pick a maximum value to share with chromium big enough not to change very often.
This sounds reasonable to me.

+1 fine with this in the near term - perhaps in the medium term we could move the UseCounter enum's def to a common shared location. Rick do you have a sense for where that might be? In addition to just logging use counters, I can imagine that other page load metrics observers might be interested in only logging metrics if a certain use counter is encountered - for that case it seems really valuable to be able to reference the enum definition in the browser process code. We don't have to solve this problem right away but it'd be ideal to have a path/plan to having a single shared definition we can use from the browser process, like with WebLoadingBehaviorFlag.

Yep having it be in some common place makes sense. I would have said WebKit/public/platform like WebLoadingBehaviorFlag except I know that there's some effort to not depend on WebKit from some parts of Chrome. But maybe that's still the right place?

I think that's a fine place to put it in the near term, but it may be that as this code integrates with GRC it may be difficult to reference these headers there. Unclear yet, though, so WebKit/public/platform seems ok for the time being.

Sorry for the long list. Let me know if it would be better to chat over VC.
Don't worry about it, I would be happy to jump on VC about this.

Yep, Charles and I can VC if you like. From this thread it sounds like this should be a very doable change. Happy to work through any remaining details over VC or however works best for you.

I think the next step (other than a couple minor questions here) is for me to put up an initial sketch of a CL (and as a Pri-2, try to get some of the conclusions here captured in the design doc). Sound good? Maybe once that CL is up and it shows how poor my understanding is of the PLM space then it'll make sense to get on a VC to correct my misunderstandings ;-)

SGTM thx!

Thanks,
Rick

--
You received this message because you are subscribed to the Google Groups "loading-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.

Bryan McQuade

unread,

May 8, 2017, 2:29:36 PM5/8/17

to Rick Byers, Charles Harrison, Luna Lu, platform-predictability, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance

Sure, absolutely, I'm supportive of either approach - sending only things that have been observed since the last update in a member of PageLoadMetadata sounds great.

Thanks,

Bryan

Thanks,
Rick

To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev...@chromium.org.

loon...@chromium.org

unread,

Jun 27, 2017, 3:15:54 PM6/27/17

to platform-predictability, rby...@chromium.org, cshar...@chromium.org, tdre...@chromium.org, nhi...@chromium.org, l...@chromium.org, loadi...@chromium.org, bmcq...@chromium.org

Hi Bryan,

Thanks for all your help so far!

I have a short question: I see page load timing / metadata is only sent over IPC when PageTimingMetricsSender is destructed. What happens if a page crashes before that?

My current approach is to keep track of a set of features observed by UseCounters and send the features along with timing / metadata (which means send features upon destruction of PageTimingMetricsSender as well). Does that sound reasonable to you?

Thanks,

Luna

Thanks,
Rick

To unsubscribe from this group and stop receiving emails from it, send an email to loading-dev+unsubscribe@chromium.org.

Charles Harrison

unread,

Jun 27, 2017, 3:20:37 PM6/27/17

to loon...@chromium.org, platform-predictability, Rick Byers, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance, Bryan McQuade

Hey Luna,

We should be sending metrics over via a timer, in addition to the destructor.

Luna Lu

unread,

Jun 27, 2017, 3:27:23 PM6/27/17

to Charles Harrison, platform-predictability, Rick Byers, Timothy Dresser, Hiroki Nakagawa, Lucas Gadani, Chromium Loading Performance, Bryan McQuade

Ahh, I see.

I added PageTimingMetricsSender::EnsureSendTimer()

Thanks!

Luna Lu | Software Engineer | loon...@google.com | +1 519 513 5751

You received this message because you are subscribed to the Google Groups "platform-predictability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-predictability+unsub...@chromium.org.
To post to this group, send email to platform-predictability@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-predictability/CADjAqN5OmWzg7D0V4qdyOS8cQkqEWT8SKbdm5yECFKvyzT1BvA%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages