Page breakup

14 views
Skip to first unread message

Andrew Fleenor

unread,
Jan 25, 2011, 4:46:39 PM1/25/11
to pcap...@googlegroups.com
I've pushed what's basically my best shot at referrer-based page breakup to github on branch "pages". But it's not that great. A few things don't get requested with referrer headers (favicon.ico under Chrome, apparently), and it probably can't distinguish between simultaneous loads of the same or similar pages. The only other methods I can think of are to guess are based on the time between requests/responses, and trying to guess whether individual requests are likely to be top-level requests, based on the size and type of their content.

Does anyone else have ideas on how to do this?

Ryan Witt

unread,
Jan 25, 2011, 5:30:21 PM1/25/11
to pcap...@googlegroups.com
- You could look to see if the new request's cookie matches a cookie from the set of existing domains. This won't catch simultaneous page views that overlap domains, but it should help for the common case of ajax requests from multiple open tabs or applications.
- Checking the user agent header could also eliminate requests from other applications or browsers.
- Another technique that might work is unziping/unencoding each object already in a page view and apply a regex like http://daringfireball.net/2009/11/liberal_regex_for_matching_urls to pull out URLs that should be included for this page view, if encountered. Again, this won't work for multiple simultaneous page loads for the same page.

In general, all of this will probably be harder when the root document is missing.

On a related note, is it possible to fill in a "dummy" document when the root object is missing? Does the HAR spec allow this?

--
Ryan Witt
http://onecreativeblog.com

Andrew Fleenor

unread,
Jan 27, 2011, 11:22:08 PM1/27/11
to pcap...@googlegroups.com
Okay. Digging through page content to look for urls would be time-consuming and of dubious benefit (i.e, links to external sites might be red herrings), but maybe you would switch it on with a command-line flag. I don't think cookies are really being parsed yet.

Regarding dummy documents, I don't think the HAR spec says one way or the other. We would probably be forgiven if we mentioned it in the comment for the entry.

Is assuming that a "page load" has ended after a certain amount of time passes between requests a reasonable method, or will something more complicated be required?
Reply all
Reply to author
Forward
0 new messages