In the "pages" array, HAR has the fields
"id": "page_0",
"title": "Test Page",
Each element in the "entries" array also has a
"pageref": "page_0",
field.
I'm not seeing a way to get the URL of the top-level HTML resource for
a page. The id/pageref is an identifier not a URL. Is there a field
for the "url" of the root HTML page somewhere? if not can we add it to
the "pages" array?
Also, lowest startedDateTime may point to a redirect to the main page, e.g.
http://example.com/ redirecting to http://www.example.com/ (just an
example but many sites do this)
Since the rendered page is http://www.example.com/ I would like to see
that in the URL field but the resource with the lowest startedDateTime
will be the initial redirect.
#2 I don't think will work either, for the same reason.
One could also argue that "startedDateTime" in the "pages" structure
is redundant for the same reason you point out here; "startedDateTime"
for pages should be the min of all startedDateTimes of resources,
right? But it's convenient to roll that value up in the "pages"
structure. Rolling up the main "url" is convenient as well.
Yes, it would be great to include the URL fragment for instance.
>
> That shouldn't be too hard for browser based tools, but what about
> tools like wireshark or fiddler? I'm wondering if it would be
> difficult for them to determine the final URL. Would the optional
> nature of the "url" parameter mean supply it if possible?
Yes, good question. For these tools I think a best effort computation
of the URL is sufficient. We can document that the URL should be the
contents of the location bar if available, otherwise the URL of the
main HTML page, after any redirects. All tools should be able to
compute the latter by looking for the resource with the lowest
startedDateTime and then following redirect chains until they land at
a 200 response. This wont catch meta redirects but it's good enough
for most cases.
This is the URL that people have in the browser BEFORE browser starts
to load, redirect, execute JS and so on.
It is important for all cases, but the easiest way to see the problem
if to think of it in context of monitoring or testing tools like
ShowSlow or WebPageTest where user enters the URL and then test tool
launches the browser and so on.
It will not be possible to get that URL for tools like network
sniffers, but they can use the heuristics like "first HTML URL" in the
stream or first request on first connection, but in tools that can
provide it should do that.
Now, I'm not 100% sure if this should be just "url" or we have to have
two items, "intended_url" and "resulting_url" or some combination of
those, but I believe there should be clarity in the spec which one is
which.
I suggest we have two of them as they have different semantic meaning
- one is the URL that person wished for and another one is the URL
that they got in the end.
(there is also whole notion of redirects that comes in long term
monitoring, but it's probably an off-topic here).
What do guys think about this issue?
Sergey
On 19/11/2010, at 3:47 AM, Bryan McQuade wrote:
>
> Yes, good question. For these tools I think a best effort computation
> of the URL is sufficient. We can document that the URL should be the
> contents of the location bar if available, otherwise the URL of the
> main HTML page, after any redirects. All tools should be able to
> compute the latter by looking for the resource with the lowest
> startedDateTime and then following redirect chains until they land at
> a 200 response. This wont catch meta redirects but it's good enough
> for most cases.
--
Mark Nottingham http://www.mnot.net/