how to identify the "root" resource for a page?

Bryan McQuade

unread,

Nov 16, 2010, 3:01:28 PM11/16/10

to http-archive-...@googlegroups.com

Hi,

In the "pages" array, HAR has the fields
"id": "page_0",
"title": "Test Page",

Each element in the "entries" array also has a
"pageref": "page_0",
field.

I'm not seeing a way to get the URL of the top-level HTML resource for
a page. The id/pageref is an identifier not a URL. Is there a field
for the "url" of the root HTML page somewhere? if not can we add it to
the "pages" array?

Jan Odvarko

unread,

Nov 18, 2010, 11:07:45 AM11/18/10

to HTTP Archive Specification

Yes, make sense to me.

So, the new modified structure would be:

"pages": [

{
"startedDateTime": "2009-04-16T12:07:25.123+01:00",
"id": "page_0",
"url": "http://www.example.com",
"title": "Test Page",
"pageTimings": {...},
"comment": ""
}

]

* url [string, optional] - URL of the page. This URL represent the top-
level HTML resource of the page.

Honza

Bryan McQuade

unread,

Nov 18, 2010, 11:10:15 AM11/18/10

to http-archive-...@googlegroups.com

great! can we get this into the 1.2 spec or is that frozen already?

Jan Odvarko

unread,

Nov 18, 2010, 11:13:54 AM11/18/10

to HTTP Archive Specification

1.2 is already frozen, so we need to target 1.3
Honza

On Nov 18, 5:10 pm, Bryan McQuade <bmcqu...@google.com> wrote:
> great! can we get this into the 1.2 spec or is that frozen already?
>

Bryan McQuade

unread,

Nov 18, 2010, 11:15:59 AM11/18/10

to http-archive-...@googlegroups.com

Ok thank you! I will look for this in 1.3. I think the syntax you
propose looks good (new 'url' field in each entry of the 'pages'
array).

simonp

unread,

Nov 18, 2010, 11:17:27 AM11/18/10

to HTTP Archive Specification

Is that necessary? Can't one of the following methods could be used
with the existing format:

1. Find for the entry with the lowest value of "startedDateTime" for
the page and use its URL

2. Find for the entry with the same value of "startedDateTime" as the
page and use its URL

Simon

> > the "pages" array?- Hide quoted text -
>
> - Show quoted text -

Bryan McQuade

unread,

Nov 18, 2010, 11:26:51 AM11/18/10

to http-archive-...@googlegroups.com

Yes and no. First, if resources don't have startedDateTimes, this is
not going to be possible.

Also, lowest startedDateTime may point to a redirect to the main page, e.g.

http://example.com/ redirecting to http://www.example.com/ (just an
example but many sites do this)

Since the rendered page is http://www.example.com/ I would like to see
that in the URL field but the resource with the lowest startedDateTime
will be the initial redirect.

#2 I don't think will work either, for the same reason.

One could also argue that "startedDateTime" in the "pages" structure
is redundant for the same reason you point out here; "startedDateTime"
for pages should be the min of all startedDateTimes of resources,
right? But it's convenient to roll that value up in the "pages"
structure. Rolling up the main "url" is convenient as well.

simonp

unread,

Nov 18, 2010, 11:43:42 AM11/18/10

to HTTP Archive Specification

I can see how redirection could cause a problem. Do you think that the
page URL should be the value in the location bar after the page has
loaded? For example, after any redirections, meta refreshes or
javascript manipulation of the location.

That shouldn't be too hard for browser based tools, but what about
tools like wireshark or fiddler? I'm wondering if it would be
difficult for them to determine the final URL. Would the optional
nature of the "url" parameter mean supply it if possible?

Is it possible for resources not to have startedDateTimes? I thought
that was mandatory in HAR 1.1 and 1.2.

Simon

On Nov 18, 4:26 pm, Bryan McQuade <bmcqu...@google.com> wrote:
> Yes and no. First, if resources don't have startedDateTimes, this is
> not going to be possible.
>
> Also, lowest startedDateTime may point to a redirect to the main page, e.g.
>

> http://example.com/redirecting tohttp://www.example.com/(just an

> example but many sites do this)
>

> Since the rendered page ishttp://www.example.com/I would like to see

> >> - Show quoted text -- Hide quoted text -

Bryan McQuade

unread,

Nov 18, 2010, 11:47:26 AM11/18/10

to http-archive-...@googlegroups.com

On Thu, Nov 18, 2010 at 11:43 AM, simonp <simon....@simtec.ltd.uk> wrote:
> I can see how redirection could cause a problem. Do you think that the
> page URL should be the value in the location bar after the page has
> loaded? For example, after any redirections, meta refreshes or
> javascript manipulation of the location.

Yes, it would be great to include the URL fragment for instance.

>
> That shouldn't be too hard for browser based tools, but what about
> tools like wireshark or fiddler? I'm wondering if it would be
> difficult for them to determine the final URL. Would the optional
> nature of the "url" parameter mean supply it if possible?

Yes, good question. For these tools I think a best effort computation
of the URL is sufficient. We can document that the URL should be the
contents of the location bar if available, otherwise the URL of the
main HTML page, after any redirects. All tools should be able to
compute the latter by looking for the resource with the lowest
startedDateTime and then following redirect chains until they land at
a 200 response. This wont catch meta redirects but it's good enough
for most cases.

Sergey Chernyshev

unread,

Nov 18, 2010, 11:30:57 PM11/18/10

to http-archive-...@googlegroups.com

Actually, I ran into the problem with tools not having a notion of
"requested URL" e.g. URL the user wanted to load.

This is the URL that people have in the browser BEFORE browser starts
to load, redirect, execute JS and so on.

It is important for all cases, but the easiest way to see the problem
if to think of it in context of monitoring or testing tools like
ShowSlow or WebPageTest where user enters the URL and then test tool
launches the browser and so on.

It will not be possible to get that URL for tools like network
sniffers, but they can use the heuristics like "first HTML URL" in the
stream or first request on first connection, but in tools that can
provide it should do that.

Now, I'm not 100% sure if this should be just "url" or we have to have
two items, "intended_url" and "resulting_url" or some combination of
those, but I believe there should be clarity in the spec which one is
which.

I suggest we have two of them as they have different semantic meaning
- one is the URL that person wished for and another one is the URL
that they got in the end.

(there is also whole notion of redirects that comes in long term
monitoring, but it's probably an off-topic here).

What do guys think about this issue?

Sergey

Mark Nottingham

unread,

Nov 21, 2010, 4:36:08 PM11/21/10

to http-archive-...@googlegroups.com

If the tool has to use a heuristic like this to calculate it, and it's possible for the end user to calculate it using the same heuristic, then the field should be optional, so the tool doesn't inject information that it doesn't have first-hand.

On 19/11/2010, at 3:47 AM, Bryan McQuade wrote:

>
> Yes, good question. For these tools I think a best effort computation
> of the URL is sufficient. We can document that the URL should be the
> contents of the location bar if available, otherwise the URL of the
> main HTML page, after any redirects. All tools should be able to
> compute the latter by looking for the resource with the lowest
> startedDateTime and then following redirect chains until they land at
> a 200 response. This wont catch meta redirects but it's good enough
> for most cases.

--
Mark Nottingham http://www.mnot.net/

Sergey Chernyshev

unread,

Nov 21, 2010, 8:13:29 PM11/21/10

to http-archive-...@googlegroups.com

I guess, there is no way around making this field optional.

Unfortunately, some tools (network devices or other traffic scanners) will not be able to produce this value without guessing, that's right, but for some class of tools (those that are on the client), this is quite possible to produce them and can add important information for tools like ShowSlow.

Bryan, do you think it's an important addition to your request? Can you describe how Page Speed sends URLs in the performance beacon?