Request for comments: Resource packages specification

1 view
Skip to first unread message

Justin Lebar

unread,
Apr 19, 2010, 6:28:23 PM4/19/10
to
I've written a formal specification based on Alexander Limi's proposal
[1] for resource packages. I'd really appreciate any feedback on the
specification, which is available at

http://stanford.edu/~jlebar/moz/respkg

You can track the progress of the implementation of this spec at

https://bugzilla.mozilla.org/show_bug.cgi?id=529208

-Justin

[1]: http://limi.net/articles/resource-packages

Aryeh Gregor

unread,
Apr 20, 2010, 12:55:42 PM4/20/10
to
On Apr 19, 6:28 pm, Justin Lebar <justin.le...@gmail.com> wrote:
> I've written a formal specification based on Alexander Limi's proposal
> [1] for resource packages.  I'd really appreciate any feedback on the
> specification, which is available at
>
>  http://stanford.edu/~jlebar/moz/respkg

"Iterate over each link element contained in the document's head
element, starting with the last link."

Since you're iterating backwards, the UA has to wait until the end of
the <head> before it's sure which package contains a particular
stylesheet or script, no? In particular, with markup like

<!doctype html>
<link rel='resource-package' href='pkg1.zip' content='script.js'>
<script src=script.js></script>
...
<body>
...

it seems like the UA cannot safely start executing the script (and,
therefore, cannot start parsing beyond that script non-speculatively)
until it's reached the end of the <head>. Which is actually circular,
isn't it? What if the script retrieved from a later <link> removes
that <link>?

Perhaps you should just specify that when retrieving a resource, you
only consider resource-packages that were specified earlier. So as
soon as you hit a request for a resource while parsing, you know which
resource-packages could possibly contain it, without parsing any
further.

Or is this a non-issue, because the algorithm is run immediately after
the element is added to the DOM? Maybe best to call this out
explicitly, if so.

Also, nitpick: you should explicitly state here that you're iterating
in reverse order, rather than iterating forward and wrapping around
from the first element to the last, or some other odd thing.

"We apply this algorithm before we check the user agent's cache to see
if it contains the requested resource. Thus a UA must not fetch a
resource from its cache unless it cannot fetch that resource from any
of the document's resource packages."

Why? Surely this will hurt performance, if a resource is cached with
long expiry time, but the corresponding resource-package is not cached
or has expired?

"We should support spaces within filenames. How do we do it for other
resources?"

Maybe just require the filenames to be URL-encoded? Authors can just
copy and paste from the href/src attributes of other elements that
way, so it's possibly even more convenient.

Justin Lebar

unread,
Apr 20, 2010, 5:01:36 PM4/20/10
to
On Apr 20, 9:55 am, Aryeh Gregor <simetri...@gmail.com> wrote:
> "Iterate over each link element contained in the document's head
> element, starting with the last link."
> (snip)

> Perhaps you should just specify that when retrieving a resource, you
> only consider resource-packages that were specified earlier.

This is a really good exposition of the issue.

I'd been thinking about specifying that resources are only affected by
resource-package declarations which appear earlier in the DOM, but I
was hoping that I could come up with a more clever solution which
didn't involve a DOM walk on every resource load.

But if we ignore resource-packages which don't appear in the <head>,
then on loading a resource within the <body>, we only need to check
that it's inside the <body> to know that it appears after all the
resource packages (assuming that the document doesn't have a <head>
after the <body>, anyway). Assuming we can tell whether an element is
in the <body> without a DOM walk, we only have to do a walk for loads
within the <head>.

I'd also need to specify that for resources not in the DOM (e.g.
Javascript Image()s), we use the set of resource packages present when
that object is fetched.

> Or is this a non-issue, because the algorithm is run immediately after
> the element is added to the DOM?  Maybe best to call this out
> explicitly, if so.

I don't think we should require the algorithm to be run immediately
after the element is added to the DOM. The UA might e.g. postpone
downloading images until all the scripts it's seen have finished.

> Also, nitpick: you should explicitly state here that you're iterating
> in reverse order, rather than iterating forward and wrapping around
> from the first element to the last, or some other odd thing.

Yes; I'll fix that.

> "We apply this algorithm before we check the user agent's cache to see
> if it contains the requested resource. Thus a UA must not fetch a
> resource from its cache unless it cannot fetch that resource from any
> of the document's resource packages."
>
> Why?  Surely this will hurt performance, if a resource is cached with
> long expiry time, but the corresponding resource-package is not cached
> or has expired?

I'm uncomfortable with this as well, but I think I like the
alternative less. If our rule was to use a file from a resource
package if the package has finished downloading by the time we request
the resource, and otherwise to use a cached copy of the image if we
have one, then you can imagine getting different behavior on different
page loads, depending on whether the resource package was in the cache
and whether the file was in the cache.

> "We should support spaces within filenames. How do we do it for other
> resources?"
>
> Maybe just require the filenames to be URL-encoded?  Authors can just
> copy and paste from the href/src attributes of other elements that
> way, so it's possibly even more convenient.

I added language to this effect this morning.

Thanks for the feedback!

-Justin

Aryeh Gregor

unread,
Apr 22, 2010, 6:36:06 PM4/22/10
to
On Apr 20, 5:01 pm, Justin Lebar <justin.le...@gmail.com> wrote:
> I'd been thinking about specifying that resources are only affected by
> resource-package declarations which appear earlier in the DOM, but I
> was hoping that I could come up with a more clever solution which
> didn't involve a DOM walk on every resource load.
>
> But if we ignore resource-packages which don't appear in the <head>,
> then on loading a resource within the <body>, we only need to check
> that it's inside the <body> to know that it appears after all the
> resource packages (assuming that the document doesn't have a <head>
> after the <body>, anyway).  Assuming we can tell whether an element is
> in the <body> without a DOM walk, we only have to do a walk for loads
> within the <head>.

Surely you could compute the last applicable resource-package for a
given resource during parsing? Then you'd just have to read off the
last resource-package parsed at that point.

> I'd also need to specify that for resources not in the DOM (e.g.
> Javascript Image()s), we use the set of resource packages present when
> that object is fetched.

Is that even deterministic? (No hope for it if it's not, I guess.)

> I don't think we should require the algorithm to be run immediately
> after the element is added to the DOM.  The UA might e.g. postpone
> downloading images until all the scripts it's seen have finished.

Sure.

> I'm uncomfortable with this as well, but I think I like the
> alternative less.  If our rule was to use a file from a resource
> package if the package has finished downloading by the time we request
> the resource, and otherwise to use a cached copy of the image if we
> have one, then you can imagine getting different behavior on different
> page loads, depending on whether the resource package was in the cache
> and whether the file was in the cache.

Well, you can always get different behavior on cache hits vs. misses,
whenever you have any kind of cache. Are you thinking of a particular
scenario that might cause confusing results? If a resource is in the
cache but the copy from the resource-package is functionally
different, the site is buggy regardless -- some users will be randomly
getting the old copy from cache and some will be getting the new copy,
no matter what. Or if the resource-package is out of sync with the
actual files, some browsers will be getting one version and some the
other. Adding further unpredictability in this case is not a huge
cost AFAICT.

On the other hand, the current algorithm will definitely cause real-
world slowdown. For instance, I expect many authors will not provide
manifests or set the content attribute for some of their packages, and
in that case, *all* resource loads will be blocked until the package
is fully downloaded (at least for ZIP?). This would be bad
regardless, but the current algorithm will make it much worse. To a
lesser extent, it will degrade performance whenever a resource-package
contains a file that's also available in another resource-package, or
outside one. I don't see any benefits that outweigh this.


A further thought: what's the story on what happens if an author
doesn't provide content="" or a manifest? ZIP puts the list of files
at the end, so all resource loads will be blocked until the ZIP is
fully read in this case, no? That seems like a major gotcha that a
lot of authors will run into. Is the story for .tar.gz any better?
Or for any other common format?

If there's some common format that wouldn't need a manifest, it would
be a good idea to standardize on that format and ban others.
Requiring Windows users to download a freeware tool to make their
packages is a *much* smaller burden than requiring them to remember to
manually use a manifest file, because they *will* forget and their
pages will silently become a lot slower. Authors tend to avoid
fragile/hard-to-maintain features like this.

Justin Lebar

unread,
Apr 23, 2010, 12:09:24 AM4/23/10
to
On Apr 22, 3:36 pm, Aryeh Gregor <simetri...@gmail.com> wrote:
> Surely you could compute the last applicable resource-package for a
> given resource during parsing? Then you'd just have to read off the
> last resource-package parsed at that point.

I suppose we could keep a list of the seen resource packages while
parsing. But a DOM node can trigger a resource load even after the
document has finished parsing (say, an <img>'s href is changed by a
script), so this doesn't help us too much.

Walking the <head> is probably fine if we only have to do it for
resources linked from the <head>. I need to see.

> > I'd also need to specify that for resources not in the DOM (e.g.
> > Javascript Image()s), we use the set of resource packages present when
> > that object is fetched.
>
> Is that even deterministic? (No hope for it if it's not, I guess.)

Yeah, I'm not sure there's much hope. At least it's well-defined,
even if it's unpredictable. :)

> If a resource is in the
> cache but the copy from the resource-package is functionally
> different, the site is buggy regardless -- some users will be randomly
> getting the old copy from cache and some will be getting the new copy,
> no matter what. Or if the resource-package is out of sync with the
> actual files, some browsers will be getting one version and some the
> other. Adding further unpredictability in this case is not a huge
> cost AFAICT.

I haven't been around for too long, but I thought that the lesson of
HTML5 was that ambiguity should be avoided at most costs; it's most
important that all browsers work the same way, even in error cases.

I can think of some situations where a more relaxed cache policy might
be nice. For instance, if the browser already has script.js in the
cache from a page which didn't use resource packages, the browser
doesn't need to wait until the resource package has downloaded before
it can start running script.js.

But in reality, if the site is using resource packages, it should just
use resource packages everywhere. With the strict cache policy, we'll
experience a slowdown relative to the relaxed cache policy exactly
once, on the first visit to a page with resource packages.

Similarly, you probably wouldn't provide a file in two separate
resource packages, at least not on purpose. So I'm not convinced that
the slowdown we get here by saying that we always wait for the last
package to download is worth the loss of consistency of saying that we
can load a resource from whichever package we want.

> A further thought: what's the story on what happens if an author
> doesn't provide content="" or a manifest? ZIP puts the list of files
> at the end, so all resource loads will be blocked until the ZIP is
> fully read in this case, no?
>

> If there's some common format that wouldn't need a manifest, it would
> be a good idea to standardize on that format and ban others.

I'm not aware of any common format that wouldn't require a manifest.
tar.gz files actually don't contain a dictionary at the end; to get a
complete list of a .tar.gz's files, you have to extract the whole
tarball. (At least, this is what I understand I'm not 100% sure
that's right; I haven't implemented tar.gz yet.)

Having said this, one can obviate the whole problem by putting the
resource package inside a subdirectory. If the package is located at /
static/pkg.zip, then only loads from the /static directory will block
on the resource package, even if it doesn't contain a content
attribute or a manifest. Whether or not developers will do that is
another story.

Even if we changed the semantics of loading to let browsers ignore
resource packages whenever they wanted, I think most browsers will
block optimistically when they see a resource package which might
contain a resource they want. The alternative it to be pessimistic
and assume the resource package doesn't contain what you want, but in
that case, why implement resource packages at all?

One last thing: If the content attribute is explicitly set to "", the
spec currently says that the browser treats the resource package as if
it's empty.

Reply all
Reply to author
Forward
0 new messages