unpackaged publications (was Restrictions on ToC nav contents?)

45 views
Skip to first unread message

Bill McCoy

unread,
Mar 13, 2013, 11:13:39 AM3/13/13
to Hadrien Gardeur, Peter Hatch, epub-ng
> I'll talk about this in another thread (we still need to talk about a few
> things regarding the <nav> element in index.html), but I'm a little puzzled
> by the fact that you're very interested in unpackaged publications (with
> resources spread across the Web), yet hostile towards HTML. We certainly
> can't expect a distributed publication to be XHTML only.

Hadrien,

You're right that we can't expect distributed publications to be XHTML only.

There's two reasons for my POV, one philosophical one practical.

First, to me the "packaged portable document" meme is about delivering
a reliable experience with content. It should be archivable,
manipulable, accessible, etc.. And generally these documents will be
assembled and checked in workflows via batch tools. Requiring the use
of XHTML here seems OK. Whereas distributed publications created on
the fly by definition won't have these portable document attributes so
being more lenient about syntax seems to make sense. If you wanted to
snapshot a distributed publication into a packaged portable document
then among the things that would apply could be "canonicalization" of
HTML to XHTML.

And practically speaking we have in EPUB a "good enough" packaged
portable document format that's well established. Incompatible changes
here would at best provide incremental benefits and at worst would
create undesirable fragmentation and aid would-be proprietizers. Right
now IMO it's more important to further establish our standard in the
face of proprietary and less Web-based alternatives vs. polishing up
its rough edges (especially when the rough edges in EPUB 3 aren't
accidents but were agreed upon in order to enable compatibility with
existing EPUB 2 reading systems, which is important in moving the
whole ecosystems towards HTML5 & modern web stack). Whereas figuring
out unpackaged distributed publications could deliver more significant
new capabilities to enable whole new means of distribution and digital
reading, without any downside of causing fragmentation.

Even if I'm wrong on the first point and ultimately the HTML
serialization is just as good as XHTML for packaged documents it would
seem better to figure out distributed publications first - where as
you say supporting HTML is a no-brainer - and then see what that
implies for packaged publications. You guys seem to be going at it
backwards.

--Bill

Hadrien Gardeur

unread,
Mar 13, 2013, 11:33:27 AM3/13/13
to Bill McCoy, Peter Hatch, epub-ng
Even if I'm wrong on the first point and ultimately the HTML
serialization is just as good as XHTML for packaged documents it would
seem better to figure out distributed publications first - where as
you say supporting HTML is a no-brainer - and then see what that
implies for packaged publications. You guys seem to be going at it
backwards.

I don't think we're going at it backwards.

The only thing we've said so far about packaging publications are:
  • a ZIP is good enough, no need for a mimetype file or a manifest that list every resource
  • a packaged publication must have an index.html (well-known filename as a discovery mechanism)
  • a packaged EPUB NG with DRM must use a different extension and media type
Basically we know how to deliver the content (ZIP) and how to discover what makes this package a publication (index.html).

How would this work for a distributed publication ? Pretty much the same way:
  • we'll deliver content documents with HTTP
  • instead of a well-known name for the discovery of index.html, we would rely on a link with proper rel/media type (in either HTTP headers or the content)
... and nothing more than that.

Here are the questions that we need to ask ourselves now:
  • can we reference content documents that are not in the package from the <nav> ? (Packaged publication with some resources outside of the package, or a fully distributed publication)
  • how do we point to index.html ? (Links obviously, but in HTTP, in HTML or both ?)
  • how do we identify the index.html in a link (Do we need a media parameter to identify index.html as something different than pure HTML/XHTML ? Which rel value do we use to point from a content document to an index.html ?)

Hadrien Gardeur

unread,
Mar 13, 2013, 11:50:18 AM3/13/13
to Bill McCoy, Peter Hatch, epub-ng
(especially when the rough edges in EPUB 3 aren't
accidents but were agreed upon in order to enable compatibility with
existing EPUB 2 reading systems, which is important in moving the
whole ecosystems towards HTML5 & modern web stack). 

It's a mix of compatibility and bad decisions.

For example the decision to use an IDref system for metadata is something that EPUB3 introduced. Some of the discussions right now in the AHL group would also increase the complexity of EPUB3 dramatically (the use of multiple OPF files with rules to select an OPF, or the ability to map multiple renditions together using an OPF per rendition and a mapping file between them) if they're adopted in the core spec. 

Dividing work into several WGs helped EPUB3 WG to finalize a spec (I was surprised by how little time it took the IDPF to get there) but it also made it harder to follow what's going on in each WG, which raised the odds of having such problems.

Bill McCoy

unread,
Mar 13, 2013, 12:26:51 PM3/13/13
to Hadrien Gardeur, epub-ng
Hi Hadrian,

Right - well my optimal e0 would start from the online presumption and then speak about the packaged scenario second but that's a detail, I'm glad you are thinking through the implications of online.

Of course the first two things you said about the packaged case are still debatable wrt the cost-benefits of deviating from EPUB:
  • a ZIP is good enough, no need for a mimetype file or a manifest that list every resource
A ZIP is isomorphic to a filesystem directory. So no you don't "need" a manifest, but if you don't have such a data structure there's no easy way to distinguish components of the publication from "junk DNA" that is at best useless and at worst may be vectors for malware. Most systems for handling content don't overload the filesystem directory or ZIP equivalent as their manifest. For example Google Packaged Apps and Mozilla Open Web Apps both define manifests, iTunes has a DB, etc.
  • a packaged publication must have an index.html (well-known filename as a discovery mechanism)
EPUB already has a well-known filename as discovery mechanism (META-INF/container.xml), so to be clear you are simply proposing changing the name and removing indirection wrt TOC (and thus removing the ability to package multiple versions of a publication together, such as both fixed and reflow). And index.html has some problems: it's not from any standard but just from default settings of Apache web server, sometimes it's index.htm instead, or even other things in other languages.

Again I'm not saying these things might not be improvements (although as noted that's arguable) but only that I personally don't see all that much prospective bang (simplification) for the definite buck (cost of fragmentation/forking plus other problems caused). 

And other more incremental changes could provide similar benefits w/ less cost. For example another hypothetical approach would be to leave META-INF/container.xml as the well-known name and simply revise the schema for rootfile to allow pointing straight at a Navigation Document that would by default also define the spine (in which case OPF could be omitted entirely). This would allow for simplified authoring if someone didn't want to bother creating OPF, and also ability to deliver full backwards compatibility where desired (because the rootfile could still optionally point at an OPF), similar to current interim situation with NCX & Navigation Document.  And processors and reading systems that wanted to handle both today's EPUB and a future e0 would have less work to do overall than with an e0 as presently suggested, e.g. one well-known name not two (or more).

--Bill

Hadrien Gardeur

unread,
Mar 13, 2013, 12:49:27 PM3/13/13
to Bill McCoy, epub-ng
A ZIP is isomorphic to a filesystem directory. So no you don't "need" a manifest, but if you don't have such a data structure there's no easy way to distinguish components of the publication from "junk DNA" that is at best useless and at worst may be vectors for malware. Most systems for handling content don't overload the filesystem directory or ZIP equivalent as their manifest. For example Google Packaged Apps and Mozilla Open Web Apps both define manifests, iTunes has a DB, etc.

Since you don't need to access such "junks", the threat is minimal. A manifest would only duplicate the information most of the time and not much else.

 
EPUB already has a well-known filename as discovery mechanism (META-INF/container.xml), so to be clear you are simply proposing changing the name and removing indirection wrt TOC (and thus removing the ability to package multiple versions of a publication together, such as both fixed and reflow). And index.html has some problems: it's not from any standard but just from default settings of Apache web server, sometimes it's index.htm instead, or even other things in other languages.

There are multiple level of indirections here.
If I want to access the ToC, I will have to open META-INF/container.xml, then open the OPF (and potentially select the OPF first, with alll kind of complex rules), parse the manifest for the Navigation Document and finally access the document.

With the proposed solution in EPUB NG, we'd have a single level of indirection, worst case scenario (index.html). 

I don't believe that the complexity that we have in EPUB3 right now is needed, and I don't think that it would work at all in a distributed environment with anything more than a single resource turning the resources into a coherent publication.

While some of the complexity about container.xml and OPF already existed in EPUB2, they were never widely used (most of the time container.xml is just used to discover the single OPF file available in the package).
Turning some of these features that we carried into EPUB3 mostly for compatibility reasons into first-class citizens is the wrong solution from my point of view.
Multiple OPFs will mean duplicating a lot of information (metadata, spine, manifest) and I'm pretty sure that using OPF selection in container.xml will break some of the existing EPUB RS.
 
If I had to boil it down to one thing, I would say that this is the main difference between the EPUB NG discussions here and the EPUB3 WG. 
EPUB NG is about having a single, HTML-based document that turns a collection of documents into a publication. 
EPUB3 needs several proprietary XML documents to achieve the same goal. Even considering some of the goals that EPUB3 is trying to achieve (multiple renditions), I don't think that this level of complexity is needed.

As for your comments regarding Apache, I never suggested using a well-know name for distributed publications. It's obvious that links (Link header in HTTP, <link> in HTML) are the right solution for that.

Bill McCoy

unread,
Mar 13, 2013, 1:10:01 PM3/13/13
to Hadrien Gardeur, epub-ng
Hi Hadrien,

The threat of "junk" content is more than just extra information. It would enable less detectable and more persistent (since it would survive unpackaging and repacking based on manifest) way to put malware together with a security exploit that might lurk in JS or media streams or whatever. But anyway my point was not that manifest is definitely better, just that EPUB is far from alone in using the manifest approach and there are benefits as well as costs to consider.

Anyway I agree with you about the complexity of EPUB packaging not being needed in distributed case. 

And, I definitely support the idea that e0 is fundamentally about the question of whether we can have a "single, (X)HTML-based document that turns a collection of documents into a (reliable) publication". This is consistent with the step already taken in EPUB 3.0 with Navigation Document replacing NCX schema (although I would not use the work "proprietary" to describe the XML data structures since that implies to me something vendor-specific which of course nothing in EPUB is... maybe "custom" or "special" is a better adjective).

I'm just trying to push on the secondary question of how much we might be able to achieve this goal with less rather than more fragmentation and additional effort for agents that would need to process both EPUB and such an e0.

--Bill

Baldur Bjarnason

unread,
Mar 13, 2013, 1:12:58 PM3/13/13
to Bill McCoy, Hadrien Gardeur, epub-ng

On 13 Mar 2013, at 16:26, Bill McCoy <whm...@gmail.com> wrote:

> Hi Hadrian,
>
> Right - well my optimal e0 would start from the online presumption and then speak about the packaged scenario second but that's a detail, I'm glad you are thinking through the implications of online.
>
> Of course the first two things you said about the packaged case are still debatable wrt the cost-benefits of deviating from EPUB:
> • a ZIP is good enough, no need for a mimetype file or a manifest that list every resource
> A ZIP is isomorphic to a filesystem directory. So no you don't "need" a manifest, but if you don't have such a data structure there's no easy way to distinguish components of the publication from "junk DNA" that is at best useless and at worst may be vectors for malware. Most systems for handling content don't overload the filesystem directory or ZIP equivalent as their manifest. For example Google Packaged Apps and Mozilla Open Web Apps both define manifests, iTunes has a DB, etc.

Actually, the picture is a more interesting and more varied than you paint it. Especially the work that Mozilla has been doing in creating an app format that doesn't need to be packaged.

Mozilla Open Web apps define a manifest but that manifest isn't like anything in EPUB. In fact, it bears remarkable similarities to e0's index.html

* It defines the name, description, author and other metadata.
* It links to the icons.
* Defines the default language.
* Optionally links to an AppCache manifest as a list of files that should be cached.
* Outlines primary activities, launch paths, and the files for those activities.
* Links to a security policy.

Those are parallels to e0's metadata and ToC. Since the index.html is HTML5/XHTML the AppCache manifest is also an option in e0, you can link to it from your index.html file and include it in the book. What's missing is a content security policy but those could be included as meta http-equiv tags pointing to X-Content-Security-Policy directives.

(Actually, I have a lot of opinions on the security side of ebooks and javascript, but it's a bit too early to start a discussion on that.)

The reason that Mozilla is using json for the app's index is because they can assume that the app installer will provide a human friendly presentation which is an assumption that I don't think we should make with e0. Unlike Chrome, their Open Web Apps are distributed unpacked, just with links to the .webapp json file.

If we recommend that e0 books hosted on the open web should include an AppCache manifest then we get most of the benefits of offline packaged ebooks but in a format that's compatible with existing web standards and widely supported in browsers.

- best
- baldur

Baldur Bjarnason

unread,
Mar 13, 2013, 1:25:45 PM3/13/13
to Bill McCoy, Hadrien Gardeur, epub-ng

On 13 Mar 2013, at 17:10, Bill McCoy <whm...@gmail.com> wrote:

> Hi Hadrien,
>
> The threat of "junk" content is more than just extra information. It would enable less detectable and more persistent (since it would survive unpackaging and repacking based on manifest) way to put malware together with a security exploit that might lurk in JS or media streams or whatever. But anyway my point was not that manifest is definitely better, just that EPUB is far from alone in using the manifest approach and there are benefits as well as costs to consider.

Actually, that's a completely unwarranted concern. Anybody who can modify the zip archive can modify the contents of files in the zip to include their malware files in the manifest and make the infection trivially and easily persistent. In fact, they'd have to do it to guarantee that their malware code gets run at some point. If you have malware that can modify zip archives you have already lost and there's nothing you can do in the format that can prevent abuse.

What you want is a distribution system that verifies the integrity of the files using something like sha1sum (or even something more cryptographically secure) and that's an issue that's completely out of scope for e0 and affects EPUB, Mobi, Chrome Packaged Apps, Node.js modules, Ruby gems equally. This is not a security problem we can solve here and it's reckless to try because it would give people a false sense of security that malware writers could exploit.

> I'm just trying to push on the secondary question of how much we might be able to achieve this goal with less rather than more fragmentation and additional effort for agents that would need to process both EPUB and such an e0.

I thought this group's raison d'être was exactly to see what a format would look like if we ignored fragmentation issues and backwards compatibility and focused on compatibility with basic web standards? What you are trying to push goes directly against that.

- best
- baldur

Bill McCoy

unread,
Mar 13, 2013, 1:26:52 PM3/13/13
to Baldur Bjarnason, Hadrien Gardeur, epub-ng
Hi Baldur, I generally agree with all of your points. 

Of course at present there's far more Chrome Packaged Web Apps (which are both packaged and manifest-ed) than Mozilla Open Web apps (manifest-ed but as you noted not packaged). Which way things go in terms of adoption surely depends on things that have nothing at all to do with eBooks, such whether or not Firefox OS gets any traction in the device market and whether the Firefox browser can climb back up above 20% market share in general and/or get any significant share on existing popular device OS's. Not to mention what if any adopted specs ultimately emerge from new W3C System Application WG. So, while I hate to sound unduly conservative, jumping on any of these shiny new bandwagons may be premature. Yet at the same time defining a brand new approach to composite (to use a different wod than "packaged") web documents may seem out of step with where the composite web app world may end up in the coming year.

--Bill 

Hadrien Gardeur

unread,
Mar 13, 2013, 1:40:11 PM3/13/13
to Bill McCoy, Baldur Bjarnason, epub-ng
Of course at present there's far more Chrome Packaged Web Apps (which are both packaged and manifest-ed) than Mozilla Open Web apps (manifest-ed but as you noted not packaged).

I'm reading https://developer.chrome.com/stable/apps/manifest.html right now and this looks very similar to what we're doing.
They do list the resources, but this is within the scope of a security policy rather than the way we do it in EPUB2/3 (manifest and spine is an IDref system, just like metadata. You can't create a spine without listing documents first and assigning them an ID).

Bill McCoy

unread,
Mar 13, 2013, 1:43:01 PM3/13/13
to Baldur Bjarnason, Hadrien Gardeur, epub-ng
> I thought this group's raison d'être was exactly to see what a format would look like
> if we ignored fragmentation issues and backwards compatibility and focused 
> on compatibility with basic web standards? What you are trying to push goes
> directly against that.

If you guys don't want to consider costs as well as benefits, to attempt to design something in a vacuum from well-specified use cases and practical considerations about adoption and fragmentation,  that's of course your call. 

I'm far more interested in what I thought was the question on the table of what could be achieved by sacrificing (some) backwards compatibility with today's EPUB (in the case of packaged documents) in order to achieve (more) simplicity and compatibility with web standards (and better support for unpackaged distributed publications). Or to to put it another way, whether we can in the future achieve a more optimized balance point among competing requirements rather than looking only at certain practical requirements and ignoring all others.

If you guys want to declare that out of scope for this discussion then I will stop pestering. I would instead then wait and see what if anything you emerge with as a specification (and then seek to take the best ideas from your theoretical exercise and  try to get 80%+ of any benefit with far less cost and disruption, and with more input from a broader community of stakeholders and experts). I am OK either way ;-)..

And again I would personally suggest that "e0" or some other name would in that case be a better name for the discussion list than "epub-ng". I.e. I don't think "epub-ng" is appropriate if you are going to declare that compatibility with the EPUB ecosystem is entirely a non-concern for the purposes of the exercise.

--Bill

 

Hadrien Gardeur

unread,
Mar 13, 2013, 1:57:50 PM3/13/13
to Bill McCoy, Baldur Bjarnason, epub-ng
And again I would personally suggest that "e0" or some other name would in that case be a better name for the discussion list than "epub-ng". I.e. I don't think "epub-ng" is appropriate if you are going to declare that compatibility with the EPUB ecosystem is entirely a non-concern for the purposes of the exercise.

I've replied to some of these points in previous messages:
  • at this point in our discussions, it is entirely possible to generate a valid EPUB 2/3 from what we've agreed on
  • it would also be possible to generate a file that's compatible with both specs
  • the additional requirements for an EPUB3 reader are so far fairly low: look for an index.html file and parse it (supporting conditions in a container.xml to select the right OPF is every bit as difficult as that for example)
Purely from a compatibility perspective, HTML vs XHTML is the bigger issue (which is why you often bring it to the table), the rest is fairly trivial.

If EPUB 3 continues in a direction where more complex interactions are adopted around OPF and container.xml (how we "glue" resources together to make a publication), it'll be much harder to align it with what they're doing here, or to have a good fit with the requirements of distributed publications.

Bill McCoy

unread,
Mar 13, 2013, 2:48:12 PM3/13/13
to Hadrien Gardeur, Baldur Bjarnason, epub-ng
Hi Hadrien, I'm with you so far. And if an e0 spec ends up with the property that for any e0 file it's possible to deterministically generate a valid EPUB 3 file with the same information (and, ideally, visa-versa) I think that would be a very powerful plus.

But again your POV at least takes into account EPUB's existence and the issues e.g. in imposing new requirements on an EPUB 3 reading system. Maybe some others on the list have a different POV that does not, so I'm not sure what the consensus if any is on that.

Re: EPUB 3 continuing in a direction where more complex interactions are adopted around OPF and container.xml - I assume you mean the Advanced-Hybrid Layout WG. Agreed that this presents challenges to algnment with a simpler approach. But I would frame this up another way: IDPF is committed to evolving EPUB to make sure it can be used for a wide variety of publication types. Manga, rich interactive textbooks, digital magazines - all have their own issues. I'm not sure that there is consensus among you guys about the scope of publications that you would aspire to address with an e0. Nor on the importance of accessibility which is where things like combining a reflow version of a publication together with fixed-layout version(s) come into play. If these things are to be tackled then a certain amount of necessary complexity may result. The trick is again to deliver the additional benefits with least cost. My understanding is that for example having multiple rootfiles does not break existing readings systems, even EPUB 2 reading systems, who simply ignore all but the first one as specified in EPUB 2.

Also I'm not sure that anything being work on in AHL has any negative intersection with distributed publications. To me the concept of being able to supply different forms of the same content, which is the main thing AHL is tackling, is consistent with e.g. the inherent fallback mechanism for web video, the content negotiation of HTTP, really the whole REST architecture of the Web. (I may be wrong though, as I have only been on the periphery of the AHL WG discussions to date).

--Bill

Hadrien Gardeur

unread,
Mar 14, 2013, 12:01:19 AM3/14/13
to Bill McCoy, Baldur Bjarnason, epub-ng
Re: EPUB 3 continuing in a direction where more complex interactions are adopted around OPF and container.xml - I assume you mean the Advanced-Hybrid Layout WG. Agreed that this presents challenges to algnment with a simpler approach. But I would frame this up another way: IDPF is committed to evolving EPUB to make sure it can be used for a wide variety of publication types. Manga, rich interactive textbooks, digital magazines - all have their own issues. I'm not sure that there is consensus among you guys about the scope of publications that you would aspire to address with an e0. Nor on the importance of accessibility which is where things like combining a reflow version of a publication together with fixed-layout version(s) come into play. If these things are to be tackled then a certain amount of necessary complexity may result. The trick is again to deliver the additional benefits with least cost. My understanding is that for example having multiple rootfiles does not break existing readings systems, even EPUB 2 reading systems, who simply ignore all but the first one as specified in EPUB 2.

Well the same way that you advise this group to keep some EPUB3 related issues in mind, I believe that it would be a good thing for EPUB3 sub-WGs to keep two things in mind:
  • that it's best to stick as close as possible to the Web stack
  • that it might benefit EPUB in the future to head toward a unique document listing resources and defining the publication
I don't think that this comes at the cost of supporting a wide variety of publication types. If you follow these two principles, you can find an alternative solution to pretty much anything that I've seen proposed lately (I recently made such a proposal myself, arguing against adding a new well-know folder name in EPUB3 and relying on the OPF instead).

As for this group, I can only speak for myself, but anything that can be done on the Web should be supported (which includes comics, interactive textbooks and digital magazines).
 

Also I'm not sure that anything being work on in AHL has any negative intersection with distributed publications. To me the concept of being able to supply different forms of the same content, which is the main thing AHL is tackling, is consistent with e.g. the inherent fallback mechanism for web video, the content negotiation of HTTP, really the whole REST architecture of the Web. (I may be wrong though, as I have only been on the periphery of the AHL WG discussions to date).

Hmm not really.

Content negotiation in HTTP is usually about two things: media type and language. 
You could (in theory) have media parameters that would enable the browser to fetch images in different resolutions (one of the main use case for having multiple renditions and rendition selection in EPUB3), or provide other resources in the Alternate header (see http://my.opera.com/karlcow/blog/2011/12/08/responsive-images-and-transparent-content-negotiation-in-http).
In practice, things are different and people mostly rely on CSS and JS for such responsive design elements. 

Hadrien Gardeur

unread,
Mar 14, 2013, 12:15:24 AM3/14/13
to Bill McCoy, Peter Hatch, epub-ng
Back to the main point after these last few messages.

Here are the questions that we need to ask ourselves now:
  • can we reference content documents that are not in the package from the <nav> ? (Packaged publication with some resources outside of the package, or a fully distributed publication)
I would be in favor of supporting this use case. I believe that about 95% of our work will be on defining what's inside index.html (should we call it something else than its file name ? Publication document ?), and things would work almost exactly the same in a distributed environment.

  • how do we point to index.html ? (Links obviously, but in HTTP, in HTML or both ?)

For example in HTML:
<link href="index.html" rel="contents" type="text/html" />

In HTTP headers:
Link: <index.html>; rel="contents"; type="text/html"

This has the benefit of working with any media delivered over HTTP.

  • how do we identify the index.html in a link (Do we need a media parameter to identify index.html as something different than pure HTML/XHTML ? Which rel value do we use to point from a content document to an index.html ?)
I used "contents" in my example as this is already registered under the following definition: "Refers to a table of contents."
Since this document also contains additional metadata about the publication itself, I don't know how other people feel about using this rel value.

Now the main issue here, is that if we stick to "contents" and use "text/html" as the media type, there's no easy way to identify whether we're just pointing to any kind of HTML document, or if we're pointing to an EPUB Zero/NG document.
Reply all
Reply to author
Forward
0 new messages