[301][NeedsDiscussion] CFI (W3C DOM Range, text-node vs text-range)

Daniel Weck

unread,

Mar 29, 2013, 8:09:47 PM3/29/13

to epub-work...@googlegroups.com

This is a request to discuss issue #218 (via email first, followed-up by a concall if necessary).

Note that I suggested a simple prose fix in the actual comments for this issue, as a proposed solution. However, I would really like to hear your feedback about the text-node vs text-range issue, to see whether or not the CFI specification contains a "design hole" that would lead to implementations producing non-interoperable results.

In summary:

----
The issue committer (natevw) articulated a problem by using W3C DOM Range as a basis for his reasoning. So I ask: is it a functional requirement for "CFI simple ranges" to map exactly (or even gracefully) with W3C DOM Ranges?

I think that the expressiveness of W3C DOM Ranges is indeed greater than CFI simple ranges (the distinction between "select the whole text node" and "select the entire text content within the text node" is not possible, because of the nature of CFI locations). As a consequence, round-tripping between CFI and DOM ranges should not be expected to be a "lossless" process. Let's take ; for example ; the Readium-SDK CFI implementation: text/document selections are made through a web browser component (let's say, WebKit). Such selections are extracted via JavaScript and translated to the CFI range syntax (e.g. for annotation / bookmark persistence). The obtained CFI expression may subsequently be turned back into a form suitable for web browser rendering (e.g. highlighting). I do not think that the original selection (perhaps expressed as a DOM Range) is expected to be exactly identical to the restituted selection: the underlying CFI expression is the authoritative data, not the DOM Range on the "user interface" side.
----

https://code.google.com/p/epub-revision/issues/detail?id=218

Brady Duga

unread,

Mar 30, 2013, 1:07:52 PM3/30/13

to epub-work...@googlegroups.com

I don't think CFI ranges can map to W3C DOM ranges. In fact, they are designed specifically so a DOM is not required for the CFI calculation to make them lightweight. Otherwise we could have just use DOM ranges and skipped CFI altogether. A specific case where they will not match is in the presence of collapsed white-space. For instance, the source text "abc<sp><sp><sp>xyz" may become "abc<sp>xyz" in the DOM, but any CFI range will refer to the original source instead.

--
You received this message because you are subscribed to the Google Groups "EPUB Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epub-working-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jim Dovey

unread,

Mar 30, 2013, 4:33:01 PM3/30/13

to Daniel Weck, epub-work...@googlegroups.com, bor...@evidentpoint.com

I've always assumed that CFIs would be both created and consumed by JavaScript (or otherwise from a DOMRange object). I can see how they might be implemented based on a simple stream parser built using a simple token-recognition engine; I'm not certain how often that would take place, however. I always assumed that they would be created based on user input (highlight range, page position) and would be resolved purely for use by an in-use web view as well.

If anyone can give guidance on other use cases more suited to situations where only a character stream view of a document is available then I'd be glad to take that into account. I'd honestly not even realized the discrepancy existed. Durr.

If necessary, we might require that all character-stream applications of CFIs would apply only to documents canonicalized using C14N1.1 <http://www.w3.org/TR/xml-c14n11/>

or even XML Normalization <http://www.w3.org/2008/xmlsec/Drafts/xml-norm/Overview.html>. The latter works better with a simple tokenizing parser, and I'd be quite happy if it got some interest outside the W3C for reasons which should be apparent upon reading the headers :-P

Sent from my iPhone

On 2013-03-30, at 2:44 PM, "Daniel Weck" <danie...@gmail.com> wrote:

In practice, the original XHTML "source" is not available at the Javascript / DOM layer, so if I am not mistaken, real world implementations (such as the Readium-SDK [1]) compute CFI character offsets based on DOM text nodes. At this stage in the XML parsing / processing pipeline, "insignificant" whitespace (such as source indentation / spacing characters used for pretty-formatting the original source) has already been discarded, and adjacent whitespace characters within mixed-content XML fragments have already been collapsed into single spaces.

So, as much as I appreciate the design motivation behind CFi's "DOM-independent" approach, we need to assert that this is a realistically-implementable approach. What I have seen so far seems to contradict this assumption, but perhaps I am missing something. Jim, Boris, any thoughts?

[1]
https://github.com/readium/readium-sdk/blob/master/ePub3/ePub/cfi.cpp
https://github.com/readium/SDKLauncher-OSX/blob/master/Scripts/lib/epub_cfi.js
https://github.com/readium/SDKLauncher-OSX/blob/master/Scripts/js/views/cfi_navigation_logic.js

David K. Ream

unread,

Mar 30, 2013, 5:46:22 PM3/30/13

to epub-work...@googlegroups.com

The Indexes Working Group discussed this quite a bit though it doesn’t impinge on the draft spec since we allowed CFIs or id/href “linking”.

But we did note in discussion that indexers have their own tools for creating indexes (embedded with in InDesign or Word, or stand-alone CINDEX, Sky …) which are unlikely to provide CFI support anytime soon. The timing of how indexes are written with respect to the writing of the text and it’s publishing to an EPUB also work against this.

It seemed more likely that id/href based index links would be provided by publishers. Then the EPUB “build” process/software might convert id/href linking into CFIs when the index is added to the EPUB.

I would like to hear other thoughts on this issue.

Dave

Daniel Weck

unread,

Mar 30, 2013, 9:05:10 PM3/30/13

to Jim Dovey, epub-work...@googlegroups.com, bor...@evidentpoint.com

Yes Jim, the CFI specification needs to clarify which XML parsing / processing model must be used in order to enable interoperable character offsets in CFI expressions. I am not convinced that XML c14n meets the expectations of real-world CFI implementations, primarily because of conflicting whitespace handling in a typical DOM/Javascript context. I am not familiar with XML Normalisation: at a quick glance, leading/trailing whitespace handling looks consistent with c14n, but I am not sure I understand the parsing rules for collapsing interspersed whitespace in XML mixed-content fragments. Handling of expanded entities and CDATA sections looks less problematic.

Personally, I feel that the model that determines the calculation of CFI character offsets should be as compatible as possible with DOM text nodes exposed in web browsers via the Javascript API (i.e. post-parse data model, not source character stream). However, I realise that this wasn't an original design goal in CFI, so perhaps the merit of a DOM-independent approach is greater than that of my pragmatic / practical viewpoint, in which case I would need to be educated (e.g. what are the use-cases for source-level CFI expressions?).

At any rate, we need to reach consensus pretty swiftly before CFI implementations proliferate ;)

Regards, Daniel

Daniel Glazman

unread,

Mar 31, 2013, 3:21:28 AM3/31/13

to epub-work...@googlegroups.com

On 31/03/13 03:05, Daniel Weck wrote:

> Personally, I feel that the model that determines the calculation of CFI
> character offsets should be as compatible as possible with DOM text
> nodes exposed in web browsers via the Javascript API (i.e. post-parse
> data model, not source character stream). However, I realise that this
> wasn't an original design goal in CFI, so perhaps the merit of a
> DOM-independent approach is greater than that of my pragmatic
> / practical viewpoint, in which case I would need to be educated
> (e.g. what are the use-cases for source-level CFI expressions?).
> At any rate, we need to reach consensus pretty swiftly before CFI
> implementations proliferate ;)

The problem is simple: whatever is decided must be implementable in
applications embedding the current browsers/rendering engines and
reachable/callable from JavaScript.
A model relying on whitespace detection at parsing time and not at DOM
time is then pure utopy, sorry. And CanonicalXML is about the same;
betting on such a requirement should be a no-go...

</Daniel>

Daniel Weck

unread,

Mar 31, 2013, 12:27:06 PM3/31/13

to Jim Dovey, epub-work...@googlegroups.com, bor...@evidentpoint.com

Jim, in the Readium SDK implementation, do you index each and every location between element tags (odd indices) even when the text node collection is empty? (I believe the answer is yes, but I wanted to double-check) This is implementable even at the DOM level, by numbering inter-element locations regardless of the presence of interspersed text nodes (XML mixed content model vs. adjacent element start/end tags). Daniel

On Saturday, March 30, 2013, Jim Dovey wrote:

Jim Dovey

unread,

Mar 31, 2013, 1:55:01 PM3/31/13

to Daniel Weck, epub-work...@googlegroups.com, bor...@evidentpoint.com

I think the answer is "yes" although I think about it in auth simpler way: even numbers refer only to elements, and are resolved/created by halving the index of the given element in the DOM. I frankly don't care whether text nodes are interspersed or not, and I don't believe the spec makes any particular commandment that odd nodes be 'counted'.

So for a <span> element I'm targeting, I'll query its index attribute: that's 5, so the CFI index will be 10. When resolving I'll halve the 10 to get 5, and will use parent.children[5] to locate the <span> again.

For text nodes, I locate the preceding element and take its CFI index, then add 1. In essence, I assume that the content is a single text node. Then for the character location I'll use the value from the DOMRange itself, or I'll calculate the offset from the start of the text node following the prior element node.

Usually I'll rely on the values in the DOMRange though; the design of CFI seemed so similar to DOMRange that my assumption was that this was the intention.

If I were to explicitly codify everything, I'd say that CFI is intended to be interoperable with DOMRange, and that the odd-number text node references refer to an implied single text node between elements at the same level of the tree; also that these text nodes may be zero characters in size, meaning an offset of n:0 is the only allowed value in that case (this should always be considered valid for the implicit character position between two adjacent element nodes).

I trust that you can glean the meaning from all that; if not I can happily write it up formally. I apologize for the ungainly description above— as you can see, this was:

Sent from my iPhone

Lee Passey

unread,

Mar 31, 2013, 3:06:45 PM3/31/13

to epub-work...@googlegroups.com

On 3/30/2013 2:33 PM, Jim Dovey wrote:

> I've always assumed that CFIs would be both created and consumed by
> JavaScript (or otherwise from a DOMRange object). I can see how they
> might be implemented based on a simple stream parser built using a
> simple token-recognition engine; I'm not certain how often that would
> take place, however. I always assumed that they would be created based
> on user input (highlight range, page position) and would be resolved
> purely for use by an in-use web view as well.

I don't think that this is the genesis of CFI at all, and I don't think
that it was intended to be created or consumed by JavaScript. It was
intended to be created or consumed by whatever programming language was
used to create any particular user agent.

As I understand it, CFI was designed by Adobe specifically to enable
storage and sharing of bookmarks, highlights and notes for
encrypted/read-only documents.

CFIs have two features that make them ideal as a solution to the
bookmarking problem: 1. They are canonical. With XMLPath or XMLPointer
the same point in the text can be referenced using different
expressions; not so with CFI. A CFI has but one valid representation, so
it is easy to determine if two CFIs are identical points in the text. 2.
CFIs were designed to sort in document order, so if I had a list of
bookmarks I could easily display them in the order they appeared in the
document.

The advantage of CFI (and XPath/XPointer/DOMRange for that matter) over
imbedded anchors is that they can be calculated, and stored, outside the
target ePub. So, using CFI I could (in an enabled user agent) create a
set of notes from an encrypted ePub and save those notes in a file
external to the target. I could then pass those notes to any other
individual (a classmate, for example) saying, "if you buy that
particular encrypted ePub, you could associate it with this set of
notes." Likewise, it would be possible to create a stand-alone index to
an encrypted ePub and distribute it independently of the target file.

I think it might be worthwhile for the IDPF to specify the file format
for bookmarks associated with an ePub; simply adopting Adobe's format
might be the easiest way.

At first blush, I think that DOMRange and CFI could, in effect, be
losslessly round-tripped, but given the fact that XPath/XPointer is not
canonical (i.e. two different XPointer expressions can reference the
exact same DOM location) the resulting XPointer expressions may not be
syntactically identical. The new XPointer expressions should still
reduce to identical CFIs.

Bottom line: CFI is superior to XPath/XPointer when creating bookmarks,
highlights, notes and indexes created after an ePub has been released.
XPath, XPointer, DOMRange is an also-ran solution but by supporting only
CFI the IDPF has attempted to lessen the impact on user agent
implementers by requiring only a single method for out-of-document, ad
hoc references.

Unlike XPath/XPointer/DOMRange, CFI is extremely difficult for humans to
work with, but fairly easy algorithmically. About one year ago I wrote a
simple program to convert XPointer expressions to CFI. If anyone wants
the source (Java) I would be happy to provide it.

Daniel Weck

unread,

Mar 31, 2013, 6:48:40 PM3/31/13

to epub-work...@googlegroups.com

On Sunday, March 31, 2013, Lee Passey wrote:

At first blush, I think that DOMRange and CFI could, in effect, be losslessly round-tripped,

Lee, thanks for your thorough recap, which I am sure many will find useful.

However, please allow me to reframe the discussion around the interoperability issue regarding CFI text references: DOM vs. "unparsed" source character stream => differences in whitespace handling / collapsing which lead to discrepancies in calculated character offsets.

I think we should judge the merits of a particular approach based upon whether or not it can be realistically implemented. CFI-based bookmarks / annotations areto be created via web-browser engines, then we should assert whether or not the DOM/Javascript layer is able to supply the kind of document information necessary to build CFI expressions. As it stands now, CFI wants to be agnostic to any Document Object Model, yet the real-world applicability of this approach seems seriously questioned because of the aforementioned character offset discrepancies.

Daniel Weck

unread,

Mar 31, 2013, 6:53:27 PM3/31/13

to epub-work...@googlegroups.com

sorry, my email was sent too early by mistake :)

anyway, my point should have come across reasonably clearly, despite the typos in my draft :D

daniel

Daniel Weck

unread,

Mar 30, 2013, 2:44:36 PM3/30/13

to epub-work...@googlegroups.com, Jim Dovey, bor...@evidentpoint.com

In practice, the original XHTML "source" is not available at the Javascript / DOM layer, so if I am not mistaken, real world implementations (such as the Readium-SDK [1]) compute CFI character offsets based on DOM text nodes. At this stage in the XML parsing / processing pipeline, "insignificant" whitespace (such as source indentation / spacing characters used for pretty-formatting the original source) has already been discarded, and adjacent whitespace characters within mixed-content XML fragments have already been collapsed into single spaces.

So, as much as I appreciate the design motivation behind CFi's "DOM-independent" approach, we need to assert that this is a realistically-implementable approach. What I have seen so far seems to contradict this assumption, but perhaps I am missing something. Jim, Boris, any thoughts?

[1]
https://github.com/readium/readium-sdk/blob/master/ePub3/ePub/cfi.cpp
https://github.com/readium/SDKLauncher-OSX/blob/master/Scripts/lib/epub_cfi.js
https://github.com/readium/SDKLauncher-OSX/blob/master/Scripts/js/views/cfi_navigation_logic.js

On 30 Mar 2013, at 18:07, Brady Duga wrote:

Brady Duga

unread,

Apr 2, 2013, 11:40:24 AM4/2/13

to epub-work...@googlegroups.com

The problem is, which browsers capabilities should CFI be based around? If the purpose of CFI is to be interoperable, we need a mechanism that works in every browser, but we have discrepancies between browsers in the way they handle white space and various css styling. For instance innertText (IE, WK browsers) reports text with whitespace collapsed and styling applied (display: none elements have no text), while innerText (FF, WK browsers) returns the original source text without whitespace collapsing or styling. I have no idea what quirks exist with DOMRanges across browsers. In addition, relying on node index is a mistake, since different browsers may handle various text-like nodes (CDATA, regular text) differently in the DOM. That's why everything before the first node element is marked as '1'. I don't see how we get to an interoperable solution if we base CFI on the various DOMs created by all browsers.

Jim Dovey

unread,

Apr 2, 2013, 12:11:22 PM4/2/13

to <epub-working-group@googlegroups.com>, epub-work...@googlegroups.com

I should clarify my reference to node-index when generating CFIs: I look through the list from 1 to n and count the number of element nodes I encounter.

Sent from my iPhone

Roger Webster

unread,

Apr 2, 2013, 12:52:07 PM4/2/13

to epub-work...@googlegroups.com

There’s also the annoying pragmatics of round-tripping CFIs to Adobe Location Strings (from RMSDK). We do that trivially now, and we have millions of location strings stored that must still work when we switch to an ePub 3 engine. Something that breaks that (i.e., a non-round-trippable solution) is a non-starter for us.

From: epub-work...@googlegroups.com [mailto:epub-work...@googlegroups.com] On Behalf Of Jim Dovey
Sent: Tuesday, April 2, 2013 9:11 AM
To: <epub-work...@googlegroups.com>
Cc: epub-work...@googlegroups.com
Subject: Re: [301][NeedsDiscussion] CFI (W3C DOM Range, text-node vs text-range)

I should clarify my reference to node-index when generating CFIs: I look through the list from 1 to n and count the number of element nodes I encounter.

Sent from my iPhone

This electronic mail message contains information that (a) is or
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
PROTECTED
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
the addressee(s) named herein. If you are not an intended
recipient, please contact the sender immediately and take the
steps necessary to delete the message completely from your
computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the
Uniform Electronic Transaction Act or any other law of similar
effect, absent an express statement to the contrary, this e-mail
message, its contents, and any attachments hereto are not
intended
to represent an offer or acceptance to enter into a contract and
are not otherwise intended to bind this sender,
barnesandnoble.com
llc, barnesandnoble.com inc. or any other person or entity.

Daniel Weck

unread,

Apr 2, 2013, 12:58:39 PM4/2/13

to epub-work...@googlegroups.com

Well, here are 2 possible options:

(1) CFI text offsets are based on the source XHTML (raw, unparsed character stream => insignificant whitespace is included, entities are not expanded).

Benefits:

CFI offset calculation rules can be specified regardless of parser discrepancies (e.g. Mozilla DOM includes whitespace in DOM tree, which must be "cleaned-up" programmatically at the application layer when working directly with text nodes). This *theoretically* guarantees cross-platform interroperability.

Inconvenients:

Bookmark/annotation implementations that are browser-based (99.9% of them?) cannot reliably use the DOM exposed by Javascript, because the post-parse XHTML artifact may contain collapsed whitspaces, etc. Consequently, implementations must parse the XHTML in a separate pass, and map the DOM-Range objects exposed at the UI layer with the underlying character stream. On a *practical* level, this is an unreasonable expectation (it is both inefficient and hard to implement correctly).

Furthermore, any minute changes in the source XML (e.g. formatting) would break exisiting CFI references.

(2) CFI text offsets are based on a normalised XHTML DOM.

Benefits:

Implementations can realistically produce CFI expressions that are interoperable across platforms. CFI textual links are less sensitive to "insignificant" changes in XML formatting.

Inconvenients:

The CFI specification must define precisely what the "normalised" DOM is (note that the DOM API includes a normalize() function).

Specific implementations may have to manually implement some of the DOM normalisation routine, in cases where there are discrepancies in Gecko/Webkit/Trident, etc. compared to the CFI reference model.

PS: using the innerText() API seems like a bad idea, I would assume that the lower-level chilNode[i].nodeValue() would be used instead for building a "normalised" view of the DOM.

Thoughts?

Dan

Jim Lester

unread,

Apr 2, 2013, 1:58:01 PM4/2/13

to epub-work...@googlegroups.com

> Bookmark/annotation implementations that are browser-based (99.9% of them?)

This number is off – unless you are only limiting yourself to fully compliant EPUB3 readers. If you include EPUB2 readers then this number is most likely reversed (ie 1% of the current EPUB readers are browser based). Interoperability between existing readers and future readers especially around bookmarks, annotations, and current reading position is a must have for us, and those existing readers are using either CFIs or RMSDK Location Strings (which has a trivial transformation to a CFI).

Also as was also mentioned CFIs are able to be compared outside of parsing the document to create a DOM, which is a property that we make use of for the Millions (constantly growing) of locations (Bookmarks, Annotations, Reading Positions) that we have stored for our users.

--

-jim

This electronic mail message contains information that (a) is or

Jim Dovey

unread,

Apr 2, 2013, 2:25:11 PM4/2/13

to <epub-working-group@googlegroups.com>

On 2013-04-02, at 12:52 PM, Roger Webster <rweb...@book.com> wrote:

There’s also the annoying pragmatics of round-tripping CFIs to Adobe Location Strings (from RMSDK). We do that trivially now, and we have millions of location strings stored that must still work when we switch to an ePub 3 engine. Something that breaks that (i.e., a non-round-trippable solution) is a non-starter for us.

Given that Adobe is not implementing or supporting EPUB 3 *at all* in RMSDK, I'm not actually convinced that we should worry about converting from a CFI back to a location string. That would amount to holding hostage the EPUB CFI standard based on a single entity's decision not to support it. Now, if they *did* support EPUB 3, and they'd shipped a LOT of software with particular quirks, then that might be a different situation.

Ideally we will be able to convert *to* CFIs, so that those identifiers can be read by EPUB 3-compliant engines, and anyone using Adobe locations for EPUB 2 content is welcome to add support for that format into e.g. Readium SDK (or their own app built upon it), and use that in lieu of CFI when bookmarking EPUB 2 content.

Remember that the onus on compatibility is that an EPUB 3 reading system should be able to handle EPUB 2 (and presumably OEBPS) content— there's no requirement that an EPUB 2 system should be able to handle data defined as part of EPUB 3, or that anything defined by EPUB 3 necessarily be representable in a manner compatible with EPUB 2. It's nice if it works, sure, but there's no requirement of that. Compare to <noscript> usage: while it's recommended that EPUB 3 files using JavaScript offer unscripted fallbacks either through a manifest item's fallback attribute or through <noscript> tags in the content, there's no requirement that this be done, and it's actually fairly uncommon to see such fallbacks in the wild. A pure-spec EPUB 2 renderer would likely choke on a LOT of more advanced EPUB 3 content.

_________________________________________

Jim Dovey

Digital Content Format Evangelist | Kobo Inc.

jdo...@kobo.com

C: (416) 716-0413

135 Liberty St. | Suite 101 | Toronto, ON | M6K 1A7

Daniel Weck

unread,

Apr 2, 2013, 3:28:42 PM4/2/13

to epub-work...@googlegroups.com

With a normalised-DOM approach, CFI expressions that reference the same document are comparable/sortable without having to parse the XHTML into an in-memory object model.

And yes, I meant "a great majority (if not all) of" EPUB *3* reading systems.

/dan

This electronic mail message contains information that (a) is or

may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
PROTECTED
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
the addressee(s) named herein. If you are not an intended
recipient, please contact the sender immediately and take the
steps necessary to delete the message completely from your
computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the
Uniform Electronic Transaction Act or any other law of similar
effect, absent an express statement to the contrary, this e-mail
message, its contents, and any attachments hereto are not
intended
to represent an offer or acceptance to enter into a contract and
are not otherwise intended to bind this sender,
barnesandnoble.com
llc, barnesandnoble.com inc. or any other person or entity.

--

Brady Duga

unread,

Apr 2, 2013, 4:10:20 PM4/2/13

to epub-work...@googlegroups.com

My comment about innerText was just to point at cross-browser issues, not as an implementation suggestion. nodeValue has similar problems (IE and WK/FF do different things). This also raises the issue of what happens to inserted content via styling. So, if I have foo:before(content: "abc"} do I count the "abc"? What if that style is in a media query? I don't think the built-in DOM normalization helps with these issues, but I can't say I have ever used it. Maybe someone else can chime in on that topic. We could choose a normalization that works in both browsers and as a raw stream, for instance by skipping all white space (keeping in mind that \s returns different results across browsers!) in text when calculating character offsets. Of course, that makes it impossible to point at white space, so marking up a Python manual might be hard.

Roger Webster

unread,

Apr 2, 2013, 4:32:03 PM4/2/13

to epub-work...@googlegroups.com

It’s more complicated than that.

We will have ePub 2.0 renderers for several years; indefinitely, perhaps, on eInk devices. These devices use RMSDK and use location strings for bookmarks, annotations, etc.

Future Reading Systems will support ePub 3.0, but MUST support interoperable bookmarks, et. al. with older engines, or people get very annoyed with us. There have been lawsuits filed over “lost” annotations.

CFIs were designed to be interoperable with location strings (they were at least partly designed by the same guy).

Unless there’s a trivial, accurate mapping from a location string to whatever-you-envision-to-replace CFIs, again, it’s a non-starter for us.

From: epub-work...@googlegroups.com [mailto:epub-work...@googlegroups.com] On Behalf Of Jim Dovey
Sent: Tuesday, April 2, 2013 11:25 AM
To: <epub-work...@googlegroups.com>
Subject: Re: [301][NeedsDiscussion] CFI (W3C DOM Range, text-node vs text-range)

On 2013-04-02, at 12:52 PM, Roger Webster <rweb...@book.com> wrote:

--

You received this message because you are subscribed to the Google Groups "EPUB Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to

epub-working-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Daniel Weck

unread,

Apr 4, 2013, 7:07:06 PM4/4/13

to epub-work...@googlegroups.com

So, let us try to move towards a proposed solution.

We can leave the CFI specification as it is, i.e. character offsets are based on raw XHTML text stream, no (DOM) parsing / normalisation. In this case, the Readium-SDK (and all others web-browser-based implementations?) must be updated to handle whitespaces and entities correctly. This requires reading the XHTML source in a separate pass, because the loaded DOM (accessible via its Javascript API) does not necessarily retain whitespaces in a way that is consistent with the raw source, and the entity expansion / resolving mechanism gets in the way as well. Talking of which, I would like to hear a clarification with regards to the handling of HTML entities in RMSDK locations: how does a user-interface map user-generated text selections to underlying CFI ranges, if entities are not expanded by a parser? (note that although entities usually resolve to PCDATA, it could also be CDATA or XML fragment) The CFI specification is currently rather obscure about this.

The other option is for CFIs to be designed / aligned with web browser technology (as suggested by contributors to this discussion), which means considering a normalised DOM (in order for CFI references to be interoperable across browser engines). As I said before: in this mode, CFI expressions would remain canonical (unique), and comparable / sortable without having to parse the referenced document to an in-memory representation. The RMSDK locations used in EPUB2 readers would need to be upgraded to CFI references when moving content to an EPUB3 reading system.

Please send your comments as soon as possible so that I can compile technical + political arguments, on time for when this discussion escalates to a concall resolution. Many thanks!

Daniel

Jim Dovey

unread,

Apr 4, 2013, 7:23:53 PM4/4/13

to <epub-working-group@googlegroups.com>, epub-work...@googlegroups.com

I honestly have no idea how to go from a JavaScript touch event (the only way non-Apple folks can get selection events from WebKit on iOS) with a DOMRange to a location in a source file. Can anyone enlighten me? Please don't point me at a WebCore API that maps a tree node to its location in a text file— if you do that I shall cry, because I can't use it on iOS :-(

Sent from my iPhone

Romain Deltour

unread,

Apr 4, 2013, 7:40:24 PM4/4/13

to epub-work...@googlegroups.com

We can leave the CFI specification as it is, i.e. character offsets are based on raw XHTML text stream

AS far as I understand the current CFI spec does not even clearly say it's based on the raw XHTML text stream. Am I missing something ? If we opt for this option, that should be clarified.

The other option is for CFIs to be designed / aligned with web browser technology

I believe the spec should be based on a clearly defined *data model*.

For the sake of completeness note that one of the possible candidate is the XPath Data Model [1], which I believe could be mapped from a DOM (about that note the following issue [2]). In the context of CFI, one of the benefit of XDM over a DOM is that it clarifies the notion of adjacent text nodes or empty text nodes.

That said, a "normalized" DOM would be equally fine and maybe more aligned with what RS are actually manipulating.

Romain.

[1] http://www.w3.org/TR/xpath-datamodel/

[2] https://www.w3.org/Bugs/Public/show_bug.cgi?id=20321

Daniel Weck

unread,

Apr 4, 2013, 7:45:37 PM4/4/13

to epub-work...@googlegroups.com

On 5 Apr 2013, at 00:40, Romain Deltour wrote:
> AS far as I understand the current CFI spec does not even clearly say it's based on the raw XHTML text stream. Am I missing something ? If we opt for this option, that should be clarified.

It is a bit "hidden" (and unclear), I agree totally:

http://www.idpf.org/epub/linking/cfi/#sec-path-child-ref

"
This indexing method ensures that node identification is not sensitive to XML parser handling of whitespace text nodes, CDATA sections and entity references (e.g., to avoid the ambiguity that can arise depending on whether a parser collapses whitespace-only text nodes, keeps text, CDATA sections and entity references as distinct nodes or doesn't, or breaks text in multiple nodes).
"

Romain Deltour

unread,

Apr 4, 2013, 7:56:40 PM4/4/13

to epub-work...@googlegroups.com

Right. What I meant is that as long as we start talking about "nodes" we're in the realm of a data model. XML itself does not introduce the concept of "text nodes", it's all about character data and markup [1] ;)

Romain.

[1] http://www.w3.org/TR/REC-xml/#syntax

Peter Sorotokin

unread,

Apr 9, 2013, 2:11:53 PM4/9/13

to epub-work...@googlegroups.com

Sorry for jumping into the discussion a bit late, I do not read this group too often.

Could someone clarify what is the problem of implementing CFI in JavaScript?

I believe this is a requirement to parse source as XML for many other (non-CFI-related) reasons for EPUB, if people seriously think they can get away with not doing it, let's discuss it. (I understand that you can prototype without it, I am talking about developing a reasonably secure and conformant system).

If you go XML parsing path, any specific problems people can flag?

Entity references should be treated as text nodes. This would cover the normal use of entities whether or not they are expanded in the DOM. It is true that this does not handle external parsed entities that can bring in elements. That's unfortunate, but is it a real use case? Do browsers support them (it does not seem so)? Does anyone support them?

Peter

Ori Idan

unread,

Apr 9, 2013, 3:20:43 PM4/9/13

to Epub working group

On Tue, Apr 9, 2013 at 9:11 PM, Peter Sorotokin <soro...@gmail.com> wrote:

Sorry for jumping into the discussion a bit late, I do not read this group too often.

Could someone clarify what is the problem of implementing CFI in JavaScript?

The problem as I understand it is the different ways the DOM is handled by different JavaScript engines, so this is not a problem of the JavaScript language but more of the DOM.

--

Ori Idan

Peter Sorotokin

unread,

Apr 9, 2013, 3:28:16 PM4/9/13

to epub-work...@googlegroups.com

I think this is the case for HTML DOM in the browser window (or for APIs like innerHTML). I think XML parsing (in DOMParser or XMLHttpRequest.responseDocument) is much more consistent (and does not strip whitespace, for instance).

Ori Idan

unread,

Apr 9, 2013, 4:01:58 PM4/9/13

to Epub working group

On Tue, Apr 9, 2013 at 10:28 PM, Peter Sorotokin <soro...@gmail.com> wrote:

I think this is the case for HTML DOM in the browser window (or for APIs like innerHTML). I think XML parsing (in DOMParser or XMLHttpRequest.responseDocument) is much more consistent (and does not strip whitespace, for instance).

XMLHttpRequest, is more for Ajax as much as I know so I consider it a misnomer.

I am not sure if XML parsing using DOMParser is much different then HTML DOM parsing.

--

Ori Idan

Peter Sorotokin

unread,

Apr 9, 2013, 4:34:10 PM4/9/13

to epub-work...@googlegroups.com

It is certainly different than HTML DOM.

Daniel Weck

unread,

Apr 17, 2013, 10:25:34 AM4/17/13

to epub-work...@googlegroups.com

FYI, "natevw" says in the issue tracker:

"

Why is "termstep" optional in "local_path" used by "range" in the grammar?

If "termstep" were required, then it would be clear that DOM ranges and CFI ranges cannot be round-tripped — a CFI range would always use spots on two leaves, while DOM could also put its hands on each side of a branch.

Since "termstep" is optional, I'd suggest the spec should allow the final step to refer to both text collections AND elements not actually present in the document. This retains DOM-independence for purposes of sorting and comparison.

Daniel Weck

unread,

Apr 17, 2013, 10:34:15 AM4/17/13

to epub-work...@googlegroups.com

So, this is a pretty critical issue and we seem to have various diverging opinions on what the CFI data model should be (whitespace handling, entities expansion), compatibility with Adobe's RMSDK locations (non-DOM), practical reality of EPUB3 implementations (DOM), etc.

I suggest that we allocate some discussion time during next week's conference call (I hope that Peter Sorotokin and Jim Dovey can make it, as well as anyone else involved in Adobe's RMSDK or Readium-SDK -related products). In the meantime, please feel free to continue to comment in this email thread.

Regards, Daniel

Jim Dovey

unread,

Apr 17, 2013, 10:53:53 AM4/17/13

to <epub-working-group@googlegroups.com>

On 2013-04-17, at 10:34 AM, Daniel Weck <danie...@gmail.com>

wrote:

So, this is a pretty critical issue and we seem to have various diverging opinions on what the CFI data model should be (whitespace handling, entities expansion), compatibility with Adobe's RMSDK locations (non-DOM), practical reality of EPUB3 implementations (DOM), etc.
I suggest that we allocate some discussion time during next week's conference call (I hope that Peter Sorotokin and Jim Dovey can make it, as well as anyone else involved in Adobe's RMSDK or Readium-SDK -related products). In the meantime, please feel free to continue to comment in this email thread.

I'll set multiple alarms for this one to make sure I don't miss it (again!).

Ric Wright

unread,

Apr 17, 2013, 11:01:29 AM4/17/13

to epub-work...@googlegroups.com

I believe there is an alarm on that iPhone of yours… :-)

Ric

From: Jim Dovey <jdo...@kobo.com>
Reply-To: <epub-work...@googlegroups.com>
Date: Wednesday, April 17, 2013 9:53 AM
To: "<epub-work...@googlegroups.com>" <epub-work...@googlegroups.com>
Subject: Re: [301][NeedsDiscussion] CFI (W3C DOM Range, text-node vs text-range)

Ric Wright

unread,

Apr 17, 2013, 11:04:19 AM4/17/13

to epub-work...@googlegroups.com

Apologies, hit wrong reply button. Please ignore.

Ric

Peter Sorotokin

unread,

Apr 17, 2013, 12:23:47 PM4/17/13

to epub-work...@googlegroups.com

Actually, reading the original issue, I do not see how it is coupled with the discussion on this thread. There are two question raised in the issue:

The first is if a fictional extra element index should be allowed in CFI for convenience. I think current spec does not allow it, but neither do not see a strong reason to disallow it. (It creates a fictional position after the last child of an element, but it does not create aliasing if you treat this position as distinct from the last position of the last child).

The other is if the position refers to the space between nodes and characters or the nodes/characters themselves. I think it refers to the space between nodes/characters.

I do not think that these two questions can be decided independently of the relationship of CFIs to W3C DOM or DOM Ranges.

Moving to address some points made in this thread (but not in the issue):

CFI spec is written in terms of XML specification, not DOM (and certainly not raw text!). Writing it in terms of DOM would require defining what is the canonical DOM representation of the XML is. But it can be implemented in therms of DOM, if the parser is not too aggressive and care is taken to not be sensitive to parser variation. It probably could be implemented on raw text (but I am not sure why). This is similar to other pointer specs, such as XPath or XPointer element scheme. Redefining on top of DOM would be a mistake IMHO.

Browser DOM implementability is certainly a useful discussion and I think we need to make sure that CFI can be implemented reasonably in the browsers. I do not think that DOM Ranges are necessary (or convenient) for such implementation (although they are useful as a data structure to represent the parsed position and the only way to do any UI-related manipulations, such as highlighting).

As far as I can tell, DOM parser in browsers preserves insignificant whitespace and expands entities. It is quite suitable for CFI implementation. It does have problems with external resources (e.g. DTDs), but that is the problem no matter how CFI is defined. So I do not really see what kind of problems with CFI implementability in the browsers are there.

One clarification worth mentioning: In XML, "\r\n" is always treated as a single "\n" (section "2.11 End-of-Line Handling" of XML spec). This is what XML parser do (including those in the browsers) and that should apply to CFIs as well. If looking at CFIs in terms of raw text, that should be handled correctly.

--

Peter Sorotokin

unread,

Apr 17, 2013, 12:25:15 PM4/17/13

to epub-work...@googlegroups.com

Sorry, I meant "I think that these two questions can be decided independently of the relationship of CFIs to W3C DOM or DOM Ranges", not "I do not think"

Kevin Ballard

unread,

Apr 17, 2013, 1:22:45 PM4/17/13

to epub-work...@googlegroups.com, Peter Sorotokin

These are definitely two different questions. The space question is valid, although I think you're right when you say the DOM preserves insignificant whitespace (it must, after all if you flip the style of an element to have `white-space: pre`, it starts rendering the previously-collapsed whitespace).

The original question I can answer. As it stands today, the CFI spec disallows pointing to a fictional element after the last text node. Thus you cannot produce A CFI that's equivalent to a DOMRange that selects the entire final text node in a container (as opposed to selecting the text node's contents). This was an unintentional restriction. It was obvious that pointing to a fictional text node at the end should be allowed in order to represent a range that selects an element, but no thought was given to a range that selects the final text node.

However, I think this is something that is worth fixing, and doing so now before CFI implementations become too widespread. Speaking personally, even though CFI was not original specced to be able to map 1-to-1 with DOMRanges, and won't even with this hole fixed (how does a DOMRange point at a spatial offset in an image, or a temporal offset in a video? And DOMRanges can be sensitive to the presence of processing instructions and comments), being able to represent the vast majority of DOMRanges (or the entirety of what I would consider to be useful DOMRanges) is useful enough that I think we should consider it.

My recommendation is that the first bullet point in section 3.1.1 should be amended to state that if the final element is succeeded by a non-empty collection of text nodes, then the following even index is considered valid.

-Kevin

Peter Sorotokin

unread,

Apr 17, 2013, 2:48:18 PM4/17/13

to epub-work...@googlegroups.com

This makes sense. For the fictional node I certainly see some pros, and I can also see some cons: fictional element is a complex construct that can be confusing. The wording have to exclude aliasing (having two CFIs for the same position) and avoid infinitely nested fictional nodes.

I would probably allow this fictional element in any real (non-fictional) element, because explaining what is non-empty collection of text nodes is tricky. What about comments? <![CDATA[]]>? entity references expanding to nothing? Who knows what various parsers in the world do in such cases?

Daniel Weck

unread,

Apr 17, 2013, 3:24:16 PM4/17/13

to epub-work...@googlegroups.com

On 17 Apr 2013, at 17:23, Peter Sorotokin wrote:
> CFI spec is written in terms of XML specification, not DOM (and certainly not raw text!).

Then, perhaps the CFI specification should clarify that the data model is XML InfoSet (just like XPointer does, although note that XPath defines its data model in terms of tree of nodes, not XML InfoSet).

http://www.w3.org/TR/xml-infoset/

...or at least reference XML 1.0 (5th edition), with certain conditions (e.g. entity references expanded by default, like in XML-InfoSet)

http://www.w3.org/TR/REC-xml/

In my opinion, the current prose is at best ambiguous, at worst misleading:

http://www.idpf.org/epub/linking/cfi/#sec-path-child-ref

"
This indexing method ensures that node identification is not sensitive to XML parser handling of whitespace text nodes, CDATA sections and entity references (e.g., to avoid the ambiguity that can arise depending on whether a parser collapses whitespace-only text nodes, keeps text, CDATA sections and entity references as distinct nodes or doesn't, or breaks text in multiple nodes).
"

I would also suggest adding something along the lines of:

When calculating character offsets (odd indices) in CFI expressions:
- Processing instructions and comments are ignored.
- Contiguous text nodes (e.g. when a run of text is separated / divided by a comment) are preserved as individual nodes, there is no implicit merging of text nodes.
- All whitespace is preserved, within runs of character data, and between elements. There is no collapsing of contiguous whitespaces. There is no trimming.
- The content of CDATA sections is treated with the same whitespace rules as above.
- Entities are expanded before CFI processing is applied.
- etc.

> Writing it in terms of DOM would require defining what is the canonical DOM representation of the XML is. But it can be implemented in therms of DOM, if the parser is not too aggressive and care is taken to not be sensitive to parser variation.

Yes, the potential problem in real-world implementations is indeed that there may be discrepancies of DOM parsing depending on the browser engine, i.e. differences in the data exposed via the (usually JavaScript) API. For example, this is what Gecko does (anyone knows similar documentation for WebKit?):

https://developer.mozilla.org/en/docs/Whitespace_in_the_DOM

I would expect that in all modern browser engines, element.childNodes contains all whitespace (uncollapsed, untrimmed), and also CDATASections as first class objects.

> One clarification worth mentioning: In XML, "\r\n" is always treated as a single "\n" (section "2.11 End-of-Line Handling" of XML spec).

Right, good reminder.

Daniel

Daniel Weck

unread,

Apr 17, 2013, 3:36:47 PM4/17/13

to epub-work...@googlegroups.com

Thanks Kevin and Peter for your thoughts on "fictional nodes". Your comments indeed directly address the concerns expressed by the issue submitter:

https://code.google.com/p/epub-revision/issues/detail?id=218

The side discussion about DOM (raw character stream, XML InfoSet, etc.) emerged from the lack of clarity in the CFI normative prose regarding the data model to use (which made it difficult to discuss the original issue at hand). I will file a separate issue.

Dan

Daniel Weck

unread,

Apr 17, 2013, 3:52:38 PM4/17/13

to epub-work...@googlegroups.com

On Wednesday, April 17, 2013 8:36:47 PM UTC+1, Daniel Weck wrote:

The side discussion about DOM (raw character stream, XML InfoSet, etc.) emerged from the lack of clarity in the CFI normative prose regarding the data model to use (which made it difficult to discuss the original issue at hand). I will file a separate issue.

Done:

https://code.google.com/p/epub-revision/issues/detail?id=351

Markus Gylling

unread,

Apr 17, 2013, 5:05:00 PM4/17/13

to epub-work...@googlegroups.com

> I'll set multiple alarms for this one to make sure I don't miss it (again!).

To be very sure we all set those clocks right then. Next weeks con call is at 16 UTC on the 25th, and it will indeed be dedicated to CFI.

http://www.timeanddate.com/worldclock/fixedtime.html?iso=20130425T16

/markus

> <image001.png>

Roger Webster

unread,

Apr 18, 2013, 2:42:13 PM4/18/13

to epub-work...@googlegroups.com

Sigh. I will be on a plane at that time...

-----Original Message-----
From: epub-work...@googlegroups.com [mailto:epub-work...@googlegroups.com] On Behalf Of Markus Gylling
Sent: Wednesday, April 17, 2013 2:05 PM
To: epub-work...@googlegroups.com
Subject: Re: [301][NeedsDiscussion] CFI (W3C DOM Range, text-node vs text-range)

Daniel Weck

unread,

May 7, 2013, 6:42:44 PM5/7/13

to epub-work...@googlegroups.com

Joint [72h] proposal for issues #351 and #218:

https://groups.google.com/forum/#!topic/epub-working-group/pgBJLfVcozY

Reply all

Reply to author

Forward