parsing HTML into a document object in Fx3

Myk Melez

unread,

Nov 16, 2006, 3:16:43 PM11/16/06

to

Folks (particularly extension developers) regularly ask for a way to
parse HTML into a document object, which is currently hard and hacky to do.

bzbarsky suggested last year that things may get better in Gecko 1.9
[1], and shaver recently started a wiki page on the subject [2].

My questions are:

1. Will things get better in Gecko 1.9/Firefox 3 (i.e. are there
concrete plans or promising developments in this area)?

2. If not, is it worth turning the MicrosummaryResource object [3],
which does this (hackily, but perhaps as well as currently possible),
into an XPCOM component usable by other code?

[1]
http://groups-beta.google.com/group/netscape.public.mozilla.dom/msg/a584d4ed6b907b5c

[2] http://developer.mozilla.org/en/docs/Parsing_HTML_From_Chrome

[3]
http://lxr.mozilla.org/mozilla/source/browser/components/microsummaries/src/nsMicrosummaryService.js.in#1873

Boris Zbarsky

unread,

Nov 16, 2006, 7:15:38 PM11/16/06

to

Myk Melez wrote:
> Folks (particularly extension developers) regularly ask for a way to
> parse HTML into a document object, which is currently hard and hacky to do.

So as I see it, the steps to get this working are:

1) Decide what the problem we're solving is. Specifically, how should
noscript, noframes, and such be parsed in these documents? Keep in mind that
depending on user settings (like whether script is enabled) we create different
DOMs from the same source.

2) Decide what the plan is for charsets (currently we depend on having a
docshell to handle charset autodetect and in some cases <meta> tags, because we
have to throw away the document and reparse).

3) Go through the HTML content sink and HTML document, and make sure all the
places that use the docshell or window can survive without one.

4) Do whatever we decided to do for charsets.

5) Make DOMParser parse HTML.

> 1. Will things get better in Gecko 1.9/Firefox 3 (i.e. are there
> concrete plans or promising developments in this area)?

I'm not aware of significant changes in this area since 1.8, and I'm not sure
anyone is working on this actively. I strongly suspect that given our existing
code, once item #1 above is sorted out handling item #3 and item #5 should not
be that bad -- a few days work at most. Items #2 and #4 I'm really not sure
about; I guess in large part it depends on what we decide to do about #2.

-Boris

Myk Melez

unread,

Nov 21, 2006, 7:58:25 PM11/21/06

to Boris Zbarsky

Boris Zbarsky wrote:
> Myk Melez wrote:
>> Folks (particularly extension developers) regularly ask for a way to
>> parse HTML into a document object, which is currently hard and hacky
>> to do.
>
> So as I see it, the steps to get this working are:

Ok, I posted your comments to bug 102699, and I also requested
blocking1.9 on the bug, since it seems to me that Firefox's microsummary
service would really benefit from it, not to mention extension authors
and other Gecko consumers.

I also filed a dependent bug 361449 to have the microsummary service use
DOMParser instead of hidden iframes to parse HTML once DOMParser can do
so. And I added a comment to bug 102699 about potentially turning
MicrosummaryResource into an XPCOM component if that bug doesn't get
fixed in Gecko 1.9.

-myk

Boris Zbarsky

unread,

Nov 21, 2006, 11:25:17 PM11/21/06

to

Myk Melez wrote:
> Ok, I posted your comments to bug 102699, and I also requested
> blocking1.9 on the bug, since it seems to me that Firefox's microsummary
> service would really benefit from it, not to mention extension authors
> and other Gecko consumers.

Right. We just need to make some decisions here...

> I also filed a dependent bug 361449 to have the microsummary service use
> DOMParser instead of hidden iframes to parse HTML once DOMParser can do
> so. And I added a comment to bug 102699 about potentially turning
> MicrosummaryResource into an XPCOM component if that bug doesn't get
> fixed in Gecko 1.9.

I don't think the "find some random chrome window and parse in an iframe in
there" approach is really something we want to turn into an "XPCOM component"...
For one thing, it doesn't work if no window is open (think Mac).

-Boris