Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why can't Firefox parse HTML?

102 views
Skip to first unread message

Matthew Gertner

unread,
May 17, 2005, 9:37:37 AM5/17/05
to
I stumbled on a previous thread in this group:
http://groups-beta.google.com/group/netscape.public.mozilla.dom/browse_thread/thread/22724c738fb67f0f/6fbfd47fb97c34d7
which claims that it is impossible to create a new HTML DOM document
from a Firefox script without displaying it in a new window. This means
that HTML screenscraping using XMLHttpRequest is not possible.

In a fit of pique I ranted about this on my blog
(http://www.allpeers.com/blog/?p=136). I was trying to be funny, but
the issue is serious. I'm probably missing something, but can someone
explain to me why the appropriate interfaces are not exposed to
scripters using XPCOM?

Cheers,
Matt

Boris Zbarsky

unread,
May 17, 2005, 11:16:48 AM5/17/05
to
Matthew Gertner wrote:
> In a fit of pique I ranted about this on my blog
> (http://www.allpeers.com/blog/?p=136). I was trying to be funny, but
> the issue is serious. I'm probably missing something, but can someone
> explain to me why the appropriate interfaces are not exposed to
> scripters using XPCOM?

Because HTML content model construction is so tied to having a window. As one
simple example, it assumes the existence of a window it can reload to handle
charset autodetection and <meta> charset declarations that are not in the first
chunk of data we get from the document. For XML this is not an issue, of
course, since the problem simply cannot arise.

There are other issues; for example the HTML parser needs the window to find out
whether scripts and frames are enabled (for parsing <noscript> and <noframes>
tags). This is not an issue in XML, again, because the _parsing_ doesn't depend
on anything. In HTML it does. And since frames and scripts can be
enabled/disabled on a per-window basis, this is a bit of a problem.

Some work has been done to make the parsing not require a window, but a lot more
needs to be done, especially if people want it to work like it would with a
window around.

-Boris

Matthew Gertner

unread,
May 18, 2005, 12:49:44 PM5/18/05
to
Boris,

Many thanks for the reply. I understand the issue much better now. Two
more questions:

1) You mention that "a lot more needs to be done." Is there an active
effort to break the remaining dependence of the HTML parser on the
existence of a window?
2) Wouldn't it be a viable workaround, in the meantime, to associate an
HTML document retrieved with XMLHttpRequest with an invisible window?

Matt

Boris Zbarsky

unread,
May 18, 2005, 1:02:43 PM5/18/05
to
Matthew Gertner wrote:
> 1) You mention that "a lot more needs to be done." Is there an active
> effort to break the remaining dependence of the HTML parser on the
> existence of a window?

Not very active right now, since Gecko is in 1.8 freeze, more or less. There
may be more work on it in the 1.9 cycle.

> 2) Wouldn't it be a viable workaround, in the meantime, to associate an
> HTML document retrieved with XMLHttpRequest with an invisible window?

That would execute scripts in the document in question, load stylesheets, etc,
etc. That seems undesirable (especially executing scripts).

-Boris

Matthew Gertner

unread,
May 19, 2005, 4:30:56 AM5/19/05
to
Ok. Personally I think this functionality is important enough to merit
a short-term workaround involving an invisible window with scripts
disabled. I can't believe that this would be a huge programming effort.
At the same time, I can see how this approach could meet with
resistance since it's obviously a hack and might have other unintended
side effects.

Is there a Bugzilla report related to this that you know of? I had a
look around but couldn't find anything.

Cheers,
Matt

Boris Zbarsky

unread,
May 19, 2005, 10:52:25 AM5/19/05
to
Matthew Gertner wrote:
> Ok. Personally I think this functionality is important enough to merit
> a short-term workaround involving an invisible window with scripts
> disabled.

Patches accepted....

> Is there a Bugzilla report related to this that you know of?

Not that I know of, no.

-Boris

Matthew Gertner

unread,
May 19, 2005, 12:01:03 PM5/19/05
to
> Patches accepted....

Fair enough, I'll take a crack at it.

Cheers,
Matt

0 new messages