[Dillo-dev] Quoted attribute parsing: summary

Jeremy Henty

unread,

Aug 16, 2010, 2:21:02 AM8/16/10

to Dillo developers

Prompted by some private conversation with corvid I've been digging
through specs and source code to see what the state of play is.

The HTML5 specification[1] states that the user agent should consume
text, converting character references until it finds the matching
close quote. If there is no matching close quote (ie. it sees an EOF
first) then it terminates (strictly speaking, it switches to the data
state and reconsumes the EOF, which makes it emit an EOF token).

Taking out Dillo's bogus attribute value detection as I proposed would
make Dillo parse quoted attribute values as per the HTML5 spec.

The Hubbub HTML parser library[2] parses quoted attribute values as
per the HTML5 spec.

Firefox parses quoted attribute values as per the HTML5 spec *except*
that if it sees an EOF then it backs up to the open quote, discards
it, then reparses as though it was expecting an unquoted attribute
value. Otherwise (ie. if it finds the matching close quote) it makes
no attempt to detect a broken attribute value, no matter what content
the attribute value has swallowed up.

So it seems that the world at large has given up on trying to detect
and correct broken attribute values.

Jeremy Henty

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/
[2] http://www.netsurf-browser.org/projects/hubbub/

_______________________________________________
Dillo-dev mailing list
Dill...@dillo.org
http://lists.auriga.wearlab.de/cgi-bin/mailman/listinfo/dillo-dev

Johannes Hofmann

unread,

Aug 16, 2010, 4:09:14 AM8/16/10

to Dillo developers

On Mon, Aug 16, 2010 at 07:21:02AM +0100, Jeremy Henty wrote:
>
> Prompted by some private conversation with corvid I've been digging
> through specs and source code to see what the state of play is.
>
> The HTML5 specification[1] states that the user agent should consume
> text, converting character references until it finds the matching
> close quote. If there is no matching close quote (ie. it sees an EOF
> first) then it terminates (strictly speaking, it switches to the data
> state and reconsumes the EOF, which makes it emit an EOF token).
>
> Taking out Dillo's bogus attribute value detection as I proposed would
> make Dillo parse quoted attribute values as per the HTML5 spec.
>
> The Hubbub HTML parser library[2] parses quoted attribute values as
> per the HTML5 spec.
>
> Firefox parses quoted attribute values as per the HTML5 spec *except*
> that if it sees an EOF then it backs up to the open quote, discards
> it, then reparses as though it was expecting an unquoted attribute
> value. Otherwise (ie. if it finds the matching close quote) it makes
> no attempt to detect a broken attribute value, no matter what content
> the attribute value has swallowed up.
>
> So it seems that the world at large has given up on trying to detect
> and correct broken attribute values.

I'd agree that we should not make compromises displaying correct
HTML when trying to deal with buggy HTML.
But are the '>' characters in the attribute value in the reddit page
actually valid?
The HTML validators at least warn about them.

Cheers,
Johannes

Jeremy Henty

unread,

Aug 17, 2010, 4:13:11 PM8/17/10

to dill...@dillo.org

Johannes Hofmann wrote:

> I'd agree that we should not make compromises displaying correct
> HTML when trying to deal with buggy HTML. But are the '>'
> characters in the attribute value in the reddit page actually valid?

Yes, at least according to the HTML5 specification. Indeed, according
to that specification, the only possible parse errors while parsing a
quoted attribute value are (i) EOF, and (ii) a malformed entity
reference. Anything else is valid!

I doubt that those '>' characters are valid according to SGML, but the
HTML5 specification explicitly states that HTML5 is not an SGML
instance. No popular client has ever parsed HTML as an SGML instance
and servers have been sending non-SGML-compliant "HTML" since forever.
No matter what earlier HTML specifications may have claimed, the
practical reality is that HTML has never been a kind of SGML.

> The HTML validators at least warn about them.

Warning about them is probably a good idea, but that's a different
issue from how to handle them. Whatever Dillo should do, its current
behaviour (a) does not conform to HTML5, and (b) breaks Reddit.

Of course, there's no reason that Dillo *must* conform to HTML5.
Indeed the HTML5 specification is peppered with the lovely phrase
"willful violation", meaning "yes we know this breaks someone else's
specification but we think it's for the best". So it's fine in
principle for Dillo to say "we're going to violate HTML5 because we
think it's for the best", but I think that this particular behaviour
is a bad idea. It violates the HTML5 standard, it deviates (as far as
I can tell) from standard practice, and it breaks an otherwise
perfectly compliant website. If we can think of a useful way to
willful violate standards so as to better handle broken HTML then
let's do it, but I think Dillo is better off without this particular
workaround because it does more harm than good.

NB: HTML5 is still a work in progress. These bug reports show some of
the discussion of parsing attribute values:

Bug 9872: "trigger a conformance error when javascript is included
in href attribute" (rejected because there are legitimate use
cases and even if it's sometimes abused it's not the HTML5
specification's role to police its use)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9872

Bug 9987: "attribute values should be allowed to contain ambiguous
ampersands ..." (still new)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9987

Regards,

Jeremy Henty

Johannes Hofmann

unread,

Aug 17, 2010, 4:40:22 PM8/17/10

to dill...@dillo.org

I also played with this some more... If all standard browsers would
insist on correctly closed quotes, we could expect web-developers to
immediately fix those errors. But for me firefox 3.6.3 shows
something given the following HTML (as does current dillo):

<div title="foo >hello world</div>dillo is great

When we remove the unquoted attribute detection as in your patch,
nothing is shown. Are newer firefox versions treating this
differently?

Regards,
Johannes

Jeremy Henty

unread,

Aug 17, 2010, 7:05:30 PM8/17/10

to dill...@dillo.org

Johannes Hofmann wrote:

> But for me firefox 3.6.3 shows something given the following HTML
> (as does current dillo):
>
> <div title="foo >hello world</div>dillo is great

That's Firefox's workaround that I described in my original post: if
it sees EOF while parsing a quoted attribute value (ie. if it *never*
sees a matching quote) then it goes back to the opening quote,
discards it, and parses an unquoted attribute value. So it ends up
parsing your example exactly as it would parse

<div title=foo >hello world</div>dillo is great

which gives the same result as vanilla Dillo, but for entirely
different reasons.

But Firefox only does that if it can't find the matching quote at all;
if you feed it

<div title="foo >hello world</div>dillo is great [... repeat
'dillo is great' 10000 times ...]</div><div title="bar">

then it matches the second double quote with the first and *all* the
text disappears. Which is exactly what HTML5 says it should do. Of
course vanilla Dillo does *better* than Firefox for this example, but
in the real world I think it does *worse*. JavaScript fragments that
confound Dillo's algorithm are far more common than examples such as
the above that it handles well.

OK, here's a new proposal: when parsing quoted attribute values, let's
copy Firefox! That would: (a) sensibly handle the missing quotes
examples that people have suggested (which my proposed patch does not
do), (b) handle well-formed JavaScript fragments correctly (which
vanilla Dillo does not do), (c) parse well-formed HTML5 as per the
HTML5 specification, (d) conform to Firefox's established practice,
and (e) not break Reddit! That's 5 wins!

It's true that we can't expect people to fix their HTML just because
the HTML5 specification says it's broken. And it's even less likely
that they will fix it just because it breaks in Dillo. But it is very
likely that they will fix it if it breaks in Firefox, so copying
Firefox is a good idea, even if you don't care about the HTML5
specification.

And, why should we care about edge cases that vanilla Dillo handles
better than Firefox, since those are precisely the cases that people
will fix to keep their Firefox users happy and that we can therefore
expect *not* to see! There's no point in having an algorithm that in
theory is better than Firefox's, because in practice it's not.

So, why not just copy Firefox? I can't see any downside.

Regards,

Jeremy Henty

Johannes Hofmann

unread,

Aug 18, 2010, 3:57:09 AM8/18/10

to dill...@dillo.org

I agree. Can you adjust the patch? Then I'd like to wait what Jorge
and corvid say, but I think it's best to mimic Firefox.

Regards,
Johannes

Jeremy Henty

unread,

Aug 18, 2010, 6:14:47 AM8/18/10

to dill...@dillo.org

Johannes Hofmann wrote:

I'll work on a new patch next week. (I'd like to do it sooner but
real life has already alloc()ed the rest of this week.)

Thanks for your comments,

Jeremy Henty

Jorge Arellano Cid

unread,

Aug 18, 2010, 1:45:08 PM8/18/10

to dill...@dillo.org

Hi Jeremy,

It's a relief to see how things have evolved in ten years. I'm
happily surprised!

In the beginning Mozilla/FF did crazy stuff to make sense out
of the most obnoxious tag soup. At that time we only parsed
correct content (as from the SPEC) and everybody ended saying
"dillo is broken" (despite the nice warning messages :-).
Then we ended correcting tag soup as much as we could (but kept
the nice warnings, which just a few souls cared for).
At that moment we followed the "When in doubt, follow FF" motto
which served us well, for the reasons you describe well.
Now it's great to see a sensible turn into a better direction.

I've considered this patch at least three times during this ten
years, and I know FF behaved differently back then. I even
committed the patch once, only to see lots of content dissappear
from the page, and had to backpedal.

Looking at the attached examples with FF is quite telling.

(please read the comments below).

On Wed, Aug 18, 2010 at 12:05:30AM +0100, Jeremy Henty wrote:
>

Agreed.
I'll be looking forward for the patch.

> It's true that we can't expect people to fix their HTML just because
> the HTML5 specification says it's broken. And it's even less likely
> that they will fix it just because it breaks in Dillo. But it is very
> likely that they will fix it if it breaks in Firefox, so copying
> Firefox is a good idea, even if you don't care about the HTML5
> specification.
>
> And, why should we care about edge cases that vanilla Dillo handles
> better than Firefox, since those are precisely the cases that people
> will fix to keep their Firefox users happy and that we can therefore
> expect *not* to see! There's no point in having an algorithm that in
> theory is better than Firefox's, because in practice it's not.
>
> So, why not just copy Firefox? I can't see any downside.

Agreed, and that's what we've done for some time now.

In the beginning we followed the SPECS, which caused more harm
than good to the project in the long term.

FWIW, when HTML was defined as XML, only a few servers provided
it with the xhtml MIME type, because it risked not being rendered
if incorrect, which didn't happen when served as "tag soup".

The W3C and WDG, among others, built nice validators which
were not used to correct the state of the web. Business logic
prevailed: why waste resources on 1-2% of the market.

As you see, we've had to sail with the tides.

--
Cheers
Jorge.-

quote-fault.html

BadAttrValue.html

Reply all

Reply to author

Forward