[Dillo-dev] [patch] HTML parser bugfix

0 views
Skip to first unread message

123

unread,
Jun 3, 2012, 2:29:51 PM6/3/12
to dill...@dillo.org
The following HTML fragment is shown as
BUG<">abc
by Dillo. It is valid (checked with validator.w3.org) HTML. In Firefox
it is shown as
abc
with '>BUG<' tooltip.

<div title=">BUG<">abc</div>

What happens is that Dillo see < character inside quotes, shows bug
"attribute lacks closing quote" and breaks out of tag parsing loop. In
fact it is not possible to say if there is a bug in HTML at that
point.

In WWW you can see this bug on http://reddit.com. Each entry on this
page contains
votehash', null, event)" >
and bug meter shows bugs for them.

Attached patch removes logic that tries to detect unterminated quoted
attributes and fixes this bug.
html.diff

Jeremy Henty

unread,
Jun 3, 2012, 4:00:55 PM6/3/12
to dill...@dillo.org

123 wrote:

> Attached patch removes logic that tries to detect unterminated
> quoted attributes and fixes this bug.

I would be wary of applying this patch without further thought. I run
Dillo with a very similar patch and it breaks *many* more pages than
it fixes. It is unfortunately necessary to detect and recover from
quoting errors in attributes. For instance, the web is *full* of
pages with tags containing this:

foo="bar""

If you don't detect and ignore that extra double quote you will break
many pages that every other browser renders perfectly well.

It's not even that uncommon to see monstrosities like this:

foo=""bar"""

It is also common to see short attributes (often CSS lengths)
delimited with a mixture of single and double quotes, eg:

size="12pt'

For these reasons, sadly, it is IMHO not realistic to just drop the
misquote-detection logic.

Regards,

Jeremy Henty

_______________________________________________
Dillo-dev mailing list
Dill...@dillo.org
http://lists.auriga.wearlab.de/cgi-bin/mailman/listinfo/dillo-dev

123

unread,
Jun 3, 2012, 4:45:29 PM6/3/12
to dill...@dillo.org
On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:
> I would be wary of applying this patch without further thought. I run
> Dillo with a very similar patch and it breaks *many* more pages than
> it fixes. It is unfortunately necessary to detect and recover from
> quoting errors in attributes. For instance, the web is *full* of
> pages with tags containing this:
>
> foo="bar""
>
> If you don't detect and ignore that extra double quote you will break
> many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes. Searching
for < inside quotes is not the right way as it breaks valid pages like
reddit main page.

Can you give examples of real web pages with double quotes?

> It's not even that uncommon to see monstrosities like this:
>
> foo=""bar"""

<div title=""abc""">def</div>

does not show tip "abc" for def in Firefox. It silently displays
nothing in Dillo after patch, however. I will try to fix it.

> It is also common to see short attributes (often CSS lengths)
> delimited with a mixture of single and double quotes, eg:
>
> size="12pt'

<div title="abc'>def</div>

displays nothing in Firefox. What browser can render it?

123

unread,
Jun 4, 2012, 8:19:13 AM6/4/12
to dill...@dillo.org
According to WHATWG [1] fragment

<div title=""abc""">def</div>

should be parsed without errors until 'a' character. abc""" should be
parsed as attribute name with parse errors because of lack of space
character and quotes inside attribute name.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html

Jorge Arellano Cid

unread,
Jun 4, 2012, 11:35:42 AM6/4/12
to dill...@dillo.org
On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:
>
> 123 wrote:
>
> > Attached patch removes logic that tries to detect unterminated
> > quoted attributes and fixes this bug.
>
> I would be wary of applying this patch without further thought. I run
> Dillo with a very similar patch and it breaks *many* more pages than
> it fixes. It is unfortunately necessary to detect and recover from
> quoting errors in attributes. For instance, the web is *full* of
> pages with tags containing this:
>
> foo="bar""
>
> If you don't detect and ignore that extra double quote you will break
> many pages that every other browser renders perfectly well.
>
> It's not even that uncommon to see monstrosities like this:
>
> foo=""bar"""
>
> It is also common to see short attributes (often CSS lengths)
> delimited with a mixture of single and double quotes, eg:
>
> size="12pt'
>
> For these reasons, sadly, it is IMHO not realistic to just drop the
> misquote-detection logic.

Ditto.

(FWIW, maybe this is the fourth time this patch is received;
In the past the submitter has always desisted after some more
testing).

--
Cheers
Jorge.-

Jeremy Henty

unread,
Jun 6, 2012, 7:14:06 PM6/6/12
to dill...@dillo.org

123 wrote:

> On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:

> > If you don't detect and ignore that extra double quote you will
> > break many pages that every other browser renders perfectly well.
>
> Then there should be some logic for detecting double quotes.

I agree. The hard question is: what logic?

> Searching for < inside quotes is not the right way as it breaks
> valid pages like reddit main page.

Again, I agree. In fact, I decided to experiment with removing
Dillo's misquote-detection just because it broke reddit.
Unfortunately it is very hard to come up with an algorithm that
correctly handles the various quoting horrors that you see all the
time, yet also correctly parses embedded javascript fragments like:

onclick='alert("clicked")'

(IIRC Dillo mis-renders reddit because of embedded javascript like
this.)

> Can you give examples of real web pages with double quotes?

I have attached a bunch of links that I collected after spending a lot
of time debugging pages that broke my locally-patched Dillo. (Ignore
the file:///... URLs as they obviously won't work for you.) I have
also attached the two patches I use. I think the first does the same
thing that you propose, although I haven't checked that in detail.
The second attempts to copy Firefox's misquote-detection algorithm.

I agree that Dillo's misquote-detection isn't good enough, but just
taking it out is a step backwards, and it is a mistake to propose
changes to it based solely on particular pages that Dillo mis-renders,
because any such change needs to be tested against the many other
broken pages that Dillo currently renders more or less correctly.

Regards,

Jeremy Henty
dillo.html
html_parse_quoted_attributes_as_per_strict_html5
html_parse_quoted_attributes_as_per_firefox

123

unread,
Jun 7, 2012, 8:45:42 AM6/7/12
to dill...@dillo.org
On Thu, Jun 07, 2012 at 12:14:06AM +0100, Jeremy Henty wrote:
>
> 123 wrote:
>
> > On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:
>
> > > If you don't detect and ignore that extra double quote you will
> > > break many pages that every other browser renders perfectly well.
> >
> > Then there should be some logic for detecting double quotes.
>
> I agree. The hard question is: what logic?

IMO the right way to do it is to implement HTML Standard [1]. When
Html_write_raw is called, parser is in the Data state. When it returns
to Data state again, it is the end of token.

I have implemented standard comment/DOCTYPE parsing. It is incomplete,
EOF is not handled and DOCTYPE parsing is not changed. Patch is
attached. Next step is to rewrite tag parsing in standard way.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization

123

unread,
Jun 7, 2012, 11:18:11 AM6/7/12
to dill...@dillo.org
Forgot to attach.
dilloparser.patch

Johannes Hofmann

unread,
Jun 12, 2012, 4:39:18 PM6/12/12
to dill...@dillo.org
On Thu, Jun 07, 2012 at 04:45:42PM +0400, 123 wrote:
> On Thu, Jun 07, 2012 at 12:14:06AM +0100, Jeremy Henty wrote:
> >
> > 123 wrote:
> >
> > > On Sun, Jun 03, 2012 at 09:00:55PM +0100, Jeremy Henty wrote:
> >
> > > > If you don't detect and ignore that extra double quote you will
> > > > break many pages that every other browser renders perfectly well.
> > >
> > > Then there should be some logic for detecting double quotes.
> >
> > I agree. The hard question is: what logic?
>
> IMO the right way to do it is to implement HTML Standard [1]. When
> Html_write_raw is called, parser is in the Data state. When it returns
> to Data state again, it is the end of token.
>
> I have implemented standard comment/DOCTYPE parsing. It is incomplete,
> EOF is not handled and DOCTYPE parsing is not changed. Patch is
> attached. Next step is to rewrite tag parsing in standard way.
>
> [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization

With your patch text disappears e.g. on
http://www.cnas.org/blogs/abumuqawama/2011/04/quote-day.html-0
which was mentioned by Jeremy while it renders ok with current dillo and firefox.

I would rather put together a test page that includes all the cases
Jeremy brought up plus the reddit one.

Also looking into firefox or other browser sources might be a good
start.

Cheers,
Johannes
Reply all
Reply to author
Forward
0 new messages