Found a parsing bug in HTMLParser

Grzegorz Adam Hankiewicz

no leída,

9 feb 2003, 12:06:569/2/03

a

Hi.

I've found a bug in HTMLParser parsing some of my webpages. The
problem is using an attribute with a value inside double quotes
which is near another attribute. I've created a small testcase
which you can see below. The w3c validator says the page is ok
(http://validator.w3.org/check?uri=http://www.terra.es/personal7/gradha/test.html),
and browsers render it without problems. Does it happen with newer
Python versions? What's the procedure for bug reports?

PD: Don't CC me your replies.

$ cat test.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>t</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head><body>
<a href="http://ss"title="pe">P</a>
</body></html>

$ python
Python 2.2.1 (#1, Apr 21 2002, 08:38:44)
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from HTMLParser import HTMLParser
>>> p = HTMLParser()
>>> file = open("test.html", "rt")
>>> p.feed("".join(file.readlines()))
>>> file.close()
>>> p.close()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.2/HTMLParser.py", line 112, in close
self.goahead(1)
File "/usr/lib/python2.2/HTMLParser.py", line 166, in goahead
self.error("EOF in middle of construct")
File "/usr/lib/python2.2/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1

Wojtek Walczak

no leída,

9 feb 2003, 15:56:149/2/03

a

Dnia Sun, 9 Feb 2003 18:06:56 +0100, Grzegorz Adam Hankiewicz napisał(a):
> I've found a bug in HTMLParser parsing some of my webpages. The

The bug exists because of that line:

the place in code responsible for complaining that is a method
check_for_whole_start_tag() of class HTMLParser, lines 308 to 312:

if next in ("abcdefghijklmnopqrstuvwxyz=/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
# end of input in or before attribute value, or we have the
# '/' from a '/>' ending
return -1

I don't want to change this since I'm sure, I'll make HTMLParser weak for some
other conditions.

ps. I'll inform people on python-dev mailing list.

--
[ ] gminick (at) underground.org.pl http://gminick.linuxsecurity.pl/ [ ]
[ "Po prostu lubie poranna samotnosc, bo wtedy kawa smakuje najlepiej." ]

Richard Jones

no leída,

9 feb 2003, 15:51:119/2/03

a

On Mon, 10 Feb 2003 4:06 am, Grzegorz Adam Hankiewicz wrote:
> I've found a bug in HTMLParser parsing some of my webpages. The

> problem is using an attribute with a value inside double quotes
> which is near another attribute. I've created a small testcase
> which you can see below. The w3c validator says the page is ok
> (http://validator.w3.org/check?uri=http://www.terra.es/personal7/gradha/tes

>t.html), and browsers render it without problems. Does it happen with newer

> Python versions? What's the procedure for bug reports?

Please go to www.python.org and follow the 'Developers', 'Bug Manager' links
to the bug reporting system.

Richard

Bengt Richter

no leída,

9 feb 2003, 16:38:369/2/03

a

On Sun, 9 Feb 2003 18:06:56 +0100, Grzegorz Adam Hankiewicz <gra...@terra.es> wrote:

>Hi.
>
>I've found a bug in HTMLParser parsing some of my webpages. The
>problem is using an attribute with a value inside double quotes
>which is near another attribute. I've created a small testcase

Too "near" to be legal HTML 4.0, I believe. From the spec:
(http://www.w3.org/TR/1998/REC-html40-19980424)
"""
3.2.2 Attributes

Elements may have associated properties, called attributes, which may have values
(by default, or set by authors or scripts). Attribute/value pairs appear before
the final ">" of an element's start tag. Any number of (legal) attribute value pairs,
separated by spaces, may appear in an element's start tag. They may appear in any order.
^^^^^^^^^^^^^^^^^^^
"""
Your DTD specification is HTML 4.0, but even if it's trying to do new XHTML stuff,
XML requires a space before each attribute definition, i.e.,
from my XML spec copy of http://www.w3.org/TR/1998/REC-xml-19980210

STag ::= '<' Name (S Attribute)* S? '>'
where
S ::= (#x20 | #x9 | #xD | #xA)

so it surprises me that you get an ok validation, though I'm not surprised
that browsers ignore anomalies.

>which you can see below. The w3c validator says the page is ok
>(http://validator.w3.org/check?uri=http://www.terra.es/personal7/gradha/test.html),
>and browsers render it without problems. Does it happen with newer
>Python versions? What's the procedure for bug reports?
>
>PD: Don't CC me your replies.
>
>$ cat test.html
><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
><html><head><title>t</title>
><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
></head><body>
><a href="http://ss"title="pe">P</a>

^^^^^^^^^^ -- need white space in front of this, e.g.,

<a href="http://ss" title="pe">P</a>
></body></html>
>
>$ python
>Python 2.2.1 (#1, Apr 21 2002, 08:38:44)
>[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
>Type "help", "copyright", "credits" or "license" for more information.
>>>> from HTMLParser import HTMLParser
>>>> p = HTMLParser()
>>>> file = open("test.html", "rt")
>>>> p.feed("".join(file.readlines()))
>>>> file.close()
>>>> p.close()
>Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "/usr/lib/python2.2/HTMLParser.py", line 112, in close
> self.goahead(1)
> File "/usr/lib/python2.2/HTMLParser.py", line 166, in goahead
> self.error("EOF in middle of construct")
> File "/usr/lib/python2.2/HTMLParser.py", line 115, in error
> raise HTMLParseError(message, self.getpos())
>HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1
>

Seems like a better message could have been generated, though.
Regards,
Bengt Richter

Grzegorz Adam Hankiewicz

no leída,

10 feb 2003, 17:45:5710/2/03

a

On Sun, Feb 09, 2003 at 09:38:36PM +0000, Bengt Richter wrote:

> On Sun, 9 Feb 2003 18:06:56 +0100, Grzegorz Adam Hankiewicz wrote:
> > I've found a bug in HTMLParser parsing some of my webpages. The
> > problem is using an attribute with a value inside double quotes
> > which is near another attribute. I've created a small testcase
>

> Too "near" to be legal HTML 4.0, I believe. [...] so it surprises

> me that you get an ok validation, though I'm not surprised that
> browsers ignore anomalies.

I'll contact the validator team then.

> >HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1
> >
> Seems like a better message could have been generated, though.

Like all error messages, it's crystal clear once you know what's
happening, or you have seen it once before.

Grzegorz Adam Hankiewicz

no leída,

11 feb 2003, 18:14:3911/2/03

a

> > > I've found a bug in HTMLParser parsing some of my webpages. The
> > > problem is using an attribute with a value inside double quotes
> > > which is near another attribute. I've created a small testcase
> >

> > Too "near" to be legal HTML 4.0, I believe. [...]

>
> I'll contact the validator team then.

Cool, I've found a "documented feature" of the validator:

> The validator relies on OpenSP to parse XML, and as stated in
> the results of validation, OpenSP supports has some limitations:
> http://openjade.sourceforge.net/doc/xml.htm "OpenSP does not
> enforce the following XML constraints:
> [...]
> # XML does not allow a parameter separator that is adjacent to a delimiter to be omitted."
>
>Which is the case in your example.