I've found a bug in HTMLParser parsing some of my webpages. The
problem is using an attribute with a value inside double quotes
which is near another attribute. I've created a small testcase
which you can see below. The w3c validator says the page is ok
(http://validator.w3.org/check?uri=http://www.terra.es/personal7/gradha/test.html),
and browsers render it without problems. Does it happen with newer
Python versions? What's the procedure for bug reports?
PD: Don't CC me your replies.
$ cat test.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>t</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head><body>
<a href="http://ss"title="pe">P</a>
</body></html>
$ python
Python 2.2.1 (#1, Apr 21 2002, 08:38:44)
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from HTMLParser import HTMLParser
>>> p = HTMLParser()
>>> file = open("test.html", "rt")
>>> p.feed("".join(file.readlines()))
>>> file.close()
>>> p.close()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.2/HTMLParser.py", line 112, in close
self.goahead(1)
File "/usr/lib/python2.2/HTMLParser.py", line 166, in goahead
self.error("EOF in middle of construct")
File "/usr/lib/python2.2/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1
the place in code responsible for complaining that is a method
check_for_whole_start_tag() of class HTMLParser, lines 308 to 312:
if next in ("abcdefghijklmnopqrstuvwxyz=/"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
# end of input in or before attribute value, or we have the
# '/' from a '/>' ending
return -1
I don't want to change this since I'm sure, I'll make HTMLParser weak for some
other conditions.
ps. I'll inform people on python-dev mailing list.
--
[ ] gminick (at) underground.org.pl http://gminick.linuxsecurity.pl/ [ ]
[ "Po prostu lubie poranna samotnosc, bo wtedy kawa smakuje najlepiej." ]
Please go to www.python.org and follow the 'Developers', 'Bug Manager' links
to the bug reporting system.
Richard
>Hi.
>
>I've found a bug in HTMLParser parsing some of my webpages. The
>problem is using an attribute with a value inside double quotes
>which is near another attribute. I've created a small testcase
Too "near" to be legal HTML 4.0, I believe. From the spec:
(http://www.w3.org/TR/1998/REC-html40-19980424)
"""
3.2.2 Attributes
Elements may have associated properties, called attributes, which may have values
(by default, or set by authors or scripts). Attribute/value pairs appear before
the final ">" of an element's start tag. Any number of (legal) attribute value pairs,
separated by spaces, may appear in an element's start tag. They may appear in any order.
^^^^^^^^^^^^^^^^^^^
"""
Your DTD specification is HTML 4.0, but even if it's trying to do new XHTML stuff,
XML requires a space before each attribute definition, i.e.,
from my XML spec copy of http://www.w3.org/TR/1998/REC-xml-19980210
STag ::= '<' Name (S Attribute)* S? '>'
where
S ::= (#x20 | #x9 | #xD | #xA)
so it surprises me that you get an ok validation, though I'm not surprised
that browsers ignore anomalies.
>which you can see below. The w3c validator says the page is ok
>(http://validator.w3.org/check?uri=http://www.terra.es/personal7/gradha/test.html),
>and browsers render it without problems. Does it happen with newer
>Python versions? What's the procedure for bug reports?
>
>PD: Don't CC me your replies.
>
>$ cat test.html
><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
><html><head><title>t</title>
><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
></head><body>
><a href="http://ss"title="pe">P</a>
^^^^^^^^^^ -- need white space in front of this, e.g.,
<a href="http://ss" title="pe">P</a>
></body></html>
>
>$ python
>Python 2.2.1 (#1, Apr 21 2002, 08:38:44)
>[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
>Type "help", "copyright", "credits" or "license" for more information.
>>>> from HTMLParser import HTMLParser
>>>> p = HTMLParser()
>>>> file = open("test.html", "rt")
>>>> p.feed("".join(file.readlines()))
>>>> file.close()
>>>> p.close()
>Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "/usr/lib/python2.2/HTMLParser.py", line 112, in close
> self.goahead(1)
> File "/usr/lib/python2.2/HTMLParser.py", line 166, in goahead
> self.error("EOF in middle of construct")
> File "/usr/lib/python2.2/HTMLParser.py", line 115, in error
> raise HTMLParseError(message, self.getpos())
>HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1
>
Seems like a better message could have been generated, though.
Regards,
Bengt Richter
I'll contact the validator team then.
> >HTMLParser.HTMLParseError: EOF in middle of construct, at line 5, column 1
> >
> Seems like a better message could have been generated, though.
Like all error messages, it's crystal clear once you know what's
happening, or you have seen it once before.
Cool, I've found a "documented feature" of the validator:
> The validator relies on OpenSP to parse XML, and as stated in
> the results of validation, OpenSP supports has some limitations:
> http://openjade.sourceforge.net/doc/xml.htm "OpenSP does not
> enforce the following XML constraints:
> [...]
> # XML does not allow a parameter separator that is adjacent to a delimiter to be omitted."
>
>Which is the case in your example.