intolerant HTML parser

Jim

unread,

Feb 6, 2010, 2:09:31 PM2/6/10

to

I generate some HTML and I want to include in my unit tests a check
for syntax. So I am looking for a program that will complain at any
syntax irregularities.

I am familiar with Beautiful Soup (use it all the time) but it is
intended to cope with bad syntax. I just tried feeding
HTMLParser.HTMLParser some HTML containing 'ab' and it
didn't complain.

That is, this:
h=HTMLParser.HTMLParser()
try:
h.feed('ab')
h.close()
print "I expect not to see this line"
except Exception, err:
print "exception:",str(err)
gives me "I expect not to see this line".

Am I using that routine incorrectly? Is there a natural Python choice
for this job?

Thanks,
Jim

John Nagle

unread,

Feb 6, 2010, 2:43:19 PM2/6/10

to

Jim wrote:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
>
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing 'ab' and it
> didn't complain.

Try HTML5lib.

http://code.google.com/p/html5lib/downloads/list

The syntax for HTML5 has well-defined notions of "correct",
"fixable", and "unparseable". For example, the common but
incorrect form of HTML comments,

<- comment ->

is understood.

HTML5lib is slow, though. Sometimes very slow. It's really a reference
implementation of the spec. There's code like this:

#Should speed up this check somehow (e.g. move the set to a constant)
if ((0x0001 <= charAsInt <= 0x0008) or
(0x000E <= charAsInt <= 0x001F) or
(0x007F <= charAsInt <= 0x009F) or
(0xFDD0 <= charAsInt <= 0xFDEF) or
charAsInt in frozenset([0x000B, 0xFFFE, 0xFFFF, 0x1FFFE,
0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE,
0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE,
0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE,
0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE,
0xFFFFF, 0x10FFFE, 0x10FFFF])):
self.tokenQueue.append({"type": tokenTypes["ParseError"],
"data":
"illegal-codepoint-for-numeric-entity",
"datavars": {"charAsInt": charAsInt}})

Every time through the loop (once per character), they build that frozen
set again.

John Nagle

Jim

unread,

Feb 6, 2010, 2:35:34 PM2/6/10

to

Thank you, John. I did not find that by looking around; I must not
have used the right words. The speed of the unit tests are not
critical so this seems like the solution for me.

Jim

Nobody

unread,

Feb 6, 2010, 11:33:12 PM2/6/10

to

On Sat, 06 Feb 2010 11:09:31 -0800, Jim wrote:

> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
>
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing 'ab' and it
> didn't complain.

HTMLParser is a tokeniser, not a parser. It treats the data as a
stream of tokens (tags, entities, PCDATA, etc); it doesn't know anything
about the HTML DTD. For all it knows, the above example could be perfectly
valid (the "b" element might allow both its start and end tags to be
omitted).

Does the validation need to be done in Python? If not, you can use
"nsgmls" to validate any SGML document for which you have a DTD. OpenSP
includes nsgmls along with the various HTML DTDs.

Stefan Behnel

unread,

Feb 8, 2010, 4:16:34 AM2/8/10

to

Jim, 06.02.2010 20:09:

> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.

First thing to note here is that you should consider switching to an HTML
generation tool that does this automatically. Generating markup manually is
usually not a good idea.

> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing 'ab' and it
> didn't complain.
>
> That is, this:
> h=HTMLParser.HTMLParser()
> try:
> h.feed('ab')
> h.close()
> print "I expect not to see this line"
> except Exception, err:
> print "exception:",str(err)
> gives me "I expect not to see this line".
>
> Am I using that routine incorrectly? Is there a natural Python choice
> for this job?

You can use lxml and let it validate the HTML output against the HTML DTD.
Just load the DTD from a catalog using the DOCTYPE in the document (see the
'docinfo' property on the parse tree).

http://codespeak.net/lxml/validation.html#id1

Note that when parsing the HTML file, you should disable the parser failure
recovery to make sure it barks on syntax errors instead of fixing them up.

http://codespeak.net/lxml/parsing.html#parser-options
http://codespeak.net/lxml/parsing.html#parsing-html

Stefan

Lawrence D'Oliveiro

unread,

Feb 8, 2010, 5:19:45 AM2/8/10

to

In message <4b6fd672$0$6734$9b4e...@newsspool2.arcor-online.net>, Stefan
Behnel wrote:

> Jim, 06.02.2010 20:09:
>
>> I generate some HTML and I want to include in my unit tests a check
>> for syntax. So I am looking for a program that will complain at any
>> syntax irregularities.
>
> First thing to note here is that you should consider switching to an HTML
> generation tool that does this automatically.

I think that’s what he’s writing.

Stefan Behnel

unread,

Feb 8, 2010, 5:36:45 AM2/8/10

to

Lawrence D'Oliveiro, 08.02.2010 11:19:

I don't read it that way. There's a huge difference between

- generating HTML manually and validating (some of) it in a unit test

and

- generating HTML using a tool that guarantees correct HTML output

the advantage of the second approach being that others have already done
all the debugging for you.

Stefan

Phlip

unread,

Feb 8, 2010, 12:12:46 PM2/8/10

to

Stefan Behnel wrote:

> I don't read it that way. There's a huge difference between
>
> - generating HTML manually and validating (some of) it in a unit test
>
> and
>
> - generating HTML using a tool that guarantees correct HTML output
>
> the advantage of the second approach being that others have already done
> all the debugging for you.

Anyone TDDing around HTML or XML should use or fork my assert_xml()
(from django-test-extensions).

The current version trivially detects a leading <html> tag and uses
etree.HTML(xml); else it goes with the stricter etree.XML(xml). The
former will not complain about the provided sample HTML.

Sadly, the industry has such a legacy of HTML written in Notepad that
well-formed (X)HTML will never be well-formed XML. My own action item
here is to apply Stefan's parser_options suggestion to make the
etree.HTML() stricter.

However, a generator is free to produce arbitrarily restricted XML
that avoids the problems with XHTML. It could, for example, push any
Javascript that even dreams of using & instead of & out into .js
files.

So an assert_xml() hot-wired to process only XML - with the true HTML
doctype - is still useful to TDD generated code, because its XPath
reference will detect that you get the nodes you expect.

--
Phlip
http://c2.com/cgi/wiki?ZeekLand

Phlip

unread,

Feb 8, 2010, 1:16:22 PM2/8/10

to

and the tweak is:

parser = etree.HTMLParser(recover=False)
return etree.HTML(xml, parser)

That reduces tolerance. The entire assert_xml() is (apologies for
wrapping lines!):

def _xml_to_tree(self, xml):
from lxml import etree
self._xml = xml

try:
if '<html' in xml[:200]: # NOTE the condition COULD suck
more!
parser = etree.HTMLParser(recover=False)
return etree.HTML(xml, parser)
return etree.HTML(xml)
else:
return etree.XML(xml)

except ValueError: # TODO don't rely on exceptions for
normal control flow
tree = xml
self._xml = str(tree) # CONSIDER does this reconstitute
the nested XML ?
return tree

def assert_xml(self, xml, xpath, **kw):
'Check that a given extent of XML or HTML contains a given
XPath, and return its first node'

tree = self._xml_to_tree(xml)
nodes = tree.xpath(xpath)
self.assertTrue(len(nodes) > 0, xpath + ' not found in ' +
self._xml)
node = nodes[0]
if kw.get('verbose', False): self.reveal_xml(node) # "here
have ye been? What have ye seen?"--Morgoth
return node

def reveal_xml(self, node):
'Spews an XML node as source, for diagnosis'

from lxml import etree
print etree.tostring(node, pretty_print=True) # CONSIDER
does pretty_print work? why not?

def deny_xml(self, xml, xpath):
'Check that a given extent of XML or HTML does not contain a
given XPath'

tree = self._xml_to_tree(xml)
nodes = tree.xpath(xpath)
self.assertEqual(0, len(nodes), xpath + ' should not appear in
' + self._xml)

Lawrence D'Oliveiro

unread,

Feb 8, 2010, 4:39:31 PM2/8/10

to

In message <4b6fe93d$0$6724$9b4e...@newsspool2.arcor-online.net>, Stefan
Behnel wrote:

> - generating HTML using a tool that guarantees correct HTML output

Where do you think these tools come from? They don’t write themselves, you
know.

Stefan Behnel

unread,

Feb 9, 2010, 4:21:29 AM2/9/10

to

Lawrence D'Oliveiro, 08.02.2010 22:39:

> In message <4b6fe93d$0$6724$9b4e...@newsspool2.arcor-online.net>, Stefan
> Behnel wrote:
>
>> - generating HTML using a tool that guarantees correct HTML output
>
> Where do you think these tools come from?

Usually PyPI.

Stefan

Lawrence D'Oliveiro

unread,

Feb 10, 2010, 5:30:24 PM2/10/10

to

In message <4b712919$0$6584$9b4e...@newsspool3.arcor-online.net>, Stefan
Behnel wrote:

> Usually PyPI.

Jim

unread,

Feb 12, 2010, 7:02:12 PM2/12/10

to

I want to thank everyone for the help, which I found very useful (the
parts that I understood :-) ).

Since I think there was some question, it happens that I am working
under django and submitting a certain form triggers an html mail. I
wanted to validate the html in some of my unit tests. It is only
about 30 lines of html so I think I'll take a pass on automated html
generation, but FWIW the intolerant parser found a couple of errors.

Thanks again,
Jim