Message from discussion
Handling bad DOCTYPE declarations
Received: by 10.115.61.1 with SMTP id o1mr437660wak.1191427543280;
Wed, 03 Oct 2007 09:05:43 -0700 (PDT)
Received: by n39g2000hsh.googlegroups.com with HTTP;
Wed, 03 Oct 2007 16:05:42 +0000 (UTC)
X-IP: 69.131.98.200
From: Kent Johnson <ken...@tds.net>
To: beautifulsoup <beautifulsoup@googlegroups.com>
Subject: Handling bad DOCTYPE declarations
Date: Wed, 03 Oct 2007 16:05:42 -0000
Message-ID: <1191427542.165142.57930@n39g2000hsh.googlegroups.com>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7,gzip(gfe),gzip(gfe)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Currently if the DOCTYPE declaration doesn't parse correctly,
BeautifulSoup puts the entire document into a single NavigableString.
The problem is in the exception handler for parse_declaration(), which
passes the entire balance of the document to handle_data(). Here is a
version of parse_declaration() that attempts to skip just the
declaration itself.
Kent
def parse_declaration(self, i):
"""Treat a bogus SGML declaration as raw data. Treat a CDATA
declaration as a CData object."""
j = None
if self.rawdata[i:i+9] == '<![CDATA[':
k = self.rawdata.find(']]>', i)
if k == -1:
k = len(self.rawdata)
data = self.rawdata[i+9:k]
j = k+3
self._toStringSubclass(data, CData)
else:
try:
j = SGMLParser.parse_declaration(self, i)
except SGMLParseError:
# Could not parse the DOCTYPE declaration
# Try to just skip the actual declaration
match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata,
re.MULTILINE)
if match:
toHandle = self.rawdata[i:match.end()]
else:
toHandle = self.rawdata[i:]
self.handle_data(toHandle)
j = i + len(toHandle)
return j