Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Handling bad DOCTYPE declarations

Received: by 10.115.61.1 with SMTP id o1mr437660wak.1191427543280;
        Wed, 03 Oct 2007 09:05:43 -0700 (PDT)
Received: by n39g2000hsh.googlegroups.com with HTTP;
	Wed, 03 Oct 2007 16:05:42 +0000 (UTC)
X-IP: 69.131.98.200
From:  Kent Johnson <ken...@tds.net>
To:  beautifulsoup <beautifulsoup@googlegroups.com>
Subject: Handling bad DOCTYPE declarations
Date: Wed, 03 Oct 2007 16:05:42 -0000
Message-ID: <1191427542.165142.57930@n39g2000hsh.googlegroups.com>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7,gzip(gfe),gzip(gfe)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"

Currently if the DOCTYPE declaration doesn't parse correctly,
BeautifulSoup puts the entire document into a single NavigableString.
The problem is in the exception handler for parse_declaration(), which
passes the entire balance of the document to handle_data(). Here is a
version of parse_declaration() that attempts to skip just the
declaration itself.

Kent

    def parse_declaration(self, i):
        """Treat a bogus SGML declaration as raw data. Treat a CDATA
        declaration as a CData object."""
        j = None
        if self.rawdata[i:i+9] == '<![CDATA[':
             k = self.rawdata.find(']]>', i)
             if k == -1:
                 k = len(self.rawdata)
             data = self.rawdata[i+9:k]
             j = k+3
             self._toStringSubclass(data, CData)
        else:
            try:
                j = SGMLParser.parse_declaration(self, i)
            except SGMLParseError:
                # Could not parse the DOCTYPE declaration
                # Try to just skip the actual declaration
                match = re.search(r'<!DOCTYPE([^>]*?)>', self.rawdata,
re.MULTILINE)
                if match:
                    toHandle = self.rawdata[i:match.end()]
                else:
                    toHandle = self.rawdata[i:]
                self.handle_data(toHandle)
                j = i + len(toHandle)
        return j