Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion BeautifulSoup choking on quotation mark typo
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Leonard Richardson  
View profile  
 More options Jan 18 2009, 11:10 am
From: Leonard Richardson <leona...@segfault.org>
Date: Sun, 18 Jan 2009 11:10:18 -0500
Local: Sun, Jan 18 2009 11:10 am
Subject: Re: BeautifulSoup choking on quotation mark typo
I thought I'd posted this to the list, but it was actually a private
email. This is my general stand on this kind of problem:

Low-level HTML problems like this are not something I can fix.
Beautiful Soup operates on the level of the tag, and if the parser
can't create a tag from the data there's nothing I can do.

I chose to switch to HTMLParser so that Beautiful Soup could run under
Python 3.0. There's some markup SGMLParser handles that HTMLParser
doesn't, like the cases you mention.

My plan for handling this is to make the underlying parser pluggable.
The default implementation will use HTMLParser with the heuristics
I've developed over the course of Beautiful Soup development. But if
that doesn't work or is too slow, you'll be able to plug in lxml,
html5lib, or write an interface to any other parser.

Basically, I want to get out of the business of writing parsers and
focus on making it really easy to manipulate the parse tree once you
have one.

In the meantime, you have three options:

1. Pre-process the data so that HTMLParser can handle it.
2. Use lxml or html5lib.
3. Use Beautiful Soup 3.0.7a, the last version that uses SGMLParser.

Leonard

On Sun, Jan 18, 2009 at 7:40 AM, Jonathan

<jonathan.north.washing...@gmail.com> wrote:

> David, you should be aware of the following—straight from
> BeautifulSoup's page:

> "You didn't write that awful page. You're just trying to get some data
> out of it. Right now, you don't really care what HTML is supposed to
> look like.  Neither does this parser."

> The whole reason not to use regexes is that there are always
> contingencies you didn't think for.  For instance, you obviously
> didn't consider things like <div align="">, which is syntactically
> valid.  Your regex would then *damage* the parse tree.

> Christian, This appears to be a bug introduced with using HTMLParser
> instead of PGMLParser.  BS should *not* crash on slightly invalid, but
> guessable HTML.  You should file a bug report.  Someone needs to fix
> it.

> --
> Jonathan

> On Jan 5, 6:23 pm, "David Barnett" <daviebd...@gmail.com> wrote:
>> You can probably use re.sub("\"\"", "\"", html). I don't think HTML allows
>> syntax like "blah""blah".

>> David

>> On Sun, Jan 4, 2009 at 10:08 PM, Christian <kreib...@gmail.com> wrote:

>> > Hi all,

>> > I have BS choking on this content ...

>> >   <div align="left""><strong>Next page:</strong> [...]

>> > (note the double quotation marks) ... with a:

>> >  File "/usr/lib/python2.5/site-packages/BeautifulSoup.py", line 1261,
>> > in _feed
>> >    self.builder.feed(markup)
>> >  File "/usr/lib/python2.5/HTMLParser.py", line 108, in feed
>> >    self.goahead(0)
>> >  File "/usr/lib/python2.5/HTMLParser.py", line 148, in goahead
>> >    k = self.parse_starttag(i)
>> >  File "/usr/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
>> >    endpos = self.check_for_whole_start_tag(i)
>> >  File "/usr/lib/python2.5/HTMLParser.py", line 301, in
>> > check_for_whole_start_tag
>> >    self.error("malformed start tag")
>> >  File "/usr/lib/python2.5/HTMLParser.py", line 115, in error
>> >    raise HTMLParseError(message, self.getpos())
>> > HTMLParser.HTMLParseError: malformed start tag, at line 49, column 20

>> > Is there an easy way to make BS tolerate this problem and soldier on?

>> > Thanks,
>> > Christian


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.