BeautifulSoup choking on quotation mark typo

499 views
Skip to first unread message

Christian

unread,
Jan 4, 2009, 10:08:08 PM1/4/09
to beautifulsoup
Hi all,

I have BS choking on this content ...

<div align="left""><strong>Next page:</strong> [...]

(note the double quotation marks) ... with a:

File "/usr/lib/python2.5/site-packages/BeautifulSoup.py", line 1261,
in _feed
self.builder.feed(markup)
File "/usr/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.5/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.5/HTMLParser.py", line 301, in
check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.5/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 49, column 20

Is there an easy way to make BS tolerate this problem and soldier on?

Thanks,
Christian

David Barnett

unread,
Jan 5, 2009, 6:23:32 PM1/5/09
to beauti...@googlegroups.com
You can probably use re.sub("\"\"", "\"", html). I don't think HTML allows syntax like "blah""blah".

David

Jonathan

unread,
Jan 18, 2009, 7:40:15 AM1/18/09
to beautifulsoup
David, you should be aware of the following—straight from
BeautifulSoup's page:

"You didn't write that awful page. You're just trying to get some data
out of it. Right now, you don't really care what HTML is supposed to
look like. Neither does this parser."

The whole reason not to use regexes is that there are always
contingencies you didn't think for. For instance, you obviously
didn't consider things like <div align="">, which is syntactically
valid. Your regex would then *damage* the parse tree.


Christian, This appears to be a bug introduced with using HTMLParser
instead of PGMLParser. BS should *not* crash on slightly invalid, but
guessable HTML. You should file a bug report. Someone needs to fix
it.

--
Jonathan

On Jan 5, 6:23 pm, "David Barnett" <daviebd...@gmail.com> wrote:
> You can probably use re.sub("\"\"", "\"", html). I don't think HTML allows
> syntax like "blah""blah".
>
> David
>

Leonard Richardson

unread,
Jan 18, 2009, 11:10:18 AM1/18/09
to beauti...@googlegroups.com
I thought I'd posted this to the list, but it was actually a private
email. This is my general stand on this kind of problem:

Low-level HTML problems like this are not something I can fix.
Beautiful Soup operates on the level of the tag, and if the parser
can't create a tag from the data there's nothing I can do.

I chose to switch to HTMLParser so that Beautiful Soup could run under
Python 3.0. There's some markup SGMLParser handles that HTMLParser
doesn't, like the cases you mention.

My plan for handling this is to make the underlying parser pluggable.
The default implementation will use HTMLParser with the heuristics
I've developed over the course of Beautiful Soup development. But if
that doesn't work or is too slow, you'll be able to plug in lxml,
html5lib, or write an interface to any other parser.

Basically, I want to get out of the business of writing parsers and
focus on making it really easy to manipulate the parse tree once you
have one.

In the meantime, you have three options:

1. Pre-process the data so that HTMLParser can handle it.
2. Use lxml or html5lib.
3. Use Beautiful Soup 3.0.7a, the last version that uses SGMLParser.

Leonard

chris...@gmail.com

unread,
Jan 26, 2009, 1:19:49 PM1/26/09
to beautifulsoup
On Jan 18, 4:10 pm, Leonard Richardson <leona...@segfault.org> wrote:
> In the meantime, you have three options:
>
> 1. Pre-process the data so that HTMLParser can handle it.
> 2. Use lxml or html5lib.
> 3. Use Beautiful Soup 3.0.7a, the last version that uses SGMLParser.

Just a heads up to say that using html5lib to do the parsing, and then
having html5lib create a BeautifulSoup tree seems to do the trick for
me (parsing TiddlyWiki files, which tend to have badly formed CDATA).

The docs explain how: http://code.google.com/p/html5lib/wiki/UserDocumentation

Simon Morgan

unread,
Mar 3, 2009, 11:32:03 AM3/3/09
to beautifulsoup
On Jan 18, 4:10 pm, Leonard Richardson <leona...@segfault.org> wrote:
> I thought I'd posted this to the list, but it was actually a private
> email. This is my general stand on this kind of problem:

So essentially Beautiful Soup isn't really two of the main things it
claims to be on the front page of the website?

It seems to me it's neither a HTML parser (because it uses the one
included in the Python standard library), nor robust (because it fails
on the most simple markup errors).

Leonard Richardson

unread,
Mar 3, 2009, 11:41:10 AM3/3/09
to beauti...@googlegroups.com
> So essentially Beautiful Soup isn't really two of the main things it
> claims to be on the front page of the website?

If you like, yes. Until I can put in the time necessary to modernize
Beautiful Soup, you're welcome to use another package, or none at all,
or to just complain.

Leonard

John Glazebrook

unread,
Mar 3, 2009, 12:19:00 PM3/3/09
to beautifulsoup@googlegroups.com beautifulsoup@
>> or to just complain.

lol :-D -- you forgot 'or fork the project and show us how to do it'

I think BS is very nice. I have used it to parse 10,000s of pages and it works fine. HTML is a standard and if someone didn't code to it then we should all expect random results. Just catch the exception and go.

m.e.b

Leonard Richardson

unread,
Mar 3, 2009, 12:17:27 PM3/3/09
to beauti...@googlegroups.com
> lol :-D  -- you forgot 'or fork the project and show us how to do it'

Forking out of spite is never a good idea, but if someone is seriously
interested in making Beautiful Soup work with multiple parsing
libraries, I'd like to talk to them.

I'm working on many other projects. I don't expect you to be
interested in them, but they interest me or pay my bills. Beautiful
Soup development is a chore that doesn't bring in much money. The
reason I haven't packed it in is that, unlike my other open source
projects, a huge number of people depend on Beautiful Soup.

My serious recommendation right now is to use html5lib. I'll try to
make some BS progress next week.

Leonard

Reply all
Reply to author
Forward
0 new messages