Questionable markup - should beautifulsoup be able to handle this?

9 views
Skip to first unread message

Eloff

unread,
May 30, 2009, 12:19:22 PM5/30/09
to beautifulsoup
I came across this markup in the wild:

<a href="/catalog/world/readfile?fk_files=98531" title="Read this book
online."rel="nofollow">Read online</a>

Browsers handle it just fine, BeautifulSoup chokes because there is no
whitespace between title="..."rel="..."

I am busy fixing it with regexps. Is this an area BeautifulSoup can be
improved in?

-Dan

Jim

unread,
Jun 4, 2009, 9:59:40 AM6/4/09
to beautifulsoup
I'm using SoupStrainer and can't find the title with http://www.digitalpoint.com/
All of the main page is on one line without new lines.

excerpt:

tags = []
try:
links = SoupStrainer(u'title')
for tag in BeautifulSoup(result.content,
parseOnlyThese=links):
tags.append(tag)
if DEBUG: logging.info('tag is ' + str(tag))

# result.content comes from Urlfetch, which Google App Engine uses

<! (C) Copyright 1996-2006 Digital Point Solutions - No portion of
this site may be reproduced in ANY form.><html><head><title>Digital
Point Solutions</title><meta name="description" content="Offering
business software packages and free online tools."><meta
name="keywords" content=""><meta http-equiv="Content-Type"
content="text/html; charset=UTF-8" /> <link rel="StyleSheet"
type="text/css" href="/style.css"></head><body background="./images/
background_1.gif">...deleted

I've tried trimming down the size of the file but that has the
(presumably) bad effect of leaving off the closing tags.

So I'm going to try to use a regular expression to trim off anything
in front of the <html>
Reply all
Reply to author
Forward
0 new messages