HTMLParseError: malformed start tag

614 views
Skip to first unread message

Aether

unread,
Mar 31, 2009, 1:19:53 AM3/31/09
to beautifulsoup
I'm new to beautiful soup and am trying to grab some values off a
table but am having some trouble.
I think the problem is stemming from img tags that are written like
this:
<img src"=http://blah.com" width=1 height=1 border=0>

So I think it is being brought up from needing to switch the " and the
=

here is a clip of the code:

1 RemoveTags = ['img', 'input', 'form', 'script']
2 proxy_con = urllib2.ProxyHandler({"http":"http://
127.0.0.1:8118"})
3 proxy_open = urllib2.build_opener(proxy_con)
4 LookupName = Name.replace(" ", "+")
5 url = 'http://www.site.com/dir/?subtopic=sub&name=' +
LookupName
6 print str(url)
7 page = proxy_open.open(url)
8 soup = BeautifulSoup(page, parseOnlyThese=table)
9 print str(soup)
10 for tag in soup.findAll():
11 if tag.name.lower() in RemoveTags:
12 tag.extract()
13 CharacterInfo = soup.find('td', width="20%")
14 print str(CharacterInfo)

the prints are just in there to help me debug
The error seems to be happening on line 8, and I'm not sure how to
parse out this mistake before this line.

I would really like to just get rid of the img tag alltogether however
its nested in the table and dont know how.
If anyone could please help me with any methods or anything about how
I could do this I would really appreciate it.
And if anyone has any advice of where I could improve, or other guides
besides the documentation I could read please don't hesitate to let me
know.

Thank you.

Leonard Richardson

unread,
Mar 31, 2009, 8:29:59 AM3/31/09
to beauti...@googlegroups.com
Hi,

This is a well-known problem. See here:

http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

Leonard

Aether

unread,
Mar 31, 2009, 9:11:27 PM3/31/09
to beautifulsoup
Thank you, after posting this I also discovered others were having
this problem.
however in case anyone else sees this I found a solution that worked
for my specific case,
in case it helps anyone else:

page = proxy_open.open(url)
page = page.read()
page = page.replace("\"=", "=\"")
soup = BeautifulSoup(page, parseOnlyThese=table)

since for the page i was looking at there was never a reason for the
quotation mark to come before the equal sign

On Mar 31, 5:29 am, Leonard Richardson <leona...@segfault.org> wrote:
> Hi,
>
> This is a well-known problem. See here:
>
> http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
>
> Leonard
>
Reply all
Reply to author
Forward
0 new messages